Home
Introduction and Lectures
Intro to the Workshop and Core
Support
Schedule
Slack
Cheat Sheets
Software and Links
Scripts
GitHub repository
Biocore website
Prerequisites
CLI
R
Data Reduction
Files and Filetypes
Project setup
Generating Expression Matrix
scRNAseq Analysis
Prepare scRNAseq Analysis
Part 1- Create object
Part 2- Filtering
Part 3- Normalization and scaling
Part 4- Dimensionality reduction
Part 5- Clustering and cell type
Part 6- Enrichment and DE
Part 7- Doublet detection
Part 8- Integration

Introduction to Single Cell RNA-Seq Part 6: Enrichment and model-based differential expression

Set up workspace

library(Seurat)
library(limma)
library(topGO)
library(dplyr)
library(kableExtra)
set.seed(12345)
experiment.aggregate <- readRDS("scRNA_workshop-05.rds")
Idents(experiment.aggregate) <- "finalcluster"

1. Gene Ontology (GO) Enrichment of Genes Expressed in a Cluster

Gene Ontology provides a controlled vocabulary for describing gene products. Here we use enrichment analysis to identify GO terms that are over-represented among the gene expressed in cells in a given cluster.

cluster10 <- subset(experiment.aggregate, idents = '10')
expr <- as.matrix(GetAssayData(cluster10))

# Select genes that are expressed > 0 in at least half of cells
n.gt.0 <- apply(expr, 1, function(x)length(which(x > 0)))
expressed.genes <- rownames(expr)[which(n.gt.0/ncol(expr) >= 0.5)]
all.genes <- rownames(expr)

# define geneList as 1 if gene is in expressed.genes, 0 otherwise
geneList <- ifelse(all.genes %in% expressed.genes, 1, 0)
names(geneList) <- all.genes

# Create topGOdata object
	GOdata <- new("topGOdata",
		ontology = "BP", # use biological process ontology
		allGenes = geneList,
		geneSelectionFun = function(x)(x == 1),
              annot = annFUN.org, mapping = "org.Hs.eg.db", ID = "symbol")
# Test for enrichment using Fisher's Exact Test
	resultFisher <- runTest(GOdata, algorithm = "elim", statistic = "fisher")
	GenTable(GOdata, Fisher = resultFisher, topNodes = 20, numChar = 60) %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)

GO.ID	Term	Annotated	Significant	Expected	Fisher
GO:0010613	positive regulation of cardiac muscle hypertrophy	19	4	0.34	0.00031
GO:0006325	chromatin organization	465	24	8.31	0.00047
GO:0030036	actin cytoskeleton organization	462	19	8.26	0.00057
GO:0034968	histone lysine methylation	81	7	1.45	0.00059
GO:2001014	regulation of skeletal muscle cell differentiation	11	3	0.20	0.00083
GO:0000381	regulation of alternative mRNA splicing, via spliceosome	42	5	0.75	0.00086
GO:0006338	chromatin remodeling	270	13	4.82	0.00108
GO:0048813	dendrite morphogenesis	95	7	1.70	0.00152
GO:0045445	myoblast differentiation	70	6	1.25	0.00152
GO:1901888	regulation of cell junction assembly	124	8	2.22	0.00167
GO:1901897	regulation of relaxation of cardiac muscle	4	2	0.07	0.00186
GO:0031666	positive regulation of lipopolysaccharide-mediated signaling...	4	2	0.07	0.00186
GO:0071320	cellular response to cAMP	30	4	0.54	0.00188
GO:0061049	cell growth involved in cardiac muscle cell development	15	3	0.27	0.00218
GO:0010595	positive regulation of endothelial cell migration	76	6	1.36	0.00232
GO:0002063	chondrocyte development	16	3	0.29	0.00264
GO:0045906	negative regulation of vasoconstriction	5	2	0.09	0.00306
GO:0098909	regulation of cardiac muscle cell action potential involved ...	5	2	0.09	0.00306
GO:0048842	positive regulation of axon extension involved in axon guida...	5	2	0.09	0.00306
GO:0021785	branchiomotor neuron axon guidance	5	2	0.09	0.00306

Annotated: number of genes (out of all.genes) that are annotated with that GO term
Significant: number of genes that are annotated with that GO term and meet our criteria for “expressed”
Expected: Under random chance, number of genes that would be expected to be annotated with that GO term and meeting our criteria for “expressed”
Fisher: (Raw) p-value from Fisher’s Exact Test

2. Model-based DE analysis in limma

limma is an R package for differential expression analysis of bulk RNASeq and microarray data. We apply it here to single cell data.

Limma can be used to fit any linear model to expression data and is useful for analyses that go beyond two-group comparisons. A detailed tutorial of model specification in limma is available here and in the limma User’s Guide.

# filter genes to those expressed in at least 10% of cells
keep <- rownames(expr)[which(n.gt.0/ncol(expr) >= 0.1)]
expr2 <- expr[keep,]

# Set up "design matrix" with statistical model
cluster10$proper.group <- make.names(cluster10$group)
mm <- model.matrix(~0 + proper.group + S.Score + G2M.Score + percent_MT + nFeature_RNA, data = cluster10[[]])
head(mm) %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)

	proper.groupColorectal.Cancer	S.Score	G2M.Score	percent_MT	nFeature_RNA
ACGGTTAGTCTCACAA_A001-C-007	1	0.0347548	0.7909218	0.5656109	1187
ACTGTCCAGGCGTTAG_A001-C-007	1	-0.0513069	-0.0204757	0.5043712	1634
AGCGTATAGCCTCAGC_A001-C-007	1	-0.0576611	-0.0549822	3.0966767	880
AGCGTATAGGTATCTC_A001-C-007	1	0.1080488	-0.0416714	0.4595588	783
AGGACTTGTACCTATG_A001-C-007	1	0.0181025	-0.0642517	0.4804393	986
AGGGTTTGTTCTTCAT_A001-C-007	1	0.0057069	-0.0475972	1.3483146	937

tail(mm) %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)

	proper.groupNormal	S.Score	G2M.Score	percent_MT	nFeature_RNA
TCCACCAAGTAACGTA_B001-A-301	1	-0.0872095	-0.1505728	0.1785183	2096
TCCACCAGTGTCCACG_B001-A-301	1	-0.0697658	-0.1145196	0.3147954	1147
TCGCTTGAGACTTGTC_B001-A-301	1	-0.0989612	-0.1146185	0.1570475	1471
TCTCCGACACAGTACT_B001-A-301	1	0.0437678	-0.1455716	0.1921230	2383
TGATGGTCATAACTCG_B001-A-301	1	-0.1108661	-0.1465694	0.5749668	1413
TTCTTCCCAGCGTAGA_B001-A-301	1	-0.0361969	0.0067655	0.5504587	843

# Fit model in limma
fit <- lmFit(expr2, mm)
head(coef(fit)) %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)

	proper.groupColorectal.Cancer	proper.groupNormal	proper.groupPolyp	S.Score	G2M.Score	percent_MT	nFeature_RNA
CCNL2	0.0579580	0.4644867	0.5707252	-0.5229024	0.7195925	0.2619560	0.0001080
CDK11B	0.5915809	0.1722522	0.5923884	-0.6623504	0.7849839	-0.3260968	0.0000319
CDK11A	0.9175843	0.2172913	0.4343150	-1.0275000	0.9939193	-0.0600846	0.0000147
NADK	0.3736396	0.2646946	0.0662475	-0.2493852	0.1639109	0.0637775	-0.0000001
GNB1	0.7772654	0.5788333	0.8958150	-0.5519191	0.9004079	-0.0616478	0.0002935
SKI	0.7171411	0.1954765	0.4870630	-0.7409301	-0.1685434	-0.2098351	0.0000133

# Test 'Normal' - 'Colorectal.Cancer'
contr <- makeContrasts(proper.groupNormal - proper.groupColorectal.Cancer, levels = colnames(coef(fit)))
contr %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)

	proper.groupNormal - proper.groupColorectal.Cancer
proper.groupColorectal.Cancer	-1
proper.groupNormal	1
proper.groupPolyp	0
S.Score	0
G2M.Score	0
percent_MT	0
nFeature_RNA	0

fit2 <- contrasts.fit(fit, contrasts = contr)
fit2 <- eBayes(fit2)
out <- topTable(fit2, n = Inf, sort.by = "P")
head(out, 30) %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)

	logFC	AveExpr	t	P.Value	adj.P.Val	B
XIST	1.9827318	0.6682736	11.870704	0.0e+00	0.0000000	37.251244
GUCA2A	1.6727802	0.4946151	10.607512	0.0e+00	0.0000000	31.021088
MMP12	-2.0976536	0.6819202	-8.045790	0.0e+00	0.0000000	18.347651
PHGR1	1.7128625	0.6230695	7.990444	0.0e+00	0.0000000	18.079041
CKB	2.0697765	1.1605407	7.813288	0.0e+00	0.0000000	17.222231
SLC26A2	1.8826649	0.9641306	6.787927	0.0e+00	0.0000004	12.377261
SLC26A3	1.5570563	0.6022747	6.661061	0.0e+00	0.0000006	11.794657
PIGR	1.7503636	1.1102687	6.460438	0.0e+00	0.0000015	10.882534
HIVEP3	-1.4376532	0.7340200	-5.994125	0.0e+00	0.0000121	8.810834
NR3C2	1.6376297	1.0291416	5.792722	1.0e-07	0.0000276	7.939339
UTY	-1.3631601	0.7485786	-5.484691	3.0e-07	0.0000928	6.636837
ZG16	0.9778633	0.3534387	5.484091	3.0e-07	0.0000928	6.634339
MUC13	1.2139028	0.5911549	5.163231	1.1e-06	0.0003496	5.320392
PDE4D	1.6637089	1.5119900	5.120095	1.3e-06	0.0003901	5.147303
TNFAIP2	-1.4572991	1.0361014	-5.104445	1.4e-06	0.0003901	5.084722
HSP90AA1	-1.3268169	0.7386294	-5.041036	1.8e-06	0.0004794	4.832355
MACF1	-1.2728166	2.5152218	-5.024663	1.9e-06	0.0004838	4.767506
SATB2	1.2335513	0.6184811	4.931185	2.9e-06	0.0006781	4.399765
RNF213	-1.4619869	1.4301922	-4.899682	3.3e-06	0.0007080	4.276803
PARP14	-1.5705769	1.4065849	-4.895788	3.3e-06	0.0007080	4.261639
IL7R	-1.0858376	0.3242462	-4.865053	3.8e-06	0.0007591	4.142219
FOXP1	1.5111842	1.6488493	4.856249	3.9e-06	0.0007591	4.108095
AMN	0.8531736	0.2741440	4.731821	6.6e-06	0.0011476	3.630123
EPCAM	1.1109426	0.5789830	4.715942	7.0e-06	0.0011476	3.569703
PLA2G7	-1.1812964	0.5448965	-4.705319	7.3e-06	0.0011476	3.529356
ZBTB20	1.5396424	1.7559376	4.700057	7.5e-06	0.0011476	3.509395
RAPGEF1	-1.5180128	1.5356339	-4.698717	7.5e-06	0.0011476	3.504313
SAT1	-1.8924435	2.0416164	-4.696409	7.6e-06	0.0011476	3.495562
TYMP	-1.3860201	0.9203430	-4.688993	7.8e-06	0.0011476	3.467469
SCNN1B	0.9079763	0.3004799	4.668610	8.5e-06	0.0011916	3.390400

Output columns:

logFC: log fold change (since we are working with Seurat’s natural log transformed data, will be natural log fold change)
AveExpr: Average expression across all cells in expr2
t: logFC divided by its standard error
P.Value: Raw p-value (based on t) from test that logFC differs from 0
adj.P.Val: Benjamini-Hochberg false discovery rate adjusted p-value
B: log-odds that gene is DE

Save files

write.csv(GenTable(GOdata, Fisher = resultFisher), file = "cluster10_GOdata.csv")
write.csv(out, file = "cluster10_Normal-Colorectal.Cancer_topTable.csv")

A note on pseudobulk DE

Pseudobulk differential expression uses count data summed across all cells in each sample (typically within each cell type or cluster). Unlike cell-level DE, pseudobulk DE requires biological replicates so we won’t perform it on this dataset.

Once counts are summed, pseudobulk data are analyzed like bulk RNASeq data.

Pseudobulk DE may result in better false discovery rate control than cell-level DE, as shown here.

The Seurat function AggregateExpression() can be used to sum counts as described here.

A tutorial on using limma for bulk RNASeq is available here.

Prepare for the next section

Download Rmd document

download.file("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2023-December-Single-Cell-RNA-Seq-Analysis/main/data_analysis/07-doublet_detection.Rmd", "07-doublet_detection.Rmd")

Session Information

sessionInfo()

R version 4.3.2 (2023-10-31 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19045) Matrix products: default locale: [1] LC_COLLATE=English_United States.utf8 [2] LC_CTYPE=English_United States.utf8 [3] LC_MONETARY=English_United States.utf8 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.utf8 time zone: America/Los_Angeles tzcode source: internal attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] org.Hs.eg.db_3.17.0 kableExtra_1.3.4 dplyr_1.1.4 [4] topGO_2.52.0 SparseM_1.81 GO.db_3.17.0 [7] AnnotationDbi_1.62.2 IRanges_2.34.1 S4Vectors_0.38.2 [10] Biobase_2.60.0 graph_1.78.0 BiocGenerics_0.46.0 [13] limma_3.56.2 Seurat_5.0.1 SeuratObject_5.0.1 [16] sp_2.1-2 loaded via a namespace (and not attached): [1] RColorBrewer_1.1-3 rstudioapi_0.15.0 jsonlite_1.8.8 [4] magrittr_2.0.3 spatstat.utils_3.0-4 rmarkdown_2.25 [7] zlibbioc_1.46.0 vctrs_0.6.5 ROCR_1.0-11 [10] memoise_2.0.1 spatstat.explore_3.2-5 RCurl_1.98-1.13 [13] webshot_0.5.5 htmltools_0.5.7 sass_0.4.8 [16] sctransform_0.4.1 parallelly_1.36.0 KernSmooth_2.23-22 [19] bslib_0.6.1 htmlwidgets_1.6.4 ica_1.0-3 [22] plyr_1.8.9 plotly_4.10.3 zoo_1.8-12 [25] cachem_1.0.8 igraph_1.5.1 mime_0.12 [28] lifecycle_1.0.4 pkgconfig_2.0.3 Matrix_1.6-4 [31] R6_2.5.1 fastmap_1.1.1 GenomeInfoDbData_1.2.10 [34] fitdistrplus_1.1-11 future_1.33.0 shiny_1.8.0 [37] digest_0.6.33 colorspace_2.1-0 patchwork_1.1.3 [40] tensor_1.5 RSpectra_0.16-1 irlba_2.3.5.1 [43] RSQLite_2.3.4 progressr_0.14.0 fansi_1.0.6 [46] spatstat.sparse_3.0-3 httr_1.4.7 polyclip_1.10-6 [49] abind_1.4-5 compiler_4.3.2 bit64_4.0.5 [52] DBI_1.1.3 fastDummies_1.7.3 highr_0.10 [55] MASS_7.3-60 tools_4.3.2 lmtest_0.9-40 [58] httpuv_1.6.13 future.apply_1.11.0 goftest_1.2-3 [61] glue_1.6.2 nlme_3.1-164 promises_1.2.1 [64] grid_4.3.2 Rtsne_0.17 cluster_2.1.6 [67] reshape2_1.4.4 generics_0.1.3 gtable_0.3.4 [70] spatstat.data_3.0-3 tidyr_1.3.0 data.table_1.14.10 [73] xml2_1.3.6 XVector_0.40.0 utf8_1.2.4 [76] spatstat.geom_3.2-7 RcppAnnoy_0.0.21 ggrepel_0.9.4 [79] RANN_2.6.1 pillar_1.9.0 stringr_1.5.1 [82] spam_2.10-0 RcppHNSW_0.5.0 later_1.3.2 [85] splines_4.3.2 lattice_0.22-5 bit_4.0.5 [88] survival_3.5-7 deldir_2.0-2 tidyselect_1.2.0 [91] Biostrings_2.68.1 miniUI_0.1.1.1 pbapply_1.7-2 [94] knitr_1.45 gridExtra_2.3 svglite_2.1.3 [97] scattermore_1.2 xfun_0.41 matrixStats_1.1.0 [100] stringi_1.8.2 lazyeval_0.2.2 yaml_2.3.7 [103] evaluate_0.23 codetools_0.2-19 tibble_3.2.1 [106] cli_3.6.1 uwot_0.1.16 systemfonts_1.0.5 [109] xtable_1.8-4 reticulate_1.34.0 munsell_0.5.0 [112] jquerylib_0.1.4 GenomeInfoDb_1.36.4 Rcpp_1.0.11 [115] globals_0.16.2 spatstat.random_3.2-2 png_0.1-8 [118] parallel_4.3.2 ellipsis_0.3.2 blob_1.2.4 [121] ggplot2_3.4.4 dotCall64_1.1-1 bitops_1.0-7 [124] listenv_0.9.0 viridisLite_0.4.2 scales_1.3.0 [127] ggridges_0.5.4 crayon_1.5.2 leiden_0.4.3.1 [130] purrr_1.0.2 rlang_1.1.2 rvest_1.0.3 [133] KEGGREST_1.40.1 cowplot_1.1.1

☰ Menu

Single Cell RNA-Seq Analysis