Home
Introduction and Lectures
Intro to the Workshop and Core
Support
Schedule
Slack
Cheat Sheets
Software and Links
Scripts
GitHub repository
Biocore website
Prerequisites
CLI
R
Data Reduction
Files and Filetypes
Project setup
Generating Expression Matrix
scRNAseq Analysis
Prepare scRNAseq Analysis
Part 1- Create object
Part 2- Filtering
Part 3- Normalization and scaling
Part 4- Dimensionality reduction
Part 5- Clustering and cell type
Part 6- Enrichment and DE
Part 7- Doublet detection
Part 8- Integration

Introduction to Single Cell RNA-Seq Part 6: Enrichment and model-based differential expression

Set up workspace

library(Seurat)
library(limma)
library(topGO)
library(dplyr)
library(kableExtra)
set.seed(12345)
experiment.aggregate <- readRDS("scRNA_workshop-05.rds")
Idents(experiment.aggregate) <- "finalcluster"

1. Gene Ontology (GO) Enrichment of Genes Expressed in a Cluster

Gene Ontology provides a controlled vocabulary for describing gene products. Here we use enrichment analysis to identify GO terms that are over-represented among the gene expressed in cells in a given cluster.

cluster10 <- subset(experiment.aggregate, idents = '10')
expr <- as.matrix(GetAssayData(cluster10))

# Select genes that are expressed > 0 in at least half of cells
n.gt.0 <- apply(expr, 1, function(x)length(which(x > 0)))
expressed.genes <- rownames(expr)[which(n.gt.0/ncol(expr) >= 0.5)]
all.genes <- rownames(expr)

# define geneList as 1 if gene is in expressed.genes, 0 otherwise
geneList <- ifelse(all.genes %in% expressed.genes, 1, 0)
names(geneList) <- all.genes

# Create topGOdata object
	GOdata <- new("topGOdata",
		ontology = "BP", # use biological process ontology
		allGenes = geneList,
		geneSelectionFun = function(x)(x == 1),
              annot = annFUN.org, mapping = "org.Hs.eg.db", ID = "symbol")
# Test for enrichment using Fisher's Exact Test
	resultFisher <- runTest(GOdata, algorithm = "elim", statistic = "fisher")
	GenTable(GOdata, Fisher = resultFisher, topNodes = 20, numChar = 60) %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)

GO.ID	Term	Annotated	Significant	Expected	Fisher
GO:0000381	regulation of alternative mRNA splicing, via spliceosome	41	9	1.85	6.9e-05
GO:0006338	chromatin remodeling	918	66	41.41	0.00028
GO:0035176	social behavior	22	6	0.99	0.00033
GO:2000650	negative regulation of sodium ion transmembrane transporter ...	4	3	0.18	0.00035
GO:0051660	establishment of centrosome localization	9	4	0.41	0.00043
GO:0006895	Golgi to endosome transport	17	5	0.77	0.00072
GO:0045053	protein retention in Golgi apparatus	5	3	0.23	0.00085
GO:0032534	regulation of microvillus assembly	5	3	0.23	0.00085
GO:0030036	actin cytoskeleton organization	482	43	21.74	0.00093
GO:0044331	cell-cell adhesion mediated by cadherin	27	6	1.22	0.00107
GO:0009653	anatomical structure morphogenesis	1541	107	69.52	0.00111
GO:0045944	positive regulation of transcription by RNA polymerase II	825	56	37.22	0.00118
GO:1902275	regulation of chromatin organization	20	5	0.90	0.00161
GO:0042060	wound healing	261	23	11.77	0.00162
GO:0007043	cell-cell junction assembly	91	15	4.11	0.00172
GO:0090257	regulation of muscle system process	128	14	5.77	0.00184
GO:0031175	neuron projection development	591	44	26.66	0.00185
GO:0070830	bicellular tight junction assembly	40	7	1.80	0.00185
GO:0048669	collateral sprouting in absence of injury	2	2	0.09	0.00203
GO:0051674	localization of cell	2	2	0.09	0.00203

Annotated: number of genes (out of all.genes) that are annotated with that GO term
Significant: number of genes that are annotated with that GO term and meet our criteria for “expressed”
Expected: Under random chance, number of genes that would be expected to be annotated with that GO term and meeting our criteria for “expressed”
Fisher: (Raw) p-value from Fisher’s Exact Test

Organisms other than human

Many organisms have a database equivalent to org.Hs.eg.db on Bioconductor.

If not, and your organism is on NCBI it’s straightforward to create your own using the AnnotationForge R package and the function makeOrgPackageFromNCBI

2. Model-based DE analysis in limma

limma is an R package for differential expression analysis of bulk RNASeq and microarray data. We apply it here to single cell data.

Limma can be used to fit any linear model to expression data and is useful for analyses that go beyond two-group comparisons. A detailed tutorial of model specification in limma is available here and in the limma User’s Guide.

# filter genes to those expressed in at least 10% of cells
keep <- rownames(expr)[which(n.gt.0/ncol(expr) >= 0.1)]
expr2 <- expr[keep,]

# Set up "design matrix" with statistical model
cluster10$proper.group <- make.names(cluster10$group)
mm <- model.matrix(~0 + proper.group + S.Score + G2M.Score + percent_MT + nFeature_RNA, data = cluster10[[]])
head(mm) %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)

	proper.groupColorectal.Cancer	S.Score	G2M.Score	percent_MT	nFeature_RNA
AAACCCAAGTTATGGA_A001-C-007	1	0.0554074	-0.1461764	0.5717008	1513
AAGCCATCAAGACCTT_A001-C-007	1	0.0660183	-0.0729290	0.4841313	1265
AAGTTCGGTACCTATG_A001-C-007	1	0.0202042	-0.0451182	0.5148005	1148
AATGAAGTCAGCGTCG_A001-C-007	1	0.1225441	0.0612193	1.0398614	1272
ACAGAAATCCAGCTCT_A001-C-007	1	-0.0769269	-0.0707690	0.6334125	1637
ACGGTTACAAATCGGG_A001-C-007	1	-0.0550882	0.1286419	0.7523716	1934

tail(mm) %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)

	proper.groupNormal	proper.groupPolyp	S.Score	G2M.Score	percent_MT	nFeature_RNA
TTCCACGAGTATTGCC_A001-C-104	0	1	0.4520556	0.0266469	0.3015580	4562
TTCTAGTAGATAACAC_A001-C-104	0	1	0.0218701	-0.0793631	1.3125000	1140
TTGCCTGTCCGCGATG_A001-C-104	0	1	-0.0612575	-0.0292726	0.3594352	2241
TTGGGATTCGTTCCCA_A001-C-104	0	1	-0.0709637	-0.0806137	0.4050223	1562
TTTCAGTTCGCACTCT_A001-C-104	0	1	-0.1047877	-0.0540112	1.8099548	1309
GACTCAACACACACGC_B001-A-301	1	0	0.0481511	-0.0900060	0.1574803	990

# Fit model in limma
fit <- lmFit(expr2, mm)
head(coef(fit)) %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)

	proper.groupColorectal.Cancer	proper.groupNormal	proper.groupPolyp	S.Score	G2M.Score	percent_MT	nFeature_RNA
ENSG00000291215	-0.2357714	-0.2248263	-0.0954233	0.5209505	-0.3817891	0.0510813	0.0001589
C1orf159	-0.0141535	-0.1325994	-0.0259810	-0.1442860	0.2517446	0.1333100	0.0001426
SDF4	0.3616255	0.1839501	0.4912284	0.2506248	1.0914474	-0.1494113	-0.0000750
CCNL2	0.4402786	1.6154478	0.0118319	0.4012170	-1.8906665	0.2768585	0.0003573
MIB2	0.0480968	-0.0562502	0.2279206	-0.1502399	-0.4402191	-0.1676190	0.0000508
CDK11B	0.6889952	-0.1659776	-0.0648067	-0.7018856	-0.2333263	0.1621756	0.0001548

# Test 'Normal' - 'Colorectal.Cancer'
contr <- makeContrasts(proper.groupNormal - proper.groupColorectal.Cancer, levels = colnames(coef(fit)))
contr %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)

	proper.groupNormal - proper.groupColorectal.Cancer
proper.groupColorectal.Cancer	-1
proper.groupNormal	1
proper.groupPolyp	0
S.Score	0
G2M.Score	0
percent_MT	0
nFeature_RNA	0

fit2 <- contrasts.fit(fit, contrasts = contr)
fit2 <- eBayes(fit2)
out <- topTable(fit2, n = Inf, sort.by = "P")
head(out, 30) %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)

	logFC	AveExpr	t	P.Value	adj.P.Val	B
UTP6	2.809558	0.1665493	5.729581	0.0000000	0.0002398	7.9550347
S100PBP	2.764765	0.2078133	4.898211	0.0000024	0.0057367	4.4838247
FDPS	2.194920	0.1608721	4.481281	0.0000142	0.0127450	2.8889818
CRKL	2.062359	0.1502419	4.431309	0.0000175	0.0127450	2.7049563
NMRAL1	2.284191	0.1670087	4.430493	0.0000175	0.0127450	2.7019633
SOAT1	2.026978	0.1554562	4.415095	0.0000187	0.0127450	2.6455817
CMSS1	2.179765	0.1843519	4.398089	0.0000200	0.0127450	2.5834884
TXNL4A	2.044588	0.1549221	4.379329	0.0000216	0.0127450	2.5152036
PROX1	2.223377	0.1608351	4.344625	0.0000249	0.0127450	2.3894760
ADAMTSL1	3.378881	0.4051661	4.315871	0.0000280	0.0127450	2.2858848
LNX2	2.802389	0.3089559	4.306615	0.0000290	0.0127450	2.2526522
SEC23B	2.024936	0.1510704	4.256737	0.0000355	0.0142802	2.0745228
TCP11L1	2.194426	0.1880198	4.112117	0.0000629	0.0218800	1.5672384
ZNF3	2.241657	0.1841910	4.097439	0.0000666	0.0218800	1.5165269
RBL1	2.107244	0.1733018	4.092354	0.0000680	0.0218800	1.4989902
ENSG00000287292	2.649530	0.2658202	4.044865	0.0000817	0.0231668	1.3360639
WDTC1	2.184789	0.1787432	4.032124	0.0000858	0.0231668	1.2926107
MED17	2.135271	0.1824321	4.009965	0.0000935	0.0231668	1.2172980
RXRA	2.060384	0.1753848	4.005170	0.0000952	0.0231668	1.2010447
CDK18	2.168277	0.1831622	4.003186	0.0000959	0.0231668	1.1943236
ADPGK	2.058125	0.1664844	3.989304	0.0001012	0.0232690	1.1473755
ZNF480	2.236793	0.2006418	3.965854	0.0001107	0.0242922	1.0683681
PPIL4	2.226664	0.2090412	3.892616	0.0001460	0.0306594	0.8240318
IRF3	2.687759	0.3212877	3.873166	0.0001571	0.0316070	0.7597636
SARS1	2.368320	0.2254349	3.835478	0.0001808	0.0346093	0.6359795
TMEM51-AS1	1.994662	0.1726518	3.808954	0.0001995	0.0346093	0.5494559
PRAG1	2.223802	0.2200442	3.793651	0.0002111	0.0346093	0.4997596
RNF13	2.212740	0.2147929	3.778006	0.0002237	0.0346093	0.4491215
KALRN	2.052896	0.1794657	3.775951	0.0002254	0.0346093	0.4424844
ST6GAL1	2.169592	0.1866848	3.770198	0.0002302	0.0346093	0.4239141

Output columns:

logFC: log fold change (since we are working with Seurat’s natural log transformed data, will be natural log fold change)
AveExpr: Average expression across all cells in expr2
t: logFC divided by its standard error
P.Value: Raw p-value (based on t) from test that logFC differs from 0
adj.P.Val: Benjamini-Hochberg false discovery rate adjusted p-value
B: log-odds that gene is DE

Save files

write.csv(GenTable(GOdata, Fisher = resultFisher), file = "cluster10_GOdata.csv")
write.csv(out, file = "cluster10_Normal-Colorectal.Cancer_topTable.csv")

A note on pseudobulk DE

Pseudobulk differential expression uses count data summed across all cells in each sample (typically within each cell type or cluster). Unlike cell-level DE, pseudobulk DE requires biological replicates so we won’t perform it on this dataset.

Once counts are summed, pseudobulk data are analyzed like bulk RNASeq data.

Pseudobulk DE may result in better false discovery rate control than cell-level DE, as shown here.

The Seurat function AggregateExpression() can be used to sum counts as described here.

A tutorial on using limma for bulk RNASeq is available here.

Prepare for the next section

Download Rmd document

download.file("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2025-July-Single-Cell-RNA-Seq-Analysis/main/data_analysis/07-doublet_detection.Rmd", "07-doublet_detection.Rmd")

Session Information

sessionInfo()

## R version 4.5.0 (2025-04-11 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] org.Hs.eg.db_3.21.0  kableExtra_1.4.0     dplyr_1.1.4         
##  [4] topGO_2.60.1         SparseM_1.84-2       GO.db_3.21.0        
##  [7] AnnotationDbi_1.70.0 IRanges_2.42.0       S4Vectors_0.46.0    
## [10] Biobase_2.68.0       graph_1.86.0         BiocGenerics_0.54.0 
## [13] generics_0.1.4       limma_3.64.1         Seurat_5.3.0        
## [16] SeuratObject_5.1.0   sp_2.2-0            
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3      rstudioapi_0.17.1       jsonlite_2.0.0         
##   [4] magrittr_2.0.3          spatstat.utils_3.1-4    farver_2.1.2           
##   [7] rmarkdown_2.29          vctrs_0.6.5             ROCR_1.0-11            
##  [10] memoise_2.0.1           spatstat.explore_3.4-3  htmltools_0.5.8.1      
##  [13] sass_0.4.10             sctransform_0.4.2       parallelly_1.45.0      
##  [16] KernSmooth_2.23-26      bslib_0.9.0             htmlwidgets_1.6.4      
##  [19] ica_1.0-3               plyr_1.8.9              plotly_4.11.0          
##  [22] zoo_1.8-14              cachem_1.1.0            igraph_2.1.4           
##  [25] mime_0.13               lifecycle_1.0.4         pkgconfig_2.0.3        
##  [28] Matrix_1.7-3            R6_2.6.1                fastmap_1.2.0          
##  [31] GenomeInfoDbData_1.2.14 fitdistrplus_1.2-4      future_1.58.0          
##  [34] shiny_1.11.1            digest_0.6.37           patchwork_1.3.1        
##  [37] tensor_1.5.1            RSpectra_0.16-2         irlba_2.3.5.1          
##  [40] textshaping_1.0.1       RSQLite_2.4.1           progressr_0.15.1       
##  [43] spatstat.sparse_3.1-0   httr_1.4.7              polyclip_1.10-7        
##  [46] abind_1.4-8             compiler_4.5.0          bit64_4.6.0-1          
##  [49] DBI_1.2.3               fastDummies_1.7.5       MASS_7.3-65            
##  [52] tools_4.5.0             lmtest_0.9-40           httpuv_1.6.16          
##  [55] future.apply_1.20.0     goftest_1.2-3           glue_1.8.0             
##  [58] nlme_3.1-168            promises_1.3.3          grid_4.5.0             
##  [61] Rtsne_0.17              cluster_2.1.8.1         reshape2_1.4.4         
##  [64] gtable_0.3.6            spatstat.data_3.1-6     tidyr_1.3.1            
##  [67] data.table_1.17.6       xml2_1.3.8              XVector_0.48.0         
##  [70] spatstat.geom_3.4-1     RcppAnnoy_0.0.22        ggrepel_0.9.6          
##  [73] RANN_2.6.2              pillar_1.11.0           stringr_1.5.1          
##  [76] spam_2.11-1             RcppHNSW_0.6.0          later_1.4.2            
##  [79] splines_4.5.0           lattice_0.22-7          survival_3.8-3         
##  [82] bit_4.6.0               deldir_2.0-4            tidyselect_1.2.1       
##  [85] Biostrings_2.76.0       miniUI_0.1.2            pbapply_1.7-2          
##  [88] knitr_1.50              gridExtra_2.3           svglite_2.2.1          
##  [91] scattermore_1.2         xfun_0.52               statmod_1.5.0          
##  [94] matrixStats_1.5.0       UCSC.utils_1.4.0        stringi_1.8.7          
##  [97] lazyeval_0.2.2          yaml_2.3.10             evaluate_1.0.4         
## [100] codetools_0.2-20        tibble_3.3.0            cli_3.6.5              
## [103] uwot_0.2.3              systemfonts_1.2.3       xtable_1.8-4           
## [106] reticulate_1.42.0       jquerylib_0.1.4         GenomeInfoDb_1.44.0    
## [109] Rcpp_1.1.0              globals_0.18.0          spatstat.random_3.4-1  
## [112] png_0.1-8               spatstat.univar_3.1-3   parallel_4.5.0         
## [115] ggplot2_3.5.2           blob_1.2.4              dotCall64_1.2          
## [118] listenv_0.9.1           viridisLite_0.4.2       scales_1.4.0           
## [121] ggridges_0.5.6          crayon_1.5.3            purrr_1.0.4            
## [124] rlang_1.1.6             KEGGREST_1.48.1         cowplot_1.2.0

☰ Menu

Single Cell RNA-Seq Analysis