Introduction to Single Cell RNA-Seq Part 6: Enrichment and model-based differential expression
Set up workspace
library(Seurat)
library(limma)
library(topGO)
library(dplyr)
library(kableExtra)
set.seed(12345)
experiment.aggregate <- readRDS("scRNA_workshop-05.rds")
Idents(experiment.aggregate) <- "finalcluster"
1. Gene Ontology (GO) Enrichment of Genes Expressed in a Cluster
Gene Ontology provides a controlled vocabulary for describing gene products. Here we use enrichment analysis to identify GO terms that are over-represented among the gene expressed in cells in a given cluster.
cluster10 <- subset(experiment.aggregate, idents = '10')
expr <- as.matrix(GetAssayData(cluster10))
# Select genes that are expressed > 0 in at least half of cells
n.gt.0 <- apply(expr, 1, function(x)length(which(x > 0)))
expressed.genes <- rownames(expr)[which(n.gt.0/ncol(expr) >= 0.5)]
all.genes <- rownames(expr)
# define geneList as 1 if gene is in expressed.genes, 0 otherwise
geneList <- ifelse(all.genes %in% expressed.genes, 1, 0)
names(geneList) <- all.genes
# Create topGOdata object
GOdata <- new("topGOdata",
ontology = "BP", # use biological process ontology
allGenes = geneList,
geneSelectionFun = function(x)(x == 1),
annot = annFUN.org, mapping = "org.Hs.eg.db", ID = "symbol")
# Test for enrichment using Fisher's Exact Test
resultFisher <- runTest(GOdata, algorithm = "elim", statistic = "fisher")
GenTable(GOdata, Fisher = resultFisher, topNodes = 20, numChar = 60) %>%
kable() %>%
kable_styling("striped", fixed_thead = TRUE)
GO.ID | Term | Annotated | Significant | Expected | Fisher |
---|---|---|---|---|---|
GO:0000381 | regulation of alternative mRNA splicing, via spliceosome | 41 | 9 | 1.85 | 6.9e-05 |
GO:0006338 | chromatin remodeling | 918 | 66 | 41.41 | 0.00028 |
GO:0035176 | social behavior | 22 | 6 | 0.99 | 0.00033 |
GO:2000650 | negative regulation of sodium ion transmembrane transporter ... | 4 | 3 | 0.18 | 0.00035 |
GO:0051660 | establishment of centrosome localization | 9 | 4 | 0.41 | 0.00043 |
GO:0006895 | Golgi to endosome transport | 17 | 5 | 0.77 | 0.00072 |
GO:0045053 | protein retention in Golgi apparatus | 5 | 3 | 0.23 | 0.00085 |
GO:0032534 | regulation of microvillus assembly | 5 | 3 | 0.23 | 0.00085 |
GO:0030036 | actin cytoskeleton organization | 482 | 43 | 21.74 | 0.00093 |
GO:0044331 | cell-cell adhesion mediated by cadherin | 27 | 6 | 1.22 | 0.00107 |
GO:0009653 | anatomical structure morphogenesis | 1541 | 107 | 69.52 | 0.00111 |
GO:0045944 | positive regulation of transcription by RNA polymerase II | 825 | 56 | 37.22 | 0.00118 |
GO:1902275 | regulation of chromatin organization | 20 | 5 | 0.90 | 0.00161 |
GO:0042060 | wound healing | 261 | 23 | 11.77 | 0.00162 |
GO:0007043 | cell-cell junction assembly | 91 | 15 | 4.11 | 0.00172 |
GO:0090257 | regulation of muscle system process | 128 | 14 | 5.77 | 0.00184 |
GO:0031175 | neuron projection development | 591 | 44 | 26.66 | 0.00185 |
GO:0070830 | bicellular tight junction assembly | 40 | 7 | 1.80 | 0.00185 |
GO:0048669 | collateral sprouting in absence of injury | 2 | 2 | 0.09 | 0.00203 |
GO:0051674 | localization of cell | 2 | 2 | 0.09 | 0.00203 |
- Annotated: number of genes (out of all.genes) that are annotated with that GO term
- Significant: number of genes that are annotated with that GO term and meet our criteria for “expressed”
- Expected: Under random chance, number of genes that would be expected to be annotated with that GO term and meeting our criteria for “expressed”
- Fisher: (Raw) p-value from Fisher’s Exact Test
Organisms other than human
Many organisms have a database equivalent to org.Hs.eg.db on Bioconductor.
If not, and your organism is on NCBI it’s straightforward to create your own using the AnnotationForge R package and the function makeOrgPackageFromNCBI
2. Model-based DE analysis in limma
limma is an R package for differential expression analysis of bulk RNASeq and microarray data. We apply it here to single cell data.
Limma can be used to fit any linear model to expression data and is useful for analyses that go beyond two-group comparisons. A detailed tutorial of model specification in limma is available here and in the limma User’s Guide.
# filter genes to those expressed in at least 10% of cells
keep <- rownames(expr)[which(n.gt.0/ncol(expr) >= 0.1)]
expr2 <- expr[keep,]
# Set up "design matrix" with statistical model
cluster10$proper.group <- make.names(cluster10$group)
mm <- model.matrix(~0 + proper.group + S.Score + G2M.Score + percent_MT + nFeature_RNA, data = cluster10[[]])
head(mm) %>%
kable() %>%
kable_styling("striped", fixed_thead = TRUE)
proper.groupColorectal.Cancer | proper.groupNormal | proper.groupPolyp | S.Score | G2M.Score | percent_MT | nFeature_RNA | |
---|---|---|---|---|---|---|---|
AAACCCAAGTTATGGA_A001-C-007 | 1 | 0 | 0 | 0.0554074 | -0.1461764 | 0.5717008 | 1513 |
AAGCCATCAAGACCTT_A001-C-007 | 1 | 0 | 0 | 0.0660183 | -0.0729290 | 0.4841313 | 1265 |
AAGTTCGGTACCTATG_A001-C-007 | 1 | 0 | 0 | 0.0202042 | -0.0451182 | 0.5148005 | 1148 |
AATGAAGTCAGCGTCG_A001-C-007 | 1 | 0 | 0 | 0.1225441 | 0.0612193 | 1.0398614 | 1272 |
ACAGAAATCCAGCTCT_A001-C-007 | 1 | 0 | 0 | -0.0769269 | -0.0707690 | 0.6334125 | 1637 |
ACGGTTACAAATCGGG_A001-C-007 | 1 | 0 | 0 | -0.0550882 | 0.1286419 | 0.7523716 | 1934 |
tail(mm) %>%
kable() %>%
kable_styling("striped", fixed_thead = TRUE)
proper.groupColorectal.Cancer | proper.groupNormal | proper.groupPolyp | S.Score | G2M.Score | percent_MT | nFeature_RNA | |
---|---|---|---|---|---|---|---|
TTCCACGAGTATTGCC_A001-C-104 | 0 | 0 | 1 | 0.4520556 | 0.0266469 | 0.3015580 | 4562 |
TTCTAGTAGATAACAC_A001-C-104 | 0 | 0 | 1 | 0.0218701 | -0.0793631 | 1.3125000 | 1140 |
TTGCCTGTCCGCGATG_A001-C-104 | 0 | 0 | 1 | -0.0612575 | -0.0292726 | 0.3594352 | 2241 |
TTGGGATTCGTTCCCA_A001-C-104 | 0 | 0 | 1 | -0.0709637 | -0.0806137 | 0.4050223 | 1562 |
TTTCAGTTCGCACTCT_A001-C-104 | 0 | 0 | 1 | -0.1047877 | -0.0540112 | 1.8099548 | 1309 |
GACTCAACACACACGC_B001-A-301 | 0 | 1 | 0 | 0.0481511 | -0.0900060 | 0.1574803 | 990 |
# Fit model in limma
fit <- lmFit(expr2, mm)
head(coef(fit)) %>%
kable() %>%
kable_styling("striped", fixed_thead = TRUE)
proper.groupColorectal.Cancer | proper.groupNormal | proper.groupPolyp | S.Score | G2M.Score | percent_MT | nFeature_RNA | |
---|---|---|---|---|---|---|---|
ENSG00000291215 | -0.2357714 | -0.2248263 | -0.0954233 | 0.5209505 | -0.3817891 | 0.0510813 | 0.0001589 |
C1orf159 | -0.0141535 | -0.1325994 | -0.0259810 | -0.1442860 | 0.2517446 | 0.1333100 | 0.0001426 |
SDF4 | 0.3616255 | 0.1839501 | 0.4912284 | 0.2506248 | 1.0914474 | -0.1494113 | -0.0000750 |
CCNL2 | 0.4402786 | 1.6154478 | 0.0118319 | 0.4012170 | -1.8906665 | 0.2768585 | 0.0003573 |
MIB2 | 0.0480968 | -0.0562502 | 0.2279206 | -0.1502399 | -0.4402191 | -0.1676190 | 0.0000508 |
CDK11B | 0.6889952 | -0.1659776 | -0.0648067 | -0.7018856 | -0.2333263 | 0.1621756 | 0.0001548 |
# Test 'Normal' - 'Colorectal.Cancer'
contr <- makeContrasts(proper.groupNormal - proper.groupColorectal.Cancer, levels = colnames(coef(fit)))
contr %>%
kable() %>%
kable_styling("striped", fixed_thead = TRUE)
proper.groupNormal - proper.groupColorectal.Cancer | |
---|---|
proper.groupColorectal.Cancer | -1 |
proper.groupNormal | 1 |
proper.groupPolyp | 0 |
S.Score | 0 |
G2M.Score | 0 |
percent_MT | 0 |
nFeature_RNA | 0 |
fit2 <- contrasts.fit(fit, contrasts = contr)
fit2 <- eBayes(fit2)
out <- topTable(fit2, n = Inf, sort.by = "P")
head(out, 30) %>%
kable() %>%
kable_styling("striped", fixed_thead = TRUE)
logFC | AveExpr | t | P.Value | adj.P.Val | B | |
---|---|---|---|---|---|---|
UTP6 | 2.809558 | 0.1665493 | 5.729581 | 0.0000000 | 0.0002398 | 7.9550347 |
S100PBP | 2.764765 | 0.2078133 | 4.898211 | 0.0000024 | 0.0057367 | 4.4838247 |
FDPS | 2.194920 | 0.1608721 | 4.481281 | 0.0000142 | 0.0127450 | 2.8889818 |
CRKL | 2.062359 | 0.1502419 | 4.431309 | 0.0000175 | 0.0127450 | 2.7049563 |
NMRAL1 | 2.284191 | 0.1670087 | 4.430493 | 0.0000175 | 0.0127450 | 2.7019633 |
SOAT1 | 2.026978 | 0.1554562 | 4.415095 | 0.0000187 | 0.0127450 | 2.6455817 |
CMSS1 | 2.179765 | 0.1843519 | 4.398089 | 0.0000200 | 0.0127450 | 2.5834884 |
TXNL4A | 2.044588 | 0.1549221 | 4.379329 | 0.0000216 | 0.0127450 | 2.5152036 |
PROX1 | 2.223377 | 0.1608351 | 4.344625 | 0.0000249 | 0.0127450 | 2.3894760 |
ADAMTSL1 | 3.378881 | 0.4051661 | 4.315871 | 0.0000280 | 0.0127450 | 2.2858848 |
LNX2 | 2.802389 | 0.3089559 | 4.306615 | 0.0000290 | 0.0127450 | 2.2526522 |
SEC23B | 2.024936 | 0.1510704 | 4.256737 | 0.0000355 | 0.0142802 | 2.0745228 |
TCP11L1 | 2.194426 | 0.1880198 | 4.112117 | 0.0000629 | 0.0218800 | 1.5672384 |
ZNF3 | 2.241657 | 0.1841910 | 4.097439 | 0.0000666 | 0.0218800 | 1.5165269 |
RBL1 | 2.107244 | 0.1733018 | 4.092354 | 0.0000680 | 0.0218800 | 1.4989902 |
ENSG00000287292 | 2.649530 | 0.2658202 | 4.044865 | 0.0000817 | 0.0231668 | 1.3360639 |
WDTC1 | 2.184789 | 0.1787432 | 4.032124 | 0.0000858 | 0.0231668 | 1.2926107 |
MED17 | 2.135271 | 0.1824321 | 4.009965 | 0.0000935 | 0.0231668 | 1.2172980 |
RXRA | 2.060384 | 0.1753848 | 4.005170 | 0.0000952 | 0.0231668 | 1.2010447 |
CDK18 | 2.168277 | 0.1831622 | 4.003186 | 0.0000959 | 0.0231668 | 1.1943236 |
ADPGK | 2.058125 | 0.1664844 | 3.989304 | 0.0001012 | 0.0232690 | 1.1473755 |
ZNF480 | 2.236793 | 0.2006418 | 3.965854 | 0.0001107 | 0.0242922 | 1.0683681 |
PPIL4 | 2.226664 | 0.2090412 | 3.892616 | 0.0001460 | 0.0306594 | 0.8240318 |
IRF3 | 2.687759 | 0.3212877 | 3.873166 | 0.0001571 | 0.0316070 | 0.7597636 |
SARS1 | 2.368320 | 0.2254349 | 3.835478 | 0.0001808 | 0.0346093 | 0.6359795 |
TMEM51-AS1 | 1.994662 | 0.1726518 | 3.808954 | 0.0001995 | 0.0346093 | 0.5494559 |
PRAG1 | 2.223802 | 0.2200442 | 3.793651 | 0.0002111 | 0.0346093 | 0.4997596 |
RNF13 | 2.212740 | 0.2147929 | 3.778006 | 0.0002237 | 0.0346093 | 0.4491215 |
KALRN | 2.052896 | 0.1794657 | 3.775951 | 0.0002254 | 0.0346093 | 0.4424844 |
ST6GAL1 | 2.169592 | 0.1866848 | 3.770198 | 0.0002302 | 0.0346093 | 0.4239141 |
Output columns:
- logFC: log fold change (since we are working with Seurat’s natural log transformed data, will be natural log fold change)
- AveExpr: Average expression across all cells in expr2
- t: logFC divided by its standard error
- P.Value: Raw p-value (based on t) from test that logFC differs from 0
- adj.P.Val: Benjamini-Hochberg false discovery rate adjusted p-value
- B: log-odds that gene is DE
Save files
write.csv(GenTable(GOdata, Fisher = resultFisher), file = "cluster10_GOdata.csv")
write.csv(out, file = "cluster10_Normal-Colorectal.Cancer_topTable.csv")
A note on pseudobulk DE
Pseudobulk differential expression uses count data summed across all cells in each sample (typically within each cell type or cluster). Unlike cell-level DE, pseudobulk DE requires biological replicates so we won’t perform it on this dataset.
Once counts are summed, pseudobulk data are analyzed like bulk RNASeq data.
Pseudobulk DE may result in better false discovery rate control than cell-level DE, as shown here.
The Seurat function AggregateExpression()
can be used to sum counts as described here.
A tutorial on using limma for bulk RNASeq is available here.
Prepare for the next section
Download Rmd document
download.file("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2025-July-Single-Cell-RNA-Seq-Analysis/main/data_analysis/07-doublet_detection.Rmd", "07-doublet_detection.Rmd")
Session Information
sessionInfo()
## R version 4.5.0 (2025-04-11 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/Los_Angeles
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] org.Hs.eg.db_3.21.0 kableExtra_1.4.0 dplyr_1.1.4
## [4] topGO_2.60.1 SparseM_1.84-2 GO.db_3.21.0
## [7] AnnotationDbi_1.70.0 IRanges_2.42.0 S4Vectors_0.46.0
## [10] Biobase_2.68.0 graph_1.86.0 BiocGenerics_0.54.0
## [13] generics_0.1.4 limma_3.64.1 Seurat_5.3.0
## [16] SeuratObject_5.1.0 sp_2.2-0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 rstudioapi_0.17.1 jsonlite_2.0.0
## [4] magrittr_2.0.3 spatstat.utils_3.1-4 farver_2.1.2
## [7] rmarkdown_2.29 vctrs_0.6.5 ROCR_1.0-11
## [10] memoise_2.0.1 spatstat.explore_3.4-3 htmltools_0.5.8.1
## [13] sass_0.4.10 sctransform_0.4.2 parallelly_1.45.0
## [16] KernSmooth_2.23-26 bslib_0.9.0 htmlwidgets_1.6.4
## [19] ica_1.0-3 plyr_1.8.9 plotly_4.11.0
## [22] zoo_1.8-14 cachem_1.1.0 igraph_2.1.4
## [25] mime_0.13 lifecycle_1.0.4 pkgconfig_2.0.3
## [28] Matrix_1.7-3 R6_2.6.1 fastmap_1.2.0
## [31] GenomeInfoDbData_1.2.14 fitdistrplus_1.2-4 future_1.58.0
## [34] shiny_1.11.1 digest_0.6.37 patchwork_1.3.1
## [37] tensor_1.5.1 RSpectra_0.16-2 irlba_2.3.5.1
## [40] textshaping_1.0.1 RSQLite_2.4.1 progressr_0.15.1
## [43] spatstat.sparse_3.1-0 httr_1.4.7 polyclip_1.10-7
## [46] abind_1.4-8 compiler_4.5.0 bit64_4.6.0-1
## [49] DBI_1.2.3 fastDummies_1.7.5 MASS_7.3-65
## [52] tools_4.5.0 lmtest_0.9-40 httpuv_1.6.16
## [55] future.apply_1.20.0 goftest_1.2-3 glue_1.8.0
## [58] nlme_3.1-168 promises_1.3.3 grid_4.5.0
## [61] Rtsne_0.17 cluster_2.1.8.1 reshape2_1.4.4
## [64] gtable_0.3.6 spatstat.data_3.1-6 tidyr_1.3.1
## [67] data.table_1.17.6 xml2_1.3.8 XVector_0.48.0
## [70] spatstat.geom_3.4-1 RcppAnnoy_0.0.22 ggrepel_0.9.6
## [73] RANN_2.6.2 pillar_1.11.0 stringr_1.5.1
## [76] spam_2.11-1 RcppHNSW_0.6.0 later_1.4.2
## [79] splines_4.5.0 lattice_0.22-7 survival_3.8-3
## [82] bit_4.6.0 deldir_2.0-4 tidyselect_1.2.1
## [85] Biostrings_2.76.0 miniUI_0.1.2 pbapply_1.7-2
## [88] knitr_1.50 gridExtra_2.3 svglite_2.2.1
## [91] scattermore_1.2 xfun_0.52 statmod_1.5.0
## [94] matrixStats_1.5.0 UCSC.utils_1.4.0 stringi_1.8.7
## [97] lazyeval_0.2.2 yaml_2.3.10 evaluate_1.0.4
## [100] codetools_0.2-20 tibble_3.3.0 cli_3.6.5
## [103] uwot_0.2.3 systemfonts_1.2.3 xtable_1.8-4
## [106] reticulate_1.42.0 jquerylib_0.1.4 GenomeInfoDb_1.44.0
## [109] Rcpp_1.1.0 globals_0.18.0 spatstat.random_3.4-1
## [112] png_0.1-8 spatstat.univar_3.1-3 parallel_4.5.0
## [115] ggplot2_3.5.2 blob_1.2.4 dotCall64_1.2
## [118] listenv_0.9.1 viridisLite_0.4.2 scales_1.4.0
## [121] ggridges_0.5.6 crayon_1.5.3 purrr_1.0.4
## [124] rlang_1.1.6 KEGGREST_1.48.1 cowplot_1.2.0