Introduction to Single Cell RNA-Seq Part 6: Enrichment and model-based differential expression
Set up workspace
library(Seurat)
library(limma)
library(topGO)
library(dplyr)
library(kableExtra)
set.seed(12345)
experiment.aggregate <- readRDS("scRNA_workshop-05.rds")
Idents(experiment.aggregate) <- "finalcluster"
1. Gene Ontology (GO) Enrichment of Genes Expressed in a Cluster
Gene Ontology provides a controlled vocabulary for describing gene products. Here we use enrichment analysis to identify GO terms that are over-represented among the gene expressed in cells in a given cluster.
cluster10 <- subset(experiment.aggregate, idents = '10')
expr <- as.matrix(GetAssayData(cluster10))
# Select genes that are expressed > 0 in at least half of cells
n.gt.0 <- apply(expr, 1, function(x)length(which(x > 0)))
expressed.genes <- rownames(expr)[which(n.gt.0/ncol(expr) >= 0.5)]
all.genes <- rownames(expr)
# define geneList as 1 if gene is in expressed.genes, 0 otherwise
geneList <- ifelse(all.genes %in% expressed.genes, 1, 0)
names(geneList) <- all.genes
# Create topGOdata object
GOdata <- new("topGOdata",
ontology = "BP", # use biological process ontology
allGenes = geneList,
geneSelectionFun = function(x)(x == 1),
annot = annFUN.org, mapping = "org.Hs.eg.db", ID = "symbol")
# Test for enrichment using Fisher's Exact Test
resultFisher <- runTest(GOdata, algorithm = "elim", statistic = "fisher")
GenTable(GOdata, Fisher = resultFisher, topNodes = 20, numChar = 60) %>%
kable() %>%
kable_styling("striped", fixed_thead = TRUE)
GO.ID | Term | Annotated | Significant | Expected | Fisher |
---|---|---|---|---|---|
GO:0006338 | chromatin remodeling | 464 | 39 | 20.46 | 7.3e-05 |
GO:0000381 | regulation of alternative mRNA splicing, via spliceosome | 39 | 8 | 1.72 | 0.00025 |
GO:0051660 | establishment of centrosome localization | 9 | 4 | 0.40 | 0.00039 |
GO:0006895 | Golgi to endosome transport | 17 | 5 | 0.75 | 0.00065 |
GO:0042060 | wound healing | 252 | 23 | 11.11 | 0.00074 |
GO:0045944 | positive regulation of transcription by RNA polymerase II | 810 | 55 | 35.71 | 0.00076 |
GO:0045053 | protein retention in Golgi apparatus | 5 | 3 | 0.22 | 0.00080 |
GO:0032534 | regulation of microvillus assembly | 5 | 3 | 0.22 | 0.00080 |
GO:0007043 | cell-cell junction assembly | 88 | 15 | 3.88 | 0.00113 |
GO:0070830 | bicellular tight junction assembly | 39 | 7 | 1.72 | 0.00139 |
GO:0035176 | social behavior | 20 | 5 | 0.88 | 0.00146 |
GO:2000650 | negative regulation of sodium ion transmembrane transporter ... | 6 | 3 | 0.26 | 0.00154 |
GO:0048669 | collateral sprouting in absence of injury | 2 | 2 | 0.09 | 0.00194 |
GO:0031175 | neuron projection development | 570 | 42 | 25.13 | 0.00195 |
GO:0002064 | epithelial cell development | 134 | 14 | 5.91 | 0.00230 |
GO:0007015 | actin filament organization | 317 | 28 | 13.98 | 0.00235 |
GO:0030155 | regulation of cell adhesion | 482 | 35 | 21.25 | 0.00242 |
GO:0042752 | regulation of circadian rhythm | 80 | 10 | 3.53 | 0.00259 |
GO:0030644 | intracellular chloride ion homeostasis | 7 | 3 | 0.31 | 0.00261 |
GO:0030050 | vesicle transport along actin filament | 7 | 3 | 0.31 | 0.00261 |
- Annotated: number of genes (out of all.genes) that are annotated with that GO term
- Significant: number of genes that are annotated with that GO term and meet our criteria for “expressed”
- Expected: Under random chance, number of genes that would be expected to be annotated with that GO term and meeting our criteria for “expressed”
- Fisher: (Raw) p-value from Fisher’s Exact Test
2. Model-based DE analysis in limma
limma is an R package for differential expression analysis of bulk RNASeq and microarray data. We apply it here to single cell data.
Limma can be used to fit any linear model to expression data and is useful for analyses that go beyond two-group comparisons. A detailed tutorial of model specification in limma is available here and in the limma User’s Guide.
# filter genes to those expressed in at least 10% of cells
keep <- rownames(expr)[which(n.gt.0/ncol(expr) >= 0.1)]
expr2 <- expr[keep,]
# Set up "design matrix" with statistical model
cluster10$proper.group <- make.names(cluster10$group)
mm <- model.matrix(~0 + proper.group + S.Score + G2M.Score + percent_MT + nFeature_RNA, data = cluster10[[]])
head(mm) %>%
kable() %>%
kable_styling("striped", fixed_thead = TRUE)
proper.groupColorectal.Cancer | proper.groupNormal | proper.groupPolyp | S.Score | G2M.Score | percent_MT | nFeature_RNA | |
---|---|---|---|---|---|---|---|
AAACCCAAGTTATGGA_A001-C-007 | 1 | 0 | 0 | 0.0554074 | -0.1461764 | 0.5717008 | 1513 |
AAGCCATCAAGACCTT_A001-C-007 | 1 | 0 | 0 | 0.0660183 | -0.0729290 | 0.4841313 | 1265 |
AAGTTCGGTACCTATG_A001-C-007 | 1 | 0 | 0 | 0.0202042 | -0.0451182 | 0.5148005 | 1148 |
AATGAAGTCAGCGTCG_A001-C-007 | 1 | 0 | 0 | 0.1225441 | 0.0612193 | 1.0398614 | 1272 |
ACAGAAATCCAGCTCT_A001-C-007 | 1 | 0 | 0 | -0.0769269 | -0.0707690 | 0.6334125 | 1637 |
ACGGTTACAAATCGGG_A001-C-007 | 1 | 0 | 0 | -0.0550882 | 0.1286419 | 0.7523716 | 1934 |
tail(mm) %>%
kable() %>%
kable_styling("striped", fixed_thead = TRUE)
proper.groupColorectal.Cancer | proper.groupNormal | proper.groupPolyp | S.Score | G2M.Score | percent_MT | nFeature_RNA | |
---|---|---|---|---|---|---|---|
TTCCACGAGTATTGCC_A001-C-104 | 0 | 0 | 1 | 0.4520556 | 0.0266469 | 0.3015580 | 4562 |
TTCTAGTAGATAACAC_A001-C-104 | 0 | 0 | 1 | 0.0218701 | -0.0793631 | 1.3125000 | 1140 |
TTGCCTGTCCGCGATG_A001-C-104 | 0 | 0 | 1 | -0.0612575 | -0.0292726 | 0.3594352 | 2241 |
TTGGGATTCGTTCCCA_A001-C-104 | 0 | 0 | 1 | -0.0709637 | -0.0806137 | 0.4050223 | 1562 |
TTTCAGTTCGCACTCT_A001-C-104 | 0 | 0 | 1 | -0.1047877 | -0.0540112 | 1.8099548 | 1309 |
GACTCAACACACACGC_B001-A-301 | 0 | 1 | 0 | 0.0481511 | -0.0900060 | 0.1574803 | 990 |
# Fit model in limma
fit <- lmFit(expr2, mm)
head(coef(fit)) %>%
kable() %>%
kable_styling("striped", fixed_thead = TRUE)
proper.groupColorectal.Cancer | proper.groupNormal | proper.groupPolyp | S.Score | G2M.Score | percent_MT | nFeature_RNA | |
---|---|---|---|---|---|---|---|
ENSG00000291215 | -0.2357714 | -0.2248263 | -0.0954233 | 0.5209505 | -0.3817891 | 0.0510813 | 0.0001589 |
C1orf159 | -0.0141535 | -0.1325994 | -0.0259810 | -0.1442860 | 0.2517446 | 0.1333100 | 0.0001426 |
SDF4 | 0.3616255 | 0.1839501 | 0.4912284 | 0.2506248 | 1.0914474 | -0.1494113 | -0.0000750 |
CCNL2 | 0.4402786 | 1.6154478 | 0.0118319 | 0.4012170 | -1.8906665 | 0.2768585 | 0.0003573 |
MIB2 | 0.0480968 | -0.0562502 | 0.2279206 | -0.1502399 | -0.4402191 | -0.1676190 | 0.0000508 |
CDK11B | 0.6889952 | -0.1659776 | -0.0648067 | -0.7018856 | -0.2333263 | 0.1621756 | 0.0001548 |
# Test 'Normal' - 'Colorectal.Cancer'
contr <- makeContrasts(proper.groupNormal - proper.groupColorectal.Cancer, levels = colnames(coef(fit)))
contr %>%
kable() %>%
kable_styling("striped", fixed_thead = TRUE)
proper.groupNormal - proper.groupColorectal.Cancer | |
---|---|
proper.groupColorectal.Cancer | -1 |
proper.groupNormal | 1 |
proper.groupPolyp | 0 |
S.Score | 0 |
G2M.Score | 0 |
percent_MT | 0 |
nFeature_RNA | 0 |
fit2 <- contrasts.fit(fit, contrasts = contr)
fit2 <- eBayes(fit2)
out <- topTable(fit2, n = Inf, sort.by = "P")
head(out, 30) %>%
kable() %>%
kable_styling("striped", fixed_thead = TRUE)
logFC | AveExpr | t | P.Value | adj.P.Val | B | |
---|---|---|---|---|---|---|
UTP6 | 2.809558 | 0.1665493 | 5.729581 | 0.0000000 | 0.0002398 | 7.9550347 |
S100PBP | 2.764765 | 0.2078133 | 4.898211 | 0.0000024 | 0.0057367 | 4.4838247 |
FDPS | 2.194920 | 0.1608721 | 4.481281 | 0.0000142 | 0.0127450 | 2.8889818 |
CRKL | 2.062359 | 0.1502419 | 4.431309 | 0.0000175 | 0.0127450 | 2.7049563 |
NMRAL1 | 2.284191 | 0.1670087 | 4.430493 | 0.0000175 | 0.0127450 | 2.7019633 |
SOAT1 | 2.026978 | 0.1554562 | 4.415095 | 0.0000187 | 0.0127450 | 2.6455817 |
CMSS1 | 2.179765 | 0.1843519 | 4.398089 | 0.0000200 | 0.0127450 | 2.5834884 |
TXNL4A | 2.044588 | 0.1549221 | 4.379329 | 0.0000216 | 0.0127450 | 2.5152036 |
PROX1 | 2.223377 | 0.1608351 | 4.344625 | 0.0000249 | 0.0127450 | 2.3894760 |
ADAMTSL1 | 3.378881 | 0.4051661 | 4.315871 | 0.0000280 | 0.0127450 | 2.2858848 |
LNX2 | 2.802389 | 0.3089559 | 4.306615 | 0.0000290 | 0.0127450 | 2.2526522 |
SEC23B | 2.024936 | 0.1510704 | 4.256737 | 0.0000355 | 0.0142802 | 2.0745228 |
TCP11L1 | 2.194426 | 0.1880198 | 4.112117 | 0.0000629 | 0.0218800 | 1.5672384 |
ZNF3 | 2.241657 | 0.1841910 | 4.097439 | 0.0000666 | 0.0218800 | 1.5165269 |
RBL1 | 2.107244 | 0.1733018 | 4.092354 | 0.0000680 | 0.0218800 | 1.4989902 |
ENSG00000287292 | 2.649530 | 0.2658202 | 4.044865 | 0.0000817 | 0.0231668 | 1.3360639 |
WDTC1 | 2.184789 | 0.1787432 | 4.032124 | 0.0000858 | 0.0231668 | 1.2926107 |
MED17 | 2.135271 | 0.1824321 | 4.009965 | 0.0000935 | 0.0231668 | 1.2172980 |
RXRA | 2.060384 | 0.1753848 | 4.005170 | 0.0000952 | 0.0231668 | 1.2010447 |
CDK18 | 2.168277 | 0.1831622 | 4.003186 | 0.0000959 | 0.0231668 | 1.1943236 |
ADPGK | 2.058125 | 0.1664844 | 3.989304 | 0.0001012 | 0.0232690 | 1.1473755 |
ZNF480 | 2.236793 | 0.2006418 | 3.965854 | 0.0001107 | 0.0242922 | 1.0683681 |
PPIL4 | 2.226664 | 0.2090412 | 3.892616 | 0.0001460 | 0.0306594 | 0.8240318 |
IRF3 | 2.687759 | 0.3212877 | 3.873166 | 0.0001571 | 0.0316070 | 0.7597636 |
SARS1 | 2.368320 | 0.2254349 | 3.835478 | 0.0001808 | 0.0346093 | 0.6359795 |
TMEM51-AS1 | 1.994662 | 0.1726518 | 3.808954 | 0.0001995 | 0.0346093 | 0.5494559 |
PRAG1 | 2.223802 | 0.2200442 | 3.793651 | 0.0002111 | 0.0346093 | 0.4997596 |
RNF13 | 2.212740 | 0.2147929 | 3.778006 | 0.0002237 | 0.0346093 | 0.4491215 |
KALRN | 2.052896 | 0.1794657 | 3.775951 | 0.0002254 | 0.0346093 | 0.4424844 |
ST6GAL1 | 2.169592 | 0.1866848 | 3.770198 | 0.0002302 | 0.0346093 | 0.4239141 |
Output columns:
- logFC: log fold change (since we are working with Seurat’s natural log transformed data, will be natural log fold change)
- AveExpr: Average expression across all cells in expr2
- t: logFC divided by its standard error
- P.Value: Raw p-value (based on t) from test that logFC differs from 0
- adj.P.Val: Benjamini-Hochberg false discovery rate adjusted p-value
- B: log-odds that gene is DE
Save files
write.csv(GenTable(GOdata, Fisher = resultFisher), file = "cluster10_GOdata.csv")
write.csv(out, file = "cluster10_Normal-Colorectal.Cancer_topTable.csv")
A note on pseudobulk DE
Pseudobulk differential expression uses count data summed across all cells in each sample (typically within each cell type or cluster). Unlike cell-level DE, pseudobulk DE requires biological replicates so we won’t perform it on this dataset.
Once counts are summed, pseudobulk data are analyzed like bulk RNASeq data.
Pseudobulk DE may result in better false discovery rate control than cell-level DE, as shown here.
The Seurat function AggregateExpression()
can be used to sum counts as described here.
A tutorial on using limma for bulk RNASeq is available here.
Prepare for the next section
Download Rmd document
download.file("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2025-July-Single-Cell-RNA-Seq-Analysis/main/data_analysis/07-doublet_detection.Rmd", "07-doublet_detection.Rmd")
Session Information
sessionInfo()
## R version 4.4.3 (2025-02-28)
## Platform: aarch64-apple-darwin20
## Running under: macOS Ventura 13.7.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Los_Angeles
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] org.Hs.eg.db_3.19.1 kableExtra_1.4.0 dplyr_1.1.4
## [4] topGO_2.56.0 SparseM_1.82 GO.db_3.19.1
## [7] AnnotationDbi_1.66.0 IRanges_2.38.0 S4Vectors_0.44.0
## [10] Biobase_2.64.0 graph_1.82.0 BiocGenerics_0.50.0
## [13] limma_3.60.2 Seurat_5.2.1 SeuratObject_5.0.2
## [16] sp_2.1-4
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 rstudioapi_0.16.0 jsonlite_1.8.8
## [4] magrittr_2.0.3 spatstat.utils_3.1-2 rmarkdown_2.27
## [7] zlibbioc_1.50.0 vctrs_0.6.5 ROCR_1.0-11
## [10] memoise_2.0.1 spatstat.explore_3.2-7 htmltools_0.5.8.1
## [13] sass_0.4.9 sctransform_0.4.1 parallelly_1.37.1
## [16] KernSmooth_2.23-26 bslib_0.7.0 htmlwidgets_1.6.4
## [19] ica_1.0-3 plyr_1.8.9 plotly_4.10.4
## [22] zoo_1.8-12 cachem_1.1.0 igraph_2.0.3
## [25] mime_0.12 lifecycle_1.0.4 pkgconfig_2.0.3
## [28] Matrix_1.7-2 R6_2.5.1 fastmap_1.2.0
## [31] GenomeInfoDbData_1.2.12 fitdistrplus_1.1-11 future_1.33.2
## [34] shiny_1.8.1.1 digest_0.6.35 colorspace_2.1-0
## [37] patchwork_1.2.0 tensor_1.5 RSpectra_0.16-1
## [40] irlba_2.3.5.1 RSQLite_2.3.7 progressr_0.14.0
## [43] fansi_1.0.6 spatstat.sparse_3.0-3 httr_1.4.7
## [46] polyclip_1.10-6 abind_1.4-5 compiler_4.4.3
## [49] bit64_4.0.5 DBI_1.2.3 fastDummies_1.7.3
## [52] highr_0.11 MASS_7.3-64 tools_4.4.3
## [55] lmtest_0.9-40 httpuv_1.6.15 future.apply_1.11.2
## [58] goftest_1.2-3 glue_1.7.0 nlme_3.1-167
## [61] promises_1.3.0 grid_4.4.3 Rtsne_0.17
## [64] cluster_2.1.8 reshape2_1.4.4 generics_0.1.3
## [67] gtable_0.3.5 spatstat.data_3.0-4 tidyr_1.3.1
## [70] data.table_1.15.4 xml2_1.3.6 XVector_0.44.0
## [73] utf8_1.2.4 spatstat.geom_3.2-9 RcppAnnoy_0.0.22
## [76] ggrepel_0.9.5 RANN_2.6.1 pillar_1.9.0
## [79] stringr_1.5.1 spam_2.10-0 RcppHNSW_0.6.0
## [82] later_1.3.2 splines_4.4.3 lattice_0.22-6
## [85] bit_4.0.5 survival_3.8-3 deldir_2.0-4
## [88] tidyselect_1.2.1 Biostrings_2.72.0 miniUI_0.1.1.1
## [91] pbapply_1.7-2 knitr_1.47 gridExtra_2.3
## [94] svglite_2.1.3 scattermore_1.2 xfun_0.44
## [97] statmod_1.5.0 matrixStats_1.3.0 UCSC.utils_1.0.0
## [100] stringi_1.8.4 lazyeval_0.2.2 yaml_2.3.8
## [103] evaluate_0.23 codetools_0.2-20 tibble_3.2.1
## [106] cli_3.6.2 uwot_0.2.2 systemfonts_1.1.0
## [109] xtable_1.8-4 reticulate_1.39.0 munsell_0.5.1
## [112] jquerylib_0.1.4 GenomeInfoDb_1.40.1 Rcpp_1.0.12
## [115] globals_0.16.3 spatstat.random_3.2-3 png_0.1-8
## [118] parallel_4.4.3 blob_1.2.4 ggplot2_3.5.1
## [121] dotCall64_1.1-1 listenv_0.9.1 viridisLite_0.4.2
## [124] scales_1.3.0 ggridges_0.5.6 crayon_1.5.2
## [127] purrr_1.0.2 rlang_1.1.3 KEGGREST_1.44.0
## [130] cowplot_1.1.3