Introduction to Single Cell RNA-Seq Part 6: Enrichment and model-based differential expression
Set up workspace
library(Seurat)
library(limma)
library(topGO)
library(dplyr)
library(kableExtra)
set.seed(12345)
experiment.aggregate <- readRDS("scRNA_workshop-05.rds")
Idents(experiment.aggregate) <- "finalcluster"
1. Gene Ontology (GO) Enrichment of Genes Expressed in a Cluster
Gene Ontology provides a controlled vocabulary for describing gene products. Here we use enrichment analysis to identify GO terms that are over-represented among the gene expressed in cells in a given cluster.
cluster10 <- subset(experiment.aggregate, idents = '10')
expr <- as.matrix(GetAssayData(cluster10))
# Select genes that are expressed > 0 in at least half of cells
n.gt.0 <- apply(expr, 1, function(x)length(which(x > 0)))
expressed.genes <- rownames(expr)[which(n.gt.0/ncol(expr) >= 0.5)]
all.genes <- rownames(expr)
# define geneList as 1 if gene is in expressed.genes, 0 otherwise
geneList <- ifelse(all.genes %in% expressed.genes, 1, 0)
names(geneList) <- all.genes
# Create topGOdata object
GOdata <- new("topGOdata",
ontology = "BP", # use biological process ontology
allGenes = geneList,
geneSelectionFun = function(x)(x == 1),
annot = annFUN.org, mapping = "org.Hs.eg.db", ID = "symbol")
# Test for enrichment using Fisher's Exact Test
resultFisher <- runTest(GOdata, algorithm = "elim", statistic = "fisher")
GenTable(GOdata, Fisher = resultFisher, topNodes = 20, numChar = 60) %>%
kable() %>%
kable_styling("striped", fixed_thead = TRUE)
GO.ID | Term | Annotated | Significant | Expected | Fisher |
---|---|---|---|---|---|
GO:0006338 | chromatin remodeling | 464 | 19 | 5.61 | 2.6e-06 |
GO:0043484 | regulation of RNA splicing | 148 | 8 | 1.79 | 0.00041 |
GO:1903613 | regulation of protein tyrosine phosphatase activity | 3 | 2 | 0.04 | 0.00043 |
GO:0006306 | DNA methylation | 33 | 4 | 0.40 | 0.00063 |
GO:0061061 | muscle structure development | 381 | 13 | 4.61 | 0.00067 |
GO:0031580 | membrane raft distribution | 4 | 2 | 0.05 | 0.00086 |
GO:0009791 | post-embryonic development | 65 | 5 | 0.79 | 0.00110 |
GO:0000512 | lncRNA-mediated post-transcriptional gene silencing | 5 | 2 | 0.06 | 0.00142 |
GO:1903978 | regulation of microglial cell activation | 5 | 2 | 0.06 | 0.00142 |
GO:0045060 | negative thymic T cell selection | 5 | 2 | 0.06 | 0.00142 |
GO:2000627 | positive regulation of miRNA catabolic process | 6 | 2 | 0.07 | 0.00211 |
GO:0072711 | cellular response to hydroxyurea | 7 | 2 | 0.08 | 0.00292 |
GO:0071320 | cellular response to cAMP | 26 | 3 | 0.31 | 0.00366 |
GO:2000045 | regulation of G1/S transition of mitotic cell cycle | 123 | 6 | 1.49 | 0.00370 |
GO:0000398 | mRNA splicing, via spliceosome | 241 | 11 | 2.91 | 0.00377 |
GO:0030889 | negative regulation of B cell proliferation | 8 | 2 | 0.10 | 0.00387 |
GO:0006376 | mRNA splice site recognition | 27 | 3 | 0.33 | 0.00408 |
GO:0010720 | positive regulation of cell development | 263 | 9 | 3.18 | 0.00452 |
GO:0060218 | hematopoietic stem cell differentiation | 28 | 3 | 0.34 | 0.00453 |
GO:0010558 | negative regulation of macromolecule biosynthetic process | 1370 | 31 | 16.56 | 0.00463 |
- Annotated: number of genes (out of all.genes) that are annotated with that GO term
- Significant: number of genes that are annotated with that GO term and meet our criteria for “expressed”
- Expected: Under random chance, number of genes that would be expected to be annotated with that GO term and meeting our criteria for “expressed”
- Fisher: (Raw) p-value from Fisher’s Exact Test
2. Model-based DE analysis in limma
limma is an R package for differential expression analysis of bulk RNASeq and microarray data. We apply it here to single cell data.
Limma can be used to fit any linear model to expression data and is useful for analyses that go beyond two-group comparisons. A detailed tutorial of model specification in limma is available here and in the limma User’s Guide.
# filter genes to those expressed in at least 10% of cells
keep <- rownames(expr)[which(n.gt.0/ncol(expr) >= 0.1)]
expr2 <- expr[keep,]
# Set up "design matrix" with statistical model
cluster10$proper.group <- make.names(cluster10$group)
mm <- model.matrix(~0 + proper.group + S.Score + G2M.Score + percent_MT + nFeature_RNA, data = cluster10[[]])
head(mm) %>%
kable() %>%
kable_styling("striped", fixed_thead = TRUE)
proper.groupColorectal.Cancer | proper.groupNormal | proper.groupPolyp | S.Score | G2M.Score | percent_MT | nFeature_RNA | |
---|---|---|---|---|---|---|---|
AAACGCTTCTCTGCTG_A001-C-007 | 1 | 0 | 0 | 0.2751272 | 0.8966284 | 1.1409396 | 1038 |
AACAGGGGTCCCTGAG_A001-C-007 | 1 | 0 | 0 | -0.0068710 | 0.0132416 | 0.7835455 | 719 |
AAGCATCCATCCCACT_A001-C-007 | 1 | 0 | 0 | 0.0518141 | 0.0255571 | 0.5510810 | 1391 |
AAGCGAGCACGAGAAC_A001-C-007 | 1 | 0 | 0 | -0.0403860 | -0.0059914 | 0.4604758 | 980 |
AAGCGTTCAGCCTATA_A001-C-007 | 1 | 0 | 0 | -0.0316891 | 0.0644262 | 0.7960199 | 761 |
ACGGTTAGTCTCACAA_A001-C-007 | 1 | 0 | 0 | 0.0445603 | 0.7918382 | 0.5617978 | 1199 |
tail(mm) %>%
kable() %>%
kable_styling("striped", fixed_thead = TRUE)
proper.groupColorectal.Cancer | proper.groupNormal | proper.groupPolyp | S.Score | G2M.Score | percent_MT | nFeature_RNA | |
---|---|---|---|---|---|---|---|
TGGGTTACAAGAATGT_B001-A-301 | 0 | 1 | 0 | 0.0913972 | -0.1247840 | 0.2367798 | 894 |
TGTTTGTTCACTACTT_B001-A-301 | 0 | 1 | 0 | -0.1148748 | -0.1289778 | 0.4276115 | 1075 |
TTCCGTGTCCGCTGTT_B001-A-301 | 0 | 1 | 0 | -0.1008870 | -0.0119225 | 0.6284916 | 1009 |
TTCTTCCAGTCCCAAT_B001-A-301 | 0 | 1 | 0 | 0.0325015 | -0.1043837 | 0.3673095 | 822 |
TTCTTCCCAGCGTAGA_B001-A-301 | 0 | 1 | 0 | -0.0226923 | 0.0085482 | 0.5474453 | 850 |
TTTACGTGTGTCTTAG_B001-A-301 | 0 | 1 | 0 | -0.0945735 | -0.1263372 | 0.5700326 | 891 |
# Fit model in limma
fit <- lmFit(expr2, mm)
head(coef(fit)) %>%
kable() %>%
kable_styling("striped", fixed_thead = TRUE)
proper.groupColorectal.Cancer | proper.groupNormal | proper.groupPolyp | S.Score | G2M.Score | percent_MT | nFeature_RNA | |
---|---|---|---|---|---|---|---|
CCNL2 | 0.2242247 | 0.1990672 | 0.2470341 | -0.4213528 | 0.4293688 | 0.1358057 | 0.0002361 |
CDK11B | 0.6242932 | 0.1148185 | 0.5999028 | -0.4532588 | 0.6436532 | -0.2417947 | 0.0000213 |
SLC35E2B | 0.0037741 | 0.0495442 | 0.0914102 | 0.5887303 | 0.0651147 | 0.1952115 | 0.0001180 |
CDK11A | 0.5310057 | -0.1306109 | -0.0306059 | -1.0572770 | 0.9785408 | -0.0026322 | 0.0003089 |
NADK | 0.9047053 | 0.5465035 | 0.5874302 | -0.2504437 | -0.0059736 | -0.2493937 | -0.0001804 |
GNB1 | 0.5492680 | 0.7200203 | 0.6733315 | -0.5749704 | 0.9643106 | 0.2342349 | 0.0001687 |
# Test 'Normal' - 'Colorectal.Cancer'
contr <- makeContrasts(proper.groupNormal - proper.groupColorectal.Cancer, levels = colnames(coef(fit)))
contr %>%
kable() %>%
kable_styling("striped", fixed_thead = TRUE)
proper.groupNormal - proper.groupColorectal.Cancer | |
---|---|
proper.groupColorectal.Cancer | -1 |
proper.groupNormal | 1 |
proper.groupPolyp | 0 |
S.Score | 0 |
G2M.Score | 0 |
percent_MT | 0 |
nFeature_RNA | 0 |
fit2 <- contrasts.fit(fit, contrasts = contr)
fit2 <- eBayes(fit2)
out <- topTable(fit2, n = Inf, sort.by = "P")
head(out, 30) %>%
kable() %>%
kable_styling("striped", fixed_thead = TRUE)
logFC | AveExpr | t | P.Value | adj.P.Val | B | |
---|---|---|---|---|---|---|
XIST | 2.1155304 | 0.8383589 | 13.661092 | 0e+00 | 0.00e+00 | 54.324855 |
SLC26A2 | 2.2940612 | 1.2102978 | 11.450240 | 0e+00 | 0.00e+00 | 40.637829 |
PIGR | 2.1206204 | 1.3048963 | 9.653033 | 0e+00 | 0.00e+00 | 29.658194 |
SLC26A3 | 1.6660558 | 0.7474037 | 9.247113 | 0e+00 | 0.00e+00 | 27.235093 |
CCND3 | 2.1487014 | 1.1948400 | 8.593783 | 0e+00 | 0.00e+00 | 23.403324 |
GUCA2A | 1.3966234 | 0.5377794 | 8.235661 | 0e+00 | 0.00e+00 | 21.345829 |
PHGR1 | 1.5875349 | 0.7790538 | 8.094118 | 0e+00 | 0.00e+00 | 20.542213 |
MMP12 | -1.5478178 | 0.4498059 | -8.042518 | 0e+00 | 0.00e+00 | 20.250687 |
CKB | 1.8461063 | 1.2010000 | 7.809511 | 0e+00 | 0.00e+00 | 18.944244 |
TYMP | -1.5086803 | 0.5986687 | -7.449051 | 0e+00 | 0.00e+00 | 16.957641 |
PDE3A | 1.2645878 | 0.5870357 | 7.227138 | 0e+00 | 0.00e+00 | 15.757169 |
MUC12 | 1.2465258 | 0.7228761 | 6.611899 | 0e+00 | 2.00e-07 | 12.529424 |
RNF213 | -1.5582052 | 1.5052670 | -6.586467 | 0e+00 | 2.00e-07 | 12.399425 |
MBNL1 | 1.6531241 | 2.4693744 | 6.362969 | 0e+00 | 5.00e-07 | 11.269562 |
NXPE1 | 1.3190223 | 0.7592336 | 6.322599 | 0e+00 | 6.00e-07 | 11.067939 |
FKBP5 | 1.7002914 | 1.3458515 | 6.307107 | 0e+00 | 6.00e-07 | 10.990775 |
ZG16 | 0.9881296 | 0.4012025 | 6.191486 | 0e+00 | 1.00e-06 | 10.418473 |
PARP14 | -1.4952975 | 1.0742512 | -6.145782 | 0e+00 | 1.20e-06 | 10.194040 |
MT-CO2 | -1.1821944 | 2.3699957 | -6.134752 | 0e+00 | 1.20e-06 | 10.140028 |
UTY | -1.1717486 | 0.6887650 | -6.006045 | 0e+00 | 2.20e-06 | 9.514278 |
MAML2 | 1.5838731 | 1.5289834 | 5.912160 | 0e+00 | 3.30e-06 | 9.063133 |
SON | -1.4616397 | 1.2857812 | -5.900829 | 0e+00 | 3.30e-06 | 9.008987 |
CLCA4 | 1.0303576 | 0.4594471 | 5.763777 | 0e+00 | 6.30e-06 | 8.359471 |
SAMD9L | -1.0383851 | 0.5141782 | -5.583241 | 1e-07 | 1.46e-05 | 7.519295 |
TNFAIP2 | -1.1456072 | 0.5712894 | -5.560157 | 1e-07 | 1.56e-05 | 7.413159 |
CCDC88A | -1.2538645 | 0.8548085 | -5.553025 | 1e-07 | 1.56e-05 | 7.380426 |
PLA2G7 | -0.9520283 | 0.3656429 | -5.536909 | 1e-07 | 1.62e-05 | 7.306572 |
ARHGAP15 | 1.4385488 | 2.0487528 | 5.406200 | 2e-07 | 2.92e-05 | 6.712990 |
LINC00996 | -1.0889424 | 0.5261653 | -5.356665 | 3e-07 | 3.56e-05 | 6.490603 |
ATP1A1 | 1.1449796 | 0.7971706 | 5.325406 | 3e-07 | 3.99e-05 | 6.350999 |
Output columns:
- logFC: log fold change (since we are working with Seurat’s natural log transformed data, will be natural log fold change)
- AveExpr: Average expression across all cells in expr2
- t: logFC divided by its standard error
- P.Value: Raw p-value (based on t) from test that logFC differs from 0
- adj.P.Val: Benjamini-Hochberg false discovery rate adjusted p-value
- B: log-odds that gene is DE
Save files
write.csv(GenTable(GOdata, Fisher = resultFisher), file = "cluster10_GOdata.csv")
write.csv(out, file = "cluster10_Normal-Colorectal.Cancer_topTable.csv")
A note on pseudobulk DE
Pseudobulk differential expression uses count data summed across all cells in each sample (typically within each cell type or cluster). Unlike cell-level DE, pseudobulk DE requires biological replicates so we won’t perform it on this dataset.
Once counts are summed, pseudobulk data are analyzed like bulk RNASeq data.
Pseudobulk DE may result in better false discovery rate control than cell-level DE, as shown here.
The Seurat function AggregateExpression()
can be used to sum counts as described here.
A tutorial on using limma for bulk RNASeq is available here.
Prepare for the next section
Download Rmd document
download.file("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2023-December-Single-Cell-RNA-Seq-Analysis/main/data_analysis/07-doublet_detection.Rmd", "07-doublet_detection.Rmd")
Session Information
sessionInfo()
## R version 4.4.0 (2024-04-24)
## Platform: aarch64-apple-darwin20
## Running under: macOS Ventura 13.5.2
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Los_Angeles
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] org.Hs.eg.db_3.19.1 kableExtra_1.4.0 dplyr_1.1.4
## [4] topGO_2.56.0 SparseM_1.82 GO.db_3.19.1
## [7] AnnotationDbi_1.66.0 IRanges_2.38.0 S4Vectors_0.42.0
## [10] Biobase_2.64.0 graph_1.82.0 BiocGenerics_0.50.0
## [13] limma_3.60.2 Seurat_5.1.0 SeuratObject_5.0.2
## [16] sp_2.1-4
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 rstudioapi_0.16.0 jsonlite_1.8.8
## [4] magrittr_2.0.3 spatstat.utils_3.0-4 rmarkdown_2.27
## [7] zlibbioc_1.50.0 vctrs_0.6.5 ROCR_1.0-11
## [10] memoise_2.0.1 spatstat.explore_3.2-7 htmltools_0.5.8.1
## [13] sass_0.4.9 sctransform_0.4.1 parallelly_1.37.1
## [16] KernSmooth_2.23-22 bslib_0.7.0 htmlwidgets_1.6.4
## [19] ica_1.0-3 plyr_1.8.9 plotly_4.10.4
## [22] zoo_1.8-12 cachem_1.1.0 igraph_2.0.3
## [25] mime_0.12 lifecycle_1.0.4 pkgconfig_2.0.3
## [28] Matrix_1.7-0 R6_2.5.1 fastmap_1.2.0
## [31] GenomeInfoDbData_1.2.12 fitdistrplus_1.1-11 future_1.33.2
## [34] shiny_1.8.1.1 digest_0.6.35 colorspace_2.1-0
## [37] patchwork_1.2.0 tensor_1.5 RSpectra_0.16-1
## [40] irlba_2.3.5.1 RSQLite_2.3.7 progressr_0.14.0
## [43] fansi_1.0.6 spatstat.sparse_3.0-3 httr_1.4.7
## [46] polyclip_1.10-6 abind_1.4-5 compiler_4.4.0
## [49] bit64_4.0.5 DBI_1.2.3 fastDummies_1.7.3
## [52] highr_0.11 MASS_7.3-60.2 tools_4.4.0
## [55] lmtest_0.9-40 httpuv_1.6.15 future.apply_1.11.2
## [58] goftest_1.2-3 glue_1.7.0 nlme_3.1-164
## [61] promises_1.3.0 grid_4.4.0 Rtsne_0.17
## [64] cluster_2.1.6 reshape2_1.4.4 generics_0.1.3
## [67] gtable_0.3.5 spatstat.data_3.0-4 tidyr_1.3.1
## [70] data.table_1.15.4 xml2_1.3.6 XVector_0.44.0
## [73] utf8_1.2.4 spatstat.geom_3.2-9 RcppAnnoy_0.0.22
## [76] ggrepel_0.9.5 RANN_2.6.1 pillar_1.9.0
## [79] stringr_1.5.1 spam_2.10-0 RcppHNSW_0.6.0
## [82] later_1.3.2 splines_4.4.0 lattice_0.22-6
## [85] bit_4.0.5 survival_3.5-8 deldir_2.0-4
## [88] tidyselect_1.2.1 Biostrings_2.72.0 miniUI_0.1.1.1
## [91] pbapply_1.7-2 knitr_1.47 gridExtra_2.3
## [94] svglite_2.1.3 scattermore_1.2 xfun_0.44
## [97] statmod_1.5.0 matrixStats_1.3.0 UCSC.utils_1.0.0
## [100] stringi_1.8.4 lazyeval_0.2.2 yaml_2.3.8
## [103] evaluate_0.23 codetools_0.2-20 tibble_3.2.1
## [106] cli_3.6.2 uwot_0.2.2 systemfonts_1.1.0
## [109] xtable_1.8-4 reticulate_1.37.0 munsell_0.5.1
## [112] jquerylib_0.1.4 GenomeInfoDb_1.40.1 Rcpp_1.0.12
## [115] globals_0.16.3 spatstat.random_3.2-3 png_0.1-8
## [118] parallel_4.4.0 blob_1.2.4 ggplot2_3.5.1
## [121] dotCall64_1.1-1 listenv_0.9.1 viridisLite_0.4.2
## [124] scales_1.3.0 ggridges_0.5.6 crayon_1.5.2
## [127] leiden_0.4.3.1 purrr_1.0.2 rlang_1.1.3
## [130] KEGGREST_1.44.0 cowplot_1.1.3