☰ Menu

      Single Cell RNA-Seq Analysis

Home
Introduction and Lectures
Intro to the Workshop and Core
Support
Schedule
Slack
Cheat Sheets
Software and Links
Scripts
GitHub repository
Biocore website
Prerequisites
CLI
R
Data Reduction
Files and Filetypes
Project setup
Generating Expression Matrix
scRNAseq Analysis
Prepare scRNAseq Analysis
Part 1- Create object
Part 2- Filtering
Part 3- Normalization and scaling
Part 4- Dimensionality reduction
Part 5- Clustering and cell type
Part 6- Enrichment and DE
Part 7- Doublet detection
Part 8- Integration

Introduction to Single Cell RNA-Seq Part 6: Enrichment and model-based differential expression

Set up workspace

library(Seurat)
library(limma)
library(topGO)
library(dplyr)
library(kableExtra)
set.seed(12345)
experiment.aggregate <- readRDS("scRNA_workshop-05.rds")
Idents(experiment.aggregate) <- "finalcluster"

1. Gene Ontology (GO) Enrichment of Genes Expressed in a Cluster

Gene Ontology provides a controlled vocabulary for describing gene products. Here we use enrichment analysis to identify GO terms that are over-represented among the gene expressed in cells in a given cluster.

cluster10 <- subset(experiment.aggregate, idents = '10')
expr <- as.matrix(GetAssayData(cluster10))

# Select genes that are expressed > 0 in at least half of cells
n.gt.0 <- apply(expr, 1, function(x)length(which(x > 0)))
expressed.genes <- rownames(expr)[which(n.gt.0/ncol(expr) >= 0.5)]
all.genes <- rownames(expr)

# define geneList as 1 if gene is in expressed.genes, 0 otherwise
geneList <- ifelse(all.genes %in% expressed.genes, 1, 0)
names(geneList) <- all.genes

# Create topGOdata object
	GOdata <- new("topGOdata",
		ontology = "BP", # use biological process ontology
		allGenes = geneList,
		geneSelectionFun = function(x)(x == 1),
              annot = annFUN.org, mapping = "org.Hs.eg.db", ID = "symbol")
# Test for enrichment using Fisher's Exact Test
	resultFisher <- runTest(GOdata, algorithm = "elim", statistic = "fisher")
	GenTable(GOdata, Fisher = resultFisher, topNodes = 20, numChar = 60) %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)
GO.ID Term Annotated Significant Expected Fisher
GO:0010613 positive regulation of cardiac muscle hypertrophy 19 4 0.34 0.00031
GO:0006325 chromatin organization 465 24 8.31 0.00047
GO:0030036 actin cytoskeleton organization 462 19 8.26 0.00057
GO:0034968 histone lysine methylation 81 7 1.45 0.00059
GO:2001014 regulation of skeletal muscle cell differentiation 11 3 0.20 0.00083
GO:0000381 regulation of alternative mRNA splicing, via spliceosome 42 5 0.75 0.00086
GO:0006338 chromatin remodeling 270 13 4.82 0.00108
GO:0048813 dendrite morphogenesis 95 7 1.70 0.00152
GO:0045445 myoblast differentiation 70 6 1.25 0.00152
GO:1901888 regulation of cell junction assembly 124 8 2.22 0.00167
GO:1901897 regulation of relaxation of cardiac muscle 4 2 0.07 0.00186
GO:0031666 positive regulation of lipopolysaccharide-mediated signaling... 4 2 0.07 0.00186
GO:0071320 cellular response to cAMP 30 4 0.54 0.00188
GO:0061049 cell growth involved in cardiac muscle cell development 15 3 0.27 0.00218
GO:0010595 positive regulation of endothelial cell migration 76 6 1.36 0.00232
GO:0002063 chondrocyte development 16 3 0.29 0.00264
GO:0045906 negative regulation of vasoconstriction 5 2 0.09 0.00306
GO:0098909 regulation of cardiac muscle cell action potential involved ... 5 2 0.09 0.00306
GO:0048842 positive regulation of axon extension involved in axon guida... 5 2 0.09 0.00306
GO:0021785 branchiomotor neuron axon guidance 5 2 0.09 0.00306

2. Model-based DE analysis in limma

limma is an R package for differential expression analysis of bulk RNASeq and microarray data. We apply it here to single cell data.

Limma can be used to fit any linear model to expression data and is useful for analyses that go beyond two-group comparisons. A detailed tutorial of model specification in limma is available here and in the limma User’s Guide.

# filter genes to those expressed in at least 10% of cells
keep <- rownames(expr)[which(n.gt.0/ncol(expr) >= 0.1)]
expr2 <- expr[keep,]

# Set up "design matrix" with statistical model
cluster10$proper.group <- make.names(cluster10$group)
mm <- model.matrix(~0 + proper.group + S.Score + G2M.Score + percent_MT + nFeature_RNA, data = cluster10[[]])
head(mm) %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)
proper.groupColorectal.Cancer proper.groupNormal proper.groupPolyp S.Score G2M.Score percent_MT nFeature_RNA
ACGGTTAGTCTCACAA_A001-C-007 1 0 0 0.0347548 0.7909218 0.5656109 1187
ACTGTCCAGGCGTTAG_A001-C-007 1 0 0 -0.0513069 -0.0204757 0.5043712 1634
AGCGTATAGCCTCAGC_A001-C-007 1 0 0 -0.0576611 -0.0549822 3.0966767 880
AGCGTATAGGTATCTC_A001-C-007 1 0 0 0.1080488 -0.0416714 0.4595588 783
AGGACTTGTACCTATG_A001-C-007 1 0 0 0.0181025 -0.0642517 0.4804393 986
AGGGTTTGTTCTTCAT_A001-C-007 1 0 0 0.0057069 -0.0475972 1.3483146 937
tail(mm) %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)
proper.groupColorectal.Cancer proper.groupNormal proper.groupPolyp S.Score G2M.Score percent_MT nFeature_RNA
TCCACCAAGTAACGTA_B001-A-301 0 1 0 -0.0872095 -0.1505728 0.1785183 2096
TCCACCAGTGTCCACG_B001-A-301 0 1 0 -0.0697658 -0.1145196 0.3147954 1147
TCGCTTGAGACTTGTC_B001-A-301 0 1 0 -0.0989612 -0.1146185 0.1570475 1471
TCTCCGACACAGTACT_B001-A-301 0 1 0 0.0437678 -0.1455716 0.1921230 2383
TGATGGTCATAACTCG_B001-A-301 0 1 0 -0.1108661 -0.1465694 0.5749668 1413
TTCTTCCCAGCGTAGA_B001-A-301 0 1 0 -0.0361969 0.0067655 0.5504587 843
# Fit model in limma
fit <- lmFit(expr2, mm)
head(coef(fit)) %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)
proper.groupColorectal.Cancer proper.groupNormal proper.groupPolyp S.Score G2M.Score percent_MT nFeature_RNA
CCNL2 0.0579580 0.4644867 0.5707252 -0.5229024 0.7195925 0.2619560 0.0001080
CDK11B 0.5915809 0.1722522 0.5923884 -0.6623504 0.7849839 -0.3260968 0.0000319
CDK11A 0.9175843 0.2172913 0.4343150 -1.0275000 0.9939193 -0.0600846 0.0000147
NADK 0.3736396 0.2646946 0.0662475 -0.2493852 0.1639109 0.0637775 -0.0000001
GNB1 0.7772654 0.5788333 0.8958150 -0.5519191 0.9004079 -0.0616478 0.0002935
SKI 0.7171411 0.1954765 0.4870630 -0.7409301 -0.1685434 -0.2098351 0.0000133
# Test 'Normal' - 'Colorectal.Cancer'
contr <- makeContrasts(proper.groupNormal - proper.groupColorectal.Cancer, levels = colnames(coef(fit)))
contr %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)
proper.groupNormal - proper.groupColorectal.Cancer
proper.groupColorectal.Cancer -1
proper.groupNormal 1
proper.groupPolyp 0
S.Score 0
G2M.Score 0
percent_MT 0
nFeature_RNA 0
fit2 <- contrasts.fit(fit, contrasts = contr)
fit2 <- eBayes(fit2)
out <- topTable(fit2, n = Inf, sort.by = "P")
head(out, 30) %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)
logFC AveExpr t P.Value adj.P.Val B
XIST 1.9827318 0.6682736 11.870704 0.0e+00 0.0000000 37.251244
GUCA2A 1.6727802 0.4946151 10.607512 0.0e+00 0.0000000 31.021088
MMP12 -2.0976536 0.6819202 -8.045790 0.0e+00 0.0000000 18.347651
PHGR1 1.7128625 0.6230695 7.990444 0.0e+00 0.0000000 18.079041
CKB 2.0697765 1.1605407 7.813288 0.0e+00 0.0000000 17.222231
SLC26A2 1.8826649 0.9641306 6.787927 0.0e+00 0.0000004 12.377261
SLC26A3 1.5570563 0.6022747 6.661061 0.0e+00 0.0000006 11.794657
PIGR 1.7503636 1.1102687 6.460438 0.0e+00 0.0000015 10.882534
HIVEP3 -1.4376532 0.7340200 -5.994125 0.0e+00 0.0000121 8.810834
NR3C2 1.6376297 1.0291416 5.792722 1.0e-07 0.0000276 7.939339
UTY -1.3631601 0.7485786 -5.484691 3.0e-07 0.0000928 6.636837
ZG16 0.9778633 0.3534387 5.484091 3.0e-07 0.0000928 6.634339
MUC13 1.2139028 0.5911549 5.163231 1.1e-06 0.0003496 5.320392
PDE4D 1.6637089 1.5119900 5.120095 1.3e-06 0.0003901 5.147303
TNFAIP2 -1.4572991 1.0361014 -5.104445 1.4e-06 0.0003901 5.084722
HSP90AA1 -1.3268169 0.7386294 -5.041036 1.8e-06 0.0004794 4.832355
MACF1 -1.2728166 2.5152218 -5.024663 1.9e-06 0.0004838 4.767506
SATB2 1.2335513 0.6184811 4.931185 2.9e-06 0.0006781 4.399765
RNF213 -1.4619869 1.4301922 -4.899682 3.3e-06 0.0007080 4.276803
PARP14 -1.5705769 1.4065849 -4.895788 3.3e-06 0.0007080 4.261639
IL7R -1.0858376 0.3242462 -4.865053 3.8e-06 0.0007591 4.142219
FOXP1 1.5111842 1.6488493 4.856249 3.9e-06 0.0007591 4.108095
AMN 0.8531736 0.2741440 4.731821 6.6e-06 0.0011476 3.630123
EPCAM 1.1109426 0.5789830 4.715942 7.0e-06 0.0011476 3.569703
PLA2G7 -1.1812964 0.5448965 -4.705319 7.3e-06 0.0011476 3.529356
ZBTB20 1.5396424 1.7559376 4.700057 7.5e-06 0.0011476 3.509395
RAPGEF1 -1.5180128 1.5356339 -4.698717 7.5e-06 0.0011476 3.504313
SAT1 -1.8924435 2.0416164 -4.696409 7.6e-06 0.0011476 3.495562
TYMP -1.3860201 0.9203430 -4.688993 7.8e-06 0.0011476 3.467469
SCNN1B 0.9079763 0.3004799 4.668610 8.5e-06 0.0011916 3.390400

Output columns:

Save files

write.csv(GenTable(GOdata, Fisher = resultFisher), file = "cluster10_GOdata.csv")
write.csv(out, file = "cluster10_Normal-Colorectal.Cancer_topTable.csv")

A note on pseudobulk DE

Pseudobulk differential expression uses count data summed across all cells in each sample (typically within each cell type or cluster). Unlike cell-level DE, pseudobulk DE requires biological replicates so we won’t perform it on this dataset.

Once counts are summed, pseudobulk data are analyzed like bulk RNASeq data.

Pseudobulk DE may result in better false discovery rate control than cell-level DE, as shown here.

The Seurat function AggregateExpression() can be used to sum counts as described here.

A tutorial on using limma for bulk RNASeq is available here.

Prepare for the next section

Download Rmd document

download.file("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2023-December-Single-Cell-RNA-Seq-Analysis/main/data_analysis/07-doublet_detection.Rmd", "07-doublet_detection.Rmd")

Session Information

sessionInfo()
R version 4.3.2 (2023-10-31 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19045) Matrix products: default locale: [1] LC_COLLATE=English_United States.utf8 [2] LC_CTYPE=English_United States.utf8 [3] LC_MONETARY=English_United States.utf8 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.utf8 time zone: America/Los_Angeles tzcode source: internal attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] org.Hs.eg.db_3.17.0 kableExtra_1.3.4 dplyr_1.1.4 [4] topGO_2.52.0 SparseM_1.81 GO.db_3.17.0 [7] AnnotationDbi_1.62.2 IRanges_2.34.1 S4Vectors_0.38.2 [10] Biobase_2.60.0 graph_1.78.0 BiocGenerics_0.46.0 [13] limma_3.56.2 Seurat_5.0.1 SeuratObject_5.0.1 [16] sp_2.1-2 loaded via a namespace (and not attached): [1] RColorBrewer_1.1-3 rstudioapi_0.15.0 jsonlite_1.8.8 [4] magrittr_2.0.3 spatstat.utils_3.0-4 rmarkdown_2.25 [7] zlibbioc_1.46.0 vctrs_0.6.5 ROCR_1.0-11 [10] memoise_2.0.1 spatstat.explore_3.2-5 RCurl_1.98-1.13 [13] webshot_0.5.5 htmltools_0.5.7 sass_0.4.8 [16] sctransform_0.4.1 parallelly_1.36.0 KernSmooth_2.23-22 [19] bslib_0.6.1 htmlwidgets_1.6.4 ica_1.0-3 [22] plyr_1.8.9 plotly_4.10.3 zoo_1.8-12 [25] cachem_1.0.8 igraph_1.5.1 mime_0.12 [28] lifecycle_1.0.4 pkgconfig_2.0.3 Matrix_1.6-4 [31] R6_2.5.1 fastmap_1.1.1 GenomeInfoDbData_1.2.10 [34] fitdistrplus_1.1-11 future_1.33.0 shiny_1.8.0 [37] digest_0.6.33 colorspace_2.1-0 patchwork_1.1.3 [40] tensor_1.5 RSpectra_0.16-1 irlba_2.3.5.1 [43] RSQLite_2.3.4 progressr_0.14.0 fansi_1.0.6 [46] spatstat.sparse_3.0-3 httr_1.4.7 polyclip_1.10-6 [49] abind_1.4-5 compiler_4.3.2 bit64_4.0.5 [52] DBI_1.1.3 fastDummies_1.7.3 highr_0.10 [55] MASS_7.3-60 tools_4.3.2 lmtest_0.9-40 [58] httpuv_1.6.13 future.apply_1.11.0 goftest_1.2-3 [61] glue_1.6.2 nlme_3.1-164 promises_1.2.1 [64] grid_4.3.2 Rtsne_0.17 cluster_2.1.6 [67] reshape2_1.4.4 generics_0.1.3 gtable_0.3.4 [70] spatstat.data_3.0-3 tidyr_1.3.0 data.table_1.14.10 [73] xml2_1.3.6 XVector_0.40.0 utf8_1.2.4 [76] spatstat.geom_3.2-7 RcppAnnoy_0.0.21 ggrepel_0.9.4 [79] RANN_2.6.1 pillar_1.9.0 stringr_1.5.1 [82] spam_2.10-0 RcppHNSW_0.5.0 later_1.3.2 [85] splines_4.3.2 lattice_0.22-5 bit_4.0.5 [88] survival_3.5-7 deldir_2.0-2 tidyselect_1.2.0 [91] Biostrings_2.68.1 miniUI_0.1.1.1 pbapply_1.7-2 [94] knitr_1.45 gridExtra_2.3 svglite_2.1.3 [97] scattermore_1.2 xfun_0.41 matrixStats_1.1.0 [100] stringi_1.8.2 lazyeval_0.2.2 yaml_2.3.7 [103] evaluate_0.23 codetools_0.2-19 tibble_3.2.1 [106] cli_3.6.1 uwot_0.1.16 systemfonts_1.0.5 [109] xtable_1.8-4 reticulate_1.34.0 munsell_0.5.0 [112] jquerylib_0.1.4 GenomeInfoDb_1.36.4 Rcpp_1.0.11 [115] globals_0.16.2 spatstat.random_3.2-2 png_0.1-8 [118] parallel_4.3.2 ellipsis_0.3.2 blob_1.2.4 [121] ggplot2_3.4.4 dotCall64_1.1-1 bitops_1.0-7 [124] listenv_0.9.0 viridisLite_0.4.2 scales_1.3.0 [127] ggridges_0.5.4 crayon_1.5.2 leiden_0.4.3.1 [130] purrr_1.0.2 rlang_1.1.2 rvest_1.0.3 [133] KEGGREST_1.40.1 cowplot_1.1.1