☰ Menu

      Single Cell RNA-Seq Analysis

Home
Introduction and Lectures
Intro to the Workshop and Core
Support
Schedule
Slack
Cheat Sheets
Software and Links
Scripts
GitHub repository
Biocore website
Prerequisites
CLI
R
Data Reduction
Files and Filetypes
Project setup
Generating Expression Matrix
scRNAseq Analysis
Prepare scRNAseq Analysis
Part 1- Create object
Part 2- Filtering
Part 3- Normalization and scaling
Part 4- Dimensionality reduction
Part 5- Clustering and cell type
Part 6- Enrichment and DE
Part 7- Doublet detection
Part 8- Integration

Introduction to Single Cell RNA-Seq Part 6: Enrichment and model-based differential expression

Set up workspace

library(Seurat)
library(limma)
library(topGO)
library(dplyr)
library(kableExtra)
set.seed(12345)
experiment.aggregate <- readRDS("scRNA_workshop-05.rds")
Idents(experiment.aggregate) <- "finalcluster"

1. Gene Ontology (GO) Enrichment of Genes Expressed in a Cluster

Gene Ontology provides a controlled vocabulary for describing gene products. Here we use enrichment analysis to identify GO terms that are over-represented among the gene expressed in cells in a given cluster.

cluster10 <- subset(experiment.aggregate, idents = '10')
expr <- as.matrix(GetAssayData(cluster10))

# Select genes that are expressed > 0 in at least half of cells
n.gt.0 <- apply(expr, 1, function(x)length(which(x > 0)))
expressed.genes <- rownames(expr)[which(n.gt.0/ncol(expr) >= 0.5)]
all.genes <- rownames(expr)

# define geneList as 1 if gene is in expressed.genes, 0 otherwise
geneList <- ifelse(all.genes %in% expressed.genes, 1, 0)
names(geneList) <- all.genes

# Create topGOdata object
	GOdata <- new("topGOdata",
		ontology = "BP", # use biological process ontology
		allGenes = geneList,
		geneSelectionFun = function(x)(x == 1),
              annot = annFUN.org, mapping = "org.Hs.eg.db", ID = "symbol")
# Test for enrichment using Fisher's Exact Test
	resultFisher <- runTest(GOdata, algorithm = "elim", statistic = "fisher")
	GenTable(GOdata, Fisher = resultFisher, topNodes = 20, numChar = 60) %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)
GO.ID Term Annotated Significant Expected Fisher
GO:0006338 chromatin remodeling 464 19 5.61 2.6e-06
GO:0043484 regulation of RNA splicing 148 8 1.79 0.00041
GO:1903613 regulation of protein tyrosine phosphatase activity 3 2 0.04 0.00043
GO:0006306 DNA methylation 33 4 0.40 0.00063
GO:0061061 muscle structure development 381 13 4.61 0.00067
GO:0031580 membrane raft distribution 4 2 0.05 0.00086
GO:0009791 post-embryonic development 65 5 0.79 0.00110
GO:0000512 lncRNA-mediated post-transcriptional gene silencing 5 2 0.06 0.00142
GO:1903978 regulation of microglial cell activation 5 2 0.06 0.00142
GO:0045060 negative thymic T cell selection 5 2 0.06 0.00142
GO:2000627 positive regulation of miRNA catabolic process 6 2 0.07 0.00211
GO:0072711 cellular response to hydroxyurea 7 2 0.08 0.00292
GO:0071320 cellular response to cAMP 26 3 0.31 0.00366
GO:2000045 regulation of G1/S transition of mitotic cell cycle 123 6 1.49 0.00370
GO:0000398 mRNA splicing, via spliceosome 241 11 2.91 0.00377
GO:0030889 negative regulation of B cell proliferation 8 2 0.10 0.00387
GO:0006376 mRNA splice site recognition 27 3 0.33 0.00408
GO:0010720 positive regulation of cell development 263 9 3.18 0.00452
GO:0060218 hematopoietic stem cell differentiation 28 3 0.34 0.00453
GO:0010558 negative regulation of macromolecule biosynthetic process 1370 31 16.56 0.00463

2. Model-based DE analysis in limma

limma is an R package for differential expression analysis of bulk RNASeq and microarray data. We apply it here to single cell data.

Limma can be used to fit any linear model to expression data and is useful for analyses that go beyond two-group comparisons. A detailed tutorial of model specification in limma is available here and in the limma User’s Guide.

# filter genes to those expressed in at least 10% of cells
keep <- rownames(expr)[which(n.gt.0/ncol(expr) >= 0.1)]
expr2 <- expr[keep,]

# Set up "design matrix" with statistical model
cluster10$proper.group <- make.names(cluster10$group)
mm <- model.matrix(~0 + proper.group + S.Score + G2M.Score + percent_MT + nFeature_RNA, data = cluster10[[]])
head(mm) %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)
proper.groupColorectal.Cancer proper.groupNormal proper.groupPolyp S.Score G2M.Score percent_MT nFeature_RNA
AAACGCTTCTCTGCTG_A001-C-007 1 0 0 0.2751272 0.8966284 1.1409396 1038
AACAGGGGTCCCTGAG_A001-C-007 1 0 0 -0.0068710 0.0132416 0.7835455 719
AAGCATCCATCCCACT_A001-C-007 1 0 0 0.0518141 0.0255571 0.5510810 1391
AAGCGAGCACGAGAAC_A001-C-007 1 0 0 -0.0403860 -0.0059914 0.4604758 980
AAGCGTTCAGCCTATA_A001-C-007 1 0 0 -0.0316891 0.0644262 0.7960199 761
ACGGTTAGTCTCACAA_A001-C-007 1 0 0 0.0445603 0.7918382 0.5617978 1199
tail(mm) %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)
proper.groupColorectal.Cancer proper.groupNormal proper.groupPolyp S.Score G2M.Score percent_MT nFeature_RNA
TGGGTTACAAGAATGT_B001-A-301 0 1 0 0.0913972 -0.1247840 0.2367798 894
TGTTTGTTCACTACTT_B001-A-301 0 1 0 -0.1148748 -0.1289778 0.4276115 1075
TTCCGTGTCCGCTGTT_B001-A-301 0 1 0 -0.1008870 -0.0119225 0.6284916 1009
TTCTTCCAGTCCCAAT_B001-A-301 0 1 0 0.0325015 -0.1043837 0.3673095 822
TTCTTCCCAGCGTAGA_B001-A-301 0 1 0 -0.0226923 0.0085482 0.5474453 850
TTTACGTGTGTCTTAG_B001-A-301 0 1 0 -0.0945735 -0.1263372 0.5700326 891
# Fit model in limma
fit <- lmFit(expr2, mm)
head(coef(fit)) %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)
proper.groupColorectal.Cancer proper.groupNormal proper.groupPolyp S.Score G2M.Score percent_MT nFeature_RNA
CCNL2 0.2242247 0.1990672 0.2470341 -0.4213528 0.4293688 0.1358057 0.0002361
CDK11B 0.6242932 0.1148185 0.5999028 -0.4532588 0.6436532 -0.2417947 0.0000213
SLC35E2B 0.0037741 0.0495442 0.0914102 0.5887303 0.0651147 0.1952115 0.0001180
CDK11A 0.5310057 -0.1306109 -0.0306059 -1.0572770 0.9785408 -0.0026322 0.0003089
NADK 0.9047053 0.5465035 0.5874302 -0.2504437 -0.0059736 -0.2493937 -0.0001804
GNB1 0.5492680 0.7200203 0.6733315 -0.5749704 0.9643106 0.2342349 0.0001687
# Test 'Normal' - 'Colorectal.Cancer'
contr <- makeContrasts(proper.groupNormal - proper.groupColorectal.Cancer, levels = colnames(coef(fit)))
contr %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)
proper.groupNormal - proper.groupColorectal.Cancer
proper.groupColorectal.Cancer -1
proper.groupNormal 1
proper.groupPolyp 0
S.Score 0
G2M.Score 0
percent_MT 0
nFeature_RNA 0
fit2 <- contrasts.fit(fit, contrasts = contr)
fit2 <- eBayes(fit2)
out <- topTable(fit2, n = Inf, sort.by = "P")
head(out, 30) %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)
logFC AveExpr t P.Value adj.P.Val B
XIST 2.1155304 0.8383589 13.661092 0e+00 0.00e+00 54.324855
SLC26A2 2.2940612 1.2102978 11.450240 0e+00 0.00e+00 40.637829
PIGR 2.1206204 1.3048963 9.653033 0e+00 0.00e+00 29.658194
SLC26A3 1.6660558 0.7474037 9.247113 0e+00 0.00e+00 27.235093
CCND3 2.1487014 1.1948400 8.593783 0e+00 0.00e+00 23.403324
GUCA2A 1.3966234 0.5377794 8.235661 0e+00 0.00e+00 21.345829
PHGR1 1.5875349 0.7790538 8.094118 0e+00 0.00e+00 20.542213
MMP12 -1.5478178 0.4498059 -8.042518 0e+00 0.00e+00 20.250687
CKB 1.8461063 1.2010000 7.809511 0e+00 0.00e+00 18.944244
TYMP -1.5086803 0.5986687 -7.449051 0e+00 0.00e+00 16.957641
PDE3A 1.2645878 0.5870357 7.227138 0e+00 0.00e+00 15.757169
MUC12 1.2465258 0.7228761 6.611899 0e+00 2.00e-07 12.529424
RNF213 -1.5582052 1.5052670 -6.586467 0e+00 2.00e-07 12.399425
MBNL1 1.6531241 2.4693744 6.362969 0e+00 5.00e-07 11.269562
NXPE1 1.3190223 0.7592336 6.322599 0e+00 6.00e-07 11.067939
FKBP5 1.7002914 1.3458515 6.307107 0e+00 6.00e-07 10.990775
ZG16 0.9881296 0.4012025 6.191486 0e+00 1.00e-06 10.418473
PARP14 -1.4952975 1.0742512 -6.145782 0e+00 1.20e-06 10.194040
MT-CO2 -1.1821944 2.3699957 -6.134752 0e+00 1.20e-06 10.140028
UTY -1.1717486 0.6887650 -6.006045 0e+00 2.20e-06 9.514278
MAML2 1.5838731 1.5289834 5.912160 0e+00 3.30e-06 9.063133
SON -1.4616397 1.2857812 -5.900829 0e+00 3.30e-06 9.008987
CLCA4 1.0303576 0.4594471 5.763777 0e+00 6.30e-06 8.359471
SAMD9L -1.0383851 0.5141782 -5.583241 1e-07 1.46e-05 7.519295
TNFAIP2 -1.1456072 0.5712894 -5.560157 1e-07 1.56e-05 7.413159
CCDC88A -1.2538645 0.8548085 -5.553025 1e-07 1.56e-05 7.380426
PLA2G7 -0.9520283 0.3656429 -5.536909 1e-07 1.62e-05 7.306572
ARHGAP15 1.4385488 2.0487528 5.406200 2e-07 2.92e-05 6.712990
LINC00996 -1.0889424 0.5261653 -5.356665 3e-07 3.56e-05 6.490603
ATP1A1 1.1449796 0.7971706 5.325406 3e-07 3.99e-05 6.350999

Output columns:

Save files

write.csv(GenTable(GOdata, Fisher = resultFisher), file = "cluster10_GOdata.csv")
write.csv(out, file = "cluster10_Normal-Colorectal.Cancer_topTable.csv")

A note on pseudobulk DE

Pseudobulk differential expression uses count data summed across all cells in each sample (typically within each cell type or cluster). Unlike cell-level DE, pseudobulk DE requires biological replicates so we won’t perform it on this dataset.

Once counts are summed, pseudobulk data are analyzed like bulk RNASeq data.

Pseudobulk DE may result in better false discovery rate control than cell-level DE, as shown here.

The Seurat function AggregateExpression() can be used to sum counts as described here.

A tutorial on using limma for bulk RNASeq is available here.

Prepare for the next section

Download Rmd document

download.file("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2023-December-Single-Cell-RNA-Seq-Analysis/main/data_analysis/07-doublet_detection.Rmd", "07-doublet_detection.Rmd")

Session Information

sessionInfo()
## R version 4.4.0 (2024-04-24)
## Platform: aarch64-apple-darwin20
## Running under: macOS Ventura 13.5.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] org.Hs.eg.db_3.19.1  kableExtra_1.4.0     dplyr_1.1.4         
##  [4] topGO_2.56.0         SparseM_1.82         GO.db_3.19.1        
##  [7] AnnotationDbi_1.66.0 IRanges_2.38.0       S4Vectors_0.42.0    
## [10] Biobase_2.64.0       graph_1.82.0         BiocGenerics_0.50.0 
## [13] limma_3.60.2         Seurat_5.1.0         SeuratObject_5.0.2  
## [16] sp_2.1-4            
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3      rstudioapi_0.16.0       jsonlite_1.8.8         
##   [4] magrittr_2.0.3          spatstat.utils_3.0-4    rmarkdown_2.27         
##   [7] zlibbioc_1.50.0         vctrs_0.6.5             ROCR_1.0-11            
##  [10] memoise_2.0.1           spatstat.explore_3.2-7  htmltools_0.5.8.1      
##  [13] sass_0.4.9              sctransform_0.4.1       parallelly_1.37.1      
##  [16] KernSmooth_2.23-22      bslib_0.7.0             htmlwidgets_1.6.4      
##  [19] ica_1.0-3               plyr_1.8.9              plotly_4.10.4          
##  [22] zoo_1.8-12              cachem_1.1.0            igraph_2.0.3           
##  [25] mime_0.12               lifecycle_1.0.4         pkgconfig_2.0.3        
##  [28] Matrix_1.7-0            R6_2.5.1                fastmap_1.2.0          
##  [31] GenomeInfoDbData_1.2.12 fitdistrplus_1.1-11     future_1.33.2          
##  [34] shiny_1.8.1.1           digest_0.6.35           colorspace_2.1-0       
##  [37] patchwork_1.2.0         tensor_1.5              RSpectra_0.16-1        
##  [40] irlba_2.3.5.1           RSQLite_2.3.7           progressr_0.14.0       
##  [43] fansi_1.0.6             spatstat.sparse_3.0-3   httr_1.4.7             
##  [46] polyclip_1.10-6         abind_1.4-5             compiler_4.4.0         
##  [49] bit64_4.0.5             DBI_1.2.3               fastDummies_1.7.3      
##  [52] highr_0.11              MASS_7.3-60.2           tools_4.4.0            
##  [55] lmtest_0.9-40           httpuv_1.6.15           future.apply_1.11.2    
##  [58] goftest_1.2-3           glue_1.7.0              nlme_3.1-164           
##  [61] promises_1.3.0          grid_4.4.0              Rtsne_0.17             
##  [64] cluster_2.1.6           reshape2_1.4.4          generics_0.1.3         
##  [67] gtable_0.3.5            spatstat.data_3.0-4     tidyr_1.3.1            
##  [70] data.table_1.15.4       xml2_1.3.6              XVector_0.44.0         
##  [73] utf8_1.2.4              spatstat.geom_3.2-9     RcppAnnoy_0.0.22       
##  [76] ggrepel_0.9.5           RANN_2.6.1              pillar_1.9.0           
##  [79] stringr_1.5.1           spam_2.10-0             RcppHNSW_0.6.0         
##  [82] later_1.3.2             splines_4.4.0           lattice_0.22-6         
##  [85] bit_4.0.5               survival_3.5-8          deldir_2.0-4           
##  [88] tidyselect_1.2.1        Biostrings_2.72.0       miniUI_0.1.1.1         
##  [91] pbapply_1.7-2           knitr_1.47              gridExtra_2.3          
##  [94] svglite_2.1.3           scattermore_1.2         xfun_0.44              
##  [97] statmod_1.5.0           matrixStats_1.3.0       UCSC.utils_1.0.0       
## [100] stringi_1.8.4           lazyeval_0.2.2          yaml_2.3.8             
## [103] evaluate_0.23           codetools_0.2-20        tibble_3.2.1           
## [106] cli_3.6.2               uwot_0.2.2              systemfonts_1.1.0      
## [109] xtable_1.8-4            reticulate_1.37.0       munsell_0.5.1          
## [112] jquerylib_0.1.4         GenomeInfoDb_1.40.1     Rcpp_1.0.12            
## [115] globals_0.16.3          spatstat.random_3.2-3   png_0.1-8              
## [118] parallel_4.4.0          blob_1.2.4              ggplot2_3.5.1          
## [121] dotCall64_1.1-1         listenv_0.9.1           viridisLite_0.4.2      
## [124] scales_1.3.0            ggridges_0.5.6          crayon_1.5.2           
## [127] leiden_0.4.3.1          purrr_1.0.2             rlang_1.1.3            
## [130] KEGGREST_1.44.0         cowplot_1.1.3