Home
Introduction and Lectures
Intro to the Workshop and Core
Support
Schedule
Slack
Cheat Sheets
Software and Links
Scripts
GitHub repository
Biocore website
Prerequisites
CLI
R
Data Reduction
Files and Filetypes
Project setup
Generating Expression Matrix
scRNAseq Analysis
Prepare scRNAseq Analysis
Part 1- Create object
Part 2- Filtering
Part 3- Normalization and scaling
Part 4- Dimensionality reduction
Part 5- Clustering and cell type
Part 6- Enrichment and DE
Part 7- Doublet detection
Part 8- Integration

Introduction to Single Cell RNA-Seq Part 6: Enrichment and model-based differential expression

Set up workspace

library(Seurat)
library(limma)
library(topGO)
library(dplyr)
library(kableExtra)
set.seed(12345)
experiment.aggregate <- readRDS("scRNA_workshop-05.rds")
Idents(experiment.aggregate) <- "finalcluster"

1. Gene Ontology (GO) Enrichment of Genes Expressed in a Cluster

Gene Ontology provides a controlled vocabulary for describing gene products. Here we use enrichment analysis to identify GO terms that are over-represented among the gene expressed in cells in a given cluster.

cluster10 <- subset(experiment.aggregate, idents = '10')
expr <- as.matrix(GetAssayData(cluster10))

# Select genes that are expressed > 0 in at least half of cells
n.gt.0 <- apply(expr, 1, function(x)length(which(x > 0)))
expressed.genes <- rownames(expr)[which(n.gt.0/ncol(expr) >= 0.5)]
all.genes <- rownames(expr)

# define geneList as 1 if gene is in expressed.genes, 0 otherwise
geneList <- ifelse(all.genes %in% expressed.genes, 1, 0)
names(geneList) <- all.genes

# Create topGOdata object
	GOdata <- new("topGOdata",
		ontology = "BP", # use biological process ontology
		allGenes = geneList,
		geneSelectionFun = function(x)(x == 1),
              annot = annFUN.org, mapping = "org.Hs.eg.db", ID = "symbol")
# Test for enrichment using Fisher's Exact Test
	resultFisher <- runTest(GOdata, algorithm = "elim", statistic = "fisher")
	GenTable(GOdata, Fisher = resultFisher, topNodes = 20, numChar = 60) %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)

GO.ID	Term	Annotated	Significant	Expected	Fisher
GO:0006338	chromatin remodeling	464	19	5.61	2.6e-06
GO:0043484	regulation of RNA splicing	148	8	1.79	0.00041
GO:1903613	regulation of protein tyrosine phosphatase activity	3	2	0.04	0.00043
GO:0006306	DNA methylation	33	4	0.40	0.00063
GO:0061061	muscle structure development	381	13	4.61	0.00067
GO:0031580	membrane raft distribution	4	2	0.05	0.00086
GO:0009791	post-embryonic development	65	5	0.79	0.00110
GO:0000512	lncRNA-mediated post-transcriptional gene silencing	5	2	0.06	0.00142
GO:1903978	regulation of microglial cell activation	5	2	0.06	0.00142
GO:0045060	negative thymic T cell selection	5	2	0.06	0.00142
GO:2000627	positive regulation of miRNA catabolic process	6	2	0.07	0.00211
GO:0072711	cellular response to hydroxyurea	7	2	0.08	0.00292
GO:0071320	cellular response to cAMP	26	3	0.31	0.00366
GO:2000045	regulation of G1/S transition of mitotic cell cycle	123	6	1.49	0.00370
GO:0000398	mRNA splicing, via spliceosome	241	11	2.91	0.00377
GO:0030889	negative regulation of B cell proliferation	8	2	0.10	0.00387
GO:0006376	mRNA splice site recognition	27	3	0.33	0.00408
GO:0010720	positive regulation of cell development	263	9	3.18	0.00452
GO:0060218	hematopoietic stem cell differentiation	28	3	0.34	0.00453
GO:0010558	negative regulation of macromolecule biosynthetic process	1370	31	16.56	0.00463

Annotated: number of genes (out of all.genes) that are annotated with that GO term
Significant: number of genes that are annotated with that GO term and meet our criteria for “expressed”
Expected: Under random chance, number of genes that would be expected to be annotated with that GO term and meeting our criteria for “expressed”
Fisher: (Raw) p-value from Fisher’s Exact Test

2. Model-based DE analysis in limma

limma is an R package for differential expression analysis of bulk RNASeq and microarray data. We apply it here to single cell data.

Limma can be used to fit any linear model to expression data and is useful for analyses that go beyond two-group comparisons. A detailed tutorial of model specification in limma is available here and in the limma User’s Guide.

# filter genes to those expressed in at least 10% of cells
keep <- rownames(expr)[which(n.gt.0/ncol(expr) >= 0.1)]
expr2 <- expr[keep,]

# Set up "design matrix" with statistical model
cluster10$proper.group <- make.names(cluster10$group)
mm <- model.matrix(~0 + proper.group + S.Score + G2M.Score + percent_MT + nFeature_RNA, data = cluster10[[]])
head(mm) %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)

	proper.groupColorectal.Cancer	S.Score	G2M.Score	percent_MT	nFeature_RNA
AAACGCTTCTCTGCTG_A001-C-007	1	0.2751272	0.8966284	1.1409396	1038
AACAGGGGTCCCTGAG_A001-C-007	1	-0.0068710	0.0132416	0.7835455	719
AAGCATCCATCCCACT_A001-C-007	1	0.0518141	0.0255571	0.5510810	1391
AAGCGAGCACGAGAAC_A001-C-007	1	-0.0403860	-0.0059914	0.4604758	980
AAGCGTTCAGCCTATA_A001-C-007	1	-0.0316891	0.0644262	0.7960199	761
ACGGTTAGTCTCACAA_A001-C-007	1	0.0445603	0.7918382	0.5617978	1199

tail(mm) %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)

	proper.groupNormal	S.Score	G2M.Score	percent_MT	nFeature_RNA
TGGGTTACAAGAATGT_B001-A-301	1	0.0913972	-0.1247840	0.2367798	894
TGTTTGTTCACTACTT_B001-A-301	1	-0.1148748	-0.1289778	0.4276115	1075
TTCCGTGTCCGCTGTT_B001-A-301	1	-0.1008870	-0.0119225	0.6284916	1009
TTCTTCCAGTCCCAAT_B001-A-301	1	0.0325015	-0.1043837	0.3673095	822
TTCTTCCCAGCGTAGA_B001-A-301	1	-0.0226923	0.0085482	0.5474453	850
TTTACGTGTGTCTTAG_B001-A-301	1	-0.0945735	-0.1263372	0.5700326	891

# Fit model in limma
fit <- lmFit(expr2, mm)
head(coef(fit)) %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)

	proper.groupColorectal.Cancer	proper.groupNormal	proper.groupPolyp	S.Score	G2M.Score	percent_MT	nFeature_RNA
CCNL2	0.2242247	0.1990672	0.2470341	-0.4213528	0.4293688	0.1358057	0.0002361
CDK11B	0.6242932	0.1148185	0.5999028	-0.4532588	0.6436532	-0.2417947	0.0000213
SLC35E2B	0.0037741	0.0495442	0.0914102	0.5887303	0.0651147	0.1952115	0.0001180
CDK11A	0.5310057	-0.1306109	-0.0306059	-1.0572770	0.9785408	-0.0026322	0.0003089
NADK	0.9047053	0.5465035	0.5874302	-0.2504437	-0.0059736	-0.2493937	-0.0001804
GNB1	0.5492680	0.7200203	0.6733315	-0.5749704	0.9643106	0.2342349	0.0001687

# Test 'Normal' - 'Colorectal.Cancer'
contr <- makeContrasts(proper.groupNormal - proper.groupColorectal.Cancer, levels = colnames(coef(fit)))
contr %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)

	proper.groupNormal - proper.groupColorectal.Cancer
proper.groupColorectal.Cancer	-1
proper.groupNormal	1
proper.groupPolyp	0
S.Score	0
G2M.Score	0
percent_MT	0
nFeature_RNA	0

fit2 <- contrasts.fit(fit, contrasts = contr)
fit2 <- eBayes(fit2)
out <- topTable(fit2, n = Inf, sort.by = "P")
head(out, 30) %>%
	  kable() %>%
	  kable_styling("striped", fixed_thead = TRUE)

	logFC	AveExpr	t	P.Value	adj.P.Val	B
XIST	2.1155304	0.8383589	13.661092	0e+00	0.00e+00	54.324855
SLC26A2	2.2940612	1.2102978	11.450240	0e+00	0.00e+00	40.637829
PIGR	2.1206204	1.3048963	9.653033	0e+00	0.00e+00	29.658194
SLC26A3	1.6660558	0.7474037	9.247113	0e+00	0.00e+00	27.235093
CCND3	2.1487014	1.1948400	8.593783	0e+00	0.00e+00	23.403324
GUCA2A	1.3966234	0.5377794	8.235661	0e+00	0.00e+00	21.345829
PHGR1	1.5875349	0.7790538	8.094118	0e+00	0.00e+00	20.542213
MMP12	-1.5478178	0.4498059	-8.042518	0e+00	0.00e+00	20.250687
CKB	1.8461063	1.2010000	7.809511	0e+00	0.00e+00	18.944244
TYMP	-1.5086803	0.5986687	-7.449051	0e+00	0.00e+00	16.957641
PDE3A	1.2645878	0.5870357	7.227138	0e+00	0.00e+00	15.757169
MUC12	1.2465258	0.7228761	6.611899	0e+00	2.00e-07	12.529424
RNF213	-1.5582052	1.5052670	-6.586467	0e+00	2.00e-07	12.399425
MBNL1	1.6531241	2.4693744	6.362969	0e+00	5.00e-07	11.269562
NXPE1	1.3190223	0.7592336	6.322599	0e+00	6.00e-07	11.067939
FKBP5	1.7002914	1.3458515	6.307107	0e+00	6.00e-07	10.990775
ZG16	0.9881296	0.4012025	6.191486	0e+00	1.00e-06	10.418473
PARP14	-1.4952975	1.0742512	-6.145782	0e+00	1.20e-06	10.194040
MT-CO2	-1.1821944	2.3699957	-6.134752	0e+00	1.20e-06	10.140028
UTY	-1.1717486	0.6887650	-6.006045	0e+00	2.20e-06	9.514278
MAML2	1.5838731	1.5289834	5.912160	0e+00	3.30e-06	9.063133
SON	-1.4616397	1.2857812	-5.900829	0e+00	3.30e-06	9.008987
CLCA4	1.0303576	0.4594471	5.763777	0e+00	6.30e-06	8.359471
SAMD9L	-1.0383851	0.5141782	-5.583241	1e-07	1.46e-05	7.519295
TNFAIP2	-1.1456072	0.5712894	-5.560157	1e-07	1.56e-05	7.413159
CCDC88A	-1.2538645	0.8548085	-5.553025	1e-07	1.56e-05	7.380426
PLA2G7	-0.9520283	0.3656429	-5.536909	1e-07	1.62e-05	7.306572
ARHGAP15	1.4385488	2.0487528	5.406200	2e-07	2.92e-05	6.712990
LINC00996	-1.0889424	0.5261653	-5.356665	3e-07	3.56e-05	6.490603
ATP1A1	1.1449796	0.7971706	5.325406	3e-07	3.99e-05	6.350999

Output columns:

logFC: log fold change (since we are working with Seurat’s natural log transformed data, will be natural log fold change)
AveExpr: Average expression across all cells in expr2
t: logFC divided by its standard error
P.Value: Raw p-value (based on t) from test that logFC differs from 0
adj.P.Val: Benjamini-Hochberg false discovery rate adjusted p-value
B: log-odds that gene is DE

Save files

write.csv(GenTable(GOdata, Fisher = resultFisher), file = "cluster10_GOdata.csv")
write.csv(out, file = "cluster10_Normal-Colorectal.Cancer_topTable.csv")

A note on pseudobulk DE

Pseudobulk differential expression uses count data summed across all cells in each sample (typically within each cell type or cluster). Unlike cell-level DE, pseudobulk DE requires biological replicates so we won’t perform it on this dataset.

Once counts are summed, pseudobulk data are analyzed like bulk RNASeq data.

Pseudobulk DE may result in better false discovery rate control than cell-level DE, as shown here.

The Seurat function AggregateExpression() can be used to sum counts as described here.

A tutorial on using limma for bulk RNASeq is available here.

Prepare for the next section

Download Rmd document

download.file("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2023-December-Single-Cell-RNA-Seq-Analysis/main/data_analysis/07-doublet_detection.Rmd", "07-doublet_detection.Rmd")

Session Information

sessionInfo()

## R version 4.4.0 (2024-04-24)
## Platform: aarch64-apple-darwin20
## Running under: macOS Ventura 13.5.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] org.Hs.eg.db_3.19.1  kableExtra_1.4.0     dplyr_1.1.4         
##  [4] topGO_2.56.0         SparseM_1.82         GO.db_3.19.1        
##  [7] AnnotationDbi_1.66.0 IRanges_2.38.0       S4Vectors_0.42.0    
## [10] Biobase_2.64.0       graph_1.82.0         BiocGenerics_0.50.0 
## [13] limma_3.60.2         Seurat_5.1.0         SeuratObject_5.0.2  
## [16] sp_2.1-4            
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3      rstudioapi_0.16.0       jsonlite_1.8.8         
##   [4] magrittr_2.0.3          spatstat.utils_3.0-4    rmarkdown_2.27         
##   [7] zlibbioc_1.50.0         vctrs_0.6.5             ROCR_1.0-11            
##  [10] memoise_2.0.1           spatstat.explore_3.2-7  htmltools_0.5.8.1      
##  [13] sass_0.4.9              sctransform_0.4.1       parallelly_1.37.1      
##  [16] KernSmooth_2.23-22      bslib_0.7.0             htmlwidgets_1.6.4      
##  [19] ica_1.0-3               plyr_1.8.9              plotly_4.10.4          
##  [22] zoo_1.8-12              cachem_1.1.0            igraph_2.0.3           
##  [25] mime_0.12               lifecycle_1.0.4         pkgconfig_2.0.3        
##  [28] Matrix_1.7-0            R6_2.5.1                fastmap_1.2.0          
##  [31] GenomeInfoDbData_1.2.12 fitdistrplus_1.1-11     future_1.33.2          
##  [34] shiny_1.8.1.1           digest_0.6.35           colorspace_2.1-0       
##  [37] patchwork_1.2.0         tensor_1.5              RSpectra_0.16-1        
##  [40] irlba_2.3.5.1           RSQLite_2.3.7           progressr_0.14.0       
##  [43] fansi_1.0.6             spatstat.sparse_3.0-3   httr_1.4.7             
##  [46] polyclip_1.10-6         abind_1.4-5             compiler_4.4.0         
##  [49] bit64_4.0.5             DBI_1.2.3               fastDummies_1.7.3      
##  [52] highr_0.11              MASS_7.3-60.2           tools_4.4.0            
##  [55] lmtest_0.9-40           httpuv_1.6.15           future.apply_1.11.2    
##  [58] goftest_1.2-3           glue_1.7.0              nlme_3.1-164           
##  [61] promises_1.3.0          grid_4.4.0              Rtsne_0.17             
##  [64] cluster_2.1.6           reshape2_1.4.4          generics_0.1.3         
##  [67] gtable_0.3.5            spatstat.data_3.0-4     tidyr_1.3.1            
##  [70] data.table_1.15.4       xml2_1.3.6              XVector_0.44.0         
##  [73] utf8_1.2.4              spatstat.geom_3.2-9     RcppAnnoy_0.0.22       
##  [76] ggrepel_0.9.5           RANN_2.6.1              pillar_1.9.0           
##  [79] stringr_1.5.1           spam_2.10-0             RcppHNSW_0.6.0         
##  [82] later_1.3.2             splines_4.4.0           lattice_0.22-6         
##  [85] bit_4.0.5               survival_3.5-8          deldir_2.0-4           
##  [88] tidyselect_1.2.1        Biostrings_2.72.0       miniUI_0.1.1.1         
##  [91] pbapply_1.7-2           knitr_1.47              gridExtra_2.3          
##  [94] svglite_2.1.3           scattermore_1.2         xfun_0.44              
##  [97] statmod_1.5.0           matrixStats_1.3.0       UCSC.utils_1.0.0       
## [100] stringi_1.8.4           lazyeval_0.2.2          yaml_2.3.8             
## [103] evaluate_0.23           codetools_0.2-20        tibble_3.2.1           
## [106] cli_3.6.2               uwot_0.2.2              systemfonts_1.1.0      
## [109] xtable_1.8-4            reticulate_1.37.0       munsell_0.5.1          
## [112] jquerylib_0.1.4         GenomeInfoDb_1.40.1     Rcpp_1.0.12            
## [115] globals_0.16.3          spatstat.random_3.2-3   png_0.1-8              
## [118] parallel_4.4.0          blob_1.2.4              ggplot2_3.5.1          
## [121] dotCall64_1.1-1         listenv_0.9.1           viridisLite_0.4.2      
## [124] scales_1.3.0            ggridges_0.5.6          crayon_1.5.2           
## [127] leiden_0.4.3.1          purrr_1.0.2             rlang_1.1.3            
## [130] KEGGREST_1.44.0         cowplot_1.1.3

☰ Menu

Single Cell RNA-Seq Analysis