Introduction to Single Cell RNA-Seq Part 6: Enrichment and model-based differential expression
Set up workspace
library(Seurat)
library(limma)
library(topGO)
library(dplyr)
library(kableExtra)
set.seed(12345)
experiment.aggregate <- readRDS("scRNA_workshop-05.rds")
Idents(experiment.aggregate) <- "finalcluster"
1. Gene Ontology (GO) Enrichment of Genes Expressed in a Cluster
Gene Ontology provides a controlled vocabulary for describing gene products. Here we use enrichment analysis to identify GO terms that are over-represented among the gene expressed in cells in a given cluster.
cluster10 <- subset(experiment.aggregate, idents = '10')
expr <- as.matrix(GetAssayData(cluster10))
# Select genes that are expressed > 0 in at least half of cells
n.gt.0 <- apply(expr, 1, function(x)length(which(x > 0)))
expressed.genes <- rownames(expr)[which(n.gt.0/ncol(expr) >= 0.5)]
all.genes <- rownames(expr)
# define geneList as 1 if gene is in expressed.genes, 0 otherwise
geneList <- ifelse(all.genes %in% expressed.genes, 1, 0)
names(geneList) <- all.genes
# Create topGOdata object
GOdata <- new("topGOdata",
ontology = "BP", # use biological process ontology
allGenes = geneList,
geneSelectionFun = function(x)(x == 1),
annot = annFUN.org, mapping = "org.Hs.eg.db", ID = "symbol")
# Test for enrichment using Fisher's Exact Test
resultFisher <- runTest(GOdata, algorithm = "elim", statistic = "fisher")
GenTable(GOdata, Fisher = resultFisher, topNodes = 20, numChar = 60) %>%
kable() %>%
kable_styling("striped", fixed_thead = TRUE)
GO.ID | Term | Annotated | Significant | Expected | Fisher |
---|---|---|---|---|---|
GO:0010613 | positive regulation of cardiac muscle hypertrophy | 19 | 4 | 0.34 | 0.00031 |
GO:0006325 | chromatin organization | 465 | 24 | 8.31 | 0.00047 |
GO:0030036 | actin cytoskeleton organization | 462 | 19 | 8.26 | 0.00057 |
GO:0034968 | histone lysine methylation | 81 | 7 | 1.45 | 0.00059 |
GO:2001014 | regulation of skeletal muscle cell differentiation | 11 | 3 | 0.20 | 0.00083 |
GO:0000381 | regulation of alternative mRNA splicing, via spliceosome | 42 | 5 | 0.75 | 0.00086 |
GO:0006338 | chromatin remodeling | 270 | 13 | 4.82 | 0.00108 |
GO:0048813 | dendrite morphogenesis | 95 | 7 | 1.70 | 0.00152 |
GO:0045445 | myoblast differentiation | 70 | 6 | 1.25 | 0.00152 |
GO:1901888 | regulation of cell junction assembly | 124 | 8 | 2.22 | 0.00167 |
GO:1901897 | regulation of relaxation of cardiac muscle | 4 | 2 | 0.07 | 0.00186 |
GO:0031666 | positive regulation of lipopolysaccharide-mediated signaling... | 4 | 2 | 0.07 | 0.00186 |
GO:0071320 | cellular response to cAMP | 30 | 4 | 0.54 | 0.00188 |
GO:0061049 | cell growth involved in cardiac muscle cell development | 15 | 3 | 0.27 | 0.00218 |
GO:0010595 | positive regulation of endothelial cell migration | 76 | 6 | 1.36 | 0.00232 |
GO:0002063 | chondrocyte development | 16 | 3 | 0.29 | 0.00264 |
GO:0045906 | negative regulation of vasoconstriction | 5 | 2 | 0.09 | 0.00306 |
GO:0098909 | regulation of cardiac muscle cell action potential involved ... | 5 | 2 | 0.09 | 0.00306 |
GO:0048842 | positive regulation of axon extension involved in axon guida... | 5 | 2 | 0.09 | 0.00306 |
GO:0021785 | branchiomotor neuron axon guidance | 5 | 2 | 0.09 | 0.00306 |
- Annotated: number of genes (out of all.genes) that are annotated with that GO term
- Significant: number of genes that are annotated with that GO term and meet our criteria for “expressed”
- Expected: Under random chance, number of genes that would be expected to be annotated with that GO term and meeting our criteria for “expressed”
- Fisher: (Raw) p-value from Fisher’s Exact Test
2. Model-based DE analysis in limma
limma is an R package for differential expression analysis of bulk RNASeq and microarray data. We apply it here to single cell data.
Limma can be used to fit any linear model to expression data and is useful for analyses that go beyond two-group comparisons. A detailed tutorial of model specification in limma is available here and in the limma User’s Guide.
# filter genes to those expressed in at least 10% of cells
keep <- rownames(expr)[which(n.gt.0/ncol(expr) >= 0.1)]
expr2 <- expr[keep,]
# Set up "design matrix" with statistical model
cluster10$proper.group <- make.names(cluster10$group)
mm <- model.matrix(~0 + proper.group + S.Score + G2M.Score + percent_MT + nFeature_RNA, data = cluster10[[]])
head(mm) %>%
kable() %>%
kable_styling("striped", fixed_thead = TRUE)
proper.groupColorectal.Cancer | proper.groupNormal | proper.groupPolyp | S.Score | G2M.Score | percent_MT | nFeature_RNA | |
---|---|---|---|---|---|---|---|
ACGGTTAGTCTCACAA_A001-C-007 | 1 | 0 | 0 | 0.0347548 | 0.7909218 | 0.5656109 | 1187 |
ACTGTCCAGGCGTTAG_A001-C-007 | 1 | 0 | 0 | -0.0513069 | -0.0204757 | 0.5043712 | 1634 |
AGCGTATAGCCTCAGC_A001-C-007 | 1 | 0 | 0 | -0.0576611 | -0.0549822 | 3.0966767 | 880 |
AGCGTATAGGTATCTC_A001-C-007 | 1 | 0 | 0 | 0.1080488 | -0.0416714 | 0.4595588 | 783 |
AGGACTTGTACCTATG_A001-C-007 | 1 | 0 | 0 | 0.0181025 | -0.0642517 | 0.4804393 | 986 |
AGGGTTTGTTCTTCAT_A001-C-007 | 1 | 0 | 0 | 0.0057069 | -0.0475972 | 1.3483146 | 937 |
tail(mm) %>%
kable() %>%
kable_styling("striped", fixed_thead = TRUE)
proper.groupColorectal.Cancer | proper.groupNormal | proper.groupPolyp | S.Score | G2M.Score | percent_MT | nFeature_RNA | |
---|---|---|---|---|---|---|---|
TCCACCAAGTAACGTA_B001-A-301 | 0 | 1 | 0 | -0.0872095 | -0.1505728 | 0.1785183 | 2096 |
TCCACCAGTGTCCACG_B001-A-301 | 0 | 1 | 0 | -0.0697658 | -0.1145196 | 0.3147954 | 1147 |
TCGCTTGAGACTTGTC_B001-A-301 | 0 | 1 | 0 | -0.0989612 | -0.1146185 | 0.1570475 | 1471 |
TCTCCGACACAGTACT_B001-A-301 | 0 | 1 | 0 | 0.0437678 | -0.1455716 | 0.1921230 | 2383 |
TGATGGTCATAACTCG_B001-A-301 | 0 | 1 | 0 | -0.1108661 | -0.1465694 | 0.5749668 | 1413 |
TTCTTCCCAGCGTAGA_B001-A-301 | 0 | 1 | 0 | -0.0361969 | 0.0067655 | 0.5504587 | 843 |
# Fit model in limma
fit <- lmFit(expr2, mm)
head(coef(fit)) %>%
kable() %>%
kable_styling("striped", fixed_thead = TRUE)
proper.groupColorectal.Cancer | proper.groupNormal | proper.groupPolyp | S.Score | G2M.Score | percent_MT | nFeature_RNA | |
---|---|---|---|---|---|---|---|
CCNL2 | 0.0579580 | 0.4644867 | 0.5707252 | -0.5229024 | 0.7195925 | 0.2619560 | 0.0001080 |
CDK11B | 0.5915809 | 0.1722522 | 0.5923884 | -0.6623504 | 0.7849839 | -0.3260968 | 0.0000319 |
CDK11A | 0.9175843 | 0.2172913 | 0.4343150 | -1.0275000 | 0.9939193 | -0.0600846 | 0.0000147 |
NADK | 0.3736396 | 0.2646946 | 0.0662475 | -0.2493852 | 0.1639109 | 0.0637775 | -0.0000001 |
GNB1 | 0.7772654 | 0.5788333 | 0.8958150 | -0.5519191 | 0.9004079 | -0.0616478 | 0.0002935 |
SKI | 0.7171411 | 0.1954765 | 0.4870630 | -0.7409301 | -0.1685434 | -0.2098351 | 0.0000133 |
# Test 'Normal' - 'Colorectal.Cancer'
contr <- makeContrasts(proper.groupNormal - proper.groupColorectal.Cancer, levels = colnames(coef(fit)))
contr %>%
kable() %>%
kable_styling("striped", fixed_thead = TRUE)
proper.groupNormal - proper.groupColorectal.Cancer | |
---|---|
proper.groupColorectal.Cancer | -1 |
proper.groupNormal | 1 |
proper.groupPolyp | 0 |
S.Score | 0 |
G2M.Score | 0 |
percent_MT | 0 |
nFeature_RNA | 0 |
fit2 <- contrasts.fit(fit, contrasts = contr)
fit2 <- eBayes(fit2)
out <- topTable(fit2, n = Inf, sort.by = "P")
head(out, 30) %>%
kable() %>%
kable_styling("striped", fixed_thead = TRUE)
logFC | AveExpr | t | P.Value | adj.P.Val | B | |
---|---|---|---|---|---|---|
XIST | 1.9827318 | 0.6682736 | 11.870704 | 0.0e+00 | 0.0000000 | 37.251244 |
GUCA2A | 1.6727802 | 0.4946151 | 10.607512 | 0.0e+00 | 0.0000000 | 31.021088 |
MMP12 | -2.0976536 | 0.6819202 | -8.045790 | 0.0e+00 | 0.0000000 | 18.347651 |
PHGR1 | 1.7128625 | 0.6230695 | 7.990444 | 0.0e+00 | 0.0000000 | 18.079041 |
CKB | 2.0697765 | 1.1605407 | 7.813288 | 0.0e+00 | 0.0000000 | 17.222231 |
SLC26A2 | 1.8826649 | 0.9641306 | 6.787927 | 0.0e+00 | 0.0000004 | 12.377261 |
SLC26A3 | 1.5570563 | 0.6022747 | 6.661061 | 0.0e+00 | 0.0000006 | 11.794657 |
PIGR | 1.7503636 | 1.1102687 | 6.460438 | 0.0e+00 | 0.0000015 | 10.882534 |
HIVEP3 | -1.4376532 | 0.7340200 | -5.994125 | 0.0e+00 | 0.0000121 | 8.810834 |
NR3C2 | 1.6376297 | 1.0291416 | 5.792722 | 1.0e-07 | 0.0000276 | 7.939339 |
UTY | -1.3631601 | 0.7485786 | -5.484691 | 3.0e-07 | 0.0000928 | 6.636837 |
ZG16 | 0.9778633 | 0.3534387 | 5.484091 | 3.0e-07 | 0.0000928 | 6.634339 |
MUC13 | 1.2139028 | 0.5911549 | 5.163231 | 1.1e-06 | 0.0003496 | 5.320392 |
PDE4D | 1.6637089 | 1.5119900 | 5.120095 | 1.3e-06 | 0.0003901 | 5.147303 |
TNFAIP2 | -1.4572991 | 1.0361014 | -5.104445 | 1.4e-06 | 0.0003901 | 5.084722 |
HSP90AA1 | -1.3268169 | 0.7386294 | -5.041036 | 1.8e-06 | 0.0004794 | 4.832355 |
MACF1 | -1.2728166 | 2.5152218 | -5.024663 | 1.9e-06 | 0.0004838 | 4.767506 |
SATB2 | 1.2335513 | 0.6184811 | 4.931185 | 2.9e-06 | 0.0006781 | 4.399765 |
RNF213 | -1.4619869 | 1.4301922 | -4.899682 | 3.3e-06 | 0.0007080 | 4.276803 |
PARP14 | -1.5705769 | 1.4065849 | -4.895788 | 3.3e-06 | 0.0007080 | 4.261639 |
IL7R | -1.0858376 | 0.3242462 | -4.865053 | 3.8e-06 | 0.0007591 | 4.142219 |
FOXP1 | 1.5111842 | 1.6488493 | 4.856249 | 3.9e-06 | 0.0007591 | 4.108095 |
AMN | 0.8531736 | 0.2741440 | 4.731821 | 6.6e-06 | 0.0011476 | 3.630123 |
EPCAM | 1.1109426 | 0.5789830 | 4.715942 | 7.0e-06 | 0.0011476 | 3.569703 |
PLA2G7 | -1.1812964 | 0.5448965 | -4.705319 | 7.3e-06 | 0.0011476 | 3.529356 |
ZBTB20 | 1.5396424 | 1.7559376 | 4.700057 | 7.5e-06 | 0.0011476 | 3.509395 |
RAPGEF1 | -1.5180128 | 1.5356339 | -4.698717 | 7.5e-06 | 0.0011476 | 3.504313 |
SAT1 | -1.8924435 | 2.0416164 | -4.696409 | 7.6e-06 | 0.0011476 | 3.495562 |
TYMP | -1.3860201 | 0.9203430 | -4.688993 | 7.8e-06 | 0.0011476 | 3.467469 |
SCNN1B | 0.9079763 | 0.3004799 | 4.668610 | 8.5e-06 | 0.0011916 | 3.390400 |
Output columns:
- logFC: log fold change (since we are working with Seurat’s natural log transformed data, will be natural log fold change)
- AveExpr: Average expression across all cells in expr2
- t: logFC divided by its standard error
- P.Value: Raw p-value (based on t) from test that logFC differs from 0
- adj.P.Val: Benjamini-Hochberg false discovery rate adjusted p-value
- B: log-odds that gene is DE
Save files
write.csv(GenTable(GOdata, Fisher = resultFisher), file = "cluster10_GOdata.csv")
write.csv(out, file = "cluster10_Normal-Colorectal.Cancer_topTable.csv")
A note on pseudobulk DE
Pseudobulk differential expression uses count data summed across all cells in each sample (typically within each cell type or cluster). Unlike cell-level DE, pseudobulk DE requires biological replicates so we won’t perform it on this dataset.
Once counts are summed, pseudobulk data are analyzed like bulk RNASeq data.
Pseudobulk DE may result in better false discovery rate control than cell-level DE, as shown here.
The Seurat function AggregateExpression()
can be used to sum counts as described here.
A tutorial on using limma for bulk RNASeq is available here.
Prepare for the next section
Download Rmd document
download.file("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2023-December-Single-Cell-RNA-Seq-Analysis/main/data_analysis/07-doublet_detection.Rmd", "07-doublet_detection.Rmd")
Session Information
sessionInfo()