An object of class Seurat
36601 features across 3343 samples within 1 assay
Active assay: RNA (36601 features, 3941 variable features)
3 dimensional reductions calculated: pca, tsne, umap
Idents(experiment.merged)<-"finalcluster"
1. Gene Ontology (GO) Enrichment of Genes Expressed in a Cluster
Gene Ontology provides a controlled vocabulary for describing gene products. Here we use enrichment analysis to identify GO terms that are overrepresented among the gene expressed in cells in a given cluster.
cluster0<-subset(experiment.merged,idents='0')expr<-as.matrix(GetAssayData(cluster0))# Filter out genes that are 0 for every cell in this clusterbad<-which(rowSums(expr)==0)expr<-expr[-bad,]# Select genes that are expressed > 0 in at least half of cellsn.gt.0<-apply(expr,1,function(x)length(which(x>0)))expressed.genes<-rownames(expr)[which(n.gt.0/ncol(expr)>=0.5)]all.genes<-rownames(expr)# define geneList as 1 if gene is in expressed.genes, 0 otherwisegeneList<-ifelse(all.genes%in%expressed.genes,1,0)names(geneList)<-all.genes# Create topGOdata objectGOdata<-new("topGOdata",ontology="BP",# use biological process ontologyallGenes=geneList,geneSelectionFun=function(x)(x==1),annot=annFUN.org,mapping="org.Hs.eg.db",ID="symbol")# Test for enrichment using Fisher's Exact TestresultFisher<-runTest(GOdata,algorithm="elim",statistic="fisher")GenTable(GOdata,Fisher=resultFisher,topNodes=20,numChar=60)
GO.ID Term Annotated Significant Expected Fisher
1 GO:0002181 cytoplasmic translation 142 108 17.73 < 1e-30
2 GO:0006364 rRNA processing 215 97 26.84 6.2e-19
3 GO:0006120 mitochondrial electron transport, NADH to ubiquinone 41 28 5.12 1.3e-16
4 GO:0032981 mitochondrial respiratory chain complex I assembly 51 30 6.37 4.7e-15
5 GO:0000027 ribosomal large subunit assembly 24 17 3.00 5.8e-11
6 GO:0000398 mRNA splicing, via spliceosome 265 103 33.08 7.8e-11
7 GO:1904874 positive regulation of telomerase RNA localization to Cajal ... 15 13 1.87 1.4e-10
8 GO:0042776 mitochondrial ATP synthesis coupled proton transport 16 13 2.00 6.6e-10
9 GO:0001732 formation of cytoplasmic translation initiation complex 16 13 2.00 6.6e-10
10 GO:0006457 protein folding 173 68 21.60 9.7e-10
11 GO:0000028 ribosomal small subunit assembly 17 13 2.12 2.5e-09
12 GO:0006412 translation 570 226 71.16 2.7e-09
13 GO:0048025 negative regulation of mRNA splicing, via spliceosome 18 13 2.25 7.9e-09
14 GO:0006446 regulation of translational initiation 72 30 8.99 2.9e-08
15 GO:0006123 mitochondrial electron transport, cytochrome c to oxygen 12 10 1.50 4.6e-08
16 GO:1904851 positive regulation of establishment of protein localization... 10 9 1.25 6.4e-08
17 GO:0045727 positive regulation of translation 115 36 14.36 8.1e-08
18 GO:1904871 positive regulation of protein localization to Cajal body 11 9 1.37 3.1e-07
19 GO:0010498 proteasomal protein catabolic process 429 91 53.56 4.9e-07
20 GO:0042273 ribosomal large subunit biogenesis 69 45 8.61 5.2e-07
Annotated: number of genes (out of all.genes) that are annotated with that GO term
Significant: number of genes that are annotated with that GO term and meet our criteria for “expressed”
Expected: Under random chance, number of genes that would be expected to be annotated with that GO term and meeting our criteria for “expressed”
Fisher: (Raw) p-value from Fisher’s Exact Test
Quiz 1
Challenge Questions
If you have extra time:
Rerun the enrichment analysis for the molecular function (MF) ontology.
Think about how you write code to repeat the above enrichment analysis for every cluster (hint: ?base::sapply).
2. Model-based DE analysis in limma
limma is an R package for differential expression analysis of bulk RNASeq and microarray data. We apply it here to single cell data.
Limma can be used to fit any linear model to expression data and is useful for analyses that go beyond two-group comparisons. A detailed tutorial of model specification in limma is available here and in the limma User’s Guide.
# filter genes to those expressed in at least 10% of cellskeep<-rownames(expr)[which(n.gt.0/ncol(expr)>=0.1)]expr2<-expr[keep,]# Set up "design matrix" with statistical modelmm<-model.matrix(~0+orig.ident,data=cluster0[[]])head(mm)
Go to https://www.ncbi.nlm.nih.gov/gene/ and look up the top 5 DE genes between convMRR and convCOVID. Which ones are related to SARS-Cov-2 infection?
How would you write code to repeatedly perform the above DE analysis in each cluster? (hint: ?base::sapply)
BONUS: Cell type identification with scMRMA
scMRMA (single cell Multi-Resolution Marker-based Annotation Algorithm) classifies cells by iteratively clustering them then annotating based on a hierarchical external database.
The databases included with the current version are only for use with human and mouse, but a user-constructed hierarchichal database can be used.