Home
Introduction and Lectures
Intro to the Workshop and Core
Support
Schedule
Slack
Cheat Sheets
Software and Links
Scripts
GitHub repository
Biocore website
Prerequisites
CLI
R
Data Reduction
Files and Filetypes
Project setup
Generating Expression Matrix
scRNAseq Analysis
Prepare scRNAseq Analysis
Part 1- Create object
Part 2- Filtering
Part 3- Normalization and scaling
Part 4- Dimensionality reduction
Part 5- Clustering and cell type
Part 6- Enrichment and DE
Part 7- Doublet detection
Part 8- Integration

Introduction to Single Cell RNA-Seq Part 6: Clustering and cell type assignment

Clustering and cell type assignment are critical steps in many single cell (or single nucleus) experiments. The power of single cell experiments is to capture the heterogeneity within samples as well as between them. Clustering permits the user to organize cells into clusters that correspond to biologically and experimentally relevant populations.

Set up workspace

library(Seurat)
library(kableExtra)
library(tidyr)
library(dplyr)
library(ggplot2)
library(HGNChelper)
library(ComplexHeatmap)
set.seed(12345)
experiment.aggregate <- readRDS("scRNA_workshop-04.rds")
experiment.aggregate

## An object of class Seurat 
## 11475 features across 6368 samples within 1 assay 
## Active assay: RNA (11475 features, 7112 variable features)
##  3 layers present: counts, data, scale.data
##  2 dimensional reductions calculated: pca, umap

Construct network

Seurat implements an graph-based clustering approach. Distances between the cells are calculated based on previously identified PCs.

The default method for identifying k-nearest neighbors is annoy, an approximate nearest-neighbor approach that is widely used for high-dimensional analysis in many fields, including single-cell analysis. Extensive community benchmarking has shown that annoy substantially improves the speed and memory requirements of neighbor discovery, with negligible impact to downstream results.

experiment.aggregate <- FindNeighbors(experiment.aggregate, reduction = "pca", dims = 1:50)

Find clusters

The FindClusters function implements the neighbor based clustering procedure, and contains a resolution parameter that sets the granularity of the downstream clustering, with increased values leading to a greater number of clusters. This code produces a series of resolutions for us to investigate and choose from.

The clustering resolution parameter is unit-less and somewhat arbitrary. The resolutions used here were selected to produce a useable number of clusters in the example experiment.

experiment.aggregate <- FindClusters(experiment.aggregate,
                                     resolution = seq(0.1, 0.4, 0.1))

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 6368
## Number of edges: 261294
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9687
## Number of communities: 10
## Elapsed time: 0 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 6368
## Number of edges: 261294
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9494
## Number of communities: 11
## Elapsed time: 0 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 6368
## Number of edges: 261294
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9343
## Number of communities: 15
## Elapsed time: 0 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 6368
## Number of edges: 261294
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9220
## Number of communities: 15
## Elapsed time: 0 seconds

Seurat adds the clustering information to the metadata table. Each FindClusters call generates a new column named with the assay, followed by “_snn_res.”, and the resolution.

cluster.resolutions <- grep("res", colnames(experiment.aggregate@meta.data), value = TRUE)
head(experiment.aggregate@meta.data[,cluster.resolutions]) %>%
  kable(caption = 'Cluster identities are added to the meta.data slot.') %>%
  kable_styling("striped")

Cluster identities are added to the meta.data slot.
	RNA_snn_res.0.1	RNA_snn_res.0.2	RNA_snn_res.0.3	RNA_snn_res.0.4
AAACCCAAGTTATGGA_A001-C-007	1	1	1	2
AAACGCTTCTCTGCTG_A001-C-007	6	7	10	10
AAAGAACGTGCTTATG_A001-C-007	1	1	5	5
AAAGAACGTTTCGCTC_A001-C-007	3	3	3	3
AAAGAACTCTGGCTGG_A001-C-007	4	5	7	8
AAAGGATTCATTACCT_A001-C-007	3	3	3	3

Explore clustering resolutions

The number of clusters produced increases with the clustering resolution.

sapply(cluster.resolutions, function(res){
  length(levels(experiment.aggregate@meta.data[,res]))
})

## RNA_snn_res.0.1 RNA_snn_res.0.2 RNA_snn_res.0.3 RNA_snn_res.0.4 
##              10              11              15              15

Visualize clustering

Dimensionality reduction plots can be used to visualize the clustering results. On these plots, we can see how each clustering resolution aligns with patterns in the data revealed by dimensionality reductions.

UMAP

# UMAP colored by cluster
lapply(cluster.resolutions, function(res){
  DimPlot(experiment.aggregate,
          group.by = res,
          reduction = "umap",
          shuffle = TRUE) +
    scale_color_viridis_d(option = "turbo")
})

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

Investigate the relationship between cluster identity and sample identity

lapply(cluster.resolutions, function(res){
         tmp = experiment.aggregate@meta.data[,c(res, "group")]
         colnames(tmp) = c("cluster", "group")
         ggplot(tmp, aes(x = cluster, fill = group)) +
           geom_bar() +
           scale_fill_viridis_d(option = "mako") +
           theme_classic()
})

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

Investigate the relationship between cluster identity and metadata values

Here, example plots are displayed for the lowest resolution in order to save space. To see plots for each resolution, use lapply().

VlnPlot(experiment.aggregate,
        group.by = "RNA_snn_res.0.4",
        features = "nCount_RNA",
        pt.size = 0.1) +
  scale_fill_viridis_d(option = "turbo")

VlnPlot(experiment.aggregate,
        group.by = "RNA_snn_res.0.4",
        features = "nFeature_RNA",
        pt.size = 0.1) +
  scale_fill_viridis_d(option = "turbo")

VlnPlot(experiment.aggregate,
        group.by = "RNA_snn_res.0.4",
        features = "percent_MT",
        pt.size = 0.1) +
  scale_fill_viridis_d(option = "turbo")

Visualize expression of genes of interest

FeaturePlot(experiment.aggregate,
            reduction = "umap",
            features = "KCNMA1")

VlnPlot(experiment.aggregate,
        group.by = "RNA_snn_res.0.4",
        features = "KCNMA1",
        pt.size = 0.1) +
  scale_fill_viridis_d(option = "turbo")

Select a resolution

For now, let’s use resolution 0.4. Over the remainder of this section, we will refine the clustering further.

Idents(experiment.aggregate) <- "RNA_snn_res.0.4"

Visualize cluster tree

Building a phylogenetic tree relating the ‘average’ cell from each group in default ‘Ident’ (currently “RNA_snn_res.0.1”). This tree is estimated based on a distance matrix constructed in either gene expression space or PCA space.

experiment.aggregate <- BuildClusterTree(experiment.aggregate, dims = 1:50)
PlotClusterTree(experiment.aggregate)

Merge clusters

In many experiments, the clustering resolution does not need to be uniform across all of the cell types present. While for some cell types of interest fine detail may be desirable, for others, simply grouping them into a larger parent cluster is sufficient. Merging cluster is very straightforward.

experiment.aggregate <- RenameIdents(experiment.aggregate,
                                     '11' = '10',
                                     '12' = '13')
experiment.aggregate$res.0.4_merged <- Idents(experiment.aggregate)
table(experiment.aggregate$res.0.4_merged)

## 
##   10   13    0    1    2    3    4    5    6    7    8    9   14 
##  159   86 1417  928  877  661  608  599  415  219  192  182   25

experiment.aggregate@meta.data %>%
  ggplot(aes(x = res.0.4_merged, fill = group)) +
  geom_bar() +
  scale_fill_viridis_d(option = "mako") +
  theme_classic()

DimPlot(experiment.aggregate,
        reduction = "umap",
        group.by = "res.0.4_merged",
        label = TRUE) +
  scale_color_viridis_d(option = "turbo")

VlnPlot(experiment.aggregate,
        group.by = "res.0.4_merged",
        features = "KCNMA1") +
  scale_fill_viridis_d(option = "turbo")

Reorder the clusters

Merging the clusters changed the order in which they appear on a plot. In order to reorder the clusters for plotting purposes take a look at the levels of the identity, then re-level as desired.

levels(experiment.aggregate$res.0.4_merged)

##  [1] "10" "13" "0"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "14"

# move one cluster to the first position
experiment.aggregate$res.0.4_merged <- relevel(experiment.aggregate$res.0.4_merged, "0")
levels(experiment.aggregate$res.0.4_merged)

##  [1] "0"  "10" "13" "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "14"

# the color assigned to some clusters will change
VlnPlot(experiment.aggregate,
        group.by = "res.0.4_merged",
        features = "CAPN9",
        pt.size = 0.1) +
  scale_fill_viridis_d(option = "turbo")

# re-level entire factor
new.order <- as.character(sort(as.numeric(levels(experiment.aggregate$res.0.4_merged))))
experiment.aggregate$res.0.4_merged <- factor(experiment.aggregate$res.0.4_merged, levels = new.order)
levels(experiment.aggregate$res.0.4_merged)

##  [1] "0"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "13" "14"

VlnPlot(experiment.aggregate,
        group.by = "res.0.4_merged",
        features = "MYRIP",
        pt.size = 0.1) +
  scale_fill_viridis_d(option = "turbo")

Subcluster

While merging clusters reduces the resolution in some parts of the experiment, sub-clustering has the opposite effect. Let’s produce sub-clusters for cluster 5.

experiment.aggregate <- FindSubCluster(experiment.aggregate,
                                       graph.name = "RNA_snn",
                                       cluster = 5,
                                       subcluster.name = "subcluster")

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 599
## Number of edges: 24112
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.7177
## Number of communities: 4
## Elapsed time: 0 seconds

experiment.aggregate$subcluster <- factor(experiment.aggregate$subcluster,
                                          levels = c(as.character(0:4),
                                                     "5_0", "5_1", "5_2", "5_3",
                                                     as.character(c(6:10, 13, 14))))
DimPlot(experiment.aggregate,
        reduction = "umap",
        group.by = "subcluster",
        shuffle = TRUE) +
  scale_color_viridis_d(option = "turbo")

sort(unique(experiment.aggregate$subcluster))

##  [1] 0   1   2   3   4   5_0 5_1 5_2 5_3 6   7   8   9   10  13  14 
## Levels: 0 1 2 3 4 5_0 5_1 5_2 5_3 6 7 8 9 10 13 14

Subset experiment by cluster identity

After exploring and refining the cluster resolution, we may have identified some clusters that are composed of cells we aren’t interested in. For example, if we have identified a cluster likely composed of contaminants, this cluster can be removed from the analysis. Alternatively, if a group of clusters have been identified as particularly of interest, these can be isolated and re-analyzed.

# remove cluster 13
Idents(experiment.aggregate) <- experiment.aggregate$subcluster
experiment.tmp <- subset(experiment.aggregate, subcluster != "13")
DimPlot(experiment.tmp,
        reduction = "umap",
        group.by = "subcluster",
        shuffle = TRUE) +
  scale_color_viridis_d(option = "turbo")

# retain cells belonging only to specified clusters
experiment.tmp <- subset(experiment.aggregate, subcluster %in% c("1", "2", "5_0", "5_1", "5_2", "5_3", "7"))
DimPlot(experiment.tmp,
        reduction = "umap",
        group.by = "subcluster",
        shuffle = TRUE) +
  scale_color_viridis_d(option = "turbo")

Identify marker genes

Seurat provides several functions that can help you find markers that define clusters via differential expression:

FindMarkers identifies markers for a cluster relative to all other clusters
FindAllMarkers performs the find markers operation for all clusters
FindAllMarkersNode defines all markers that split a node from the cluster tree

FindMarkers

markers <- FindMarkers(experiment.aggregate,
                       group.by = "subcluster",
                       ident.1 = "1")
length(which(markers$p_val_adj < 0.05)) # how many are significant?

## [1] 2617

head(markers) %>%
  kable() %>%
  kable_styling("striped")

	avg_log2FC	pct.1	pct.2
KCNMA1	5.071086	0.912	0.091
MUC2	4.222438	0.874	0.139
PTPRN2	3.524203	0.878	0.169
MYRIP	3.432499	0.759	0.143
ANO7	4.997656	0.642	0.041
FER1L6	4.817227	0.637	0.059

The “pct.1” and “pct.2” columns record the proportion of cells with normalized expression above 0 in ident.1 and ident.2, respectively. The “p_val” is the raw p-value associated with the differential expression test, while the BH-adjusted value is found in “p_val_adj”. Finally, “avg_logFC” is the average log fold change difference between the two groups.

Marker genes identified this way can be visualized in violin plots, feature plots, and heat maps.

view.markers <- c(rownames(markers[markers$avg_log2FC > 0,])[1],
                  rownames(markers[markers$avg_log2FC < 0,])[1])
lapply(view.markers, function(marker){
  VlnPlot(experiment.aggregate,
          group.by = "subcluster",
          features = marker) +
    scale_fill_viridis_d(option = "turbo")
})

## [[1]]

## 
## [[2]]

FeaturePlot(experiment.aggregate,
            features = view.markers,
            ncol = 2)

DoHeatmap(experiment.aggregate,
          group.by = "subcluster",
          features = view.markers,
          group.colors = viridis::turbo(length(levels(experiment.aggregate$subcluster))))

Improved heatmap

The Seurat DoHeatmap function provided by Seurat provides a convenient look at expression of selected genes. The ComplexHeatmap library generates heat maps with a much finer level of control.

cluster.colors <- viridis::turbo(length(levels(experiment.aggregate$subcluster)))
names(cluster.colors) <- levels(experiment.aggregate$subcluster)
group.colors <- viridis::mako(length(levels(experiment.aggregate$group)))
names(group.colors) <- levels(experiment.aggregate$group)
top.annotation <- columnAnnotation(df = experiment.aggregate@meta.data[,c("group", "subcluster")],
                                   col = list(group = group.colors,
                                              subcluster = cluster.colors))
mat <- as.matrix(GetAssayData(experiment.aggregate[rownames(markers)[1:20],],
                              slot = "data"))
Heatmap(mat,
        name = "normalized\ncounts",
        show_row_dend = FALSE,
        show_column_dend = FALSE,
        show_column_names = FALSE,
        top_annotation = top.annotation)

FindAllMarkers

FindAllMarkers can be used to automate this process across all clusters.

Idents(experiment.aggregate) <- "subcluster"
markers <- FindAllMarkers(experiment.aggregate,
                          only.pos = TRUE,
                          min.pct = 0.25,
                          thresh.use = 0.25)
tapply(markers$p_val_adj, markers$cluster, function(x){
  length(x < 0.05)
})

##    0    1    2    3    4  5_0  5_1  5_2  5_3    6    7    8    9   10   13   14 
##  825  489 1323 1463 1495  262  722  241  359 1615  705  339  586  269  302  272

head(markers) %>%
  kable() %>%
  kable_styling("striped")

	avg_log2FC	pct.1	pct.2	gene
RBFOX1	2.034105	0.855	0.406	RBFOX1
NXPE1	1.678752	0.975	0.639	NXPE1
ADAMTSL1	1.646397	0.912	0.508	ADAMTSL1
XIST	1.551350	0.855	0.353	XIST
SATB2	1.235538	0.941	0.659	SATB2
HNF1A-AS1	1.390375	0.881	0.559	HNF1A-AS1

view.markers <- tapply(markers$gene, markers$cluster, function(x){head(x,1)})
# violin plots
lapply(view.markers, function(marker){
  VlnPlot(experiment.aggregate,
          group.by = "subcluster",
          features = marker) +
    scale_fill_viridis_d(option = "turbo")
})

## $`0`

## 
## $`1`

## 
## $`2`

## 
## $`3`

## 
## $`4`

## 
## $`5_0`

## 
## $`5_1`

## 
## $`5_2`

## 
## $`5_3`

## 
## $`6`

## 
## $`7`

## 
## $`8`

## 
## $`9`

## 
## $`10`

## 
## $`13`

## 
## $`14`

# feature plots
lapply(view.markers, function(marker){
  FeaturePlot(experiment.aggregate,
              features = marker)
})

## $`0`

## 
## $`1`

## 
## $`2`

## 
## $`3`

## 
## $`4`

## 
## $`5_0`

## 
## $`5_1`

## 
## $`5_2`

## 
## $`5_3`

## 
## $`6`

## 
## $`7`

## 
## $`8`

## 
## $`9`

## 
## $`10`

## 
## $`13`

## 
## $`14`

# heat map
DoHeatmap(experiment.aggregate,
          group.by = "subcluster",
          features = view.markers,
          group.colors = viridis::turbo(length(unique(experiment.aggregate$subcluster))))

# ComplexHeatmap
mat <- as.matrix(GetAssayData(experiment.aggregate[view.markers,],
                              slot = "data"))
Heatmap(mat,
        name = "normalized\ncounts",
        show_row_dend = FALSE,
        show_column_dend = FALSE,
        show_column_names = FALSE,
        column_split = experiment.aggregate$subcluster,
        top_annotation = top.annotation)

Calculate mean marker expression within clusters

You may want to get an idea of the mean expression of markers in a cluster or group of clusters. The percent expressing is provided by FindMarkers and FindAllMarkers, along with the average log fold change, but not the expression value itself. The function below calculates a mean for the supplied marker in the named cluster(s) and all other groups. Please note that this function accesses the active identity.

# ensure active identity is set to desired clustering resolution
Idents(experiment.aggregate) <- experiment.aggregate$subcluster
# define function
getGeneClusterMeans <- function(feature, idents){
  x = GetAssayData(experiment.aggregate)[feature,]
  m = tapply(x, Idents(experiment.aggregate) %in% idents, mean)
  names(m) = c("mean.out.of.idents", "mean.in.idents")
  return(m[c(2,1)])
}
# calculate means for a single marker
getGeneClusterMeans("SMOC2", c("1"))

##     mean.in.idents mean.out.of.idents 
##         0.06505446         0.27696665

# add means to marker table (example using subset)
markers.small <- markers[view.markers,]
means <- matrix(mapply(getGeneClusterMeans, view.markers, markers.small$cluster), ncol = 2, byrow = TRUE)
colnames(means) <- c("mean.in.cluster", "mean.out.of.cluster")
rownames(means) <- view.markers
markers.small <- cbind(markers.small, means)
markers.small[,c("cluster", "mean.in.cluster", "mean.out.of.cluster", "avg_log2FC", "p_val_adj")] %>%
  kable() %>%
  kable_styling("striped")

	cluster	mean.in.cluster	mean.out.of.cluster	avg_log2FC
RBFOX1	0	2.4472279	0.8916247	2.0341055
KCNMA1	1	2.9248186	0.1774590	5.0710862
CEMIP	2	1.6756835	0.1465987	4.0381823
ENSG00000227088	3	1.9786250	0.0199813	7.1264153
SCNN1B	0	1.5708278	0.9227322	0.3662769
LINC00278	2	0.7565661	0.2854624	1.1396960
SLC9A3	5_1	0.4823360	0.0151884	5.4115329
KCND3	5_2	0.5098400	0.0169088	4.9700114
CDH13	5_1	1.0216303	0.1698600	3.0163505
HDAC9	5_1	0.9094500	0.2483312	1.9275863
NUSAP1	7	1.0219056	0.0566234	4.6308676
CALD1	8	2.7791941	0.0433339	7.1432480
NOTCH2	9	2.7472839	0.1420540	5.5497559
CHST11	10	2.1137727	0.0484920	6.3438692
EGFL7	13	1.1791919	0.0034060	9.2892294
RIMS2	14	2.6670242	0.0090534	9.8329481

Advanced visualizations

Researchers may use the tree, markers, domain knowledge, and goals to identify useful clusters. This may mean adjusting PCA to use, choosing a new resolution, merging clusters together, sub-clustering, sub-setting, etc. You may also want to use automated cell type identification at this point, which will be discussed in the next section.

Address overplotting

Single cell and single nucleus experiments may include so many cells that dimensionality reduction plots sometimes suffer from overplotting, where individual points are difficult to see. The following code addresses this by adjusting the size and opacity of the points.

alpha.use <- 0.4
p <- DimPlot(experiment.aggregate,
             group.by="subcluster",
             pt.size=0.5,
             reduction = "umap",
             shuffle = TRUE) +
  scale_color_viridis_d(option = "turbo")
p$layers[[1]]$mapping$alpha <- alpha.use
p + scale_alpha_continuous(range = alpha.use, guide = FALSE)

Highlight a subset

DimPlot(experiment.aggregate,
        group.by = "subcluster",
        cells.highlight = CellsByIdentities(experiment.aggregate, idents = c("1", "2", "3")),
        cols.highlight = c(viridis::viridis(3))) +
  ggtitle("Selected clusters")

Split dimensionality reduction plots

DimPlot(experiment.aggregate,
        group.by = "subcluster",
        split.by = "group") +
  scale_color_viridis_d(option = "turbo")

Plot a subset of cells

Note that the object itself is unchanged by the subsetting operation.

DimPlot(experiment.aggregate,
        group.by = "subcluster",
        reduction = "umap",
        cells = Cells(experiment.aggregate)[experiment.aggregate$orig.ident %in% "A001-C-007"]) +
  scale_color_viridis_d(option = "turbo") +
  ggtitle("A00-C-007 subcluster")

Automated cell type detection

ScType is one of many available automated cell type detection algorithms. It has the advantage of being fast and flexible; it can be used with the large human and mouse cell type database provided by the authors, with a user-defined database, or with some combination of the two.

In this section, we will use ScType to assign our clusters from Seurat to a cell type based on a hierarchical external database. The database supplied with the package is human but users can supply their own data. More details are available on Github.

Source ScType functions from Github

The source() functions in the code box below run scripts stored in the ScType GitHub repository. These scripts add two additional functions, gene_sets_prepare() and sctype_score(), to the global environment.

source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/gene_sets_prepare.R")
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/sctype_score_.R")

Create marker database

The authors of ScType have provided an extensive database of human and mouse cell type markers in the following tissues:

Immune system
Pancreas
Liver
Eye
Kidney
Brain
Lung
Adrenal
Heart
Intestine
Muscle
Placenta
Spleen
Stomach
Thymus

To set up the database for ScType, we run the gene_sets_prepare() function on a URL pointing to the full database, specifying the tissue subset to retrieve.

db.URL <- "https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/ScTypeDB_full.xlsx"
tissue <- "Intestine"
gs.list <- gene_sets_prepare(db.URL, cell_type = "Intestine")

View(gs.list)

The database is composed of two lists (negative and positive) of cell type markers for the specified tissue. In this case, no negative markers have been provided, so the vectors of gene names in that part of the database are empty (length 0).

In addition to using the database supplied with the package, or substituting another database, it is possible to augment the provided database by changing the provided markers for an existing cell type, or adding another cell type to the object.

Let’s add a cell type and associated markers (markers from PanglaoDB).

gs.list$gs_positive$`Tuft cells` <- c("SUCNR1", "FABP1", "POU2F3", "SIGLECF", "CDHR2", "AVIL", "ESPN", "LRMP", "TRPM5", "DCLK1", "TAS1R3", "SOX9", "TUBB5", "CAMK2B", "GNAT3", "IL25", "PLCB2", "GFI1B", "ATOH1", "CD24A", "ASIC5", "KLF3", "KLF6", "DRD3", "NRADD", "GNG13", "NREP", "RGS2", "RAC2", "PTGS1", "IRF7", "FFAR3", "ALOX5", "TSLP", "IL4RA", "IL13RA1", "IL17RB", "PTPRC")
gs.list$gs_negative$`Tuft cells` <- NULL

Score cells

ScType scores every cell, summarizing the transformed and weighted expression of markers for each cell type. This generates a matrix with the dimensions cell types x cells.

es.max <- sctype_score(GetAssayData(experiment.aggregate, layer = "scale"),
                       scaled = TRUE,
                       gs = gs.list$gs_positive,
                       gs2 = gs.list$gs_negative)
# cell type scores of first cell
es.max[,1]

##         Smooth muscle cells              Lymphoid cells 
##                  -0.3636781                  -0.1108393 
## Intestinal epithelial cells                    ENS glia 
##                  -0.4479756                  -0.1363974 
##  Vascular endothelial cells               Stromal cells 
##                  -0.1014542                  -0.1532908 
## Lymphatic endothelial cells                  Tuft cells 
##                   2.1557870                   0.6969473

Score clusters

The “ScType score” of each cluster is calculated by summing the cell-level scores within each cluster. The cell type with the largest (positive) score is the most highly enriched cell type for that cluster.

clusters <- sort(unique(experiment.aggregate$subcluster))
tmp <- lapply(clusters, function(cluster){
  es.max.cluster = sort(rowSums(es.max[, experiment.aggregate$subcluster == cluster]), decreasing = TRUE)
  out = head(data.frame(subcluster = cluster,
                        ScType = names(es.max.cluster),
                        scores = es.max.cluster,
                        ncells = sum(experiment.aggregate$subcluster == cluster)))
  out$rank = 1:length(out$scores)
  return(out)
})
cluster.ScType <- do.call("rbind", tmp)
cluster.ScType %>%
  pivot_wider(id_cols = subcluster,
              names_from = ScType,
              values_from = scores) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped"), fixed_thead = TRUE)

subcluster	Lymphatic endothelial cells	ENS glia	Lymphoid cells	Vascular endothelial cells	Stromal cells	Smooth muscle cells	Tuft cells	Intestinal epithelial cells
0	87.7378776	-5.604127	-100.553219	-110.834542	-134.7583553	-160.546668	NA	NA
1	NA	-20.493367	-86.545933	-58.613385	-95.9900018	266.466891	276.54233	NA
2	-19.4149079	41.549581	-37.805579	-44.034424	-96.3785689	NA	NA	-39.581128
3	NA	-13.676205	-22.021587	31.148505	-30.4911573	-4.270566	NA	139.287912
4	118.6766979	-57.538190	-31.830609	-47.293829	-31.1877105	NA	-39.64303	NA
5_0	-2.1521834	6.739549	-32.525223	-34.517383	-45.1453270	NA	NA	-99.447739
5_1	6.6622500	NA	3.787856	-8.950350	-9.8834830	NA	27.57278	82.371743
5_2	-16.8508058	-5.342219	-4.376185	-8.756197	-0.2337756	NA	NA	-9.098947
5_3	30.1754620	-1.354951	-5.832314	-5.169916	-7.2313034	NA	-14.39365	NA
6	87.0996915	NA	-14.926685	-38.410281	-30.5241198	NA	258.86783	263.174753
7	43.2487916	8.538654	NA	-11.153706	-29.2848852	-8.402164	34.97969	NA
8	32.5489313	18.638932	1.528065	23.905926	539.0313173	512.913159	NA	NA
9	NA	9.689404	-11.593291	-2.777240	-17.8104451	NA	-27.71397	546.455908
10	-15.3437347	20.659288	329.428586	28.334859	-20.7395694	NA	149.27955	NA
13	-0.9443133	46.375058	5.309052	220.704912	-13.7513072	NA	211.30776	NA
14	-3.9295706	-3.319804	-2.833958	-2.579368	-3.5508860	-7.890572	NA	NA

cluster.ScType.top <- cluster.ScType %>%
  filter(rank == 1) %>%
  select(subcluster, ncells, ScType, scores)
rownames(cluster.ScType.top) <- cluster.ScType.top$subcluster
cluster.ScType.top <- cluster.ScType.top %>%
  rename("score" = scores)

Once the scores have been calculated for each cluster, they can be added to the Seurat object.

subcluster.ScType <- cluster.ScType.top[experiment.aggregate$subcluster, "ScType"]
experiment.aggregate <- AddMetaData(experiment.aggregate,
                                    metadata = subcluster.ScType,
                                    col.name = "subcluster_ScType")
DimPlot(experiment.aggregate,
        reduction = "umap",
        group.by = "subcluster_ScType",
        shuffle = TRUE) +
  scale_color_viridis_d()

The ScType developer suggests that assignments with a score less than (number of cells in cluster)/4 are low confidence and should be set to unknown:

cluster.ScType.top <- cluster.ScType.top %>%
  mutate(ScType_filtered = ifelse(score >= ncells / 4, ScType, "Unknown"))
subcluster.ScType.filtered <- cluster.ScType.top[experiment.aggregate$subcluster, "ScType_filtered"]
experiment.aggregate <- AddMetaData(experiment.aggregate,
                                    metadata = subcluster.ScType.filtered,
                                    col.name = "subcluster_ScType_filtered")
DimPlot(experiment.aggregate,
        reduction = "umap",
        group.by = "subcluster_ScType_filtered",
        shuffle = TRUE) +
  scale_color_viridis_d()

types <- grep("Unknown", unique(experiment.aggregate$subcluster_ScType_filtered), value = TRUE, invert = TRUE)
type.markers <- unlist(lapply(gs.list$gs_positive[types], function(m){
  rownames(markers[which(m %in% rownames(markers)),])
}))
type.markers <- type.markers[!duplicated(type.markers)]
type.markers.df <- data.frame(marker = type.markers,
                              type = gsub('[0-9]', '', names(type.markers)))
type.colors <- viridis::viridis(length(unique(type.markers.df$type)))
names(type.colors) <- unique(type.markers.df$type)
left.annotation <- rowAnnotation("marker" = type.markers.df$type, col = list(marker = type.colors))
mat <- as.matrix(GetAssayData(experiment.aggregate,
                              slot = "data")[type.markers.df$marker,])
Heatmap(mat,
        name = "normalized\ncounts",
        top_annotation = top.annotation,
        left_annotation = left.annotation,
        show_column_dend = FALSE,
        show_column_names = FALSE,
        show_row_dend = FALSE)

Prepare for the next section

Save the Seurat object and download the next Rmd file

# set the finalcluster to subcluster
experiment.aggregate$finalcluster <- experiment.aggregate$subcluster
# save object
saveRDS(experiment.aggregate, file="scRNA_workshop-05.rds")

Download the Rmd file

download.file("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2023-December-Single-Cell-RNA-Seq-Analysis/main/data_analysis/06-de_enrichment.Rmd", "06-de_enrichment.Rmd")

Session Information

sessionInfo()

## R version 4.4.0 (2024-04-24)
## Platform: aarch64-apple-darwin20
## Running under: macOS Ventura 13.5.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] ComplexHeatmap_2.20.0 HGNChelper_0.8.14     ggplot2_3.5.1        
## [4] dplyr_1.1.4           tidyr_1.3.1           kableExtra_1.4.0     
## [7] Seurat_5.1.0          SeuratObject_5.0.2    sp_2.1-4             
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3     shape_1.4.6.1          rstudioapi_0.16.0     
##   [4] jsonlite_1.8.8         magrittr_2.0.3         spatstat.utils_3.0-4  
##   [7] farver_2.1.2           rmarkdown_2.27         GlobalOptions_0.1.2   
##  [10] vctrs_0.6.5            ROCR_1.0-11            spatstat.explore_3.2-7
##  [13] htmltools_0.5.8.1      sass_0.4.9             sctransform_0.4.1     
##  [16] parallelly_1.37.1      KernSmooth_2.23-22     bslib_0.7.0           
##  [19] htmlwidgets_1.6.4      ica_1.0-3              plyr_1.8.9            
##  [22] plotly_4.10.4          zoo_1.8-12             cachem_1.1.0          
##  [25] igraph_2.0.3           mime_0.12              lifecycle_1.0.4       
##  [28] iterators_1.0.14       pkgconfig_2.0.3        Matrix_1.7-0          
##  [31] R6_2.5.1               fastmap_1.2.0          clue_0.3-65           
##  [34] fitdistrplus_1.1-11    future_1.33.2          shiny_1.8.1.1         
##  [37] digest_0.6.35          colorspace_2.1-0       S4Vectors_0.42.0      
##  [40] patchwork_1.2.0        tensor_1.5             RSpectra_0.16-1       
##  [43] irlba_2.3.5.1          labeling_0.4.3         progressr_0.14.0      
##  [46] fansi_1.0.6            spatstat.sparse_3.0-3  httr_1.4.7            
##  [49] polyclip_1.10-6        abind_1.4-5            compiler_4.4.0        
##  [52] withr_3.0.0            doParallel_1.0.17      viridis_0.6.5         
##  [55] fastDummies_1.7.3      highr_0.11             MASS_7.3-60.2         
##  [58] rjson_0.2.21           tools_4.4.0            splitstackshape_1.4.8 
##  [61] lmtest_0.9-40          ape_5.8                zip_2.3.1             
##  [64] httpuv_1.6.15          future.apply_1.11.2    goftest_1.2-3         
##  [67] glue_1.7.0             nlme_3.1-164           promises_1.3.0        
##  [70] Rtsne_0.17             cluster_2.1.6          reshape2_1.4.4        
##  [73] generics_0.1.3         gtable_0.3.5           spatstat.data_3.0-4   
##  [76] data.table_1.15.4      xml2_1.3.6             utf8_1.2.4            
##  [79] BiocGenerics_0.50.0    spatstat.geom_3.2-9    RcppAnnoy_0.0.22      
##  [82] ggrepel_0.9.5          RANN_2.6.1             foreach_1.5.2         
##  [85] pillar_1.9.0           stringr_1.5.1          limma_3.60.2          
##  [88] spam_2.10-0            RcppHNSW_0.6.0         later_1.3.2           
##  [91] circlize_0.4.16        splines_4.4.0          lattice_0.22-6        
##  [94] survival_3.5-8         deldir_2.0-4           tidyselect_1.2.1      
##  [97] miniUI_0.1.1.1         pbapply_1.7-2          knitr_1.47            
## [100] gridExtra_2.3          IRanges_2.38.0         svglite_2.1.3         
## [103] scattermore_1.2        stats4_4.4.0           xfun_0.44             
## [106] statmod_1.5.0          matrixStats_1.3.0      stringi_1.8.4         
## [109] lazyeval_0.2.2         yaml_2.3.8             evaluate_0.23         
## [112] codetools_0.2-20       tibble_3.2.1           cli_3.6.2             
## [115] uwot_0.2.2             xtable_1.8-4           reticulate_1.37.0     
## [118] systemfonts_1.1.0      munsell_0.5.1          jquerylib_0.1.4       
## [121] Rcpp_1.0.12            globals_0.16.3         spatstat.random_3.2-3 
## [124] png_0.1-8              parallel_4.4.0         presto_1.0.0          
## [127] dotCall64_1.1-1        listenv_0.9.1          viridisLite_0.4.2     
## [130] scales_1.3.0           ggridges_0.5.6         openxlsx_4.2.5.2      
## [133] crayon_1.5.2           leiden_0.4.3.1         purrr_1.0.2           
## [136] GetoptLong_1.0.5       rlang_1.1.3            cowplot_1.1.3

☰ Menu

Single Cell RNA-Seq Analysis