Last Updated: December 7, 2022
Our first Markdown document concentrates on getting data into R and setting up our initial object.
Seurat (now Version 4) is a popular R package that is designed for QC, analysis, and exploration of single cell data. Seurat aims to enable users to identify and interpret sources of heterogeneity from single cell transcriptomic measurements, and to integrate diverse types of single cell data. Further, the authors provide several tutorials, on their website.
The expression_data_cellranger.zip file that we have downloaded in previous step contains the single cell matrix files and HDF5 files for three single nuclei RNASeq samples from Becker et al., 2022. After uncompress the file, please make sure that you see three folders: A001-C-007, A001-C-104 and B001-A-301 in the same folder as this R markdown file.
We start each markdown document with loading needed libraries for R:
# must have Seurat
library(Seurat)
library(kableExtra)
library(ggplot2)
library(SoupX)
experiment_name = "Colon Cancer"
dataset_loc <- "./"
ids <- c("A001-C-007", "A001-C-104", "B001-A-301")
Cell Ranger provides a function cellranger aggr
that
will combine multiple samples into a single matrix file. However, when
processing data in R this is unnecessary and we can quickly aggregate
them in R.
Seurat provides a function Read10X
and
Read10X_h5
to read in 10X data folder. First we read in
data from each individual sample folder.
Later, we initialize the Seurat object
(CreateSeuratObject
) with the raw (non-normalized data).
Keep all cells with at least 200 detected genes. Also extracting sample
names, calculating and adding in the metadata mitochondrial percentage
of each cell. Adding in the metadata batchid and cell cycle. Finally,
saving the raw Seurat object.
tmp.data <- lapply(ids, function(i){
d10x <- load10X(file.path(dataset_loc, i, "/outs"))
d10x <- autoEstCont(d10x)
out <- adjustCounts(d10x, roundToInt = TRUE)
colnames(out) <- paste(sapply(strsplit(colnames(out),split="-"),'[[',1L),i,sep="_")
diff <- out@x - d10x$toc@x
return(list(x10 = out, diff = diff))
})
d10x.data <- lapply(tmp.data, function(x){x$x10})
p1 <- ggplot(data.frame(diff=tmp.data[[1]]$diff), aes(y=diff)) + geom_boxplot() + scale_y_continuous(trans=scales::pseudo_log_trans(base=10)) + ggtitle("A001-C-007")
p2 <- ggplot(data.frame(diff=tmp.data[[2]]$diff), aes(y=diff)) + geom_boxplot() + scale_y_continuous(trans=scales::pseudo_log_trans(base=10)) + ggtitle("A001-C-104")
p3 <- ggplot(data.frame(diff=tmp.data[[3]]$diff), aes(y=diff)) + geom_boxplot() + scale_y_continuous(trans=scales::pseudo_log_trans(base=10)) + ggtitle("B001-A-301")
p1 + p2 + p3
names(d10x.data) <- ids
str(d10x.data)
## List of 3
## $ A001-C-007:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. ..@ i : int [1:2482307] 60 145 216 237 238 247 258 285 290 298 ...
## .. ..@ p : int [1:1797] 0 1549 2236 2703 3241 4013 4642 5748 6285 7444 ...
## .. ..@ Dim : int [1:2] 36601 1796
## .. ..@ Dimnames:List of 2
## .. .. ..$ : chr [1:36601] "MIR1302-2HG" "FAM138A" "OR4F5" "AL627309.1" ...
## .. .. ..$ : chr [1:1796] "AAACCCAAGTTATGGA_A001-C-007" "AAACCCACAACGCCCA_A001-C-007" "AAACCCACAGAAGTTA_A001-C-007" "AAACCCAGTCAGTCCG_A001-C-007" ...
## .. ..@ x : num [1:2482307] 1 1 1 1 1 1 1 1 1 1 ...
## .. ..@ factors : list()
## $ A001-C-104:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. ..@ i : int [1:4039871] 70 83 86 94 216 237 246 247 282 338 ...
## .. ..@ p : int [1:3143] 0 1303 2321 4305 5973 7072 8227 9174 10383 12432 ...
## .. ..@ Dim : int [1:2] 36601 3142
## .. ..@ Dimnames:List of 2
## .. .. ..$ : chr [1:36601] "MIR1302-2HG" "FAM138A" "OR4F5" "AL627309.1" ...
## .. .. ..$ : chr [1:3142] "AAACCCACATAAGCAA_A001-C-104" "AAACCCAGTGGAACCA_A001-C-104" "AAACCCAGTTTCGCTC_A001-C-104" "AAACGAAAGAGAGTGA_A001-C-104" ...
## .. ..@ x : num [1:4039871] 1 1 3 2 2 1 2 1 1 1 ...
## .. ..@ factors : list()
## $ B001-A-301:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. ..@ i : int [1:6777011] 86 94 177 228 247 258 298 339 344 362 ...
## .. ..@ p : int [1:4515] 0 1572 3541 5325 6373 9028 10945 12986 14689 15717 ...
## .. ..@ Dim : int [1:2] 36601 4514
## .. ..@ Dimnames:List of 2
## .. .. ..$ : chr [1:36601] "MIR1302-2HG" "FAM138A" "OR4F5" "AL627309.1" ...
## .. .. ..$ : chr [1:4514] "AAACCCACAGCTATTG_B001-A-301" "AAACCCAGTGTTCCAA_B001-A-301" "AAACGAAGTGGTATGG_B001-A-301" "AAACGAATCATTGGTG_B001-A-301" ...
## .. ..@ x : num [1:6777011] 0 1 0 2 1 1 3 2 3 1 ...
## .. ..@ factors : list()
Filter criteria: remove genes that do not occur in a minimum of 10 cells and remove cells that don’t have a minimum of 200 features
experiment.data <- do.call("cbind", d10x.data)
experiment.SoupX <- CreateSeuratObject(
experiment.data,
project = experiment_name,
min.cells = 10,
min.features = 300,
names.field = 2,
names.delim = "\\_")
experiment.SoupX
## An object of class Seurat
## 19959 features across 9452 samples within 1 assay
## Active assay: RNA (19959 features, 0 variable features)
str(experiment.SoupX)
## Formal class 'Seurat' [package "SeuratObject"] with 13 slots
## ..@ assays :List of 1
## .. ..$ RNA:Formal class 'Assay' [package "SeuratObject"] with 8 slots
## .. .. .. ..@ counts :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. ..@ i : int [1:13268276] 39 88 124 136 137 143 151 165 169 175 ...
## .. .. .. .. .. ..@ p : int [1:9453] 0 1547 2234 2700 3237 4002 4628 5730 6263 7418 ...
## .. .. .. .. .. ..@ Dim : int [1:2] 19959 9452
## .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. ..$ : chr [1:19959] "AL627309.1" "AL627309.5" "LINC01409" "LINC01128" ...
## .. .. .. .. .. .. ..$ : chr [1:9452] "AAACCCAAGTTATGGA_A001-C-007" "AAACCCACAACGCCCA_A001-C-007" "AAACCCACAGAAGTTA_A001-C-007" "AAACCCAGTCAGTCCG_A001-C-007" ...
## .. .. .. .. .. ..@ x : num [1:13268276] 1 1 1 1 1 1 1 1 1 1 ...
## .. .. .. .. .. ..@ factors : list()
## .. .. .. ..@ data :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. ..@ i : int [1:13268276] 39 88 124 136 137 143 151 165 169 175 ...
## .. .. .. .. .. ..@ p : int [1:9453] 0 1547 2234 2700 3237 4002 4628 5730 6263 7418 ...
## .. .. .. .. .. ..@ Dim : int [1:2] 19959 9452
## .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. ..$ : chr [1:19959] "AL627309.1" "AL627309.5" "LINC01409" "LINC01128" ...
## .. .. .. .. .. .. ..$ : chr [1:9452] "AAACCCAAGTTATGGA_A001-C-007" "AAACCCACAACGCCCA_A001-C-007" "AAACCCACAGAAGTTA_A001-C-007" "AAACCCAGTCAGTCCG_A001-C-007" ...
## .. .. .. .. .. ..@ x : num [1:13268276] 1 1 1 1 1 1 1 1 1 1 ...
## .. .. .. .. .. ..@ factors : list()
## .. .. .. ..@ scale.data : num[0 , 0 ]
## .. .. .. ..@ key : chr "rna_"
## .. .. .. ..@ assay.orig : NULL
## .. .. .. ..@ var.features : logi(0)
## .. .. .. ..@ meta.features:'data.frame': 19959 obs. of 0 variables
## .. .. .. ..@ misc : list()
## ..@ meta.data :'data.frame': 9452 obs. of 3 variables:
## .. ..$ orig.ident : Factor w/ 3 levels "A001-C-007","A001-C-104",..: 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ nCount_RNA : num [1:9452] 2057 850 534 597 934 ...
## .. ..$ nFeature_RNA: int [1:9452] 1535 684 461 531 755 610 1077 526 1140 1218 ...
## ..@ active.assay: chr "RNA"
## ..@ active.ident: Factor w/ 3 levels "A001-C-007","A001-C-104",..: 1 1 1 1 1 1 1 1 1 1 ...
## .. ..- attr(*, "names")= chr [1:9452] "AAACCCAAGTTATGGA_A001-C-007" "AAACCCACAACGCCCA_A001-C-007" "AAACCCACAGAAGTTA_A001-C-007" "AAACCCAGTCAGTCCG_A001-C-007" ...
## ..@ graphs : list()
## ..@ neighbors : list()
## ..@ reductions : list()
## ..@ images : list()
## ..@ project.name: chr "Colon Cancer"
## ..@ misc : list()
## ..@ version :Classes 'package_version', 'numeric_version' hidden list of 1
## .. ..$ : int [1:3] 4 1 3
## ..@ commands : list()
## ..@ tools : list()
experiment.SoupX$percent.mito <- PercentageFeatureSet(experiment.SoupX, pattern = "^MT-")
summary(experiment.SoupX$percent.mito)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.3303 0.6324 0.8956 14.0784
VlnPlot(
experiment.SoupX,
features = c("nFeature_RNA", "nCount_RNA","percent.mito"),
ncol = 1, pt.size = 0.3)
Plot the distribution of number of cells each gene is represented by.
plot(sort(Matrix::rowSums(GetAssayData(experiment.SoupX) >= 3), decreasing = TRUE) , xlab="gene rank", ylab="number of cells", main="Cells per genes (reads/gene >= 3 )")
We use the information above to filter out cells. Here we choose those that have percent mitochondrial genes max of 8%, unique UMI counts under 1,000 or greater than 12,000 and contain at least 700 features within them.
table(experiment.SoupX$orig.ident)
##
## A001-C-007 A001-C-104 B001-A-301
## 1796 3142 4514
experiment.SoupX <- subset(experiment.SoupX, percent.mito <= 8)
experiment.SoupX <- subset(experiment.SoupX, nCount_RNA >= 500 & nCount_RNA <= 12000)
experiment.SoupX <- subset(experiment.SoupX, nFeature_RNA >= 400 & nFeature_RNA < 4000)
experiment.SoupX
## An object of class Seurat
## 19959 features across 8984 samples within 1 assay
## Active assay: RNA (19959 features, 0 variable features)
table(experiment.SoupX$orig.ident)
##
## A001-C-007 A001-C-104 B001-A-301
## 1691 2952 4341
Lets se the ridge plots now after filtering
RidgePlot(experiment.SoupX, features=c("nFeature_RNA","nCount_RNA", "percent.mito"), ncol = 2)
experiment.SoupX <- NormalizeData(
object = experiment.SoupX,
normalization.method = "LogNormalize",
scale.factor = 10000)
Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq
# this code is for human samples only!
s.genes <- (cc.genes$s.genes)
g2m.genes <- (cc.genes$g2m.genes)
experiment.SoupX <- CellCycleScoring(experiment.SoupX,
s.features = s.genes,
g2m.features = g2m.genes,
set.ident = TRUE)
table(experiment.SoupX@meta.data$Phase) %>%
kable(caption = "Number of Cells in each Cell Cycle Stage", col.names = c("Stage", "Count"), align = "c") %>%
kable_styling()
Stage | Count |
---|---|
G1 | 4863 |
G2M | 1794 |
S | 2327 |
ScaleData - Scales and centers genes in the dataset. If variables are provided in vars.to.regress, they are individually regressed against each gene, and the resulting residuals are then scaled and centered unless otherwise specified. Here we regress out cell cycle results S.Score and G2M.Score, percentage mitochondria (percent.mito) and the number of features (nFeature_RNA).
experiment.SoupX <- ScaleData(
object = experiment.SoupX,
vars.to.regress = c("S.Score", "G2M.Score", "percent.mito", "nFeature_RNA"))
saveRDS(experiment.SoupX, file="SoupX.RDS")
experiment.SoupX <- readRDS("SoupX.RDS")
table(Idents(experiment.SoupX))
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
## 490 452 442 438 427 385 379 370 325 320 317 305 301 271 268 266 247 246 229 227
## 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
## 217 182 181 175 174 154 147 131 128 117 115 102 94 94 93 78 51 43
## So lets change it back to sample name
Idents(experiment.SoupX) <- "orig.ident"
table(Idents(experiment.SoupX))
##
## A001-C-007 A001-C-104 B001-A-301
## 1693 2945 4343
experiment.SoupX <- FindVariableFeatures(
object = experiment.SoupX,
selection.method = "vst")
length(VariableFeatures(experiment.SoupX))
## [1] 2000
top10 <- head(VariableFeatures(experiment.SoupX), 10)
top10
## [1] "BEST4" "CLCA4" "NRG1" "TPH1" "CACNA1A" "LRMP" "TIMP3"
## [8] "CTNNA2" "TRPM3" "CA7"
vfp1 <- VariableFeaturePlot(experiment.SoupX)
vfp1 <- LabelPoints(plot = vfp1, points = top10, repel = TRUE)
vfp1
experiment.SoupX <- RunPCA(object = experiment.SoupX, npcs=100)
Principal components plot
DimPlot(object = experiment.SoupX, reduction = "pca")
Lets choose the first 25, based on our prior part.
use.pcs = 1:25
experiment.SoupX <- FindNeighbors(experiment.SoupX, reduction="pca", dims = use.pcs)
experiment.SoupX <- FindClusters(
object = experiment.SoupX,
resolution = seq(0.25,4,0.5),
verbose = FALSE
)
experiment.SoupX <- RunUMAP(
object = experiment.SoupX,
reduction.use = "pca",
dims = use.pcs)
Plot uMap coloring by the slot ‘ident’ (default).
DimPlot(object = experiment.SoupX, pt.size=0.5, reduction = "umap", label = T)
Load Seurat object generated without ambient RNA correction
load("clusters_seurat_object.RData")
experiment.aggregate <- subset(experiment.aggregate, cells=colnames(experiment.SoupX))
experiment.SoupX <- subset(experiment.SoupX, cells=colnames(experiment.aggregate))
Idents(experiment.aggregate) <- "RNA_snn_res.1.25"
Idents(experiment.SoupX) <- "RNA_snn_res.1.25"
identical(colnames(experiment.SoupX), colnames(experiment.aggregate))
## [1] TRUE
p1 <- DimPlot(object = experiment.SoupX, pt.size=0.5, reduction = "umap", label = T)
p2 <- DimPlot(object = experiment.aggregate, pt.size=0.5, reduction = "umap", label = T)
p1 + p2
experiment.SoupX$transfer_cluster <- experiment.aggregate$RNA_snn_res.1.25
p1 <- DimPlot(object = experiment.SoupX, group.by = "transfer_cluster", pt.size=0.5, reduction = "umap", label = T)
p2 <- DimPlot(object = experiment.SoupX, pt.size=0.5, reduction = "umap", label = T)
p1 + p2
sessionInfo()
## R version 4.2.2 (2022-10-31)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] SoupX_1.6.2 ggplot2_3.4.0 kableExtra_1.3.4 SeuratObject_4.1.3
## [5] Seurat_4.3.0
##
## loaded via a namespace (and not attached):
## [1] ggbeeswarm_0.6.0 Rtsne_0.16 colorspace_2.0-3
## [4] deldir_1.0-6 ellipsis_0.3.2 ggridges_0.5.4
## [7] rstudioapi_0.14 spatstat.data_3.0-0 farver_2.1.1
## [10] leiden_0.4.3 listenv_0.8.0 ggrepel_0.9.2
## [13] fansi_1.0.3 xml2_1.3.3 R.methodsS3_1.8.2
## [16] codetools_0.2-18 splines_4.2.2 cachem_1.0.6
## [19] knitr_1.41 polyclip_1.10-4 jsonlite_1.8.4
## [22] ica_1.0-3 cluster_2.1.4 R.oo_1.25.0
## [25] png_0.1-8 uwot_0.1.14 shiny_1.7.3
## [28] sctransform_0.3.5 spatstat.sparse_3.0-0 compiler_4.2.2
## [31] httr_1.4.4 assertthat_0.2.1 Matrix_1.5-3
## [34] fastmap_1.1.0 lazyeval_0.2.2 cli_3.4.1
## [37] later_1.3.0 htmltools_0.5.3 tools_4.2.2
## [40] igraph_1.3.5 gtable_0.3.1 glue_1.6.2
## [43] RANN_2.6.1 reshape2_1.4.4 dplyr_1.0.10
## [46] Rcpp_1.0.9 scattermore_0.8 jquerylib_0.1.4
## [49] vctrs_0.5.1 svglite_2.1.0 nlme_3.1-160
## [52] spatstat.explore_3.0-5 progressr_0.11.0 lmtest_0.9-40
## [55] spatstat.random_3.0-1 xfun_0.35 stringr_1.4.1
## [58] globals_0.16.2 rvest_1.0.3 mime_0.12
## [61] miniUI_0.1.1.1 lifecycle_1.0.3 irlba_2.3.5.1
## [64] goftest_1.2-3 future_1.29.0 MASS_7.3-58.1
## [67] zoo_1.8-11 scales_1.2.1 promises_1.2.0.1
## [70] spatstat.utils_3.0-1 parallel_4.2.2 RColorBrewer_1.1-3
## [73] yaml_2.3.6 reticulate_1.28 pbapply_1.6-0
## [76] gridExtra_2.3 ggrastr_1.0.1 sass_0.4.4
## [79] stringi_1.7.8 highr_0.9 systemfonts_1.0.4
## [82] rlang_1.0.6 pkgconfig_2.0.3 matrixStats_0.63.0
## [85] evaluate_0.18 lattice_0.20-45 tensor_1.5
## [88] ROCR_1.0-11 purrr_0.3.5 labeling_0.4.2
## [91] patchwork_1.1.2 htmlwidgets_1.5.4 cowplot_1.1.1
## [94] tidyselect_1.2.0 parallelly_1.32.1 RcppAnnoy_0.0.20
## [97] plyr_1.8.8 magrittr_2.0.3 R6_2.5.1
## [100] generics_0.1.3 DBI_1.1.3 withr_2.5.0
## [103] pillar_1.8.1 fitdistrplus_1.1-8 survival_3.4-0
## [106] abind_1.4-5 sp_1.5-1 tibble_3.1.8
## [109] future.apply_1.10.0 crayon_1.5.2 KernSmooth_2.23-20
## [112] utf8_1.2.2 spatstat.geom_3.0-3 plotly_4.10.1
## [115] rmarkdown_2.18 grid_4.2.2 data.table_1.14.6
## [118] webshot_0.5.4 digest_0.6.30 xtable_1.8-4
## [121] tidyr_1.2.1 httpuv_1.6.6 R.utils_2.12.2
## [124] munsell_0.5.0 beeswarm_0.4.0 viridisLite_0.4.1
## [127] vipor_0.4.5 bslib_0.4.1