Part 1: Loading data from CellRanger into R

Our first Markdown document concentrates on getting data into R and setting up our initial object.

Single Cell Analysis with Seurat and some custom code!

Seurat (now Version 4) is a popular R package that is designed for QC, analysis, and exploration of single cell data. Seurat aims to enable users to identify and interpret sources of heterogeneity from single cell transcriptomic measurements, and to integrate diverse types of single cell data. Further, the authors provide several tutorials, on their website.

The expression_data_cellranger.zip file that we have downloaded in previous step contains the single cell matrix files and HDF5 files for three single nuclei RNASeq samples from Becker et al., 2022. After uncompress the file, please make sure that you see three folders: A001-C-007, A001-C-104 and B001-A-301 in the same folder as this R markdown file.

We start each markdown document with loading needed libraries for R:

# must have Seurat
library(Seurat)
library(kableExtra)
library(ggplot2)
library(SoupX)

Setup the experiment folder and data info

experiment_name = "Colon Cancer"
dataset_loc <- "./"
ids <- c("A001-C-007", "A001-C-104", "B001-A-301")

Load the Cell Ranger Matrix Data and create the base Seurat object.

Cell Ranger provides a function cellranger aggr that will combine multiple samples into a single matrix file. However, when processing data in R this is unnecessary and we can quickly aggregate them in R.

Seurat provides a function Read10X and Read10X_h5 to read in 10X data folder. First we read in data from each individual sample folder.

Later, we initialize the Seurat object (CreateSeuratObject) with the raw (non-normalized data). Keep all cells with at least 200 detected genes. Also extracting sample names, calculating and adding in the metadata mitochondrial percentage of each cell. Adding in the metadata batchid and cell cycle. Finally, saving the raw Seurat object.

Load the Cell Ranger Matrix Data (hdf5 file) and create the base Seurat object.

tmp.data <- lapply(ids, function(i){
  d10x <- load10X(file.path(dataset_loc, i, "/outs"))
  d10x <- autoEstCont(d10x)
  out <- adjustCounts(d10x, roundToInt = TRUE)
  colnames(out) <- paste(sapply(strsplit(colnames(out),split="-"),'[[',1L),i,sep="_")
  diff <- out@x - d10x$toc@x
  return(list(x10 = out, diff = diff))
})

d10x.data <- lapply(tmp.data, function(x){x$x10})


p1 <- ggplot(data.frame(diff=tmp.data[[1]]$diff), aes(y=diff)) + geom_boxplot() + scale_y_continuous(trans=scales::pseudo_log_trans(base=10)) + ggtitle("A001-C-007")
p2 <- ggplot(data.frame(diff=tmp.data[[2]]$diff), aes(y=diff)) + geom_boxplot() + scale_y_continuous(trans=scales::pseudo_log_trans(base=10)) + ggtitle("A001-C-104")
p3 <- ggplot(data.frame(diff=tmp.data[[3]]$diff), aes(y=diff)) + geom_boxplot() + scale_y_continuous(trans=scales::pseudo_log_trans(base=10)) + ggtitle("B001-A-301")
p1 + p2 + p3

names(d10x.data) <- ids

str(d10x.data)

## List of 3
##  $ A001-C-007:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. ..@ i       : int [1:2482307] 60 145 216 237 238 247 258 285 290 298 ...
##   .. ..@ p       : int [1:1797] 0 1549 2236 2703 3241 4013 4642 5748 6285 7444 ...
##   .. ..@ Dim     : int [1:2] 36601 1796
##   .. ..@ Dimnames:List of 2
##   .. .. ..$ : chr [1:36601] "MIR1302-2HG" "FAM138A" "OR4F5" "AL627309.1" ...
##   .. .. ..$ : chr [1:1796] "AAACCCAAGTTATGGA_A001-C-007" "AAACCCACAACGCCCA_A001-C-007" "AAACCCACAGAAGTTA_A001-C-007" "AAACCCAGTCAGTCCG_A001-C-007" ...
##   .. ..@ x       : num [1:2482307] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..@ factors : list()
##  $ A001-C-104:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. ..@ i       : int [1:4039871] 70 83 86 94 216 237 246 247 282 338 ...
##   .. ..@ p       : int [1:3143] 0 1303 2321 4305 5973 7072 8227 9174 10383 12432 ...
##   .. ..@ Dim     : int [1:2] 36601 3142
##   .. ..@ Dimnames:List of 2
##   .. .. ..$ : chr [1:36601] "MIR1302-2HG" "FAM138A" "OR4F5" "AL627309.1" ...
##   .. .. ..$ : chr [1:3142] "AAACCCACATAAGCAA_A001-C-104" "AAACCCAGTGGAACCA_A001-C-104" "AAACCCAGTTTCGCTC_A001-C-104" "AAACGAAAGAGAGTGA_A001-C-104" ...
##   .. ..@ x       : num [1:4039871] 1 1 3 2 2 1 2 1 1 1 ...
##   .. ..@ factors : list()
##  $ B001-A-301:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. ..@ i       : int [1:6777011] 86 94 177 228 247 258 298 339 344 362 ...
##   .. ..@ p       : int [1:4515] 0 1572 3541 5325 6373 9028 10945 12986 14689 15717 ...
##   .. ..@ Dim     : int [1:2] 36601 4514
##   .. ..@ Dimnames:List of 2
##   .. .. ..$ : chr [1:36601] "MIR1302-2HG" "FAM138A" "OR4F5" "AL627309.1" ...
##   .. .. ..$ : chr [1:4514] "AAACCCACAGCTATTG_B001-A-301" "AAACCCAGTGTTCCAA_B001-A-301" "AAACGAAGTGGTATGG_B001-A-301" "AAACGAATCATTGGTG_B001-A-301" ...
##   .. ..@ x       : num [1:6777011] 0 1 0 2 1 1 3 2 3 1 ...
##   .. ..@ factors : list()

Create the Seurat object

Filter criteria: remove genes that do not occur in a minimum of 10 cells and remove cells that don’t have a minimum of 200 features

experiment.data <- do.call("cbind", d10x.data)

experiment.SoupX <- CreateSeuratObject(
  experiment.data,
  project = experiment_name,
  min.cells = 10,
  min.features = 300,
  names.field = 2,
  names.delim = "\\_")

experiment.SoupX

## An object of class Seurat 
## 19959 features across 9452 samples within 1 assay 
## Active assay: RNA (19959 features, 0 variable features)

str(experiment.SoupX)

## Formal class 'Seurat' [package "SeuratObject"] with 13 slots
##   ..@ assays      :List of 1
##   .. ..$ RNA:Formal class 'Assay' [package "SeuratObject"] with 8 slots
##   .. .. .. ..@ counts       :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. ..@ i       : int [1:13268276] 39 88 124 136 137 143 151 165 169 175 ...
##   .. .. .. .. .. ..@ p       : int [1:9453] 0 1547 2234 2700 3237 4002 4628 5730 6263 7418 ...
##   .. .. .. .. .. ..@ Dim     : int [1:2] 19959 9452
##   .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. ..$ : chr [1:19959] "AL627309.1" "AL627309.5" "LINC01409" "LINC01128" ...
##   .. .. .. .. .. .. ..$ : chr [1:9452] "AAACCCAAGTTATGGA_A001-C-007" "AAACCCACAACGCCCA_A001-C-007" "AAACCCACAGAAGTTA_A001-C-007" "AAACCCAGTCAGTCCG_A001-C-007" ...
##   .. .. .. .. .. ..@ x       : num [1:13268276] 1 1 1 1 1 1 1 1 1 1 ...
##   .. .. .. .. .. ..@ factors : list()
##   .. .. .. ..@ data         :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. ..@ i       : int [1:13268276] 39 88 124 136 137 143 151 165 169 175 ...
##   .. .. .. .. .. ..@ p       : int [1:9453] 0 1547 2234 2700 3237 4002 4628 5730 6263 7418 ...
##   .. .. .. .. .. ..@ Dim     : int [1:2] 19959 9452
##   .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. ..$ : chr [1:19959] "AL627309.1" "AL627309.5" "LINC01409" "LINC01128" ...
##   .. .. .. .. .. .. ..$ : chr [1:9452] "AAACCCAAGTTATGGA_A001-C-007" "AAACCCACAACGCCCA_A001-C-007" "AAACCCACAGAAGTTA_A001-C-007" "AAACCCAGTCAGTCCG_A001-C-007" ...
##   .. .. .. .. .. ..@ x       : num [1:13268276] 1 1 1 1 1 1 1 1 1 1 ...
##   .. .. .. .. .. ..@ factors : list()
##   .. .. .. ..@ scale.data   : num[0 , 0 ] 
##   .. .. .. ..@ key          : chr "rna_"
##   .. .. .. ..@ assay.orig   : NULL
##   .. .. .. ..@ var.features : logi(0) 
##   .. .. .. ..@ meta.features:'data.frame':   19959 obs. of  0 variables
##   .. .. .. ..@ misc         : list()
##   ..@ meta.data   :'data.frame': 9452 obs. of  3 variables:
##   .. ..$ orig.ident  : Factor w/ 3 levels "A001-C-007","A001-C-104",..: 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ nCount_RNA  : num [1:9452] 2057 850 534 597 934 ...
##   .. ..$ nFeature_RNA: int [1:9452] 1535 684 461 531 755 610 1077 526 1140 1218 ...
##   ..@ active.assay: chr "RNA"
##   ..@ active.ident: Factor w/ 3 levels "A001-C-007","A001-C-104",..: 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..- attr(*, "names")= chr [1:9452] "AAACCCAAGTTATGGA_A001-C-007" "AAACCCACAACGCCCA_A001-C-007" "AAACCCACAGAAGTTA_A001-C-007" "AAACCCAGTCAGTCCG_A001-C-007" ...
##   ..@ graphs      : list()
##   ..@ neighbors   : list()
##   ..@ reductions  : list()
##   ..@ images      : list()
##   ..@ project.name: chr "Colon Cancer"
##   ..@ misc        : list()
##   ..@ version     :Classes 'package_version', 'numeric_version'  hidden list of 1
##   .. ..$ : int [1:3] 4 1 3
##   ..@ commands    : list()
##   ..@ tools       : list()

The percentage of reads that map to the mitochondrial genome

Low-quality / dying cells often exhibit extensive mitochondrial contamination.
We calculate mitochondrial QC metrics with the PercentageFeatureSet function, which calculates the percentage of counts originating from a set of features.
We use the set of all genes, in mouse these genes can be identified as those that begin with ‘mt’, in human data they begin with MT.

experiment.SoupX$percent.mito <- PercentageFeatureSet(experiment.SoupX, pattern = "^MT-")
summary(experiment.SoupX$percent.mito)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.3303  0.6324  0.8956 14.0784

VlnPlot(
  experiment.SoupX,
  features = c("nFeature_RNA", "nCount_RNA","percent.mito"),
  ncol = 1, pt.size = 0.3)

Plot the distribution of number of cells each gene is represented by.

plot(sort(Matrix::rowSums(GetAssayData(experiment.SoupX) >= 3), decreasing = TRUE) , xlab="gene rank", ylab="number of cells", main="Cells per genes (reads/gene >= 3 )")

Cell filtering

We use the information above to filter out cells. Here we choose those that have percent mitochondrial genes max of 8%, unique UMI counts under 1,000 or greater than 12,000 and contain at least 700 features within them.

table(experiment.SoupX$orig.ident)

## 
## A001-C-007 A001-C-104 B001-A-301 
##       1796       3142       4514

experiment.SoupX <- subset(experiment.SoupX, percent.mito <= 8)

experiment.SoupX <- subset(experiment.SoupX, nCount_RNA >= 500 & nCount_RNA <= 12000)

experiment.SoupX <- subset(experiment.SoupX, nFeature_RNA >= 400 & nFeature_RNA < 4000)

experiment.SoupX

## An object of class Seurat 
## 19959 features across 8984 samples within 1 assay 
## Active assay: RNA (19959 features, 0 variable features)

table(experiment.SoupX$orig.ident)

## 
## A001-C-007 A001-C-104 B001-A-301 
##       1691       2952       4341

Lets se the ridge plots now after filtering

RidgePlot(experiment.SoupX, features=c("nFeature_RNA","nCount_RNA", "percent.mito"), ncol = 2)

experiment.SoupX <- NormalizeData(
  object = experiment.SoupX,
  normalization.method = "LogNormalize",
  scale.factor = 10000)

Calculate Cell-Cycle with Seurat, the list of genes comes with Seurat (only for human)

Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq

# this code is for human samples only!
s.genes <- (cc.genes$s.genes)
g2m.genes <- (cc.genes$g2m.genes)

experiment.SoupX <- CellCycleScoring(experiment.SoupX,
                                         s.features = s.genes,
                                         g2m.features = g2m.genes,
                                         set.ident = TRUE)
table(experiment.SoupX@meta.data$Phase) %>%
  kable(caption = "Number of Cells in each Cell Cycle Stage", col.names = c("Stage", "Count"), align = "c") %>%
  kable_styling()

Number of Cells in each Cell Cycle Stage
Stage	Count
G1	4863
G2M	1794
S	2327

Scale the data

ScaleData - Scales and centers genes in the dataset. If variables are provided in vars.to.regress, they are individually regressed against each gene, and the resulting residuals are then scaled and centered unless otherwise specified. Here we regress out cell cycle results S.Score and G2M.Score, percentage mitochondria (percent.mito) and the number of features (nFeature_RNA).

experiment.SoupX <- ScaleData(
  object = experiment.SoupX,
  vars.to.regress = c("S.Score", "G2M.Score", "percent.mito", "nFeature_RNA"))
saveRDS(experiment.SoupX, file="SoupX.RDS")

Fixing the defualt “Ident” in Seurat

experiment.SoupX <- readRDS("SoupX.RDS")
table(Idents(experiment.SoupX))

## 
##   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19 
## 490 452 442 438 427 385 379 370 325 320 317 305 301 271 268 266 247 246 229 227 
##  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37 
## 217 182 181 175 174 154 147 131 128 117 115 102  94  94  93  78  51  43

## So lets change it back to sample name
Idents(experiment.SoupX) <- "orig.ident"
table(Idents(experiment.SoupX))

## 
## A001-C-007 A001-C-104 B001-A-301 
##       1693       2945       4343

experiment.SoupX <- FindVariableFeatures(
  object = experiment.SoupX,
  selection.method = "vst")

length(VariableFeatures(experiment.SoupX))

## [1] 2000

top10 <- head(VariableFeatures(experiment.SoupX), 10)

top10

##  [1] "BEST4"   "CLCA4"   "NRG1"    "TPH1"    "CACNA1A" "LRMP"    "TIMP3"  
##  [8] "CTNNA2"  "TRPM3"   "CA7"

vfp1 <- VariableFeaturePlot(experiment.SoupX)
vfp1 <- LabelPoints(plot = vfp1, points = top10, repel = TRUE)
vfp1

experiment.SoupX <- RunPCA(object = experiment.SoupX, npcs=100)

Principal components plot

DimPlot(object = experiment.SoupX, reduction = "pca")

So how many features should we use? Use too few and your leaving out interesting variation that may define cell types, use too many and you add in noise? maybe?

Lets choose the first 25, based on our prior part.

use.pcs = 1:25

Identifying clusters

experiment.SoupX <- FindNeighbors(experiment.SoupX, reduction="pca", dims = use.pcs)

experiment.SoupX <- FindClusters(
    object = experiment.SoupX,
    resolution = seq(0.25,4,0.5),
    verbose = FALSE
)

experiment.SoupX <- RunUMAP(
  object = experiment.SoupX,
  reduction.use = "pca",
  dims = use.pcs)

Plot uMap coloring by the slot ‘ident’ (default).

DimPlot(object = experiment.SoupX, pt.size=0.5, reduction = "umap", label = T)

Load Seurat object generated without ambient RNA correction

load("clusters_seurat_object.RData")
experiment.aggregate <- subset(experiment.aggregate, cells=colnames(experiment.SoupX))
experiment.SoupX <- subset(experiment.SoupX, cells=colnames(experiment.aggregate))
Idents(experiment.aggregate) <- "RNA_snn_res.1.25"
Idents(experiment.SoupX) <- "RNA_snn_res.1.25"
identical(colnames(experiment.SoupX), colnames(experiment.aggregate))

## [1] TRUE

p1 <- DimPlot(object = experiment.SoupX, pt.size=0.5, reduction = "umap", label = T)
p2 <- DimPlot(object = experiment.aggregate, pt.size=0.5, reduction = "umap", label = T)
p1 + p2

experiment.SoupX$transfer_cluster <- experiment.aggregate$RNA_snn_res.1.25
p1 <- DimPlot(object = experiment.SoupX, group.by = "transfer_cluster", pt.size=0.5, reduction = "umap", label = T)
p2 <- DimPlot(object = experiment.SoupX, pt.size=0.5, reduction = "umap", label = T)
p1 + p2

Session Information

sessionInfo()

## R version 4.2.2 (2022-10-31)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] SoupX_1.6.2        ggplot2_3.4.0      kableExtra_1.3.4   SeuratObject_4.1.3
## [5] Seurat_4.3.0      
## 
## loaded via a namespace (and not attached):
##   [1] ggbeeswarm_0.6.0       Rtsne_0.16             colorspace_2.0-3      
##   [4] deldir_1.0-6           ellipsis_0.3.2         ggridges_0.5.4        
##   [7] rstudioapi_0.14        spatstat.data_3.0-0    farver_2.1.1          
##  [10] leiden_0.4.3           listenv_0.8.0          ggrepel_0.9.2         
##  [13] fansi_1.0.3            xml2_1.3.3             R.methodsS3_1.8.2     
##  [16] codetools_0.2-18       splines_4.2.2          cachem_1.0.6          
##  [19] knitr_1.41             polyclip_1.10-4        jsonlite_1.8.4        
##  [22] ica_1.0-3              cluster_2.1.4          R.oo_1.25.0           
##  [25] png_0.1-8              uwot_0.1.14            shiny_1.7.3           
##  [28] sctransform_0.3.5      spatstat.sparse_3.0-0  compiler_4.2.2        
##  [31] httr_1.4.4             assertthat_0.2.1       Matrix_1.5-3          
##  [34] fastmap_1.1.0          lazyeval_0.2.2         cli_3.4.1             
##  [37] later_1.3.0            htmltools_0.5.3        tools_4.2.2           
##  [40] igraph_1.3.5           gtable_0.3.1           glue_1.6.2            
##  [43] RANN_2.6.1             reshape2_1.4.4         dplyr_1.0.10          
##  [46] Rcpp_1.0.9             scattermore_0.8        jquerylib_0.1.4       
##  [49] vctrs_0.5.1            svglite_2.1.0          nlme_3.1-160          
##  [52] spatstat.explore_3.0-5 progressr_0.11.0       lmtest_0.9-40         
##  [55] spatstat.random_3.0-1  xfun_0.35              stringr_1.4.1         
##  [58] globals_0.16.2         rvest_1.0.3            mime_0.12             
##  [61] miniUI_0.1.1.1         lifecycle_1.0.3        irlba_2.3.5.1         
##  [64] goftest_1.2-3          future_1.29.0          MASS_7.3-58.1         
##  [67] zoo_1.8-11             scales_1.2.1           promises_1.2.0.1      
##  [70] spatstat.utils_3.0-1   parallel_4.2.2         RColorBrewer_1.1-3    
##  [73] yaml_2.3.6             reticulate_1.28        pbapply_1.6-0         
##  [76] gridExtra_2.3          ggrastr_1.0.1          sass_0.4.4            
##  [79] stringi_1.7.8          highr_0.9              systemfonts_1.0.4     
##  [82] rlang_1.0.6            pkgconfig_2.0.3        matrixStats_0.63.0    
##  [85] evaluate_0.18          lattice_0.20-45        tensor_1.5            
##  [88] ROCR_1.0-11            purrr_0.3.5            labeling_0.4.2        
##  [91] patchwork_1.1.2        htmlwidgets_1.5.4      cowplot_1.1.1         
##  [94] tidyselect_1.2.0       parallelly_1.32.1      RcppAnnoy_0.0.20      
##  [97] plyr_1.8.8             magrittr_2.0.3         R6_2.5.1              
## [100] generics_0.1.3         DBI_1.1.3              withr_2.5.0           
## [103] pillar_1.8.1           fitdistrplus_1.1-8     survival_3.4-0        
## [106] abind_1.4-5            sp_1.5-1               tibble_3.1.8          
## [109] future.apply_1.10.0    crayon_1.5.2           KernSmooth_2.23-20    
## [112] utf8_1.2.2             spatstat.geom_3.0-3    plotly_4.10.1         
## [115] rmarkdown_2.18         grid_4.2.2             data.table_1.14.6     
## [118] webshot_0.5.4          digest_0.6.30          xtable_1.8-4          
## [121] tidyr_1.2.1            httpuv_1.6.6           R.utils_2.12.2        
## [124] munsell_0.5.0          beeswarm_0.4.0         viridisLite_0.4.1     
## [127] vipor_0.4.5            bslib_0.4.1

Introduction to Single Cell RNAseq Part 1: SoupX added

UCD Bioinformatics Core