Home
Introduction and Lectures
Intro to the Workshop and Core
Schedule
What is Bioinformatics/Genomics?
Experimental Design and Cost Estimation
Support
Slack
Zoom
Cheat Sheets
Software and Links
Scripts
Prerequisites
CLI
R
Data Reduction
Files and Filetypes
Project setup
Generating Expression Matrix
scRNAseq Analysis
Prepare scRNAseq Analysis
scRNAseq Analysis - PART1
scRNAseq Analysis - PART2
scRNAseq Analysis - PART3
scRNAseq Analysis - PART4
scRNAseq Analysis - PART5
scRNAseq Analysis - PART6
ETC
Closing thoughts
Workshop Photos
Github page
Biocore website

Last Updated: March 20 2022

Part 2: Some QA/QC, filtering and normalization

Load libraries

library(Seurat)
library(biomaRt)
library(ggplot2)
library(knitr)
library(kableExtra)

Load the Seurat object from part 1

load(file="original_seurat_object.RData")
experiment.aggregate

An object of class Seurat 36601 features across 5803 samples within 1 assay Active assay: RNA (36601 features, 0 variable features)

set.seed(12345)

Some basic QA/QC of the metadata, print tables of the 5% quantiles.

Show 5% quantiles for number of genes per cell per sample

kable(do.call("cbind", tapply(experiment.aggregate$nFeature_RNA,
                      Idents(experiment.aggregate),quantile,probs=seq(0,1,0.05))),
      caption = "5% Quantiles of Genes/Cell by Sample") %>% kable_styling()

5% Quantiles of Genes/Cell by Sample
	conv_COVID	conv_MMR	conv_Tdap	norm_COVID
0%	302.0	300.0	307.00	302
5%	669.2	657.4	675.85	430
10%	846.4	847.0	873.20	530
15%	1011.1	964.0	1021.00	627
20%	1160.2	1067.6	1167.60	728
25%	1323.0	1199.0	1350.75	815
30%	1522.6	1375.2	1618.60	889
35%	1876.8	1638.5	1908.25	940
40%	2123.0	1982.6	2119.00	987
45%	2328.3	2190.6	2282.95	1018
50%	2472.0	2370.0	2451.50	1062
55%	2605.7	2518.0	2567.05	1105
60%	2756.4	2667.0	2679.80	1128
65%	2872.1	2835.3	2820.15	1165
70%	2994.8	3019.6	2971.70	1205
75%	3155.0	3173.5	3144.50	1246
80%	3347.0	3327.8	3304.80	1280
85%	3495.9	3498.9	3477.35	1324
90%	3704.8	3735.8	3712.90	1376
95%	3989.3	4117.3	4071.35	1462
100%	6251.0	6718.0	5395.00	3336

Show 5% quantiles for number of UMI per cell per sample

kable(do.call("cbind", tapply(experiment.aggregate$nCount_RNA,
                                      Idents(experiment.aggregate),quantile,probs=seq(0,1,0.05))),
      caption = "5% Quantiles of UMI/Cell by Sample") %>% kable_styling()

5% Quantiles of UMI/Cell by Sample
	conv_COVID	conv_MMR	conv_Tdap	norm_COVID
0%	394.0	392.0	397.00	410
5%	1021.5	992.0	1020.85	653
10%	1481.2	1429.0	1493.00	838
15%	1882.1	1730.0	1844.00	1004
20%	2241.4	2028.4	2300.00	1168
25%	2797.5	2455.5	2858.25	1347
30%	3858.0	2969.0	4028.20	1553
35%	5685.3	4249.1	6007.75	1858
40%	7320.6	6444.2	7485.20	2026
45%	8346.0	7825.1	8602.00	2172
50%	9222.0	8914.0	9479.00	2288
55%	10050.1	9685.4	10240.95	2384
60%	10834.0	10649.2	11206.60	2521
65%	11821.5	11633.1	12141.35	2681
70%	12781.0	12799.6	13170.80	2890
75%	13934.0	13752.0	14181.00	3092
80%	14956.2	15047.2	15545.00	3354
85%	16350.6	16541.7	16942.25	3508
90%	18092.6	18609.8	18735.90	3810
95%	20486.6	21617.9	22126.75	4173
100%	51229.0	63454.0	48334.00	13527

Show 5% quantiles for number of mitochondrial percentage per cell per sample

kable(round(do.call("cbind", tapply(experiment.aggregate$percent.mito, Idents(experiment.aggregate),quantile,probs=seq(0,1,0.05))), digits = 3),
      caption = "5% Quantiles of Percent Mitochondria by Sample") %>% kable_styling()

5% Quantiles of Percent Mitochondria by Sample
	conv_COVID	conv_MMR	conv_Tdap	norm_COVID
0%	0.000	0.000	0.000	1.473
5%	0.197	0.272	0.238	2.678
10%	0.281	0.364	0.307	3.008
15%	0.342	0.452	0.364	3.272
20%	0.389	0.529	0.425	3.487
25%	0.442	0.598	0.484	3.678
30%	0.504	0.669	0.553	3.878
35%	0.568	0.746	0.615	4.091
40%	0.622	0.834	0.691	4.264
45%	0.689	0.931	0.767	4.547
50%	0.773	1.037	0.868	4.769
55%	0.857	1.150	0.979	5.026
60%	0.964	1.256	1.119	5.352
65%	1.082	1.468	1.274	5.731
70%	1.247	1.772	1.567	6.177
75%	1.563	2.503	1.974	6.914
80%	2.140	3.562	2.752	7.943
85%	3.229	4.814	4.213	9.243
90%	4.496	6.796	5.921	11.608
95%	7.490	8.849	9.446	14.417
100%	20.332	43.994	48.264	23.400

Violin plot of 1) number of genes, 2) number of UMI and 3) percent mitochondrial genes

VlnPlot(
  experiment.aggregate,
  features = c("nFeature_RNA", "nCount_RNA","percent.mito"),
  ncol = 1, pt.size = 0.3)

plot ridge plots of the same data

RidgePlot(experiment.aggregate, features=c("nFeature_RNA","nCount_RNA", "percent.mito"), ncol = 2)

Plot the distribution of number of cells each gene is represented by.

plot(sort(Matrix::rowSums(GetAssayData(experiment.aggregate) >= 3), decreasing = TRUE) , xlab="gene rank", ylab="number of cells", main="Cells per genes (reads/gene >= 3 )")

Gene Plot, scatter plot of gene expression across cells, (colored by sample), drawing horizontal an verticale lines at proposed filtering cutoffs.

FeatureScatter(experiment.aggregate, feature1 = "nCount_RNA", feature2 = "percent.mito", shuffle = TRUE) + geom_vline(xintercept = c(1000,12000)) + geom_hline(yintercept = 8)

FeatureScatter(experiment.aggregate, feature1 = "nFeature_RNA", feature2 = "percent.mito", shuffle = TRUE) + geom_vline(xintercept = 700) + geom_hline(yintercept = 8)

FeatureScatter(
  experiment.aggregate, "nCount_RNA", "nFeature_RNA",
  pt.size = 0.5, shuffle = TRUE)  + geom_vline(xintercept = c(1000,12000)) + geom_hline(yintercept = 700)

Cell filtering

We use the information above to filter out cells. Here we choose those that have percent mitochondrial genes max of 8%, unique UMI counts under 1,000 or greater than 12,000 and contain at least 700 features within them.

table(experiment.aggregate$orig.ident)

conv_COVID conv_MMR conv_Tdap norm_COVID 1415 1875 1612 901

experiment.aggregate <- subset(experiment.aggregate, percent.mito <= 8)

experiment.aggregate <- subset(experiment.aggregate, nCount_RNA >= 1000 & nCount_RNA <= 12000)

experiment.aggregate <- subset(experiment.aggregate, nFeature_RNA >= 700)

experiment.aggregate

An object of class Seurat 36601 features across 3343 samples within 1 assay Active assay: RNA (36601 features, 0 variable features)

table(experiment.aggregate$orig.ident)

conv_COVID conv_MMR conv_Tdap norm_COVID 793 1033 859 658

Lets se the ridge plots now after filtering

RidgePlot(experiment.aggregate, features=c("nFeature_RNA","nCount_RNA", "percent.mito"), ncol = 2)

You may also want to filter out additional genes.

When creating the base Seurat object we did filter out some genes, recall Keep all genes expressed in >= 10 cells. After filtering cells and you may want to be more aggressive with the gene filter. Seurat doesn’t supply such a function (that I can find), so below is a function that can do so, it filters genes requiring a min.value (log-normalized) in at least min.cells, here expression of 1 in at least 400 cells.

experiment.aggregate
FilterGenes <-
 function (object, min.value=1, min.cells = 0, genes = NULL) {
   genes.use <- rownames(object)
   if (!is.null(genes)) {
     genes.use <- intersect(genes.use, genes)
     object@data <- GetAssayData(object)[genes.use, ]
   } else if (min.cells > 0) {
     num.cells <- Matrix::rowSums(GetAssayData(object) > min.value)
     genes.use <- names(num.cells[which(num.cells >= min.cells)])
     object = object[genes.use, ]
   }
  object <- LogSeuratCommand(object = object)
  return(object)
}

experiment.aggregate.genes <- FilterGenes(object = experiment.aggregate, min.value = 1, min.cells = 400)
experiment.aggregate.genes
rm(experiment.aggregate.genes)

Next we want to normalize the data

After filtering out cells from the dataset, the next step is to normalize the data. By default, we employ a global-scaling normalization method LogNormalize that normalizes the gene expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and then log-transforms the data.

?NormalizeData

experiment.aggregate <- NormalizeData(
  object = experiment.aggregate,
  normalization.method = "LogNormalize",
  scale.factor = 10000)

Calculate Cell-Cycle with Seurat, the list of genes comes with Seurat (only for human)

Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq

# this code is for human samples only!
s.genes <- (cc.genes$s.genes)
g2m.genes <- (cc.genes$g2m.genes)

experiment.aggregate <- CellCycleScoring(experiment.aggregate,
                                         s.features = s.genes,
                                         g2m.features = g2m.genes,
                                         set.ident = TRUE)
table(experiment.aggregate@meta.data$Phase) %>%
  kable(caption = "Number of Cells in each Cell Cycle Stage", col.names = c("Stage", "Count"), align = "c") %>%
  kable_styling()

Number of Cells in each Cell Cycle Stage
Stage	Count
G1	1072
G2M	1486
S	785

For mouse data, we would need to convert the gene lists from human genes to their mouse orthologs using Biomart. Skip this code for the workshop data.

# Mouse Code DO NOT RUN
convertHumanGeneList <- function(x){
  require("biomaRt")
  human = useEnsembl("ensembl", dataset = "hsapiens_gene_ensembl", mirror = "uswest")
  mouse = useEnsembl("ensembl", dataset = "mmusculus_gene_ensembl", mirror = "uswest")

  genes = getLDS(attributes = c("hgnc_symbol"), filters = "hgnc_symbol", values = x , mart = human, attributesL = c("mgi_symbol"), martL = mouse, uniqueRows=T)

  humanx <- unique(genes[, 2])

  # Print the first 6 genes found to the screen
  print(head(humanx))
  return(humanx)
}

m.s.genes <- convertHumanGeneList(cc.genes.updated.2019$s.genes)
m.g2m.genes <- convertHumanGeneList(cc.genes.updated.2019$g2m.genes)

# Create our Seurat object and complete the initialization steps
experiment.aggregate <- CellCycleScoring(experiment.aggregate, s.features = m.s.genes, g2m.features = m.g2m.genes, set.ident = TRUE)

table(experiment.aggregate@meta.data$Phase) %>% kable(caption = "Number of Cells in each Cell Cycle Stage", col.names = c("Stage", "Count"), align = "c") %>% kable_styling()

Fixing the defualt “Ident” in Seurat

table(Idents(experiment.aggregate))

G1 S G2M 1072 785 1486

## So lets change it back to sample name
Idents(experiment.aggregate) <- "orig.ident"
table(Idents(experiment.aggregate))

conv_COVID conv_MMR conv_Tdap norm_COVID 793 1033 859 658

Identify variable genes

The function FindVariableFeatures identifies the most highly variable genes (default 2000 genes) by fitting a line to the relationship of log(variance) and log(mean) using loess smoothing, uses this information to standardize the data, then calculates the variance of the standardized data. This helps avoid selecting genes that only appear variable due to their expression level.

?FindVariableFeatures

experiment.aggregate <- FindVariableFeatures(
  object = experiment.aggregate,
  selection.method = "vst")

length(VariableFeatures(experiment.aggregate))

[1] 2000

top10 <- head(VariableFeatures(experiment.aggregate), 10)

top10

[1] "CCL4L2" "IL3" "CCL4" "CCL3" "GEM" "IFNG" "CSF2" [8] "CCL20" "XCL2" "TRBV4-2"

vfp1 <- VariableFeaturePlot(experiment.aggregate)
vfp1 <- LabelPoints(plot = vfp1, points = top10, repel = TRUE)
vfp1

FindVariableFeatures isn’t the only way to set the “variable features” of a Seurat object. Another reasonable approach is to select a set of “minimally expressed” genes.

dim(experiment.aggregate)

[1] 36601 3343

min.value = 2
min.cells = 10
num.cells <- Matrix::rowSums(GetAssayData(experiment.aggregate, slot = "count") > min.value)
genes.use <- names(num.cells[which(num.cells >= min.cells)])
length(genes.use)

[1] 3941

VariableFeatures(experiment.aggregate) <- genes.use

Question(s)

Finally, lets save the filtered and normalized data

save(experiment.aggregate, file="pre_sample_corrected.RData")

Get the next Rmd file

download.file("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2022-March-Single-Cell-RNA-Seq-Analysis/main/data_analysis/scRNA_Workshop-PART3.Rmd", "scRNA_Workshop-PART3.Rmd")

Session Information

sessionInfo()

R version 4.1.2 (2021-11-01) Platform: aarch64-apple-darwin20 (64-bit) Running under: macOS Monterey 12.0.1 Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRblas.0.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] kableExtra_1.3.4 knitr_1.37 ggplot2_3.3.5 biomaRt_2.50.3 [5] SeuratObject_4.0.4 Seurat_4.1.0 loaded via a namespace (and not attached): [1] systemfonts_1.0.4 BiocFileCache_2.2.1 plyr_1.8.6 [4] igraph_1.2.11 lazyeval_0.2.2 splines_4.1.2 [7] listenv_0.8.0 scattermore_0.8 GenomeInfoDb_1.30.1 [10] digest_0.6.29 htmltools_0.5.2 fansi_1.0.2 [13] magrittr_2.0.2 memoise_2.0.1 tensor_1.5 [16] cluster_2.1.2 ROCR_1.0-11 globals_0.14.0 [19] Biostrings_2.62.0 matrixStats_0.61.0 svglite_2.1.0 [22] spatstat.sparse_2.1-0 prettyunits_1.1.1 colorspace_2.0-3 [25] rvest_1.0.2 rappdirs_0.3.3 blob_1.2.2 [28] ggrepel_0.9.1 xfun_0.30 dplyr_1.0.8 [31] crayon_1.5.0 RCurl_1.98-1.6 jsonlite_1.8.0 [34] spatstat.data_2.1-2 survival_3.3-1 zoo_1.8-9 [37] glue_1.6.2 polyclip_1.10-0 gtable_0.3.0 [40] zlibbioc_1.40.0 XVector_0.34.0 webshot_0.5.2 [43] leiden_0.3.9 future.apply_1.8.1 BiocGenerics_0.40.0 [46] abind_1.4-5 scales_1.1.1 DBI_1.1.2 [49] spatstat.random_2.1-0 miniUI_0.1.1.1 Rcpp_1.0.8.3 [52] progress_1.2.2 viridisLite_0.4.0 xtable_1.8-4 [55] reticulate_1.24 spatstat.core_2.4-0 bit_4.0.4 [58] stats4_4.1.2 htmlwidgets_1.5.4 httr_1.4.2 [61] RColorBrewer_1.1-2 ellipsis_0.3.2 ica_1.0-2 [64] farver_2.1.0 pkgconfig_2.0.3 XML_3.99-0.9 [67] dbplyr_2.1.1 sass_0.4.0 uwot_0.1.11 [70] deldir_1.0-6 utf8_1.2.2 labeling_0.4.2 [73] tidyselect_1.1.2 rlang_1.0.2 reshape2_1.4.4 [76] later_1.3.0 AnnotationDbi_1.56.2 munsell_0.5.0 [79] tools_4.1.2 cachem_1.0.6 cli_3.2.0 [82] generics_0.1.2 RSQLite_2.2.10 ggridges_0.5.3 [85] evaluate_0.15 stringr_1.4.0 fastmap_1.1.0 [88] yaml_2.3.5 goftest_1.2-3 bit64_4.0.5 [91] fitdistrplus_1.1-8 purrr_0.3.4 RANN_2.6.1 [94] KEGGREST_1.34.0 pbapply_1.5-0 future_1.24.0 [97] nlme_3.1-155 mime_0.12 xml2_1.3.3 [100] compiler_4.1.2 rstudioapi_0.13 filelock_1.0.2 [103] curl_4.3.2 plotly_4.10.0 png_0.1-7 [106] spatstat.utils_2.3-0 tibble_3.1.6 bslib_0.3.1 [109] stringi_1.7.6 highr_0.9 lattice_0.20-45 [112] Matrix_1.4-0 vctrs_0.3.8 pillar_1.7.0 [115] lifecycle_1.0.1 spatstat.geom_2.3-2 lmtest_0.9-39 [118] jquerylib_0.1.4 RcppAnnoy_0.0.19 data.table_1.14.2 [121] cowplot_1.1.1 bitops_1.0-7 irlba_2.3.5 [124] httpuv_1.6.5 patchwork_1.1.1 R6_2.5.1 [127] promises_1.2.0.1 KernSmooth_2.23-20 gridExtra_2.3 [130] IRanges_2.28.0 parallelly_1.30.0 codetools_0.2-18 [133] MASS_7.3-55 assertthat_0.2.1 withr_2.5.0 [136] sctransform_0.3.3 S4Vectors_0.32.3 GenomeInfoDbData_1.2.7 [139] hms_1.1.1 mgcv_1.8-39 parallel_4.1.2 [142] grid_4.1.2 rpart_4.1.16 tidyr_1.2.0 [145] rmarkdown_2.13 Rtsne_0.15 Biobase_2.54.0 [148] shiny_1.7.1

☰ Menu

Single Cell RNA-Seq Analysis