☰ Menu

      Advanced Single Cell RNA-Seq Workshop

Home
Introduction and Lectures
Intro to the Workshop and Core
Schedule
What is Bioinformatics/Genomics?
Experimental Design and Cost Estimation
Single Cell Sample Preparation - Dr. Diana Burkart-Waco
Support
Cheat Sheets
Software and Links
Scripts
Prerequisites
CLI - Logging in and Transferring Files
CLI - Intro to Command-Line
CLI - Advanced Command-Line (extra)
CLI - Running jobs on the Cluster and using modules
R - Getting Started
R - Intro to R
R - Prepare Data in R (extra)
R - Data in R (extra)
More Materials (extra)
Data Reduction
Generating Expression Matrices
Expression project setup
Preprocessing reads with HTStream
Generating Expression Tables
VDJ T cell and B cell
Velocity analysis
Data analysis
scRNA analysis prepare
Mapping Comparison
Anchoring (Comparison dataset)
Shiny App Install/Overview
App Practical Usage
AWS Hosted App (Optional)
Monocle
VDJ T cell and B cell analysis
Velocity analysis
ETC
Closing thoughts
Workshop Photos
Github page
Biocore website

Generating a Single Cell Gene Expression Matrix

Most analyses have two stages: data reduction and biological analysis. Statistical analyses of scRNA-seq data take as their starting point an expression matrix, where each row represents a gene and each column represents a sample (in scRNAseq a cell). Each entry in the matrix represents the number of reads (expression level) of a particular gene in a given sample (cell). In most cases the number of unique reads (post umi filtering) assigned to that gene in that sample/cell. Generating the expression matrix often involves some, or all, of the following.

Flowchart of the scRNAseq analysis

Flowchart of scRNAseq analysis

Preprocessing and mapping reads

Raw fastq files first need to be preprocessed, extracting any elements that are a part of the sequence read.

The remaining sequences are mapped to a reference.

For 10X genomics datasets, there is the cellranger application. If you’d rather not use cellranger, or have non-10X data there is also STAR solo aligner. For large full-transcript datasets from well annotated organisms (e.g. mouse, human) pseudo-alignment method Salmon is also a good choice for alignment. For full-length datasets with tens- or hundreds of thousands of reads per cell pseudo-aligners become more appealing since their run-time can be several orders of magnitude less than traditional aligners.

Note, if spike-ins are used, the spike-in sequences should be added to the reference sequence prior to mapping.

Gene Counting

Cellranger, STAR solo, and Salmon Alevin all quantify the expression level of each gene for each cell taking into account UMI sequence as a part of its output.

Specific steps to be performed are dependent on the type of library, the element layout of the read, and the sequencing parameters.

scRNAseq Libraries

Generating scRNAseq libraries is currently an active area of research with several protocols being published in the last few years, including:

Differences between the methods are are in how they capture and quantify gene expression (either full-length or tag-based).

Full-length capture tries to achieve a uniform coverage of each transcript (many reads per transcript). Tag-based protocols only capture either the 5’- or 3’-end of each tran script (single read per transcript). Choice in method determines what types of analyses the data can be used for. Full-length capture can be used to distinguish different iso-forms, where tag-based method is best used for only gene abundance.

Regardless of the protocol, you should spend some time to understand the structure of the read after library preparation.

Elements to a 10x read

Elements to a 10x read (V3)

Packages for downstream analysis

There are many and more being developed. For smaller experiments < 5000 cells, the R packages SingleCellExperiment, scater, SC3 are good choices. For larger experiments (> 5000 cells), the R package Seurat offers a complete solution.

A nice page keeping track of single-cell software can be found here.