☰ Menu

      Single Cell RNA-Seq Analysis

Home
Introduction and Lectures
Intro to the Workshop and Core
Support
Schedule
Slack
Cheat Sheets
Software and Links
Scripts
GitHub repository
Biocore website
Prerequisites
CLI
R
Data Reduction
Files and Filetypes
Project setup
Generating Expression Matrix
scRNAseq Analysis
Prepare scRNAseq Analysis
Part 1- Create object
Part 2- Filtering
Part 3- Normalization and scaling
Part 4- Dimensionality reduction
Part 5- Clustering and cell type
Part 6- Enrichment and DE
Part 7- Doublet detection
Part 8- Integration

The dataset used in this workshop is a subset of a much larger dataset from a recent study that generated single nuclei transcriptome and chromatin accessibility profiles from colorectal tissue samples. The authors isolated 1000 to 10000 nuclei each from 81 samples of three types: 48 polyp samples, 27 normal tissue samples, and 6 colorectal cancer (CRC) samples from patients with or without germline APC mutations. They observed a continuum of cell state and composition changes from normal tissue, to polyps, to cancer.

For the purposes of this workshop, we will use one sample from each condition (CRC: A001-C-007, polyp: A001-C-104, and normal: B001-A-301).

Source:

Becker, W. R.; Nevins, S. A.; Chen, D. C.; Chiu, R.; Horning, A. M.; Guha, T. K.; Laquindanum, R.; Mills, M.; Chaib, H.; Ladabaum, U.; Longacre, T.; Shen, J.; Esplin, E. D.; Kundaje, A.; Ford, J. M.; Curtis, C.; Snyder, M. P.; Greenleaf, W. J. Single-Cell Analyses Define a Continuum of Cell State and Composition Changes in the Malignant Transformation of Polyps to Colorectal Cancer. Nat. Genet. 2022. https://doi.org/10.1038/s41588-022-01088-x.

Data Setup

Let’s set up a project directory for the analysis, and talk a bit about project philosophy.

1. First, create a directory for your user and the example project in the workshop directory:

cd
mkdir -p /share/workshop/scRNA_workshop/$USER/scrnaseq_example

2a. Next, go into that directory, create a raw data directory (we are going to call this 00-RawData) and cd into that directory. Let’s then create symbolic links to the fastq files that contains the raw read data.

cd /share/workshop/scRNA_workshop/$USER/scrnaseq_example
mkdir 00-RawData
cd 00-RawData/
ln -s /share/workshop/scRNA_workshop/Data/*.fastq.gz .

This directory now contains the reads for each sample.

2b. Let’s create a sample sheet for the project, and store sample names in a file called samples.txt

cd /share/workshop/scRNA_workshop/$USER/scrnaseq_example/00-RawData
ls *_R1_* |cut -d'_' -f1 - > ../samples.txt
cat ../samples.txt

Data Exploration


3. Now, take a look at the raw data directory.

ls /share/workshop/scRNA_workshop/$USER/scrnaseq_example/00-RawData

4. View the contents of the files using the ‘less’ command, when gzipped used ‘zless’ (which is just the ‘less’ command for gzipped files, q to exit):

Read 1

cd /share/workshop/scRNA_workshop/$USER/scrnaseq_example/00-RawData
zless A001-C-007_S4_R1_001.fastq.gz

and Read 2

zless A001-C-007_S4_R2_001.fastq.gz

A detailed explanation of FASTQ file can be found here. Please read on the description and make sure you can identify which lines correspond to a single read and which lines are the header, sequence, and quality values. Press ‘q’ to exit this screen.

Let’s figure out the number of reads in this file. A simple way to do that is to count the number of lines and divide by 4 (because the record of each read uses 4 lines). In order to do this use cat to output the uncompressed file and pipe that to “wc” to count the number of lines:

zcat A001-C-007_S4_R1_001.fastq.gz | wc -l

Divide this number by 4 and you have the number of reads in this file. One more thing to try is to figure out the length of the reads without counting each nucleotide. First get the first 4 lines of the file (i.e. the first record):

zcat A001-C-007_S4_R1_001.fastq.gz  | head -4

Note the header lines (1st and 3rd line) and sequence and quality lines (2nd and 4th) in each 4-line fastq block. You can isolate the sequence line:

zcat A001-C-007_S4_R1_001.fastq.gz | head -2 | tail -1

Then, copy and paste the DNA sequence line into the following command (replace [sequence] with the line):

echo -n [sequence] | wc -c

This will give you the length of the read.

Also can do the bash one liner:

echo -n $(zcat A001-C-007_S4_R1_001.fastq.gz  | head -2 | tail -1) | wc -c

See if you can figure out how this command works.

Quiz