The dataset used in this workshop is a subset of a much larger dataset from a recent study that generated single nuclei transcriptome and chromatin accessibility profiles from colorectal tissue samples.1 The authors isolated 1000 to 10000 nuclei per sample for 81 samples of three types: 48 polyp samples, 27 normal tissue samples, and 6 colorectal cancer (CRC) samples from patients with or without germline APC mutations. They observed a continuum of cell state and composition changes from normal tissue, to polyps, to cancer.
For the purposes of this workshop, we will use one sample from each condition (CRC: A001-C-007, polyp: A001-C-104, and normal: B001-A-301).
Data Setup
Let’s set up a project directory for the analysis, and talk a bit about project philosophy.
1. First, create a directory for your user and the example project in the workshop directory:
cd
mkdir -p /share/workshop/scRNA_workshop/$USER/scrnaseq_example
2a. Next, go into that directory, create a raw data directory (we are going to call this 00-RawData) and cd into that directory. Let’s then create symbolic links to the fastq files that contains the raw read data.
cd /share/workshop/scRNA_workshop/$USER/scrnaseq_example
mkdir 00-RawData
cd 00-RawData/
ln -s /share/workshop/scRNA_workshop/DATA/*.fastq.gz .
This directory now contains the reads for each sample.
2b. Let’s create a sample sheet for the project, and store sample names in a file called samples.txt
cd /share/workshop/scRNA_workshop/$USER/scrnaseq_example/00-RawData
ls *_R1_* |cut -d'_' -f1 > ../samples.txt
cat ../samples.txt
Data Exploration
3. Now, take a look at the raw data directory.
ls /share/workshop/scRNA_workshop/$USER/scrnaseq_example/00-RawData
4. View the contents of the files using the ‘less’ command, when gzipped used ‘zless’ (which is just the ‘less’ command for gzipped files, q to exit):
Read 1
cd /share/workshop/scRNA_workshop/$USER/scrnaseq_example/00-RawData
zless A001-C-007_S4_R1_001.fastq.gz
and Read 2
zless A001-C-007_S4_R2_001.fastq.gz
A detailed explanation of FASTQ file can be found here. Please read on the description and make sure you can identify which lines correspond to a single read and which lines are the header, sequence, and quality values. Press ‘q’ to exit this screen.
Let’s figure out the number of reads in this file. A simple way to do that is to count the number of lines and divide by 4 (because the record of each read uses 4 lines). In order to do this use cat to output the uncompressed file and pipe that to “wc” to count the number of lines:
zcat A001-C-007_S4_R1_001.fastq.gz | wc -l
Divide this number by 4 and you have the number of reads in this file.
expr $(zcat A001-C-007_S4_R1_001.fastq.gz | wc -l) / 4
One more thing to try is to figure out the length of the reads without counting each nucleotide. First get the first 4 lines of the file (i.e. the first record):
zcat A001-C-007_S4_R1_001.fastq.gz | head -4
Note the header lines (1st and 3rd line) and sequence and quality lines (2nd and 4th) in each 4-line fastq block. You can isolate the sequence line:
zcat A001-C-007_S4_R1_001.fastq.gz | head -2 | tail -1
Then, copy and paste the DNA sequence line into the following command (replace [sequence] with the line):
echo -n [sequence] | wc -c
This will give you the length of the read.
Also can do the bash one liner:
echo -n $(zcat A001-C-007_S4_R1_001.fastq.gz | head -2 | tail -1) | wc -c
See if you can figure out how this command works.
Quiz
-
Becker, W. R.; Nevins, S. A.; Chen, D. C.; Chiu, R.; Horning, A. M.; Guha, T. K.; Laquindanum, R.; Mills, M.; Chaib, H.; Ladabaum, U.; Longacre, T.; Shen, J.; Esplin, E. D.; Kundaje, A.; Ford, J. M.; Curtis, C.; Snyder, M. P.; Greenleaf, W. J. Single-Cell Analyses Define a Continuum of Cell State and Composition Changes in the Malignant Transformation of Polyps to Colorectal Cancer. Nat. Genet. 2022. https://doi.org/10.1038/s41588-022-01088-x. ↩