☰ Menu

      mRNA-Seq Workshop

Home
Introduction and Lectures
Intro to the Workshop and Core
Schedule
What is Bioinformatics/Genomics?
Experimental Design and Cost Estimation
RNA Sequencing Technologies - Dr. Lutz Froenicke
Support
Using Zoom
Using Slack
Cheat Sheets
Software and Links
Scripts
Prerequisites
CLI - Logging in and Transferring Files
CLI - Intro to Command-Line
CLI - Advanced Command-Line (extra)
CLI - Running jobs on the Cluster and using modules
R - Getting Started
R - Intro to R
R - Prepare Data in R (extra)
R - Data in R (extra)
More Materials (extra)
Data Reduction
Files and Filetypes
Prepare dataset
Preprocessing raw data
Indexing a Genome
Alignment with Star
Generating counts tables
Alignment/Counts with Salmon (Extra)
Data analysis
Prepare R for data analysis
Annotation from BioMart
Differential Expression Analysis
Pathway Analysis
Comparison between STAR and Salmon
ETC
Closing thoughts
Workshop Photos
Github page
Report Errors
Biocore website

Alignment to Read Counts & Visualization in IGV

  1. Initial Setup
  2. Mapping vs Assembly
  3. Aligners/Mappers
  4. Mapping against the genome vs transcriptome
  5. Counting reads as a proxy for gene expression
  6. Alignments
  7. Running STAR on the experiment
  8. Quality Assurance - Mapping statistics as QA/QC.

Initial Setup

This document assumes preproc htstream has been completed. IF for some reason it didn’t finish, is corrupted or you missed the session, you can link over a completed copy

cp -r /share/biocore/workshops/2020_mRNAseq_July/HTS_testing /share/workshop/mrnaseq_workshop/$USER/rnaseq_example/.
cp -r /share/biocore/workshops/2020_mRNAseq_July/01-HTS_Preproc /share/workshop/mrnaseq_workshop/$USER/rnaseq_example/.

This document assumes reference indexing has been completed.

IF will need to run this command to update some reference changes our group made last night.

rm -rf  /share/biocore/workshops/2020_mRNAseq/References/star_2.7.3a_index_GRCm38.p6
ln -s /share/biocore/workshops/2020_mRNAseq/References/star_2.7.3a_index_GRCm38.p6 /share/workshop/mrnaseq_workshop/$USER/rnaseq_example/References/.

Mapping vs Assembly

Assembly seeks to put together the puzzle without knowing what the picture is.

Mapping (or alignment to a reference) tries to put together the puzzle pieces directly onto an image of the picture._

Alignment concepts

alignment_figure3

Considerations when mapping

In RNAseq data, you must also consider effect of splice junctions, reads may span an intron.

alignment_figure1


Aligners/Mappers

Many alignment algorithms to choose from. Examples include:

Pseudo-aligners (salmon and kalisto)


Mapping against the genome vs transcriptome

May seem intuitive to map RNAseq data to transcriptome, but it is not that simple.

More so, a aligner will try to map every read, somewhere, provided the alignment meets its minimum requirements.

Genome and Genome Annotation

Genome sequence fasta files and annotation (gff, gtf) files go together! These should be identified at the beginning of analysis.


Counting reads as a proxy for gene expression

The more you can count (and HTS sequencing systems can count a lot) the better the measure of copy number for even rare transcripts in a population.

Technical artifacts should be considered during counting

Options are (STAR, HTSEQ, featureCounts)

The HTSEQ way

Star Implementation

Counts coincide with Htseq-counts under default parameters (union and tests all orientations). Need to specify GTF file at genome generation step or during mapping.

N_unmapped 213761 213761 213761 N_multimapping 132491 132491 132491 N_noFeature 309643 2619257 322976 N_ambiguous 116750 892 47031 ENSMUSG00000102693 0 0 0 ENSMUSG00000064842 0 0 0 ENSMUSG00000051951 0 0 0 ENSMUSG00000102851 0 0 0 ENSMUSG00000103377 0 0 0 ENSMUSG00000104017 0 0 0

Choose the appropriate column given the library preparation characteristics and generate a matrix expression table, columns are samples, rows are genes.

What does stranded and unstranded mean? Which is better and why? Stranded vs Unstraded


Alignments

  1. We are now ready to try an alignment:

     cd /share/workshop/mrnaseq_workshop/$USER/rnaseq_example/HTS_testing
    

    and let’s run STAR (via srun) on the pair of streamed test files we created earlier:

     srun --time=15:00:00 -n 8 --mem=32g --reservation=mrnaseq_workshop --account=mrnaseq_workshop --pty /bin/bash
    

    Once you’ve been given an interactive session we can run STAR. You can ignore the two warnings/errors and you know your on a cluster node because your server will change. Here you see I’m on tadpole, then after the srun command is successful, I am now on drove-13.

    msettles@tadpole:/share/workshop/msettles/rnaseq_example/> HTS_testing$ srun --time=15:00:00 -n 8 --mem=32g --reservation=mrnaseq_workshop --account=mrnaseq_workshop --pty /bin/bash srun: job 29372920 queued and waiting for resources srun: job 29372920 has been allocated resources groups: cannot find name for group ID 2020 bash: /home/msettles/.bashrc: Permission denied msettles@drove-13:/share/workshop/msettles/rnaseq_example/> HTS_testing$
  2. Then run the star commands

     module load star
     STAR \
     --runThreadN 8 \
        --genomeDir ../References/star.overlap100.gencode.M25 \
        --outSAMtype BAM SortedByCoordinate \
        --quantMode GeneCounts \
        --outFileNamePrefix mouse_110_WT_C_R1_ \
        --readFilesCommand zcat \
        --readFilesIn mouse_110_WT_C_R1.fastq.gz mouse_110_WT_C_R2.fastq.gz
    

    In the command, we are telling star to count reads on a gene level (‘–quantMode GeneCounts’), the prefix for all the output files will be mouse_110WT_C.streamed, the command to unzip the files (zcat), and finally, the input file pair.

    Once finished please ‘exit’ the srun session. You’ll know you were successful when your back on tadpole

    msettles@drove-13:/share/workshop/msettles/rnaseq_example/HTS_testing$ exit exit msettles@tadpole:/share/workshop/msettles/rnaseq_example/HTS_testing$

Now let’s take a look at an alignment in IGV.

  1. We first need to index the bam file, will use ‘samtools’ for this step, which is a program to manipulate SAM/BAM files. Take a look at the options for samtools and ‘samtools index’.

    module load samtools
    samtools
    samtools index
    

    We need to index the BAM file:

    cd /share/workshop/mrnaseq_workshop/$USER/rnaseq_example/HTS_testing
    samtools index mouse_110_WT_C_R1_Aligned.sortedByCoord.out.bam
    

    IF for some reason it didn’t finish, is corrupted or you missed the session, you can copy over a completed copy

    cp -r /share/biocore/workshops/2020_mRNAseq_July/HTS_testing/mouse_110_WT_C_R1_Aligned.sortedByCoord.out.bam* /share/workshop/mrnaseq_workshop/$USER/rnaseq_example/HTS_testing
    
  2. Transfer mouse_110_WT_C.streamed_Aligned.sortedByCoord.out.bam and mouse_110_WT_C.streamed_Aligned.sortedByCoord.out.bam.bai (the index file) to your computer using scp or winSCP, or copy/paste from cat [sometimes doesn’t work].

    In Mac/Linux, Windows users use WinSCP. In a new shell session on my laptop. NOT logged into tadpole. Replace [your_username] with your username

     mkdir ~/rnaseq_workshop
     cd ~/rnaseq_workshop
     scp [your_username]@tadpole.genomecenter.ucdavis.edu:/share/workshop/mrnaseq_workshop/[your_username]/rnaseq_example/HTS_testing/mouse_110_WT_C_R1_Aligned.sortedByCoord.out.bam* .
    

    Its ok of the mkdir command fails (“File exists”) because we aleady created the directory earlier.

  3. Now we are ready to use IGV.

    Go to the IGV page at the Broad Institute.

    index_igv1

    And then navigate to the download page, IGV download

    index_igv2

    Here you can download IGV for your respective platform (Window, Mac OSX, Linux), but we are going to use the web application they supply, IGV web app.

    index_igv3

  4. The first thing we want to do is load the Human genome. Click on “Genomes” in the menu and choose “Human (GRCm38/mm10)”.

    index_igv4

  5. Now let’s load the alignment bam and index files. Click on “Tracks” and choose “Local File …”.

    index_igv5

    Navigate to where you transferred the bam and index file and select them both.

    index_igv6

    Now your alignment is loaded. Any loaded bam file aligned to a genome is called a “track”.

    index_igv7

  6. Lets take a look at the alignment associated with the gene Fn1, and if for some reason it doesn’t find HBB (web IGV can be fickle) go to position chr1:71,610,633-71,628,073. If you don’t see any reads, this likely means your in the wrong genome, double check that it says mm10 in the top left.

    index_igv8

    index_igv9

    You can zoom in by clicking on the plus sign (top right) or zoom out by clicking the negative sign (click it twice). You also may have to move around by clicking and dragging in the BAM track window.

    You can also zoom in by clicking and dragging across the number line at the top. That section will highlight, and when you release the button, it will zoom into that section.

    index_igv10

    index_igv11

    Reset the window by searching for Fn1 again, and then zoom in 2 steps.

    index_igv12

  7. See that the reads should be aligning within the exons in the gene. This makes sense, since RNA-Seq reads are from exons. Play with the settings on the right hand side a bit and selecting reads.

    index_igv13

    index_igv14

    index_igv15

    index_igv16

    index_igv17


Running STAR on the experiment

Start Group Exercise:

  1. We can now run STAR across all samples on the real data using a SLURM script, star.slurm, that we should take a look at now.

     cd /share/workshop/mrnaseq_workshop/$USER/rnaseq_example  # We'll run this from the main directory
     wget https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2020-mRNA_Seq_Workshop/master/software_scripts/scripts/star.slurm
     less star.slurm
    
    #!/bin/bash
    
     #SBATCH --job-name=star # Job name
     #SBATCH --nodes=1
     #SBATCH --ntasks=8
     #SBATCH --time=60
     #SBATCH --mem=32000 # Memory pool for all cores (see also --mem-per-cpu)
     #SBATCH --partition=production
     #SBATCH --reservation=mrnaseq_workshop
     #SBATCH --account=mrnaseq_workshop
     #SBATCH --array=1-22
     #SBATCH --output=slurmout/star_%A_%a.out # File to which STDOUT will be written
     #SBATCH --error=slurmout/star_%A_%a.err # File to which STDERR will be written
    
     start=`date +%s`
     echo $HOSTNAME
     echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID
    
     sample=`sed "${SLURM_ARRAY_TASK_ID}q;d" samples.txt`
     REF="References/star_2.7.3a_index_GRCm38.p6"
    
     outpath='02-STAR_alignment'
     [[ -d ${outpath} ]] || mkdir ${outpath}
     [[ -d ${outpath}/${sample} ]] || mkdir ${outpath}/${sample}
    
     echo "SAMPLE: ${sample}"
    
     module load star
    
     call="STAR
          --runThreadN 8 \
          --genomeDir $REF \
          --outSAMtype BAM SortedByCoordinate \
          --readFilesCommand zcat \
          --readFilesIn 01-HTS_Preproc/${sample}/${sample}_R1.fastq.gz 01-HTS_Preproc/${sample}/${sample}_R2.fastq.gz \
          --quantMode GeneCounts \
          --outFileNamePrefix ${outpath}/${sample}/${sample}_ \
          > ${outpath}/${sample}/${sample}-STAR.stdout 2> ${outpath}/${sample}/${sample}-STAR.stderr"
    
     echo $call
     eval $call
    
     end=`date +%s`
     runtime=$((end-start))
     echo $runtime
     

When you are done, type “q” to exit.

  1. After looking at the script, lets run it.

     sbatch star.slurm  # moment of truth!
    

    We can watch the progress of our task array using the ‘squeue’ command. Takes about 30 minutes to process each sample.

     squeue -u msettles  # use your username
    

Quality Assurance - Mapping statistics as QA/QC.

  1. Once your jobs have finished successfully, check the error and out logs like we did in the previous exercise.

    Use a script of ours, star_stats.sh to collect the alignment stats. Don’t worry about the script’s contents at the moment; you’ll use very similar commands to create a counts table in the next section. For now:

     cd /share/workshop/mrnaseq_workshop/$USER/rnaseq_example  # We'll run this from the main directory
     wget https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2020-mRNA_Seq_Workshop/master/software_scripts/scripts/star_stats.sh
     bash star_stats.sh
    
    #!/bin/bash
    
     echo -en "sample_names" > names.txt
     echo -en "total_in_feature\t" > totals.txt
     cat 02-STAR_alignment/*/*ReadsPerGene.out.tab | head -4 | cut -f1 > stats.txt
     cat samples.txt | while read sample; do
         echo ${sample}
         echo -en "\t${sample}" >> names.txt
         head -4 02-STAR_alignment/${sample}/${sample}_ReadsPerGene.out.tab | cut -f4 > temp1
         paste stats.txt temp1 > temp2
         mv temp2 stats.txt
         tail -n +5 02-STAR_alignment/${sample}/${sample}_ReadsPerGene.out.tab | cut -f4 | \
             perl -ne '$tot+=$_ }{ print "$tot\t"' >> totals.txt
     done
     echo -en "\n" >> names.txt
     cat names.txt stats.txt totals.txt > temp1
     mv temp1 summary_star_alignments.txt
     rm stats.txt
     rm names.txt
     rm totals.txt
     
  2. Transfer summary_star_alignments.txt to your computer using scp or winSCP, or copy/paste from cat [sometimes doesn’t work],

    In Mac/Linux, Windows users use WinSCP. In a new shell session on my laptop. NOT logged into tadpole. Replace my [your_username] with your username

     mkdir -p ~/rnaseq_workshop
     cd ~/rnaseq_workshop
     scp [your_username]@tadpole.genomecenter.ucdavis.edu:/share/workshop/mrnaseq_workshop/[your_username]/rnaseq_example/summary_star_alignments.txt .
    

Questions:

  1. Look at the script star.slurm. What does the array=1-22 mean, why is it used, and what is the usage of it in the script itself?
  2. Look through the files in an output directory and check out what is present and discuss what each of them mean. (for example: cd /share/workshop/mrnaseq_workshop/$USER/rnaseq_example/02-STAR_alignment/mouse_110_WT_C )
  3. Come up with a brief command you might use to check that all of the sample alignments using STAR have a reasonable output and/or did not produce any errors.
  4. Open summary_star_alignments.txt in excel (or excel like application), and review. The table that this script creates (“summary_star_alignments.txt”) can be pulled to your laptop via ‘scp’, or WinSCP, etc., and imported into a spreadsheet. Are all samples behaving similarly? Discuss …
  5. If time, find some other regions/genes with high expression using IGV with your group. (Looking at genes the paper references is a great place to start)

Stop Group Exercise