Variant Discovery using GATK3

In this section, we are going to run GATK (Genome Analysis Toolkit). GATK was created by the Broad Institute for variant analysis and genotyping primarily for the human genome. However, it certainly can be used for any genome.

fc04


1. First, make a directory for this section in your variant_example directory:

cd /share/workshop/<your username>/variant_example
mkdir 04-gatk
cd 04-gatk

Now, link in the files that we will be using. We have already done the alignment, so we will start with the BAM files from the final alignment step (as well as their indices):

ln -s ../02-Alignment/*.sorted.bam .
ln -s ../02-Alignment/*.sorted.bam.bai .

2.. At this point, we have generated bam files that contain all the alignment information for the downstream analysis. But before we go into GATK, there is some information that needs to be added to the BAM file, using “AddOrReplaceReadGroups” from the Picard Tools suite. First load the picard-tools module and look at the usage and available sub-commands, as well as the detailed options for AddOrReplaceReadGroups:

module load picard-tools
picard -h
picard AddOrReplaceReadGroups -H

To your BAM file, we will add A8100 as “Read Group ID”, “Read Group sample name” and “Read group library”. “Read group platform” has to be illumina as the sequencing was done using an Illumina instrument. “Read group platform unit” we are going to set as run:

picard AddOrReplaceReadGroups VALIDATION_STRINGENCY=LENIENT \
I=A8100.chr18.sorted.bam O=A8100.chr18.rg.bam \
RGID=A8100 RGLB=A8100 RGPL=illumina RGPU=run RGSM=A8100

And index the resulting BAM file so that downstream software can easily navigate the BAM file:

module load samtools
samtools index A8100.chr18.rg.bam

3. In the next step, we will start using GATK (Genome Analysis Tool Kit). In order to do so, we need to index the reference in two other ways:

cd ../ref
picard CreateSequenceDictionary R=chr18.fa O=chr18.dict
samtools faidx chr18.fa
cd ../04-gatk

Now load gatk, take a look at the options, and the list of tools:

module load gatk
gatk -h
gatk --list

This sets a script in your path called “gatk” which will call the underlying gatk java file.


4. Now we are ready to use GATK. The first step in GATK variant discovery is to do base quality recalibration using “BaseRecalibrator”. Basically, the quality scores are changed based upon machine cycle, sequence context, and any other covariates. This recalibration changes the quality scores so that the reported score is more accurate with respect to the probability of mismatching the reference genome. The output BAM file from the previous step is used as the input BAM file. We will use the chr18.vcf “ROD” (Reference Ordered Data) file as our known sites. ROD files are merely the regular format of a file, except that they are in the same order, chromosomally, as the reference. GATK expects its known variant files to be ROD files. We will leave the other parameters as default. First download the vcf to your “ref” directory and index it:

cd ../ref
wget https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2018-August-Variant-Analysis-Workshop/master/wednesday/chr18.vcf
gatk IndexFeatureFile -F chr18.vcf
cd ../04-gatk

Take a look at the options for gatk and for BaseRecalibrator:

gatk -h
gatk BaseRecalibrator -h

Then, run BaseRecalibrator:

gatk BaseRecalibrator \
-I A8100.chr18.rg.bam -R ../ref/chr18.fa \
--known-sites ../ref/chr18.vcf \
-O A8100.chr18.recal_data.grp

This step (takes about 5 minutes) generates a covariates file. The covariates file is used in the next step to actually carry out the recalibration.


5. The recalibration step is carried out by the GATK command “ApplyBQSR”. Choose your covariate file generated by the previous step as the parameter to bqsr (Base Quality Score Recalibration). The input BAM is the BAM outputted from the add read group step:

gatk ApplyBQSR \
-R ../ref/chr18.fa -I A8100.chr18.rg.bam \
-bqsr A8100.chr18.recal_data.grp \
-O A8100.chr18.recalibrated.bam

This step takes about 4 minutes.


6.. Now we are ready to do the initial per sample variant calling using “HaplotypeCaller”. HaplotypeCaller first identifies regions of interest, determines haplotypes by local re-assembly of the regions, determines the likelihoods of the genotypes, and finally assigns sample genotypes. The input BAM file is the output BAM from the previous step. We will be outputting GVCF files using the -ERC option (Emit Reference Confidence). GVCF files are basically VCF files except with variant information for every position in the genome, regardless of whether there is a variant there or not. These GVCF files are used to make it easier to call genotypes across samples and also make it easier to add new samples into your experiment. However, this step can take a long time to run, so we want to run it in the background and using a command called ‘nohup’ that will allow the command to continue even after you log out. We will also redirect the info output to a file.

nohup gatk HaplotypeCaller \
-R ../ref/chr18.fa -ERC GVCF \
-I A8100.chr18.recalibrated.bam -O A8100.chr18.g.vcf &> A8100.chr18.nohup &

Now this command is running in the background and will continue to run even if you log out. It will take about 3 hours to run. You can watch the progress using the ‘tail’ command:

tail -f A8100.chr18.nohup

Use <Ctrl>-C to exit the tail command.


7. The next step is to run all of the previous commands on the rest of the samples. To do that we need to download a Slurm script:

wget https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2018-August-Variant-Analysis-Workshop/master/wednesday/gatk.slurm

Change the permissions:

chmod a+x gatk.slurm

Take a look at it:

cat gatk.slurm

Now, run all the other samples using sbatch:

sbatch gatk.slurm A9004
sbatch gatk.slurm A9006
sbatch gatk.slurm A9014
sbatch gatk.slurm A9018

This will take ~3 hours to run. Once all of the samples finish, you can move on to the next step, which will probably be tomorrow.


8. Next, we need to combine all the gvcf files into one file for genotyping. We will use our known SNPs file as dbsnp:

gatk CombineGVCFs -R ../ref/chr18.fa --dbsnp ../ref/chr18.vcf \
--variant A8100.chr18.g.vcf \
--variant A9004.chr18.g.vcf \
--variant A9006.chr18.g.vcf \
--variant A9014.chr18.g.vcf \
--variant A9018.chr18.g.vcf \
--output all.chr18.g.vcf    

This will take about 30 minutes.


9. Now that we have generated combind GVCF file, we need to run “GenotypeGVCFs” to join together multiple samples’ GVCF files into one VCF file with aggregated genotype likelihoods and with re-annotation. We will use the “–variant” option once for each GVCF file and we will use our known SNPs file as dbsnp:

gatk GenotypeGVCFs -R ../ref/chr18.fa --dbsnp ../ref/chr18.vcf \
--variant all.chr18.g.vcf \
--output all.genotyped.chr18.vcf

This will take about 30 minutes. It will produce a regular VCF file which we will further filter in the following steps. Take a look at the file:

less all.genotyped.chr18.vcf

10. After the variant calling in the previous step, we will use “SelectVariants” to generate the output that only contains SNP variants. The input variant file should be the output from the previous step. Choose “SNP” as the selectType, which will get all variants that are SNPs:

gatk SelectVariants \
-R ../ref/chr18.fa --variant all.genotyped.chr18.vcf -select-type SNP \
-O snps.chr18.vcf

The variant file generated contains only the SNP variants called by GenotypeGVCFs.


11. After the initial variant calling, a filter process needs to be done to generate a list of good variants using some type of criteria. Here we are using a set of criteria on the SNPs. We will use the “VariantFiltration” subprogram from GATK. The variant input file should be the SNP variant (VCF) file from the previous step. Our filter expression is QD<2.0 || MQ<40.0 || FS>60.0 || MQRankSum<-12.5 || ReadPosRankSum<-8.0, which we get from looking at the GATK best practices for small datasets. The “double-pipe” (||) symbol is a logical OR, meaning that if any of those criteria are true then the variant will be tagged. Choose your canonical annotation file (chr18.vcf) as the Mask ROD file and give the filter a name:

gatk VariantFiltration \
-R ../ref/chr18.fa --variant snps.chr18.vcf \
--mask ../ref/chr18.vcf --filter-name "snpsfilter" \
--filter-expression "QD<2.0 || MQ<40.0 || FS>60.0 || MQRankSum<-12.5 || ReadPosRankSum<-8.0" \
--output snps.chr18.tagged.vcf &> varfilter.out &

We put the informational output to a file and run it in the background. This step will tag the variants with a “snpsfilter” tag if they did not pass and a “PASS” tag if they did. It does not remove the variants that did not pass.


12. After the filtration step, we need to further select only the variants that have passed the filter by running “SelectVariants” again. The input variant file is the output variant file from the previous step. Use vc.isNotFiltered() (including the parentheses) as the “select expression”. This “select expression” was found by searching and reading help forums. Leave the other parameters as default.

gatk SelectVariants \
-R ../ref/chr18.fa --variant snps.chr18.tagged.vcf \
-select 'vc.isNotFiltered()' -O snps.chr18.filtered.vcf

Now you have a file which contains SNPs which passed the filters. We can now use these for the next stage in our analysis, which is Effect Prediction.