☰ Menu

      Genome-Wide Association Studies

Home
Introduction and Lectures
Intro to the Workshop and Core
Schedule
Dr. Anthony Musolf Talk
What is Bioinformatics/Genomics?
Support
Slack
Zoom
Cheat Sheets
Software and Links
Scripts
Prerequisites
Logging In
CLI
R
Cluster Computing
Data Reduction
Files and Filetypes
Project setup
Preprocessing raw data
Alignment with BWA
Variant calling using GATK
Comparison of freebayes, GATK, and deepvariant output
Data Analysis
Plink Step by Step TDT
Plink Step by Step TDT (solutions)
wAnnovar Annotation
Plink Step by Step (Non FBAT excercise)
Setup in R
GWAS Visualization
ETC
Closing thoughts
Workshop Photos
Github page
Biocore website

Plink Step by Step

Outline


About this dataset

What is craniosynostosis?

WGS data was collected from trios with an affected child and unaffected parents. We now have a vcf file as discussed in the data reduction.

Just to be clear lets look at a pedigree of the type of trios we are interested in, and also just do a quick review of pedigrees.

Here would be a female child example of this trio:


Start Group Exercise 1: (25 mins)


Lets do some initial setup

mkdir /share/workshop/gwas_workshop/${USER}
mkdir /share/workshop/gwas_workshop/${USER}/plink
cd /share/workshop/gwas_workshop/${USER}/plink

# this will take say omitting directories and that is ok
cp /share/workshop/plink_template/* .
mkdir 00-Fixed
ln -s /share/workshop/plink_template/00-Fixed/* 00-Fixed/
module load plink/1.90p
sample='21'
outpath='01-Plink_fam_fix'
[[ -d ${outpath} ]] || mkdir ${outpath}
infile="00-Fixed/chr${sample}.subset.vqsr.vcf"

Generate the b file

makebfile="plink --vcf ${infile} --make-bed --out ${outpath}/all_${sample}"
echo $makebfile
eval $makebfile

Lets take a look at the files generated

head ${outpath}/all_${sample}.bim
cat ${outpath}/all_${sample}.bim | wc -l
cat ${outpath}/all_${sample}.log
head ${outpath}/all_${sample}.nosex

Question:

What can you say about the variant at position 10001930? What is the observed variant and what is the reference allele?


Update the .fam file

fam="8_fam.txt"
head ${fam}
famin="plink --bfile ${outpath}/all_${sample} --make-bed --fam ${fam} --out ${outpath}/all_${sample}_fam"
echo $famin
eval $famin

Question:

Amongst your group:


End Group Exercise 1, Break

Discussing the answers to the above question. 5 minute break.

After returning from break we will work on the second half of this document after I give a brief description of the portions involved.


Start Group Exercise 2: (25 mins)


Remove all variants not genotyped to 95%

genotyped="plink --bfile ${outpath}/all_${sample}_fam --geno 0.05 --make-bed --out ${outpath}/clean-missing_${sample}"
echo $genotyped
eval $genotyped
cat ${outpath}/clean-missing_${sample}.bim | wc -l

Question:

What was position of the 5th variant removed?


Remove monomorphic SNPs

mono="plink --bfile ${outpath}/clean-missing_${sample} --maf 0.000000001 --make-bed --out ${outpath}/clean-missing-monos_${sample}"
echo $mono
eval $mono
cat ${outpath}/clean-missing-monos_${sample}.bim | wc -l

Question:

What was the last variant removed?


The following line will create a temporary variant identifier in the .bim file.

cat ${outpath}/clean-missing-monos_${sample}.bim > ${outpath}/tempfile_${sample}.bim
awk '{print $1"\t"$4"\t"$3"\t"$4"\t"$5"\t"$6}' ${outpath}/tempfile_${sample}.bim > ${outpath}/clean-missing-monos_${sample}.bim

Remove all SNPs with “*” as an alt allele (hemizygous snps)

grep "*" ${outpath}/clean-missing-monos_${sample}.bim | awk '{print $4}' > ${outpath}/allHemizgyous_${sample}.txt
hemizy="plink --bfile ${outpath}/clean-missing-monos_${sample} --exclude ${outpath}/allHemizgyous_${sample}.txt --make-bed --out ${outpath}/clean-basic-QC_${sample}"
echo $hemizy
eval $hemizy
head ${outpath}/allHemizgyous_${sample}.txt
cat ${outpath}/allHemizgyous_${sample}.txt | wc -l
cat ${outpath}/clean-missing-monos_${sample}.bim | grep "*" | head
head ${outpath}/clean-missing-monos_${sample}.bim

Question:

How many hemizygous SNPs were there?


End Group Exercise 2, Break

We will take a break here and come back in about 15 minutes after discussing the answers to the above questions.

After returning from break we will work on the second half of this document after I give a brief description of the portions involved.


Start Group Exercise 3 (35 mins):


Filter on common variants for IBD

common="plink --bfile ${outpath}/clean-basic-QC_${sample} --maf 0.4 --make-bed --out ${outpath}/common_${sample}"
echo $common
eval $common

Question:

What percentage of variants are removed at this step? Why do you think the nature of this percentage is fairly extreme?


Calculate IBD

ibdcalc="plink --bfile ${outpath}/common_${sample} --make-bed --out ${outpath}/idb_${sample}"
echo $ibdcalc
eval $ibdcalc

Calculate IBD within and across families (relatives check)

relcheck="plink --bfile ${outpath}/idb_${sample} --genome --rel-check --out ${outpath}/ibd-relcheck_${sample}"
echo $relcheck
eval $relcheck
head ${outpath}/ibd-relcheck_${sample}.genome

Question:

What could be happening with the relationship between 5136-SB-0663 & 5136-SB-0666? (Hint: look at PI HAT)


Get Mendelian Errors

getmendel="plink --bfile ${outpath}/clean-basic-QC_${sample} --mendel --out ${outpath}/Mendelian_${sample}"
echo $getmendel
eval $getmendel
head ${outpath}/Mendelian_${sample}.mendel
head ${outpath}/clean-basic-QC_${sample}.bim

Remove Mendelian Errors

remmendel="plink --bfile ${outpath}/clean-basic-QC_${sample} --exclude ${outpath}/Mendelian_${sample}.mendel --make-bed --out ${outpath}/cleanedME_${sample}"
echo $remmendel
eval $remmendel
head ${outpath}/cleanedME_${sample}.bim

End Group Exercise 3, Break

We will take a break here and come back in about 5 minutes after discussing the answers to the above questions.

After returning from break we will work on the remainder of this document after I give a brief description of the portions involved.


Start Group Exercise 4 (45 mins):


Calculate TDT across families

tdtcall="plink --bfile ${outpath}/cleanedME_${sample} --tdt --freq --adjust --out ${outpath}/tdt_${sample}"
echo $tdtcall
eval $tdtcall

Lets check out the frq file and the tdt file. We will want the information from both of these for later.

head ${outpath}/tdt_${sample}.tdt
head ${outpath}/tdt_${sample}.tdt.adjusted
head ${outpath}/tdt_${sample}.frq

Now on your local computer command line or filezilla/winSCP:

Lets convert our plink output at the point prior to TDT in order to annotate

cd ${outpath}
module load annovar
wannovarvcf="plink --bfile cleanedME_${sample} --recode vcf --out cleanedME_${sample}_vcf"
echo $wannovarvcf
eval $wannovarvcf

convert2="convert2annovar.pl -format vcf4old --withfreq cleanedME_${sample}_vcf.vcf > chr${sample}_anno.avinput"
echo $convert2
eval $convert2
head chr${sample}_anno.avinput

Putting it all together in a script

cd /share/workshop/gwas_workshop/${USER}/plink
mkdir stderr
mkdir stdout

Here is the script we will run take a look with your group and see if you have any questions or if there is anything unclear to the group.


#!/bin/bash
#
#SBATCH --time=60:00 # days-hours
#SBATCH --job-name=plinkrun  # Job name
#SBATCH --array=1-2 # all chroms plus x and y
#SBATCH --nodes=1
#SBATCH --mem=20000 # Memory pool for all cores (see also --mem-per-cpu)
#SBATCH --reservation=workshop
#SBATCH --reservation=workshop
#SBATCH --account=workshop
#SBATCH --output=stdout/counts-ArrayJob_%A_%a.out # File to which STDOUT will be written
#SBATCH --error=stderr/counts-ArrayJob_%A_%a.err # File to which STDERR will be written
#SBATCH --mail-type=END # Type of email notification- BEGIN,END,FAIL,ALL
##SBATCH --mail-user=kgmitchell@ucdavis.edu # Email to which notifications will be sent (please change this if you want to uncomment this)


start=`date +%s`

hostname

module load plink/1.90p

sample=`sed "${SLURM_ARRAY_TASK_ID}q;d" chrom.txt`


cd /share/workshop/gwas_workshop/${USER}/plink

outpath='01-Plink_fam_fix'
[[ -d ${outpath} ]] || mkdir ${outpath}


infile="00-Fixed/chr${sample}.subset.vqsr.vcf"

# make b file
makebfile="plink --vcf ${infile} --make-bed --out ${outpath}/all_${sample}"
echo $makebfile
eval $makebfile


# set fam file and update our info
fam="8_fam.txt"
head ${fam}


famin="plink --bfile ${outpath}/all_${sample} --make-bed --fam ${fam} --out ${outpath}/all_${sample}_fam"
echo $famin
eval $famin


# filter based on genotyped amount
genotyped="plink --bfile ${outpath}/all_${sample}_fam --geno 0.05 --make-bed --out ${outpath}/clean-missing_${sample}"
echo $genotyped
eval $genotyped


# remove monomorphic SNPs
mono="plink --bfile ${outpath}/clean-missing_${sample} --maf 0.000000001 --make-bed --out ${outpath}/clean-missing-monos_${sample}"
echo $mono
eval $mono

# create a temp variant identifier
cat ${outpath}/clean-missing-monos_${sample}.bim > ${outpath}/tempfile_${sample}.bim
awk '{print $1"\t"$4"\t"$3"\t"$4"\t"$5"\t"$6}' ${outpath}/tempfile_${sample}.bim > ${outpath}/clean-missing-monos_${sample}.bim


# filter out hemizygous SNPs
grep "*" ${outpath}/clean-missing-monos_${sample}.bim | awk '{print $4}' > ${outpath}/allHemizgyous_${sample}.txt
hemizy="plink --bfile ${outpath}/clean-missing-monos_${sample} --exclude ${outpath}/allHemizgyous_${sample}.txt --make-bed --out ${outpath}/clean-basic-QC_${sample}"
echo $hemizy
eval $hemizy


# filter on common variants for IBD
common="plink --bfile ${outpath}/clean-basic-QC_${sample} --maf 0.005 --make-bed --out ${outpath}/common_${sample}"
echo $common
eval $common


# calculate IBD
ibdcalc="plink --bfile ${outpath}/common_${sample} --make-bed --out ${outpath}/idb_${sample}"
echo $ibdcalc
eval $ibdcalc

# perform relative check
relcheck="plink --bfile ${outpath}/idb_${sample} --genome --rel-check --out ${outpath}/ibd-relcheck_${sample}"
echo $relcheck
eval $relcheck


# get mendelian errors
getmendel="plink --bfile ${outpath}/clean-basic-QC_${sample} --mendel --out ${outpath}/Mendelian_${sample}"
echo $getmendel
eval $getmendel


# remove mendelian errors
remmendel="plink --bfile ${outpath}/clean-basic-QC_${sample} --exclude ${outpath}/Mendelian_${sample}.mendel --make-bed --out ${outpath}/cleanedME_${sample}"
echo $remmendel
eval $remmendel


# calculate tdt with p-value adjustment
# we will use the cleanedME file as that was the most filtered file prior to statistical tests
tdtcall="plink --bfile ${outpath}/cleanedME_${sample} --tdt --freq --adjust --out ${outpath}/tdt_${sample}"
echo $tdtcall
eval $tdtcall


# wAnnovar prep
cd ${outpath}
module load annovar

wannovarvcf="plink --bfile cleanedME_${sample} --recode vcf --out cleanedME_${sample}_vcf"
echo $wannovarvcf
eval $wannovarvcf

convert2="convert2annovar.pl -format vcf4old --withfreq cleanedME_${sample}_vcf.vcf > chr${sample}_anno.avinput"
echo $convert2
eval $convert2

Finally you can submit the jobs to the cluster

sbatch plink_final.slurm

Check the status of your jobs. This should not take long.

squeue -u ${USER}
cat ${outpath}/*.avinput > master.avinput

End Group Exercise 4, Break

We will take a break here and come back in about 15 minutes after discussing the answers to the above questions.

After returning from break we will move onto the wAnnovar annotation with the data we have produced here.