Home
Introduction
Introduction to the Workshop and the Core
Schedule
What is Bioinformatics/Genomics?
Some Experimental Design?
Support
Slack
Zoom
Cheat Sheets
Software and Links
Scripts
Prerequisites
Introduction to the Command Line
Introduction to R
High Performance Computing
Data Reduction ChIPseq/ATACseq
Filetype Description
ChIP and ATAC project setup
Preprocessing Reads
Mapping Reads
Filtering Reads
Peak Call with MACS2
Data Analysis ChIPseq/ATACseq
Differential ChIPseq
Differential ATACseq
R Package Installation
Methylation
Introduction to WGBS
WGBS hands-on
ETC
Closing thoughts
Workshop Photos
Github page
Biocore website

Calling Peaks using MACS2

This document assumes filtering of samples has been completed.

IF for some reason it didn’t finish, is corrupted or you missed the session, you can copy over a completed copy

#cp -r /share/biocore/workshops/2020_Epigenetics/ChIPseq/03-Filter /share/workshop/epigenetics_workshop/$USER/chipseq_example/.
#cp -r /share/biocore/workshops/2020_Epigenetics/ATACseq/03-Filter /share/workshop/epigenetics_workshop/$USER/atacseq_example/.

Model-based Analysis of ChIP-Seq (MACS2)

MACS2 is a tool for identifying “Peaks” in such data as ChIPseq and ATACseq. MACS captures the influence of genome complexity to evaluate the significance of enriched regions.

For ChIP-seq experiments, what we observe from the alignment files is a strand asymmetry with read densities on the +/- strand, centered around the binding site. The 5’ ends of the selected fragments will form groups on the positive- and negative-strand. The distributions of these groups are then assessed using statistical measures and compared against background (input or mock IP samples) to determine if the site of enrichment is likely to be a real binding site.

peaks_figure1

From Wilbanks and Faccioti, PLoS One 2010

Sliding a window of length 2 x bandwidth (= half of estimated sonication size) across genome and determine read counts
Retain windows with counts > MFOLD (fold-enrichment of treatment/back-ground)
PEAKS: probability of an enrichment being stronger than expected
- H0: reads are randomly distributed throughout the genome following a Poisson distribution
- Determine the background distribution (λ) by sliding a window of size 2x fragment size across the background to estimate the local coverage
First lets get ready to run MACS2 and look at the help documentation. We’ll save these results into the folder macs_test.
```
 cd /share/workshop/epigenetics_workshop/$USER/chipseq_example  # We'll run this from the main directory

 module load macs2
 mkdir macs_test
 macs2 callpeak -h
```
Questions
1. What version of MACS2 is this? Is it the most current version?
2. What parameters are required.
Next lets try and run MACS again the input sample on a test sample.

Here the option -t specified the “test” sample (JLDY037G) and -c the “control” sample (JLDYO37L aka “Input”).
```
 macs2 callpeak  -t 03-Filter/JLDY037G/JLDY037G_filtered_blacklisted.bam -c 03-Filter/JLDY037L/JLDY037L_filtered_blacklisted.bam -f BAMPE  -n macs_test/JLDY037G_input  -g mm  --keep-dup all
```
Questions
1. What does the -g and –keep-dup parameters do?
2. How many files did this command produce.
Lets count the number of peaks we found.
```
 wc -l macs_test/JLDY037G_input_peaks.narrowPeak
```
Spoiler Alert, we get Zero peaks.

Questions
1. Go back through our QA/QC, how did this sample look?
2. What might be a possible explaination for this?

Lets do the same, but this time with the IGG sample (JLDY037D) as control.

 macs2 callpeak  -t 03-Filter/JLDY037G/JLDY037G_filtered_blacklisted.bam -c 03-Filter/JLDY037D/JLDY037D_filtered_blacklisted.bam -f BAMPE  -n macs_test/JLDY037G_igg  -g mm  --keep-dup all

Lets count the number of peaks we found.

 wc -l macs_test/JLDY037G_igg_peaks.narrowPeak

Spoiler Alert, we get Zero peaks. Questions

Go back through our QA/QC, how did this sample look?
What might be a possible explaination for this?

Lets do the same again, but this time with the raw IGG sample (JLDY037D) as control. I mapped the raw fastq data for the IGG sample, so no read processing or filtering.
```
 macs2 callpeak  -t 03-Filter/JLDY037G/JLDY037G_filtered_blacklisted.bam -c /share/biocore/workshops/2020_Epigenetics/ChIPseq/01-BWA-INIT/JLDY037D/JLDY037D_bwa.bam -f BAMPE  -n macs_test/JLDY037G_raw_igg  -g mm  --keep-dup all
```
Lets count the number of peaks we found.
```
 wc -l macs_test/JLDY037G_raw_igg_peaks.narrowPeak
```
Spoiler Alert, we get peaks. :) Questions
1. How many peaks do we get?
2. What is the difference between processed and raw IGG Bams?
3. Given the 3 results above, what might be a possible explaination for the first 2 failures in identifying any peaks?
4. Compare (line counts) the narrowPeak file to the bed file produced. Look at the first few lines of each file, whats a difference (range wise)?
5. Take a look at the xls file (Its not excel).

MACS2 Output

sample_peaks.narrowPeak: BED6+4 format file which contains the peak locations together with peak summit, pvalue and qvalue. A narrowPeak (.narrowPeak) file is used by the ENCODE project to provide called peaks of signal enrichment based on pooled, normalized (interpreted) data. It is a BED 6+4 format, which means the first 6 columns of a standard BED file with 4 additional fields:
sample_peaks.xls: a tabular file which contains information about called peaks. Additional information includes pileup and fold enrichment
sample_summits.bed: peak summits locations for every peak. To find the motifs at the binding sites, this file is recommended

OK so, do we need a control??
```
 macs2 callpeak  -t 03-Filter/JLDY037G/JLDY037G_filtered_blacklisted.bam -f BAMPE  -n macs_test/JLDY037G_noinput -g mm  --keep-dup all
```
Lets count the number of peaks we found.
```
 wc -l macs_test/JLDY037G_noinput_peaks.narrowPeak
```
Spoiler Alert, we get peaks. :) Questions
1. How many peaks do we get?
2. Which run produced the “most” peaks?
3. Ok Given the 4 results above, what should we do?

Peak Calling with MACS2 for ChIPseq

We can now run MACS2 Peak calling across all samples on the real data using a SLURM script, macs2-chipseq.slurm, that we should take a look at now.
```
 cd /share/workshop/epigenetics_workshop/$USER/chipseq_example  # We'll run this from the main directory

 wget https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2020-Epigenetics_Workshop/master/software_scripts/scripts/macs2-chipseq.slurm -O macs2-chipseq.slurm
 less macs2-chipseq.slurm
```
#!/bin/bash # #SBATCH --job-name=macs_chip # Job name #SBATCH --nodes=1 #SBATCH --ntasks=1 # Number of cores #SBATCH --mem=4000 # Memory pool for all cores (see also --mem-per-cpu) #SBATCH --time=2:00:00 #SBATCH --array=1-8 #SBATCH --partition=production # Partition to submit to #SBATCH --account=epigenetics # cluster account to use for the job #SBATCH --reservation=epigenetics-workshop # cluster account reservation #SBATCH --output=slurm_out/macs2-%A_%a.out # File to which STDOUT will be written #SBATCH --error=slurm_out/macs2-%A_%a.err # File to which STDERR will be written #SBATCH --mail-type=ALL # Type of email notification- BEGIN,END,FAIL,ALL #SBATCH --mail-user=myemail@email.com # Email to which notifications will be sent start=`date +%s` echo $HOSTNAME echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID inpath=03-Filter sample=`sed "${SLURM_ARRAY_TASK_ID}q;d" samples.txt | awk -F '\t' '{print $1}'` bam=${inpath}/${sample}/${sample}_filtered_blacklisted.bam input="JLDY037L" input_bam=${inpath}/${input}/${input}_filtered_blacklisted.bam outpath='04-MACS2' [[ -d ${outpath} ]] || mkdir ${outpath} [[ -d ${outpath}/${sample} ]] || mkdir ${outpath}/${sample} echo "SAMPLE: ${sample}" output=${outpath}/${sample}/${sample} module load macs2/2.2.5 THREADS=${SLURM_NTASKS} #THREADS=1 #call="macs2 callpeak -t ${bam} -c ${input_bam} -f BAMPE -n ${output} -g mm --keep-dup all 2> ${output}.log" call="macs2 callpeak -t ${bam} -f BAMPE -n ${output} -g mm --keep-dup all 2> ${output}.log" echo $call eval $call end=`date +%s` runtime=$((end-start)) echo $runtime
After looking at the script, lets run it.
```
 sbatch macs2-atacseq.slurm  # moment of truth!
```
We can watch the progress of our task array using the ‘squeue’ command. Takes about 1:30 hours to process each sample.
```
 squeue -u $USER  # use your username
```

MACS2 for ATAC

We can now run MACS2 Peak calling across all samples on the real data using a SLURM script, macs2-atacseq.slurm, that we should take a look at now.
```
 cd /share/workshop/epigenetics_workshop/$USER/atacseq_example  # We'll run this from the main directory

 wget https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2020-Epigenetics_Workshop/master/software_scripts/scripts/macs2-atacseq.slurm -O macs2-atac.slurm
 less macs2-atacseq.slurm
```
#!/bin/bash # #SBATCH --job-name=macs-atac # Job name #SBATCH --nodes=1 #SBATCH --ntasks=1 # Number of cores #SBATCH --mem=4000 # Memory pool for all cores (see also --mem-per-cpu) #SBATCH --time=2 #SBATCH --array=1-6 #SBATCH --partition=production # Partition to submit to #SBATCH --account=epigenetics # cluster account to use for the job #SBATCH --reservation=epigenetics-workshop # cluster account reservation #SBATCH --output=slurm_out/macs2-%A_%a.out # File to which STDOUT will be written #SBATCH --error=slurm_out/macs2-%A_%a.err # File to which STDERR will be written #SBATCH --mail-type=ALL # Type of email notification- BEGIN,END,FAIL,ALL #SBATCH --mail-user=myemail@email.com # Email to which notifications will be sent start=`date +%s` echo $HOSTNAME echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID inpath=03-Filter sample=`sed "${SLURM_ARRAY_TASK_ID}q;d" samples.txt | awk -F '\t' '{print $1}'` bam=${inpath}/${sample}/${sample}_shifted_filtered_blacklisted.bam outpath='04-MACS2' [[ -d ${outpath} ]] || mkdir ${outpath} [[ -d ${outpath}/${sample} ]] || mkdir ${outpath}/${sample} echo "SAMPLE: ${sample}" output=${outpath}/${sample}/${sample} module load macs2/2.2.5 THREADS=${SLURM_NTASKS} #THREADS=1 call="macs2 callpeak -t ${bam} -f BAMPE -n ${output} -g mm --keep-dup all 2> ${output}.log" echo $call eval $call end=`date +%s` runtime=$((end-start)) echo $runtime
After looking at the script, lets run it.
```
 sbatch macs2-atacseq.slurm  # moment of truth!
```
We can watch the progress of our task array using the ‘squeue’ command. Takes about 1:30 hours to process each sample.
```
 squeue -u $USER  # use your username
```

MultiQC QA/QC Summary of the filter results.

Well there is a MultiQC report for MACS but its pretty pointless to generate.

## Run multiqc to collect statistics and create a report:
cd /share/workshop/epigenetics_workshop/$USER/chipseq_example
module load multiqc/htstream.dev0
multiqc -i ChIPseq-macs2-report -o 04-MACS2-ChIPseq-report ./04-MACS2

Do the same for the ATACseq experiment

Transfer ChIPseq-macs-report_multiqc_report.html and ATACseq-filter-report_multiqc_report.html to your computer and open it in a web browser.

Or in case of emergency, download this copy: ChIPseq-macs2-report_multiqc_report.html and ATACseq-macs2-report_multiqc_report.html for the ATACseq

☰ Menu

Introduction to Analysis of Epigenetic Data 2020