☰ Menu

      Genome Assembly Workshop 2020

Home
Introduction and Lectures
Intro to the Workshop and Core
Schedule
Genome Assembly
Introduction to the DNA Tech Core
A Brief Overview of Genome Annotation, with a Focus on the Use of Isoseq
Support
Cheat Sheets
Software and Links
Scripts
Prerequisites
CLI - Logging in and Transferring Files
CLI - Intro to Command-Line
CLI - Advanced Command-Line (extra)
CLI - Running jobs on the Cluster and using modules
Conda
R - Getting Started
R - Intro to R
R - Prepare Data in R (extra)
R - Data in R (extra)
More Materials (extra)
Snakemake
Introduction
Challenge Answers
K-mers
K-mers tutorial
PacBio
Introduction to PacBio HiFi Data and Applications
Genome Assembly with PacBio HiFi Data
Improved Phased Assembly (IPA) Using HiFi Data
Assembling the drosophila genome with IPA and HiFi data
ONT Assembly
Introduction to ONT
Assembly using ONT - Hands-on
Bionano
Optical mapping for accurate genome assembly, comparative genomics, and haplotype segregation
Phase Genomics
Using Proximity to Fix Assembly
Genome Assessment
BUSCO
Additional QA/QC and metrics
ETC
Closing thoughts
Workshop Photos
Github page
Report Errors
Biocore website

Snakemake Introduction

  1. Overview
  2. Initial Setup
  3. Brief Overview of Commands Using the Example Workflow
  4. Challenge activities and CLI/cluster warmup
  5. Using the config file
  6. Running the snakefile as a job on the cluster

Overview

Snakemake is a flexible, python based workflow system.

Snakemake is different from other workflow systems (like CWL-Common workflow language) in the following ways:

Initial Setup

keithgmitchell@barbera:/share/biocore/keith/examples/snakemake-demo$ ls config.json data envs scripts slurm-24072402.out Snakefile

Let’s look at the contents of the Snakefile and talk a bit about what it is doing.



samples = ["A", "B"]

rule all:
    input:
        "calls/all.vcf",
        "plots/quals.svg"

rule bwa:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        temp("mapped/{sample}.bam")
    threads: 8
    shell:
        "module load bwa; "
        "bwa mem -t {threads} {input} | samtools view -Sb - > {output}"

rule sort:
    input:
        "mapped/{sample}.bam"
    output:
        "mapped/{sample}.sorted.bam"
    shell:
        "module load samtools; "
        "samtools sort -o {output} {input}"

rule call:
    input:
        fa="data/genome.fa",
        bam=expand("mapped/{sample}.sorted.bam", sample=samples)
    output:
        "calls/all.vcf"
    shell:
        "module load samtools; "
        "module load bcftools; "
        "samtools mpileup -g -f {input.fa} {input.bam} | "
        "bcftools call -mv - > {output}"

rule stats:
    input:
        "calls/all.vcf"
    output:
        report("plots/quals.svg", caption="report/calling.rst")
    conda:
        "envs/stats.yaml"
    script:
        "scripts/plot-quals.py"


Brief Overview of Commands Using the Example Workflow

  1. Prepare the environment for running snakemake:
     module load snakemake/5.6.0
     source activate snakemake
    

    Note: if you have trouble with the source activate snakemake command try the following:

     conda init bash
     source ~/.bashrc
    
  2. Run the snakemake file as a dry run (the example workflow shown above).
    • This will build a DAG of the jobs to be run without actually executing them.
    • snakemake --dry-run
    • Why is this helpful?
  3. Executing rules of interest.
    • snakemake --dry-run all VS. snakemake --dry-run call VS. snakemake --dry-run bwa
    • Why does this happen.. where is the wildcard specified?
  4. Run the snakemake file in order to produce an image of the DAG of jobs to be run.
    • snakemake --dag | dot -Tsvg > dag.svg OR snakemake --dag | dot -Tsvg > dag.svg dag
  5. Run the snakemake (this time not as a dry run)
    • snakemake --use-conda

Challenge activities and CLI/cluster warmup:

Please attempt the advanced questions but do not worry if they are too difficult. Send a private message to me on Slack with your answers and raise your hand on zoom when you are finished.

  1. Access the svg file from the cluster to your local computer to view it yourself after running the --dag command. Just tell me the program or command you used to do this. (5 mins)
  2. Edit the Snakefile to include all three samples. (Hint: look at data/samples to see which one can still be included) Then rerun the Snakefile. (7 mins)
    • It may say “Nothing to be done.” Why is this and how can you overcome it? (For example try first starting by removing the plots and see what happens)
  3. Which lines on the above Snakefile would not work on an HPC? (Hint: why is the stats rule not this way?) (5 mins)
  4. Advanced: We now have a graph of the vcf quality scores but not we may want to do some more advanced analysis with the vcf file. Create a rule in the snakemake file that strips the header from the vcf file (lines with ‘##’ in it) so we are only left with a tsv. Now run just the snakemake and specify just that rule you have just created. Why does running the snakemake with this rule not produce any plots? How can you adjust the all rule to run this new rule and the plot rule? (15 mins)
  5. Advanced: Use the snakemake documentation or google to find out how to remove not store the sorted mappings. Then implement this in the file above, clean the files, and rerun. (Hint: we also use this in the snakefile above). (7 mins)

Using the config file.

We are going to run a small example of how we can use the config file to increase the robustness of our Snakefile:

Let’t take a look at the cofig file (config.json):


 {
     "__default__" :
     {
         "samples_file" : "samples.txt",
     },
 }

Replace samples = ["A", "B"] with the following and clean up your files (rm -rf calls/ mapped/ plots/):


import json
import sys
samples = []
with open(config["__default__"]['samples_file'], 'r') as samples_file:
    for line in samples_file:
        if line.strip('\n') != '':
            samples.append(line.strip('\n'))
print(samples)

Then run the snakefile specifying the custom parameters in our json file.

snakemake --use-conda --configfile config.json

Running the snakefile as a job on the cluster and using the config file

Let’s look at a couple of ways we can run the snakemake workflow on the cluster. The first one shown here runs one job on the cluster that will then execute the snakemake --use-conda command. This does not fully utilize the capability of snakemake’s ability. Go ahead and run this command.

sbatch -t 0:30:00 -n 1 --ntasks 1 --mem 2000 --partition production --account genome_workshop --reservation genome_workshop --wrap='snakemake --use-conda'

After running the command above and your job has finished running, clean up the files (rm -rf mapped/ calls/ plots/) and run the line below. Running the sbatch within snakemake allows for a more control over your resources for each rule in the pipleine. We will use the config.json file, as seen below, for this:

snakemake -j 99 --cluster-config cluster.json --cluster "sbatch -t {cluster.time} --output {cluster.output} --error {cluster.error} --nodes {cluster.nodes} --ntasks {cluster.ntasks} --cpus-per-task {cluster.cpus} --mem {cluster.mem} --partition {cluster.partition} --account {cluster.account} --reservation {cluster.reservation}" --use-conda --latency-wait 50

{
    "__default__" :
    {
        "time" : "0:20:00",
        "nodes": 1,
        "ntasks": 1,
        "cpus" : 1,
        "mem": 2000,
        "output": "snakemake%A.out",
        "error": "snakemake%A.err",
        "reservation": "genome_workshop",
        "partition": "production",
        "account": "genome_workshop"
    }
}

What do you notice about the difference between the two?

Note: In my experience this can be a bit tricky, and it is not something, I myself, have entirely mastered, but I encourage working to get more comfortable with it!

Extra resources:

Snakemake Documentation Homepage