☰ Menu

      Genome Assembly Workshop 2020

Home
Introduction and Lectures
Intro to the Workshop and Core
Schedule
Genome Assembly
Introduction to the DNA Tech Core
A Brief Overview of Genome Annotation, with a Focus on the Use of Isoseq
Support
Cheat Sheets
Software and Links
Scripts
Prerequisites
CLI - Logging in and Transferring Files
CLI - Intro to Command-Line
CLI - Advanced Command-Line (extra)
CLI - Running jobs on the Cluster and using modules
Conda
R - Getting Started
R - Intro to R
R - Prepare Data in R (extra)
R - Data in R (extra)
More Materials (extra)
Snakemake
Introduction
Challenge Answers
K-mers
K-mers tutorial
PacBio
Introduction to PacBio HiFi Data and Applications
Genome Assembly with PacBio HiFi Data
Improved Phased Assembly (IPA) Using HiFi Data
Assembling the drosophila genome with IPA and HiFi data
ONT Assembly
Introduction to ONT
Assembly using ONT - Hands-on
Bionano
Optical mapping for accurate genome assembly, comparative genomics, and haplotype segregation
Phase Genomics
Using Proximity to Fix Assembly
Genome Assessment
BUSCO
Additional QA/QC and metrics
ETC
Closing thoughts
Workshop Photos
Github page
Report Errors
Biocore website

Make sure our workspace is setup properly and request resources on the cluster

mkdir -p /share/workshop/genome_assembly/$USER
cd /share/workshop/genome_assembly/$USER

srun -t 06:00:00 -c 24 -n 1 --mem 32000 --partition production --account genome_workshop --reservation genome_workshop --pty /bin/bash
aklog

Assessing Assembly Completeness

The VGP and EBP has published a set of Draft Assembly Statistics to aspire to.

They propose six broad quality categories (see table below)

quality_metrics

The 1st column are split into sub-metrics in the 2nd column. The recommendations for draft to finished qualities (columns 3–6) are based on those achieved in past studies, this study, and what we aspire to.

Quality of an assembly can be expressed in a 3 value notation, X.Y.Q = contig NG50 log10 value; scaffold NG50 log10 value; and QV value.

Assembly metrics

N50
N50 statistic defines assembly quality in terms of contiguity. Given a set of contigs, the N50 is defined as the sequence length of the shortest contig at 50% of the total genome length. N50 can be described as a weighted median statistic such that 50% of the entire assembly is contained in contigs or scaffolds equal to or larger than this value.
L50
Given a set of contigs, each with its own length, the L50 count is defined as the smallest number of contigs whose length sum makes up half of genome size.
N90
The N90 statistic is less than or equal to the N50 statistic; it is the length for which the collection of all contigs of that length or longer contains at least 90% of the sum of the lengths of all contigs.
NG50
Note that N50 is calculated in the context of the assembly size rather than the genome size. Therefore, comparisons of N50 values derived from assemblies of significantly different lengths are usually not informative, even if for the same genome. To address this, the authors of the Assemblathon competition came up with a new measure called NG50. The NG50 statistic is the same as N50 except that it is 50% of the known or estimated genome size that must be of the NG50 length or longer. This allows for meaningful comparisons between different assemblies.

Additional Requirements

Our experience has shown that currently, all (combinations of) automated processes generate assemblies with a variety of remaining errors, some of which are relatively easy to address and should be corrected before submission. We therefore propose that a set of quality control criteria are required to be met including:

  1. separation of sequence of the target species from contaminants and other organisms such as symbionts/cobionts

  2. explicit identification of a primary (haploid or pseudo-haploid) assembly, with additional sequence in a secondary bin that may contain either full alternate haplotypes or a set of haplotypic/other sequence from the individual

  3. separation and explicit identification of organellar genomes

  4. only A,C,G,T and N bases and removal of trailing Ns

We also encourage:

  1. identification of discordances between raw data and resulting assembly to locate and remove structural errors (misjoins, missed joins and false duplications)

  2. identification of sex chromosomes where possible

  3. reconciliation with the known karyotype where it exists and this is possible.

Applications

quality_metrics_apps

Preforming some metrics

genometools seqstat

We can use genometools to get a number of assembly stats. In order to get NG50 values we need to provide it the estimate of the genome size determined by genomescope.

mkdir genometools
cd genometools
module load genometools/1.5.9
gt seqstat -contigs -genome 135000000 /share/workshop/genome_assembly/pacbio_2020_data_drosophila/hifi_long_read_diploid_ipa_assembly/RUN/14-final/final.p_ctg.fasta > diploid_ipa_assembly.stats
# number of contigs:     208
# genome length:         135000000
# total contigs length:  232197789
#    as % of genome:     172.00 %
# mean contig size:      1116335.52
# contig size first quartile: 78104
# median contig size:         148967
# contig size third quartile: 534433
# longest contig:             23464522
# shortest contig:            50222
# contigs > 500 nt:           208 (100.00 %)
# contigs > 1K nt:            208 (100.00 %)
# contigs > 10K nt:           208 (100.00 %)
# contigs > 100K nt:          135 (64.90 %)
# contigs > 1M nt:            37 (17.79 %)
# N50:                   7970785
# L50:                   9
# N80:                   1615167
# L80:                   29
# NG50:                  12556253
# LG50:                  4
# NG80:                  8941183
# LG80:                  8

Now lets take a look at the purged duplicates result.

gt seqstat -contigs -genome 135000000 /share/workshop/genome_assembly/pacbio_2020_data_drosophila/purge_dup_asm/final.purged.p_ctg.fasta > purged_ipa_assembly.stats
# number of contigs:     70
# genome length:         135000000
# total contigs length:  134186888
#    as % of genome:     99.40 %
# mean contig size:      1916955.54
# contig size first quartile: 105276
# median contig size:         314579
# contig size third quartile: 1030361
# longest contig:             23466353
# shortest contig:            50196
# contigs > 500 nt:           70 (100.00 %)
# contigs > 1K nt:            70 (100.00 %)
# contigs > 10K nt:           70 (100.00 %)
# contigs > 100K nt:          54 (77.14 %)
# contigs > 1M nt:            17 (24.29 %)
# N50:                   13488756
# L50:                   4
# N80:                   2484305
# L80:                   9
# NG50:                  13488756
# LG50:                  4
# NG80:                  2121280
# LG80:                  10

MERQURY

Merqury got alot of its inspiration from KAT K-mer Analysis Toolkit First we need some dependancies. Gonzalo Garcia from the Earlham Institute in the UK contributed to the last Genome Assembly workshop in 2018 discussing k-mers and KAT. Can see his materials here.

mkdir merqury_drosophila
cd merqury_drosophila/
ln -s /share/workshop/genome_assembly/pacbio_2020_data_drosophila/hifi_long_read_data/ELF_19kb.m64001_190914_015449.Q20.38X.fasta .
ln -s /share/workshop/genome_assembly/pacbio_2020_data_drosophila/purge_dup_asm/final.purged.*.fasta .

module load R
# Need argparse, ggplot2, scales
Rscript -e 'install.packages(c("argparse", "ggplot2", "scales"),repos = "http://cran.us.r-project.org")'

module load samtools
module load bedtools2
module load java/jdk11.0
module load igvtools

wget https://github.com/marbl/meryl/releases/download/v1.0/meryl-1.0.Linux-amd64.tar.xz
tar -xf meryl-1.0.Linux-amd64.tar.xz
export PATH=$PWD/meryl-1.0/Linux-amd64/bin:$PATH

Creating Meryl counts of the high quality PacBio HiFi data.

meryl count k=21 ELF_19kb.m64001_190914_015449.Q20.38X.fasta output pacbio.meryl
# This takes quite some time 30+ minutes
# Alternatively you can link mine
#ln -s /share/workshop/genome_assembly/msettles/merqury_drosophila/pacbio.meryl .
git clone https://github.com/marbl/merqury.git
export MERQURY=$PWD/merqury
ln -s $MERQURY/merqury.sh

less merqury.sh
./merqury.sh pacbio.meryl final.purged.p_ctg.fasta final.purged.a_ctg.fasta merqury_pacbio
# Took about an hour

spectra_cn and spectra_asm plots

spectra_cn
We can count the canonical k-mers observed in the assembly and in the accurate, whole-genome read set (Here the original PacBio HiFi reads). A typical k-mer spectrum for a heterozygous diploid genome consists of two primary peaks, one representing k-mers that are 1-copy in the diploid genome (heterozygous, on a single haplotype) and one representing those that are 2-copy in the diploid genome (homozygous, on both haplotypes or two copies on one haplotype). The 2-copy k-mers appear with a frequency approximately equal to the average depth of sequencing coverage, where the 1-copy k-mers appear with frequency approximately equal to half the sequencing coverage. Using the multiplicity of the k-mer counts, and modeling the k-mer survival rate (i.e. how many k-mers are unaffected by sequencing error), it is possible to predict the size and repeat content of a genome from the k-mer spectrum alone.

spectra_asm
Similar to the spectra-cn analysis, we can color each k-mer in the read set by the assembly in which it is found. This becomes useful when two haploid assemblies are evaluated. This way, we can detect k-mers that are present only in one assembly, k-mers shared in both assemblies, and k-mers not present in the assembly and only found in the read set.

file -> merqury_pacbio.spectra-cn.fl.png
file -> merqury_pacbio.spectra-asm.fl.png

QV values

We can also use k-mers to estimate the frequency of consensus errors in the assembly. We use a binomial model of k-mer survival, and assume all k-mers in the assembly should be found at least once in the read set. In brief, we estimate the probability P that a base in the assembly is correct.

file -> merqury_pacbio.qv

Assembly ASM_ONLY TOTAL QV ERROR
final.purged.p_ctg 4924 134184714 57.576 1.74744e-06
final.purged.a_ctg 10189 115249809 53.7571 4.21008e-06
Both 15113 249434523 55.3981 2.88528e-06

K-mer completeness

We define a “reliable k-mer” as a k-mer that is truly in the genome and unlikely to be caused by sequencing error. With exact k-mer counts, it is easy to filter out low-copy k-mers that are likely to represent sequencing errors. The k-mer completeness is calculated as the fraction of reliable k-mers in the read set that are also found in the assembly.

file -> completeness.stats

Assembly all ASM TOTAL  
final.purged.p_ctg all 116624064 131494201 88.6914
final.purged.a_ctg all 97678031 131494201 74.2831
both all 128428935 131494201 97.6689

Excercise

  1. Given all the above what is the final X.Y.Q Assessment for this assembly.

  2. Run the above excercise for each of the final resulting assemblies we’ve performed in this workshop, post the X.Y.Q Assessment in the Slack channel.