☰ Menu

      Genome Assembly Workshop 2020

Introduction and Lectures
Intro to the Workshop and Core
Genome Assembly
Introduction to the DNA Tech Core
A Brief Overview of Genome Annotation, with a Focus on the Use of Isoseq
Cheat Sheets
Software and Links
CLI - Logging in and Transferring Files
CLI - Intro to Command-Line
CLI - Advanced Command-Line (extra)
CLI - Running jobs on the Cluster and using modules
R - Getting Started
R - Intro to R
R - Prepare Data in R (extra)
R - Data in R (extra)
More Materials (extra)
Challenge Answers
K-mers tutorial
Introduction to PacBio HiFi Data and Applications
Genome Assembly with PacBio HiFi Data
Improved Phased Assembly (IPA) Using HiFi Data
Assembling the drosophila genome with IPA and HiFi data
ONT Assembly
Introduction to ONT
Assembly using ONT - Hands-on
Optical mapping for accurate genome assembly, comparative genomics, and haplotype segregation
Phase Genomics
Using Proximity to Fix Assembly
Genome Assessment
Additional QA/QC and metrics
Closing thoughts
Workshop Photos
Github page
Report Errors
Biocore website

Make sure our workspace is setup properly

mkdir -p /share/workshop/genome_assembly/$USER
cd /share/workshop/genome_assembly/$USER

K-mer frequency plots

We can use K-mers (sequences of length K) to estimate biases, repeat content, sequencing coverage, and heterozygosity.

It is always a good idea to plot K-mer frequencies to get a picture of the genome composition and sequencing coverage/quality. For example, haploid and diploid genomes have differing K-mer distributions.


The 31-mer histograms of paired-end sequence data. Each histogram shows a bimodal distribution typical of a diploid heterozygous genome. The relative fraction of the distribution under the left (haploid) peak is proportional to the genome heterozygosity. Using the relative proportions of the two peaks the genomes can be ranked by their heterozygosity.

Kristian A. Stevens, Keith Woeste, Sandeep Chakraborty, Marc W. Crepeau, Charles A. Leslie, Pedro J. Martínez-García, Daniela Puiu, Jeanne Romero-Severson, Mark Coggeshall, Abhaya M. Dandekar, Daniel Kluepfel, David B. Neale, Steven L. Salzberg, Charles H. Langley, Genomic Variation Among and Within Six Juglans Species, G3: GENES, GENOMES, GENETICS July 1, 2018 vol. 8 no. 7 2153-2165. doi: 10.1534/g3.118.200030."

What are K-mers

A K-mer is a substring of length K in a string of DNA bases. For example: All 2-mers of the sequence AATTGGCCG are AA, AT, TT, TG, GG, GC, CC, CG. Similarly, all 3-mers of the sequence AATTGGCCG are AAT, ATT, TTG, TGG, GGC, GCC, CCG. There are an exponentially increasing number of possible K-mers for increasing numbers of K. There are 16 possible 2-mers for DNA if we assume there are only 4 types of bases (A,T,G,C). The equation for the number of K-mers possible for a given K is then 4^K.

Bases K-mer size Total possible kmers
4 1 4
4 2 16
4 3 64
4 4 256
4 5 1,024
4 6 4,096
4 7 16,384
4 8 65,536
4 9 262,144
4 10 1,048,576
4 21 4.4e+12
4 27 1.8e+16
4 31 4.6e+18

As you can see, there are 64 possibilities for a 3-mer and over 4 Trillion possibilities for a 21-mer!

For a given sequence of length L, and a K-mer size of K, the total k-mer’s possible will be given by ( L – k ) + 1

e.g. For the sequence of length of 14 , and a K-mer length of 8, the number of K-mer’s generated will be:


n   = ( L - K ) + 1
    = (14 - 8 ) + 1
    = 7


For shorter fragments, as in the above example, the total number of K-mers estimated is n = 7, which it is not close to actual fragment size of L which is 14 bps. But for larger fragments , the total number of K-mer’s (n) provide a good approximation to the actual genome size. The following table tries to illustrate the approximation:


Genome Sizes Total K-mers of k=18 % error in genome estimation
L N=(L-K)+1  
100 83 17
1000 983 1.7
10000 9983 0.17
100000 99983 0.017
1000000 999983 0.0017

So for a genome size of 1 Mb and K-mer size of 18, the error between estimation and reality is only .0017%. Which is a very good approximation of actual size.

In choosing a K-mer size, it should be large enough to allow the K-mer to map uniquely to the genome. So the total available K-mers should be sufficiently larger than the genome size and therefor has the ability to store all the K-mers in the genome (a K-mer size of 21 is large enough for most genomes). However, too large K-mers leads to need for substantial computational resources, as well as producing more erroneous K-mers caused by sequencing errors. In other words, the higher error rate in the sequencing data, the smaller k-mer size should be used. Extremely large (haploid size »10GB) and/or very repetitive genomes may benefit from larger kmer lengths to increase the number of unique k-mers.

Genome copies

A significant issue that we face in a real genome sequencing project is non-uniform coverage of genome. This is attributed to both technical and biological variables.

ex: amplification bias of certain genomic regions during PCR (a step in preparation of Illumina sequencing libraries) [Technical] and presence of repetitive sequences in genome [Biological].

Extending the above example, if we had 10 copies (C) of 1Mb genome, then the total no of K-mer’s seen (n) of length k = 21 will be 9999700.

n   = [( L - k ) + 1 ] * C
    = [(1000000 - 21 ) + 1] * 10
    = 9999700

To get the fragement size (ala genome size), we simply need to divide the total by the number of copies:

L   = n / C
    = 9999700 / 10
    = 9999700

The above example helps to understand that we never sequence a single copy of genome but rather a population. Rather we end up sequencing C copies of genome. This is also referred as coverage in sequencing. To obtain actual genome size (N), divide the total K-mers seen (n) by coverage (C).

N  = n / C

Estimating genome characteristics

In the first step, a K-mer frequency is calculated. There are software tools that help in finding the K-mer frequency in sequencing projects. The K-mer frequency of a shotgun library follows a pseudo-normal distribution (actually it is a Poisson distribution) around the mean coverage.

K-mer counting tools

Computing K-mer frequencies with Jellyfish

Lets first get an interactive session on the cluster

cd /share/workshop/genome_assembly/$USER
srun -t 03:30:00 -c 8 -n 1 --mem 32000 --partition production --account genome_workshop --reservation genome_workshop --pty /bin/bash

Running jellyfish, here we use a K-mer size of 21. You should always use “canonical kmers” (-C) since the sequencing reads will come from both the forward and reverse strand of DNA.

module load jellyfish
jellyfish count -C -m 21 -s 1000000000 -t 8 /share/workshop/genome_assembly/pacbio_2020_data_drosophila/hifi_long_read_data/ELF_19kb.m64001_190914_015449.Q20.38X.fasta -o reads.jf

jellyfish histo -t 8 reads.jf > reads.histo

K-mer frequencies

Once the K-mer frequencies are calculated a histogram can be plotted to visualize the distribution.

Download the K-mer frequencing table

spec1_21 <- read.table("reads.histo")
colnames(spec1_21) <- c("freq","k21")


# genome size estimate
g_est = round(sum(as.numeric(spec1_21[error:nrow(spec1_21),1]*spec1_21[error:nrow(spec1_21),2]))/peak, digits=0)
# [1] 139970181

# Single copy sum
singleC = round(sum(as.numeric(spec1_21[error:singlecopy,1]*spec1_21[error:singlecopy,2]))/peak, digits=0)
# [1] 116938146

# Percent in single copy
psinglecopy <- round(singleC/g_est, digits=2)
# [1] 0.84

# Estimated coverage
total_bp <- 5319759793
cov <- round(total_bp/g_est, digits=2)
# [1] 38.01


Genomescope is an R script that uses k-mer frequencies generated from raw read data to estimate the genome size, abundance of repetitive elements and rate of heterozygosity.

There is also an online version, which offers all of the same functionality in an easy to use web interface

Running on the command line

Rscript genomescope.R histogram_file k-mer_length read_length output_dir [kmer_max] [verbose]
cd /share/workshop/genome_assembly/$USER
git clone https://github.com/schatzlab/genomescope.git
module load R
Rscript ./genomescope/genomescope.R reads.histo 21 19000 genomescope_reads

The Rscript generates a directory called “genomescope_reads” with the following files

The summary.txt show the following results

GenomeScope version 1.0
k = 21

property                      min               max
Heterozygosity                0.645301%         0.647448%
Genome Haploid Length         134,957,059 bp    134,998,768 bp
Genome Repeat Length          20,721,599 bp     20,728,003 bp
Genome Unique Length          114,235,461 bp    114,270,765 bp
Model Fit                     96.7426%          97.8227%
Read Error Rate               0.182922%         0.182922%

The two plots allow you to visualize the K-mer frequencies

The large number of unique (near unique) K-mers that have a frequency of 1-7 (the extreme left side of the graph) is due to technical error in sequencing. The big peak at 36 in the graph above is in fact the homozygous portions of the genome that account for the identical 21-mers from both strands of the DNA. The dotted line corresponds to the predicted center of that peak. The small peak/shoulder to the left of the peak corresponds to the heterozygous portions of the genome that accounts for different 21-mers from each strand of the DNA. The two dotted lines to the right of the main peak (at coverage = 54 and 72) are the duplicated heterozygous regions and duplicated homozygous regions and correspond to two smaller peaks. The shape of these peaks are affected by the sequencing errors and sequencing duplicates.

Genomescope model

Genomescope uses an equation that accurately models the shape and size of the K-mer graph using four negative binomial peaks whose shape and size are determined by % heterozygosity, % sequencing duplication, % sequencing Error. All very useful pieces of information to learn about your genome from your raw data.

For polyploid organisms, they are working on genomescope2.0 https://github.com/tbenavi1/genomescope2.0

Continued Exploration

All of the resulting histo files can be found here: