10X genomics data and assembly with Supernova

10X Platform

Linked Reads Linked Reads

Laboratory Protocol Laboratory Protocol

The Math Math

10X assembly with supernova

10X Genomics Supernova main page

Supernova’s “achieving success”

Benefits

Range

Guidance

Annoyance

Downloading Supernova

download location

Installing is basically extract and test. I’ve personally never had an issue, but would seem few have.

Supernova has 3 - sub applications

Fastq files

Have your sequencing provider produce fastq files as you would any other sequencing project. No need to start with bcl files and run supernova mkfastq.

Barcodes - 10X genomics libraries have barcodes, which is a good thing, means we can sequence multiple libraries on a single run, very important now that we can use the Novaseq, further they were aware of barcode issues and have 4 barcodes for every library.

Instructions can be found here

Sample index sets can be found here

However you want to make sure your sequencing provider does not demuliplex out each barcode, you should get 1 set of fastq files, which contained within are data from 4 barcodes. They should know how to do this.

Outputs

Output of supernova run:

A report.txt


SUMMARY ——————————————————————————–

Supernova mkoutput

mkoutput sub-application is what traverses the graph produced by supernova run and outputs fasta files that represent your “assembly”.

Graph can be represented like

graph

Each edge represents a DNA sequence (here represented by 10,20,30,40), and successive edges overlap each other by a kmer junction (K-1 bases, currently K=48). So concatenating successive edges requires deleting the shared k-1 bases at each junction. Each edge has left and right vertices. For example, edge 10 has left vertex 1 and right vertex 2. Any path through a Supernova graph is a list of edges, for example 10,20,40 is a path in the above graph that extends vertices 1 .. 4.

1 edges=56793,1700916,733979,1039876,471726,2775518,688836,1700914,693338 left=118468 right=118473 ver=1.10 style=4

There are four modes of output

raw - extract all edges graph

megabubbles - extract all megabubbles graph

Here Bubbles are flattened by selecting the branch having highest coverage. Gaps are joined to adjacent sequences, resulting in longer edges that represent the gaps internally by sequences of Ns. Reverse complement edges are not represented.

pseudohap - extract the single pseudohaplotype graph

pseudohap2 - extract the both pseudohaplotypes graph

Little point in ever using pseudohap, but rather pseudohap2 as first haplotype product is the same in both and you get both with pseudohap2.

Supernova pipeline Structure

The log file output by supernova will tell you approximately where you are in the pipeline and how much more you have to go.

proc10xG python scripts

A small set of python scripts to process linked reads. github page

Scripts ready for use

On Fastq Files

On BWA Mapped Sam/Bam files

Example

  1. first process reads with process_10xReads.py which extracts the GEM barcode and primer sequence and compares the barcode to a white list, marking reads with status. Then appends the status, library barcode, GEM barcode, primer sequences and corresponding quality scores to the comment of the read ID and the whitelisted barcode to the beginning of the read, in interleaved format
  2. Then map to the genome using bwa mem with flags -p (interleaved) and -C (appends comment to the sam file)
  3. Next process with samContcat2Tag.py which extracts the appended comment and add tags
  4. sort using samtools sort, sorting on reads ID (GEM Barcode)
  5. and finally saving log output to stderr.out and stdout.out
process_10xReads.py -a -1 data/CaCon-sm_R1_001.fastq.gz
-2 data/CaCon-sm_R2_001.fastq.gz |
bwa mem -t 1 -p -C data/polished_p_ctg.fa - | samConcat2Tag.py | samtools sort -n -o mapping.bcmapped.bam - 2> stderr.out > stdout.out

Hands on - assembling a Lizard Genome

Workflow

workflow

Data location - /share/workshop/msettles/aspidoscelis_assembly

Tasks

  1. Create folder in your workshop directory
  2. Link raw data into your directory
    • /share/workshop/msettles/aspidoscelis_assembly/2018_Sept_10X
  3. Copy all the slurm scripts to your directory
  4. Create a slurm_out directory
  5. Clone the proc10xG repo
    • git clone https://github.com/ucdavis-bioinformatics/proc10xG.git

Final directory should look like

-- /share/workshop/[username]/assembly10x
  -- /2018_Sept_10X  # A linked
  -- /kat.slurm # copied
  -- /map_barcodes.slurm # copied
  -- /mkbwaref.slurm # copied
  -- /mkoutput_supernova.slurm
  -- /proc10xG.slurm
  -- /run_supernova.slurm
  -- /slurm_out

With small modification to the script, everything now should be set up for you to run all the scripts and create a 10x Assembly using Supernova.

Your Chores