Home
Introduction
Intro to the Workshop
Schedule
Support
Using Slack in this workshop
Using Zoom in this workshop
Cheat Sheets
Software and Links
Talks
Perspective on Bioinformatics/Genomics?
Iso-Seq Library Prep
Prerequisites
Logging In
CLI
Cluster Computing
Data Reduction
PacBio data
Dr. Liz Tseng
Introduction to PacBio and Iso-Seq
Using the Iso-Seq Application on SMRT Link and BioConda
Tutorial
Richard Kuo
Transcriptome Annotation by Modular Algorithms TAMA
Tutorial
Fairlie Reese
Long-read RNA-seq transcriptomic analysis with TALON and Swan
Tutorial
Download ipython notebook
Conesa Lab
IsoAnnot and IsoAnnotLite
TappAS Install
Hands-on tappAS
ETC
Closing thoughts
Recordings
Github page
Biocore website
Pacific Biosciences website
Conesa lab website

PacBio Data

What might you expect to receive from a PacBio Sequencing Provider?

PacBio Subread and Scraps

zmwread

For Isoseq data, you could receive raw bam files, ccs result, OR isoseq pipeline within smrtlink results.

We would recommend you start from the CCS files. The SMRTlink isoseq run is ok if you don’t have compute or are new to Bioinformatics (lack of ability), however doing the isoseq analysis pipeline yourself provides you with greater control over the output.

PacBio processes each ZMW’s original full polymerase read on the Sequel II for adapters and optionally barcodes producing a scraps (low quality regions, adapter sequence and barcode sequence) and subreads (the insert). The Sequel IIe will skip these steps and give only the CCS files. Both data are now represented within a unaligned bam file. See File Types

Creating a Project Directory

First, create a directory for you and the example project in the workshop share directory:

cd
mkdir -p /share/workshop/isoseq_workshop/$USER/view_data

Link raw data files

Next, go into that directory, lets then create symbolic link to the data directories that contains the raw and secondary processed data.
```
 cd /share/workshop/isoseq_workshop/$USER/view_data
 ln -s /share/workshop/PacBio_data/* .
```
This directory now contains a folder of a single smrt cell’s isoseq data. NOTE: This is the structure (partially) that the UC Davis Sequencing Core uses and may not be the structure you get back from your provider

Getting To Know the Data a little

Now, take a look at the raw data directory. The structure here is a PacBio “Run”, followed by the Cell index (here C1) and then data folders underneath.

The data folders are:
1. 3_C01_Primary, This is the primary output
2. 0000001653, secondary analysis folder for the isoseq pipeline
3. 01-ccs, this is the ccs output, not by smrtline but rather by the command line.
Lets take a look at the primary output
```
 ls -lah /share/workshop/isoseq_workshop/$USER/view_data/r64069_20200616_170349/C1/3_C01_Primary
```
There are a number of files in this directory, but the most relevant are the scraps.bam and the subreads.bam files.

Lets take a quick look at those files.
```
 module load samtools
 cd /share/workshop/isoseq_workshop/$USER/view_data/r64069_20200616_170349/C1/3_C01_Primary
 samtools view m64069_200619_021350.scraps.bam | head -n 1
 samtools view m64069_200619_021350.subreads.bam | head -n 1
```
Questions
1. in the scraps file
  1. what is the ZMW of the first fragment in the file?
  2. what are the fragment coordinates from the polymerase read? and so what is the length of fragment, verify this by looking at the sequence.
  3. What kind of “Scrap region-type annotation” is this fragment? Use the bam reference to help
2. in the subreads file
  1. what is the ZMW of the first fragment in the file? Why would it be different from the scraps file?
  2. what are the fragment coordinates from the polymerase read? and so what is the length of fragment, verify this by looking at the sequence.
  3. What are the quality scores assigned to each base?
  4. what is the pw tag? Other tags?
Lets tale a look at the ccs output

The current implementation of the CCS algorithm is described here
```
 ls -lah /share/workshop/isoseq_workshop/$USER/view_data/r64069_20200616_170349/C1/01-ccs
```
There are a number of files in this directory, but the most relevant is the ccs.bam.

Lets take a quick look at those files.
```
 module load samtools
 cd /share/workshop/isoseq_workshop/$USER/view_data/r64069_20200616_170349/C1/01-ccs
 samtools view m64069_200619_021350.ccs.bam | head -n 1
```
Questions
1. what is the ZMW of the first fragment in the file? Is it the same as the subread?
2. how many subreads contributed to this CCS read?
3. what are some of the quality scores assigned to each base, what seems to be the range? For those with experience with Illumina, how does this compare.
4. what is the rq tag?

The secondary output folder

 ls -lah /share/workshop/isoseq_workshop/$USER/view_data/r64069_20200616_170349/C1/0000001653

 ls -lah /share/workshop/isoseq_workshop/$USER/view_data/r64069_20200616_170349/C1/0000001653/outputs

There are a number of files in this directory, but since the whole point of this workshop is to perform these, we won’t go into them.

Homework, command line practice

for the ZMW m64069_200619_021350/50 how many subreads contributed to the consensus reads?
how many subreads does this fragment have in the primary subreads file? Warning: takes a long time to run, its a BIG file OR you can cheat
```
 samtools view  m64069_200619_021350.subreads.bam | head -n 100000 | grep -c  '^m64069_200619_021350/50/'
```
take your time to look through the directories and files. Which are summaries. Advanced
how many “reads” are in our subreads file, so total unique zmw?
what is the average length of the subreads?

☰ Menu

PacBio Iso-Seq Workshop Online