Home
Introduction and Lectures
Intro to the Workshop and Core
Schedule
What is Bioinformatics/Genomics?
Generating Microbial Amplicons
Support
Zoom
Slack
Cheat Sheets
Software and Links
Scripts
Prerequisites
CLI
Introduction to R
High Performance Computing
Data Reduction
Filetype Description
MCA project setup
Preprocessing Reads
ASV production
Data Analysis
Prepare Analysis
MCA Part 1
MCA Part 2
MCA Part 3
MCA Part 4
MCA Part 5
MCA Part 6
ETC
Closing thoughts
Github page
Biocore website

Preprocessed reads to phyloseq object using dada2

Some of you should also launch an interactive session on the cluster before performing the below opperation.

srun -t 12:00:00 -c 8 -n 1 --mem 48000 --partition production --account workshop --reservation workshop --pty /bin/bash

cd /share/workshop/mca_workshop/$USER
mkdir 03-DADA2
module load R
R

From here on out the commands will be within R

R Environment Setup

We are going to use a R package library directory for this workshop that is located inside the workshop folder. All packages have already been installed a few days ago.

R Library

Lets make sure we are in the rigth location and our library path is available.

getwd()
.libPaths("/share/workshop/mca_htstream/r_lib")
.libPaths()

Packages being used for this section

## CODE to install the needed packages
# if (!requireNamespace("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# BiocManager::install(c("dada2", "phyloseq"))
# BiocManager::install(c("DECIPHER", "phangorn"))
##

library(dada2); packageVersion("dada2")

Data reduction with DADA2

In this section we want to process our sequence data, summarizing it to:

Abundance table
Taxanomic assignment
Phylogenetic tree

The typical means for producing abundance tables are

OTU clustering by 3% simularity.
Direct classification of each read to tax, summarize by taxa assignment
Production of amplicon sequence variants (ASV)

Both methods 1 and 2 are outdated and ASV analysis is appropriate approach.

Sample file setup.

First task is to get the sample files we will be processing. If you recall htstream produced both SE and PE files, but our SE files are the ones we want to use here.

path <- "01-HTS_Preproc"
list.files(path)

Get a list of “SE” files names

We can get a list of all SE files in the HTStream directory by using the pattern ‘_SE/fastq’

fnSEs = sort(list.files(path, pattern="*_SE.fastq", full.names=T))
sample.names <-  sapply(strsplit(basename(fnSEs), "_SE"), `[`, 1)

How many samples are there, is that the correct number

Creating a metadata object from the read IDs

In this case our sample names are sufficient to produce our metadata. Usually you would have a metadata table, excel-like table, with the mapping of sample names to information about that sample.

split_names <- strsplit(sample.names,split="_")
grp <- sapply(split_names, "[[",1L)
temp <- sapply(split_names,"[[",2L)
replicate <- sapply(split_names,"[[",3L)

mdata <- data.frame("SampleID"=sample.names, "Group"=grp, "Temp"=temp, "Replicate"=replicate)

How many groups do we have here?
How many temps? how many replicate ids?

Produce and examine raw data quality profile plots:

png(file.path("03-DADA2","fnSEs_quality.png"), width=2000, height=1000)
plotQualityProfile(fnSEs[1:3])
dev.off()

quality_profile

Does this look similar to the plot in our htstream multiqc?

Note: Any additional filtering or trimming should no longer be necessary if reads are preprocessed with htsteam and filtered for any reads containing an “N”.

DADA2 Denoising

We get right to denoising of our SE reads. DADA2 first generates an error model based on ~100 million basepair (default), the algorithm uses the first n reads across N samples that sum to 100M. The error model is to estimate a specific error-signature of the run. If your dataset spans multiple runs, this process should be ran for each run separately.

This can take a little bit, good time for a break

derep <- derepFastq(fnSEs)
errU <- learnErrors(derep, multithread=TRUE, nreads=50000)

For your dataset remove the nreads=5000 param

We can plot the error profiles and investigate.

png(file.path("03-DADA2","errors-profiles.png"), width=2500, height=1500)
plotErrors(errU, nominalQ=TRUE)
dev.off()

quality_profile

The goal is to have the solid line follow the ‘dots’ well, after reviewing many experiements you’ll be able to tell what sort of deviation is ok. Here we could use more fine tuning.

Which parameters of learnErrors could be modified to investigate improvement.
On your own, after the workshop, explore these changes and their impact.

Then DADA2 denoises the dataset using the error model previously generated above. First, dereplicating the sequences, then clustering and aligning, finally denoiseing. It does this by incorporating the consensus quality profiles and abundances of each sequence, and then figures out if each a sequence is more likely to be of biological origin or technical artifact.

This step can be run on individual samples, all samples pooled, or a pseudo-mix of the two pseudo. Running it on individual samples is the least computationally intensive, followed by pseudo and pool. Pooling the samples allow the algorithm to identify a lowly abundant sequence in 1 sample that is more abundant in another. So same low abuntant taxa in some samples could be filtered out when not pooling. For the sake of time we denoise samples individually.

dadaUs = dada(derep, err=errU, multithread=TRUE)

What does this object “dataUs” look like? What is contained within it

Generate an ASV count table

Now we aggregate the counts of the denoised sequence data into a sequence table. So the number of reads seen per sample per sequence. Here the columns are denoised sequences and the rows are samples. Save the result for later.

seqtab <- makeSequenceTable(dadaUs)
rownames(seqtab) = sample.names

save(seqtab, file=file.path("03-DADA2","seqtab.RData"))

What are the dimensions of our sequence table? How mamy ASVs do we have?
What are the column names?
What are the row names?
Which sequence has the most reads? how many?
Which sequence has the least reads? how many?

Chimera removal

Next is to include a chimera removal step. This step takes ASVs with very high read count support and then looks for low read count support ASVs that have perfect matches to two higher-support ASVs over part of their length. So a chimera is a mixture of 2 other ASVs. There is typically a low percentage of these.

chimera

seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)

What is the new dimensions of our sequence table?

Lets explore the object a little.

We now have our final set of denoised data. So what fraction of total reads were retained?

sum(seqtab.nochim)/sum(seqtab)

Sequence lengths before nochim:

table(nchar(getSequences(seqtab)))

Look at distribution of fragment lengths discarded by the chimera removal step.

table(nchar(setdiff(getSequences(seqtab),getSequences(seqtab.nochim))))

Save the result for later.

save(seqtab.nochim, file=file.path("03-DADA2","seqtab.nochim.RData"))

Track reads through the pipeline

Lets create a simple table that show us how our ASV table was generated and how many reads per sample we have, and where we lost reads.

getN <- function(x) sum(getUniques(x))

track = data.frame(input = sapply(derep, getN),
                   denoised = rowSums(seqtab),
                   nochim = rowSums(seqtab.nochim),
                   ASVs = rowSums(seqtab > 0),
                   ASVs.nochim = rowSums(seqtab.nochim > 0))

rownames(track) <- sample.names
track

write.table(data.frame(sample=rownames(track), track), file=file.path("03-DADA2","read_tracking.tsv"), sep='\t', row.names=F)

Which sample “lost” the most/least reads”?
Which sample has the most/least ASVs”?
Do you see a relationship between sequence count and ASV?

Assign taxonomy to the sequence data:

DADA2 re-implements the Ribosomal Data Projects Bayesian Classifier. This is a k-mer (15 mers) based algorithm, comparing a subset of each read’s kmers to a database (of kmers). 100 bootstraps are performed (randome drawing of kmers) to establish confidence level (bootstrap score) in result. The approach classifies reads to the genus level and supplies a bootstrap score of confidence.

For this you need a database of reference sequences. The developers supply references for

silva
rdp
Unite
GTDB
and more …

For this experiment, we are going to use the SILVA database

Download the SILVA databases for both rdp classifier and species identification.

dir.create("SILVA")
download.file("https://zenodo.org/record/4587955/files/silva_nr99_v138.1_train_set.fa.gz?download=1", "SILVA/silva_nr99_v138.1_train_set.fa.gz")
download.file("https://zenodo.org/record/4587955/files/silva_species_assignment_v138.1.fa.gz?download=1", "SILVA/silva_species_assignment_v138.1.fa.gz")

Classify agains Silva db

Using assignTaxonomy we can assign our seqs to taxa. Here we set the minimum bootstrap score to 0, so we can apply our own bootstrap cutoff, rather than rerunning classification with new bootstrap thresholds.

taxa.silva.nochim = assignTaxonomy(seqtab.nochim, "SILVA/silva_nr99_v138.1_train_set.fa.gz", multithread = T, minBoot=0, outputBootstraps = TRUE)

The result is a list, first element is the taxanomic assignment, second is the bootstrap scores. Look at the bootstrap scores in the results.

table(taxa.silva.nochim[[2]][,6] >= 50, useNA ="always")

Then lets choose score of 50, setting to NA anything less than 50.

taxa.silva.nochim.deboot <- taxa.silva.nochim[[1]]
taxa.silva.nochim.deboot[taxa.silva.nochim[[2]] < 50] <- NA

Then perform species assignment

species assignment is done by exact sequence match to the database, AND the taxomomic path is consistant.

taxa.silva.nochim.deboot = addSpecies(taxa.silva.nochim.deboot, "SILVA/silva_species_assignment_v138.1.fa.gz")

save(taxa.silva.nochim.deboot, file=file.path("03-DADA2", "taxa.silva.nochim.RData"))

How many sequences are assigned at each rank?

colSums(!is.na(taxa.silva.nochim.deboot))

Spend a minute exploring this object.
How many ‘unique’ genera are identifed? species?
Which genera is represented by the most sequences? Do these have different species assignments?

Make a phylogenetic tree from all of the ASV sequences

To produce a phylogenetic tree we need to:

Extract sequences
Perform multiple sequence alignment
Convert to phyDat format
Compute the maximum likelyhood distance matrix
Compute the neighbor-joining tree
Compute the likelihood of a tree given a sequence alignment and a model
update and optimize its parameters.

First lets load the library.

library(DECIPHER); packageVersion("DECIPHER")
library(phangorn); packageVersion("phangorn")

ASVs.nochim = DNAStringSet(colnames(seqtab.nochim))
names(ASVs.nochim) = paste0("ASV", 1:ncol(seqtab.nochim))

This can take a while, so we will reduce down to the top 1000 sequences only. For your dataset remove the [1:1000] indexing, or choose a more appropriate one

alignment = AlignSeqs(ASVs.nochim[1:1000], anchor=NA, processors=30)

phang.align <- phyDat(as(alignment, "matrix"), type="DNA")
dm <- dist.ml(phang.align)
treeNJ <- NJ(dm)

fit = pml(treeNJ, data=phang.align)
fitGTR <- update(fit, k=4, inv=0.2)
#fitGTR <- optim.pml(fitGTR, model="GTR", optInv=TRUE, optGamma=TRUE, rearrangement = "stochastic", control = pml.control(trace = 0))

optim.pml can take a REALLY long time to run, which is why its commented out here, it essentially optimizes the different model parameters, searching for a better tree using a bunch of stochastic rearrangement strategies.

save(fitGTR, file=file.path("03-DADA2", "fitGTR.RData"))

Explore this object

Lets make a Phyloseq object that will then be used for data analysis

The intent of our phyloseq object is to store ALL the information for our experiment into a single object. We need:

* 'otu' table
* sample metadata
* taxanomic table
* sequences
* phylogenetic tree

Prepare the different parts of the phyloseq object and Silva annotation

Lets match up the sample metadata, assign names to each sequence ASV 1 through n and genearlly prepare the data.

mdata2 = mdata[match(rownames(seqtab.nochim), mdata$SampleID), ]
rownames(mdata2) = mdata2$SampleID

ASVs.nochim = DNAStringSet(colnames(seqtab.nochim))
names(ASVs.nochim) = paste0("ASV", 1:ncol(seqtab.nochim))

tmp.seqtab = seqtab.nochim
colnames(tmp.seqtab) = names(ASVs.nochim)
tmp.taxa = taxa.silva.nochim.deboot
rownames(tmp.taxa) = names(ASVs.nochim)

Create the object

Load the phyloseq library and create the object.

library(phyloseq); packageVersion("phyloseq")

ps.silva.nochim = phyloseq(
             otu_table(tmp.seqtab, taxa_are_rows=FALSE),
             sample_data(mdata2),
             tax_table(tmp.taxa),
             refseq(ASVs.nochim),
             phy_tree(fitGTR$tree))

ps.silva.nochim

DONE!

save(ps.silva.nochim, file=file.path("03-DADA2", "phyloseq_nochim_silva.RData"))

Lets copy over the RData file (and any intermediate result you’d like) to your computer for data analysis.

☰ Menu

Microbial Community Analysis (amplicons)