0. Setup

Install topGO, KEGGREST, org.At.tair.db, and Rgraphviz if not already installed

#source("https://bioconductor.org/biocLite.R")
#biocLite(c("topGO", "KEGGREST", "org.At.tair.db", "Rgraphviz"))

Load libraries

library(topGO)

## Loading required package: BiocGenerics

## Loading required package: parallel

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind,
##     colMeans, colnames, colSums, dirname, do.call, duplicated,
##     eval, evalq, Filter, Find, get, grep, grepl, intersect,
##     is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
##     paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
##     Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which, which.max,
##     which.min

## Loading required package: graph

## Loading required package: Biobase

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## Loading required package: GO.db

## Loading required package: AnnotationDbi

## Loading required package: stats4

## Loading required package: IRanges

## Loading required package: S4Vectors

## 
## Attaching package: 'S4Vectors'

## The following object is masked from 'package:base':
## 
##     expand.grid

## 
## Attaching package: 'IRanges'

## The following object is masked from 'package:grDevices':
## 
##     windows

##

## Loading required package: SparseM

## 
## Attaching package: 'SparseM'

## The following object is masked from 'package:base':
## 
##     backsolve

## 
## groupGOTerms:    GOBPTerm, GOMFTerm, GOCCTerm environments built.

## 
## Attaching package: 'topGO'

## The following object is masked from 'package:IRanges':
## 
##     members

library(KEGGREST)

Files for examples:

I5_v_C_time6.txt

ensembl_example_input.txt

mart_export.txt.

1. Gene Ontology (GO) Enrichment

Gene ontology (http://www.geneontology.org/) provides a controlled vocabulary for describing biological processes (BP ontology), molecular functions (MF ontology) and cellular components (CC ontology)

The GO ontologies themselves are organism-independent; terms are associated with genes for a specific organism through direct experimentation or through sequence homology with another organism and its GO annotation.

Terms are related to other terms through parent-child relationships in a directed acylic graph.

Enrichment analysis provides one way of drawing conclusions about a set of differential expression results.

2. topGO Example Using Kolmogorov-Smirnov Testing

Our first example uses Kolmogorov-Smirnov Testing for enrichment testing of our arabadopsis DE results, with GO annotation obtained from the Bioconductor database org.At.tair.db.

The first step in each topGO analysis is to create a topGOdata object. This contains the genes, the score for each gene (here we use the p-value from the DE test), the GO terms associated with each gene, and the ontology to be used (here we use the biological process ontology)

infile <- "I5_v_C_time6.txt"
tmp <- read.delim(infile)

# OR
tmp <- read.delim("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2018-June-RNA-Seq-Workshop/master/friday/I5_v_C_time6.txt")

geneList <- tmp$P.Value
names(geneList) <- tmp$Gene
# Create topGOData object
GOdata <- new("topGOdata",
    ontology = "BP",
    allGenes = geneList,
    geneSelectionFun = function(x)x,
    annot = annFUN.org, mapping = "org.At.tair.db")

## 
## Building most specific GOs .....

## Loading required package: org.At.tair.db

##

##  ( 2475 GO terms found. )

## 
## Build GO DAG topology ..........

##  ( 4338 GO terms and 9505 relations. )

## 
## Annotating nodes ...............

##  ( 16216 genes annotated to the GO terms. )

The topGOdata object is then used as input for enrichment testing:

# Kolmogorov-Smirnov testing
resultKS <- runTest(GOdata, algorithm = "weight01", statistic = "ks")

## 
##           -- Weight01 Algorithm -- 
## 
##       the algorithm is scoring 4338 nontrivial nodes
##       parameters: 
##           test statistic: ks
##           score order: increasing

## 
##   Level 18:  1 nodes to be scored    (0 eliminated genes)

## 
##   Level 17:  1 nodes to be scored    (0 eliminated genes)

## 
##   Level 16:  2 nodes to be scored    (1 eliminated genes)

## 
##   Level 15:  19 nodes to be scored   (3 eliminated genes)

## 
##   Level 14:  47 nodes to be scored   (6 eliminated genes)

## 
##   Level 13:  88 nodes to be scored   (63 eliminated genes)

## 
##   Level 12:  195 nodes to be scored  (612 eliminated genes)

## 
##   Level 11:  304 nodes to be scored  (1214 eliminated genes)

## 
##   Level 10:  463 nodes to be scored  (2348 eliminated genes)

## 
##   Level 9:   575 nodes to be scored  (4513 eliminated genes)

## 
##   Level 8:   663 nodes to be scored  (6164 eliminated genes)

## 
##   Level 7:   674 nodes to be scored  (8049 eliminated genes)

## 
##   Level 6:   620 nodes to be scored  (9529 eliminated genes)

## 
##   Level 5:   391 nodes to be scored  (10347 eliminated genes)

## 
##   Level 4:   197 nodes to be scored  (11297 eliminated genes)

## 
##   Level 3:   74 nodes to be scored   (11765 eliminated genes)

## 
##   Level 2:   23 nodes to be scored   (12162 eliminated genes)

## 
##   Level 1:   1 nodes to be scored    (12302 eliminated genes)

tab <- GenTable(GOdata, raw.p.value = resultKS, topNodes = length(resultKS@score), numChar = 120)

topGO preferentially tests more specific terms, utilizing the topology of the GO graph. The algorithms used are described in detail here.

head(tab, 15)

##         GO.ID                                           Term Annotated
## 1  GO:0001510                                RNA methylation       172
## 2  GO:0006412                                    translation       620
## 3  GO:0042254                            ribosome biogenesis       332
## 4  GO:0009220 pyrimidine ribonucleotide biosynthetic process       133
## 5  GO:0046482       para-aminobenzoic acid metabolic process        38
## 6  GO:0046686                        response to cadmium ion       459
## 7  GO:0009407                        toxin catabolic process       203
## 8  GO:0042545                         cell wall modification       209
## 9  GO:0010583                     response to cyclopentenone       143
## 10 GO:0009414                  response to water deprivation       393
## 11 GO:0009664              plant-type cell wall organization       246
## 12 GO:0006626             protein targeting to mitochondrion       104
## 13 GO:0006414                       translational elongation        32
## 14 GO:0006457                                protein folding       270
## 15 GO:0000028               ribosomal small subunit assembly         6
##    Significant Expected raw.p.value
## 1          172      172     9.0e-30
## 2          620      620     3.6e-29
## 3          332      332     1.1e-14
## 4          133      133     4.0e-09
## 5           38       38     1.2e-07
## 6          459      459     1.4e-07
## 7          203      203     7.9e-07
## 8          209      209     2.8e-06
## 9          143      143     6.0e-06
## 10         393      393     7.6e-06
## 11         246      246     4.3e-05
## 12         104      104     8.6e-05
## 13          32       32     8.9e-05
## 14         270      270     0.00028
## 15           6        6     0.00037

Annotated: number of genes (in our gene list) that are annotated with the term
Significant: n/a for this example, same as Annotated here
Expected: n/a for this example, same as Annotated here
raw.p.value: P-value from Kolomogorov-Smirnov test that DE p-values annotated with the term are smaller (i.e. more significant) than those not annotated with the term.

The Kolmogorov-Smirnov test directly compares two probability distributions based on their maximum distance.

To illustrate the KS test, we plot probability distributions of p-values that are and that are not annotated with the term “RNA methylation”. (This won’t exactly match what topGO does due to their elimination algorithm):

rna.meth.terms <- genesInTerm(GOdata)[["GO:0001510"]] # get genes associated with term
p.values.in <- geneList[names(geneList) %in% rna.meth.terms]
p.values.out <- geneList[!(names(geneList) %in% rna.meth.terms)]
plot.ecdf(p.values.in, verticals = T, do.points = F, col = "red", lwd = 2, xlim = c(0,1),
          main = "Empirical Distribution of DE P-Values by Annotation with 'RNA Methylation'",
          cex.main = 0.9, xlab = "p", ylab = "Probabilty(P-Value < p)")
ecdf.out <- ecdf(p.values.out)
xx <- unique(sort(c(seq(0, 1, length = 201), knots(ecdf.out))))
lines(xx, ecdf.out(xx), col = "black", lwd = 2)
legend("bottomright", legend = c("Genes Annotated with 'RNA Methylation'", "Genes Not Annotated with 'RNA Methylation'"), lwd = 2, col = 2:1, cex = 0.9)

We can use the function showSigOfNodes to plot the GO graph for the 3 most significant terms and their parents, color coded by enrichment p-value (red is most significant):

par(cex = 0.3)
showSigOfNodes(GOdata, score(resultKS), firstSigNodes = 3, useInfo = "def")

## Loading required package: Rgraphviz

## Loading required package: grid

## 
## Attaching package: 'grid'

## The following object is masked from 'package:topGO':
## 
##     depth

## 
## Attaching package: 'Rgraphviz'

## The following objects are masked from 'package:IRanges':
## 
##     from, to

## The following objects are masked from 'package:S4Vectors':
## 
##     from, to

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 41 
## Number of Edges = 70 
## 
## $complete.dag
## [1] "A graph with 41 nodes."

par(cex = 1)

3. topGO Example Using Fisher’s Exact Test

Next, we use Fisher’s exact test to test for pathway enrichment among significantly DE genes, with GO terms from an external file.

This example uses Mus musculus, but the same procedure can be followed to input GO terms from an external file for a non-model organism.

Creating topGOdata object using GO terms from Biomart (https://www.ensembl.org/biomart):

gene.go <- read.delim("mart_export.txt", stringsAsFactors = F)

# OR

gene.go <- read.delim("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2018-June-RNA-Seq-Workshop/master/friday/mart_export.txt", stringsAsFactors = F)

gene.go <- gene.go[which(gene.go$GO.term.accession != ""),] # take out genes without GO terms
head(gene.go)

##       Gene.stable.ID GO.term.accession
## 3 ENSMUSG00000064370        GO:0009055
## 4 ENSMUSG00000064370        GO:0016020
## 5 ENSMUSG00000064370        GO:0016491
## 6 ENSMUSG00000064370        GO:0006122
## 7 ENSMUSG00000064370        GO:0008121
## 8 ENSMUSG00000064370        GO:0022904

# Create list with element for each gene, containing vectors with all terms for each gene
gene2GO <- tapply(gene.go$GO.term.accession, gene.go$Gene.stable.ID, function(x)x)
head(gene2GO)

## $ENSMUSG00000000001
##  [1] "GO:0007186" "GO:0007165" "GO:0005525" "GO:0003924" "GO:0004871"
##  [6] "GO:0031683" "GO:0019001" "GO:0007188" "GO:0016020" "GO:0046872"
## [11] "GO:0005886" "GO:0005634" "GO:0005794" "GO:0005515" "GO:0005737"
## [16] "GO:0000166" "GO:0005856" "GO:0005730" "GO:0046039" "GO:0007049"
## [21] "GO:0051301" "GO:0000139" "GO:0005789" "GO:0005813" "GO:0005815"
## [26] "GO:0030496" "GO:0019003" "GO:0007193" "GO:0016239" "GO:0001664"
## [31] "GO:0005834" "GO:0007212" "GO:0019904" "GO:0031821" "GO:0032794"
## [36] "GO:0006906" "GO:0042588" "GO:0045121"
## 
## $ENSMUSG00000000003
## [1] "GO:0036094" "GO:0005576"
## 
## $ENSMUSG00000000028
##  [1] "GO:0006270" "GO:0005634" "GO:0032508" "GO:0006260" "GO:0007049"
##  [6] "GO:0051301" "GO:0005813" "GO:0003682" "GO:0003697" "GO:0003688"
## [11] "GO:0031298" "GO:0000727" "GO:0043138" "GO:0006267" "GO:0005656"
## [16] "GO:0031261" "GO:0031938" "GO:1900087" "GO:1902977"
## 
## $ENSMUSG00000000037
## [1] "GO:0006355" "GO:0005634" "GO:0005515" "GO:0034613" "GO:0006915"
## [6] "GO:0000790" "GO:0036353" "GO:0001741"
## 
## $ENSMUSG00000000049
##  [1] "GO:0005615" "GO:0009986" "GO:0042802" "GO:0005576" "GO:0031012"
##  [6] "GO:0006641" "GO:0008201" "GO:0016525" "GO:0031639" "GO:0005543"
## [11] "GO:0034361" "GO:0042627" "GO:0001937" "GO:0030195" "GO:0010596"
## [16] "GO:0034392" "GO:0008289" "GO:0051006" "GO:0034364" "GO:0033033"
## [21] "GO:0051918" "GO:0030193" "GO:0060230" "GO:0007597" "GO:0051917"
## 
## $ENSMUSG00000000056
## [1] "GO:0008150" "GO:0005634" "GO:0055114" "GO:0005730" "GO:0005521"
## [6] "GO:0005638" "GO:0031981" "GO:0003954"

Define vector that is 1 if gene is significantly DE (adj.P.Val < 0.05) and 0 otherwise:

infile <- "ensembl_example_input.txt"
pcutoff <- 0.05 # cutoff for defining significant genes
DE <- read.delim(infile, stringsAsFactors = F)

#OR
DE <- read.delim("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2018-June-RNA-Seq-Workshop/master/friday/ensembl_example_input.txt", stringsAsFactors = F)

head(DE)

##                    Gene    adj.P.Val
## 1 ENSMUSG00000033857.12 7.416582e-14
## 2 ENSMUSG00000054715.11 7.416582e-14
## 3  ENSMUSG00000105940.1 7.416582e-14
## 4 ENSMUSG00000032382.14 1.098506e-13
## 5 ENSMUSG00000027208.14 1.206893e-13
## 6 ENSMUSG00000020063.16 3.185723e-13

# Define gene list as 1's if adjP < cutoff, 0, otherwise
tmp <- ifelse(DE$adj.P.Val < pcutoff, 1, 0)
geneList <- tmp

# geneList needs names that match those for GO terms, so stripping off part of name after dot
# (This is only necessary for ensembl IDs)
names(geneList) <- unlist(lapply(strsplit(DE$Gene, split = ".", fixed = T), function(x)x[1]))
head(geneList)

## ENSMUSG00000033857 ENSMUSG00000054715 ENSMUSG00000105940 
##                  1                  1                  1 
## ENSMUSG00000032382 ENSMUSG00000027208 ENSMUSG00000020063 
##                  1                  1                  1

Create topGOdata object:

GOdata <- new("topGOdata",
        ontology = "BP",
        allGenes = geneList,
        geneSelectionFun = function(x)(x == 1),
        annot = annFUN.gene2GO, gene2GO = gene2GO)

## 
## Building most specific GOs .....

##  ( 11311 GO terms found. )

## 
## Build GO DAG topology ..........

##  ( 15207 GO terms and 35988 relations. )

## 
## Annotating nodes ...............

##  ( 15188 genes annotated to the GO terms. )

Run Fisher’s Exact Test:

resultFisher <- runTest(GOdata, algorithm = "elim", statistic = "fisher")

## 
##           -- Elim Algorithm -- 
## 
##       the algorithm is scoring 8311 nontrivial nodes
##       parameters: 
##           test statistic: fisher
##           cutOff: 0.01

## 
##   Level 19:  2 nodes to be scored    (0 eliminated genes)

## 
##   Level 18:  7 nodes to be scored    (0 eliminated genes)

## 
##   Level 17:  19 nodes to be scored   (0 eliminated genes)

## 
##   Level 16:  55 nodes to be scored   (0 eliminated genes)

## 
##   Level 15:  112 nodes to be scored  (3 eliminated genes)

## 
##   Level 14:  208 nodes to be scored  (25 eliminated genes)

## 
##   Level 13:  352 nodes to be scored  (41 eliminated genes)

## 
##   Level 12:  510 nodes to be scored  (47 eliminated genes)

## 
##   Level 11:  752 nodes to be scored  (93 eliminated genes)

## 
##   Level 10:  976 nodes to be scored  (873 eliminated genes)

## 
##   Level 9:   1121 nodes to be scored (1385 eliminated genes)

## 
##   Level 8:   1138 nodes to be scored (1462 eliminated genes)

## 
##   Level 7:   1156 nodes to be scored (1722 eliminated genes)

## 
##   Level 6:   928 nodes to be scored  (4402 eliminated genes)

## 
##   Level 5:   551 nodes to be scored  (5655 eliminated genes)

## 
##   Level 4:   284 nodes to be scored  (5673 eliminated genes)

## 
##   Level 3:   118 nodes to be scored  (5679 eliminated genes)

## 
##   Level 2:   21 nodes to be scored   (6050 eliminated genes)

## 
##   Level 1:   1 nodes to be scored    (6050 eliminated genes)

tab <- GenTable(GOdata, raw.p.value = resultFisher, topNodes = length(resultFisher@score),
                numChar = 120)
head(tab)

##        GO.ID                                                       Term
## 1 GO:0097192 extrinsic apoptotic signaling pathway in absence of ligand
## 2 GO:0032287               peripheral nervous system myelin maintenance
## 3 GO:0003283                                  atrial septum development
## 4 GO:0006667                              sphinganine metabolic process
## 5 GO:0035582                sequestering of BMP in extracellular matrix
## 6 GO:1903598               positive regulation of gap junction assembly
##   Annotated Significant Expected raw.p.value
## 1        67          17     7.05     0.00044
## 2         8           5     0.84     0.00055
## 3        17           7     1.79     0.00106
## 4         3           3     0.32     0.00116
## 5         3           3     0.32     0.00116
## 6         3           3     0.32     0.00116

Annotated: number of genes (in our gene list) that are annotated with the term
Significant: Number of significantly DE genes annotated with that term (i.e. genes where geneList = 1)
Expected: Under random chance, number of genes that would be expected to be significantly DE and annotated with that term
raw.p.value: P-value from Fisher’s Exact Test, testing for association between significance and pathway membership.

Fisher’s Exact Test is applied to the table:

Significance/Annotation	Annotated With GO Term	Not Annotated With GO Term
Significantly DE	n1	n3
Not Significantly DE	n2	n4

and compares the probability of the observed table, conditional on the row and column sums, to what would be expected under random chance.

Advantages over KS (or Wilcoxon) Tests:

*Ease of interpretation

Disadvantages:

Relies on significant/non-significant dichotomy (an interesting gene could have an adjusted p-value of 0.051 and be counted as non-significant)
Less powerful
May be less useful if there are very few (or a large number of) significant genes

4. KEGG Pathway Enrichment Testing With KEGGREST

KEGG, the Kyoto Encyclopedia of Genes and Genomes (https://www.genome.jp/kegg/), provides assignment of genes for many organisms into pathways.

We will access KEGG pathway assignments for arabadopsis through the KEGGREST Bioconductor package, and then use some homebrew code for enrichment testing.

Get all arabadopsis pathways and their genes:

# Pull all pathways for AT  
pathways.list <- keggList("pathway", "ath")
head(pathways.list)

##                                                                   path:ath00010 
##             "Glycolysis / Gluconeogenesis - Arabidopsis thaliana (thale cress)" 
##                                                                   path:ath00020 
##                "Citrate cycle (TCA cycle) - Arabidopsis thaliana (thale cress)" 
##                                                                   path:ath00030 
##                "Pentose phosphate pathway - Arabidopsis thaliana (thale cress)" 
##                                                                   path:ath00040 
## "Pentose and glucuronate interconversions - Arabidopsis thaliana (thale cress)" 
##                                                                   path:ath00051 
##          "Fructose and mannose metabolism - Arabidopsis thaliana (thale cress)" 
##                                                                   path:ath00052 
##                     "Galactose metabolism - Arabidopsis thaliana (thale cress)"

# Pull all genes for each pathway
pathway.codes <- sub("path:", "", names(pathways.list)) 
genes.by.pathway <- sapply(pathway.codes,
    function(pwid){
        pw <- keggGet(pwid)
        if (is.null(pw[[1]]$GENE)) return(NA)
        pw2 <- pw[[1]]$GENE[c(TRUE,FALSE)] # may need to modify this to c(FALSE, TRUE) for other organisms
        pw2 <- unlist(lapply(strsplit(pw2, split = ";", fixed = T), function(x)x[1]))
        return(pw2)
    }
)
head(genes.by.pathway)

## $ath00010
##   [1] "AT4G37840" "AT1G47840" "AT2G19860" "AT4G29130" "AT1G50460"
##   [6] "AT3G20040" "AT4G24620" "AT5G42740" "AT4G32840" "AT5G56630"
##  [11] "AT5G61580" "AT4G29220" "AT2G22480" "AT4G26270" "AT5G47810"
##  [16] "AT1G76550" "AT4G04040" "AT1G12000" "AT1G20950" "AT1G43670"
##  [21] "AT3G54050" "AT5G64380" "AT4G26520" "AT2G36460" "AT3G52930"
##  [26] "AT2G01140" "AT4G38970" "AT4G26530" "AT5G03690" "AT2G21330"
##  [31] "AT3G55440" "AT2G21170" "AT1G13440" "AT1G16300" "AT1G79530"
##  [36] "AT3G04120" "AT3G12780" "AT1G56190" "AT1G79550" "AT1G22170"
##  [41] "AT1G78050" "AT1G09780" "AT3G08590" "AT3G50520" "AT1G74030"
##  [46] "AT2G36530" "AT2G29560" "AT3G04050" "AT5G52920" "AT5G56350"
##  [51] "AT3G25960" "AT3G52990" "AT3G55650" "AT3G55810" "AT5G08570"
##  [56] "AT5G63680" "AT4G26390" "AT1G32440" "AT3G49160" "AT3G22960"
##  [61] "AT2G36580" "AT1G59900" "AT1G01090" "AT1G24180" "AT5G50850"
##  [66] "AT2G34590" "AT1G30120" "AT1G34430" "AT3G52200" "AT1G54220"
##  [71] "AT3G13930" "AT3G25860" "AT3G17240" "AT3G16950" "AT1G48030"
##  [76] "AT4G16155" "AT4G17260" "AT4G33070" "AT5G01320" "AT5G01330"
##  [81] "AT5G54960" "AT1G22440" "AT5G42250" "AT5G43940" "AT4G22110"
##  [86] "AT1G22430" "AT1G77120" "AT1G32780" "AT5G24760" "AT1G64710"
##  [91] "AT1G23800" "AT3G48000" "AT4G34240" "AT1G44170" "AT4G36250"
##  [96] "AT1G54100" "AT5G36880" "AT3G17940" "AT3G47800" "AT5G15140"
## [101] "AT3G01260" "AT5G51820" "AT1G23190" "AT1G70730" "AT4G23730"
## [106] "AT3G61610" "AT3G01590" "AT5G57330" "AT5G14500" "AT4G25900"
## [111] "AT5G66530" "AT1G09870" "AT2G24270" "AT5G65690" "AT4G37870"
## 
## $ath00020
##  [1] "AT2G42790" "AT3G58740" "AT3G58750" "AT3G60100" "AT2G44350"
##  [6] "AT3G06650" "AT5G49460" "AT1G60810" "AT1G10670" "AT1G09430"
## [11] "AT4G35830" "AT2G05710" "AT4G26970" "AT1G65930" "AT1G54340"
## [16] "AT5G14590" "AT4G35650" "AT5G03290" "AT2G17130" "AT4G35260"
## [21] "AT3G09810" "AT1G32480" "AT3G55410" "AT5G65750" "AT5G55070"
## [26] "AT4G26910" "AT3G17240" "AT3G16950" "AT1G48030" "AT4G16155"
## [31] "AT5G08300" "AT5G23250" "AT2G20420" "AT5G66760" "AT2G18450"
## [36] "AT5G40650" "AT5G65165" "AT3G27380" "AT4G32210" "AT5G09600"
## [41] "AT5G50950" "AT2G47510" "AT1G04410" "AT5G56720" "AT5G43330"
## [46] "AT3G47520" "AT1G53240" "AT2G22780" "AT3G15020" "AT5G09660"
## [51] "AT5G65690" "AT4G37870" "AT1G59900" "AT1G01090" "AT1G24180"
## [56] "AT5G50850" "AT2G34590" "AT1G30120" "AT1G34430" "AT3G52200"
## [61] "AT1G54220" "AT3G13930" "AT3G25860"
## 
## $ath00030
##  [1] "AT4G24620" "AT5G42740" "AT3G27300" "AT5G13110" "AT1G09420"
##  [6] "AT5G35790" "AT1G24280" "AT5G40760" "AT5G24410" "AT5G24420"
## [11] "AT3G49360" "AT1G13700" "AT5G24400" "AT1G64190" "AT5G41670"
## [16] "AT3G02360" "AT1G63290" "AT5G61410" "AT3G01850" "AT3G60750"
## [21] "AT2G45290" "AT5G13420" "AT1G12230" "AT1G71100" "AT2G01290"
## [26] "AT3G04790" "AT5G44520" "AT1G17160" "AT5G51820" "AT1G23190"
## [31] "AT1G70730" "AT2G44530" "AT1G32380" "AT2G35390" "AT2G24270"
## [36] "AT2G16790" "AT4G26520" "AT2G36460" "AT3G52930" "AT2G01140"
## [41] "AT4G38970" "AT4G26530" "AT5G03690" "AT2G21330" "AT1G43670"
## [46] "AT3G54050" "AT5G64380" "AT4G32840" "AT5G56630" "AT5G61580"
## [51] "AT4G29220" "AT2G22480" "AT4G26270" "AT5G47810" "AT1G76550"
## [56] "AT4G04040" "AT1G12000" "AT1G20950"
## 
## $ath00040
##  [1] "AT3G43270" "AT1G53830" "AT5G04970" "AT2G45220" "AT5G19730"
##  [6] "AT1G11590" "AT5G49180" "AT2G47030" "AT1G69940" "AT4G33220"
## [11] "AT3G29090" "AT4G03930" "AT2G21610" "AT1G11580" "AT3G05620"
## [16] "AT5G27870" "AT3G14310" "AT4G02320" "AT3G47400" "AT5G51500"
## [21] "AT5G51490" "AT1G02810" "AT3G10720" "AT2G47550" "AT4G02330"
## [26] "AT3G05610" "AT5G55590" "AT4G13710" "AT3G55140" "AT3G09540"
## [31] "AT3G27400" "AT5G55720" "AT5G09280" "AT3G24670" "AT1G30350"
## [36] "AT2G02720" "AT3G01270" "AT3G53190" "AT1G14420" "AT5G04310"
## [41] "AT5G15110" "AT1G11920" "AT5G63180" "AT4G22090" "AT5G48900"
## [46] "AT4G22080" "AT1G67750" "AT4G24780" "AT3G07010" "AT4G13210"
## [51] "AT3G24230" "AT1G04680" "AT3G54920" "AT3G59850" "AT5G14650"
## [56] "AT3G15720" "AT5G17200" "AT3G57510" "AT3G07850" "AT1G02790"
## [61] "AT5G14470" "AT3G01640" "AT5G52560" "AT5G39320" "AT1G26570"
## [66] "AT5G15490" "AT3G29360" "AT3G01010" "AT5G17310" "AT3G03250"
## [71] "AT1G63290" "AT5G61410" "AT3G01850" "AT5G49650" "AT5G57655"
## [76] "AT5G51970"
## 
## $ath00051
##  [1] "AT3G10900" "AT2G20680" "AT5G01930" "AT5G66460" "AT1G02310"
##  [6] "AT4G28320" "AT3G30540" "AT3G10890" "AT5G51830" "AT2G31390"
## [11] "AT4G10260" "AT1G06030" "AT3G59480" "AT1G66430" "AT1G06020"
## [16] "AT1G50390" "AT3G02570" "AT1G67070" "AT2G45790" "AT1G74910"
## [21] "AT2G39770" "AT3G55590" "AT4G30570" "AT5G66280" "AT3G51160"
## [26] "AT1G17890" "AT1G73250" "AT4G11980" "AT1G01220" "AT4G37840"
## [31] "AT1G47840" "AT2G19860" "AT4G29130" "AT1G50460" "AT3G20040"
## [36] "AT4G32840" "AT5G56630" "AT5G61580" "AT4G29220" "AT2G22480"
## [41] "AT4G26270" "AT5G47810" "AT1G76550" "AT4G04040" "AT1G12000"
## [46] "AT1G20950" "AT1G43670" "AT3G54050" "AT5G64380" "AT1G07110"
## [51] "AT5G51970" "AT5G57655" "AT4G26520" "AT2G36460" "AT3G52930"
## [56] "AT2G01140" "AT4G38970" "AT4G26530" "AT5G03690" "AT2G21330"
## [61] "AT3G55440" "AT2G21170" "AT1G48430" "AT3G17770"
## 
## $ath00052
##  [1] "AT3G17940" "AT3G47800" "AT5G15140" "AT3G01260" "AT5G18200"
##  [6] "AT1G63180" "AT1G64440" "AT1G12780" "AT4G23920" "AT4G10960"
## [11] "AT5G17310" "AT3G03250" "AT5G51820" "AT1G23190" "AT1G70730"
## [16] "AT4G37840" "AT1G47840" "AT2G19860" "AT4G29130" "AT1G50460"
## [21] "AT3G20040" "AT3G54440" "AT1G72990" "AT3G52840" "AT5G52560"
## [26] "AT2G47180" "AT1G56600" "AT1G09350" "AT1G60470" "AT5G23790"
## [31] "AT4G26250" "AT1G60450" "AT5G30500" "AT1G55740" "AT3G57520"
## [36] "AT5G20250" "AT5G40390" "AT3G56310" "AT5G08370" "AT5G08380"
## [41] "AT4G32840" "AT5G56630" "AT5G61580" "AT4G29220" "AT2G22480"
## [46] "AT4G26270" "AT5G47810" "AT2G36190" "AT3G13790" "AT1G12240"
## [51] "AT1G62660" "AT3G13784" "AT5G11920" "AT4G01970" "AT3G45940"
## [56] "AT5G11720"

Read in DE file to be used in enrichment testing:

infile <- "I5_v_C_time6.txt"
DE.table <- read.delim(infile, stringsAsFactors = F)

# OR

DE.table <- read.delim("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2018-June-RNA-Seq-Workshop/master/friday/I5_v_C_time6.txt")

geneList <- DE.table$P.Value
names(geneList) <- DE.table$Gene
head(geneList)

##    AT4G12520    AT3G30720    AT5G26270    AT3G33528    AT1G64795 
## 2.206726e-10 9.108689e-10 4.101051e-09 2.741289e-07 5.985471e-07 
##    AT3G05955 
## 1.257443e-06

Apply Wilcoxon rank-sum test to each pathway, testing if “in” p-values are smaller than “out” p-values:

# Wilcoxon test for each pathway
pVals.by.pathway <- t(sapply(names(genes.by.pathway),
    function(pathway) {
        pathway.genes <- genes.by.pathway[[pathway]]
        list.genes.in.pathway <- intersect(names(geneList), pathway.genes)
        list.genes.not.in.pathway <- setdiff(names(geneList), list.genes.in.pathway)
        scores.in.pathway <- geneList[list.genes.in.pathway]
        scores.not.in.pathway <- geneList[list.genes.not.in.pathway]
        if (length(scores.in.pathway) > 0){
            p.value <- wilcox.test(scores.in.pathway, scores.not.in.pathway, alternative = "less")$p.value
        } else{
            p.value <- NA
        }
        return(c(p.value = p.value, Annotated = length(list.genes.in.pathway)))
    }
))

# Assemble output table
outdat <- data.frame(pathway.code = rownames(pVals.by.pathway))
outdat$pathway.name <- pathways.list[outdat$pathway.code]
outdat$p.value <- pVals.by.pathway[,"p.value"]
outdat$Annotated <- pVals.by.pathway[,"Annotated"]
outdat <- outdat[order(outdat$p.value),]
head(outdat)

##     pathway.code
## 106     ath03010
## 94      ath00966
## 128     ath04141
## 10      ath00071
## 107     ath03013
## 55      ath00592
##                                                                         pathway.name
## 106                                    Ribosome - Arabidopsis thaliana (thale cress)
## 94                   Glucosinolate biosynthesis - Arabidopsis thaliana (thale cress)
## 128 Protein processing in endoplasmic reticulum - Arabidopsis thaliana (thale cress)
## 10                       Fatty acid degradation - Arabidopsis thaliana (thale cress)
## 107                               RNA transport - Arabidopsis thaliana (thale cress)
## 55              alpha-Linolenic acid metabolism - Arabidopsis thaliana (thale cress)
##          p.value Annotated
## 106 5.378589e-35       301
## 94  2.572845e-06        22
## 128 6.009307e-06       185
## 10  2.040541e-05        43
## 107 1.784614e-04       152
## 55  1.145646e-03        37

p.value: P-value for Wilcoxon rank-sum testing, testing that p-values from DE analysis for genes in the pathway are smaller than those not in the pathway
Annotated: Number of genes in the pathway (regardless of DE p-value)

The Wilcoxon rank-sum test is the nonparametric analogue of the two-sample t-test. It compares the ranks of observations in two groups. It is more powerful than the Kolmogorov-Smirnov test.

sessionInfo()

## R version 3.5.0 (2018-04-23)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
##  [1] grid      stats4    parallel  stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] Rgraphviz_2.24.0     org.At.tair.db_3.6.0 KEGGREST_1.20.0     
##  [4] topGO_2.32.0         SparseM_1.77         GO.db_3.6.0         
##  [7] AnnotationDbi_1.42.1 IRanges_2.14.10      S4Vectors_0.18.3    
## [10] Biobase_2.40.0       graph_1.58.0         BiocGenerics_0.26.0 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.17       compiler_3.5.0     XVector_0.20.0    
##  [4] tools_3.5.0        zlibbioc_1.26.0    digest_0.6.15     
##  [7] bit_1.1-14         RSQLite_2.1.1      evaluate_0.10.1   
## [10] memoise_1.1.0      lattice_0.20-35    pkgconfig_2.0.1   
## [13] png_0.1-7          DBI_1.0.0          curl_3.2          
## [16] yaml_2.1.19        stringr_1.3.1      httr_1.3.1        
## [19] knitr_1.20         Biostrings_2.48.0  rprojroot_1.3-2   
## [22] bit64_0.9-7        R6_2.2.2           rmarkdown_1.10    
## [25] blob_1.1.1         magrittr_1.5       backports_1.1.2   
## [28] htmltools_0.3.6    matrixStats_0.53.1 stringi_1.1.7

Gene Ontology and KEGG Enrichment Analysis

0. Setup

1. Gene Ontology (GO) Enrichment

2. topGO Example Using Kolmogorov-Smirnov Testing

3. topGO Example Using Fisher’s Exact Test

4. KEGG Pathway Enrichment Testing With KEGGREST