Note: Throughout, text taken directly from the vignettes will be shown in georgia font and code/output from the vignettes will be shown with a blue background. Everything else is my commentary/additions.

limma

Limma is an R package for differential expression testing of RNASeq and microarray data. The limma User’s Guide is an extensive, 100+ page summary of limma’s many capabilities. We will focus only on Chapter 15, “RNA-seq Data”.

Install limma and edgeR if you have not already done so:

# source("https://bioconductor.org/biocLite.R")
# biocLite(c("limma", "edgeR"))

Load edgeR (which loads limma as a dependency)

library(edgeR)
## Loading required package: limma
## Warning: package 'limma' was built under R version 3.5.1

Read in the counts table used in the limma examples from the course github page:

counts <- read.delim('https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2018-September-Bioinformatics-Prerequisites/master/friday/counts.tsv', row.names = 1)

The counts table has one column per sample and one row per gene. It shows the number of reads aligning to each gene in each sample.

dim(counts)
## [1] 25702     4
head(counts)
##           A_1.bam A_2.bam B_1.bam B_2.bam
## 653635        627     513     579     574
## 100422834       1       0       0       0
## 645520          5       3       0       0
## 79501           0       0       0       0
## 729737         70      64      24      21
## 100507658      10       7       5       4

We also need to define a model matrix since this is not done in the vignette:

group <- rep(c("A", "B"), each = 2)
design <- model.matrix(~group)

Now for the actual limma vignette (starting in 15.3):

15.3 Normalization and Filtering

Once a matrix of read counts has been created, with rows for genes and columns for samples, it is convenient to create a DGEList object using the edgeR package.

dge <- DGEList(counts=counts)
dge
## An object of class "DGEList"
## $counts
##           A_1.bam A_2.bam B_1.bam B_2.bam
## 653635        627     513     579     574
## 100422834       1       0       0       0
## 645520          5       3       0       0
## 79501           0       0       0       0
## 729737         70      64      24      21
## 25697 more rows ...
## 
## $samples
##         group lib.size norm.factors
## A_1.bam     1   383672            1
## A_2.bam     1   323786            1
## B_1.bam     1   464801            1
## B_2.bam     1   451689            1

A DGEList object is a container for counts, normalization factors, and library sizes.

The next step is to remove rows that consistently have zero or very low counts. One can for example use

keep <- filterByExpr(dge, design)
dge <- dge[keep,,keep.lib.sizes=FALSE]
dge
## An object of class "DGEList"
## $counts
##           A_1.bam A_2.bam B_1.bam B_2.bam
## 653635        627     513     579     574
## 729737         70      64      24      21
## 100132287      63      43      20      15
## 100131754   11839   10685   45338   44547
## 100133331      77      60      33      39
## 1615 more rows ...
## 
## $samples
##         group lib.size norm.factors
## A_1.bam     1   382288            1
## A_2.bam     1   322674            1
## B_1.bam     1   463301            1
## B_2.bam     1   450284            1

Our data have been filtered from 25,702 genes to 1620 genes. Note that filtering is not an exact science, and is done different ways even within the limma User’s Guide (see e.g. section 18.1.7).

Here we will assume that filtering has been done.

It is usual to apply scale normalization to RNA-seq read counts, and the TMM normalization method in particular has been found to perform well in comparative studies. This can be applied to the DGEList object:

dge <- calcNormFactors(dge)

The norm.factors column in dge$samples has now been assigned numbers other than the default value of 1:

dge
## An object of class "DGEList"
## $counts
##           A_1.bam A_2.bam B_1.bam B_2.bam
## 653635        627     513     579     574
## 729737         70      64      24      21
## 100132287      63      43      20      15
## 100131754   11839   10685   45338   44547
## 100133331      77      60      33      39
## 1615 more rows ...
## 
## $samples
##         group lib.size norm.factors
## A_1.bam     1   382288    1.1629573
## A_2.bam     1   322674    1.1573440
## B_1.bam     1   463301    0.8572287
## B_2.bam     1   450284    0.8667165

15.4 Differential expression: limma-trend

If the sequencing depth is reasonably consistent across the RNA samples, then the simplest and most robust approach to differential expression is to use limma-trend. This approach will usually work well if the ratio of the largest library size to the smallest is not more than about 3-fold.

In the limma-trend approach, the counts are converted to logCPM values using edgeR’s cpm function:

logCPM <- cpm(dge, log=TRUE, prior.count=3)

prior.count is the constant that is added to all counts before log transformation in order to avoid taking the log of 0. Its default value is 0.25.

The prior count is used here to damp down the variances of logarithms of low counts.

The logCPM values can then be used in any standard limma pipeline, using the trend=TRUE argument when running eBayes or treat. For example:

fit <- lmFit(logCPM, design)
fit <- eBayes(fit, trend=TRUE)
topTable(fit, coef=ncol(design))
##               logFC   AveExpr         t      P.Value    adj.P.Val        B
## 477        6.702730 10.097521 104.64572 1.289558e-13 1.163076e-10 19.42430
## 2752       2.822725 12.911626 103.21992 1.435896e-13 1.163076e-10 19.38206
## 23154      4.149472 11.364604  92.95400 3.262592e-13 1.761800e-10 19.03409
## 22883      2.668013 12.459653  88.78688 4.673095e-13 1.892604e-10 18.86725
## 23208      6.191143 10.078057  85.22099 6.442451e-13 2.087354e-10 18.71049
## 481        3.428246 11.584274  78.71541 1.199934e-12 3.239821e-10 18.38600
## 2023      -2.280158 13.629827 -70.07808 2.981364e-12 5.614906e-10 17.86117
## 100131754  2.047753 15.776777  70.02656 2.998580e-12 5.614906e-10 17.85766
## 10472      4.402382 10.024830  69.39116 3.220388e-12 5.614906e-10 17.81397
## 9783       5.220951  9.713354  68.74277 3.465991e-12 5.614906e-10 17.76860

Using trend=TRUE means that limma “squeezes” the genewise variances towards a trend line (rather than the default global mean).

coef = ncol(design) in the call to topTable works out to coef = 2. We are telling topTable to test the second coefficient (“groupB”):

head(coef(fit))
##           (Intercept)      groupB
## 653635      10.450526  0.07280501
## 729737       7.424122 -1.41271267
## 100132287    7.079451 -1.39220328
## 100131754   14.752900  2.04775296
## 100133331    7.445348 -0.82146092
## 79854        5.643624  0.24920993

In the output from topTable:

topTable shows the top 10 most significant genes by default, the argument n can be used to change that number.

Or, to give more weight to fold-changes in the gene ranking, one might use:

fit <- treat(fit, lfc=log2(1.2))
topTreat(fit, coef=ncol(design))
##            logFC   AveExpr         t      P.Value    adj.P.Val
## 79727  -7.967585  6.885753 -92.45334 4.136977e-08 2.500193e-05
## 10485   7.716040  6.759981  87.65236 5.144712e-08 2.500193e-05
## 51440   7.742719  8.076993  86.85248 5.346975e-08 2.500193e-05
## 3765    7.837163  6.820542  79.50812 7.733537e-08 2.500193e-05
## 127254  6.619733  6.211827  75.74582 9.257224e-08 2.500193e-05
## 477     6.702730 10.097521  71.87338 1.153532e-07 2.500193e-05
## 26052   7.525391  8.067417  72.03942 1.159934e-07 2.500193e-05
## 7143    6.078910  5.941416  67.16788 1.507634e-07 2.500193e-05
## 3208    6.857143  7.426971  67.07065 1.542778e-07 2.500193e-05
## 777     5.920557  6.272886  65.76432 1.639550e-07 2.500193e-05

The function treat() tests if the absolute log fold change is greater than the indicated value (log2(1.2)), rather than 0.

15.5 Differential expression: voom

When the library sizes are quite variable between samples, then the voom approach is theoretically more powerful than limma-trend. In this approach, the voom transformation is applied to the normalized and filtered DGEList object:

v <- voom(dge, design, plot=TRUE)

In the plot, each point represents a gene, with the x-axis showing the average expression in log2 counts per million reads and the y-axis showing the square root of the residual standard deviation of expression in log2 counts per million reads.

The voom transformation uses the experiment design matrix, and produces an EList object.

It is also possible to give a matrix of counts directly to voom without TMM normalization, by

v <- voom(counts, design, plot=TRUE)

If the data are very noisy, one can apply the same between-array normalization methods as would be used for microarrays, for example:

v <- voom(counts, design, plot=TRUE, normalize="quantile")

After this, the usual limma pipelines for differential expression can be applied, for example:

fit <- lmFit(v, design)
fit <- eBayes(fit)
## Warning: Zero sample variances detected, have been offset away from zero
topTable(fit, coef=ncol(design))
##               logFC  AveExpr             t      P.Value    adj.P.Val
## 100507192 -2.929838 3.365092 -1.626574e+16 1.072348e-34 1.207500e-30
## 727721    -1.584963 1.107691 -8.455993e+15 4.228272e-34 1.207500e-30
## 7780      -1.584963 1.107691 -8.455993e+15 4.228272e-34 1.207500e-30
## 100507337 -1.584963 1.107691 -8.455993e+15 4.228272e-34 1.207500e-30
## 100507664 -1.584963 1.107691 -8.455993e+15 4.228272e-34 1.207500e-30
## 270       -1.584963 1.107691 -8.455993e+15 4.228272e-34 1.207500e-30
## 3284      -1.584963 1.107691 -8.455993e+15 4.228272e-34 1.207500e-30
## 728945    -1.584963 1.107691 -8.455993e+15 4.228272e-34 1.207500e-30
## 730227    -1.584963 1.107691 -8.455993e+15 4.228272e-34 1.207500e-30
## 148930     4.122552 2.376486  5.910097e+15 8.962322e-34 2.059773e-30
##                  B
## 100507192 68.59136
## 727721    66.95172
## 7780      66.95172
## 100507337 66.95172
## 100507664 66.95172
## 270       66.95172
## 3284      66.95172
## 728945    66.95172
## 730227    66.95172
## 148930    65.65863

Or, to give more weight to fold-changes in the ranking, one could use say:

fit <- treat(fit, lfc=log2(1.2))
## Warning: Zero sample variances detected, have been offset away from zero
topTreat(fit, coef=ncol(design))
##               logFC  AveExpr             t      P.Value    adj.P.Val
## 100507192 -2.929838 3.365092 -1.480544e+16 1.100817e-34 1.320896e-30
## 727721    -1.584963 1.107691 -7.052668e+15 4.625346e-34 1.320896e-30
## 7780      -1.584963 1.107691 -7.052668e+15 4.625346e-34 1.320896e-30
## 100507337 -1.584963 1.107691 -7.052668e+15 4.625346e-34 1.320896e-30
## 100507664 -1.584963 1.107691 -7.052668e+15 4.625346e-34 1.320896e-30
## 270       -1.584963 1.107691 -7.052668e+15 4.625346e-34 1.320896e-30
## 3284      -1.584963 1.107691 -7.052668e+15 4.625346e-34 1.320896e-30
## 728945    -1.584963 1.107691 -7.052668e+15 4.625346e-34 1.320896e-30
## 730227    -1.584963 1.107691 -7.052668e+15 4.625346e-34 1.320896e-30
## 148930     4.122552 2.376486  5.533010e+15 9.081657e-34 2.253205e-30

biomaRt

The biomaRt package provides an R interface to Ensembl databases.

Install biomaRt if not already installed

# source("https://bioconductor.org/biocLite.R")
# biocLite("biomaRt")

Let’s start with part 2 of the biomaRt vignette (accessible from within R by vignette(“biomaRt”)):

2 Selecting a BioMart database and dataset

Every analysis with biomaRt starts with selecting a BioMart database to use. A first step is to check which BioMart web services are available. The function listMarts() will display all available BioMart web services

library("biomaRt")
listMarts()
##                biomart               version
## 1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 93
## 2   ENSEMBL_MART_MOUSE      Mouse strains 93
## 3     ENSEMBL_MART_SNP  Ensembl Variation 93
## 4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 93

The useMart() function can now be used to connect to a specified BioMart database, this must be a valid name given by listMarts(). In the next example we choose to query the Ensembl BioMart database.

ensembl=useMart("ensembl")

BioMart databases can contain several datasets, for Ensembl every species is a different dataset. In a next step we look at which datasets are available in the selected BioMart by using the function listDatasets().

listDatasets(ensembl)
##                            dataset
## 1       acarolinensis_gene_ensembl
## 2        amelanoleuca_gene_ensembl
## 3          amexicanus_gene_ensembl
## 4          anancymaae_gene_ensembl
## 5      aplatyrhynchos_gene_ensembl
## 6             btaurus_gene_ensembl
## 7             caperea_gene_ensembl
## 8               catys_gene_ensembl
## 9          ccapucinus_gene_ensembl
## 10         cchok1gshd_gene_ensembl
## 11            ccrigri_gene_ensembl
## 12           celegans_gene_ensembl
## 13        cfamiliaris_gene_ensembl
## 14            chircus_gene_ensembl
## 15         choffmanni_gene_ensembl
## 16      cintestinalis_gene_ensembl
## 17           cjacchus_gene_ensembl
## 18          clanigera_gene_ensembl
## 19         cpalliatus_gene_ensembl
## 20         cporcellus_gene_ensembl
## 21           csabaeus_gene_ensembl
## 22          csavignyi_gene_ensembl
## 23          csyrichta_gene_ensembl
## 24      dmelanogaster_gene_ensembl
## 25      dnovemcinctus_gene_ensembl
## 26             dordii_gene_ensembl
## 27             drerio_gene_ensembl
## 28           eburgeri_gene_ensembl
## 29          ecaballus_gene_ensembl
## 30         eeuropaeus_gene_ensembl
## 31          etelfairi_gene_ensembl
## 32        falbicollis_gene_ensembl
## 33             fcatus_gene_ensembl
## 34        fdamarensis_gene_ensembl
## 35         gaculeatus_gene_ensembl
## 36            ggallus_gene_ensembl
## 37           ggorilla_gene_ensembl
## 38            gmorhua_gene_ensembl
## 39            hfemale_gene_ensembl
## 40              hmale_gene_ensembl
## 41           hsapiens_gene_ensembl
## 42  itridecemlineatus_gene_ensembl
## 43           jjaculus_gene_ensembl
## 44          lafricana_gene_ensembl
## 45         lchalumnae_gene_ensembl
## 46          loculatus_gene_ensembl
## 47           mauratus_gene_ensembl
## 48            mcaroli_gene_ensembl
## 49         mdomestica_gene_ensembl
## 50      mfascicularis_gene_ensembl
## 51              mfuro_gene_ensembl
## 52         mgallopavo_gene_ensembl
## 53       mleucophaeus_gene_ensembl
## 54         mlucifugus_gene_ensembl
## 55           mmulatta_gene_ensembl
## 56           mmurinus_gene_ensembl
## 57          mmusculus_gene_ensembl
## 58        mnemestrina_gene_ensembl
## 59       mochrogaster_gene_ensembl
## 60            mpahari_gene_ensembl
## 61           mspretus_gene_ensembl
## 62           neugenii_gene_ensembl
## 63            ngalili_gene_ensembl
## 64        nleucogenys_gene_ensembl
## 65          oanatinus_gene_ensembl
## 66             oaries_gene_ensembl
## 67         ocuniculus_gene_ensembl
## 68             odegus_gene_ensembl
## 69         ogarnettii_gene_ensembl
## 70           olatipes_gene_ensembl
## 71         oniloticus_gene_ensembl
## 72          oprinceps_gene_ensembl
## 73            pabelii_gene_ensembl
## 74           paltaica_gene_ensembl
## 75            panubis_gene_ensembl
## 76           pbairdii_gene_ensembl
## 77          pcapensis_gene_ensembl
## 78         pcoquereli_gene_ensembl
## 79           pformosa_gene_ensembl
## 80           pmarinus_gene_ensembl
## 81          ppaniscus_gene_ensembl
## 82            ppardus_gene_ensembl
## 83          psinensis_gene_ensembl
## 84       ptroglodytes_gene_ensembl
## 85          pvampyrus_gene_ensembl
## 86             rbieti_gene_ensembl
## 87        rnorvegicus_gene_ensembl
## 88         rroxellana_gene_ensembl
## 89           saraneus_gene_ensembl
## 90       sboliviensis_gene_ensembl
## 91        scerevisiae_gene_ensembl
## 92          sharrisii_gene_ensembl
## 93            sscrofa_gene_ensembl
## 94         tbelangeri_gene_ensembl
## 95           tguttata_gene_ensembl
## 96      tnigroviridis_gene_ensembl
## 97          trubripes_gene_ensembl
## 98         ttruncatus_gene_ensembl
## 99             vpacos_gene_ensembl
## 100        xmaculatus_gene_ensembl
## 101       xtropicalis_gene_ensembl
##                                                      description
## 1                                 Anole lizard genes (AnoCar2.0)
## 2                                          Panda genes (ailMel1)
## 3                                    Cave fish genes (AstMex102)
## 4                             Ma's night monkey genes (Anan_2.0)
## 5                                      Duck genes (BGI_duck_1.0)
## 6                                             Cow genes (UMD3.1)
## 7                          Brazilian guinea pig genes (CavAp1.0)
## 8                                Sooty mangabey genes (Caty_1.0)
## 9                            Capuchin genes (Cebus_imitator-1.0)
## 10                  Chinese hamster CHOK1GS genes (CHOK1GS_HDv1)
## 11                     Chinese hamster CriGri genes (CriGri_1.0)
## 12                       Caenorhabditis elegans genes (WBcel235)
## 13                                         Dog genes (CanFam3.1)
## 14                                             Goat genes (ARS1)
## 15                                         Sloth genes (choHof1)
## 16                                     C.intestinalis genes (KH)
## 17                                  Marmoset genes (ASM275486v1)
## 18                      Long-tailed chinchilla genes (ChiLan1.0)
## 19                            Angola colobus genes (Cang.pa_1.0)
## 20                                  Guinea Pig genes (Cavpor3.0)
## 21                                  Vervet-AGM genes (ChlSab1.1)
## 22                                   C.savignyi genes (CSAV 2.0)
## 23                        Tarsier genes (Tarsius_syrichta-2.0.1)
## 24                                        Fruitfly genes (BDGP6)
## 25                                   Armadillo genes (Dasnov3.0)
## 26                                 Kangaroo rat genes (Dord_2.0)
## 27                                      Zebrafish genes (GRCz11)
## 28                          Inshore hagfish genes (Eburgeri_3.2)
## 29                                       Horse genes (Equ Cab 2)
## 30                                      Hedgehog genes (eriEur1)
## 31                         Lesser hedgehog tenrec genes (TENREC)
## 32                                 Flycatcher genes (FicAlb_1.4)
## 33                                   Cat genes (Felis_catus_9.0)
## 34                              Damara mole rat genes (DMR_v1.0)
## 35                                  Stickleback genes (BROAD S1)
## 36                             Chicken genes (Gallus_gallus-5.0)
## 37                                       Gorilla genes (gorGor4)
## 38                                           Cod genes (gadMor1)
## 39               Naked mole-rat female genes (HetGla_female_1.0)
## 40                        Naked mole-rat male genes (HetGla_1.0)
## 41                                      Human genes (GRCh38.p12)
## 42                                    Squirrel genes (SpeTri2.0)
## 43                      Lesser Egyptian jerboa genes (JacJac1.0)
## 44                                    Elephant genes (Loxafr3.0)
## 45                                    Coelacanth genes (LatCha1)
## 46                                   Spotted gar genes (LepOcu1)
## 47                              Golden Hamster genes (MesAur1.0)
## 48                          Ryukyu mouse genes (CAROLI_EIJ_v1.1)
## 49                                       Opossum genes (monDom5)
## 50           Crab-eating macaque genes (Macaca_fascicularis_5.0)
## 51                                   Ferret genes (MusPutFur1.0)
## 52                                    Turkey genes (Turkey_2.01)
## 53                                     Drill genes (Mleu.le_1.0)
## 54                                    Microbat genes (Myoluc2.0)
## 55                                    Macaque genes (Mmul_8.0.1)
## 56                                  Mouse Lemur genes (Mmur_3.0)
## 57                                       Mouse genes (GRCm38.p6)
## 58                           Pig-tailed macaque genes (Mnem_1.0)
## 59                                Prairie vole genes (MicOch1.0)
## 60                           Shrew mouse genes (PAHARI_EIJ_v1.1)
## 61                           Algerian mouse genes (SPRET_EiJ_v1)
## 62                                      Wallaby genes (Meug_1.0)
## 63  Upper Galilee mountains blind mole rat genes (S.galili_v1.0)
## 64                                       Gibbon genes (Nleu_3.0)
## 65                                        Platypus genes (OANA5)
## 66                                        Sheep genes (Oar_v3.1)
## 67                                      Rabbit genes (OryCun2.0)
## 68                                        Degu genes (OctDeg1.0)
## 69                                      Bushbaby genes (OtoGar3)
## 70                                           Medaka genes (HdrR)
## 71                                     Tilapia genes (Orenil1.0)
## 72                                    Pika genes (OchPri2.0-Ens)
## 73                                       Orangutan genes (PPYG2)
## 74                                  Amur tiger genes (PanTig1.0)
## 75                                 Olive baboon genes (Panu_3.0)
## 76                 Northern American deer mouse genes (Pman_1.0)
## 77                                         Hyrax genes (proCap1)
## 78                            Coquerel's sifaka genes (Pcoq_1.0)
## 79                   Amazon molly genes (Poecilia_formosa-5.1.2)
## 80                                  Lamprey genes (Pmarinus_7.0)
## 81                                      Bonobo genes (panpan1.1)
## 82                                     Leopard genes (PanPar1.0)
## 83                   Chinese softshell turtle genes (PelSin_1.0)
## 84                                Chimpanzee genes (Pan_tro_3.0)
## 85                                       Megabat genes (pteVam1)
## 86                   Black snub-nosed monkey genes (ASM169854v1)
## 87                                          Rat genes (Rnor_6.0)
## 88                      Golden snub-nosed monkey genes (Rrox_v1)
## 89                                         Shrew genes (sorAra1)
## 90                    Bolivian squirrel monkey genes (SaiBol1.0)
## 91                      Saccharomyces cerevisiae genes (R64-1-1)
## 92                        Tasmanian devil genes (Devil_ref v7.0)
## 93                                       Pig genes (Sscrofa11.1)
## 94                                    Tree Shrew genes (tupBel1)
## 95                               Zebra Finch genes (taeGut3.2.4)
## 96                               Tetraodon genes (TETRAODON 8.0)
## 97                                         Fugu genes (FUGU 4.0)
## 98                                       Dolphin genes (turTru1)
## 99                                        Alpaca genes (vicPac1)
## 100                                Platyfish genes (Xipmac4.4.2)
## 101                                      Xenopus genes (JGI 4.2)
##                     version
## 1                 AnoCar2.0
## 2                   ailMel1
## 3                 AstMex102
## 4                  Anan_2.0
## 5              BGI_duck_1.0
## 6                    UMD3.1
## 7                  CavAp1.0
## 8                  Caty_1.0
## 9        Cebus_imitator-1.0
## 10             CHOK1GS_HDv1
## 11               CriGri_1.0
## 12                 WBcel235
## 13                CanFam3.1
## 14                     ARS1
## 15                  choHof1
## 16                       KH
## 17              ASM275486v1
## 18                ChiLan1.0
## 19              Cang.pa_1.0
## 20                Cavpor3.0
## 21                ChlSab1.1
## 22                 CSAV 2.0
## 23   Tarsius_syrichta-2.0.1
## 24                    BDGP6
## 25                Dasnov3.0
## 26                 Dord_2.0
## 27                   GRCz11
## 28             Eburgeri_3.2
## 29                Equ Cab 2
## 30                  eriEur1
## 31                   TENREC
## 32               FicAlb_1.4
## 33          Felis_catus_9.0
## 34                 DMR_v1.0
## 35                 BROAD S1
## 36        Gallus_gallus-5.0
## 37                  gorGor4
## 38                  gadMor1
## 39        HetGla_female_1.0
## 40               HetGla_1.0
## 41               GRCh38.p12
## 42                SpeTri2.0
## 43                JacJac1.0
## 44                Loxafr3.0
## 45                  LatCha1
## 46                  LepOcu1
## 47                MesAur1.0
## 48          CAROLI_EIJ_v1.1
## 49                  monDom5
## 50  Macaca_fascicularis_5.0
## 51             MusPutFur1.0
## 52              Turkey_2.01
## 53              Mleu.le_1.0
## 54                Myoluc2.0
## 55               Mmul_8.0.1
## 56                 Mmur_3.0
## 57                GRCm38.p6
## 58                 Mnem_1.0
## 59                MicOch1.0
## 60          PAHARI_EIJ_v1.1
## 61             SPRET_EiJ_v1
## 62                 Meug_1.0
## 63            S.galili_v1.0
## 64                 Nleu_3.0
## 65                    OANA5
## 66                 Oar_v3.1
## 67                OryCun2.0
## 68                OctDeg1.0
## 69                  OtoGar3
## 70                     HdrR
## 71                Orenil1.0
## 72            OchPri2.0-Ens
## 73                    PPYG2
## 74                PanTig1.0
## 75                 Panu_3.0
## 76                 Pman_1.0
## 77                  proCap1
## 78                 Pcoq_1.0
## 79   Poecilia_formosa-5.1.2
## 80             Pmarinus_7.0
## 81                panpan1.1
## 82                PanPar1.0
## 83               PelSin_1.0
## 84              Pan_tro_3.0
## 85                  pteVam1
## 86              ASM169854v1
## 87                 Rnor_6.0
## 88                  Rrox_v1
## 89                  sorAra1
## 90                SaiBol1.0
## 91                  R64-1-1
## 92           Devil_ref v7.0
## 93              Sscrofa11.1
## 94                  tupBel1
## 95              taeGut3.2.4
## 96            TETRAODON 8.0
## 97                 FUGU 4.0
## 98                  turTru1
## 99                  vicPac1
## 100             Xipmac4.4.2
## 101                 JGI 4.2

To select a dataset we can update the Mart object using the function useDataset(). In the example below we choose to use the hsapiens dataset.

ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)

Or alternatively if the dataset one wants to use is known in advance, we can select a BioMart database and dataset in one step by:

ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")

3 How to build a biomaRt query

The getBM() function has three arguments that need to be introduced: filters, attributes and values. Filters define a restriction on the query. For example you want to restrict the output to all genes located on the human X chromosome then the filter chromosome_name can be used with value ‘X’. The listFilters() function shows you all available filters in the selected dataset.

filters = listFilters(ensembl)
filters[1:70,]
##                                   name
## 1                      chromosome_name
## 2                                start
## 3                                  end
## 4                           band_start
## 5                             band_end
## 6                         marker_start
## 7                           marker_end
## 8                        encode_region
## 9                               strand
## 10                  chromosomal_region
## 11                           with_ccds
## 12                         with_chembl
## 13       with_clone_based_ensembl_gene
## 14 with_clone_based_ensembl_transcript
## 15                         with_dbass3
## 16                         with_dbass5
## 17              with_ens_hs_transcript
## 18             with_ens_hs_translation
## 19          with_entrezgene_trans_name
## 20                           with_embl
## 21                   with_arrayexpress
## 22                         with_genedb
## 23                             with_go
## 24                     with_goslim_goa
## 25                           with_ottp
## 26                           with_hgnc
## 27                with_hgnc_trans_name
## 28                            with_hpa
## 29                     with_protein_id
## 30                    with_kegg_enzyme
## 31                   with_ens_lrg_gene
## 32             with_ens_lrg_transcript
## 33                         with_merops
## 34                       with_mim_gene
## 35                     with_mim_morbid
## 36                        with_mirbase
## 37             with_mirbase_trans_name
## 38                     with_entrezgene
## 39                            with_pdb
## 40                       with_reactome
## 41                  with_reactome_gene
## 42            with_reactome_transcript
## 43                    with_refseq_mrna
## 44          with_refseq_mrna_predicted
## 45                   with_refseq_ncrna
## 46         with_refseq_ncrna_predicted
## 47                 with_refseq_peptide
## 48       with_refseq_peptide_predicted
## 49                           with_rfam
## 50                with_rfam_trans_name
## 51                     with_rnacentral
## 52                           with_ucsc
## 53                        with_uniparc
## 54                     with_uniprot_gn
## 55               with_uniprotswissprot
## 56                with_uniprotsptrembl
## 57               with_vega_translation
## 58                       with_wikigene
## 59                     ensembl_gene_id
## 60             ensembl_gene_id_version
## 61               ensembl_transcript_id
## 62       ensembl_transcript_id_version
## 63                  ensembl_peptide_id
## 64          ensembl_peptide_id_version
## 65                     ensembl_exon_id
## 66                  external_gene_name
## 67            external_transcript_name
## 68                                ccds
## 69                              chembl
## 70            clone_based_ensembl_gene
##                                                      description
## 1                                       Chromosome/scaffold name
## 2                                                          Start
## 3                                                            End
## 4                                                     Band Start
## 5                                                       Band End
## 6                                                   Marker Start
## 7                                                     Marker End
## 8                                                  Encode region
## 9                                                         Strand
## 10                        e.g. 1:100:10000:-1, 1:100000:200000:1
## 11                                               With CCDS ID(s)
## 12                                             With ChEMBL ID(s)
## 13                         With Clone-based (Ensembl) gene ID(s)
## 14                   With Clone-based (Ensembl) transcript ID(s)
## 15               With DataBase of Aberrant 3' Splice Sites ID(s)
## 16               With DataBase of Aberrant 5' Splice Sites ID(s)
## 17                           With Ensembl Human Transcript ID(s)
## 18                          With Ensembl Human Translation ID(s)
## 19                         With EntrezGene transcript name ID(s)
## 20                        With European Nucleotide Archive ID(s)
## 21                                   With Expression Atlas ID(s)
## 22                                             With GeneDB ID(s)
## 23                                                 With GO ID(s)
## 24                                         With GOSlim GOA ID(s)
## 25                                 With Havana translation ID(s)
## 26                                        With HGNC Symbol ID(s)
## 27                               With HGNC transcript name ID(s)
## 28                                With Human Protein Atlas ID(s)
## 29                                   With INSDC protein ID ID(s)
## 30                            With KEGG Pathway and Enzyme ID(s)
## 31                        With LRG display in Ensembl gene ID(s)
## 32                  With LRG display in Ensembl transcript ID(s)
## 33                    With MEROPS - the Peptidase Database ID(s)
## 34                                           With MIM gene ID(s)
## 35                                         With MIM morbid ID(s)
## 36                                            With miRBase ID(s)
## 37                            With miRBase transcript name ID(s)
## 38                                          With NCBI gene ID(s)
## 39                                                With PDB ID(s)
## 40                                           With Reactome ID(s)
## 41                                      With Reactome gene ID(s)
## 42                                With Reactome transcript ID(s)
## 43                                        With RefSeq mRNA ID(s)
## 44                              With RefSeq mRNA predicted ID(s)
## 45                                       With RefSeq ncRNA ID(s)
## 46                             With RefSeq ncRNA predicted ID(s)
## 47                                     With RefSeq peptide ID(s)
## 48                           With RefSeq peptide predicted ID(s)
## 49                                               With RFAM ID(s)
## 50                               With RFAM transcript name ID(s)
## 51                                         With RNAcentral ID(s)
## 52                                     With UCSC Stable ID ID(s)
## 53                                            With UniParc ID(s)
## 54                                With UniProtKB Gene Name ID(s)
## 55                               With UniProtKB/Swiss-Prot ID(s)
## 56                                   With UniProtKB/TrEMBL ID(s)
## 57                                   With Vega translation ID(s)
## 58                                           With WikiGene ID(s)
## 59                      Gene stable ID(s) [e.g. ENSG00000000003]
## 60      Gene stable ID(s) with version [e.g. ENSG00000000003.14]
## 61                Transcript stable ID(s) [e.g. ENST00000000233]
## 62 Transcript stable ID(s) with version [e.g. ENST00000000233.9]
## 63                   Protein stable ID(s) [e.g. ENSP00000000233]
## 64    Protein stable ID(s) with version [e.g. ENSP00000000233.5]
## 65                             Exon ID(s) [e.g. ENSE00000327880]
## 66                                   Gene Name(s) [e.g. RF00100]
## 67                      Transcript Name(s) [e.g. RF00100.12-201]
## 68                                      CCDS ID(s) [e.g. CCDS10]
## 69                             ChEMBL ID(s) [e.g. CHEMBL1075092]
## 70            Clone-based (Ensembl) gene ID(s) [e.g. AB015752.1]

Attributes define the values we are interested in to retrieve. For example we want to retrieve the gene symbols or chromosomal coordinates. The listAttributes() function displays all available attributes in the selected dataset.

attributes = listAttributes(ensembl)
attributes[1:20,]
##                             name
## 1                ensembl_gene_id
## 2        ensembl_gene_id_version
## 3          ensembl_transcript_id
## 4  ensembl_transcript_id_version
## 5             ensembl_peptide_id
## 6     ensembl_peptide_id_version
## 7                ensembl_exon_id
## 8                    description
## 9                chromosome_name
## 10                start_position
## 11                  end_position
## 12                        strand
## 13                          band
## 14              transcript_start
## 15                transcript_end
## 16      transcription_start_site
## 17             transcript_length
## 18                transcript_tsl
## 19      transcript_gencode_basic
## 20             transcript_appris
##                                   description         page
## 1                              Gene stable ID feature_page
## 2                      Gene stable ID version feature_page
## 3                        Transcript stable ID feature_page
## 4                Transcript stable ID version feature_page
## 5                           Protein stable ID feature_page
## 6                   Protein stable ID version feature_page
## 7                              Exon stable ID feature_page
## 8                            Gene description feature_page
## 9                    Chromosome/scaffold name feature_page
## 10                            Gene start (bp) feature_page
## 11                              Gene end (bp) feature_page
## 12                                     Strand feature_page
## 13                             Karyotype band feature_page
## 14                      Transcript start (bp) feature_page
## 15                        Transcript end (bp) feature_page
## 16             Transcription start site (TSS) feature_page
## 17 Transcript length (including UTRs and CDS) feature_page
## 18             Transcript support level (TSL) feature_page
## 19                   GENCODE basic annotation feature_page
## 20                          APPRIS annotation feature_page

The getBM() function is the main query function in biomaRt. It has four main arguments:

- attributes: is a vector of attributes that one wants to retrieve (= the output of the query).

- filters: is a vector of filters that one wil use as input to the query.

- values: a vector of values for the filters. In case multple filters are in use, the values argument requires a list of values where each position in the list corresponds to the position of the filters in the filters argument (see examples below).

- mart: is an object of class Mart, which is created by the useMart() function.

Note: for some frequently used queries to Ensembl, wrapper functions are available: getGene() and getSequence(). These functions call the getBM() function with hard coded filter and attribute names.

Now that we selected a BioMart database and dataset, and know about attributes, filters, and the values for filters; we can build a biomaRt query. Let’s make an easy query for the following problem: We have a list of Affymetrix identifiers from the u133plus2 platform and we want to retrieve the corresponding EntrezGene identifiers using the Ensembl mappings.

The u133plus2 platform will be the filter for this query and as values for this filter we use our list of Affymetrix identifiers. As output (attributes) for the query we want to retrieve the EntrezGene and u133plus2 identifiers so we get a mapping of these two identifiers as a result. The exact names that we will have to use to specify the attributes and filters can be retrieved with the listAttributes() and listFilters() function respectively. Let’s now run the query:

affyids=c("202763_at","209310_s_at","207500_at")
getBM(attributes=c('affy_hg_u133_plus_2', 'entrezgene'), 
      filters = 'affy_hg_u133_plus_2', 
      values = affyids, 
      mart = ensembl)
##   affy_hg_u133_plus_2 entrezgene
## 1         209310_s_at        837
## 2           207500_at        838
## 3           202763_at        836

Putting it all Together

Let’s use biomaRt to get annotation for our output from limma:

v <- voom(dge, design)
fit <- lmFit(v, design)
fit <- eBayes(fit)
tab <- topTable(fit, coef=ncol(design), n = 20)
head(tab)
##               logFC  AveExpr         t      P.Value    adj.P.Val        B
## 100131754  2.054703 15.77659 113.29014 3.781646e-21 6.126266e-18 38.49198
## 2752       2.826613 12.90987  96.45001 3.169468e-20 2.567269e-17 36.95688
## 22883      2.671140 12.45734  79.03031 4.398194e-19 2.375025e-16 34.32948
## 2023      -2.284219 13.62891 -75.47484 8.074817e-19 3.270301e-16 33.51309
## 23154      4.160514 11.35699  60.95788 1.351506e-17 4.378878e-15 30.75714
## 8682       3.023933 11.74011  56.60478 3.588091e-17 9.687847e-15 29.92931

The identifiers (row names) in our top table are Entrez Gene identifiers, so let’s add a column with that information

tab$entrezgene <- rownames(tab)
head(tab)
##               logFC  AveExpr         t      P.Value    adj.P.Val        B
## 100131754  2.054703 15.77659 113.29014 3.781646e-21 6.126266e-18 38.49198
## 2752       2.826613 12.90987  96.45001 3.169468e-20 2.567269e-17 36.95688
## 22883      2.671140 12.45734  79.03031 4.398194e-19 2.375025e-16 34.32948
## 2023      -2.284219 13.62891 -75.47484 8.074817e-19 3.270301e-16 33.51309
## 23154      4.160514 11.35699  60.95788 1.351506e-17 4.378878e-15 30.75714
## 8682       3.023933 11.74011  56.60478 3.588091e-17 9.687847e-15 29.92931
##           entrezgene
## 100131754  100131754
## 2752            2752
## 22883          22883
## 2023            2023
## 23154          23154
## 8682            8682

Let’s get the gene symbol and gene description matching those Entrez Gene identifiers:

anno <- getBM(ensembl, attributes = c("entrezgene", "hgnc_symbol", "description"), filters = "entrezgene", values = tab$entrezgene)
anno
##    entrezgene hgnc_symbol
## 1       10472      ZBTB18
## 2        1465       CSRP1
## 3        2023        ENO1
## 4       22883      CLSTN1
## 5       23095       KIF1B
## 6       23154        NCDN
## 7       23499       MACF1
## 8        2752        GLUL
## 9      284654       RSPO1
## 10        477      ATP1A2
## 11        481      ATP1B1
## 12       4904        YBX1
## 13      55450     CAMK2N1
## 14       5567      PRKACB
## 15       6135       RPL11
## 16       6202        RPS8
## 17       8682       PEA15
## 18       9659     PDE4DIP
## 19       9900        SV2A
##                                                                                       description
## 1                    zinc finger and BTB domain containing 18 [Source:HGNC Symbol;Acc:HGNC:13030]
## 2                          cysteine and glycine rich protein 1 [Source:HGNC Symbol;Acc:HGNC:2469]
## 3                                                    enolase 1 [Source:HGNC Symbol;Acc:HGNC:3350]
## 4                                               calsyntenin 1 [Source:HGNC Symbol;Acc:HGNC:17447]
## 5                                    kinesin family member 1B [Source:HGNC Symbol;Acc:HGNC:16636]
## 6                                               neurochondrin [Source:HGNC Symbol;Acc:HGNC:17597]
## 7                     microtubule-actin crosslinking factor 1 [Source:HGNC Symbol;Acc:HGNC:13664]
## 8                                     glutamate-ammonia ligase [Source:HGNC Symbol;Acc:HGNC:4341]
## 9                                                 R-spondin 1 [Source:HGNC Symbol;Acc:HGNC:21679]
## 10                   ATPase Na+/K+ transporting subunit alpha 2 [Source:HGNC Symbol;Acc:HGNC:800]
## 11                    ATPase Na+/K+ transporting subunit beta 1 [Source:HGNC Symbol;Acc:HGNC:804]
## 12                                     Y-box binding protein 1 [Source:HGNC Symbol;Acc:HGNC:8014]
## 13 calcium/calmodulin dependent protein kinase II inhibitor 1 [Source:HGNC Symbol;Acc:HGNC:24190]
## 14        protein kinase cAMP-activated catalytic subunit beta [Source:HGNC Symbol;Acc:HGNC:9381]
## 15                                      ribosomal protein L11 [Source:HGNC Symbol;Acc:HGNC:10301]
## 16                                       ribosomal protein S8 [Source:HGNC Symbol;Acc:HGNC:10441]
## 17              proliferation and apoptosis adaptor protein 15 [Source:HGNC Symbol;Acc:HGNC:8822]
## 18                   phosphodiesterase 4D interacting protein [Source:HGNC Symbol;Acc:HGNC:15580]
## 19                           synaptic vesicle glycoprotein 2A [Source:HGNC Symbol;Acc:HGNC:20566]