Note: Throughout, text taken directly from the vignettes will be shown in georgia font and code/output from the vignettes will be shown with a blue background. Everything else is my commentary/additions.

limma

Limma is an R package for differential expression testing of RNASeq and microarray data. The limma User’s Guide is an extensive, 100+ page summary of limma’s many capabilities. We will focus only on Chapter 15, “RNA-seq Data”.

Install edgeR if you have not already done so (installs limma as a dependency):

# if (!requireNamespace("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# BiocManager::install("edgeR", version = "3.8")

Load edgeR (which loads limma as a dependency)

library(edgeR)
## Loading required package: limma

Read in a counts table:

counts <- read.delim('https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2018-September-Bioinformatics-Prerequisites/master/friday/counts.tsv', row.names = 1)

The counts table has one column per sample and one row per gene. It shows the number of reads aligning to each gene in each sample.

dim(counts)
## [1] 25702     4
head(counts)
##           A_1.bam A_2.bam B_1.bam B_2.bam
## 653635        627     513     579     574
## 100422834       1       0       0       0
## 645520          5       3       0       0
## 79501           0       0       0       0
## 729737         70      64      24      21
## 100507658      10       7       5       4

We also need to define a model matrix since this is not done in the vignette:

group <- rep(c("A", "B"), each = 2)
design <- model.matrix(~group)

Now for the actual limma vignette (starting in 15.3):

15.3 Normalization and Filtering

Once a matrix of read counts has been created, with rows for genes and columns for samples, it is convenient to create a DGEList object using the edgeR package.

dge <- DGEList(counts=counts)
dge
## An object of class "DGEList"
## $counts
##           A_1.bam A_2.bam B_1.bam B_2.bam
## 653635        627     513     579     574
## 100422834       1       0       0       0
## 645520          5       3       0       0
## 79501           0       0       0       0
## 729737         70      64      24      21
## 25697 more rows ...
## 
## $samples
##         group lib.size norm.factors
## A_1.bam     1   383672            1
## A_2.bam     1   323786            1
## B_1.bam     1   464801            1
## B_2.bam     1   451689            1

A DGEList object is a container for counts, normalization factors, and library sizes.

The next step is to remove rows that consistently have zero or very low counts. One can for example use

keep <- filterByExpr(dge, design)
dge <- dge[keep,,keep.lib.sizes=FALSE]
dge
## An object of class "DGEList"
## $counts
##           A_1.bam A_2.bam B_1.bam B_2.bam
## 653635        627     513     579     574
## 729737         70      64      24      21
## 100132287      63      43      20      15
## 100131754   11839   10685   45338   44547
## 100133331      77      60      33      39
## 1615 more rows ...
## 
## $samples
##         group lib.size norm.factors
## A_1.bam     1   382288            1
## A_2.bam     1   322674            1
## B_1.bam     1   463301            1
## B_2.bam     1   450284            1

Our data have been filtered from 25,702 genes to 1620 genes. Note that filtering is not an exact science, and is done different ways even within the limma User’s Guide (see e.g. section 18.1.7).

Here we will assume that filtering has been done.

It is usual to apply scale normalization to RNA-seq read counts, and the TMM normalization method in particular has been found to perform well in comparative studies. This can be applied to the DGEList object:

dge <- calcNormFactors(dge)

The norm.factors column in dge$samples has now been assigned numbers other than the default value of 1:

dge
## An object of class "DGEList"
## $counts
##           A_1.bam A_2.bam B_1.bam B_2.bam
## 653635        627     513     579     574
## 729737         70      64      24      21
## 100132287      63      43      20      15
## 100131754   11839   10685   45338   44547
## 100133331      77      60      33      39
## 1615 more rows ...
## 
## $samples
##         group lib.size norm.factors
## A_1.bam     1   382288    1.1629573
## A_2.bam     1   322674    1.1573440
## B_1.bam     1   463301    0.8572287
## B_2.bam     1   450284    0.8667165

15.4 Differential expression: limma-trend

If the sequencing depth is reasonably consistent across the RNA samples, then the simplest and most robust approach to differential expression is to use limma-trend. This approach will usually work well if the ratio of the largest library size to the smallest is not more than about 3-fold.

In the limma-trend approach, the counts are converted to logCPM values using edgeR’s cpm function:

logCPM <- cpm(dge, log=TRUE, prior.count=3)

prior.count is the constant that is added to all counts before log transformation in order to avoid taking the log of 0. Its default value is 0.25.

The prior count is used here to damp down the variances of logarithms of low counts.

The logCPM values can then be used in any standard limma pipeline, using the trend=TRUE argument when running eBayes or treat. For example:

fit <- lmFit(logCPM, design)
fit <- eBayes(fit, trend=TRUE)
topTable(fit, coef=ncol(design))
##               logFC   AveExpr         t      P.Value    adj.P.Val        B
## 477        6.702730 10.097521 104.64572 1.289558e-13 1.163076e-10 19.42430
## 2752       2.822725 12.911626 103.21992 1.435896e-13 1.163076e-10 19.38206
## 23154      4.149472 11.364604  92.95400 3.262592e-13 1.761800e-10 19.03409
## 22883      2.668013 12.459653  88.78688 4.673095e-13 1.892604e-10 18.86725
## 23208      6.191143 10.078057  85.22099 6.442451e-13 2.087354e-10 18.71049
## 481        3.428246 11.584274  78.71541 1.199934e-12 3.239821e-10 18.38600
## 2023      -2.280158 13.629827 -70.07808 2.981364e-12 5.614906e-10 17.86117
## 100131754  2.047753 15.776777  70.02656 2.998580e-12 5.614906e-10 17.85766
## 10472      4.402382 10.024830  69.39116 3.220388e-12 5.614906e-10 17.81397
## 9783       5.220951  9.713354  68.74277 3.465991e-12 5.614906e-10 17.76860

Using trend=TRUE means that limma “squeezes” the genewise variances towards a trend line (rather than the default global mean).

coef = ncol(design) in the call to topTable works out to coef = 2. We are telling topTable to test the second coefficient (“groupB”):

head(coef(fit))
##           (Intercept)      groupB
## 653635      10.450526  0.07280501
## 729737       7.424122 -1.41271267
## 100132287    7.079451 -1.39220328
## 100131754   14.752900  2.04775296
## 100133331    7.445348 -0.82146092
## 79854        5.643624  0.24920993

In the output from topTable:

topTable shows the top 10 most significant genes by default, the argument n can be used to change that number.

Or, to give more weight to fold-changes in the gene ranking, one might use:

fit <- treat(fit, lfc=log2(1.2))
topTreat(fit, coef=ncol(design))
##            logFC   AveExpr         t      P.Value    adj.P.Val
## 79727  -7.967585  6.885753 -92.45334 4.136977e-08 2.500193e-05
## 10485   7.716040  6.759981  87.65236 5.144712e-08 2.500193e-05
## 51440   7.742719  8.076993  86.85248 5.346975e-08 2.500193e-05
## 3765    7.837163  6.820542  79.50812 7.733537e-08 2.500193e-05
## 127254  6.619733  6.211827  75.74582 9.257224e-08 2.500193e-05
## 477     6.702730 10.097521  71.87338 1.153532e-07 2.500193e-05
## 26052   7.525391  8.067417  72.03942 1.159934e-07 2.500193e-05
## 7143    6.078910  5.941416  67.16788 1.507634e-07 2.500193e-05
## 3208    6.857143  7.426971  67.07065 1.542778e-07 2.500193e-05
## 777     5.920557  6.272886  65.76432 1.639550e-07 2.500193e-05

The function treat() tests if the absolute log fold change is greater than the indicated value (log2(1.2)), rather than 0.

15.5 Differential expression: voom

When the library sizes are quite variable between samples, then the voom approach is theoretically more powerful than limma-trend. In this approach, the voom transformation is applied to the normalized and filtered DGEList object:

v <- voom(dge, design, plot=TRUE)

In the plot, each point represents a gene, with the x-axis showing the average expression in log2 counts per million reads and the y-axis showing the square root of the residual standard deviation of expression in log2 counts per million reads.

The voom transformation uses the experiment design matrix, and produces an EList object.

It is also possible to give a matrix of counts directly to voom without TMM normalization, by

v <- voom(counts, design, plot=TRUE)

If the data are very noisy, one can apply the same between-array normalization methods as would be used for microarrays, for example:

v <- voom(counts, design, plot=TRUE, normalize="quantile")

After this, the usual limma pipelines for differential expression can be applied, for example:

fit <- lmFit(v, design)
fit <- eBayes(fit)
## Warning: Zero sample variances detected, have been offset away from zero
topTable(fit, coef=ncol(design))
##               logFC  AveExpr             t      P.Value    adj.P.Val
## 100507192 -2.929838 3.365092 -1.626574e+16 1.072348e-34 1.207500e-30
## 727721    -1.584963 1.107691 -8.455993e+15 4.228272e-34 1.207500e-30
## 7780      -1.584963 1.107691 -8.455993e+15 4.228272e-34 1.207500e-30
## 100507337 -1.584963 1.107691 -8.455993e+15 4.228272e-34 1.207500e-30
## 100507664 -1.584963 1.107691 -8.455993e+15 4.228272e-34 1.207500e-30
## 270       -1.584963 1.107691 -8.455993e+15 4.228272e-34 1.207500e-30
## 3284      -1.584963 1.107691 -8.455993e+15 4.228272e-34 1.207500e-30
## 728945    -1.584963 1.107691 -8.455993e+15 4.228272e-34 1.207500e-30
## 730227    -1.584963 1.107691 -8.455993e+15 4.228272e-34 1.207500e-30
## 148930     4.122552 2.376486  5.910097e+15 8.962322e-34 2.059773e-30
##                  B
## 100507192 68.59136
## 727721    66.95172
## 7780      66.95172
## 100507337 66.95172
## 100507664 66.95172
## 270       66.95172
## 3284      66.95172
## 728945    66.95172
## 730227    66.95172
## 148930    65.65863

Or, to give more weight to fold-changes in the ranking, one could use say:

fit <- treat(fit, lfc=log2(1.2))
## Warning: Zero sample variances detected, have been offset away from zero
topTreat(fit, coef=ncol(design))
##               logFC  AveExpr             t      P.Value    adj.P.Val
## 100507192 -2.929838 3.365092 -1.480544e+16 1.100817e-34 1.320896e-30
## 727721    -1.584963 1.107691 -7.052668e+15 4.625346e-34 1.320896e-30
## 7780      -1.584963 1.107691 -7.052668e+15 4.625346e-34 1.320896e-30
## 100507337 -1.584963 1.107691 -7.052668e+15 4.625346e-34 1.320896e-30
## 100507664 -1.584963 1.107691 -7.052668e+15 4.625346e-34 1.320896e-30
## 270       -1.584963 1.107691 -7.052668e+15 4.625346e-34 1.320896e-30
## 3284      -1.584963 1.107691 -7.052668e+15 4.625346e-34 1.320896e-30
## 728945    -1.584963 1.107691 -7.052668e+15 4.625346e-34 1.320896e-30
## 730227    -1.584963 1.107691 -7.052668e+15 4.625346e-34 1.320896e-30
## 148930     4.122552 2.376486  5.533010e+15 9.081657e-34 2.253205e-30

biomaRt

The biomaRt package provides an R interface to Ensembl databases.

Install biomaRt if not already installed

# if (!requireNamespace("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# BiocManager::install("biomaRt", version = "3.8")

Let’s start with part 2 of the biomaRt vignette (accessible from within R by vignette(“biomaRt”)):

2 Selecting a BioMart database and dataset

Every analysis with biomaRt starts with selecting a BioMart database to use. A first step is to check which BioMart web services are available. The function listMarts() will display all available BioMart web services

library("biomaRt")
listMarts()
##                biomart               version
## 1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 95
## 2   ENSEMBL_MART_MOUSE      Mouse strains 95
## 3     ENSEMBL_MART_SNP  Ensembl Variation 95
## 4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 95

The useMart() function can now be used to connect to a specified BioMart database, this must be a valid name given by listMarts(). In the next example we choose to query the Ensembl BioMart database.

ensembl=useMart("ensembl")

BioMart databases can contain several datasets, for Ensembl every species is a different dataset. In a next step we look at which datasets are available in the selected BioMart by using the function listDatasets().

listDatasets(ensembl)
##                            dataset
## 1         acalliptera_gene_ensembl
## 2       acarolinensis_gene_ensembl
## 3        acitrinellus_gene_ensembl
## 4        amelanoleuca_gene_ensembl
## 5          amexicanus_gene_ensembl
## 6          anancymaae_gene_ensembl
## 7          aocellaris_gene_ensembl
## 8            apercula_gene_ensembl
## 9      aplatyrhynchos_gene_ensembl
## 10      apolyacanthus_gene_ensembl
## 11       atestudineus_gene_ensembl
## 12            btaurus_gene_ensembl
## 13            caperea_gene_ensembl
## 14              catys_gene_ensembl
## 15            cbellii_gene_ensembl
## 16         ccapucinus_gene_ensembl
## 17         cchok1gshd_gene_ensembl
## 18            ccrigri_gene_ensembl
## 19             cdingo_gene_ensembl
## 20           celegans_gene_ensembl
## 21        cfamiliaris_gene_ensembl
## 22            chircus_gene_ensembl
## 23         choffmanni_gene_ensembl
## 24      cintestinalis_gene_ensembl
## 25           cjacchus_gene_ensembl
## 26          clanigera_gene_ensembl
## 27         cpalliatus_gene_ensembl
## 28         cporcellus_gene_ensembl
## 29           csabaeus_gene_ensembl
## 30          csavignyi_gene_ensembl
## 31        csemilaevis_gene_ensembl
## 32          csyrichta_gene_ensembl
## 33        cvariegatus_gene_ensembl
## 34      dmelanogaster_gene_ensembl
## 35      dnovemcinctus_gene_ensembl
## 36             dordii_gene_ensembl
## 37             drerio_gene_ensembl
## 38            easinus_gene_ensembl
## 39           eburgeri_gene_ensembl
## 40          ecaballus_gene_ensembl
## 41         eeuropaeus_gene_ensembl
## 42            elucius_gene_ensembl
## 43          etelfairi_gene_ensembl
## 44        falbicollis_gene_ensembl
## 45             fcatus_gene_ensembl
## 46        fdamarensis_gene_ensembl
## 47      fheteroclitus_gene_ensembl
## 48         gaculeatus_gene_ensembl
## 49           gaffinis_gene_ensembl
## 50         gagassizii_gene_ensembl
## 51            ggallus_gene_ensembl
## 52           ggorilla_gene_ensembl
## 53            gmorhua_gene_ensembl
## 54           hburtoni_gene_ensembl
## 55             hcomes_gene_ensembl
## 56            hfemale_gene_ensembl
## 57              hmale_gene_ensembl
## 58           hsapiens_gene_ensembl
## 59         ipunctatus_gene_ensembl
## 60  itridecemlineatus_gene_ensembl
## 61           jjaculus_gene_ensembl
## 62        kmarmoratus_gene_ensembl
## 63          lafricana_gene_ensembl
## 64          lbergylta_gene_ensembl
## 65         lchalumnae_gene_ensembl
## 66          loculatus_gene_ensembl
## 67             malbus_gene_ensembl
## 68           marmatus_gene_ensembl
## 69           mauratus_gene_ensembl
## 70            mcaroli_gene_ensembl
## 71         mdomestica_gene_ensembl
## 72      mfascicularis_gene_ensembl
## 73              mfuro_gene_ensembl
## 74         mgallopavo_gene_ensembl
## 75       mleucophaeus_gene_ensembl
## 76         mlucifugus_gene_ensembl
## 77              mmola_gene_ensembl
## 78           mmulatta_gene_ensembl
## 79           mmurinus_gene_ensembl
## 80          mmusculus_gene_ensembl
## 81        mnemestrina_gene_ensembl
## 82       mochrogaster_gene_ensembl
## 83            mpahari_gene_ensembl
## 84           mspretus_gene_ensembl
## 85             mzebra_gene_ensembl
## 86         nbrichardi_gene_ensembl
## 87           neugenii_gene_ensembl
## 88            ngalili_gene_ensembl
## 89        nleucogenys_gene_ensembl
## 90          oanatinus_gene_ensembl
## 91             oaries_gene_ensembl
## 92         ocuniculus_gene_ensembl
## 93             odegus_gene_ensembl
## 94         ogarnettii_gene_ensembl
## 95               ohni_gene_ensembl
## 96              ohsok_gene_ensembl
## 97           olatipes_gene_ensembl
## 98        omelastigma_gene_ensembl
## 99         oniloticus_gene_ensembl
## 100         oprinceps_gene_ensembl
## 101           pabelii_gene_ensembl
## 102          paltaica_gene_ensembl
## 103           panubis_gene_ensembl
## 104          pbairdii_gene_ensembl
## 105         pcapensis_gene_ensembl
## 106         pcinereus_gene_ensembl
## 107        pcoquereli_gene_ensembl
## 108          pformosa_gene_ensembl
## 109       pkingsleyae_gene_ensembl
## 110        platipinna_gene_ensembl
## 111   pmagnuspinnatus_gene_ensembl
## 112          pmarinus_gene_ensembl
## 113         pmexicana_gene_ensembl
## 114        pnattereri_gene_ensembl
## 115         pnyererei_gene_ensembl
## 116         ppaniscus_gene_ensembl
## 117           ppardus_gene_ensembl
## 118       preticulata_gene_ensembl
## 119         psinensis_gene_ensembl
## 120      ptroglodytes_gene_ensembl
## 121         pvampyrus_gene_ensembl
## 122            rbieti_gene_ensembl
## 123       rnorvegicus_gene_ensembl
## 124        rroxellana_gene_ensembl
## 125          saraneus_gene_ensembl
## 126      sboliviensis_gene_ensembl
## 127       scerevisiae_gene_ensembl
## 128         sdorsalis_gene_ensembl
## 129         sdumerili_gene_ensembl
## 130         sformosus_gene_ensembl
## 131         sharrisii_gene_ensembl
## 132          smaximus_gene_ensembl
## 133         spartitus_gene_ensembl
## 134        spunctatus_gene_ensembl
## 135           sscrofa_gene_ensembl
## 136        tbelangeri_gene_ensembl
## 137          tguttata_gene_ensembl
## 138     tnigroviridis_gene_ensembl
## 139         trubripes_gene_ensembl
## 140        ttruncatus_gene_ensembl
## 141       uamericanus_gene_ensembl
## 142        umaritimus_gene_ensembl
## 143            vpacos_gene_ensembl
## 144           vvulpes_gene_ensembl
## 145       xcouchianus_gene_ensembl
## 146        xmaculatus_gene_ensembl
## 147       xtropicalis_gene_ensembl
##                                                      description
## 1                               Eastern happy genes (fAstCal1.2)
## 2                                 Anole lizard genes (AnoCar2.0)
## 3                                 Midas cichlid genes (Midas_v5)
## 4                                          Panda genes (ailMel1)
## 5                       Cave fish genes (Astyanax_mexicanus-2.0)
## 6                             Ma's night monkey genes (Anan_2.0)
## 7                            Clown anemonefish genes (AmpOce1.0)
## 8                               Orange clownfish genes (Nemo_v1)
## 9                                      Duck genes (BGI_duck_1.0)
## 10                             Spiny chromis genes (ASM210954v1)
## 11                             Climbing perch genes (fAnaTes1.1)
## 12                                        Cow genes (ARS-UCD1.2)
## 13                         Brazilian guinea pig genes (CavAp1.0)
## 14                               Sooty mangabey genes (Caty_1.0)
## 15           Painted turtle genes (Chrysemys_picta_bellii-3.0.3)
## 16                           Capuchin genes (Cebus_imitator-1.0)
## 17                  Chinese hamster CHOK1GS genes (CHOK1GS_HDv1)
## 18                     Chinese hamster CriGri genes (CriGri_1.0)
## 19                                     Dingo genes (ASM325472v1)
## 20                       Caenorhabditis elegans genes (WBcel235)
## 21                                         Dog genes (CanFam3.1)
## 22                                             Goat genes (ARS1)
## 23                                         Sloth genes (choHof1)
## 24                                     C.intestinalis genes (KH)
## 25                                  Marmoset genes (ASM275486v1)
## 26                      Long-tailed chinchilla genes (ChiLan1.0)
## 27                            Angola colobus genes (Cang.pa_1.0)
## 28                                  Guinea Pig genes (Cavpor3.0)
## 29                                  Vervet-AGM genes (ChlSab1.1)
## 30                                   C.savignyi genes (CSAV 2.0)
## 31                                  Tongue sole genes (Cse_v1.0)
## 32                        Tarsier genes (Tarsius_syrichta-2.0.1)
## 33                    Sheepshead minnow genes (C_variegatus-1.0)
## 34                                        Fruitfly genes (BDGP6)
## 35                                   Armadillo genes (Dasnov3.0)
## 36                                 Kangaroo rat genes (Dord_2.0)
## 37                                      Zebrafish genes (GRCz11)
## 38                                    Donkey genes (ASM303372v1)
## 39                                  Hagfish genes (Eburgeri_3.2)
## 40                                       Horse genes (EquCab3.0)
## 41                                      Hedgehog genes (eriEur1)
## 42                                 Northern pike genes (Eluc_V3)
## 43                         Lesser hedgehog tenrec genes (TENREC)
## 44                                 Flycatcher genes (FicAlb_1.4)
## 45                                   Cat genes (Felis_catus_9.0)
## 46                              Damara mole rat genes (DMR_v1.0)
## 47                 Mummichog genes (Fundulus_heteroclitus-3.0.2)
## 48                                  Stickleback genes (BROAD S1)
## 49                      Western mosquitofish genes (ASM309773v1)
## 50                 Agassiz's desert tortoise genes (ASM289641v1)
## 51                                        Chicken genes (GRCg6a)
## 52                                       Gorilla genes (gorGor4)
## 53                                           Cod genes (gadMor1)
## 54                       Burton's mouthbrooder genes (AstBur1.0)
## 55                    Tiger tail seahorse genes (H_comes_QL1_v1)
## 56               Naked mole-rat female genes (HetGla_female_1.0)
## 57                        Naked mole-rat male genes (HetGla_1.0)
## 58                                      Human genes (GRCh38.p12)
## 59                            Channel catfish genes (IpCoco_1.2)
## 60                                    Squirrel genes (SpeTri2.0)
## 61                      Lesser Egyptian jerboa genes (JacJac1.0)
## 62                          Mangrove rivulus genes (ASM164957v1)
## 63                                    Elephant genes (Loxafr3.0)
## 64                              Ballan wrasse genes (BallGen_V1)
## 65                                    Coelacanth genes (LatCha1)
## 66                                   Spotted gar genes (LepOcu1)
## 67                                 Swamp eel genes (M_albus_1.0)
## 68                                Zig-zag eel genes (fMasArm1.1)
## 69                              Golden Hamster genes (MesAur1.0)
## 70                          Ryukyu mouse genes (CAROLI_EIJ_v1.1)
## 71                                       Opossum genes (monDom5)
## 72           Crab-eating macaque genes (Macaca_fascicularis_5.0)
## 73                                   Ferret genes (MusPutFur1.0)
## 74                                    Turkey genes (Turkey_2.01)
## 75                                     Drill genes (Mleu.le_1.0)
## 76                                    Microbat genes (Myoluc2.0)
## 77                             Ocean sunfish genes (ASM169857v1)
## 78                                    Macaque genes (Mmul_8.0.1)
## 79                                  Mouse Lemur genes (Mmur_3.0)
## 80                                       Mouse genes (GRCm38.p6)
## 81                           Pig-tailed macaque genes (Mnem_1.0)
## 82                                Prairie vole genes (MicOch1.0)
## 83                           Shrew mouse genes (PAHARI_EIJ_v1.1)
## 84                           Algerian mouse genes (SPRET_EiJ_v1)
## 85                             Zebra mbuna genes (M_zebra_UMD2a)
## 86                            Lyretail cichlid genes (NeoBri1.0)
## 87                                      Wallaby genes (Meug_1.0)
## 88  Upper Galilee mountains blind mole rat genes (S.galili_v1.0)
## 89                                       Gibbon genes (Nleu_3.0)
## 90                                        Platypus genes (OANA5)
## 91                                        Sheep genes (Oar_v3.1)
## 92                                      Rabbit genes (OryCun2.0)
## 93                                        Degu genes (OctDeg1.0)
## 94                                      Bushbaby genes (OtoGar3)
## 95                       Japanese medaka HNI genes (ASM223471v1)
## 96                      Japanese medaka HSOK genes (ASM223469v1)
## 97                      Japanese medaka HdrR genes (ASM223467v1)
## 98                            Indian medaka genes (Om_v0.7.RACA)
## 99                                     Tilapia genes (Orenil1.0)
## 100                                   Pika genes (OchPri2.0-Ens)
## 101                                      Orangutan genes (PPYG2)
## 102                                      Tiger genes (PanTig1.0)
## 103                                Olive baboon genes (Panu_3.0)
## 104                Northern American deer mouse genes (Pman_1.0)
## 105                                        Hyrax genes (proCap1)
## 106                               Koala genes (phaCin_tgac_v2.0)
## 107                           Coquerel's sifaka genes (Pcoq_1.0)
## 108                  Amazon molly genes (Poecilia_formosa-5.1.2)
## 109                  Paramormyrops kingsleyae genes (PKINGS_0.1)
## 110                        Sailfin molly genes (P_latipinna-1.0)
## 111                  Periophthalmus magnuspinnatus genes (PM.fa)
## 112                                 Lamprey genes (Pmarinus_7.0)
## 113                        Shortfin molly genes (P_mexicana-1.0)
## 114      Red-bellied piranha genes (Pygocentrus_nattereri-1.0.2)
## 115                      Makobe Island cichlid genes (PunNye1.0)
## 116                                     Bonobo genes (panpan1.1)
## 117                                    Leopard genes (PanPar1.0)
## 118                            Guppy genes (Guppy_female_1.0_MT)
## 119                  Chinese softshell turtle genes (PelSin_1.0)
## 120                               Chimpanzee genes (Pan_tro_3.0)
## 121                                      Megabat genes (pteVam1)
## 122                  Black snub-nosed monkey genes (ASM169854v1)
## 123                                         Rat genes (Rnor_6.0)
## 124                     Golden snub-nosed monkey genes (Rrox_v1)
## 125                                        Shrew genes (sorAra1)
## 126                   Bolivian squirrel monkey genes (SaiBol1.0)
## 127                     Saccharomyces cerevisiae genes (R64-1-1)
## 128                          Yellowtail amberjack genes (Sedor1)
## 129                            Greater amberjack genes (Sdu_1.0)
## 130                         Asian bonytongue genes (ASM162426v1)
## 131                       Tasmanian devil genes (Devil_ref v7.0)
## 132                                   Turbot genes (ASM318616v1)
## 133          Bicolor damselfish genes (Stegastes_partitus-1.0.2)
## 134                                  Tuatara genes (ASM311381v1)
## 135                                      Pig genes (Sscrofa11.1)
## 136                                   Tree Shrew genes (tupBel1)
## 137                              Zebra Finch genes (taeGut3.2.4)
## 138                              Tetraodon genes (TETRAODON 8.0)
## 139                                           Fugu genes (FUGU5)
## 140                                      Dolphin genes (turTru1)
## 141                      American black bear genes (ASM334442v1)
## 142                                Polar bear genes (UrsMar_1.0)
## 143                                       Alpaca genes (vicPac1)
## 144                                    Red fox genes (VulVul2.2)
## 145     Monterrey platyfish genes (Xiphophorus_couchianus-4.0.1)
## 146                       Platyfish genes (X_maculatus-5.0-male)
## 147                                      Xenopus genes (JGI 4.2)
##                          version
## 1                     fAstCal1.2
## 2                      AnoCar2.0
## 3                       Midas_v5
## 4                        ailMel1
## 5         Astyanax_mexicanus-2.0
## 6                       Anan_2.0
## 7                      AmpOce1.0
## 8                        Nemo_v1
## 9                   BGI_duck_1.0
## 10                   ASM210954v1
## 11                    fAnaTes1.1
## 12                    ARS-UCD1.2
## 13                      CavAp1.0
## 14                      Caty_1.0
## 15  Chrysemys_picta_bellii-3.0.3
## 16            Cebus_imitator-1.0
## 17                  CHOK1GS_HDv1
## 18                    CriGri_1.0
## 19                   ASM325472v1
## 20                      WBcel235
## 21                     CanFam3.1
## 22                          ARS1
## 23                       choHof1
## 24                            KH
## 25                   ASM275486v1
## 26                     ChiLan1.0
## 27                   Cang.pa_1.0
## 28                     Cavpor3.0
## 29                     ChlSab1.1
## 30                      CSAV 2.0
## 31                      Cse_v1.0
## 32        Tarsius_syrichta-2.0.1
## 33              C_variegatus-1.0
## 34                         BDGP6
## 35                     Dasnov3.0
## 36                      Dord_2.0
## 37                        GRCz11
## 38                   ASM303372v1
## 39                  Eburgeri_3.2
## 40                     EquCab3.0
## 41                       eriEur1
## 42                       Eluc_V3
## 43                        TENREC
## 44                    FicAlb_1.4
## 45               Felis_catus_9.0
## 46                      DMR_v1.0
## 47   Fundulus_heteroclitus-3.0.2
## 48                      BROAD S1
## 49                   ASM309773v1
## 50                   ASM289641v1
## 51                        GRCg6a
## 52                       gorGor4
## 53                       gadMor1
## 54                     AstBur1.0
## 55                H_comes_QL1_v1
## 56             HetGla_female_1.0
## 57                    HetGla_1.0
## 58                    GRCh38.p12
## 59                    IpCoco_1.2
## 60                     SpeTri2.0
## 61                     JacJac1.0
## 62                   ASM164957v1
## 63                     Loxafr3.0
## 64                    BallGen_V1
## 65                       LatCha1
## 66                       LepOcu1
## 67                   M_albus_1.0
## 68                    fMasArm1.1
## 69                     MesAur1.0
## 70               CAROLI_EIJ_v1.1
## 71                       monDom5
## 72       Macaca_fascicularis_5.0
## 73                  MusPutFur1.0
## 74                   Turkey_2.01
## 75                   Mleu.le_1.0
## 76                     Myoluc2.0
## 77                   ASM169857v1
## 78                    Mmul_8.0.1
## 79                      Mmur_3.0
## 80                     GRCm38.p6
## 81                      Mnem_1.0
## 82                     MicOch1.0
## 83               PAHARI_EIJ_v1.1
## 84                  SPRET_EiJ_v1
## 85                 M_zebra_UMD2a
## 86                     NeoBri1.0
## 87                      Meug_1.0
## 88                 S.galili_v1.0
## 89                      Nleu_3.0
## 90                         OANA5
## 91                      Oar_v3.1
## 92                     OryCun2.0
## 93                     OctDeg1.0
## 94                       OtoGar3
## 95                   ASM223471v1
## 96                   ASM223469v1
## 97                   ASM223467v1
## 98                  Om_v0.7.RACA
## 99                     Orenil1.0
## 100                OchPri2.0-Ens
## 101                        PPYG2
## 102                    PanTig1.0
## 103                     Panu_3.0
## 104                     Pman_1.0
## 105                      proCap1
## 106             phaCin_tgac_v2.0
## 107                     Pcoq_1.0
## 108       Poecilia_formosa-5.1.2
## 109                   PKINGS_0.1
## 110              P_latipinna-1.0
## 111                        PM.fa
## 112                 Pmarinus_7.0
## 113               P_mexicana-1.0
## 114  Pygocentrus_nattereri-1.0.2
## 115                    PunNye1.0
## 116                    panpan1.1
## 117                    PanPar1.0
## 118          Guppy_female_1.0_MT
## 119                   PelSin_1.0
## 120                  Pan_tro_3.0
## 121                      pteVam1
## 122                  ASM169854v1
## 123                     Rnor_6.0
## 124                      Rrox_v1
## 125                      sorAra1
## 126                    SaiBol1.0
## 127                      R64-1-1
## 128                       Sedor1
## 129                      Sdu_1.0
## 130                  ASM162426v1
## 131               Devil_ref v7.0
## 132                  ASM318616v1
## 133     Stegastes_partitus-1.0.2
## 134                  ASM311381v1
## 135                  Sscrofa11.1
## 136                      tupBel1
## 137                  taeGut3.2.4
## 138                TETRAODON 8.0
## 139                        FUGU5
## 140                      turTru1
## 141                  ASM334442v1
## 142                   UrsMar_1.0
## 143                      vicPac1
## 144                    VulVul2.2
## 145 Xiphophorus_couchianus-4.0.1
## 146         X_maculatus-5.0-male
## 147                      JGI 4.2

To select a dataset we can update the Mart object using the function useDataset(). In the example below we choose to use the hsapiens dataset.

ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)

Or alternatively if the dataset one wants to use is known in advance, we can select a BioMart database and dataset in one step by:

ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")

3 How to build a biomaRt query

The getBM() function has three arguments that need to be introduced: filters, attributes and values. Filters define a restriction on the query. For example you want to restrict the output to all genes located on the human X chromosome then the filter chromosome_name can be used with value ‘X’. The listFilters() function shows you all available filters in the selected dataset.

filters = listFilters(ensembl)
filters[1:70,]
##                                   name
## 1                      chromosome_name
## 2                                start
## 3                                  end
## 4                           band_start
## 5                             band_end
## 6                         marker_start
## 7                           marker_end
## 8                        encode_region
## 9                               strand
## 10                  chromosomal_region
## 11                           with_ccds
## 12                         with_chembl
## 13       with_clone_based_ensembl_gene
## 14 with_clone_based_ensembl_transcript
## 15                         with_dbass3
## 16                         with_dbass5
## 17          with_entrezgene_trans_name
## 18                           with_embl
## 19                   with_arrayexpress
## 20                         with_genedb
## 21                             with_go
## 22                     with_goslim_goa
## 23                           with_hgnc
## 24                with_hgnc_trans_name
## 25                            with_hpa
## 26                     with_protein_id
## 27                    with_kegg_enzyme
## 28                   with_ens_lrg_gene
## 29                         with_merops
## 30                       with_mim_gene
## 31                     with_mim_morbid
## 32                        with_mirbase
## 33             with_mirbase_trans_name
## 34                     with_entrezgene
## 35                            with_pdb
## 36                  with_reactome_gene
## 37            with_reactome_transcript
## 38                    with_refseq_mrna
## 39          with_refseq_mrna_predicted
## 40                   with_refseq_ncrna
## 41         with_refseq_ncrna_predicted
## 42                 with_refseq_peptide
## 43       with_refseq_peptide_predicted
## 44                           with_rfam
## 45                with_rfam_trans_name
## 46                     with_rnacentral
## 47                           with_ucsc
## 48                        with_uniparc
## 49                     with_uniprot_gn
## 50               with_uniprotswissprot
## 51                with_uniprotsptrembl
## 52                       with_wikigene
## 53                     ensembl_gene_id
## 54             ensembl_gene_id_version
## 55               ensembl_transcript_id
## 56       ensembl_transcript_id_version
## 57                  ensembl_peptide_id
## 58          ensembl_peptide_id_version
## 59                     ensembl_exon_id
## 60                  external_gene_name
## 61            external_transcript_name
## 62                                ccds
## 63                              chembl
## 64            clone_based_ensembl_gene
## 65      clone_based_ensembl_transcript
## 66                         dbass3_name
## 67                           dbass3_id
## 68                         dbass5_name
## 69                           dbass5_id
## 70               entrezgene_trans_name
##                                                      description
## 1                                       Chromosome/scaffold name
## 2                                                          Start
## 3                                                            End
## 4                                                     Band Start
## 5                                                       Band End
## 6                                                   Marker Start
## 7                                                     Marker End
## 8                                                  Encode region
## 9                                                         Strand
## 10                        e.g. 1:100:10000:-1, 1:100000:200000:1
## 11                                               With CCDS ID(s)
## 12                                             With ChEMBL ID(s)
## 13                         With Clone-based (Ensembl) gene ID(s)
## 14                   With Clone-based (Ensembl) transcript ID(s)
## 15               With DataBase of Aberrant 3' Splice Sites ID(s)
## 16               With DataBase of Aberrant 5' Splice Sites ID(s)
## 17                         With EntrezGene transcript name ID(s)
## 18                        With European Nucleotide Archive ID(s)
## 19                                   With Expression Atlas ID(s)
## 20                                             With GeneDB ID(s)
## 21                                                 With GO ID(s)
## 22                                         With GOSlim GOA ID(s)
## 23                                        With HGNC Symbol ID(s)
## 24                               With HGNC transcript name ID(s)
## 25                                With Human Protein Atlas ID(s)
## 26                                   With INSDC protein ID ID(s)
## 27                            With KEGG Pathway and Enzyme ID(s)
## 28                        With LRG display in Ensembl gene ID(s)
## 29                    With MEROPS - the Peptidase Database ID(s)
## 30                                           With MIM gene ID(s)
## 31                                         With MIM morbid ID(s)
## 32                                            With miRBase ID(s)
## 33                            With miRBase transcript name ID(s)
## 34                                          With NCBI gene ID(s)
## 35                                                With PDB ID(s)
## 36                                      With Reactome gene ID(s)
## 37                                With Reactome transcript ID(s)
## 38                                        With RefSeq mRNA ID(s)
## 39                              With RefSeq mRNA predicted ID(s)
## 40                                       With RefSeq ncRNA ID(s)
## 41                             With RefSeq ncRNA predicted ID(s)
## 42                                     With RefSeq peptide ID(s)
## 43                           With RefSeq peptide predicted ID(s)
## 44                                               With RFAM ID(s)
## 45                               With RFAM transcript name ID(s)
## 46                                         With RNAcentral ID(s)
## 47                                     With UCSC Stable ID ID(s)
## 48                                            With UniParc ID(s)
## 49                                With UniProtKB Gene Name ID(s)
## 50                               With UniProtKB/Swiss-Prot ID(s)
## 51                                   With UniProtKB/TrEMBL ID(s)
## 52                                           With WikiGene ID(s)
## 53                      Gene stable ID(s) [e.g. ENSG00000000003]
## 54      Gene stable ID(s) with version [e.g. ENSG00000000003.14]
## 55                Transcript stable ID(s) [e.g. ENST00000000233]
## 56 Transcript stable ID(s) with version [e.g. ENST00000000233.9]
## 57                   Protein stable ID(s) [e.g. ENSP00000000233]
## 58    Protein stable ID(s) with version [e.g. ENSP00000000233.5]
## 59                             Exon ID(s) [e.g. ENSE00000327880]
## 60                                   Gene Name(s) [e.g. MIR4723]
## 61                         Transcript Name(s) [e.g. MIR4723-201]
## 62                                      CCDS ID(s) [e.g. CCDS10]
## 63                             ChEMBL ID(s) [e.g. CHEMBL1075092]
## 64            Clone-based (Ensembl) gene ID(s) [e.g. AB015752.1]
## 65  Clone-based (Ensembl) transcript ID(s) [e.g. AB015752.1-201]
## 66     DataBase of Aberrant 3' Splice Sites name(s) [e.g. ABCC8]
## 67           DataBase of Aberrant 3' Splice Sites ID(s) [e.g. 1]
## 68     DataBase of Aberrant 5' Splice Sites name(s) [e.g. ABCD1]
## 69           DataBase of Aberrant 5' Splice Sites ID(s) [e.g. 1]
## 70              EntrezGene transcript name ID(s) [e.g. AA06-201]

Attributes define the values we are interested in to retrieve. For example we want to retrieve the gene symbols or chromosomal coordinates. The listAttributes() function displays all available attributes in the selected dataset.

attributes = listAttributes(ensembl)
attributes[1:20,]
##                             name
## 1                ensembl_gene_id
## 2        ensembl_gene_id_version
## 3          ensembl_transcript_id
## 4  ensembl_transcript_id_version
## 5             ensembl_peptide_id
## 6     ensembl_peptide_id_version
## 7                ensembl_exon_id
## 8                    description
## 9                chromosome_name
## 10                start_position
## 11                  end_position
## 12                        strand
## 13                          band
## 14              transcript_start
## 15                transcript_end
## 16      transcription_start_site
## 17             transcript_length
## 18                transcript_tsl
## 19      transcript_gencode_basic
## 20             transcript_appris
##                                   description         page
## 1                              Gene stable ID feature_page
## 2                      Gene stable ID version feature_page
## 3                        Transcript stable ID feature_page
## 4                Transcript stable ID version feature_page
## 5                           Protein stable ID feature_page
## 6                   Protein stable ID version feature_page
## 7                              Exon stable ID feature_page
## 8                            Gene description feature_page
## 9                    Chromosome/scaffold name feature_page
## 10                            Gene start (bp) feature_page
## 11                              Gene end (bp) feature_page
## 12                                     Strand feature_page
## 13                             Karyotype band feature_page
## 14                      Transcript start (bp) feature_page
## 15                        Transcript end (bp) feature_page
## 16             Transcription start site (TSS) feature_page
## 17 Transcript length (including UTRs and CDS) feature_page
## 18             Transcript support level (TSL) feature_page
## 19                   GENCODE basic annotation feature_page
## 20                          APPRIS annotation feature_page

The getBM() function is the main query function in biomaRt. It has four main arguments:

- attributes: is a vector of attributes that one wants to retrieve (= the output of the query).

- filters: is a vector of filters that one wil use as input to the query.

- values: a vector of values for the filters. In case multple filters are in use, the values argument requires a list of values where each position in the list corresponds to the position of the filters in the filters argument (see examples below).

- mart: is an object of class Mart, which is created by the useMart() function.

Note: for some frequently used queries to Ensembl, wrapper functions are available: getGene() and getSequence(). These functions call the getBM() function with hard coded filter and attribute names.

Now that we selected a BioMart database and dataset, and know about attributes, filters, and the values for filters; we can build a biomaRt query. Let’s make an easy query for the following problem: We have a list of Affymetrix identifiers from the u133plus2 platform and we want to retrieve the corresponding EntrezGene identifiers using the Ensembl mappings.

The u133plus2 platform will be the filter for this query and as values for this filter we use our list of Affymetrix identifiers. As output (attributes) for the query we want to retrieve the EntrezGene and u133plus2 identifiers so we get a mapping of these two identifiers as a result. The exact names that we will have to use to specify the attributes and filters can be retrieved with the listAttributes() and listFilters() function respectively. Let’s now run the query:

affyids=c("202763_at","209310_s_at","207500_at")
getBM(attributes=c('affy_hg_u133_plus_2', 'entrezgene'), 
      filters = 'affy_hg_u133_plus_2', 
      values = affyids, 
      mart = ensembl)
##   affy_hg_u133_plus_2 entrezgene
## 1           202763_at        836
## 2         209310_s_at        837
## 3           207500_at        838

Putting it all Together

Let’s use biomaRt to get annotation for our output from limma:

v <- voom(dge, design)
fit <- lmFit(v, design)
fit <- eBayes(fit)
tab <- topTable(fit, coef=ncol(design), n = 20)
head(tab)
##               logFC  AveExpr         t      P.Value    adj.P.Val        B
## 100131754  2.054703 15.77659 113.29014 3.781646e-21 6.126266e-18 38.49198
## 2752       2.826613 12.90987  96.45001 3.169468e-20 2.567269e-17 36.95688
## 22883      2.671140 12.45734  79.03031 4.398194e-19 2.375025e-16 34.32948
## 2023      -2.284219 13.62891 -75.47484 8.074817e-19 3.270301e-16 33.51309
## 23154      4.160514 11.35699  60.95788 1.351506e-17 4.378878e-15 30.75714
## 8682       3.023933 11.74011  56.60478 3.588091e-17 9.687847e-15 29.92931

The identifiers (row names) in our top table are Entrez Gene identifiers, so let’s add a column with that information

tab$entrezgene <- rownames(tab)
head(tab)
##               logFC  AveExpr         t      P.Value    adj.P.Val        B
## 100131754  2.054703 15.77659 113.29014 3.781646e-21 6.126266e-18 38.49198
## 2752       2.826613 12.90987  96.45001 3.169468e-20 2.567269e-17 36.95688
## 22883      2.671140 12.45734  79.03031 4.398194e-19 2.375025e-16 34.32948
## 2023      -2.284219 13.62891 -75.47484 8.074817e-19 3.270301e-16 33.51309
## 23154      4.160514 11.35699  60.95788 1.351506e-17 4.378878e-15 30.75714
## 8682       3.023933 11.74011  56.60478 3.588091e-17 9.687847e-15 29.92931
##           entrezgene
## 100131754  100131754
## 2752            2752
## 22883          22883
## 2023            2023
## 23154          23154
## 8682            8682

Let’s get the gene symbol and gene description matching those Entrez Gene identifiers:

anno <- getBM(ensembl, attributes = c("entrezgene", "hgnc_symbol", "description"), filters = "entrezgene", values = tab$entrezgene)
anno
##    entrezgene hgnc_symbol
## 1       10472      ZBTB18
## 2        1465       CSRP1
## 3        2023        ENO1
## 4       22883      CLSTN1
## 5       23095       KIF1B
## 6       23154        NCDN
## 7       23499       MACF1
## 8        2752        GLUL
## 9      284654       RSPO1
## 10        477      ATP1A2
## 11        481      ATP1B1
## 12       4904        YBX1
## 13      55450     CAMK2N1
## 14       5567      PRKACB
## 15       6135       RPL11
## 16       6202        RPS8
## 17       8682       PEA15
## 18       9659     PDE4DIP
## 19       9900        SV2A
##                                                                                       description
## 1                    zinc finger and BTB domain containing 18 [Source:HGNC Symbol;Acc:HGNC:13030]
## 2                          cysteine and glycine rich protein 1 [Source:HGNC Symbol;Acc:HGNC:2469]
## 3                                                    enolase 1 [Source:HGNC Symbol;Acc:HGNC:3350]
## 4                                               calsyntenin 1 [Source:HGNC Symbol;Acc:HGNC:17447]
## 5                                    kinesin family member 1B [Source:HGNC Symbol;Acc:HGNC:16636]
## 6                                               neurochondrin [Source:HGNC Symbol;Acc:HGNC:17597]
## 7                     microtubule-actin crosslinking factor 1 [Source:HGNC Symbol;Acc:HGNC:13664]
## 8                                     glutamate-ammonia ligase [Source:HGNC Symbol;Acc:HGNC:4341]
## 9                                                 R-spondin 1 [Source:HGNC Symbol;Acc:HGNC:21679]
## 10                   ATPase Na+/K+ transporting subunit alpha 2 [Source:HGNC Symbol;Acc:HGNC:800]
## 11                    ATPase Na+/K+ transporting subunit beta 1 [Source:HGNC Symbol;Acc:HGNC:804]
## 12                                     Y-box binding protein 1 [Source:HGNC Symbol;Acc:HGNC:8014]
## 13 calcium/calmodulin dependent protein kinase II inhibitor 1 [Source:HGNC Symbol;Acc:HGNC:24190]
## 14        protein kinase cAMP-activated catalytic subunit beta [Source:HGNC Symbol;Acc:HGNC:9381]
## 15                                      ribosomal protein L11 [Source:HGNC Symbol;Acc:HGNC:10301]
## 16                                       ribosomal protein S8 [Source:HGNC Symbol;Acc:HGNC:10441]
## 17              proliferation and apoptosis adaptor protein 15 [Source:HGNC Symbol;Acc:HGNC:8822]
## 18                   phosphodiesterase 4D interacting protein [Source:HGNC Symbol;Acc:HGNC:15580]
## 19                           synaptic vesicle glycoprotein 2A [Source:HGNC Symbol;Acc:HGNC:20566]