Note: Throughout, text taken directly from the vignettes will be shown in georgia font and code/output from the vignettes will be shown with a blue background. Everything else is my commentary/additions.
Limma is an R package for differential expression testing of RNASeq and microarray data. The limma User’s Guide is an extensive, 100+ page summary of limma’s many capabilities. We will focus only on Chapter 15, “RNA-seq Data”.
Install limma and edgeR if you have not already done so:
# source("https://bioconductor.org/biocLite.R")
# biocLite(c("limma", "edgeR"))
Load edgeR (which loads limma as a dependency)
library(edgeR)
## Loading required package: limma
## Warning: package 'limma' was built under R version 3.5.1
Read in the counts table used in the limma examples from the course github page:
counts <- read.delim('https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2018-September-Bioinformatics-Prerequisites/master/friday/counts.tsv', row.names = 1)
The counts table has one column per sample and one row per gene. It shows the number of reads aligning to each gene in each sample.
dim(counts)
## [1] 25702 4
head(counts)
## A_1.bam A_2.bam B_1.bam B_2.bam
## 653635 627 513 579 574
## 100422834 1 0 0 0
## 645520 5 3 0 0
## 79501 0 0 0 0
## 729737 70 64 24 21
## 100507658 10 7 5 4
We also need to define a model matrix since this is not done in the vignette:
group <- rep(c("A", "B"), each = 2)
design <- model.matrix(~group)
Now for the actual limma vignette (starting in 15.3):
15.3 Normalization and Filtering
Once a matrix of read counts has been created, with rows for genes and columns for samples, it is convenient to create a DGEList object using the edgeR package.
dge <- DGEList(counts=counts)
dge
## An object of class "DGEList"
## $counts
## A_1.bam A_2.bam B_1.bam B_2.bam
## 653635 627 513 579 574
## 100422834 1 0 0 0
## 645520 5 3 0 0
## 79501 0 0 0 0
## 729737 70 64 24 21
## 25697 more rows ...
##
## $samples
## group lib.size norm.factors
## A_1.bam 1 383672 1
## A_2.bam 1 323786 1
## B_1.bam 1 464801 1
## B_2.bam 1 451689 1
A DGEList object is a container for counts, normalization factors, and library sizes.
The next step is to remove rows that consistently have zero or very low counts. One can for example use
keep <- filterByExpr(dge, design)
dge <- dge[keep,,keep.lib.sizes=FALSE]
dge
## An object of class "DGEList"
## $counts
## A_1.bam A_2.bam B_1.bam B_2.bam
## 653635 627 513 579 574
## 729737 70 64 24 21
## 100132287 63 43 20 15
## 100131754 11839 10685 45338 44547
## 100133331 77 60 33 39
## 1615 more rows ...
##
## $samples
## group lib.size norm.factors
## A_1.bam 1 382288 1
## A_2.bam 1 322674 1
## B_1.bam 1 463301 1
## B_2.bam 1 450284 1
Our data have been filtered from 25,702 genes to 1620 genes. Note that filtering is not an exact science, and is done different ways even within the limma User’s Guide (see e.g. section 18.1.7).
Here we will assume that filtering has been done.
It is usual to apply scale normalization to RNA-seq read counts, and the TMM normalization method in particular has been found to perform well in comparative studies. This can be applied to the DGEList object:
dge <- calcNormFactors(dge)
The norm.factors column in dge$samples has now been assigned numbers other than the default value of 1:
dge
## An object of class "DGEList"
## $counts
## A_1.bam A_2.bam B_1.bam B_2.bam
## 653635 627 513 579 574
## 729737 70 64 24 21
## 100132287 63 43 20 15
## 100131754 11839 10685 45338 44547
## 100133331 77 60 33 39
## 1615 more rows ...
##
## $samples
## group lib.size norm.factors
## A_1.bam 1 382288 1.1629573
## A_2.bam 1 322674 1.1573440
## B_1.bam 1 463301 0.8572287
## B_2.bam 1 450284 0.8667165
15.4 Differential expression: limma-trend
If the sequencing depth is reasonably consistent across the RNA samples, then the simplest and most robust approach to differential expression is to use limma-trend. This approach will usually work well if the ratio of the largest library size to the smallest is not more than about 3-fold.
In the limma-trend approach, the counts are converted to logCPM values using edgeR’s cpm function:
logCPM <- cpm(dge, log=TRUE, prior.count=3)
prior.count is the constant that is added to all counts before log transformation in order to avoid taking the log of 0. Its default value is 0.25.
The prior count is used here to damp down the variances of logarithms of low counts.
The logCPM values can then be used in any standard limma pipeline, using the trend=TRUE argument when running eBayes or treat. For example:
fit <- lmFit(logCPM, design)
fit <- eBayes(fit, trend=TRUE)
topTable(fit, coef=ncol(design))
## logFC AveExpr t P.Value adj.P.Val B
## 477 6.702730 10.097521 104.64572 1.289558e-13 1.163076e-10 19.42430
## 2752 2.822725 12.911626 103.21992 1.435896e-13 1.163076e-10 19.38206
## 23154 4.149472 11.364604 92.95400 3.262592e-13 1.761800e-10 19.03409
## 22883 2.668013 12.459653 88.78688 4.673095e-13 1.892604e-10 18.86725
## 23208 6.191143 10.078057 85.22099 6.442451e-13 2.087354e-10 18.71049
## 481 3.428246 11.584274 78.71541 1.199934e-12 3.239821e-10 18.38600
## 2023 -2.280158 13.629827 -70.07808 2.981364e-12 5.614906e-10 17.86117
## 100131754 2.047753 15.776777 70.02656 2.998580e-12 5.614906e-10 17.85766
## 10472 4.402382 10.024830 69.39116 3.220388e-12 5.614906e-10 17.81397
## 9783 5.220951 9.713354 68.74277 3.465991e-12 5.614906e-10 17.76860
Using trend=TRUE means that limma “squeezes” the genewise variances towards a trend line (rather than the default global mean).
coef = ncol(design) in the call to topTable works out to coef = 2. We are telling topTable to test the second coefficient (“groupB”):
head(coef(fit))
## (Intercept) groupB
## 653635 10.450526 0.07280501
## 729737 7.424122 -1.41271267
## 100132287 7.079451 -1.39220328
## 100131754 14.752900 2.04775296
## 100133331 7.445348 -0.82146092
## 79854 5.643624 0.24920993
In the output from topTable:
topTable shows the top 10 most significant genes by default, the argument n can be used to change that number.
Or, to give more weight to fold-changes in the gene ranking, one might use:
fit <- treat(fit, lfc=log2(1.2))
topTreat(fit, coef=ncol(design))
## logFC AveExpr t P.Value adj.P.Val
## 79727 -7.967585 6.885753 -92.45334 4.136977e-08 2.500193e-05
## 10485 7.716040 6.759981 87.65236 5.144712e-08 2.500193e-05
## 51440 7.742719 8.076993 86.85248 5.346975e-08 2.500193e-05
## 3765 7.837163 6.820542 79.50812 7.733537e-08 2.500193e-05
## 127254 6.619733 6.211827 75.74582 9.257224e-08 2.500193e-05
## 477 6.702730 10.097521 71.87338 1.153532e-07 2.500193e-05
## 26052 7.525391 8.067417 72.03942 1.159934e-07 2.500193e-05
## 7143 6.078910 5.941416 67.16788 1.507634e-07 2.500193e-05
## 3208 6.857143 7.426971 67.07065 1.542778e-07 2.500193e-05
## 777 5.920557 6.272886 65.76432 1.639550e-07 2.500193e-05
The function treat() tests if the absolute log fold change is greater than the indicated value (log2(1.2)), rather than 0.
15.5 Differential expression: voom
When the library sizes are quite variable between samples, then the voom approach is theoretically more powerful than limma-trend. In this approach, the voom transformation is applied to the normalized and filtered DGEList object:
v <- voom(dge, design, plot=TRUE)
In the plot, each point represents a gene, with the x-axis showing the average expression in log2 counts per million reads and the y-axis showing the square root of the residual standard deviation of expression in log2 counts per million reads.
The voom transformation uses the experiment design matrix, and produces an EList object.
It is also possible to give a matrix of counts directly to voom without TMM normalization, by
v <- voom(counts, design, plot=TRUE)
If the data are very noisy, one can apply the same between-array normalization methods as would be used for microarrays, for example:
v <- voom(counts, design, plot=TRUE, normalize="quantile")
After this, the usual limma pipelines for differential expression can be applied, for example:
fit <- lmFit(v, design)
fit <- eBayes(fit)
## Warning: Zero sample variances detected, have been offset away from zero
topTable(fit, coef=ncol(design))
## logFC AveExpr t P.Value adj.P.Val
## 100507192 -2.929838 3.365092 -1.626574e+16 1.072348e-34 1.207500e-30
## 727721 -1.584963 1.107691 -8.455993e+15 4.228272e-34 1.207500e-30
## 7780 -1.584963 1.107691 -8.455993e+15 4.228272e-34 1.207500e-30
## 100507337 -1.584963 1.107691 -8.455993e+15 4.228272e-34 1.207500e-30
## 100507664 -1.584963 1.107691 -8.455993e+15 4.228272e-34 1.207500e-30
## 270 -1.584963 1.107691 -8.455993e+15 4.228272e-34 1.207500e-30
## 3284 -1.584963 1.107691 -8.455993e+15 4.228272e-34 1.207500e-30
## 728945 -1.584963 1.107691 -8.455993e+15 4.228272e-34 1.207500e-30
## 730227 -1.584963 1.107691 -8.455993e+15 4.228272e-34 1.207500e-30
## 148930 4.122552 2.376486 5.910097e+15 8.962322e-34 2.059773e-30
## B
## 100507192 68.59136
## 727721 66.95172
## 7780 66.95172
## 100507337 66.95172
## 100507664 66.95172
## 270 66.95172
## 3284 66.95172
## 728945 66.95172
## 730227 66.95172
## 148930 65.65863
Or, to give more weight to fold-changes in the ranking, one could use say:
fit <- treat(fit, lfc=log2(1.2))
## Warning: Zero sample variances detected, have been offset away from zero
topTreat(fit, coef=ncol(design))
## logFC AveExpr t P.Value adj.P.Val
## 100507192 -2.929838 3.365092 -1.480544e+16 1.100817e-34 1.320896e-30
## 727721 -1.584963 1.107691 -7.052668e+15 4.625346e-34 1.320896e-30
## 7780 -1.584963 1.107691 -7.052668e+15 4.625346e-34 1.320896e-30
## 100507337 -1.584963 1.107691 -7.052668e+15 4.625346e-34 1.320896e-30
## 100507664 -1.584963 1.107691 -7.052668e+15 4.625346e-34 1.320896e-30
## 270 -1.584963 1.107691 -7.052668e+15 4.625346e-34 1.320896e-30
## 3284 -1.584963 1.107691 -7.052668e+15 4.625346e-34 1.320896e-30
## 728945 -1.584963 1.107691 -7.052668e+15 4.625346e-34 1.320896e-30
## 730227 -1.584963 1.107691 -7.052668e+15 4.625346e-34 1.320896e-30
## 148930 4.122552 2.376486 5.533010e+15 9.081657e-34 2.253205e-30
The biomaRt package provides an R interface to Ensembl databases.
Install biomaRt if not already installed
# source("https://bioconductor.org/biocLite.R")
# biocLite("biomaRt")
Let’s start with part 2 of the biomaRt vignette (accessible from within R by vignette(“biomaRt”)):
2 Selecting a BioMart database and dataset
Every analysis with biomaRt starts with selecting a BioMart database to use. A first step is to check which BioMart web services are available. The function listMarts() will display all available BioMart web services
library("biomaRt")
listMarts()
## biomart version
## 1 ENSEMBL_MART_ENSEMBL Ensembl Genes 93
## 2 ENSEMBL_MART_MOUSE Mouse strains 93
## 3 ENSEMBL_MART_SNP Ensembl Variation 93
## 4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 93
The useMart() function can now be used to connect to a specified BioMart database, this must be a valid name given by listMarts(). In the next example we choose to query the Ensembl BioMart database.
ensembl=useMart("ensembl")
BioMart databases can contain several datasets, for Ensembl every species is a different dataset. In a next step we look at which datasets are available in the selected BioMart by using the function listDatasets().
listDatasets(ensembl)
## dataset
## 1 acarolinensis_gene_ensembl
## 2 amelanoleuca_gene_ensembl
## 3 amexicanus_gene_ensembl
## 4 anancymaae_gene_ensembl
## 5 aplatyrhynchos_gene_ensembl
## 6 btaurus_gene_ensembl
## 7 caperea_gene_ensembl
## 8 catys_gene_ensembl
## 9 ccapucinus_gene_ensembl
## 10 cchok1gshd_gene_ensembl
## 11 ccrigri_gene_ensembl
## 12 celegans_gene_ensembl
## 13 cfamiliaris_gene_ensembl
## 14 chircus_gene_ensembl
## 15 choffmanni_gene_ensembl
## 16 cintestinalis_gene_ensembl
## 17 cjacchus_gene_ensembl
## 18 clanigera_gene_ensembl
## 19 cpalliatus_gene_ensembl
## 20 cporcellus_gene_ensembl
## 21 csabaeus_gene_ensembl
## 22 csavignyi_gene_ensembl
## 23 csyrichta_gene_ensembl
## 24 dmelanogaster_gene_ensembl
## 25 dnovemcinctus_gene_ensembl
## 26 dordii_gene_ensembl
## 27 drerio_gene_ensembl
## 28 eburgeri_gene_ensembl
## 29 ecaballus_gene_ensembl
## 30 eeuropaeus_gene_ensembl
## 31 etelfairi_gene_ensembl
## 32 falbicollis_gene_ensembl
## 33 fcatus_gene_ensembl
## 34 fdamarensis_gene_ensembl
## 35 gaculeatus_gene_ensembl
## 36 ggallus_gene_ensembl
## 37 ggorilla_gene_ensembl
## 38 gmorhua_gene_ensembl
## 39 hfemale_gene_ensembl
## 40 hmale_gene_ensembl
## 41 hsapiens_gene_ensembl
## 42 itridecemlineatus_gene_ensembl
## 43 jjaculus_gene_ensembl
## 44 lafricana_gene_ensembl
## 45 lchalumnae_gene_ensembl
## 46 loculatus_gene_ensembl
## 47 mauratus_gene_ensembl
## 48 mcaroli_gene_ensembl
## 49 mdomestica_gene_ensembl
## 50 mfascicularis_gene_ensembl
## 51 mfuro_gene_ensembl
## 52 mgallopavo_gene_ensembl
## 53 mleucophaeus_gene_ensembl
## 54 mlucifugus_gene_ensembl
## 55 mmulatta_gene_ensembl
## 56 mmurinus_gene_ensembl
## 57 mmusculus_gene_ensembl
## 58 mnemestrina_gene_ensembl
## 59 mochrogaster_gene_ensembl
## 60 mpahari_gene_ensembl
## 61 mspretus_gene_ensembl
## 62 neugenii_gene_ensembl
## 63 ngalili_gene_ensembl
## 64 nleucogenys_gene_ensembl
## 65 oanatinus_gene_ensembl
## 66 oaries_gene_ensembl
## 67 ocuniculus_gene_ensembl
## 68 odegus_gene_ensembl
## 69 ogarnettii_gene_ensembl
## 70 olatipes_gene_ensembl
## 71 oniloticus_gene_ensembl
## 72 oprinceps_gene_ensembl
## 73 pabelii_gene_ensembl
## 74 paltaica_gene_ensembl
## 75 panubis_gene_ensembl
## 76 pbairdii_gene_ensembl
## 77 pcapensis_gene_ensembl
## 78 pcoquereli_gene_ensembl
## 79 pformosa_gene_ensembl
## 80 pmarinus_gene_ensembl
## 81 ppaniscus_gene_ensembl
## 82 ppardus_gene_ensembl
## 83 psinensis_gene_ensembl
## 84 ptroglodytes_gene_ensembl
## 85 pvampyrus_gene_ensembl
## 86 rbieti_gene_ensembl
## 87 rnorvegicus_gene_ensembl
## 88 rroxellana_gene_ensembl
## 89 saraneus_gene_ensembl
## 90 sboliviensis_gene_ensembl
## 91 scerevisiae_gene_ensembl
## 92 sharrisii_gene_ensembl
## 93 sscrofa_gene_ensembl
## 94 tbelangeri_gene_ensembl
## 95 tguttata_gene_ensembl
## 96 tnigroviridis_gene_ensembl
## 97 trubripes_gene_ensembl
## 98 ttruncatus_gene_ensembl
## 99 vpacos_gene_ensembl
## 100 xmaculatus_gene_ensembl
## 101 xtropicalis_gene_ensembl
## description
## 1 Anole lizard genes (AnoCar2.0)
## 2 Panda genes (ailMel1)
## 3 Cave fish genes (AstMex102)
## 4 Ma's night monkey genes (Anan_2.0)
## 5 Duck genes (BGI_duck_1.0)
## 6 Cow genes (UMD3.1)
## 7 Brazilian guinea pig genes (CavAp1.0)
## 8 Sooty mangabey genes (Caty_1.0)
## 9 Capuchin genes (Cebus_imitator-1.0)
## 10 Chinese hamster CHOK1GS genes (CHOK1GS_HDv1)
## 11 Chinese hamster CriGri genes (CriGri_1.0)
## 12 Caenorhabditis elegans genes (WBcel235)
## 13 Dog genes (CanFam3.1)
## 14 Goat genes (ARS1)
## 15 Sloth genes (choHof1)
## 16 C.intestinalis genes (KH)
## 17 Marmoset genes (ASM275486v1)
## 18 Long-tailed chinchilla genes (ChiLan1.0)
## 19 Angola colobus genes (Cang.pa_1.0)
## 20 Guinea Pig genes (Cavpor3.0)
## 21 Vervet-AGM genes (ChlSab1.1)
## 22 C.savignyi genes (CSAV 2.0)
## 23 Tarsier genes (Tarsius_syrichta-2.0.1)
## 24 Fruitfly genes (BDGP6)
## 25 Armadillo genes (Dasnov3.0)
## 26 Kangaroo rat genes (Dord_2.0)
## 27 Zebrafish genes (GRCz11)
## 28 Inshore hagfish genes (Eburgeri_3.2)
## 29 Horse genes (Equ Cab 2)
## 30 Hedgehog genes (eriEur1)
## 31 Lesser hedgehog tenrec genes (TENREC)
## 32 Flycatcher genes (FicAlb_1.4)
## 33 Cat genes (Felis_catus_9.0)
## 34 Damara mole rat genes (DMR_v1.0)
## 35 Stickleback genes (BROAD S1)
## 36 Chicken genes (Gallus_gallus-5.0)
## 37 Gorilla genes (gorGor4)
## 38 Cod genes (gadMor1)
## 39 Naked mole-rat female genes (HetGla_female_1.0)
## 40 Naked mole-rat male genes (HetGla_1.0)
## 41 Human genes (GRCh38.p12)
## 42 Squirrel genes (SpeTri2.0)
## 43 Lesser Egyptian jerboa genes (JacJac1.0)
## 44 Elephant genes (Loxafr3.0)
## 45 Coelacanth genes (LatCha1)
## 46 Spotted gar genes (LepOcu1)
## 47 Golden Hamster genes (MesAur1.0)
## 48 Ryukyu mouse genes (CAROLI_EIJ_v1.1)
## 49 Opossum genes (monDom5)
## 50 Crab-eating macaque genes (Macaca_fascicularis_5.0)
## 51 Ferret genes (MusPutFur1.0)
## 52 Turkey genes (Turkey_2.01)
## 53 Drill genes (Mleu.le_1.0)
## 54 Microbat genes (Myoluc2.0)
## 55 Macaque genes (Mmul_8.0.1)
## 56 Mouse Lemur genes (Mmur_3.0)
## 57 Mouse genes (GRCm38.p6)
## 58 Pig-tailed macaque genes (Mnem_1.0)
## 59 Prairie vole genes (MicOch1.0)
## 60 Shrew mouse genes (PAHARI_EIJ_v1.1)
## 61 Algerian mouse genes (SPRET_EiJ_v1)
## 62 Wallaby genes (Meug_1.0)
## 63 Upper Galilee mountains blind mole rat genes (S.galili_v1.0)
## 64 Gibbon genes (Nleu_3.0)
## 65 Platypus genes (OANA5)
## 66 Sheep genes (Oar_v3.1)
## 67 Rabbit genes (OryCun2.0)
## 68 Degu genes (OctDeg1.0)
## 69 Bushbaby genes (OtoGar3)
## 70 Medaka genes (HdrR)
## 71 Tilapia genes (Orenil1.0)
## 72 Pika genes (OchPri2.0-Ens)
## 73 Orangutan genes (PPYG2)
## 74 Amur tiger genes (PanTig1.0)
## 75 Olive baboon genes (Panu_3.0)
## 76 Northern American deer mouse genes (Pman_1.0)
## 77 Hyrax genes (proCap1)
## 78 Coquerel's sifaka genes (Pcoq_1.0)
## 79 Amazon molly genes (Poecilia_formosa-5.1.2)
## 80 Lamprey genes (Pmarinus_7.0)
## 81 Bonobo genes (panpan1.1)
## 82 Leopard genes (PanPar1.0)
## 83 Chinese softshell turtle genes (PelSin_1.0)
## 84 Chimpanzee genes (Pan_tro_3.0)
## 85 Megabat genes (pteVam1)
## 86 Black snub-nosed monkey genes (ASM169854v1)
## 87 Rat genes (Rnor_6.0)
## 88 Golden snub-nosed monkey genes (Rrox_v1)
## 89 Shrew genes (sorAra1)
## 90 Bolivian squirrel monkey genes (SaiBol1.0)
## 91 Saccharomyces cerevisiae genes (R64-1-1)
## 92 Tasmanian devil genes (Devil_ref v7.0)
## 93 Pig genes (Sscrofa11.1)
## 94 Tree Shrew genes (tupBel1)
## 95 Zebra Finch genes (taeGut3.2.4)
## 96 Tetraodon genes (TETRAODON 8.0)
## 97 Fugu genes (FUGU 4.0)
## 98 Dolphin genes (turTru1)
## 99 Alpaca genes (vicPac1)
## 100 Platyfish genes (Xipmac4.4.2)
## 101 Xenopus genes (JGI 4.2)
## version
## 1 AnoCar2.0
## 2 ailMel1
## 3 AstMex102
## 4 Anan_2.0
## 5 BGI_duck_1.0
## 6 UMD3.1
## 7 CavAp1.0
## 8 Caty_1.0
## 9 Cebus_imitator-1.0
## 10 CHOK1GS_HDv1
## 11 CriGri_1.0
## 12 WBcel235
## 13 CanFam3.1
## 14 ARS1
## 15 choHof1
## 16 KH
## 17 ASM275486v1
## 18 ChiLan1.0
## 19 Cang.pa_1.0
## 20 Cavpor3.0
## 21 ChlSab1.1
## 22 CSAV 2.0
## 23 Tarsius_syrichta-2.0.1
## 24 BDGP6
## 25 Dasnov3.0
## 26 Dord_2.0
## 27 GRCz11
## 28 Eburgeri_3.2
## 29 Equ Cab 2
## 30 eriEur1
## 31 TENREC
## 32 FicAlb_1.4
## 33 Felis_catus_9.0
## 34 DMR_v1.0
## 35 BROAD S1
## 36 Gallus_gallus-5.0
## 37 gorGor4
## 38 gadMor1
## 39 HetGla_female_1.0
## 40 HetGla_1.0
## 41 GRCh38.p12
## 42 SpeTri2.0
## 43 JacJac1.0
## 44 Loxafr3.0
## 45 LatCha1
## 46 LepOcu1
## 47 MesAur1.0
## 48 CAROLI_EIJ_v1.1
## 49 monDom5
## 50 Macaca_fascicularis_5.0
## 51 MusPutFur1.0
## 52 Turkey_2.01
## 53 Mleu.le_1.0
## 54 Myoluc2.0
## 55 Mmul_8.0.1
## 56 Mmur_3.0
## 57 GRCm38.p6
## 58 Mnem_1.0
## 59 MicOch1.0
## 60 PAHARI_EIJ_v1.1
## 61 SPRET_EiJ_v1
## 62 Meug_1.0
## 63 S.galili_v1.0
## 64 Nleu_3.0
## 65 OANA5
## 66 Oar_v3.1
## 67 OryCun2.0
## 68 OctDeg1.0
## 69 OtoGar3
## 70 HdrR
## 71 Orenil1.0
## 72 OchPri2.0-Ens
## 73 PPYG2
## 74 PanTig1.0
## 75 Panu_3.0
## 76 Pman_1.0
## 77 proCap1
## 78 Pcoq_1.0
## 79 Poecilia_formosa-5.1.2
## 80 Pmarinus_7.0
## 81 panpan1.1
## 82 PanPar1.0
## 83 PelSin_1.0
## 84 Pan_tro_3.0
## 85 pteVam1
## 86 ASM169854v1
## 87 Rnor_6.0
## 88 Rrox_v1
## 89 sorAra1
## 90 SaiBol1.0
## 91 R64-1-1
## 92 Devil_ref v7.0
## 93 Sscrofa11.1
## 94 tupBel1
## 95 taeGut3.2.4
## 96 TETRAODON 8.0
## 97 FUGU 4.0
## 98 turTru1
## 99 vicPac1
## 100 Xipmac4.4.2
## 101 JGI 4.2
To select a dataset we can update the Mart object using the function useDataset(). In the example below we choose to use the hsapiens dataset.
ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
Or alternatively if the dataset one wants to use is known in advance, we can select a BioMart database and dataset in one step by:
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
3 How to build a biomaRt query
The getBM() function has three arguments that need to be introduced: filters, attributes and values. Filters define a restriction on the query. For example you want to restrict the output to all genes located on the human X chromosome then the filter chromosome_name can be used with value ‘X’. The listFilters() function shows you all available filters in the selected dataset.
filters = listFilters(ensembl)
filters[1:70,]
## name
## 1 chromosome_name
## 2 start
## 3 end
## 4 band_start
## 5 band_end
## 6 marker_start
## 7 marker_end
## 8 encode_region
## 9 strand
## 10 chromosomal_region
## 11 with_ccds
## 12 with_chembl
## 13 with_clone_based_ensembl_gene
## 14 with_clone_based_ensembl_transcript
## 15 with_dbass3
## 16 with_dbass5
## 17 with_ens_hs_transcript
## 18 with_ens_hs_translation
## 19 with_entrezgene_trans_name
## 20 with_embl
## 21 with_arrayexpress
## 22 with_genedb
## 23 with_go
## 24 with_goslim_goa
## 25 with_ottp
## 26 with_hgnc
## 27 with_hgnc_trans_name
## 28 with_hpa
## 29 with_protein_id
## 30 with_kegg_enzyme
## 31 with_ens_lrg_gene
## 32 with_ens_lrg_transcript
## 33 with_merops
## 34 with_mim_gene
## 35 with_mim_morbid
## 36 with_mirbase
## 37 with_mirbase_trans_name
## 38 with_entrezgene
## 39 with_pdb
## 40 with_reactome
## 41 with_reactome_gene
## 42 with_reactome_transcript
## 43 with_refseq_mrna
## 44 with_refseq_mrna_predicted
## 45 with_refseq_ncrna
## 46 with_refseq_ncrna_predicted
## 47 with_refseq_peptide
## 48 with_refseq_peptide_predicted
## 49 with_rfam
## 50 with_rfam_trans_name
## 51 with_rnacentral
## 52 with_ucsc
## 53 with_uniparc
## 54 with_uniprot_gn
## 55 with_uniprotswissprot
## 56 with_uniprotsptrembl
## 57 with_vega_translation
## 58 with_wikigene
## 59 ensembl_gene_id
## 60 ensembl_gene_id_version
## 61 ensembl_transcript_id
## 62 ensembl_transcript_id_version
## 63 ensembl_peptide_id
## 64 ensembl_peptide_id_version
## 65 ensembl_exon_id
## 66 external_gene_name
## 67 external_transcript_name
## 68 ccds
## 69 chembl
## 70 clone_based_ensembl_gene
## description
## 1 Chromosome/scaffold name
## 2 Start
## 3 End
## 4 Band Start
## 5 Band End
## 6 Marker Start
## 7 Marker End
## 8 Encode region
## 9 Strand
## 10 e.g. 1:100:10000:-1, 1:100000:200000:1
## 11 With CCDS ID(s)
## 12 With ChEMBL ID(s)
## 13 With Clone-based (Ensembl) gene ID(s)
## 14 With Clone-based (Ensembl) transcript ID(s)
## 15 With DataBase of Aberrant 3' Splice Sites ID(s)
## 16 With DataBase of Aberrant 5' Splice Sites ID(s)
## 17 With Ensembl Human Transcript ID(s)
## 18 With Ensembl Human Translation ID(s)
## 19 With EntrezGene transcript name ID(s)
## 20 With European Nucleotide Archive ID(s)
## 21 With Expression Atlas ID(s)
## 22 With GeneDB ID(s)
## 23 With GO ID(s)
## 24 With GOSlim GOA ID(s)
## 25 With Havana translation ID(s)
## 26 With HGNC Symbol ID(s)
## 27 With HGNC transcript name ID(s)
## 28 With Human Protein Atlas ID(s)
## 29 With INSDC protein ID ID(s)
## 30 With KEGG Pathway and Enzyme ID(s)
## 31 With LRG display in Ensembl gene ID(s)
## 32 With LRG display in Ensembl transcript ID(s)
## 33 With MEROPS - the Peptidase Database ID(s)
## 34 With MIM gene ID(s)
## 35 With MIM morbid ID(s)
## 36 With miRBase ID(s)
## 37 With miRBase transcript name ID(s)
## 38 With NCBI gene ID(s)
## 39 With PDB ID(s)
## 40 With Reactome ID(s)
## 41 With Reactome gene ID(s)
## 42 With Reactome transcript ID(s)
## 43 With RefSeq mRNA ID(s)
## 44 With RefSeq mRNA predicted ID(s)
## 45 With RefSeq ncRNA ID(s)
## 46 With RefSeq ncRNA predicted ID(s)
## 47 With RefSeq peptide ID(s)
## 48 With RefSeq peptide predicted ID(s)
## 49 With RFAM ID(s)
## 50 With RFAM transcript name ID(s)
## 51 With RNAcentral ID(s)
## 52 With UCSC Stable ID ID(s)
## 53 With UniParc ID(s)
## 54 With UniProtKB Gene Name ID(s)
## 55 With UniProtKB/Swiss-Prot ID(s)
## 56 With UniProtKB/TrEMBL ID(s)
## 57 With Vega translation ID(s)
## 58 With WikiGene ID(s)
## 59 Gene stable ID(s) [e.g. ENSG00000000003]
## 60 Gene stable ID(s) with version [e.g. ENSG00000000003.14]
## 61 Transcript stable ID(s) [e.g. ENST00000000233]
## 62 Transcript stable ID(s) with version [e.g. ENST00000000233.9]
## 63 Protein stable ID(s) [e.g. ENSP00000000233]
## 64 Protein stable ID(s) with version [e.g. ENSP00000000233.5]
## 65 Exon ID(s) [e.g. ENSE00000327880]
## 66 Gene Name(s) [e.g. RF00100]
## 67 Transcript Name(s) [e.g. RF00100.12-201]
## 68 CCDS ID(s) [e.g. CCDS10]
## 69 ChEMBL ID(s) [e.g. CHEMBL1075092]
## 70 Clone-based (Ensembl) gene ID(s) [e.g. AB015752.1]
Attributes define the values we are interested in to retrieve. For example we want to retrieve the gene symbols or chromosomal coordinates. The listAttributes() function displays all available attributes in the selected dataset.
attributes = listAttributes(ensembl)
attributes[1:20,]
## name
## 1 ensembl_gene_id
## 2 ensembl_gene_id_version
## 3 ensembl_transcript_id
## 4 ensembl_transcript_id_version
## 5 ensembl_peptide_id
## 6 ensembl_peptide_id_version
## 7 ensembl_exon_id
## 8 description
## 9 chromosome_name
## 10 start_position
## 11 end_position
## 12 strand
## 13 band
## 14 transcript_start
## 15 transcript_end
## 16 transcription_start_site
## 17 transcript_length
## 18 transcript_tsl
## 19 transcript_gencode_basic
## 20 transcript_appris
## description page
## 1 Gene stable ID feature_page
## 2 Gene stable ID version feature_page
## 3 Transcript stable ID feature_page
## 4 Transcript stable ID version feature_page
## 5 Protein stable ID feature_page
## 6 Protein stable ID version feature_page
## 7 Exon stable ID feature_page
## 8 Gene description feature_page
## 9 Chromosome/scaffold name feature_page
## 10 Gene start (bp) feature_page
## 11 Gene end (bp) feature_page
## 12 Strand feature_page
## 13 Karyotype band feature_page
## 14 Transcript start (bp) feature_page
## 15 Transcript end (bp) feature_page
## 16 Transcription start site (TSS) feature_page
## 17 Transcript length (including UTRs and CDS) feature_page
## 18 Transcript support level (TSL) feature_page
## 19 GENCODE basic annotation feature_page
## 20 APPRIS annotation feature_page
The getBM() function is the main query function in biomaRt. It has four main arguments:
- attributes: is a vector of attributes that one wants to retrieve (= the output of the query).
- filters: is a vector of filters that one wil use as input to the query.
- values: a vector of values for the filters. In case multple filters are in use, the values argument requires a list of values where each position in the list corresponds to the position of the filters in the filters argument (see examples below).
- mart: is an object of class Mart, which is created by the useMart() function.
Note: for some frequently used queries to Ensembl, wrapper functions are available: getGene() and getSequence(). These functions call the getBM() function with hard coded filter and attribute names.
Now that we selected a BioMart database and dataset, and know about attributes, filters, and the values for filters; we can build a biomaRt query. Let’s make an easy query for the following problem: We have a list of Affymetrix identifiers from the u133plus2 platform and we want to retrieve the corresponding EntrezGene identifiers using the Ensembl mappings.
The u133plus2 platform will be the filter for this query and as values for this filter we use our list of Affymetrix identifiers. As output (attributes) for the query we want to retrieve the EntrezGene and u133plus2 identifiers so we get a mapping of these two identifiers as a result. The exact names that we will have to use to specify the attributes and filters can be retrieved with the listAttributes() and listFilters() function respectively. Let’s now run the query:
affyids=c("202763_at","209310_s_at","207500_at")
getBM(attributes=c('affy_hg_u133_plus_2', 'entrezgene'),
filters = 'affy_hg_u133_plus_2',
values = affyids,
mart = ensembl)
## affy_hg_u133_plus_2 entrezgene
## 1 209310_s_at 837
## 2 207500_at 838
## 3 202763_at 836
Let’s use biomaRt to get annotation for our output from limma:
v <- voom(dge, design)
fit <- lmFit(v, design)
fit <- eBayes(fit)
tab <- topTable(fit, coef=ncol(design), n = 20)
head(tab)
## logFC AveExpr t P.Value adj.P.Val B
## 100131754 2.054703 15.77659 113.29014 3.781646e-21 6.126266e-18 38.49198
## 2752 2.826613 12.90987 96.45001 3.169468e-20 2.567269e-17 36.95688
## 22883 2.671140 12.45734 79.03031 4.398194e-19 2.375025e-16 34.32948
## 2023 -2.284219 13.62891 -75.47484 8.074817e-19 3.270301e-16 33.51309
## 23154 4.160514 11.35699 60.95788 1.351506e-17 4.378878e-15 30.75714
## 8682 3.023933 11.74011 56.60478 3.588091e-17 9.687847e-15 29.92931
The identifiers (row names) in our top table are Entrez Gene identifiers, so let’s add a column with that information
tab$entrezgene <- rownames(tab)
head(tab)
## logFC AveExpr t P.Value adj.P.Val B
## 100131754 2.054703 15.77659 113.29014 3.781646e-21 6.126266e-18 38.49198
## 2752 2.826613 12.90987 96.45001 3.169468e-20 2.567269e-17 36.95688
## 22883 2.671140 12.45734 79.03031 4.398194e-19 2.375025e-16 34.32948
## 2023 -2.284219 13.62891 -75.47484 8.074817e-19 3.270301e-16 33.51309
## 23154 4.160514 11.35699 60.95788 1.351506e-17 4.378878e-15 30.75714
## 8682 3.023933 11.74011 56.60478 3.588091e-17 9.687847e-15 29.92931
## entrezgene
## 100131754 100131754
## 2752 2752
## 22883 22883
## 2023 2023
## 23154 23154
## 8682 8682
Let’s get the gene symbol and gene description matching those Entrez Gene identifiers:
anno <- getBM(ensembl, attributes = c("entrezgene", "hgnc_symbol", "description"), filters = "entrezgene", values = tab$entrezgene)
anno
## entrezgene hgnc_symbol
## 1 10472 ZBTB18
## 2 1465 CSRP1
## 3 2023 ENO1
## 4 22883 CLSTN1
## 5 23095 KIF1B
## 6 23154 NCDN
## 7 23499 MACF1
## 8 2752 GLUL
## 9 284654 RSPO1
## 10 477 ATP1A2
## 11 481 ATP1B1
## 12 4904 YBX1
## 13 55450 CAMK2N1
## 14 5567 PRKACB
## 15 6135 RPL11
## 16 6202 RPS8
## 17 8682 PEA15
## 18 9659 PDE4DIP
## 19 9900 SV2A
## description
## 1 zinc finger and BTB domain containing 18 [Source:HGNC Symbol;Acc:HGNC:13030]
## 2 cysteine and glycine rich protein 1 [Source:HGNC Symbol;Acc:HGNC:2469]
## 3 enolase 1 [Source:HGNC Symbol;Acc:HGNC:3350]
## 4 calsyntenin 1 [Source:HGNC Symbol;Acc:HGNC:17447]
## 5 kinesin family member 1B [Source:HGNC Symbol;Acc:HGNC:16636]
## 6 neurochondrin [Source:HGNC Symbol;Acc:HGNC:17597]
## 7 microtubule-actin crosslinking factor 1 [Source:HGNC Symbol;Acc:HGNC:13664]
## 8 glutamate-ammonia ligase [Source:HGNC Symbol;Acc:HGNC:4341]
## 9 R-spondin 1 [Source:HGNC Symbol;Acc:HGNC:21679]
## 10 ATPase Na+/K+ transporting subunit alpha 2 [Source:HGNC Symbol;Acc:HGNC:800]
## 11 ATPase Na+/K+ transporting subunit beta 1 [Source:HGNC Symbol;Acc:HGNC:804]
## 12 Y-box binding protein 1 [Source:HGNC Symbol;Acc:HGNC:8014]
## 13 calcium/calmodulin dependent protein kinase II inhibitor 1 [Source:HGNC Symbol;Acc:HGNC:24190]
## 14 protein kinase cAMP-activated catalytic subunit beta [Source:HGNC Symbol;Acc:HGNC:9381]
## 15 ribosomal protein L11 [Source:HGNC Symbol;Acc:HGNC:10301]
## 16 ribosomal protein S8 [Source:HGNC Symbol;Acc:HGNC:10441]
## 17 proliferation and apoptosis adaptor protein 15 [Source:HGNC Symbol;Acc:HGNC:8822]
## 18 phosphodiesterase 4D interacting protein [Source:HGNC Symbol;Acc:HGNC:15580]
## 19 synaptic vesicle glycoprotein 2A [Source:HGNC Symbol;Acc:HGNC:20566]