Note: Throughout, text taken directly from the vignettes will be shown in georgia font and code/output from the vignettes will be shown with a blue background. Everything else is my commentary/additions.
Limma is an R package for differential expression testing of RNASeq and microarray data. The limma User’s Guide is an extensive, 100+ page summary of limma’s many capabilities. We will focus only on Chapter 15, “RNA-seq Data”.
Install edgeR if you have not already done so (installs limma as a dependency):
# if (!requireNamespace("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
# BiocManager::install("edgeR", version = "3.8")
Load edgeR (which loads limma as a dependency)
library(edgeR)
## Loading required package: limma
Read in a counts table:
counts <- read.delim('https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2018-September-Bioinformatics-Prerequisites/master/friday/counts.tsv', row.names = 1)
The counts table has one column per sample and one row per gene. It shows the number of reads aligning to each gene in each sample.
dim(counts)
## [1] 25702 4
head(counts)
## A_1.bam A_2.bam B_1.bam B_2.bam
## 653635 627 513 579 574
## 100422834 1 0 0 0
## 645520 5 3 0 0
## 79501 0 0 0 0
## 729737 70 64 24 21
## 100507658 10 7 5 4
We also need to define a model matrix since this is not done in the vignette:
group <- rep(c("A", "B"), each = 2)
design <- model.matrix(~group)
Now for the actual limma vignette (starting in 15.3):
15.3 Normalization and Filtering
Once a matrix of read counts has been created, with rows for genes and columns for samples, it is convenient to create a DGEList object using the edgeR package.
dge <- DGEList(counts=counts)
dge
## An object of class "DGEList"
## $counts
## A_1.bam A_2.bam B_1.bam B_2.bam
## 653635 627 513 579 574
## 100422834 1 0 0 0
## 645520 5 3 0 0
## 79501 0 0 0 0
## 729737 70 64 24 21
## 25697 more rows ...
##
## $samples
## group lib.size norm.factors
## A_1.bam 1 383672 1
## A_2.bam 1 323786 1
## B_1.bam 1 464801 1
## B_2.bam 1 451689 1
A DGEList object is a container for counts, normalization factors, and library sizes.
The next step is to remove rows that consistently have zero or very low counts. One can for example use
keep <- filterByExpr(dge, design)
dge <- dge[keep,,keep.lib.sizes=FALSE]
dge
## An object of class "DGEList"
## $counts
## A_1.bam A_2.bam B_1.bam B_2.bam
## 653635 627 513 579 574
## 729737 70 64 24 21
## 100132287 63 43 20 15
## 100131754 11839 10685 45338 44547
## 100133331 77 60 33 39
## 1615 more rows ...
##
## $samples
## group lib.size norm.factors
## A_1.bam 1 382288 1
## A_2.bam 1 322674 1
## B_1.bam 1 463301 1
## B_2.bam 1 450284 1
Our data have been filtered from 25,702 genes to 1620 genes. Note that filtering is not an exact science, and is done different ways even within the limma User’s Guide (see e.g. section 18.1.7).
Here we will assume that filtering has been done.
It is usual to apply scale normalization to RNA-seq read counts, and the TMM normalization method in particular has been found to perform well in comparative studies. This can be applied to the DGEList object:
dge <- calcNormFactors(dge)
The norm.factors column in dge$samples has now been assigned numbers other than the default value of 1:
dge
## An object of class "DGEList"
## $counts
## A_1.bam A_2.bam B_1.bam B_2.bam
## 653635 627 513 579 574
## 729737 70 64 24 21
## 100132287 63 43 20 15
## 100131754 11839 10685 45338 44547
## 100133331 77 60 33 39
## 1615 more rows ...
##
## $samples
## group lib.size norm.factors
## A_1.bam 1 382288 1.1629573
## A_2.bam 1 322674 1.1573440
## B_1.bam 1 463301 0.8572287
## B_2.bam 1 450284 0.8667165
15.4 Differential expression: limma-trend
If the sequencing depth is reasonably consistent across the RNA samples, then the simplest and most robust approach to differential expression is to use limma-trend. This approach will usually work well if the ratio of the largest library size to the smallest is not more than about 3-fold.
In the limma-trend approach, the counts are converted to logCPM values using edgeR’s cpm function:
logCPM <- cpm(dge, log=TRUE, prior.count=3)
prior.count is the constant that is added to all counts before log transformation in order to avoid taking the log of 0. Its default value is 0.25.
The prior count is used here to damp down the variances of logarithms of low counts.
The logCPM values can then be used in any standard limma pipeline, using the trend=TRUE argument when running eBayes or treat. For example:
fit <- lmFit(logCPM, design)
fit <- eBayes(fit, trend=TRUE)
topTable(fit, coef=ncol(design))
## logFC AveExpr t P.Value adj.P.Val B
## 477 6.702730 10.097521 104.64572 1.289558e-13 1.163076e-10 19.42430
## 2752 2.822725 12.911626 103.21992 1.435896e-13 1.163076e-10 19.38206
## 23154 4.149472 11.364604 92.95400 3.262592e-13 1.761800e-10 19.03409
## 22883 2.668013 12.459653 88.78688 4.673095e-13 1.892604e-10 18.86725
## 23208 6.191143 10.078057 85.22099 6.442451e-13 2.087354e-10 18.71049
## 481 3.428246 11.584274 78.71541 1.199934e-12 3.239821e-10 18.38600
## 2023 -2.280158 13.629827 -70.07808 2.981364e-12 5.614906e-10 17.86117
## 100131754 2.047753 15.776777 70.02656 2.998580e-12 5.614906e-10 17.85766
## 10472 4.402382 10.024830 69.39116 3.220388e-12 5.614906e-10 17.81397
## 9783 5.220951 9.713354 68.74277 3.465991e-12 5.614906e-10 17.76860
Using trend=TRUE means that limma “squeezes” the genewise variances towards a trend line (rather than the default global mean).
coef = ncol(design) in the call to topTable works out to coef = 2. We are telling topTable to test the second coefficient (“groupB”):
head(coef(fit))
## (Intercept) groupB
## 653635 10.450526 0.07280501
## 729737 7.424122 -1.41271267
## 100132287 7.079451 -1.39220328
## 100131754 14.752900 2.04775296
## 100133331 7.445348 -0.82146092
## 79854 5.643624 0.24920993
In the output from topTable:
topTable shows the top 10 most significant genes by default, the argument n can be used to change that number.
Or, to give more weight to fold-changes in the gene ranking, one might use:
fit <- treat(fit, lfc=log2(1.2))
topTreat(fit, coef=ncol(design))
## logFC AveExpr t P.Value adj.P.Val
## 79727 -7.967585 6.885753 -92.45334 4.136977e-08 2.500193e-05
## 10485 7.716040 6.759981 87.65236 5.144712e-08 2.500193e-05
## 51440 7.742719 8.076993 86.85248 5.346975e-08 2.500193e-05
## 3765 7.837163 6.820542 79.50812 7.733537e-08 2.500193e-05
## 127254 6.619733 6.211827 75.74582 9.257224e-08 2.500193e-05
## 477 6.702730 10.097521 71.87338 1.153532e-07 2.500193e-05
## 26052 7.525391 8.067417 72.03942 1.159934e-07 2.500193e-05
## 7143 6.078910 5.941416 67.16788 1.507634e-07 2.500193e-05
## 3208 6.857143 7.426971 67.07065 1.542778e-07 2.500193e-05
## 777 5.920557 6.272886 65.76432 1.639550e-07 2.500193e-05
The function treat() tests if the absolute log fold change is greater than the indicated value (log2(1.2)), rather than 0.
15.5 Differential expression: voom
When the library sizes are quite variable between samples, then the voom approach is theoretically more powerful than limma-trend. In this approach, the voom transformation is applied to the normalized and filtered DGEList object:
v <- voom(dge, design, plot=TRUE)
In the plot, each point represents a gene, with the x-axis showing the average expression in log2 counts per million reads and the y-axis showing the square root of the residual standard deviation of expression in log2 counts per million reads.
The voom transformation uses the experiment design matrix, and produces an EList object.
It is also possible to give a matrix of counts directly to voom without TMM normalization, by
v <- voom(counts, design, plot=TRUE)
If the data are very noisy, one can apply the same between-array normalization methods as would be used for microarrays, for example:
v <- voom(counts, design, plot=TRUE, normalize="quantile")
After this, the usual limma pipelines for differential expression can be applied, for example:
fit <- lmFit(v, design)
fit <- eBayes(fit)
## Warning: Zero sample variances detected, have been offset away from zero
topTable(fit, coef=ncol(design))
## logFC AveExpr t P.Value adj.P.Val
## 100507192 -2.929838 3.365092 -1.626574e+16 1.072348e-34 1.207500e-30
## 727721 -1.584963 1.107691 -8.455993e+15 4.228272e-34 1.207500e-30
## 7780 -1.584963 1.107691 -8.455993e+15 4.228272e-34 1.207500e-30
## 100507337 -1.584963 1.107691 -8.455993e+15 4.228272e-34 1.207500e-30
## 100507664 -1.584963 1.107691 -8.455993e+15 4.228272e-34 1.207500e-30
## 270 -1.584963 1.107691 -8.455993e+15 4.228272e-34 1.207500e-30
## 3284 -1.584963 1.107691 -8.455993e+15 4.228272e-34 1.207500e-30
## 728945 -1.584963 1.107691 -8.455993e+15 4.228272e-34 1.207500e-30
## 730227 -1.584963 1.107691 -8.455993e+15 4.228272e-34 1.207500e-30
## 148930 4.122552 2.376486 5.910097e+15 8.962322e-34 2.059773e-30
## B
## 100507192 68.59136
## 727721 66.95172
## 7780 66.95172
## 100507337 66.95172
## 100507664 66.95172
## 270 66.95172
## 3284 66.95172
## 728945 66.95172
## 730227 66.95172
## 148930 65.65863
Or, to give more weight to fold-changes in the ranking, one could use say:
fit <- treat(fit, lfc=log2(1.2))
## Warning: Zero sample variances detected, have been offset away from zero
topTreat(fit, coef=ncol(design))
## logFC AveExpr t P.Value adj.P.Val
## 100507192 -2.929838 3.365092 -1.480544e+16 1.100817e-34 1.320896e-30
## 727721 -1.584963 1.107691 -7.052668e+15 4.625346e-34 1.320896e-30
## 7780 -1.584963 1.107691 -7.052668e+15 4.625346e-34 1.320896e-30
## 100507337 -1.584963 1.107691 -7.052668e+15 4.625346e-34 1.320896e-30
## 100507664 -1.584963 1.107691 -7.052668e+15 4.625346e-34 1.320896e-30
## 270 -1.584963 1.107691 -7.052668e+15 4.625346e-34 1.320896e-30
## 3284 -1.584963 1.107691 -7.052668e+15 4.625346e-34 1.320896e-30
## 728945 -1.584963 1.107691 -7.052668e+15 4.625346e-34 1.320896e-30
## 730227 -1.584963 1.107691 -7.052668e+15 4.625346e-34 1.320896e-30
## 148930 4.122552 2.376486 5.533010e+15 9.081657e-34 2.253205e-30
The biomaRt package provides an R interface to Ensembl databases.
Install biomaRt if not already installed
# if (!requireNamespace("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
# BiocManager::install("biomaRt", version = "3.8")
Let’s start with part 2 of the biomaRt vignette (accessible from within R by vignette(“biomaRt”)):
2 Selecting a BioMart database and dataset
Every analysis with biomaRt starts with selecting a BioMart database to use. A first step is to check which BioMart web services are available. The function listMarts() will display all available BioMart web services
library("biomaRt")
listMarts()
## biomart version
## 1 ENSEMBL_MART_ENSEMBL Ensembl Genes 95
## 2 ENSEMBL_MART_MOUSE Mouse strains 95
## 3 ENSEMBL_MART_SNP Ensembl Variation 95
## 4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 95
The useMart() function can now be used to connect to a specified BioMart database, this must be a valid name given by listMarts(). In the next example we choose to query the Ensembl BioMart database.
ensembl=useMart("ensembl")
BioMart databases can contain several datasets, for Ensembl every species is a different dataset. In a next step we look at which datasets are available in the selected BioMart by using the function listDatasets().
listDatasets(ensembl)
## dataset
## 1 acalliptera_gene_ensembl
## 2 acarolinensis_gene_ensembl
## 3 acitrinellus_gene_ensembl
## 4 amelanoleuca_gene_ensembl
## 5 amexicanus_gene_ensembl
## 6 anancymaae_gene_ensembl
## 7 aocellaris_gene_ensembl
## 8 apercula_gene_ensembl
## 9 aplatyrhynchos_gene_ensembl
## 10 apolyacanthus_gene_ensembl
## 11 atestudineus_gene_ensembl
## 12 btaurus_gene_ensembl
## 13 caperea_gene_ensembl
## 14 catys_gene_ensembl
## 15 cbellii_gene_ensembl
## 16 ccapucinus_gene_ensembl
## 17 cchok1gshd_gene_ensembl
## 18 ccrigri_gene_ensembl
## 19 cdingo_gene_ensembl
## 20 celegans_gene_ensembl
## 21 cfamiliaris_gene_ensembl
## 22 chircus_gene_ensembl
## 23 choffmanni_gene_ensembl
## 24 cintestinalis_gene_ensembl
## 25 cjacchus_gene_ensembl
## 26 clanigera_gene_ensembl
## 27 cpalliatus_gene_ensembl
## 28 cporcellus_gene_ensembl
## 29 csabaeus_gene_ensembl
## 30 csavignyi_gene_ensembl
## 31 csemilaevis_gene_ensembl
## 32 csyrichta_gene_ensembl
## 33 cvariegatus_gene_ensembl
## 34 dmelanogaster_gene_ensembl
## 35 dnovemcinctus_gene_ensembl
## 36 dordii_gene_ensembl
## 37 drerio_gene_ensembl
## 38 easinus_gene_ensembl
## 39 eburgeri_gene_ensembl
## 40 ecaballus_gene_ensembl
## 41 eeuropaeus_gene_ensembl
## 42 elucius_gene_ensembl
## 43 etelfairi_gene_ensembl
## 44 falbicollis_gene_ensembl
## 45 fcatus_gene_ensembl
## 46 fdamarensis_gene_ensembl
## 47 fheteroclitus_gene_ensembl
## 48 gaculeatus_gene_ensembl
## 49 gaffinis_gene_ensembl
## 50 gagassizii_gene_ensembl
## 51 ggallus_gene_ensembl
## 52 ggorilla_gene_ensembl
## 53 gmorhua_gene_ensembl
## 54 hburtoni_gene_ensembl
## 55 hcomes_gene_ensembl
## 56 hfemale_gene_ensembl
## 57 hmale_gene_ensembl
## 58 hsapiens_gene_ensembl
## 59 ipunctatus_gene_ensembl
## 60 itridecemlineatus_gene_ensembl
## 61 jjaculus_gene_ensembl
## 62 kmarmoratus_gene_ensembl
## 63 lafricana_gene_ensembl
## 64 lbergylta_gene_ensembl
## 65 lchalumnae_gene_ensembl
## 66 loculatus_gene_ensembl
## 67 malbus_gene_ensembl
## 68 marmatus_gene_ensembl
## 69 mauratus_gene_ensembl
## 70 mcaroli_gene_ensembl
## 71 mdomestica_gene_ensembl
## 72 mfascicularis_gene_ensembl
## 73 mfuro_gene_ensembl
## 74 mgallopavo_gene_ensembl
## 75 mleucophaeus_gene_ensembl
## 76 mlucifugus_gene_ensembl
## 77 mmola_gene_ensembl
## 78 mmulatta_gene_ensembl
## 79 mmurinus_gene_ensembl
## 80 mmusculus_gene_ensembl
## 81 mnemestrina_gene_ensembl
## 82 mochrogaster_gene_ensembl
## 83 mpahari_gene_ensembl
## 84 mspretus_gene_ensembl
## 85 mzebra_gene_ensembl
## 86 nbrichardi_gene_ensembl
## 87 neugenii_gene_ensembl
## 88 ngalili_gene_ensembl
## 89 nleucogenys_gene_ensembl
## 90 oanatinus_gene_ensembl
## 91 oaries_gene_ensembl
## 92 ocuniculus_gene_ensembl
## 93 odegus_gene_ensembl
## 94 ogarnettii_gene_ensembl
## 95 ohni_gene_ensembl
## 96 ohsok_gene_ensembl
## 97 olatipes_gene_ensembl
## 98 omelastigma_gene_ensembl
## 99 oniloticus_gene_ensembl
## 100 oprinceps_gene_ensembl
## 101 pabelii_gene_ensembl
## 102 paltaica_gene_ensembl
## 103 panubis_gene_ensembl
## 104 pbairdii_gene_ensembl
## 105 pcapensis_gene_ensembl
## 106 pcinereus_gene_ensembl
## 107 pcoquereli_gene_ensembl
## 108 pformosa_gene_ensembl
## 109 pkingsleyae_gene_ensembl
## 110 platipinna_gene_ensembl
## 111 pmagnuspinnatus_gene_ensembl
## 112 pmarinus_gene_ensembl
## 113 pmexicana_gene_ensembl
## 114 pnattereri_gene_ensembl
## 115 pnyererei_gene_ensembl
## 116 ppaniscus_gene_ensembl
## 117 ppardus_gene_ensembl
## 118 preticulata_gene_ensembl
## 119 psinensis_gene_ensembl
## 120 ptroglodytes_gene_ensembl
## 121 pvampyrus_gene_ensembl
## 122 rbieti_gene_ensembl
## 123 rnorvegicus_gene_ensembl
## 124 rroxellana_gene_ensembl
## 125 saraneus_gene_ensembl
## 126 sboliviensis_gene_ensembl
## 127 scerevisiae_gene_ensembl
## 128 sdorsalis_gene_ensembl
## 129 sdumerili_gene_ensembl
## 130 sformosus_gene_ensembl
## 131 sharrisii_gene_ensembl
## 132 smaximus_gene_ensembl
## 133 spartitus_gene_ensembl
## 134 spunctatus_gene_ensembl
## 135 sscrofa_gene_ensembl
## 136 tbelangeri_gene_ensembl
## 137 tguttata_gene_ensembl
## 138 tnigroviridis_gene_ensembl
## 139 trubripes_gene_ensembl
## 140 ttruncatus_gene_ensembl
## 141 uamericanus_gene_ensembl
## 142 umaritimus_gene_ensembl
## 143 vpacos_gene_ensembl
## 144 vvulpes_gene_ensembl
## 145 xcouchianus_gene_ensembl
## 146 xmaculatus_gene_ensembl
## 147 xtropicalis_gene_ensembl
## description
## 1 Eastern happy genes (fAstCal1.2)
## 2 Anole lizard genes (AnoCar2.0)
## 3 Midas cichlid genes (Midas_v5)
## 4 Panda genes (ailMel1)
## 5 Cave fish genes (Astyanax_mexicanus-2.0)
## 6 Ma's night monkey genes (Anan_2.0)
## 7 Clown anemonefish genes (AmpOce1.0)
## 8 Orange clownfish genes (Nemo_v1)
## 9 Duck genes (BGI_duck_1.0)
## 10 Spiny chromis genes (ASM210954v1)
## 11 Climbing perch genes (fAnaTes1.1)
## 12 Cow genes (ARS-UCD1.2)
## 13 Brazilian guinea pig genes (CavAp1.0)
## 14 Sooty mangabey genes (Caty_1.0)
## 15 Painted turtle genes (Chrysemys_picta_bellii-3.0.3)
## 16 Capuchin genes (Cebus_imitator-1.0)
## 17 Chinese hamster CHOK1GS genes (CHOK1GS_HDv1)
## 18 Chinese hamster CriGri genes (CriGri_1.0)
## 19 Dingo genes (ASM325472v1)
## 20 Caenorhabditis elegans genes (WBcel235)
## 21 Dog genes (CanFam3.1)
## 22 Goat genes (ARS1)
## 23 Sloth genes (choHof1)
## 24 C.intestinalis genes (KH)
## 25 Marmoset genes (ASM275486v1)
## 26 Long-tailed chinchilla genes (ChiLan1.0)
## 27 Angola colobus genes (Cang.pa_1.0)
## 28 Guinea Pig genes (Cavpor3.0)
## 29 Vervet-AGM genes (ChlSab1.1)
## 30 C.savignyi genes (CSAV 2.0)
## 31 Tongue sole genes (Cse_v1.0)
## 32 Tarsier genes (Tarsius_syrichta-2.0.1)
## 33 Sheepshead minnow genes (C_variegatus-1.0)
## 34 Fruitfly genes (BDGP6)
## 35 Armadillo genes (Dasnov3.0)
## 36 Kangaroo rat genes (Dord_2.0)
## 37 Zebrafish genes (GRCz11)
## 38 Donkey genes (ASM303372v1)
## 39 Hagfish genes (Eburgeri_3.2)
## 40 Horse genes (EquCab3.0)
## 41 Hedgehog genes (eriEur1)
## 42 Northern pike genes (Eluc_V3)
## 43 Lesser hedgehog tenrec genes (TENREC)
## 44 Flycatcher genes (FicAlb_1.4)
## 45 Cat genes (Felis_catus_9.0)
## 46 Damara mole rat genes (DMR_v1.0)
## 47 Mummichog genes (Fundulus_heteroclitus-3.0.2)
## 48 Stickleback genes (BROAD S1)
## 49 Western mosquitofish genes (ASM309773v1)
## 50 Agassiz's desert tortoise genes (ASM289641v1)
## 51 Chicken genes (GRCg6a)
## 52 Gorilla genes (gorGor4)
## 53 Cod genes (gadMor1)
## 54 Burton's mouthbrooder genes (AstBur1.0)
## 55 Tiger tail seahorse genes (H_comes_QL1_v1)
## 56 Naked mole-rat female genes (HetGla_female_1.0)
## 57 Naked mole-rat male genes (HetGla_1.0)
## 58 Human genes (GRCh38.p12)
## 59 Channel catfish genes (IpCoco_1.2)
## 60 Squirrel genes (SpeTri2.0)
## 61 Lesser Egyptian jerboa genes (JacJac1.0)
## 62 Mangrove rivulus genes (ASM164957v1)
## 63 Elephant genes (Loxafr3.0)
## 64 Ballan wrasse genes (BallGen_V1)
## 65 Coelacanth genes (LatCha1)
## 66 Spotted gar genes (LepOcu1)
## 67 Swamp eel genes (M_albus_1.0)
## 68 Zig-zag eel genes (fMasArm1.1)
## 69 Golden Hamster genes (MesAur1.0)
## 70 Ryukyu mouse genes (CAROLI_EIJ_v1.1)
## 71 Opossum genes (monDom5)
## 72 Crab-eating macaque genes (Macaca_fascicularis_5.0)
## 73 Ferret genes (MusPutFur1.0)
## 74 Turkey genes (Turkey_2.01)
## 75 Drill genes (Mleu.le_1.0)
## 76 Microbat genes (Myoluc2.0)
## 77 Ocean sunfish genes (ASM169857v1)
## 78 Macaque genes (Mmul_8.0.1)
## 79 Mouse Lemur genes (Mmur_3.0)
## 80 Mouse genes (GRCm38.p6)
## 81 Pig-tailed macaque genes (Mnem_1.0)
## 82 Prairie vole genes (MicOch1.0)
## 83 Shrew mouse genes (PAHARI_EIJ_v1.1)
## 84 Algerian mouse genes (SPRET_EiJ_v1)
## 85 Zebra mbuna genes (M_zebra_UMD2a)
## 86 Lyretail cichlid genes (NeoBri1.0)
## 87 Wallaby genes (Meug_1.0)
## 88 Upper Galilee mountains blind mole rat genes (S.galili_v1.0)
## 89 Gibbon genes (Nleu_3.0)
## 90 Platypus genes (OANA5)
## 91 Sheep genes (Oar_v3.1)
## 92 Rabbit genes (OryCun2.0)
## 93 Degu genes (OctDeg1.0)
## 94 Bushbaby genes (OtoGar3)
## 95 Japanese medaka HNI genes (ASM223471v1)
## 96 Japanese medaka HSOK genes (ASM223469v1)
## 97 Japanese medaka HdrR genes (ASM223467v1)
## 98 Indian medaka genes (Om_v0.7.RACA)
## 99 Tilapia genes (Orenil1.0)
## 100 Pika genes (OchPri2.0-Ens)
## 101 Orangutan genes (PPYG2)
## 102 Tiger genes (PanTig1.0)
## 103 Olive baboon genes (Panu_3.0)
## 104 Northern American deer mouse genes (Pman_1.0)
## 105 Hyrax genes (proCap1)
## 106 Koala genes (phaCin_tgac_v2.0)
## 107 Coquerel's sifaka genes (Pcoq_1.0)
## 108 Amazon molly genes (Poecilia_formosa-5.1.2)
## 109 Paramormyrops kingsleyae genes (PKINGS_0.1)
## 110 Sailfin molly genes (P_latipinna-1.0)
## 111 Periophthalmus magnuspinnatus genes (PM.fa)
## 112 Lamprey genes (Pmarinus_7.0)
## 113 Shortfin molly genes (P_mexicana-1.0)
## 114 Red-bellied piranha genes (Pygocentrus_nattereri-1.0.2)
## 115 Makobe Island cichlid genes (PunNye1.0)
## 116 Bonobo genes (panpan1.1)
## 117 Leopard genes (PanPar1.0)
## 118 Guppy genes (Guppy_female_1.0_MT)
## 119 Chinese softshell turtle genes (PelSin_1.0)
## 120 Chimpanzee genes (Pan_tro_3.0)
## 121 Megabat genes (pteVam1)
## 122 Black snub-nosed monkey genes (ASM169854v1)
## 123 Rat genes (Rnor_6.0)
## 124 Golden snub-nosed monkey genes (Rrox_v1)
## 125 Shrew genes (sorAra1)
## 126 Bolivian squirrel monkey genes (SaiBol1.0)
## 127 Saccharomyces cerevisiae genes (R64-1-1)
## 128 Yellowtail amberjack genes (Sedor1)
## 129 Greater amberjack genes (Sdu_1.0)
## 130 Asian bonytongue genes (ASM162426v1)
## 131 Tasmanian devil genes (Devil_ref v7.0)
## 132 Turbot genes (ASM318616v1)
## 133 Bicolor damselfish genes (Stegastes_partitus-1.0.2)
## 134 Tuatara genes (ASM311381v1)
## 135 Pig genes (Sscrofa11.1)
## 136 Tree Shrew genes (tupBel1)
## 137 Zebra Finch genes (taeGut3.2.4)
## 138 Tetraodon genes (TETRAODON 8.0)
## 139 Fugu genes (FUGU5)
## 140 Dolphin genes (turTru1)
## 141 American black bear genes (ASM334442v1)
## 142 Polar bear genes (UrsMar_1.0)
## 143 Alpaca genes (vicPac1)
## 144 Red fox genes (VulVul2.2)
## 145 Monterrey platyfish genes (Xiphophorus_couchianus-4.0.1)
## 146 Platyfish genes (X_maculatus-5.0-male)
## 147 Xenopus genes (JGI 4.2)
## version
## 1 fAstCal1.2
## 2 AnoCar2.0
## 3 Midas_v5
## 4 ailMel1
## 5 Astyanax_mexicanus-2.0
## 6 Anan_2.0
## 7 AmpOce1.0
## 8 Nemo_v1
## 9 BGI_duck_1.0
## 10 ASM210954v1
## 11 fAnaTes1.1
## 12 ARS-UCD1.2
## 13 CavAp1.0
## 14 Caty_1.0
## 15 Chrysemys_picta_bellii-3.0.3
## 16 Cebus_imitator-1.0
## 17 CHOK1GS_HDv1
## 18 CriGri_1.0
## 19 ASM325472v1
## 20 WBcel235
## 21 CanFam3.1
## 22 ARS1
## 23 choHof1
## 24 KH
## 25 ASM275486v1
## 26 ChiLan1.0
## 27 Cang.pa_1.0
## 28 Cavpor3.0
## 29 ChlSab1.1
## 30 CSAV 2.0
## 31 Cse_v1.0
## 32 Tarsius_syrichta-2.0.1
## 33 C_variegatus-1.0
## 34 BDGP6
## 35 Dasnov3.0
## 36 Dord_2.0
## 37 GRCz11
## 38 ASM303372v1
## 39 Eburgeri_3.2
## 40 EquCab3.0
## 41 eriEur1
## 42 Eluc_V3
## 43 TENREC
## 44 FicAlb_1.4
## 45 Felis_catus_9.0
## 46 DMR_v1.0
## 47 Fundulus_heteroclitus-3.0.2
## 48 BROAD S1
## 49 ASM309773v1
## 50 ASM289641v1
## 51 GRCg6a
## 52 gorGor4
## 53 gadMor1
## 54 AstBur1.0
## 55 H_comes_QL1_v1
## 56 HetGla_female_1.0
## 57 HetGla_1.0
## 58 GRCh38.p12
## 59 IpCoco_1.2
## 60 SpeTri2.0
## 61 JacJac1.0
## 62 ASM164957v1
## 63 Loxafr3.0
## 64 BallGen_V1
## 65 LatCha1
## 66 LepOcu1
## 67 M_albus_1.0
## 68 fMasArm1.1
## 69 MesAur1.0
## 70 CAROLI_EIJ_v1.1
## 71 monDom5
## 72 Macaca_fascicularis_5.0
## 73 MusPutFur1.0
## 74 Turkey_2.01
## 75 Mleu.le_1.0
## 76 Myoluc2.0
## 77 ASM169857v1
## 78 Mmul_8.0.1
## 79 Mmur_3.0
## 80 GRCm38.p6
## 81 Mnem_1.0
## 82 MicOch1.0
## 83 PAHARI_EIJ_v1.1
## 84 SPRET_EiJ_v1
## 85 M_zebra_UMD2a
## 86 NeoBri1.0
## 87 Meug_1.0
## 88 S.galili_v1.0
## 89 Nleu_3.0
## 90 OANA5
## 91 Oar_v3.1
## 92 OryCun2.0
## 93 OctDeg1.0
## 94 OtoGar3
## 95 ASM223471v1
## 96 ASM223469v1
## 97 ASM223467v1
## 98 Om_v0.7.RACA
## 99 Orenil1.0
## 100 OchPri2.0-Ens
## 101 PPYG2
## 102 PanTig1.0
## 103 Panu_3.0
## 104 Pman_1.0
## 105 proCap1
## 106 phaCin_tgac_v2.0
## 107 Pcoq_1.0
## 108 Poecilia_formosa-5.1.2
## 109 PKINGS_0.1
## 110 P_latipinna-1.0
## 111 PM.fa
## 112 Pmarinus_7.0
## 113 P_mexicana-1.0
## 114 Pygocentrus_nattereri-1.0.2
## 115 PunNye1.0
## 116 panpan1.1
## 117 PanPar1.0
## 118 Guppy_female_1.0_MT
## 119 PelSin_1.0
## 120 Pan_tro_3.0
## 121 pteVam1
## 122 ASM169854v1
## 123 Rnor_6.0
## 124 Rrox_v1
## 125 sorAra1
## 126 SaiBol1.0
## 127 R64-1-1
## 128 Sedor1
## 129 Sdu_1.0
## 130 ASM162426v1
## 131 Devil_ref v7.0
## 132 ASM318616v1
## 133 Stegastes_partitus-1.0.2
## 134 ASM311381v1
## 135 Sscrofa11.1
## 136 tupBel1
## 137 taeGut3.2.4
## 138 TETRAODON 8.0
## 139 FUGU5
## 140 turTru1
## 141 ASM334442v1
## 142 UrsMar_1.0
## 143 vicPac1
## 144 VulVul2.2
## 145 Xiphophorus_couchianus-4.0.1
## 146 X_maculatus-5.0-male
## 147 JGI 4.2
To select a dataset we can update the Mart object using the function useDataset(). In the example below we choose to use the hsapiens dataset.
ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
Or alternatively if the dataset one wants to use is known in advance, we can select a BioMart database and dataset in one step by:
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
3 How to build a biomaRt query
The getBM() function has three arguments that need to be introduced: filters, attributes and values. Filters define a restriction on the query. For example you want to restrict the output to all genes located on the human X chromosome then the filter chromosome_name can be used with value ‘X’. The listFilters() function shows you all available filters in the selected dataset.
filters = listFilters(ensembl)
filters[1:70,]
## name
## 1 chromosome_name
## 2 start
## 3 end
## 4 band_start
## 5 band_end
## 6 marker_start
## 7 marker_end
## 8 encode_region
## 9 strand
## 10 chromosomal_region
## 11 with_ccds
## 12 with_chembl
## 13 with_clone_based_ensembl_gene
## 14 with_clone_based_ensembl_transcript
## 15 with_dbass3
## 16 with_dbass5
## 17 with_entrezgene_trans_name
## 18 with_embl
## 19 with_arrayexpress
## 20 with_genedb
## 21 with_go
## 22 with_goslim_goa
## 23 with_hgnc
## 24 with_hgnc_trans_name
## 25 with_hpa
## 26 with_protein_id
## 27 with_kegg_enzyme
## 28 with_ens_lrg_gene
## 29 with_merops
## 30 with_mim_gene
## 31 with_mim_morbid
## 32 with_mirbase
## 33 with_mirbase_trans_name
## 34 with_entrezgene
## 35 with_pdb
## 36 with_reactome_gene
## 37 with_reactome_transcript
## 38 with_refseq_mrna
## 39 with_refseq_mrna_predicted
## 40 with_refseq_ncrna
## 41 with_refseq_ncrna_predicted
## 42 with_refseq_peptide
## 43 with_refseq_peptide_predicted
## 44 with_rfam
## 45 with_rfam_trans_name
## 46 with_rnacentral
## 47 with_ucsc
## 48 with_uniparc
## 49 with_uniprot_gn
## 50 with_uniprotswissprot
## 51 with_uniprotsptrembl
## 52 with_wikigene
## 53 ensembl_gene_id
## 54 ensembl_gene_id_version
## 55 ensembl_transcript_id
## 56 ensembl_transcript_id_version
## 57 ensembl_peptide_id
## 58 ensembl_peptide_id_version
## 59 ensembl_exon_id
## 60 external_gene_name
## 61 external_transcript_name
## 62 ccds
## 63 chembl
## 64 clone_based_ensembl_gene
## 65 clone_based_ensembl_transcript
## 66 dbass3_name
## 67 dbass3_id
## 68 dbass5_name
## 69 dbass5_id
## 70 entrezgene_trans_name
## description
## 1 Chromosome/scaffold name
## 2 Start
## 3 End
## 4 Band Start
## 5 Band End
## 6 Marker Start
## 7 Marker End
## 8 Encode region
## 9 Strand
## 10 e.g. 1:100:10000:-1, 1:100000:200000:1
## 11 With CCDS ID(s)
## 12 With ChEMBL ID(s)
## 13 With Clone-based (Ensembl) gene ID(s)
## 14 With Clone-based (Ensembl) transcript ID(s)
## 15 With DataBase of Aberrant 3' Splice Sites ID(s)
## 16 With DataBase of Aberrant 5' Splice Sites ID(s)
## 17 With EntrezGene transcript name ID(s)
## 18 With European Nucleotide Archive ID(s)
## 19 With Expression Atlas ID(s)
## 20 With GeneDB ID(s)
## 21 With GO ID(s)
## 22 With GOSlim GOA ID(s)
## 23 With HGNC Symbol ID(s)
## 24 With HGNC transcript name ID(s)
## 25 With Human Protein Atlas ID(s)
## 26 With INSDC protein ID ID(s)
## 27 With KEGG Pathway and Enzyme ID(s)
## 28 With LRG display in Ensembl gene ID(s)
## 29 With MEROPS - the Peptidase Database ID(s)
## 30 With MIM gene ID(s)
## 31 With MIM morbid ID(s)
## 32 With miRBase ID(s)
## 33 With miRBase transcript name ID(s)
## 34 With NCBI gene ID(s)
## 35 With PDB ID(s)
## 36 With Reactome gene ID(s)
## 37 With Reactome transcript ID(s)
## 38 With RefSeq mRNA ID(s)
## 39 With RefSeq mRNA predicted ID(s)
## 40 With RefSeq ncRNA ID(s)
## 41 With RefSeq ncRNA predicted ID(s)
## 42 With RefSeq peptide ID(s)
## 43 With RefSeq peptide predicted ID(s)
## 44 With RFAM ID(s)
## 45 With RFAM transcript name ID(s)
## 46 With RNAcentral ID(s)
## 47 With UCSC Stable ID ID(s)
## 48 With UniParc ID(s)
## 49 With UniProtKB Gene Name ID(s)
## 50 With UniProtKB/Swiss-Prot ID(s)
## 51 With UniProtKB/TrEMBL ID(s)
## 52 With WikiGene ID(s)
## 53 Gene stable ID(s) [e.g. ENSG00000000003]
## 54 Gene stable ID(s) with version [e.g. ENSG00000000003.14]
## 55 Transcript stable ID(s) [e.g. ENST00000000233]
## 56 Transcript stable ID(s) with version [e.g. ENST00000000233.9]
## 57 Protein stable ID(s) [e.g. ENSP00000000233]
## 58 Protein stable ID(s) with version [e.g. ENSP00000000233.5]
## 59 Exon ID(s) [e.g. ENSE00000327880]
## 60 Gene Name(s) [e.g. MIR4723]
## 61 Transcript Name(s) [e.g. MIR4723-201]
## 62 CCDS ID(s) [e.g. CCDS10]
## 63 ChEMBL ID(s) [e.g. CHEMBL1075092]
## 64 Clone-based (Ensembl) gene ID(s) [e.g. AB015752.1]
## 65 Clone-based (Ensembl) transcript ID(s) [e.g. AB015752.1-201]
## 66 DataBase of Aberrant 3' Splice Sites name(s) [e.g. ABCC8]
## 67 DataBase of Aberrant 3' Splice Sites ID(s) [e.g. 1]
## 68 DataBase of Aberrant 5' Splice Sites name(s) [e.g. ABCD1]
## 69 DataBase of Aberrant 5' Splice Sites ID(s) [e.g. 1]
## 70 EntrezGene transcript name ID(s) [e.g. AA06-201]
Attributes define the values we are interested in to retrieve. For example we want to retrieve the gene symbols or chromosomal coordinates. The listAttributes() function displays all available attributes in the selected dataset.
attributes = listAttributes(ensembl)
attributes[1:20,]
## name
## 1 ensembl_gene_id
## 2 ensembl_gene_id_version
## 3 ensembl_transcript_id
## 4 ensembl_transcript_id_version
## 5 ensembl_peptide_id
## 6 ensembl_peptide_id_version
## 7 ensembl_exon_id
## 8 description
## 9 chromosome_name
## 10 start_position
## 11 end_position
## 12 strand
## 13 band
## 14 transcript_start
## 15 transcript_end
## 16 transcription_start_site
## 17 transcript_length
## 18 transcript_tsl
## 19 transcript_gencode_basic
## 20 transcript_appris
## description page
## 1 Gene stable ID feature_page
## 2 Gene stable ID version feature_page
## 3 Transcript stable ID feature_page
## 4 Transcript stable ID version feature_page
## 5 Protein stable ID feature_page
## 6 Protein stable ID version feature_page
## 7 Exon stable ID feature_page
## 8 Gene description feature_page
## 9 Chromosome/scaffold name feature_page
## 10 Gene start (bp) feature_page
## 11 Gene end (bp) feature_page
## 12 Strand feature_page
## 13 Karyotype band feature_page
## 14 Transcript start (bp) feature_page
## 15 Transcript end (bp) feature_page
## 16 Transcription start site (TSS) feature_page
## 17 Transcript length (including UTRs and CDS) feature_page
## 18 Transcript support level (TSL) feature_page
## 19 GENCODE basic annotation feature_page
## 20 APPRIS annotation feature_page
The getBM() function is the main query function in biomaRt. It has four main arguments:
- attributes: is a vector of attributes that one wants to retrieve (= the output of the query).
- filters: is a vector of filters that one wil use as input to the query.
- values: a vector of values for the filters. In case multple filters are in use, the values argument requires a list of values where each position in the list corresponds to the position of the filters in the filters argument (see examples below).
- mart: is an object of class Mart, which is created by the useMart() function.
Note: for some frequently used queries to Ensembl, wrapper functions are available: getGene() and getSequence(). These functions call the getBM() function with hard coded filter and attribute names.
Now that we selected a BioMart database and dataset, and know about attributes, filters, and the values for filters; we can build a biomaRt query. Let’s make an easy query for the following problem: We have a list of Affymetrix identifiers from the u133plus2 platform and we want to retrieve the corresponding EntrezGene identifiers using the Ensembl mappings.
The u133plus2 platform will be the filter for this query and as values for this filter we use our list of Affymetrix identifiers. As output (attributes) for the query we want to retrieve the EntrezGene and u133plus2 identifiers so we get a mapping of these two identifiers as a result. The exact names that we will have to use to specify the attributes and filters can be retrieved with the listAttributes() and listFilters() function respectively. Let’s now run the query:
affyids=c("202763_at","209310_s_at","207500_at")
getBM(attributes=c('affy_hg_u133_plus_2', 'entrezgene'),
filters = 'affy_hg_u133_plus_2',
values = affyids,
mart = ensembl)
## affy_hg_u133_plus_2 entrezgene
## 1 202763_at 836
## 2 209310_s_at 837
## 3 207500_at 838
Let’s use biomaRt to get annotation for our output from limma:
v <- voom(dge, design)
fit <- lmFit(v, design)
fit <- eBayes(fit)
tab <- topTable(fit, coef=ncol(design), n = 20)
head(tab)
## logFC AveExpr t P.Value adj.P.Val B
## 100131754 2.054703 15.77659 113.29014 3.781646e-21 6.126266e-18 38.49198
## 2752 2.826613 12.90987 96.45001 3.169468e-20 2.567269e-17 36.95688
## 22883 2.671140 12.45734 79.03031 4.398194e-19 2.375025e-16 34.32948
## 2023 -2.284219 13.62891 -75.47484 8.074817e-19 3.270301e-16 33.51309
## 23154 4.160514 11.35699 60.95788 1.351506e-17 4.378878e-15 30.75714
## 8682 3.023933 11.74011 56.60478 3.588091e-17 9.687847e-15 29.92931
The identifiers (row names) in our top table are Entrez Gene identifiers, so let’s add a column with that information
tab$entrezgene <- rownames(tab)
head(tab)
## logFC AveExpr t P.Value adj.P.Val B
## 100131754 2.054703 15.77659 113.29014 3.781646e-21 6.126266e-18 38.49198
## 2752 2.826613 12.90987 96.45001 3.169468e-20 2.567269e-17 36.95688
## 22883 2.671140 12.45734 79.03031 4.398194e-19 2.375025e-16 34.32948
## 2023 -2.284219 13.62891 -75.47484 8.074817e-19 3.270301e-16 33.51309
## 23154 4.160514 11.35699 60.95788 1.351506e-17 4.378878e-15 30.75714
## 8682 3.023933 11.74011 56.60478 3.588091e-17 9.687847e-15 29.92931
## entrezgene
## 100131754 100131754
## 2752 2752
## 22883 22883
## 2023 2023
## 23154 23154
## 8682 8682
Let’s get the gene symbol and gene description matching those Entrez Gene identifiers:
anno <- getBM(ensembl, attributes = c("entrezgene", "hgnc_symbol", "description"), filters = "entrezgene", values = tab$entrezgene)
anno
## entrezgene hgnc_symbol
## 1 10472 ZBTB18
## 2 1465 CSRP1
## 3 2023 ENO1
## 4 22883 CLSTN1
## 5 23095 KIF1B
## 6 23154 NCDN
## 7 23499 MACF1
## 8 2752 GLUL
## 9 284654 RSPO1
## 10 477 ATP1A2
## 11 481 ATP1B1
## 12 4904 YBX1
## 13 55450 CAMK2N1
## 14 5567 PRKACB
## 15 6135 RPL11
## 16 6202 RPS8
## 17 8682 PEA15
## 18 9659 PDE4DIP
## 19 9900 SV2A
## description
## 1 zinc finger and BTB domain containing 18 [Source:HGNC Symbol;Acc:HGNC:13030]
## 2 cysteine and glycine rich protein 1 [Source:HGNC Symbol;Acc:HGNC:2469]
## 3 enolase 1 [Source:HGNC Symbol;Acc:HGNC:3350]
## 4 calsyntenin 1 [Source:HGNC Symbol;Acc:HGNC:17447]
## 5 kinesin family member 1B [Source:HGNC Symbol;Acc:HGNC:16636]
## 6 neurochondrin [Source:HGNC Symbol;Acc:HGNC:17597]
## 7 microtubule-actin crosslinking factor 1 [Source:HGNC Symbol;Acc:HGNC:13664]
## 8 glutamate-ammonia ligase [Source:HGNC Symbol;Acc:HGNC:4341]
## 9 R-spondin 1 [Source:HGNC Symbol;Acc:HGNC:21679]
## 10 ATPase Na+/K+ transporting subunit alpha 2 [Source:HGNC Symbol;Acc:HGNC:800]
## 11 ATPase Na+/K+ transporting subunit beta 1 [Source:HGNC Symbol;Acc:HGNC:804]
## 12 Y-box binding protein 1 [Source:HGNC Symbol;Acc:HGNC:8014]
## 13 calcium/calmodulin dependent protein kinase II inhibitor 1 [Source:HGNC Symbol;Acc:HGNC:24190]
## 14 protein kinase cAMP-activated catalytic subunit beta [Source:HGNC Symbol;Acc:HGNC:9381]
## 15 ribosomal protein L11 [Source:HGNC Symbol;Acc:HGNC:10301]
## 16 ribosomal protein S8 [Source:HGNC Symbol;Acc:HGNC:10441]
## 17 proliferation and apoptosis adaptor protein 15 [Source:HGNC Symbol;Acc:HGNC:8822]
## 18 phosphodiesterase 4D interacting protein [Source:HGNC Symbol;Acc:HGNC:15580]
## 19 synaptic vesicle glycoprotein 2A [Source:HGNC Symbol;Acc:HGNC:20566]