limma is an R package that was originally developed for differential expression (DE) analysis of microarray data.
voom is a function in the limma package that modifies RNA-Seq data for use with limma.
Together they allow fast, flexible, and powerful analyses of RNA-Seq data. Limma-voom is our tool of choice for DE analyses because it:
Allows for incredibly flexible model specification (you can include multiple categorical and continuous variables, allowing incorporation of almost any kind of metadata)
Based on simulation studies, maintains the false discovery rate at or below the nominal rate, unlike some other packages
Empirical Bayes smoothing of gene-wise standard deviations provides increased power.
Input data for this example is on the course github page.
First, install the edgeR package if not already installed (which installs limma as a dependency)
# source("https://bioconductor.org/biocLite.R")
# biocLite("edgeR")
Load the edgeR package (which loads limma as a dependency)
library(edgeR)
## Loading required package: limma
Read in the counts table
counts <- read.delim("all_counts.txt", row.names = 1)
head(counts)
## C61 C62 C63 C64 C91 C92 C93 C94 I561 I562 I563 I564 I591 I592
## AT1G01010 341 371 275 419 400 542 377 372 677 522 455 508 821 466
## AT1G01020 164 94 176 155 200 183 166 115 172 157 122 152 189 171
## AT1G03987 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## AT1G01030 20 34 40 27 28 36 22 40 20 7 57 38 25 10
## AT1G01040 738 487 610 690 945 1033 836 908 857 821 770 751 848 607
## AT1G03993 1 0 0 0 0 0 0 1 3 0 1 1 1 0
## I593 I594 I861 I862 I863 I864 I891 I892 I893 I894
## AT1G01010 691 500 157 473 459 228 590 491 565 496
## AT1G01020 163 185 46 162 119 53 172 212 169 157
## AT1G03987 0 0 0 0 0 0 0 0 0 0
## AT1G01030 17 26 49 17 24 48 27 28 47 32
## AT1G01040 871 756 361 618 641 439 783 692 768 625
## AT1G03993 0 0 0 1 0 1 3 2 1 1
OR read the file directly from the github page:
counts <- read.delim("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2018-June-RNA-Seq-Workshop/master/thursday/all_counts.txt")
head(counts)
## C61 C62 C63 C64 C91 C92 C93 C94 I561 I562 I563 I564 I591 I592
## AT1G01010 289 317 225 343 325 449 310 299 563 438 380 407 678 386
## AT1G01020 127 78 142 130 156 146 144 95 138 129 99 118 154 140
## AT1G03987 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## AT1G01030 17 25 32 24 22 25 21 35 18 6 46 33 19 8
## AT1G01040 605 415 506 565 762 854 658 753 704 692 641 601 704 508
## AT1G03993 1 1 0 0 0 0 1 1 3 0 1 1 1 0
## I593 I594 I861 I862 I863 I864 I891 I892 I893 I894
## AT1G01010 567 421 130 411 382 190 501 390 480 407
## AT1G01020 127 154 35 132 97 46 139 175 137 123
## AT1G03987 0 0 0 0 0 0 0 0 0 0
## AT1G01030 14 21 37 16 19 38 24 18 37 23
## AT1G01040 733 614 297 521 542 381 651 573 650 550
## AT1G03993 0 0 0 1 0 1 3 1 1 1
Create DGEList object
d0 <- DGEList(counts)
Calculate normalization factors
d0 <- calcNormFactors(d0)
d0
## An object of class "DGEList"
## $counts
## C61 C62 C63 C64 C91 C92 C93 C94 I561 I562 I563 I564 I591 I592
## AT1G01010 289 317 225 343 325 449 310 299 563 438 380 407 678 386
## AT1G01020 127 78 142 130 156 146 144 95 138 129 99 118 154 140
## AT1G03987 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## AT1G01030 17 25 32 24 22 25 21 35 18 6 46 33 19 8
## AT1G01040 605 415 506 565 762 854 658 753 704 692 641 601 704 508
## I593 I594 I861 I862 I863 I864 I891 I892 I893 I894
## AT1G01010 567 421 130 411 382 190 501 390 480 407
## AT1G01020 127 154 35 132 97 46 139 175 137 123
## AT1G03987 0 0 0 0 0 0 0 0 0 0
## AT1G01030 14 21 37 16 19 38 24 18 37 23
## AT1G01040 733 614 297 521 542 381 651 573 650 550
## 34257 more rows ...
##
## $samples
## group lib.size norm.factors
## C61 1 10502901 1.0404937
## C62 1 9423745 0.9719425
## C63 1 9437115 1.0401597
## C64 1 10415490 1.0360806
## C91 1 10169158 1.0468854
## 19 more rows ...
Note: calcNormFactors doesn’t normalize the data, it just calculates normalization factors for use downstream.
Filter low-expressed genes
cutoff <- 1
drop <- which(apply(cpm(d0), 1, max) < cutoff)
d <- d0[-drop,]
dim(d) # number of genes left
## [1] 21080 24
“Low-expressed” is subjective and depends on the dataset.
Derive experiment information from the sample names
Our experiment has two factors, cultivar (“C”, “I5”, or “I8”) and time (6 or 9)
The sample names are the cultivar, followed by the time, followed by the replicate
snames <- colnames(counts) # Sample names
snames
## [1] "C61" "C62" "C63" "C64" "C91" "C92" "C93" "C94" "I561" "I562"
## [11] "I563" "I564" "I591" "I592" "I593" "I594" "I861" "I862" "I863" "I864"
## [21] "I891" "I892" "I893" "I894"
cultivar <- substr(snames, 1, nchar(snames) - 2)
time <- substr(snames, nchar(snames) - 1, nchar(snames) - 1)
cultivar
## [1] "C" "C" "C" "C" "C" "C" "C" "C" "I5" "I5" "I5" "I5" "I5" "I5"
## [15] "I5" "I5" "I8" "I8" "I8" "I8" "I8" "I8" "I8" "I8"
time
## [1] "6" "6" "6" "6" "9" "9" "9" "9" "6" "6" "6" "6" "9" "9" "9" "9" "6"
## [18] "6" "6" "6" "9" "9" "9" "9"
Create a new variable “group” that combines cultivar and time
group <- interaction(cultivar, time)
group
## [1] C.6 C.6 C.6 C.6 C.9 C.9 C.9 C.9 I5.6 I5.6 I5.6 I5.6 I5.9 I5.9
## [15] I5.9 I5.9 I8.6 I8.6 I8.6 I8.6 I8.9 I8.9 I8.9 I8.9
## Levels: C.6 I5.6 I8.6 C.9 I5.9 I8.9
Note: you can also enter group information manually, or read it in from an external file. If you do this, it is \(very, very, very\) important that you make sure the metadata is in the same order as the column names of the counts table.
Multidimensional scaling (MDS) plot
plotMDS(d, col = as.numeric(group))
Specify the model to be fitted. We do this before using voom since voom uses variances of the model residuals (observed - fitted)
mm <- model.matrix(~0 + group)
The above specifies a model where each coefficient corresponds to a group mean
Voom
y <- voom(d, mm, plot = T)
What is voom doing?
More details at https://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-2-r29
The above is a “good” voom plot. If your voom plot looks like the below, you might want to filter more:
tmp <- voom(d0, mm, plot = T)
lmFit fits a linear model using weighted least squares for each gene:
fit <- lmFit(y, mm)
head(coef(fit))
## groupC.6 groupI5.6 groupI8.6 groupC.9 groupI5.9 groupI8.9
## AT1G01010 4.837410 5.3738532 5.065354 5.043214 5.5240004 5.363809
## AT1G01020 3.530869 3.4993460 3.212860 3.689622 3.7209961 3.736297
## AT1G01030 1.250817 0.9293832 1.559242 1.285596 0.4831707 1.215591
## AT1G01040 5.676015 5.9469878 5.778889 6.182374 5.8641107 5.815498
## AT1G01050 6.598712 6.5013631 6.463936 6.619239 6.7532886 6.711772
## AT1G01060 7.807988 7.4624783 7.390741 8.966047 8.2706387 8.376129
Comparisons between groups (log fold-changes) are obtained as contrasts of these fitted linear models:
Specify which groups to compare:
Comparison between times 6 and 9 for cultivar I5
contr <- makeContrasts(groupI5.9 - groupI5.6, levels = colnames(coef(fit)))
contr
## Contrasts
## Levels groupI5.9 - groupI5.6
## groupC.6 0
## groupI5.6 -1
## groupI8.6 0
## groupC.9 0
## groupI5.9 1
## groupI8.9 0
Estimate contrast for each gene
tmp <- contrasts.fit(fit, contr)
Empirical Bayes smoothing of standard errors (shrinks standard errors that are much larger or smaller than those from other genes towards the average standard error) (see https://www.degruyter.com/doi/10.2202/1544-6115.1027)
tmp <- eBayes(tmp)
What genes are most differentially expressed?
top.table <- topTable(tmp, sort.by = "P", n = Inf)
head(top.table, 20)
## logFC AveExpr t P.Value adj.P.Val
## AT5G37260 3.0480867 6.964609 24.154998 1.193341e-16 2.515563e-12
## AT3G02990 1.6484885 3.304597 13.605334 8.573977e-12 9.036971e-08
## AT2G29500 -5.0342224 5.525802 -12.062120 7.937586e-11 5.577477e-07
## AT3G24520 1.8715741 5.882965 11.710003 1.360008e-10 7.167244e-07
## AT3G46230 -6.7068648 4.544494 -11.189910 3.081581e-10 1.299195e-06
## AT5G65630 1.0468903 7.553775 10.657591 7.329270e-10 2.575017e-06
## AT3G24100 -1.2759156 5.988295 -10.237187 1.485118e-09 4.021633e-06
## AT1G53540 -7.1847583 4.316965 -10.160183 1.693866e-09 4.021633e-06
## AT4G21320 -2.8641453 4.705136 -10.152258 1.717016e-09 4.021633e-06
## AT5G48570 -4.0078194 6.422341 -10.036624 2.094832e-09 4.415906e-06
## AT5G52882 0.8786555 7.649621 9.828722 3.007116e-09 5.398811e-06
## AT5G02810 -4.8872284 3.419336 -9.816290 3.073327e-09 5.398811e-06
## AT1G09140 -1.2469280 7.027052 -9.496095 5.419786e-09 8.788392e-06
## AT4G24780 1.9288845 6.979568 9.380927 6.666788e-09 1.003828e-05
## AT5G59570 1.5315276 3.005741 9.330689 7.300769e-09 1.026001e-05
## AT4G11880 2.7436628 2.919357 9.183289 9.547854e-09 1.245739e-05
## AT4G25500 1.0517676 6.143738 9.155503 1.004628e-08 1.245739e-05
## AT1G22770 -3.5882843 3.953036 -9.090404 1.132259e-08 1.326002e-05
## AT1G31230 1.1574470 6.114996 8.721535 2.252449e-08 2.499033e-05
## AT2G27820 -0.8709118 5.735808 -8.680724 2.433146e-08 2.564536e-05
## B
## AT5G37260 26.622628
## AT3G02990 16.006216
## AT2G29500 14.645646
## AT3G24520 14.385545
## AT3G46230 12.519533
## AT5G65630 12.802559
## AT3G24100 12.109394
## AT1G53540 10.864734
## AT4G21320 11.825499
## AT5G48570 11.752325
## AT5G52882 11.430101
## AT5G02810 10.430908
## AT1G09140 10.854555
## AT4G24780 10.651922
## AT5G59570 10.146392
## AT4G11880 9.781987
## AT4G25500 10.250838
## AT1G22770 9.958652
## AT1G31230 9.459672
## AT2G27820 9.385693
AT5G37260 has higher expression at time 9 than at time 6 (logFC is positive). AT2G29500 has lower expression at time 9 than at time 6 (logFC is negative).
How many DE genes are there?
length(which(top.table$adj.P.Val < 0.05))
## [1] 4680
Write top.table to a file
top.table$Gene <- rownames(top.table)
top.table <- top.table[,c("Gene", names(top.table)[1:6])]
write.table(top.table, file = "time9_v_time6_I5.txt", row.names = F, sep = "\t", quote = F)
Let’s say we want to compare cultivars C and I5 at time 6. The only thing we have to change is the call to makeContrasts:
contr <- makeContrasts(groupI5.6 - groupC.6, levels = colnames(coef(fit)))
tmp <- contrasts.fit(fit, contr)
tmp <- eBayes(tmp)
top.table <- topTable(tmp, sort.by = "P", n = Inf)
head(top.table, 20)
## logFC AveExpr t P.Value adj.P.Val
## AT4G12520 -9.3396792 0.5671202 -14.118616 4.276991e-12 9.015898e-08
## AT3G30720 5.6550000 3.5130064 11.628491 1.543229e-10 1.626564e-06
## AT5G26270 2.3722487 4.4417048 9.728356 3.587045e-09 2.520497e-05
## AT3G33528 -4.7012468 -1.6468149 -8.175366 6.441093e-08 3.394456e-04
## AT3G05955 -3.7995906 -1.6958534 -7.210032 4.545141e-07 1.916232e-03
## AT4G28100 -0.8099448 4.5746806 -7.041358 6.477323e-07 1.986777e-03
## AT1G64795 -4.7147903 -1.1051093 -7.032657 6.597456e-07 1.986777e-03
## AT3G25730 1.3872163 3.9538977 6.242201 3.650963e-06 9.620287e-03
## AT5G05480 -0.4869486 4.6067884 -5.647317 1.392471e-05 3.005506e-02
## AT2G18193 1.0100187 3.8571257 5.626785 1.459359e-05 3.005506e-02
## AT4G01870 1.6373400 5.6170176 5.595305 1.568338e-05 3.005506e-02
## AT2G14878 -0.5162040 6.4733722 -5.531829 1.814030e-05 3.186647e-02
## AT2G06995 -3.2115972 -2.3585729 -5.439207 2.244863e-05 3.640132e-02
## AT4G15248 -1.6999189 2.1452779 -5.392221 2.501939e-05 3.673680e-02
## AT1G62280 -1.7551000 2.7349000 -5.373243 2.614099e-05 3.673680e-02
## AT3G28740 2.2032568 5.5413479 5.284877 3.207698e-05 4.226142e-02
## AT5G14370 0.8031693 3.7422520 5.174352 4.147430e-05 4.748115e-02
## AT2G25737 0.8732454 3.6798171 5.163677 4.251876e-05 4.748115e-02
## AT2G37760 0.7703328 5.4139388 5.141986 4.472409e-05 4.748115e-02
## AT2G30400 1.0087276 2.0253964 5.138887 4.504853e-05 4.748115e-02
## B
## AT4G12520 6.42577720
## AT3G30720 9.54391447
## AT5G26270 11.00209007
## AT3G33528 1.56894746
## AT3G05955 2.19224634
## AT4G28100 6.18609294
## AT1G64795 2.33667151
## AT3G25730 4.53222502
## AT5G05480 3.26213083
## AT2G18193 3.23531092
## AT4G01870 3.11677860
## AT2G14878 2.94274140
## AT2G06995 0.02144583
## AT4G15248 2.49861111
## AT1G62280 2.67518742
## AT3G28740 2.42959865
## AT5G14370 2.25381821
## AT2G25737 2.23281223
## AT2G37760 2.08993315
## AT2G30400 2.09909628
length(which(top.table$adj.P.Val < 0.05)) # number of DE genes
## [1] 20
top.table$Gene <- rownames(top.table)
top.table <- top.table[,c("Gene", names(top.table)[1:6])]
write.table(top.table, file = "I5_v_C_time6.txt", row.names = F, sep = "\t", quote = F)
What if we refit our model as a two-factor model (rather than using the group variable)?
Create new model matrix:
mm <- model.matrix(~cultivar*time)
We are specifying that model includes effects for cultivar, time, and the cultivar-time interaction (which allows the differences between cultivars to differ across time)
colnames(mm)
## [1] "(Intercept)" "cultivarI5" "cultivarI8"
## [4] "time9" "cultivarI5:time9" "cultivarI8:time9"
y <- voom(d, mm, plot = F)
fit <- lmFit(y, mm)
head(coef(fit))
## (Intercept) cultivarI5 cultivarI8 time9 cultivarI5:time9
## AT1G01010 4.837410 0.53644370 0.2279446 0.20580445 -0.05565729
## AT1G01020 3.530869 -0.03152318 -0.3180096 0.15875297 0.06289715
## AT1G01030 1.250817 -0.32143420 0.3084243 0.03477863 -0.48099113
## AT1G01040 5.676015 0.27097286 0.1028739 0.50635951 -0.58923660
## AT1G01050 6.598712 -0.09734846 -0.1347759 0.02052702 0.23139851
## AT1G01060 7.807988 -0.34550979 -0.4172467 1.15805850 -0.34989810
## cultivarI8:time9
## AT1G01010 0.09265044
## AT1G01020 0.36468449
## AT1G01030 -0.37842909
## AT1G01040 -0.46975071
## AT1G01050 0.22730960
## AT1G01060 -0.17267051
Let’s estimate the difference between cultivars I5 and C at time 6
tmp <- contrasts.fit(fit, coef = 2) # Directly test second coefficient
tmp <- eBayes(tmp)
top.table <- topTable(tmp, sort.by = "P", n = Inf)
head(top.table, 20)
## logFC AveExpr t P.Value adj.P.Val
## AT4G12520 -9.3396792 0.5671202 -14.118616 4.276991e-12 9.015898e-08
## AT3G30720 5.6550000 3.5130064 11.628491 1.543229e-10 1.626564e-06
## AT5G26270 2.3722487 4.4417048 9.728356 3.587045e-09 2.520497e-05
## AT3G33528 -4.7012468 -1.6468149 -8.175366 6.441093e-08 3.394456e-04
## AT3G05955 -3.7995906 -1.6958534 -7.210032 4.545141e-07 1.916232e-03
## AT4G28100 -0.8099448 4.5746806 -7.041358 6.477323e-07 1.986777e-03
## AT1G64795 -4.7147903 -1.1051093 -7.032657 6.597456e-07 1.986777e-03
## AT3G25730 1.3872163 3.9538977 6.242201 3.650963e-06 9.620287e-03
## AT5G05480 -0.4869486 4.6067884 -5.647317 1.392471e-05 3.005506e-02
## AT2G18193 1.0100187 3.8571257 5.626785 1.459359e-05 3.005506e-02
## AT4G01870 1.6373400 5.6170176 5.595305 1.568338e-05 3.005506e-02
## AT2G14878 -0.5162040 6.4733722 -5.531829 1.814030e-05 3.186647e-02
## AT2G06995 -3.2115972 -2.3585729 -5.439207 2.244863e-05 3.640132e-02
## AT4G15248 -1.6999189 2.1452779 -5.392221 2.501939e-05 3.673680e-02
## AT1G62280 -1.7551000 2.7349000 -5.373243 2.614099e-05 3.673680e-02
## AT3G28740 2.2032568 5.5413479 5.284877 3.207698e-05 4.226142e-02
## AT5G14370 0.8031693 3.7422520 5.174352 4.147430e-05 4.748115e-02
## AT2G25737 0.8732454 3.6798171 5.163677 4.251876e-05 4.748115e-02
## AT2G37760 0.7703328 5.4139388 5.141986 4.472409e-05 4.748115e-02
## AT2G30400 1.0087276 2.0253964 5.138887 4.504853e-05 4.748115e-02
## B
## AT4G12520 6.42577720
## AT3G30720 9.54391447
## AT5G26270 11.00209007
## AT3G33528 1.56894746
## AT3G05955 2.19224634
## AT4G28100 6.18609294
## AT1G64795 2.33667151
## AT3G25730 4.53222502
## AT5G05480 3.26213083
## AT2G18193 3.23531092
## AT4G01870 3.11677860
## AT2G14878 2.94274140
## AT2G06995 0.02144583
## AT4G15248 2.49861111
## AT1G62280 2.67518742
## AT3G28740 2.42959865
## AT5G14370 2.25381821
## AT2G25737 2.23281223
## AT2G37760 2.08993315
## AT2G30400 2.09909628
length(which(top.table$adj.P.Val < 0.05)) # number of DE genes
## [1] 20
We get the same results as with the model where each coefficient corresponded to a group mean. In essence, these are the same model, so use whichever is most convenient for what you are estimating.
The interaction effects cultivarI5:time9 and cultivarI8:time9 are easier to estimate and test in this setup
head(coef(fit))
## (Intercept) cultivarI5 cultivarI8 time9 cultivarI5:time9
## AT1G01010 4.837410 0.53644370 0.2279446 0.20580445 -0.05565729
## AT1G01020 3.530869 -0.03152318 -0.3180096 0.15875297 0.06289715
## AT1G01030 1.250817 -0.32143420 0.3084243 0.03477863 -0.48099113
## AT1G01040 5.676015 0.27097286 0.1028739 0.50635951 -0.58923660
## AT1G01050 6.598712 -0.09734846 -0.1347759 0.02052702 0.23139851
## AT1G01060 7.807988 -0.34550979 -0.4172467 1.15805850 -0.34989810
## cultivarI8:time9
## AT1G01010 0.09265044
## AT1G01020 0.36468449
## AT1G01030 -0.37842909
## AT1G01040 -0.46975071
## AT1G01050 0.22730960
## AT1G01060 -0.17267051
tmp <- contrasts.fit(fit, coef = 5) # Test cultivarI5:time9
tmp <- eBayes(tmp)
top.table <- topTable(tmp, sort.by = "P", n = Inf)
head(top.table, 20)
## logFC AveExpr t P.Value adj.P.Val
## AT5G38200 5.2681384 5.459342 12.169571 6.750884e-11 1.423086e-06
## AT5G06860 3.9331835 6.449356 7.667986 1.770961e-07 1.306381e-03
## AT1G08430 7.3873178 5.946190 7.640080 1.874141e-07 1.306381e-03
## AT1G76990 -0.9683284 8.651885 -7.503011 2.478901e-07 1.306381e-03
## AT5G18170 1.4739946 6.811722 7.145965 5.197419e-07 2.191232e-03
## AT5G24030 1.5597102 6.422377 6.737394 1.238209e-06 3.821427e-03
## AT1G62280 3.3975934 2.734900 6.652058 1.488488e-06 3.821427e-03
## AT2G41380 3.7331437 5.866261 6.615737 1.610271e-06 3.821427e-03
## AT2G45360 -4.8763311 1.093918 -6.592316 1.694190e-06 3.821427e-03
## AT2G46790 3.2686459 3.970742 6.530515 1.937883e-06 3.821427e-03
## AT5G15950 3.2415341 5.516374 6.517393 1.994103e-06 3.821427e-03
## AT5G67520 -1.8307730 3.290640 -6.362042 2.802299e-06 4.649936e-03
## AT1G01650 -1.1996420 5.524074 -6.351573 2.867608e-06 4.649936e-03
## AT5G15330 -1.3784824 4.115181 -6.268550 3.444132e-06 5.065418e-03
## AT1G48100 1.9296680 2.376553 6.247994 3.604425e-06 5.065418e-03
## AT4G19160 -2.3207711 7.235367 -6.179190 4.198820e-06 5.356807e-03
## AT1G05890 -0.6529356 6.455076 -6.160149 4.380439e-06 5.356807e-03
## AT5G56010 1.2941382 8.452257 6.130758 4.676715e-06 5.356807e-03
## AT4G37870 0.8880398 8.434194 6.116456 4.828242e-06 5.356807e-03
## AT1G58110 -0.7654977 5.850866 -6.004302 6.205156e-06 6.540234e-03
## B
## AT5G38200 13.058923
## AT5G06860 7.282910
## AT1G08430 6.287578
## AT1G76990 7.111815
## AT5G18170 6.408931
## AT5G24030 5.581114
## AT1G62280 4.943495
## AT2G41380 5.221262
## AT2G45360 3.503402
## AT2G46790 4.956204
## AT5G15950 5.114242
## AT5G67520 4.615689
## AT1G01650 4.778464
## AT5G15330 4.553861
## AT1G48100 3.973489
## AT4G19160 4.412505
## AT1G05890 4.372387
## AT5G56010 4.309468
## AT4G37870 4.278843
## AT1G58110 4.041773
length(which(top.table$adj.P.Val < 0.05))
## [1] 882
The log fold change here is the difference between cultivarI5 and cultivar C in the log fold changes between times 9 and 6. It is ALSO the difference between times 9 and 6 in the log fold changes between cultivarI5 and cultivar C.
A gene for which this interaction effect is significant is one for which the effect of time differs between cultivars, and for which the effect of cultivar differs between times.
Specifying a different model is simply a matter of changing the calls to model.matrix (and possibly to contrasts.fit).
Let’s say we have information on the RNA extraction batch:
batch <- factor(rep(rep(1:2, each = 2), 6))
batch
## [1] 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2
## Levels: 1 2
To adjust for batch in the analysis, add batch to the end of the call to model matrix. Everything else about the code stays the same:
mm <- model.matrix(~0 + group + batch)
y <- voom(d, mm, plot = F)
fit <- lmFit(y, mm)
contr <- makeContrasts(groupI5.6 - groupC.6, levels = colnames(coef(fit)))
tmp <- contrasts.fit(fit, contr)
tmp <- eBayes(tmp)
top.table <- topTable(tmp, sort.by = "P", n = Inf)
head(top.table, 20)
## logFC AveExpr t P.Value adj.P.Val
## AT4G12520 -9.1849480 0.5671202 -13.894622 1.254284e-11 2.644031e-07
## AT3G30720 5.6553502 3.5130064 11.626230 2.902705e-10 3.059451e-06
## AT5G26270 2.3756632 4.4417048 10.532157 1.568494e-09 1.102129e-05
## AT3G33528 -4.6930275 -1.6468149 -8.114771 1.054174e-07 5.555497e-04
## AT1G64795 -4.6966105 -1.1051093 -7.585967 2.920305e-07 1.231200e-03
## AT4G28100 -0.8142969 4.5746806 -7.332425 4.823635e-07 1.694704e-03
## AT3G05955 -3.7806772 -1.6958534 -7.158687 6.837456e-07 2.059051e-03
## AT1G62280 -1.7706304 2.7349000 -6.183451 5.223128e-06 1.278365e-02
## AT2G18193 1.0171979 3.8571257 6.155724 5.544006e-06 1.278365e-02
## AT3G25730 1.3861828 3.9538977 6.114089 6.064349e-06 1.278365e-02
## AT4G34138 1.2583243 6.2839069 5.874446 1.020572e-05 1.955787e-02
## AT1G67890 0.4262907 5.1117436 5.759619 1.312886e-05 2.306302e-02
## AT3G63430 -1.2020149 3.7073650 -5.617984 1.794992e-05 2.838999e-02
## AT5G05480 -0.4876253 4.6067884 -5.595805 1.885483e-05 2.838999e-02
## AT4G15248 -1.7089812 2.1452779 -5.492499 2.372592e-05 3.079319e-02
## AT2G14878 -0.5161993 6.4733722 -5.474049 2.472296e-05 3.079319e-02
## AT2G06995 -3.1599964 -2.3585729 -5.445134 2.637233e-05 3.079319e-02
## AT4G01870 1.6376531 5.6170176 5.444359 2.641803e-05 3.079319e-02
## AT1G15380 -0.6846986 7.6756786 -5.421227 2.782070e-05 3.079319e-02
## AT3G45980 -0.3280034 7.2242831 -5.399375 2.921555e-05 3.079319e-02
## B
## AT4G12520 5.6334654
## AT3G30720 8.9070420
## AT5G26270 11.6801080
## AT3G33528 1.1444487
## AT1G64795 2.4478920
## AT4G28100 6.4756663
## AT3G05955 1.7078155
## AT1G62280 4.1194174
## AT2G18193 4.1644124
## AT3G25730 4.0704267
## AT4G34138 3.5467936
## AT1G67890 3.3320337
## AT3G63430 3.0358961
## AT5G05480 2.9985093
## AT4G15248 2.5246096
## AT2G14878 2.6674208
## AT2G06995 -0.2708462
## AT4G01870 2.6397139
## AT1G15380 2.5436455
## AT3G45980 2.4978614
length(which(top.table$adj.P.Val < 0.05))
## [1] 27
What if we want to adjust for a continuous variable like RIN score:
# Generate example RIN data
set.seed(99)
RIN <- rnorm(n = 24, mean = 7.5, sd = 1)
RIN
## [1] 7.713963 7.979658 7.587829 7.943859 7.137162 7.622674 6.636155
## [8] 7.989624 7.135883 6.205758 6.754231 8.421550 8.250054 4.991446
## [15] 4.459066 7.500266 7.105981 5.754972 7.998631 7.770954 8.598922
## [22] 8.252513 7.440583 7.155431
Model adjusting for RIN score
mm <- model.matrix(~0 + group + RIN)
y <- voom(d, mm, plot = F)
fit <- lmFit(y, mm)
contr <- makeContrasts(groupI5.6 - groupC.6, levels = colnames(coef(fit)))
tmp <- contrasts.fit(fit, contr)
tmp <- eBayes(tmp)
top.table <- topTable(tmp, sort.by = "P", n = Inf)
head(top.table, 20)
## logFC AveExpr t P.Value adj.P.Val
## AT3G30720 5.6296669 3.5130064 18.264852 8.228016e-14 1.078373e-09
## AT4G12520 -9.2914672 0.5671202 -18.052498 1.023125e-13 1.078373e-09
## AT5G26270 2.4136835 4.4417048 10.642693 1.300777e-09 9.140129e-06
## AT3G33528 -4.9109883 -1.6468149 -9.609516 7.125167e-09 3.754963e-05
## AT1G64795 -4.7654967 -1.1051093 -7.254218 5.606683e-07 2.363777e-03
## AT4G28100 -0.8023348 4.5746806 -6.994622 9.486720e-07 3.333001e-03
## AT3G25730 1.4185849 3.9538977 6.546417 2.402778e-06 7.235795e-03
## AT4G15248 -1.7173926 2.1452779 -6.178114 5.259531e-06 1.385886e-02
## AT3G28740 2.3364878 5.5413479 5.900899 9.592908e-06 2.114492e-02
## AT3G05955 -3.6893043 -1.6958534 -5.880481 1.003080e-05 2.114492e-02
## AT4G01870 1.6906107 5.6170176 5.821660 1.141042e-05 2.126536e-02
## AT1G60750 1.8497644 3.8830018 5.794730 1.210552e-05 2.126536e-02
## AT1G17180 1.5714528 4.2264956 5.523743 2.205086e-05 3.575631e-02
## AT2G18193 1.0022545 3.8571257 5.472489 2.472195e-05 3.722420e-02
## AT5G07870 0.8857028 5.0986202 5.401616 2.896986e-05 3.817990e-02
## AT3G27940 1.4469855 -0.4298164 5.401474 2.897905e-05 3.817990e-02
## AT4G34135 1.0569416 6.3882644 5.238191 4.183945e-05 4.792741e-02
## AT5G48010 2.1658813 5.3481653 5.225912 4.301563e-05 4.792741e-02
## AT5G05480 -0.4793068 4.6067884 -5.191084 4.653849e-05 4.792741e-02
## AT4G34138 1.3466558 6.2839069 5.184933 4.719053e-05 4.792741e-02
## B
## AT3G30720 17.859502
## AT4G12520 11.272023
## AT5G26270 12.140983
## AT3G33528 3.415722
## AT1G64795 3.228510
## AT4G28100 5.862248
## AT3G25730 4.969968
## AT4G15248 4.106498
## AT3G28740 3.577535
## AT3G05955 0.636844
## AT4G01870 3.399840
## AT1G60750 3.421066
## AT1G17180 2.815864
## AT2G18193 2.734741
## AT5G07870 2.518925
## AT3G27940 1.914480
## AT4G34135 2.102290
## AT5G48010 2.121253
## AT5G05480 2.095733
## AT4G34138 1.986345
length(which(top.table$adj.P.Val < 0.05))
## [1] 28
What if we want to look at the correlation of gene expression with a continuous variable like pH?
# Generate example pH data
set.seed(99)
pH <- rnorm(n = 24, mean = 8, sd = 1.5)
pH
## [1] 8.320944 8.719487 8.131743 8.665788 7.455743 8.184011 6.704232
## [8] 8.734436 7.453825 6.058637 6.881346 9.382326 9.125082 4.237169
## [15] 3.438599 8.000399 7.408972 5.382459 8.747947 8.406431 9.648382
## [22] 9.128770 7.910875 7.483147
Specify model matrix:
mm <- model.matrix(~pH)
head(mm)
## (Intercept) pH
## 1 1 8.320944
## 2 1 8.719487
## 3 1 8.131743
## 4 1 8.665788
## 5 1 7.455743
## 6 1 8.184011
y <- voom(d, mm, plot = F)
fit <- lmFit(y, mm)
tmp <- contrasts.fit(fit, coef = 2) # test "pH" coefficient
tmp <- eBayes(tmp)
top.table <- topTable(tmp, sort.by = "P", n = Inf)
head(top.table, 20)
## logFC AveExpr t P.Value adj.P.Val
## AT3G63220 0.10930007 3.72231850 3.984813 0.0005224914 0.9979999
## AT4G22790 -0.18156791 2.62688789 -3.924496 0.0006095427 0.9979999
## AT3G25855 -0.18661169 0.39590825 -3.888223 0.0006686278 0.9979999
## AT5G48060 -0.14380178 3.08645012 -3.729192 0.0010015260 0.9979999
## AT1G60030 -0.18144388 1.90130680 -3.688498 0.0011101065 0.9979999
## AT5G50290 -0.19527796 0.87276139 -3.683545 0.0011240846 0.9979999
## AT1G69588 -0.18176455 1.07793155 -3.430115 0.0021225677 0.9979999
## AT1G17250 0.18143048 0.91048376 3.423565 0.0021574364 0.9979999
## AT5G66350 -0.18602484 0.59789247 -3.376519 0.0024247787 0.9979999
## AT5G49320 -0.17105888 1.05400394 -3.292867 0.0029815543 0.9979999
## AT3G17730 -0.16801297 1.56449303 -3.282393 0.0030594390 0.9979999
## AT5G61420 -0.18139376 5.05638487 -3.280444 0.0030741424 0.9979999
## AT4G05430 -0.20982687 -0.33505919 -3.272344 0.0031360082 0.9979999
## AT1G51460 -0.26599783 -0.19515578 -3.249322 0.0033184675 0.9979999
## AT1G59730 -0.26312342 2.24935746 -3.235196 0.0034354577 0.9979999
## AT2G38920 -0.25174501 0.02998871 -3.228661 0.0034909211 0.9979999
## AT4G11655 0.30473405 -0.14566023 3.217560 0.0035871102 0.9979999
## AT1G63020 -0.11884059 3.38076346 -3.210234 0.0036519841 0.9979999
## AT2G16400 -0.09043316 4.49774646 -3.198184 0.0037611636 0.9979999
## AT1G17240 0.22499668 1.07480792 3.181368 0.0039188025 0.9979999
## B
## AT3G63220 -0.3281633
## AT4G22790 -0.5717697
## AT3G25855 -1.5833958
## AT5G48060 -0.8575577
## AT1G60030 -1.1988239
## AT5G50290 -1.5961103
## AT1G69588 -1.8900360
## AT1G17250 -2.2539850
## AT5G66350 -2.1461291
## AT5G49320 -2.1077044
## AT3G17730 -1.9798846
## AT5G61420 -1.6446842
## AT4G05430 -2.6782487
## AT1G51460 -2.5644227
## AT1G59730 -1.8823323
## AT2G38920 -2.5302070
## AT4G11655 -3.0281427
## AT1G63020 -1.8275637
## AT2G16400 -1.8149709
## AT1G17240 -2.5144550
length(which(top.table$adj.P.Val < 0.05))
## [1] 0
In this case, limma is fitting a linear regression model, which here is a straight line fit, with the slope and intercept defined by the model coefficients:
AT1G60030 <- y$E["AT1G60030",]
plot(AT1G60030 ~ pH, ylim = c(0, 3.5))
intercept <- coef(fit)["AT1G60030", "(Intercept)"]
slope <- coef(fit)["AT1G60030", "pH"]
abline(a = intercept, b = slope)
In this example, the log fold change logFC is the slope of the line, or the change in gene expression (on the log2 CPM scale) for each unit increase in pH.
Here, a logFC of -0.19 means a 0.19 log2 CPM decrease in gene expression for each unit increase in pH, or a 14% decrease on the CPM scale (2^0.19 = 1.14).
Limma fits a linear model to each gene.
Linear models include analysis of variance (ANOVA) models, linear regression, and any model of the form
\[Y = \beta_0 + \beta_{1}X_{1} + \beta_{2}X_{2} + \dots + \beta_{p}X_{p} + \epsilon\] The covariates X can be:
The \(\beta\)’s are unknown parameters to be estimated.
In limma, the \(\beta\)’s are the log fold changes.
The error (residual) term \(\epsilon\) is assumed to be normally distributed with a variance that is constant across the range of the data.
Normally distributed means the residuals come from a distribution that looks like this:
The log2 transformation that voom applies to the counts makes the data “normal enough”, but doesn’t completely stabilize the variance:
tmp <- voom(d, mm, plot = T)
The log2 counts per million are more variable at lower expression levels. The variance weights calculated by voom address this situation.
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] edgeR_3.22.2 limma_3.36.1
##
## loaded via a namespace (and not attached):
## [1] locfit_1.5-9.1 Rcpp_0.12.17 lattice_0.20-35 digest_0.6.15
## [5] rprojroot_1.3-2 grid_3.5.0 backports_1.1.2 magrittr_1.5
## [9] evaluate_0.10.1 stringi_1.1.7 rmarkdown_1.10 tools_3.5.0
## [13] stringr_1.3.1 yaml_2.1.19 compiler_3.5.0 htmltools_0.3.6
## [17] knitr_1.20