UCD Bioinformatics Core Workshop

Home
Setup
Analysis in limma

Introduction

Limma is an R package (developed for use with gene expression microarrays) that is used for differential abundance/expression analysis of proteomics, metabolomics, RNA sequencing, and other ‘omics data.

Core steps of limma analysis

A linear model (e.g. ANOVA or regression) is fitted to each protein.
Empirical Bayes smoothing is used to get better estimates of standard errors of log fold changes, which are then used in differential abundance testing.
P-values are adjusted for multiple testing across proteins using the Benjamini-Hochberg false discovery rate controlling method.

Limma allows for more complicated experimental designs and models than many other packages.

References:

Matthew E. Ritchie, Belinda Phipson, Di Wu, Yifang Hu, Charity W. Law, Wei Shi, Gordon K. Smyth, limma powers differential expression analyses for RNA-sequencing and microarray studies, Nucleic Acids Research, Volume 43, Issue 7, 20 April 2015, Page e47, https://doi.org/10.1093/nar/gkv007

Smyth GK. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004;3:Article3. doi: 10.2202/1544-6115.1027. Epub 2004 Feb 12. PMID: 16646809.

https://www.bioconductor.org/packages//2.11/bioc/vignettes/limma/inst/doc/usersguide.pdf

Data

We will use proteomics data from Kistner et al. on 50 subjects with chronic lymphocytic leukemia (CLL).

The differential abundance analysis will compare subjects with mutated (M) IGHV versus unmutated (U) IGHV. CLL patients with mutated IGHV tend to have a better prognosis.

References:

Franziska Kistner, Justus L. Grossmann, Ludwig R. Sinn, Vadim Demichev bioRxiv 2023.06.20.545604; doi: https://doi.org/10.1101/2023.06.20.545604

Analysis

Load R packages:

library(dplyr)
library(limma)
library(ggplot2)
library(ggrepel)
library(ComplexHeatmap)

Read in proteomics data and metadata

Read in proteomics data, clean up:

dat0 <- read.delim("legacy-report.pg_matrix.tsv", check.names = FALSE)

# Split out protein annotation and protein abundance data
anno <- select(dat0, Protein.Group:First.Protein.Description)
dat <- select(dat0, -(Protein.Group:First.Protein.Description))

# Clean up column names so they match filenames in metadata
colnames(dat) <- gsub("D:\\PXD022216\\", "", colnames(dat), fixed = TRUE)

# Set rownames to Protein.Group
rownames(dat) <- anno$Protein.Group

Read in metadata, reorder so matches order of proteomics data. Very important!

pdata <- read.delim("Annotation_file_timsTOF.txt")
rownames(pdata) <- pdata$File.name

# Reorder to match proteomics data
pdata <- pdata[colnames(dat),]

# Check again to make sure metadata matches proteomics data
identical(pdata$File.name, colnames(dat))

## [1] TRUE

Set missing values to 0:

na.mat <- is.na(dat) 
dat[na.mat] <- 0

We will log transform the proteomics data prior to analysis. This helps make the data closer to normally distributed and makes the variability more constant between low and high expressed proteins.

Log transform data:

dat.trans <- log2(dat + 1)

Boxplot of data to evaluate need for between-sample normalization:

boxplot(dat.trans, xaxt = "n")

Between sample normalization appears to have already been performed. We could use the limma function normalizeCyclicLoess to normalize if this wasn’t the case:

Example of normalization (not run):

dat.trans <- normalizeCyclicLoess(dat.trans)

Next, we filter the data to remove proteins present in few samples. This helps with the following:

Testing fewer proteins makes the multiple testing adjustment less strict (the remaining proteins have a better chance of being significant).
Having proteins that are absent (zero) in lots of samples introduces large numbers of ties in the data, causing the variability to be underestimated.

Filter to proteins present in at least half of samples:

num.nonmissing <- rowSums(!na.mat)
keep <- num.nonmissing >= ncol(dat.trans)/2
dat.filtered <- dat.trans[keep,]

Multidimensional scaling plots

A multidimensional scaling (MDS) plot is a two dimensional view of the data showing the relative distances between protein profiles.

We will use the limma function plotMDS to get the MDS coordinates then make the actual plot in ggplot2 (https://ggplot2.tidyverse.org/) for extra flexibility.

coords <- plotMDS(dat.filtered, plot = FALSE)
plotdat <- pdata
plotdat$MDS_1 <- coords$x
plotdat$MDS_2 <- coords$y

MDS plot by IGHV status

ggplot(plotdat, aes(x = MDS_1, y = MDS_2, color = IGHV.status)) + geom_point(size = 3) + theme_bw() + labs(color = "IGHV status")

MDS plot by gender

ggplot(plotdat, aes(x = MDS_1, y = MDS_2, color = Gender)) + geom_point(size = 3) + theme_bw()

MDS plot by leukocyte count

ggplot(plotdat, aes(x = MDS_1, y = MDS_2, color = Leukocyte.count)) + geom_point(size = 3) + theme_bw() + scale_color_viridis_c() + labs(color = "Leukocyte count")

Differential abundance analysis in limma

We will compare expression of each protein between subjects with mutated and unmutated IGHV.

The design matrix is how we specify the experimental design in limma.

Set up design matrix:

mm <- model.matrix(~0 + IGHV.status, data = pdata)

Fit model to each protein:

fit <- lmFit(dat.filtered, mm)

Specify comparison (“contrast”) being tested:

contr <- makeContrasts(IGHV.statusM - IGHV.statusU, levels = colnames(coef(fit)))
contr2 <- contrasts.fit(fit, contrasts = contr)

Empirical Bayes smoothing adjusts the standard error of each protein to be closer to the average standard error, compensating for unusually high or low variability proteins.

Empirical Bayes smoothing:

contr3 <- eBayes(contr2)

Finally, we adjust for testing multiple proteins. With 7043 proteins after filtering, we would expect 352.15 to have a raw p-value less than 0.05, even if there were no true differences.

Defining significant as FDR adjusted P < 0.05 means that we expect 5% of the proteins with an adjusted p-value < 0.05 to be false discoveries.

Display differential abundance results with Benjamini-Hochberg false discovery rate adjusted p-values:

results <- topTable(contr3, sort.by = "P", n = Inf)
head(results)

##             logFC   AveExpr         t      P.Value   adj.P.Val        B
## Q5T7N2 -7.0562842  5.342274 -6.087076 1.718323e-07 0.001210215 7.066122
## P02545 -1.5809825 15.231864 -5.554732 1.126951e-06 0.002031282 5.347222
## Q7Z6L1  0.5239282 12.468625  5.445413 1.652756e-06 0.002031282 4.997042
## Q8NDB2  1.2098372 13.587557  5.425782 1.770141e-06 0.002031282 4.934294
## Q13572  0.4398251 12.845232  5.423379 1.785062e-06 0.002031282 4.926617
## Q9Y6C9 -0.4183323 15.632817 -5.393540 1.981094e-06 0.002031282 4.831330

Results include the following columns:

logFC: log2 fold change for IGHV status M - U
AveExpr: Average protein level across all samples, on transformed scale
t: t statistic, or log fold change divided by its standard error
P.Value: Raw p-value from the test that the logFC differs from 0
adj.P.Val: Benjamini-Hochberg false discovery rate adjusted p-value
B: log odds of differential abundance

How many proteins differ significantly between mutated and unmutated IGHV status subjects:

length(which(results$adj.P.Val < 0.05))

## [1] 178

Merge in annotation information:

results$Protein.Group <- rownames(results)
results <- left_join(results, anno, by = "Protein.Group")
head(results)

##        logFC   AveExpr         t      P.Value   adj.P.Val        B
## 1 -7.0562842  5.342274 -6.087076 1.718323e-07 0.001210215 7.066122
## 2 -1.5809825 15.231864 -5.554732 1.126951e-06 0.002031282 5.347222
## 3  0.5239282 12.468625  5.445413 1.652756e-06 0.002031282 4.997042
## 4  1.2098372 13.587557  5.425782 1.770141e-06 0.002031282 4.934294
## 5  0.4398251 12.845232  5.423379 1.785062e-06 0.002031282 4.926617
## 6 -0.4183323 15.632817 -5.393540 1.981094e-06 0.002031282 4.831330
##   Protein.Group Protein.Ids Protein.Names  Genes
## 1        Q5T7N2      Q5T7N2   LITD1_HUMAN  L1TD1
## 2        P02545      P02545    LMNA_HUMAN   LMNA
## 3        Q7Z6L1      Q7Z6L1   TCPR1_HUMAN TECPR1
## 4        Q8NDB2      Q8NDB2   BANK1_HUMAN  BANK1
## 5        Q13572      Q13572   ITPK1_HUMAN  ITPK1
## 6        Q9Y6C9      Q9Y6C9   MTCH2_HUMAN  MTCH2
##                             First.Protein.Description
## 1 LINE-1 type transposase domain-containing protein 1
## 2                                        Prelamin-A/C
## 3 Tectonin beta-propeller repeat-containing protein 1
## 4        B-cell scaffold protein with ankyrin repeats
## 5                 Inositol-tetrakisphosphate 1-kinase
## 6                     Mitochondrial carrier homolog 2

Volcano plot of results

A volcano plot shows -log10(p-value) plotted against the log fold change.

results$isSig <- ifelse(results$adj.P.Val < 0.05, "Significant", "Not Significant")
top.results <- results[1:15,] # we will only label the most significant proteins
ggplot(results, aes(x = logFC, y = -log10(P.Value), col = isSig)) + geom_point() + 
  scale_color_manual(values = c("grey", "blue")) + 
  geom_text_repel(data = top.results, aes(label = Genes), max.overlaps = Inf, color = "black") + 
  theme_bw() + 
  labs(color = NULL)

Heatmap of top 20 proteins

A heatmap displays the sample-level expression of a set of proteins.

We’ll use the R package ComplexHeatmap (https://jokergoo.github.io/ComplexHeatmap-reference/book/)

toplot <- results$Protein.Group[1:20] 
plotdat <- as.matrix(dat.filtered[toplot,])
new.rownames <- anno$Genes[match(rownames(plotdat), anno$Protein.Group)]

set.seed(99) # Annotation bar colors are random!
col_ha <- HeatmapAnnotation(`IGHV status` = pdata$IGHV.status, annotation_name_side = "left")

Heatmap(plotdat, heatmap_legend_param = list(title = "log2\nNormalized\nAbundance"), show_column_names = FALSE, row_labels = new.rownames, bottom_annotation = col_ha)

Plot using Z scores:

toplot <- results$Protein.Group[1:20] 
plotdat <- as.matrix(dat.filtered[toplot,])
# convert to Z scores
plotdat <- t(apply(plotdat, 1, scale))

new.rownames <- anno$Genes[match(rownames(plotdat), anno$Protein.Group)]

set.seed(99) # Annotation bar colors are random!
col_ha <- HeatmapAnnotation(`IGHV status` = pdata$IGHV.status, annotation_name_side = "left")

Heatmap(plotdat, heatmap_legend_param = list(title = "Z score"), show_column_names = FALSE, row_labels = new.rownames, bottom_annotation = col_ha)

Boxplots of individual proteins

plotdat <- pdata
plotdat$L1TD1 <- as.numeric(dat.filtered["Q5T7N2",])
plotdat$LMNA <- as.numeric(dat.filtered["P02545",])

ggplot(plotdat, aes(x = IGHV.status, y = L1TD1, fill = IGHV.status)) + geom_boxplot() + theme_bw() + theme(legend.position = "none") + labs(x = "IGHV status")

ggplot(plotdat, aes(x = IGHV.status, y = LMNA, fill = IGHV.status)) + geom_boxplot() + theme_bw() + theme(legend.position = "none") + labs(x = "IGHV status")

R session information

sessionInfo()

## R version 4.4.1 (2024-06-14 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] ComplexHeatmap_2.20.0 ggrepel_0.9.5         ggplot2_3.5.1        
## [4] limma_3.60.4          dplyr_1.1.4          
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.9          utf8_1.2.4          generics_0.1.3     
##  [4] shape_1.4.6.1       digest_0.6.37       magrittr_2.0.3     
##  [7] RColorBrewer_1.1-3  evaluate_0.24.0     iterators_1.0.14   
## [10] circlize_0.4.16     fastmap_1.2.0       foreach_1.5.2      
## [13] doParallel_1.0.17   jsonlite_1.8.8      GlobalOptions_0.1.2
## [16] fansi_1.0.6         viridisLite_0.4.2   scales_1.3.0       
## [19] codetools_0.2-20    jquerylib_0.1.4     cli_3.6.3          
## [22] rlang_1.1.4         crayon_1.5.3        munsell_0.5.1      
## [25] withr_3.0.1         cachem_1.1.0        yaml_2.3.10        
## [28] tools_4.4.1         parallel_4.4.1      colorspace_2.1-1   
## [31] BiocGenerics_0.50.0 GetoptLong_1.0.5    vctrs_0.6.5        
## [34] R6_2.5.1            png_0.1-8           stats4_4.4.1       
## [37] matrixStats_1.3.0   lifecycle_1.0.4     S4Vectors_0.42.1   
## [40] IRanges_2.38.1      clue_0.3-65         cluster_2.1.6      
## [43] pkgconfig_2.0.3     pillar_1.9.0        bslib_0.8.0        
## [46] gtable_0.3.5        glue_1.7.0          Rcpp_1.0.13        
## [49] statmod_1.5.0       highr_0.11          xfun_0.47          
## [52] tibble_3.2.1        tidyselect_1.2.1    rstudioapi_0.16.0  
## [55] knitr_1.48          farver_2.1.2        rjson_0.2.22       
## [58] htmltools_0.5.8.1   labeling_0.4.3      rmarkdown_2.28     
## [61] compiler_4.4.1

☰ Menu

Proteomics Analysis with Limma