Introduction to box and violin plots
Box plots and their cousins, violin plots, are some of the simplest visualizations. These plots are a means of visualizing the distribution of a single continuous variable. Box and violin plots can be split to compare distributions between groups.
Box plot
The rectangular box that gives a box plot its name is drawn around the interquartile range (IQR) of the distribution, with a line marking the median, and whiskers extending to 1.5 times the IQR on either side. Outliers may be indicated with points.
Violin plot
Instead of quantile markings, violin plot has variable width designed to help the reader perceive the density of data.
Set-up
In this chapter, we will begin with the fundamentals of assembling a box plot, and then layer on additional information using graphical attributes like color and annotation text, and modify plot elements to customize figure appearance and improve readability.
Packages
We will be working extensively with ggplot2 over the course of this workshop. Part of the tidyverse ecosystem, ggplot2 is a comprehensive, flexible framework for producing highly customizable graphics of many types. We will also make use of dplyr, another tidyverse package, to clean and reshape data before plotting.
library(dplyr)
library(tidyr)
library(magrittr)
library(kableExtra)
library(ggplot2)
library(ggsignif)
library(ggExtra)
Data
We’ll use gene expression data from a single cell experiment for these plots. Typically this sort of data is stored in a complex structure which retains expression data and experiment metadata together in a single object. In this case, we will be using a Seurat object created in one of our single cell RNA-Seq workshops.
experiment.data <- readRDS("scRNA_workshop-05.rds")
This workshop does not aim to give a comprehensive understanding of the Seurat workflow and object; the relevant data for the purposes of box and violin plots are the scaled counts and experiment metadata.
Normalized expression data is stored in the “data” slot, and can be accessed as follows:
experiment.data@assays$RNA$data
The metadata is stored in another slot within this object.
experiment.data@meta.data %>%
slice(1:50) %>%
kable() %>%
kable_styling("striped", fixed_thead = TRUE) %>%
scroll_box(height = "200px")
orig.ident | nCount_RNA | nFeature_RNA | group | run | percent_MT | S.Score | G2M.Score | Phase | old.ident | RNA_snn_res.0.1 | RNA_snn_res.0.2 | RNA_snn_res.0.3 | RNA_snn_res.0.4 | seurat_clusters | res.0.4_merged | subcluster | subcluster_ScType | subcluster_ScType_filtered | finalcluster | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AAACCCAAGTTATGGA_A001-C-007 | A001-C-007 | 2043 | 1513 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.5717008 | 0.0242106 | -0.1155330 | S | A001-C-007 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | ENS glia | Unknown | 2 |
AAACGCTTCTCTGCTG_A001-C-007 | A001-C-007 | 1402 | 1038 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 1.1409396 | 0.2751272 | 0.8966284 | G2M | A001-C-007 | 6 | 7 | 10 | 10 | 10 | 10 | 10 | Lymphoid cells | Lymphoid cells | 10 |
AAAGAACGTGCTTATG_A001-C-007 | A001-C-007 | 1634 | 1136 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.4191617 | -0.0708949 | -0.0599297 | G1 | A001-C-007 | 1 | 1 | 5 | 5 | 5 | 5 | 5_1 | Intestinal epithelial cells | Intestinal epithelial cells | 5_1 |
AAAGAACGTTTCGCTC_A001-C-007 | A001-C-007 | 1704 | 1193 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.4545455 | 0.0654961 | 1.0354100 | G2M | A001-C-007 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | Intestinal epithelial cells | Unknown | 3 |
AAAGAACTCTGGCTGG_A001-C-007 | A001-C-007 | 949 | 814 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.6937562 | -0.0377334 | 0.1175223 | G2M | A001-C-007 | 4 | 5 | 7 | 8 | 8 | 8 | 8 | Stromal cells | Stromal cells | 8 |
AAAGGATTCATTACCT_A001-C-007 | A001-C-007 | 1194 | 945 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.8071025 | -0.0244222 | 0.1248186 | G2M | A001-C-007 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | Intestinal epithelial cells | Unknown | 3 |
AAAGTGACACGCTTAA_A001-C-007 | A001-C-007 | 2731 | 1805 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.3216583 | -0.0743887 | 0.0186249 | G2M | A001-C-007 | 1 | 1 | 5 | 5 | 5 | 5 | 5_2 | Stromal cells | Unknown | 5_2 |
AACAAAGAGGCTAAAT_A001-C-007 | A001-C-007 | 1607 | 1155 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.8928571 | -0.1237838 | 0.0419251 | G2M | A001-C-007 | 4 | 5 | 7 | 8 | 8 | 8 | 8 | Stromal cells | Stromal cells | 8 |
AACAAAGTCTTGGTCC_A001-C-007 | A001-C-007 | 2538 | 1746 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.1922338 | 0.1545636 | -0.1093509 | S | A001-C-007 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | Intestinal epithelial cells | Unknown | 3 |
AACAAGACATAGAGGC_A001-C-007 | A001-C-007 | 1897 | 1445 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.7139215 | 0.6718581 | 0.1696297 | S | A001-C-007 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | Intestinal epithelial cells | Unknown | 3 |
AACAAGAGTTTAGACC_A001-C-007 | A001-C-007 | 4235 | 2540 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.2511416 | 0.5910065 | 0.1868783 | S | A001-C-007 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | Intestinal epithelial cells | Unknown | 3 |
AACAGGGCAATAGGAT_A001-C-007 | A001-C-007 | 2084 | 1491 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.4659832 | 0.0874241 | 0.2063970 | G2M | A001-C-007 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | Intestinal epithelial cells | Unknown | 3 |
AACAGGGCATCTGTTT_A001-C-007 | A001-C-007 | 6668 | 3543 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.1450537 | 0.3685904 | 0.3005523 | S | A001-C-007 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | Intestinal epithelial cells | Unknown | 3 |
AACAGGGGTCCCTGAG_A001-C-007 | A001-C-007 | 980 | 719 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.7835455 | -0.0068710 | 0.0132416 | G2M | A001-C-007 | 6 | 7 | 10 | 10 | 10 | 10 | 10 | Lymphoid cells | Lymphoid cells | 10 |
AACAGGGGTGTTACAC_A001-C-007 | A001-C-007 | 1396 | 1045 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.6250000 | 0.0274687 | 0.0553581 | G2M | A001-C-007 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | Intestinal epithelial cells | Unknown | 3 |
AACCAACCATGGGCAA_A001-C-007 | A001-C-007 | 3302 | 2055 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.3233392 | 0.0038243 | 0.0125395 | G2M | A001-C-007 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | Intestinal epithelial cells | Unknown | 3 |
AACCACAGTCAACCAT_A001-C-007 | A001-C-007 | 4054 | 2387 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.2655081 | 0.2146213 | 0.5965836 | G2M | A001-C-007 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | Intestinal epithelial cells | Unknown | 3 |
AACCATGGTACGATTC_A001-C-007 | A001-C-007 | 1802 | 1300 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.2686728 | 0.4236822 | 0.1838327 | S | A001-C-007 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | Intestinal epithelial cells | Unknown | 3 |
AACCCAACACAGCATT_A001-C-007 | A001-C-007 | 5411 | 2959 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.4321210 | 0.1802900 | 0.4591222 | G2M | A001-C-007 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | Intestinal epithelial cells | Unknown | 3 |
AACCCAAGTCGGTGTC_A001-C-007 | A001-C-007 | 1834 | 1364 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.4253057 | -0.0934836 | -0.0775423 | G1 | A001-C-007 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | ENS glia | Unknown | 2 |
AACCTGAAGATTCGAA_A001-C-007 | A001-C-007 | 3848 | 2333 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.4527163 | 0.5221604 | 0.6572811 | G2M | A001-C-007 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | Intestinal epithelial cells | Unknown | 3 |
AACCTGACAGTCAGCC_A001-C-007 | A001-C-007 | 3938 | 2321 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.2229378 | 0.2688958 | 0.8493498 | G2M | A001-C-007 | 1 | 1 | 1 | 7 | 7 | 7 | 7 | Lymphatic endothelial cells | Unknown | 7 |
AACCTGAGTTGGTAGG_A001-C-007 | A001-C-007 | 1318 | 1039 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.9572901 | 0.1681216 | 0.8078883 | G2M | A001-C-007 | 1 | 1 | 1 | 7 | 7 | 7 | 7 | Lymphatic endothelial cells | Unknown | 7 |
AACCTTTTCGCTCATC_A001-C-007 | A001-C-007 | 981 | 776 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.6979063 | 0.0040267 | -0.0166809 | S | A001-C-007 | 1 | 1 | 5 | 5 | 5 | 5 | 5_2 | Stromal cells | Unknown | 5_2 |
AACGAAAGTATGTCTG_A001-C-007 | A001-C-007 | 1172 | 721 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.6639004 | -0.0661428 | -0.0747148 | G1 | A001-C-007 | 1 | 1 | 5 | 5 | 5 | 5 | 5_1 | Intestinal epithelial cells | Intestinal epithelial cells | 5_1 |
AACGGGAAGAGGGTGG_A001-C-007 | A001-C-007 | 1387 | 1099 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 1.1243851 | 0.5242471 | 0.0287247 | S | A001-C-007 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | Intestinal epithelial cells | Unknown | 3 |
AAGACAACAACACGTT_A001-C-007 | A001-C-007 | 1137 | 885 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.5132592 | 0.4208438 | 0.3445698 | S | A001-C-007 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | Intestinal epithelial cells | Unknown | 3 |
AAGACTCCAGCTACAT_A001-C-007 | A001-C-007 | 1379 | 1038 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.3531073 | -0.0201581 | -0.0812678 | G1 | A001-C-007 | 1 | 1 | 5 | 5 | 5 | 5 | 5_2 | Stromal cells | Unknown | 5_2 |
AAGATAGCATTGGGAG_A001-C-007 | A001-C-007 | 1494 | 1114 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.7115136 | -0.0343704 | -0.0056271 | G1 | A001-C-007 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | Intestinal epithelial cells | Unknown | 3 |
AAGCATCCATCCCACT_A001-C-007 | A001-C-007 | 2231 | 1391 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.5510810 | 0.0518141 | 0.0255571 | S | A001-C-007 | 6 | 7 | 10 | 10 | 10 | 10 | 10 | Lymphoid cells | Lymphoid cells | 10 |
AAGCCATCAAGACCTT_A001-C-007 | A001-C-007 | 1814 | 1265 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.4841313 | 0.0483830 | -0.0533808 | S | A001-C-007 | 2 | 2 | 2 | 1 | 1 | 1 | 1 | Tuft cells | Tuft cells | 1 |
AAGCGAGCACGAGAAC_A001-C-007 | A001-C-007 | 1256 | 980 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.4604758 | -0.0403860 | -0.0059914 | G1 | A001-C-007 | 6 | 7 | 9 | 10 | 10 | 10 | 10 | Lymphoid cells | Lymphoid cells | 10 |
AAGCGTTCAGCCTATA_A001-C-007 | A001-C-007 | 942 | 761 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.7960199 | -0.0316891 | 0.0644262 | G2M | A001-C-007 | 6 | 7 | 9 | 11 | 11 | 10 | 10 | Lymphoid cells | Lymphoid cells | 10 |
AAGGAATAGACTCCGC_A001-C-007 | A001-C-007 | 3295 | 2008 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.4461630 | -0.0135791 | 0.7699682 | G2M | A001-C-007 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | Intestinal epithelial cells | Unknown | 3 |
AAGGAATTCGTTCATT_A001-C-007 | A001-C-007 | 2668 | 1821 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.4730713 | 0.2752857 | 0.3133345 | G2M | A001-C-007 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | Intestinal epithelial cells | Unknown | 3 |
AAGTACCTCGCCACTT_A001-C-007 | A001-C-007 | 2246 | 1549 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.4778454 | 0.2451425 | 0.6566478 | G2M | A001-C-007 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | Intestinal epithelial cells | Unknown | 3 |
AAGTCGTGTCGCAGTC_A001-C-007 | A001-C-007 | 1591 | 1205 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.1820388 | -0.0900911 | -0.0757283 | G1 | A001-C-007 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | Intestinal epithelial cells | Unknown | 3 |
AAGTCGTGTGCGAACA_A001-C-007 | A001-C-007 | 2662 | 1773 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.4330567 | -0.0938635 | -0.1193087 | G1 | A001-C-007 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | Intestinal epithelial cells | Unknown | 3 |
AAGTTCGAGAACTGAT_A001-C-007 | A001-C-007 | 1145 | 894 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.1700680 | -0.0307752 | -0.0470116 | G1 | A001-C-007 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | Intestinal epithelial cells | Unknown | 3 |
AAGTTCGCAGAAGTTA_A001-C-007 | A001-C-007 | 1001 | 772 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 2.6239067 | 0.0998638 | 0.0772789 | S | A001-C-007 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | Intestinal epithelial cells | Unknown | 3 |
AAGTTCGGTACCTATG_A001-C-007 | A001-C-007 | 1514 | 1149 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.5148005 | 0.0105391 | -0.0305894 | S | A001-C-007 | 2 | 2 | 2 | 1 | 1 | 1 | 1 | Tuft cells | Tuft cells | 1 |
AAGTTCGTCCTCCACA_A001-C-007 | A001-C-007 | 6237 | 3355 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.1555452 | 0.1787459 | 0.4338195 | G2M | A001-C-007 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | Intestinal epithelial cells | Unknown | 3 |
AATAGAGTCGCGTGCA_A001-C-007 | A001-C-007 | 2041 | 1402 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.5725191 | 0.0017802 | 0.0003999 | S | A001-C-007 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | Intestinal epithelial cells | Unknown | 3 |
AATCACGCAGCAATTC_A001-C-007 | A001-C-007 | 1413 | 1111 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.8270159 | 0.2103020 | 0.6019287 | G2M | A001-C-007 | 1 | 1 | 1 | 7 | 7 | 7 | 7 | Lymphatic endothelial cells | Unknown | 7 |
AATGAAGCAGCCTATA_A001-C-007 | A001-C-007 | 2654 | 1780 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.3289474 | -0.0022276 | -0.0878330 | G1 | A001-C-007 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | ENS glia | Unknown | 2 |
AATGAAGTCAGCGTCG_A001-C-007 | A001-C-007 | 1685 | 1272 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 1.0398614 | 0.1172836 | 0.0612130 | S | A001-C-007 | 2 | 2 | 2 | 1 | 1 | 1 | 1 | Tuft cells | Tuft cells | 1 |
AATGACCCAAGCGATG_A001-C-007 | A001-C-007 | 1871 | 1285 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.2601457 | -0.1164721 | -0.0523448 | G1 | A001-C-007 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | Intestinal epithelial cells | Unknown | 3 |
AATGACCTCGTAGCCG_A001-C-007 | A001-C-007 | 1979 | 1313 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.5920079 | 0.0422544 | 0.5874028 | G2M | A001-C-007 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | Intestinal epithelial cells | Unknown | 3 |
AATGCCACAGGTGACA_A001-C-007 | A001-C-007 | 2343 | 1535 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 0.8720930 | -0.1055068 | 0.0278103 | G2M | A001-C-007 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | Intestinal epithelial cells | Unknown | 3 |
AATGGAACAAGGGCAT_A001-C-007 | A001-C-007 | 1299 | 1015 | Colorectal Cancer | A00509:126:HTLFWDMXX:1 | 1.0534236 | 0.3728563 | 0.4312851 | G2M | A001-C-007 | 1 | 1 | 1 | 7 | 7 | 7 | 7 | Lymphatic endothelial cells | Unknown | 7 |
Let’s create a smaller data frame containing the expression information for a select number of markers and a few key metadata values for each sample.
markers <- c("SATB2", "NXPE1", "PDE3A", "CFTR", "HNF1A-AS1", "ADAMTSL1", "AC073050.1", "PID1", "NEO1", "XIST", "NR5A2", "AC019330.1", "CNTN4", "CNTN3", "SPON1", "LEFTY1")
markers <- markers[markers %in% rownames(experiment.data@assays$RNA$data)]
sc.data <- as.data.frame(t(as.matrix(experiment.data@assays$RNA$data[markers,])))
sc.data$cell <- rownames(sc.data)
metadata <- experiment.data@meta.data
metadata$cell <- rownames(metadata)
metadata <- select(metadata, cell, subcluster, subcluster_ScType_filtered)
sc.data <- inner_join(metadata, sc.data, by = "cell")
slice(sc.data, 1:50) %>%
kable() %>%
kable_styling("striped", fixed_thead = TRUE) %>%
scroll_box(height = "200px")
cell | subcluster | subcluster_ScType_filtered | SATB2 | NXPE1 | PDE3A | CFTR | HNF1A-AS1 | ADAMTSL1 | PID1 | NEO1 | XIST | NR5A2 | CNTN4 | CNTN3 | SPON1 | LEFTY1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AAACCCAAGTTATGGA_A001-C-007 | 2 | Unknown | 2.3785758 | 2.378576 | 0.000000 | 1.774064 | 1.774064 | 0.000000 | 0.000000 | 1.7740643 | 0 | 0.000000 | 0.0000000 | 1.774064 | 0.000000 | 0 |
AAACGCTTCTCTGCTG_A001-C-007 | 10 | Lymphoid cells | 0.0000000 | 0.000000 | 0.000000 | 2.095889 | 0.000000 | 0.000000 | 0.000000 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AAAGAACGTGCTTATG_A001-C-007 | 5_1 | Intestinal epithelial cells | 2.9632015 | 3.237886 | 1.962901 | 2.583235 | 0.000000 | 1.962901 | 2.583235 | 0.0000000 | 0 | 1.962901 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AAAGAACGTTTCGCTC_A001-C-007 | 3 | Unknown | 0.0000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AAAGAACTCTGGCTGG_A001-C-007 | 8 | Stromal cells | 0.0000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AAAGGATTCATTACCT_A001-C-007 | 3 | Unknown | 0.0000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AAAGTGACACGCTTAA_A001-C-007 | 5_2 | Unknown | 2.1190618 | 3.410926 | 2.119062 | 2.750257 | 3.134188 | 1.539372 | 0.000000 | 1.5393721 | 0 | 2.483655 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AACAAAGAGGCTAAAT_A001-C-007 | 8 | Stromal cells | 0.0000000 | 0.000000 | 1.977239 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AACAAAGTCTTGGTCC_A001-C-007 | 3 | Unknown | 0.0000000 | 0.000000 | 0.000000 | 0.000000 | 1.597388 | 0.000000 | 0.000000 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AACAAGACATAGAGGC_A001-C-007 | 3 | Unknown | 0.0000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AACAAGAGTTTAGACC_A001-C-007 | 3 | Unknown | 0.0000000 | 0.000000 | 0.000000 | 3.102408 | 2.549943 | 0.000000 | 0.000000 | 1.7444145 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AACAGGGCAATAGGAT_A001-C-007 | 3 | Unknown | 0.0000000 | 0.000000 | 0.000000 | 1.757593 | 0.000000 | 0.000000 | 0.000000 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AACAGGGCATCTGTTT_A001-C-007 | 3 | Unknown | 0.0000000 | 0.000000 | 0.000000 | 2.139890 | 1.704584 | 0.000000 | 0.000000 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AACAGGGGTCCCTGAG_A001-C-007 | 10 | Lymphoid cells | 0.0000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AACAGGGGTGTTACAC_A001-C-007 | 3 | Unknown | 0.0000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AACCAACCATGGGCAA_A001-C-007 | 3 | Unknown | 0.0000000 | 0.000000 | 0.000000 | 0.000000 | 1.393386 | 0.000000 | 0.000000 | 1.3933861 | 0 | 1.393386 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AACCACAGTCAACCAT_A001-C-007 | 3 | Unknown | 1.2432030 | 0.000000 | 0.000000 | 2.385712 | 0.000000 | 0.000000 | 0.000000 | 0.0000000 | 0 | 0.000000 | 1.2432030 | 0.000000 | 1.243203 | 0 |
AACCATGGTACGATTC_A001-C-007 | 3 | Unknown | 0.0000000 | 0.000000 | 0.000000 | 2.493104 | 0.000000 | 0.000000 | 0.000000 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AACCCAACACAGCATT_A001-C-007 | 3 | Unknown | 0.0000000 | 0.000000 | 1.046648 | 2.326344 | 1.046648 | 1.046648 | 0.000000 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AACCCAAGTCGGTGTC_A001-C-007 | 2 | Unknown | 3.3415469 | 0.000000 | 1.864477 | 2.476969 | 1.864477 | 2.476969 | 2.476969 | 2.4769690 | 0 | 1.864477 | 0.0000000 | 1.864477 | 0.000000 | 0 |
AACCTGAAGATTCGAA_A001-C-007 | 3 | Unknown | 0.0000000 | 0.000000 | 0.000000 | 0.000000 | 1.824147 | 0.000000 | 0.000000 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AACCTGACAGTCAGCC_A001-C-007 | 7 | Unknown | 0.0000000 | 0.000000 | 0.000000 | 1.804794 | 0.000000 | 1.263946 | 2.787241 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AACCTGAGTTGGTAGG_A001-C-007 | 7 | Unknown | 2.1502789 | 2.150279 | 2.150279 | 0.000000 | 2.783436 | 0.000000 | 2.150279 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AACCTTTTCGCTCATC_A001-C-007 | 5_2 | Unknown | 0.0000000 | 0.000000 | 0.000000 | 3.062800 | 2.415349 | 0.000000 | 2.415349 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AACGAAAGTATGTCTG_A001-C-007 | 5_1 | Intestinal epithelial cells | 0.0000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AACGGGAAGAGGGTGG_A001-C-007 | 3 | Unknown | 0.0000000 | 0.000000 | 0.000000 | 3.790055 | 0.000000 | 0.000000 | 0.000000 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AAGACAACAACACGTT_A001-C-007 | 3 | Unknown | 0.0000000 | 0.000000 | 0.000000 | 2.281880 | 0.000000 | 0.000000 | 0.000000 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AAGACTCCAGCTACAT_A001-C-007 | 5_2 | Unknown | 2.1104110 | 2.741051 | 3.124780 | 2.110411 | 0.000000 | 0.000000 | 2.110411 | 0.0000000 | 0 | 2.110411 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AAGATAGCATTGGGAG_A001-C-007 | 3 | Unknown | 2.0403681 | 0.000000 | 0.000000 | 2.666317 | 0.000000 | 0.000000 | 0.000000 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AAGCATCCATCCCACT_A001-C-007 | 10 | Lymphoid cells | 0.0000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AAGCCATCAAGACCTT_A001-C-007 | 1 | Tuft cells | 3.5285955 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AAGCGAGCACGAGAAC_A001-C-007 | 10 | Lymphoid cells | 2.1929693 | 0.000000 | 0.000000 | 0.000000 | 2.192969 | 0.000000 | 2.192969 | 2.1929693 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AAGCGTTCAGCCTATA_A001-C-007 | 10 | Lymphoid cells | 0.0000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AAGGAATAGACTCCGC_A001-C-007 | 3 | Unknown | 0.0000000 | 0.000000 | 0.000000 | 2.313001 | 2.313001 | 0.000000 | 0.000000 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AAGGAATTCGTTCATT_A001-C-007 | 3 | Unknown | 0.0000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AAGTACCTCGCCACTT_A001-C-007 | 3 | Unknown | 0.0000000 | 0.000000 | 0.000000 | 1.696049 | 0.000000 | 0.000000 | 0.000000 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AAGTCGTGTCGCAGTC_A001-C-007 | 3 | Unknown | 0.0000000 | 0.000000 | 0.000000 | 2.988510 | 0.000000 | 0.000000 | 0.000000 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AAGTCGTGTGCGAACA_A001-C-007 | 3 | Unknown | 1.5595277 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.5595277 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AAGTTCGAGAACTGAT_A001-C-007 | 3 | Unknown | 0.0000000 | 0.000000 | 0.000000 | 2.275586 | 0.000000 | 2.275586 | 0.000000 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AAGTTCGCAGAAGTTA_A001-C-007 | 3 | Unknown | 0.0000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AAGTTCGGTACCTATG_A001-C-007 | 1 | Tuft cells | 0.0000000 | 0.000000 | 0.000000 | 2.028808 | 0.000000 | 2.028808 | 0.000000 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AAGTTCGTCCTCCACA_A001-C-007 | 3 | Unknown | 0.9567933 | 0.000000 | 0.000000 | 1.759581 | 2.199076 | 0.000000 | 0.000000 | 0.9567933 | 0 | 0.000000 | 0.9567933 | 0.000000 | 0.000000 | 0 |
AATAGAGTCGCGTGCA_A001-C-007 | 3 | Unknown | 0.0000000 | 0.000000 | 0.000000 | 0.000000 | 1.774878 | 0.000000 | 0.000000 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AATCACGCAGCAATTC_A001-C-007 | 7 | Unknown | 2.0890380 | 2.089038 | 0.000000 | 2.718283 | 2.089038 | 2.089038 | 0.000000 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AATGAAGCAGCCTATA_A001-C-007 | 2 | Unknown | 0.0000000 | 2.144268 | 0.000000 | 2.144268 | 3.161560 | 0.000000 | 0.000000 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AATGAAGTCAGCGTCG_A001-C-007 | 1 | Tuft cells | 1.9365404 | 2.554855 | 0.000000 | 1.936540 | 2.554855 | 0.000000 | 0.000000 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AATGACCCAAGCGATG_A001-C-007 | 3 | Unknown | 0.0000000 | 0.000000 | 0.000000 | 2.835224 | 1.847625 | 0.000000 | 0.000000 | 0.0000000 | 0 | 1.847625 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AATGACCTCGTAGCCG_A001-C-007 | 3 | Unknown | 0.0000000 | 0.000000 | 0.000000 | 2.407496 | 0.000000 | 0.000000 | 0.000000 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AATGCCACAGGTGACA_A001-C-007 | 3 | Unknown | 0.0000000 | 1.661657 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0000000 | 0 | 1.661657 | 0.0000000 | 0.000000 | 0.000000 | 0 |
AATGGAACAAGGGCAT_A001-C-007 | 7 | Unknown | 0.0000000 | 2.163120 | 0.000000 | 2.797065 | 2.163120 | 2.163120 | 0.000000 | 0.0000000 | 0 | 0.000000 | 0.0000000 | 2.797065 | 0.000000 | 0 |
rm(experiment.data, metadata, markers)
Basic plot
A ggplot2 object is built in layers, with each layer inheriting
parameters from the previous elements. The parent plot is created by the
ggplot()
call, and subsequent layers are added with “geoms.” Here, we
have applied geom_boxplot()
, which creates box plots.
The code below generates the most basic possible box plot with our data: distribution of normalized expression values for a single gene across all cells sampled.
ggplot(data = sc.data, mapping = aes(y = SATB2)) +
geom_boxplot()
The box plot geom accepts a number of arguments that allow you to tune its appearance. Take a look at the help statement to get an idea of the options.
?geom_boxplot
The “notch” argument controls whether a notch will be drawn at the median. Two boxes with non-overlapping notches likely represent distributions with differing medians.
ggplot(data = sc.data, mapping = aes(y = SATB2)) +
geom_boxplot(notch = TRUE)
The “outlier” arguments control the appearance of any points plotted outside of the box and whiskers.
ggplot(data = sc.data, mapping = aes(y = CNTN4)) +
geom_boxplot(notch = TRUE,
outliers = TRUE)
The computations underlying the box plot visualization are performed by
stat_boxplot()
. By default, the length of the whiskers is 1.5 times
the interquartile range.
ggplot(data = sc.data, mapping = aes(y = CNTN4)) +
geom_boxplot(notch = TRUE,
coef = 2)
Comparing distributions
Our basic box plot is not as informative (or decorative) as it could be. Let’s add information from the metadata.
These cells have been grouped into clusters. The box plot can be split across the groups, allowing us to compare the distribution of expression values within each cell population.
Providing a categorical variable to the x axis produces a series of box plots corresponding to the levels of the variable.
ggplot(data = sc.data, mapping = aes(x = subcluster, y = SATB2)) +
geom_boxplot()
Horizontal box plots
Sometimes it may be advantageous to flip the coordinate system on its
side so that the distribution stretches from left to right instead of
vertically. This is accomplished with coord_flip()
.
ggplot(data = sc.data, mapping = aes(x = subcluster, y = SATB2)) +
geom_boxplot() +
coord_flip()
Communicate additional information using graphical attributes
While the coordinate space of a box plot communicates the values of up to two variables (one continuous and the optional second categorical), other visual qualities (aesthetics in ggplot) can be used to encode additional information, both categorical and continuous.
Box plots have the following mappable aesthetics:
- fill
- alpha
- color
- shape
- size
- linetype
- linewidth
You can assign variables to any number of these aesthetics. Some caveats apply.
Shape, and size apply to the outlier points. Shape is only suitable for categorical values, and cannot be used on very densely plotted points, where distinguishing shape becomes difficult. Meanwhile, size should be used with caution, as it implicitly communicates a sense of quantitative difference that is not appropriate for some qualitative measures (e.g. case vs control).
Color, linetype, and linewidth apply to the whiskers and box outlines. Line-based attributes can be difficult to distinguish on box plots, the interpretation of which relies heavily on the area of the boxes.
Alpha, which controls opacity, and fill apply to the box area. While alpha can be a difficult scale in which to visualize fine gradations of a continuous variable, fill is the most-used aesthetic for box and violin plots.
Fill and stroke are only useful with a subset of available point shapes; explore this documentation to understand why.
Color fill by x-axis category
You will often see filled box plots with fill colored using the same variable as is used for the x axis. This can be useful across multi-panel visualizations to tie together the same samples visualized on different axes (e.g. a dimensionality reduction biplot and a box plot may share a color scheme to aid comprehension).
ggplot(data = sc.data, mapping = aes(x = subcluster, y = SATB2, fill = subcluster)) +
geom_boxplot()
The default colors are notoriously difficult to distinguish for colorblind users. Many libraries offer palettes to extend the default color options, or you can set palettes manually.
Now that we have code for a working box plot, we can store our plot object and add to it as we go.
p <- ggplot(data = sc.data, mapping = aes(x = subcluster, y = SATB2, fill = subcluster)) + geom_boxplot()
Built-in palettes: viridis
Viridis
is one of many color palette resources. To access the viridis palettes
seamlessly within ggplot2, we can call the scale_fill_viridis_
family
of functions: d for discrete data, b for binned data, and c for
continuous data.
p + scale_fill_viridis_d()
Custom color palettes
The simplest way to set custom colors for a ggplot object is with
scale_fill_manual()
. These colors are based on a selection of web
accessible colors generated by
palette.es.
While the 5 color version generated by the site is relatively
accessible, this expanded color palette is not.
custom.palette <- c("#0074e6", "#eec1f1", "#b35e7e", "#534623", "#faa300",
"#194d80", "white", "#ff0062", "black", "#ffd380",
"#80c0ff", "#f566ff", "#663648", "#b39b59", "#ffedcc",
"#ffa6c8")
p + scale_fill_manual(values = custom.palette)
Color fill independent of x-axis
When no value is supplied to x, the fill will be used to split the box plot.
ggplot(data = sc.data, mapping = aes(y = SATB2, fill = subcluster_ScType_filtered)) +
geom_boxplot() +
scale_fill_viridis_d(option = "turbo", name = "Putative cell type") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Coloring by putative cell reveals that cell type assignment was inconclusive for a large number of cells. Locating this box toward the center of our plot, and filling it with a bright color distracts from the differences in expression we observe between cells that were successfully assigned.
To do this, we convert the character vector “subcluster_ScType_filtered” in our data frame to a factor, and set the levels to control the ordering.
sc.data$putative <- factor(gsub("Unknown", NA, sc.data$subcluster_ScType_filtered),
levels = c("Vascular endothelial cells",
"Lymphoid cells",
"Stromal cells",
"Tuft cells",
"Lymphatic endothelial cells",
"Intestinal epithelial cells"))
ggplot(data = sc.data, mapping = aes(y = SATB2, fill = putative)) +
geom_boxplot() +
scale_fill_viridis_d(option = "turbo",
name = "Putative cell type",
begin = 0.1,
end = 0.9,
na.value = "gray90")
The fill scale function also offers the option of setting a shade for
missing data. Here we have selected a low-contrast light gray color.
With this gray color used for NA, a white background may be preferable.
All graphical elements of a ggplot object are modifiable, and a number
of pre-built themes exist. The default theme is theme_gray()
.
ggplot(data = sc.data, mapping = aes(y = SATB2, fill = putative)) +
geom_boxplot() +
scale_fill_viridis_d(option = "turbo",
name = "Putative cell type",
begin = 0.1,
end = 0.9,
na.value = "gray90") +
theme_bw()
Reshaping data for improved visualizations
If more than one marker is of interest, we can display these on a shared set of axes by reshaping the data frame to put the expression values in the same column. Associated gene symbols are moved to another column, making the data frame much longer and narrower.
pivot_longer(sc.data, cols = SATB2:LEFTY1, names_to = "gene", values_to = "normalized.counts") %>%
slice(1:50) %>%
kable() %>%
kable_styling("striped", fixed_thead = TRUE) %>%
scroll_box(height = "200px")
cell | subcluster | subcluster_ScType_filtered | putative | gene | normalized.counts |
---|---|---|---|---|---|
AAACCCAAGTTATGGA_A001-C-007 | 2 | Unknown | NA | SATB2 | 2.378576 |
AAACCCAAGTTATGGA_A001-C-007 | 2 | Unknown | NA | NXPE1 | 2.378576 |
AAACCCAAGTTATGGA_A001-C-007 | 2 | Unknown | NA | PDE3A | 0.000000 |
AAACCCAAGTTATGGA_A001-C-007 | 2 | Unknown | NA | CFTR | 1.774064 |
AAACCCAAGTTATGGA_A001-C-007 | 2 | Unknown | NA | HNF1A-AS1 | 1.774064 |
AAACCCAAGTTATGGA_A001-C-007 | 2 | Unknown | NA | ADAMTSL1 | 0.000000 |
AAACCCAAGTTATGGA_A001-C-007 | 2 | Unknown | NA | PID1 | 0.000000 |
AAACCCAAGTTATGGA_A001-C-007 | 2 | Unknown | NA | NEO1 | 1.774064 |
AAACCCAAGTTATGGA_A001-C-007 | 2 | Unknown | NA | XIST | 0.000000 |
AAACCCAAGTTATGGA_A001-C-007 | 2 | Unknown | NA | NR5A2 | 0.000000 |
AAACCCAAGTTATGGA_A001-C-007 | 2 | Unknown | NA | CNTN4 | 0.000000 |
AAACCCAAGTTATGGA_A001-C-007 | 2 | Unknown | NA | CNTN3 | 1.774064 |
AAACCCAAGTTATGGA_A001-C-007 | 2 | Unknown | NA | SPON1 | 0.000000 |
AAACCCAAGTTATGGA_A001-C-007 | 2 | Unknown | NA | LEFTY1 | 0.000000 |
AAACGCTTCTCTGCTG_A001-C-007 | 10 | Lymphoid cells | Lymphoid cells | SATB2 | 0.000000 |
AAACGCTTCTCTGCTG_A001-C-007 | 10 | Lymphoid cells | Lymphoid cells | NXPE1 | 0.000000 |
AAACGCTTCTCTGCTG_A001-C-007 | 10 | Lymphoid cells | Lymphoid cells | PDE3A | 0.000000 |
AAACGCTTCTCTGCTG_A001-C-007 | 10 | Lymphoid cells | Lymphoid cells | CFTR | 2.095889 |
AAACGCTTCTCTGCTG_A001-C-007 | 10 | Lymphoid cells | Lymphoid cells | HNF1A-AS1 | 0.000000 |
AAACGCTTCTCTGCTG_A001-C-007 | 10 | Lymphoid cells | Lymphoid cells | ADAMTSL1 | 0.000000 |
AAACGCTTCTCTGCTG_A001-C-007 | 10 | Lymphoid cells | Lymphoid cells | PID1 | 0.000000 |
AAACGCTTCTCTGCTG_A001-C-007 | 10 | Lymphoid cells | Lymphoid cells | NEO1 | 0.000000 |
AAACGCTTCTCTGCTG_A001-C-007 | 10 | Lymphoid cells | Lymphoid cells | XIST | 0.000000 |
AAACGCTTCTCTGCTG_A001-C-007 | 10 | Lymphoid cells | Lymphoid cells | NR5A2 | 0.000000 |
AAACGCTTCTCTGCTG_A001-C-007 | 10 | Lymphoid cells | Lymphoid cells | CNTN4 | 0.000000 |
AAACGCTTCTCTGCTG_A001-C-007 | 10 | Lymphoid cells | Lymphoid cells | CNTN3 | 0.000000 |
AAACGCTTCTCTGCTG_A001-C-007 | 10 | Lymphoid cells | Lymphoid cells | SPON1 | 0.000000 |
AAACGCTTCTCTGCTG_A001-C-007 | 10 | Lymphoid cells | Lymphoid cells | LEFTY1 | 0.000000 |
AAAGAACGTGCTTATG_A001-C-007 | 5_1 | Intestinal epithelial cells | Intestinal epithelial cells | SATB2 | 2.963201 |
AAAGAACGTGCTTATG_A001-C-007 | 5_1 | Intestinal epithelial cells | Intestinal epithelial cells | NXPE1 | 3.237886 |
AAAGAACGTGCTTATG_A001-C-007 | 5_1 | Intestinal epithelial cells | Intestinal epithelial cells | PDE3A | 1.962901 |
AAAGAACGTGCTTATG_A001-C-007 | 5_1 | Intestinal epithelial cells | Intestinal epithelial cells | CFTR | 2.583235 |
AAAGAACGTGCTTATG_A001-C-007 | 5_1 | Intestinal epithelial cells | Intestinal epithelial cells | HNF1A-AS1 | 0.000000 |
AAAGAACGTGCTTATG_A001-C-007 | 5_1 | Intestinal epithelial cells | Intestinal epithelial cells | ADAMTSL1 | 1.962901 |
AAAGAACGTGCTTATG_A001-C-007 | 5_1 | Intestinal epithelial cells | Intestinal epithelial cells | PID1 | 2.583235 |
AAAGAACGTGCTTATG_A001-C-007 | 5_1 | Intestinal epithelial cells | Intestinal epithelial cells | NEO1 | 0.000000 |
AAAGAACGTGCTTATG_A001-C-007 | 5_1 | Intestinal epithelial cells | Intestinal epithelial cells | XIST | 0.000000 |
AAAGAACGTGCTTATG_A001-C-007 | 5_1 | Intestinal epithelial cells | Intestinal epithelial cells | NR5A2 | 1.962901 |
AAAGAACGTGCTTATG_A001-C-007 | 5_1 | Intestinal epithelial cells | Intestinal epithelial cells | CNTN4 | 0.000000 |
AAAGAACGTGCTTATG_A001-C-007 | 5_1 | Intestinal epithelial cells | Intestinal epithelial cells | CNTN3 | 0.000000 |
AAAGAACGTGCTTATG_A001-C-007 | 5_1 | Intestinal epithelial cells | Intestinal epithelial cells | SPON1 | 0.000000 |
AAAGAACGTGCTTATG_A001-C-007 | 5_1 | Intestinal epithelial cells | Intestinal epithelial cells | LEFTY1 | 0.000000 |
AAAGAACGTTTCGCTC_A001-C-007 | 3 | Unknown | NA | SATB2 | 0.000000 |
AAAGAACGTTTCGCTC_A001-C-007 | 3 | Unknown | NA | NXPE1 | 0.000000 |
AAAGAACGTTTCGCTC_A001-C-007 | 3 | Unknown | NA | PDE3A | 0.000000 |
AAAGAACGTTTCGCTC_A001-C-007 | 3 | Unknown | NA | CFTR | 0.000000 |
AAAGAACGTTTCGCTC_A001-C-007 | 3 | Unknown | NA | HNF1A-AS1 | 0.000000 |
AAAGAACGTTTCGCTC_A001-C-007 | 3 | Unknown | NA | ADAMTSL1 | 0.000000 |
AAAGAACGTTTCGCTC_A001-C-007 | 3 | Unknown | NA | PID1 | 0.000000 |
AAAGAACGTTTCGCTC_A001-C-007 | 3 | Unknown | NA | NEO1 | 0.000000 |
sc.pivot <- pivot_longer(sc.data, cols = SATB2:LEFTY1, names_to = "gene", values_to = "normalized.counts")
Reshaping the data frame this way allows us to use “gene” to set fill color, while the y-axis is relabled “normalized.counts.”
filter(sc.pivot, gene %in% c("PDE3A", "CFTR", "ADAMTSL1")) %>%
ggplot(mapping = aes(x = subcluster, y = normalized.counts, fill = gene)) +
geom_boxplot() +
scale_fill_viridis_d() +
theme_bw()
Notice that the x-axis and fill are assigned independently, resulting in three boxes plotted for each sub-cluster.
Displaying multiple categorical variables
If we want to examine the relationship between our three markers of interest from the previous figure split over each putative cell type, we can accomplish this one of three ways:
- use putative cell type as x and gene as fill
- use gene as fill and facet by putative cell type
- create a facet grid using gene ~ putative cell type
Set fill and x independently
As in the previous figure, we can simply set assign the fill and x aesthetics to different categorical values.
filter(sc.pivot, gene %in% c("PDE3A", "CFTR", "ADAMTSL1")) %>%
ggplot(mapping = aes(x = putative, y = normalized.counts, fill = gene)) +
geom_boxplot() +
scale_fill_viridis_d() +
theme_bw()
Due to the long cell type names, the x-axis is unreadable. Let’s change
the angle of the text in order to prevent overlapping. The theme()
function in ggplot2 allows access to the plot elements individually,
allowing us to fine-tune the axis text appearance (among many other
things!).
filter(sc.pivot, gene %in% c("PDE3A", "CFTR", "ADAMTSL1")) %>%
ggplot(mapping = aes(x = putative, y = normalized.counts, fill = gene)) +
geom_boxplot() +
scale_fill_viridis_d() +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
We can also suppress the unecessary plot x-axis and legend titles.
filter(sc.pivot, gene %in% c("PDE3A", "CFTR", "ADAMTSL1")) %>%
ggplot(mapping = aes(x = putative, y = normalized.counts, fill = gene)) +
geom_boxplot() +
scale_fill_viridis_d() +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.title.x = element_blank(),
legend.title = element_blank())
The labs()
function lets us specify labels, captions and titles for
the plot.
filter(sc.pivot, gene %in% c("PDE3A", "CFTR", "ADAMTSL1")) %>%
ggplot(mapping = aes(x = putative, y = normalized.counts, fill = gene)) +
geom_boxplot() +
scale_fill_viridis_d() +
labs(y = "Normalized counts") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.title.x = element_blank(),
legend.title = element_blank())
To change the axis tick labels, use scale_x_discrete()
.
filter(sc.pivot, gene %in% c("PDE3A", "CFTR", "ADAMTSL1")) %>%
ggplot(mapping = aes(x = putative, y = normalized.counts, fill = gene)) +
geom_boxplot() +
scale_fill_viridis_d() +
labs(y = "Normalized counts") +
theme_bw() +
scale_x_discrete(labels = c(levels(sc.pivot$putative), "Unassigned")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.title.x = element_blank(),
legend.title = element_blank())
##
Facets
Faceting creates multiple sub-plots, which allow us to visualize more levels of categorical variation without adding additional colors.
Fill by cell type and facet by gene
filter(sc.pivot, gene %in% c("PDE3A", "CFTR", "ADAMTSL1")) %>%
ggplot(mapping = aes(y = normalized.counts, fill = putative)) +
geom_boxplot() +
facet_wrap(~gene) +
scale_fill_viridis_d(option = "turbo",
name = "Putative cell type",
begin = 0.1,
end = 0.9,
na.value = "gray90") +
theme_bw() +
theme(legend.title = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title = element_blank())
When crafting your box plot, try to make the most important comparison easy for viewers. In this case, filling by gene and faceting by cell type may be more useful.
Fill by gene and facet by cell type
filter(sc.pivot, gene %in% c("PDE3A", "CFTR", "ADAMTSL1")) %>%
ggplot(mapping = aes(y = normalized.counts, fill = gene)) +
geom_boxplot() +
facet_wrap(~putative) +
scale_fill_viridis_d() +
theme_linedraw() +
theme(legend.title = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title = element_blank())
Create a gene by cell type grid
If we wanted to examine the relationship between the sample identity, cell type, and gene expression, we can create a grid that allows us to view all three simultaneously.
sc.pivot$sample <- sapply(strsplit(sc.pivot$cell, split = "_", fixed = TRUE), "[[", 2L)
filter(sc.pivot, gene %in% c("PDE3A", "CFTR", "ADAMTSL1")) %>%
ggplot(mapping = aes(y = normalized.counts, fill = sample)) +
geom_boxplot() +
facet_grid(gene~putative) +
scale_fill_viridis_d(option = "mako") +
theme_classic() +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank()) +
labs(y = "Normalized counts")
Subset the data
If some comparisons are irrelevant, you can subset the data for the sake of clarity.
filter(sc.pivot,
gene %in% c("PDE3A", "CFTR", "ADAMTSL1"),
putative %in% c("Intestinal epithelial cells", "Lymphatic endothelial cells", "Tuft cells")) %>%
ggplot(mapping = aes(x = putative, y = normalized.counts, fill = gene)) +
geom_boxplot() +
scale_fill_viridis_d() +
labs(y = "Normalized counts") +
theme_linedraw() +
theme(axis.title.x = element_blank(),
legend.title = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1))
Annotations
Box plots are, by their nature, fairly simple. Adding annotations to the box plot to highlight comparisons of interest or communicate additional information can make them more informative without becoming overly complex.
One common annotation for box plots is the significance bracket. These are visual indicators of statistically significant differences between means. They may be marked with the p-value for the test, or with asterisks.
It is possible to add significance annotations manually, but we will be using the library ggsignif.
ggplot(data = sc.data, mapping = aes(x = putative, y = PDE3A, fill = putative)) +
geom_boxplot() +
scale_fill_viridis_d(option = "turbo",
begin = 0.1,
end = 0.9,
na.value = "gray90") +
guides(fill = "none") +
geom_signif(comparisons = list(c("Tuft cells", "Intestinal epithelial cells"))) +
theme_bw() +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1))
The significance test geom produced by ggsignif is compatible with coord_flip and faceting, but does require manual adjustment when categorical values on the x axis are broken up by the fill argument and position_dodge comes into play.
anno <- t.test(sc.data$PDE3A[sc.data$subcluster == "6"],
sc.data$PDE3A[sc.data$subcluster == "9"])
filter(sc.data, putative %in% c("Tuft cells", "Lymphatic endothelial cells", "Intestinal epithelial cells")) %>%
ggplot(mapping = (aes(x = putative, y = PDE3A, fill = subcluster))) +
geom_boxplot() +
geom_signif(annotation = formatC(anno$p.value, digits = 1),
y_position = 5.75,
xmin = 3,
xmax = 3.25,
tip_length = c(0.05,0.2)) +
scale_fill_viridis_d(option = "plasma") +
theme_minimal() +
theme(axis.title.x = element_blank(),
legend.title = element_blank())
Marginal plots
Box plots can also be useful in the context of annotating other types of
plots. For example, in a densely plotted scatter plot, you may want
marginal box plots to help readers accurately perceive the distribution
of values on one or both axes. This can help you combat misleading
over-plotting. The ggMarginal()
function from ggExtra provides access
to three types of marginal plots, including a box plot.
p <- filter(sc.data, putative %in% c("Intestinal epithelial cells", "Tuft cells", "Stromal cells")) %>%
ggplot(mapping = aes(x = ADAMTSL1, y = CFTR, color = putative, fill = putative)) +
geom_point() +
scale_color_manual(values = custom.palette[c(1,2,5)]) +
coord_fixed() +
theme_classic()
ggMarginal(p + theme(legend.position = "bottom", legend.title = element_blank()), type = "boxplot", groupFill = TRUE)
Violin plots
Violin plots are very similar to box plots. In some cases, you can
simply substitute geom_violin()
in place of geom_boxplot()
. However,
be aware that violin plots require a mapping for the x aesthetic.
filter(sc.pivot, gene %in% c("PDE3A", "CFTR", "ADAMTSL1")) %>%
ggplot(mapping = aes(x = putative, y = normalized.counts, fill = gene)) +
geom_violin() +
scale_fill_viridis_d() +
labs(y = "Normalized counts") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.title.x = element_blank(),
legend.title = element_blank())
ggplot(data = sc.data, mapping = aes(x = putative, y = SATB2, fill = putative)) +
geom_violin() +
scale_fill_viridis_d(option = "turbo",
begin = 0.1,
end = 0.9,
na.value = "gray90") +
guides(fill = "none") +
labs(x = "Putative cell type") +
coord_flip() +
theme_bw()
filter(sc.pivot, gene %in% c("PDE3A", "CFTR", "ADAMTSL1")) %>%
ggplot(mapping = aes(y = normalized.counts, x = sample, fill = sample)) +
geom_violin() +
facet_grid(gene~putative) +
scale_fill_viridis_d(option = "mako") +
theme_classic() +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank()) +
labs(y = "Normalized counts")
Significance brackets are not common on violin plots, but if desired, it functions the same way as on a box plot.
ggplot(data = sc.data, mapping = aes(x = putative, y = PDE3A, fill = putative)) +
geom_violin() +
scale_fill_viridis_d(option = "turbo",
begin = 0.1,
end = 0.9,
na.value = "gray90") +
guides(fill = "none") +
geom_signif(comparisons = list(c("Tuft cells", "Intestinal epithelial cells"))) +
theme_bw() +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1))
Occasionally, you will see a box plot superimposed on the violin forms.
In this case, be sure to move the fill assignment inside the
geom_violin() call so that the box plots remain unfilled (this makes
them easier to perceive).
ggplot(data = sc.data, mapping = aes(x = putative, y = PDE3A)) +
geom_violin(mapping = aes(fill = putative)) +
geom_boxplot(width = 0.1) +
scale_fill_viridis_d(option = "turbo",
begin = 0.1,
end = 0.9,
na.value = "gray90") +
guides(fill = "none") +
geom_signif(comparisons = list(c("Tuft cells", "Intestinal epithelial cells"))) +
theme_bw() +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1))