7 Introduction to the tidyverse

The tidyverse is a collection of packages by the creators of RStudio that share an approach to data science.

The authors model data science like this:

Data science life cycle

The tidyverse packages replace some of the base R functions with alternatives that are intended to be more user friendly for data scientists who are following this life cycle.

We will only be covering a few of the packages from the tidyverse.

library(tidyr)
library(dplyr)
library(magrittr)

7.1 Defining tidy data

In “tidy” data, every column is a variable, every row is an observation, and every cell contains a single observation. Is the birthweight data frame tidy? Why or why not?

7.2 Pipes: combining tidyverse functions

The tidyverse employs piping to send the output of one function to another function, rather than the nesting used in base r. The “pipe” is written with a greater than symbol sandwiched between two percent signs, like this: %>%.

experiment %>%
  filter(low.birthweight == TRUE) %>%
  select(birth.date, length, birthweight, smoker)

Table 7.1: Tidyverse filter and select.
birth.date	length	birthweight	smoker
7/18/1968	46	2.05	TRUE
9/16/1968	48	1.92	TRUE
1/8/1968	47	2.66	TRUE
3/23/1967	50	2.51	TRUE
9/27/1968	43	2.65	FALSE
4/2/1968	48	2.37	TRUE

# equivalent to:
experiment[experiment$low.birthweight == TRUE, c("birth.date", "length", "birthweight", "smoker")]

Table 7.2: Base R subsetting.
	birth.date	length	birthweight	smoker
5	7/18/1968	46	2.05	TRUE
8	9/16/1968	48	1.92	TRUE
10	1/8/1968	47	2.66	TRUE
13	3/23/1967	50	2.51	TRUE
33	9/27/1968	43	2.65	FALSE
35	4/2/1968	48	2.37	TRUE

7.3 Transforming data

The separate() function makes the conversion of the “birth.date” column into “month,” “day,” and “year” trivial.

experiment %>%
  filter(low.birthweight == TRUE) %>%
  select(birth.date, length, birthweight, smoker) %>%
  separate(col = birth.date, sep = "[/]", into = c("month", "day", "year"))

Table 7.3: Tidyverse separate.
month	day	year	length	birthweight	smoker
7	18	1968	46	2.05	TRUE
9	16	1968	48	1.92	TRUE
1	8	1968	47	2.66	TRUE
3	23	1967	50	2.51	TRUE
9	27	1968	43	2.65	FALSE
4	2	1968	48	2.37	TRUE

The mutate() function adds a new column based on data contained in the existing columns.

experiment %>%
  filter(low.birthweight == TRUE) %>%
  select(birth.date, length, birthweight, smoker) %>%
  mutate(d = birthweight / length)

Table 7.4: Tidyverse mutate.
birth.date	length	birthweight	smoker	d
7/18/1968	46	2.05	TRUE	0.0445652
9/16/1968	48	1.92	TRUE	0.0400000
1/8/1968	47	2.66	TRUE	0.0565957
3/23/1967	50	2.51	TRUE	0.0502000
9/27/1968	43	2.65	FALSE	0.0616279
4/2/1968	48	2.37	TRUE	0.0493750

7.4 Summarizing data

The group_by() and summarize() functions apply a function to a group defined by one or more categorical variables.

experiment %>%
  group_by(smoker) %>%
  summarize(mean.birthweight = mean(birthweight))

Table 7.5: Tidyverse summarize grouped by smoker.
smoker	mean.birthweight
FALSE	3.509500
TRUE	3.134091

experiment %>%
  group_by(smoker, low.birthweight) %>%
  summarize(mean.birthweight = mean(birthweight))

## `summarise()` has grouped output by 'smoker'. You can override using the `.groups` argument.

Table 7.6: Tidyverse summarize grouped by smoker and birthweight.
smoker	low.birthweight	mean.birthweight
FALSE	FALSE	3.554737
FALSE	TRUE	2.650000
TRUE	FALSE	3.378824
TRUE	TRUE	2.302000

To change the order of rows, use arrange(). To return one or more specified rows, use slice().

experiment %>%
  group_by(smoker) %>%
  select(smoker, birthweight, length, head.circumference, weeks.gestation) %>%
  slice_max(order_by = birthweight, n = 5)

Table 7.7: Tidyverse slice, ordered on birthweight.
smoker	birthweight	length	head.circumference	weeks.gestation
FALSE	4.55	56	34	44
FALSE	4.32	53	36	40
FALSE	4.10	58	39	41
FALSE	4.07	53	38	44
FALSE	3.94	54	37	42
TRUE	4.57	58	39	41
TRUE	3.87	50	33	45
TRUE	3.86	52	36	39
TRUE	3.64	53	38	40
TRUE	3.59	53	34	40

The pivot_longer() and pivot_wider() functions rearrange data, decreasing or increasing the number of columns. The use of this will become more evident during visualization.

experiment %>%
  filter(low.birthweight == TRUE) %>%
  select(smoker, `miR-16`, `miR-21`, `miR-146a`, `miR-182`) %>%
  pivot_longer(cols = c(`miR-16`, `miR-21`, `miR-146a`, `miR-182`),
               names_to = "gene",
               values_to = "expression")

Table 7.8: Tidyverse pivot_longer.
smoker	gene	expression
TRUE	miR-16	60
TRUE	miR-21	48
TRUE	miR-146a	89
TRUE	miR-182	65
TRUE	miR-16	70
TRUE	miR-21	51
TRUE	miR-146a	84
TRUE	miR-182	48
TRUE	miR-16	61
TRUE	miR-21	51
TRUE	miR-146a	97
TRUE	miR-182	59
TRUE	miR-16	47
TRUE	miR-21	40
TRUE	miR-146a	87
TRUE	miR-182	63
FALSE	miR-16	64
FALSE	miR-21	47
FALSE	miR-146a	104
FALSE	miR-182	42
TRUE	miR-16	61
TRUE	miR-21	47
TRUE	miR-146a	77
TRUE	miR-182	60

7.5 Exercise 4: converting between base R and Tidyverse

Reproduce the table 7.5 or table 7.6 using base R. Use Tidyverse functions to answer the question you addressed in exercise 3.