UCD Bioinformatics Core Workshop

Home
Introduction
Intro to the Workshop and Core
Schedule
Resources
Zoom help
Slack help
Software and Links
Cheat Sheets
Intro CLI
Logging in
Introduction to the Command Line part 1
Introduction to the Command Line part 2
The cluster and modules
Introduction to Python
Python Part1, Background
Python Part2, Data Types
Python Part2, Solutions
Python Part3, Flow Control
Python Part3, Solutions
Python Part4, Working with Files
Python Part4, Solutions
Advanced CLI
Advanced Command Line
Advanced Challenge Solutions
A Simple Bioinformatics Workflow
Software Installation
Using make and cmake
Using Conda
Getting set up with AWS
AWS Event Engine
Starting and Stopping your AWS Instance
Introduction to R
Introduction to R
Introduction to tidyverse
Organizing, manipulating, and visualizing data in the tidyverse
Advanced R
Linear models in R
ETC
Closing thoughts
Github page
Biocore website

Tidyverse is a set of package for doing data science. R for Data Science by Garrett Grolemund, Hadley Wickham model Data Science in the following way.

library(knitr)
knitr::opts_chunk$set(echo = TRUE)

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.1     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

We will start learning about Tidyverse tools by starting at the first step in this process, importing data.

Step 1: Import data with readr

“The goal of ‘readr’ is to provide a fast and friendly way to read rectangular data (like ‘csv’, ‘tsv’, and ‘fwf’). It is designed to flexibly parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes.”

The readr package gets loaded automatically when you use library(tidyverse), or you can load it directly.

library(readr)

readr supports a number of file formats with different read_* functions including:

read_csv(): comma separated (CSV) files
read_tsv(): tab separated (TSV) files
read_delim(): general delimited files (you must supply delimiter!)
read_fwf(): fixed width files
read_table(): tabular files where columns are separated by white-space.
read_log(): web log files

Readr also has functions write data in a number of formats with various write_* functions:

write_csv(): comma separated (CSV) files
write_tsv(): tab separated (TSV) files
write_delim(): general delimited files
write_excel_csv(): comma separated files for Excel

Get some data and try out these functions:

Note that this data set is from the ggplot2 package.

read_tsv("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2020-Bioinformatics_Prerequisites_Workshop/master/Intro_to_R/Intro2R/mpg.tsv")

## Parsed with column specification:
## cols(
##   manufacturer = col_character(),
##   model = col_character(),
##   displ = col_double(),
##   year = col_double(),
##   cyl = col_double(),
##   trans = col_character(),
##   drv = col_character(),
##   cty = col_double(),
##   hwy = col_double(),
##   fl = col_character(),
##   class = col_character()
## )

## # A tibble: 234 x 11
##    manufacturer model    displ  year   cyl trans   drv     cty   hwy fl    class
##    <chr>        <chr>    <dbl> <dbl> <dbl> <chr>   <chr> <dbl> <dbl> <chr> <chr>
##  1 audi         a4         1.8  1999     4 auto(l… f        18    29 p     comp…
##  2 audi         a4         1.8  1999     4 manual… f        21    29 p     comp…
##  3 audi         a4         2    2008     4 manual… f        20    31 p     comp…
##  4 audi         a4         2    2008     4 auto(a… f        21    30 p     comp…
##  5 audi         a4         2.8  1999     6 auto(l… f        16    26 p     comp…
##  6 audi         a4         2.8  1999     6 manual… f        18    26 p     comp…
##  7 audi         a4         3.1  2008     6 auto(a… f        18    27 p     comp…
##  8 audi         a4 quat…   1.8  1999     4 manual… 4        18    26 p     comp…
##  9 audi         a4 quat…   1.8  1999     4 auto(l… 4        16    25 p     comp…
## 10 audi         a4 quat…   2    2008     4 manual… 4        20    28 p     comp…
## # … with 224 more rows

Alternatively we could use another of the readr functions, read_lines, to look at the first few lines of the file:

read_lines("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2020-Bioinformatics_Prerequisites_Workshop/master/Intro_to_R/Intro2R/mpg.tsv", n_max = 5)

## [1] "manufacturer\tmodel\tdispl\tyear\tcyl\ttrans\tdrv\tcty\thwy\tfl\tclass"
## [2] "audi\ta4\t1.8\t1999\t4\tauto(l5)\tf\t18\t29\tp\tcompact"               
## [3] "audi\ta4\t1.8\t1999\t4\tmanual(m5)\tf\t21\t29\tp\tcompact"             
## [4] "audi\ta4\t2\t2008\t4\tmanual(m6)\tf\t20\t31\tp\tcompact"               
## [5] "audi\ta4\t2\t2008\t4\tauto(av)\tf\t21\t30\tp\tcompact"

We could also check the number of lines by reading the whole file and counting the lines. This approach will be slow for large files:

length(read_lines("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2020-Bioinformatics_Prerequisites_Workshop/master/Intro_to_R/Intro2R/mpg.tsv"))

## [1] 235

Using the functions you just ran, answer the following questions:

How many lines does the file have?
What is the first line of the file?
What separates the values in the file?

To read the file and store it in an object, we assign the output of the read_tsv function to an object.

mpg <- read_tsv("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2020-Bioinformatics_Prerequisites_Workshop/master/Intro_to_R/Intro2R/mpg.tsv")

## Parsed with column specification:
## cols(
##   manufacturer = col_character(),
##   model = col_character(),
##   displ = col_double(),
##   year = col_double(),
##   cyl = col_double(),
##   trans = col_character(),
##   drv = col_character(),
##   cty = col_double(),
##   hwy = col_double(),
##   fl = col_character(),
##   class = col_character()
## )

What are “Column Specifications”?

Computers use different types of containers to store different types of data. In tidyverse all numeric data (floating point and integer) is stored as a 64-bit double. Data that is not numeric is stored in character vectors. When reading a file, readr must make a guess about the type of data stored in each column. To do this, readr skims the first 1000 lines of the file, investigating the values it finds there, and using them to make a guess at the format of the file.

Now, let’s look at the object we just loaded

Does the object have the expected number of lines?
What type of object is it?

mpg

## # A tibble: 234 x 11
##    manufacturer model    displ  year   cyl trans   drv     cty   hwy fl    class
##    <chr>        <chr>    <dbl> <dbl> <dbl> <chr>   <chr> <dbl> <dbl> <chr> <chr>
##  1 audi         a4         1.8  1999     4 auto(l… f        18    29 p     comp…
##  2 audi         a4         1.8  1999     4 manual… f        21    29 p     comp…
##  3 audi         a4         2    2008     4 manual… f        20    31 p     comp…
##  4 audi         a4         2    2008     4 auto(a… f        21    30 p     comp…
##  5 audi         a4         2.8  1999     6 auto(l… f        16    26 p     comp…
##  6 audi         a4         2.8  1999     6 manual… f        18    26 p     comp…
##  7 audi         a4         3.1  2008     6 auto(a… f        18    27 p     comp…
##  8 audi         a4 quat…   1.8  1999     4 manual… 4        18    26 p     comp…
##  9 audi         a4 quat…   1.8  1999     4 auto(l… 4        16    25 p     comp…
## 10 audi         a4 quat…   2    2008     4 manual… 4        20    28 p     comp…
## # … with 224 more rows

class(mpg)

## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

is_tibble(mpg)

## [1] TRUE

What is a Tibble?

Tibbles are a modified type of data frame. Everything you have learned about accessing and manipulating data frames still applies, but a tibble behaves a little differently. The documentation describes them like this:

Tibbles are data.frames that are lazy and surly: they do less (i.e. they don’t change variable names or types, and don’t do partial matching) and complain more (e.g. when a variable does not exist). This forces you to confront problems earlier, typically leading to cleaner, more expressive code.

Creating tibbles

As we saw above, importing data using the readr package creates a tibble. Additionally, tibbles can be created from existing objects using as_tibble(),

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

as_tibble(iris)

## # A tibble: 150 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa
##  2          4.9         3            1.4         0.2 setosa
##  3          4.7         3.2          1.3         0.2 setosa
##  4          4.6         3.1          1.5         0.2 setosa
##  5          5           3.6          1.4         0.2 setosa
##  6          5.4         3.9          1.7         0.4 setosa
##  7          4.6         3.4          1.4         0.3 setosa
##  8          5           3.4          1.5         0.2 setosa
##  9          4.4         2.9          1.4         0.2 setosa
## 10          4.9         3.1          1.5         0.1 setosa
## # … with 140 more rows

created manually by specifying each column,

tibble(
        column1=1:5,
        column2=c("a","b","c","d","e"),
        column3=pi*1:5
       )

## # A tibble: 5 x 3
##   column1 column2 column3
##     <int> <chr>     <dbl>
## 1       1 a          3.14
## 2       2 b          6.28
## 3       3 c          9.42
## 4       4 d         12.6
## 5       5 e         15.7

or row-by-row with tribble().

tribble(
  ~x, ~y,  ~z,
  "a", 2,  3.6,
  "b", 1,  8.5
)

## # A tibble: 2 x 3
##   x         y     z
##   <chr> <dbl> <dbl>
## 1 a         2   3.6
## 2 b         1   8.5

Tibbles can have terrible column names that should never be used. Good practice is to have unique column names that start with a letter and contain no special characters.

tibble(
  `terrible column 1` = 1:5,
  `no good :)` = letters[1:5],
  `;very-bad/` = sin(4:8)
)

## # A tibble: 5 x 3
##   `terrible column 1` `no good :)` `;very-bad/`
##                 <int> <chr>               <dbl>
## 1                   1 a                  -0.757
## 2                   2 b                  -0.959
## 3                   3 c                  -0.279
## 4                   4 d                   0.657
## 5                   5 e                   0.989

Challenge

Create a tibble with 100 rows and 3 columns, write your tibble to a file, and read it back in.

Write it out using write_tsv()
Which function should you use to read back in?
- Try read_csv(). Did it read in successfully?
- Try to use read_delim(). Did that work any better? hint: specify delimiter

Advanced challenge

Try to generate a parsing failure in readr by creating and writing a tibble that causes problems when read in by read_delim.

Then create R code that allows read_delim to successfully load your treacherous tibble.

Step 2: Tidy data with tidyr

First, what is “tidy” data?

Tidy data is data where:

1) Every column is a variable.

2) Every row is an observation.

3) Every cell is a single value.

Tidy data describes a standard way of storing data that is used wherever possible throughout the tidyverse. If you ensure that your data is tidy, you’ll spend less time fighting with the tools and more time working on your analysis.

Wickham, Hadley. “Tidy data.” Journal of Statistical Software 59.10 (2014): 1-23.

Definitions:

A variable stores a set of values of a particular type (height, temperature, duration)
An observation is all values measured on the same unit across variables

There is more than one way we can store data from an experiment.

d1 = data.frame(
        mouse_id = rep(c("A1","A2","A3"), 2),
        treatment = rep(c('a','b'), each = 3),
        mass = c(4,6,9,8,5,4)
)
d2 <- data.frame(
        A1=c(4,8),
        A2=c(6,5),
        A3=c(9,4)
)
rownames(d2) <- c("a","b")
d1

##   mouse_id treatment mass
## 1       A1         a    4
## 2       A2         a    6
## 3       A3         a    9
## 4       A1         b    8
## 5       A2         b    5
## 6       A3         b    4

d2

##   A1 A2 A3
## a  4  6  9
## b  8  5  4

Is either of these ways of organizing our experiment tidy? Why or why not?

Remember, for “tidy” data, we’re looking for a table where:

Every column is a variable
Every row is an observation
Every cell is a single value

In the two tables above, what are the variables and observations? Where are they stored?

How do we make our data “tidy?”

First, make the row names into a new column with the rownames_to_column() function:

d2.1 <- rownames_to_column(d2, 'treatment')
d2.1

##   treatment A1 A2 A3
## 1         a  4  6  9
## 2         b  8  5  4

Make each row an observation with pivot_longer() function:

d2.2 <- pivot_longer(d2.1, cols = -treatment, names_to = "mouse_id", values_to = "mass")
d2.2

## # A tibble: 6 x 3
##   treatment mouse_id  mass
##   <chr>     <chr>    <dbl>
## 1 a         A1           4
## 2 a         A2           6
## 3 a         A3           9
## 4 b         A1           8
## 5 b         A2           5
## 6 b         A3           4

Reorder columns for looks using select() function:

d2.3 <- select(d2.2, mouse_id, treatment, mass)
d2.3

## # A tibble: 6 x 3
##   mouse_id treatment  mass
##   <chr>    <chr>     <dbl>
## 1 A1       a             4
## 2 A2       a             6
## 3 A3       a             9
## 4 A1       b             8
## 5 A2       b             5
## 6 A3       b             4

Piping commands together with magrittr

Although the code above is fairly readable, it is not compact. It also creates three copies of the data (d2.1, d2.2, d2.3). We could use a couple of different strategies for carrying out this series of operations in a more concise manner.

The base-R strategy, Nested Functions

Traditionally, R users have used nested functions to skip the creation of intermediary objects:

d3 <- select( pivot_longer(rownames_to_column(d2, 'treatment'),
                             cols = -treatment,
                             names_to = "mouse_id",
                             values_to = "mass"),
                mouse_id, treatment, mass)
d3

## # A tibble: 6 x 3
##   mouse_id treatment  mass
##   <chr>    <chr>     <dbl>
## 1 A1       a             4
## 2 A2       a             6
## 3 A3       a             9
## 4 A1       b             8
## 5 A2       b             5
## 6 A3       b             4

identical(d2.3, d3)

## [1] TRUE

The tidyverse option, using the %>% pipe operator

The magrittr package offers a set of operators which make your code more readable by:

structuring sequences of data operations left-to-right (as opposed to from the inside and out),

avoiding nested function calls,

minimizing the need for local variables and function definitions, and

making it easy to add steps anywhere in the sequence of operations.

d4 <-
  d2 %>% rownames_to_column('treatment') %>%
    pivot_longer(cols = -treatment,
                 names_to = "mouse_id",
                 values_to = "mass") %>%
    select(mouse_id, treatment, mass)
d4

## # A tibble: 6 x 3
##   mouse_id treatment  mass
##   <chr>    <chr>     <dbl>
## 1 A1       a             4
## 2 A2       a             6
## 3 A3       a             9
## 4 A1       b             8
## 5 A2       b             5
## 6 A3       b             4

identical(d2.3, d4)

## [1] TRUE

How does %>% work?

By default, %>% works to replace the first argument in a function with the left-hand side value with basic piping:

x %>% f is equivalent to f(x)
x %>% f(y) is equivalent to f(x, y)
x %>% f %>% g %>% h is equivalent to h(g(f(x)))

In more complicated situations you can also specify the argument to pipe to using the argument placeholder:

x %>% f(y, .) is equivalent to f(y, x)
x %>% f(y, z = .) is equivalent to f(y, z = x)

Data can be made “wider” as well as “longer”

In addition to pivot_longer(), which we used above, we can change the shape of our data table using pivot_wider(), it does the inverse transformation of pivot_longer() and adds columns by removing rows.

The pivot_wider() function adds columns by removing rows. For example:

d1

##   mouse_id treatment mass
## 1       A1         a    4
## 2       A2         a    6
## 3       A3         a    9
## 4       A1         b    8
## 5       A2         b    5
## 6       A3         b    4

pivot_wider(d1, names_from = mouse_id, values_from = mass )

## # A tibble: 2 x 4
##   treatment    A1    A2    A3
##   <chr>     <dbl> <dbl> <dbl>
## 1 a             4     6     9
## 2 b             8     5     4

Separating values from a column that contains two or more variables

Sometimes a column contains two variables split by a delimiter. The separate() function can be used to create two columns from this single input. The tidyr package includes an example of this for us to practice with:

table3

## # A tibble: 6 x 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

The third column rate contains two different values, cases and population. We can use the separate() function to split these into two columns.

separate(data = table3, col = rate, into = c("cases", "population"), sep = '/')

## # A tibble: 6 x 4
##   country      year cases  population
##   <chr>       <int> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362
## 4 Brazil       2000 80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Challenge

Tidy the relig_income example data table included in the tidyr package. hint: what are the variables and observations in this dataset?

What is wrong with the example data table2 from the tidyr package? What is wrong with table2? How many rows does it have per observation? Tidy it.

Step 3, Transform data with dplyr

The dplyr package provides a set of verbs for modifying, combining, and otherwise transforming variables, creating subsets based on various attributes, combining multiple tibbles, etc. These approaches are meant to enable efficient manipulation of data. Functions include:

mutate() adds new variables that are functions of existing variables
select() picks variables based on their names
filter() picks cases based on their values
summarize() reduces multiple values down to a single summary
arrange() changes the ordering of the rows

To demonstrate some of them we will use the starwars dataset included in the dplyr package.

library(dplyr)
starwars

## # A tibble: 87 x 14
##    name  height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr>
##  1 Luke…    172    77 blond      fair       blue            19   male  mascu…
##  2 C-3PO    167    75 <NA>       gold       yellow         112   none  mascu…
##  3 R2-D2     96    32 <NA>       white, bl… red             33   none  mascu…
##  4 Dart…    202   136 none       white      yellow          41.9 male  mascu…
##  5 Leia…    150    49 brown      light      brown           19   fema… femin…
##  6 Owen…    178   120 brown, gr… light      blue            52   male  mascu…
##  7 Beru…    165    75 brown      light      blue            47   fema… femin…
##  8 R5-D4     97    32 <NA>       white, red red             NA   none  mascu…
##  9 Bigg…    183    84 black      light      brown           24   male  mascu…
## 10 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
## # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

For example, using mutate, we can calculate the BMI of characters in the Star Wars universe.

starwars %>% mutate(BMI = mass / (height/100))

## # A tibble: 87 x 15
##    name  height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr>
##  1 Luke…    172    77 blond      fair       blue            19   male  mascu…
##  2 C-3PO    167    75 <NA>       gold       yellow         112   none  mascu…
##  3 R2-D2     96    32 <NA>       white, bl… red             33   none  mascu…
##  4 Dart…    202   136 none       white      yellow          41.9 male  mascu…
##  5 Leia…    150    49 brown      light      brown           19   fema… femin…
##  6 Owen…    178   120 brown, gr… light      blue            52   male  mascu…
##  7 Beru…    165    75 brown      light      blue            47   fema… femin…
##  8 R5-D4     97    32 <NA>       white, red red             NA   none  mascu…
##  9 Bigg…    183    84 black      light      brown           24   male  mascu…
## 10 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
## # … with 77 more rows, and 6 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>, BMI <dbl>

As we saw above, select() allows us to limit the output to specified columns based on name or index.

starwars %>% mutate(BMI = mass / (height/100)) %>% select(name, species, BMI)

## # A tibble: 87 x 3
##    name               species   BMI
##    <chr>              <chr>   <dbl>
##  1 Luke Skywalker     Human    44.8
##  2 C-3PO              Droid    44.9
##  3 R2-D2              Droid    33.3
##  4 Darth Vader        Human    67.3
##  5 Leia Organa        Human    32.7
##  6 Owen Lars          Human    67.4
##  7 Beru Whitesun lars Human    45.5
##  8 R5-D4              Droid    33.0
##  9 Biggs Darklighter  Human    45.9
## 10 Obi-Wan Kenobi     Human    42.3
## # … with 77 more rows

Before we calculate any summary statistics, we need to check to see if we have appropriate levels of replication. We can determine the number of samples per species using count().

starwars %>% count(species, sort = TRUE, name = "Samples")

## # A tibble: 38 x 2
##    species  Samples
##    <chr>      <int>
##  1 Human         35
##  2 Droid          6
##  3 <NA>           4
##  4 Gungan         3
##  5 Kaminoan       2
##  6 Mirialan       2
##  7 Twi'lek        2
##  8 Wookiee        2
##  9 Zabrak         2
## 10 Aleena         1
## # … with 28 more rows

Species information is missing for a number of samples, so we can remove these with drop_na(), and also remove any samples for which mass or height is NA (since that will interfere with our BMI calculation).

starwars %>% drop_na(species, mass, height) %>% count(species, sort=T, name = "Samples")

## # A tibble: 31 x 2
##    species  Samples
##    <chr>      <int>
##  1 Human         22
##  2 Droid          4
##  3 Gungan         2
##  4 Mirialan       2
##  5 Wookiee        2
##  6 Aleena         1
##  7 Besalisk       1
##  8 Cerean         1
##  9 Clawdite       1
## 10 Dug            1
## # … with 21 more rows

To calculate the mean BMI for all species with at least three samples, we can combine several operators, piping them together with the %>% as we saw above in the section on tidying data. In order, we will:

Drop observations with missing species, mass, or height with drop_na()
Calculate BMI for remaining observations using mutate()
Group observations by species with group_by()
Remove all species with fewer than 3 observations using filter and n()
Calculate mean BMI of each species with summarize

We can drop observations for species for which there isn’t sufficient replication using a combination of filter(), group_by and a special n() function.

starwars %>% drop_na(species, mass, height) %>%
  mutate(BMI = mass / (height/100)) %>%
  group_by(species) %>%
  filter(n() >= 3) %>%
  summarize(mean_BMI = mean(BMI))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 2 x 2
##   species mean_BMI
##   <chr>      <dbl>
## 1 Droid       45.3
## 2 Human       45.8

So far, we’ve used all of the functions listed except arrange. What does arrange do? According to the documentation, arrange “changes the ordering of the rows.” This might be useful if we wanted to find the tallest inhabitant of each of the planets sampled in starwars.

starwars %>%
  arrange(homeworld, desc(height)) %>%  # sort by homeworld, then height descending
  group_by(homeworld) %>%  # use group_by to tell slice() how to select records
  slice(1)  # keep only the first observation per homeworld

## # A tibble: 49 x 14
## # Groups:   homeworld [49]
##    name  height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr>
##  1 Bail…    191    NA black      tan        brown             67 male  mascu…
##  2 Ratt…     79    15 none       grey, blue unknown           NA male  mascu…
##  3 Lobot    175    79 none       light      blue              37 male  mascu…
##  4 Jek …    180   110 brown      fair       blue              NA male  mascu…
##  5 Nute…    191    90 none       mottled g… red               NA male  mascu…
##  6 Ki-A…    198    82 white      pale       yellow            92 male  mascu…
##  7 Mas …    196    NA none       blue       blue              NA male  mascu…
##  8 Mon …    150    NA auburn     fair       blue              48 fema… femin…
##  9 Jang…    183    79 black      tan        brown             66 male  mascu…
## 10 Han …    180    80 brown      fair       brown             29 male  mascu…
## # … with 39 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

Challenge

Use the data transformation functions to investigate flights into and out of NYC. This data can be found in the flights dataset from the nycflights13 package. First install the package and take a look at the data.

if (!any(rownames(installed.packages()) == "nycflights13")){
  install.packages("nycflights13")
}
library(nycflights13)
flights

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Assuming that people traveling for the Christmas/New Years holiday depart between December 20th and 24th,

from which airport (EWR, JFK, LGA) did the most flights travel to San Francisco Airport (SFO)?
How many of these flights were delayed in departure by at least 30 minutes?
How many of these flights arrived late by at least 30 minutes?
What was the shortest in-flight time to SFO from each of the airports?

Advanced challenge

Answer the questions above using base R.

Step 4, Visualise with ggplot2

Based on The Grammar of Graphics, ggplot2 is a popular system for creating graphics in a “declarative” way. This means that you provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

Most of the following material on ggplot2 was taken directly from RStudio teaching resources that are made freely available online.

The template for creating a plot with ggplot looks like this:

ggplot(data = <DATA>) + <GEOM_FUNCTION>(mapping = aes() )

The geom is a “geometric object” (e.g. geom_boxplot, geom_dotplot, etc.) that describes how the data is to be displayed, and the the aes or “aesthetics,” are visual properties of the graphical output (e.g. color, shape, or size). When we call ggplot using this format, we are “mapping,” or connecting, values in the data to visual element of the plot such as the color, shape, or size of points.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = cyl))

There are many geoms and aesthetics available in ggplot2, and by combining these we can communicate a great deal about our data.

Challenge

Create scatter plots from the mpg dataset. Experiment with mapping aesthetics to different variables.

How does the behavior of ggplot differ when you use the “class” variable vs the “cty variable for color?
What happens when you use more than one aesthetic?
What happens if you try to specify an aesthetic mapping outside of the aes() function?

Create a tibble with x coordinates, y coordinates, 6 colors (color1-6), and 6 shapes (shape1-6).

Plot your tibble.
Try to set the size of all points to something larger than the default.
Does the tibble behave in the same way as the mpg data frame?

Plots can be built up piece by piece and only plotted when they are ready.

This code will not produce a plot:

q <- ggplot(mpg) + geom_point(aes(x = displ, y = hwy))

Facets are another way to display groupings in your data. Only instead of color or shape, facets display each group in a different plot.

Challenge

Experiment with each of the modifications to the q object below.

How does facet_grid() work?
How does facet_wrap() differ from facet_grid()?
What happens when you provide two variables to facet_wrap()?

q <- ggplot(mpg) + geom_point(mapping = aes(x = displ, y = hwy))
q + facet_grid(. ~ cyl)

q + facet_grid(drv ~ .)

q + facet_grid(drv ~ cyl)

q + facet_wrap(~ class)

In addition to the aesthetics, ggplot2 allows you access to the labels on your plot, which you can modify to better communicate your findings.

ggplot(data = mpg) +
  geom_point(mapping = aes(displ, hwy, color = class)) +
  labs(title = "Fuel Efficiency by Engine Size",
    subtitle = "Data facetted by class",
    x = "Engine Size (displacement in liters)",
    y = "Fuel Efficiency (MPG)",
    color = "Class of\nAutomobile",
    caption = "Data from the EPA") +
    facet_wrap(~ class)

So far, we’ve mostly been working with geom_point. However, ggplot provides many other geoms, including:

geom_abline()
geom_boxplot()
geom_histogram()
geom_smooth()
geom_violin()

and many more. For each of these geoms, the available aesthetics can be mapped to the data, and may have additional parameters to modify that can change the appearance of the plot.

A plot may use one or more geoms

For example, this box plot and histogram each use a single geom. The geom_histogram function takes an additional argument that allows the user to set the binwidth and get a higher or lower resolution plot.

ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = class, y = hwy))

ggplot(data = mpg) +
  geom_histogram(mapping = aes(x = hwy), binwidth=1)

If two (or more) geoms use the same type of axes, they can be displayed on the same plot. In this case, we can use both points and smoothed conditional means.

ggplot(mpg) +
  geom_point(aes(displ, hwy)) +
  geom_smooth(aes(displ, hwy))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Aesthetic mapping may be set for the entire plot, or for each layer individually.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point(mapping = aes(color = drv)) +
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
  geom_point() +
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

This concept also applies to data. If you only want a subset of your data plotted by one of your geoms, you can specify a second data set inside of the function call to that geom.

myplot <- ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point(mapping = aes(color = drv)) +
  geom_smooth(data = filter(mpg, drv == "f"))
myplot

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Once we have finished creating our plot, we have two ways to save the plot.

Option 1, using base-R png() or pdf() functions:

pdf(file="myplot.pdf", width = 5, height = 5)
myplot

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

dev.off()

## quartz_off_screen
##                 2

png(file="myplot.png", width = 500, height = 500)
myplot

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

dev.off()

## quartz_off_screen
##                 2

Option 2, using the ggsave function. By default, ggsave will save the last plot made, to save a different plot, we can specify another plot instead.

p <- ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point(mapping = aes(color = drv)) +
  geom_smooth()

ggsave(file = "myplot2.pdf", width = 5, height = 5)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggsave(file = "myplot2.pdf", width = 5, height = 5, plot = myplot)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

?ggsave

Keep exploring

Visit https://rstudio.cloud/learn/cheat-sheets and download the Data Visualization cheatsheet. Experiment with different geoms, getting a good mix of one variable, two variables, continuous, and discrete.

☰ Menu

Bioinformatics: Command Line/R Prerequisites 2020

Step 1: Import data with readr

readr supports a number of file formats with different read_* functions including:

Readr also has functions write data in a number of formats with various write_* functions:

Get some data and try out these functions:

What are “Column Specifications”?

Now, let’s look at the object we just loaded

What is a Tibble?

Creating tibbles

Challenge

Advanced challenge

Step 2: Tidy data with tidyr

First, what is “tidy” data?

There is more than one way we can store data from an experiment.

How do we make our data “tidy?”

Piping commands together with magrittr

The base-R strategy, Nested Functions

The tidyverse option, using the %>% pipe operator

How does %>% work?

Data can be made “wider” as well as “longer”

Separating values from a column that contains two or more variables

Challenge

Step 3, Transform data with dplyr

Challenge

Advanced challenge

Step 4, Visualise with ggplot2

Challenge

Challenge

A plot may use one or more geoms

Aesthetic mapping may be set for the entire plot, or for each layer individually.

Once we have finished creating our plot, we have two ways to save the plot.

Keep exploring