R is a very powerful language. However, it doesn’t always behave the way one might expect…

Strings and factors

By default, when reading in a file, R reads in character data as a factor (i.e. coded as numbers). This can lead to some unintended consequences:

dat <- read.csv("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2019-March-Bioinformatics-Prerequisites/master/thursday/stringsAsFactors_example.csv")
dat
##   Subject          Age
## 1       1           23
## 2       2           24
## 3       3           55
## 4       4           33
## 5       5 not recorded
## 6       6           35
## 7       7           44
str(dat)
## 'data.frame':    7 obs. of  2 variables:
##  $ Subject: int  1 2 3 4 5 6 7
##  $ Age    : Factor w/ 7 levels "23","24","33",..: 1 2 6 3 7 4 5

Age needs to be numeric if we want to use it in an analysis. One might think that as.numeric() would help:

as.numeric(dat$Age)
## [1] 1 2 6 3 7 4 5

This is not what we want! This error is obvious with a small dataset, but easy to miss with a larger one. A few ways to fix this:

  1. Use stringsAsFactors = F when reading in the data
dat <- read.csv("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2019-March-Bioinformatics-Prerequisites/master/thursday/stringsAsFactors_example.csv", stringsAsFactors = F)
str(dat)
## 'data.frame':    7 obs. of  2 variables:
##  $ Subject: int  1 2 3 4 5 6 7
##  $ Age    : chr  "23" "24" "55" "33" ...
as.numeric(dat$Age)
## Warning: NAs introduced by coercion
## [1] 23 24 55 33 NA 35 44
  1. Use as.character() with as.numeric()
dat <- read.csv("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2019-March-Bioinformatics-Prerequisites/master/thursday/stringsAsFactors_example.csv")
str(dat)
## 'data.frame':    7 obs. of  2 variables:
##  $ Subject: int  1 2 3 4 5 6 7
##  $ Age    : Factor w/ 7 levels "23","24","33",..: 1 2 6 3 7 4 5
as.numeric(as.character(dat$Age))
## Warning: NAs introduced by coercion
## [1] 23 24 55 33 NA 35 44
  1. If you’re recording data, don’t mix character and numeric data in the same column.

Missing values in calculations

By default, summaries of data with missing values (NA’s) return NA

height <- c(55, 60, 65, 72, 66, NA, NA, NA)
mean(height)
## [1] NA

To get the mean of the non-missing values, use na.rm = T

mean(height, na.rm = T)
## [1] 63.6

Sometimes, na.rm = T leaves nothing in one subgroup of the data. This can also lead to unexpected behavior:

height <- c(55, 60, 65, 72, 66, NA, NA, NA)
sex <- c("F", "F", "F", "F", "F", "M", "M", "M")
mean(height[sex == "M"], na.rm = T)
## [1] NaN
min(height[sex == "M"], na.rm = T)
## Warning in min(height[sex == "M"], na.rm = T): no non-missing arguments to
## min; returning Inf
## [1] Inf
max(height[sex == "M"], na.rm = T)
## Warning in max(height[sex == "M"], na.rm = T): no non-missing arguments to
## max; returning -Inf
## [1] -Inf

These aren’t errors per se, but be very careful if you plan to use these results in downstream calculations.

If you’re testing for missing values, use is.na() instead of ==

x <- NA
x == NA # always returns NA
## [1] NA
is.na(x)
## [1] TRUE

Subsetting strangeness

Be careful when subsetting on data with missing values:

height <- c(55, 60, 65, 72, 66, 72, 63, 66)
sex <- c("F", "F", "F", "F", "F", "M", "M", NA)
dat <- data.frame(height = height, sex = sex)
dat[dat$sex == "M",]
##    height  sex
## 6      72    M
## 7      63    M
## NA     NA <NA>

Where did the NA’s come from? There are no missing values for height here. Let’s break this down: The statement dat$sex == "M" returns NA when applied to the NA value for sex:

dat$sex == "M"
## [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE    NA

When we use dat$sex == "M" to subset, the resulting NA creates a row where all the values are NA:

dat[dat$sex == "M",]
##    height  sex
## 6      72    M
## 7      63    M
## NA     NA <NA>

There are a few ways around this:

  1. Use subset()
subset(dat, sex == "M")
##   height sex
## 6     72   M
## 7     63   M
  1. Use which()
dat[which(dat$sex == "M"),]
##   height sex
## 6     72   M
## 7     63   M

Sometimes we want to remove certain observations. What happens if you try to take away something that’s not there?

dat
##   height  sex
## 1     55    F
## 2     60    F
## 3     65    F
## 4     72    F
## 5     66    F
## 6     72    M
## 7     63    M
## 8     66 <NA>
drop <- which(dat$sex == "Male")
drop
## integer(0)
dat2 <- dat[-drop,]
dat2
## [1] height sex   
## <0 rows> (or 0-length row.names)

Logically, one would think that taking nothing away from something would leave what you started with…

The solution is to check that you aren’t using an empty vector to subset:

drop <- which(dat$sex == "Male")
if (length(drop) > 0){
  dat2 <- dat[-drop,]
}else{
  dat2 <- dat
}
dat2
##   height  sex
## 1     55    F
## 2     60    F
## 3     65    F
## 4     72    F
## 5     66    F
## 6     72    M
## 7     63    M
## 8     66 <NA>

You may be aware that you can use dat[1,] to pull out the first row of the data frame dat:

dat
##   height  sex
## 1     55    F
## 2     60    F
## 3     65    F
## 4     72    F
## 5     66    F
## 6     72    M
## 7     63    M
## 8     66 <NA>
dat[1,]
##   height sex
## 1     55   F

What happens if you forget the comma?

dat[1]
##   height
## 1     55
## 2     60
## 3     65
## 4     72
## 5     66
## 6     72
## 7     63
## 8     66

Since a data frame is a list, using a single subscript gives you a list (here, another data frame) with a single element

str(dat[1])
## 'data.frame':    8 obs. of  1 variable:
##  $ height: num  55 60 65 72 66 72 63 66

However, this behaviour is totally different with a matrix

mat <- as.matrix(dat)
mat
##      height sex
## [1,] "55"   "F"
## [2,] "60"   "F"
## [3,] "65"   "F"
## [4,] "72"   "F"
## [5,] "66"   "F"
## [6,] "72"   "M"
## [7,] "63"   "M"
## [8,] "66"   NA
mat[1]
## [1] "55"

When you take a single subscript of a matrix, you get that element of the vectorized matrix

Mangled column names

By default, when reading in data R converts column names to syntactically valid R variable names. Spaces and punctuation are converted to “.”, and column names starting with a number have “X” added to the beginning:

dat <- read.csv("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2019-March-Bioinformatics-Prerequisites/master/thursday/check_names_example.csv")
dat
##     X1st.Run   X2nd.Run   X3rd.Run
## 1  0.8996841  0.1321942 0.03146651
## 2 -0.7056392 -0.6499181 1.10609271
## 3  0.8223703 -0.2632767 0.11815986
## 4  0.1686693 -2.5247638 0.14299149
## 5 -1.3234993 -0.8117055 0.90373851

If you want to keep the original column names, use check.names = F:

dat <- read.csv("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2019-March-Bioinformatics-Prerequisites/master/thursday/check_names_example.csv", check.names = F)
dat
##      1st Run    2nd Run    3rd Run
## 1  0.8996841  0.1321942 0.03146651
## 2 -0.7056392 -0.6499181 1.10609271
## 3  0.8223703 -0.2632767 0.11815986
## 4  0.1686693 -2.5247638 0.14299149
## 5 -1.3234993 -0.8117055 0.90373851

However, since these aren’t valid R names you have to put backticks (``) around the names when using them:

# dat$1st Run # doesn't work
dat$`1st Run`
## [1]  0.8996841 -0.7056392  0.8223703  0.1686693 -1.3234993

Package not available error messages

Upon installation of a package, R may sometimes say the package is not available. This may happen for a couple of reasons:

  1. You’re using the wrong archive (usually, trying to install a Bioconductor package from CRAN)
install.packages("limma")
## Installing package into 'C:/Users/bpdurbin/Desktop/mar2019_course/packrat/lib/x86_64-w64-mingw32/3.5.3'
## (as 'lib' is unspecified)
## Warning: package 'limma' is not available (as a binary package for R
## version 3.5.3)

(Sometimes RStudio may look in multiple archives without you asking explicitly, but I have only seen this behaviour on Linux).

(This is how you would install the package from Bioconductor):

# if (!requireNamespace("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# BiocManager::install("limma", version = "3.8")
  1. You spelled the package name wrong.

  2. The package actually doesn’t exist (least likely)

Whitespace

Whitespace in character data (from e.g. an accidentally typed space in an Excel file) is handled like a regular character in R:

dat <- read.csv("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2019-March-Bioinformatics-Prerequisites/master/thursday/whitespace_example.csv")
dat
##   Subject Sex
## 1       1   M
## 2       2   M
## 3       3   F
## 4       4   F
## 5       5  F 
## 6       6   F
table(dat$Sex)
## 
##  F F   M 
##  3  1  2

Notice that "F" and "F " are treated as separate categories.

To trim leading and trailing whitespace on reading data in, use strip.white = T

dat <- read.csv("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2019-March-Bioinformatics-Prerequisites/master/thursday/whitespace_example.csv", strip.white = T)
table(dat$Sex)
## 
## F M 
## 4 2

Not every method of reading data in R has this option, but you can also use the function trimws():

dat <- read.csv("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2019-March-Bioinformatics-Prerequisites/master/thursday/whitespace_example.csv")
table(dat$Sex)
## 
##  F F   M 
##  3  1  2
dat$Sex <- trimws(dat$Sex)
table(dat$Sex)
## 
## F M 
## 4 2

Alphabetically ordered factor levels

By default, R orders levels of factors (or characters that it converts to factors for analysis) alphabetically. This may not be what you want.

rating <- factor(c("Strongly Agree (5)", "Agree (4)", "Agree (4)", "Disagree (2)", "Disagree (2)", "Strongly Disagree (1)", "Agree (4)", "Neither Agree nor Disagree (3)"))
table(rating)
## rating
##                      Agree (4)                   Disagree (2) 
##                              3                              2 
## Neither Agree nor Disagree (3)             Strongly Agree (5) 
##                              1                              1 
##          Strongly Disagree (1) 
##                              1

The default order doesn’t make sense here. However, we can change the levels when creating the factor variable:

levels.use <- c("Strongly Disagree (1)", "Disagree (2)", "Neither Agree nor Disagree (3)", "Agree (4)", "Strongly Agree (5)")
rating <- factor(c("Strongly Agree (5)", "Agree (4)", "Agree (4)", "Disagree (2)", "Disagree (2)", "Strongly Disagree (1)", "Agree (4)", "Neither Agree nor Disagree (3)"),
                 levels = levels.use)
table(rating)
## rating
##          Strongly Disagree (1)                   Disagree (2) 
##                              1                              2 
## Neither Agree nor Disagree (3)                      Agree (4) 
##                              1                              3 
##             Strongly Agree (5) 
##                              1

But beware of trying to use levels() after the fact to change the ordering of the levels:

rating <- factor(c("Strongly Agree (5)", "Agree (4)", "Agree (4)", "Disagree (2)", "Disagree (2)", "Strongly Disagree (1)", "Agree (4)", "Neither Agree nor Disagree (3)"))
levels(rating) <- levels.use
table(rating)
## rating
##          Strongly Disagree (1)                   Disagree (2) 
##                              3                              2 
## Neither Agree nor Disagree (3)                      Agree (4) 
##                              1                              1 
##             Strongly Agree (5) 
##                              1

We have accidentally changed all of the “Agree” responses to “Strongly Disagree”!

To make a factor level the reference (first) level, you can use the relevel() function:

genotype <- factor(c("Mut", "WT", "Mut", "Mut", "WT", "WT"))
levels(genotype)
## [1] "Mut" "WT"

Make “WT” the reference level:

genotype <- relevel(genotype, ref = "WT")
levels(genotype)
## [1] "WT"  "Mut"

Overloaded functions

R functions (like plot() and print()) often do different things depending on the type of object you apply them to. This is useful, unless you aren’t expecting it.

Consider the function plotMDS in the limma package, which creates a multidimensional scaling plot. Let’s read in some RNA-Seq count data and plot it:

counts <- read.delim("https://ucdavis-bioinformatics-training.github.io/2018-June-RNA-Seq-Workshop/friday/all_counts.txt")
limma::plotMDS(counts) 

edgeR has an object class called a DGEList that is a container for counts and other information. By changing the class, we get a totally different plot (the data are scaled and log transformed first):

d <- edgeR::DGEList(counts)
limma::plotMDS(d)

Check the documentation to see how a function behaves for different types of input.

A few other things:

  1. Cutting and pasting an entire R error message into Google can be surprisingly helpful.
  2. R is case sensitive: you can have two different objects called “Age” and “age” (although I wouldn’t recommend it)
  3. If you’re using R on a Windows machine, file paths have to be specified using “/” instead of the usual backslash

(Inspired by The R Inferno)