R is a very powerful language. However, it doesn’t always behave the way one might expect…
By default, when reading in a file, R reads in character data as a factor (i.e. coded as numbers). This can lead to some unintended consequences:
dat <- read.csv("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2019-March-Bioinformatics-Prerequisites/master/thursday/stringsAsFactors_example.csv")
dat
## Subject Age
## 1 1 23
## 2 2 24
## 3 3 55
## 4 4 33
## 5 5 not recorded
## 6 6 35
## 7 7 44
str(dat)
## 'data.frame': 7 obs. of 2 variables:
## $ Subject: int 1 2 3 4 5 6 7
## $ Age : Factor w/ 7 levels "23","24","33",..: 1 2 6 3 7 4 5
Age needs to be numeric if we want to use it in an analysis. One might think that as.numeric()
would help:
as.numeric(dat$Age)
## [1] 1 2 6 3 7 4 5
This is not what we want! This error is obvious with a small dataset, but easy to miss with a larger one. A few ways to fix this:
stringsAsFactors = F
when reading in the datadat <- read.csv("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2019-March-Bioinformatics-Prerequisites/master/thursday/stringsAsFactors_example.csv", stringsAsFactors = F)
str(dat)
## 'data.frame': 7 obs. of 2 variables:
## $ Subject: int 1 2 3 4 5 6 7
## $ Age : chr "23" "24" "55" "33" ...
as.numeric(dat$Age)
## Warning: NAs introduced by coercion
## [1] 23 24 55 33 NA 35 44
as.character()
with as.numeric()
dat <- read.csv("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2019-March-Bioinformatics-Prerequisites/master/thursday/stringsAsFactors_example.csv")
str(dat)
## 'data.frame': 7 obs. of 2 variables:
## $ Subject: int 1 2 3 4 5 6 7
## $ Age : Factor w/ 7 levels "23","24","33",..: 1 2 6 3 7 4 5
as.numeric(as.character(dat$Age))
## Warning: NAs introduced by coercion
## [1] 23 24 55 33 NA 35 44
By default, summaries of data with missing values (NA’s) return NA
height <- c(55, 60, 65, 72, 66, NA, NA, NA)
mean(height)
## [1] NA
To get the mean of the non-missing values, use na.rm = T
mean(height, na.rm = T)
## [1] 63.6
Sometimes, na.rm = T
leaves nothing in one subgroup of the data. This can also lead to unexpected behavior:
height <- c(55, 60, 65, 72, 66, NA, NA, NA)
sex <- c("F", "F", "F", "F", "F", "M", "M", "M")
mean(height[sex == "M"], na.rm = T)
## [1] NaN
min(height[sex == "M"], na.rm = T)
## Warning in min(height[sex == "M"], na.rm = T): no non-missing arguments to
## min; returning Inf
## [1] Inf
max(height[sex == "M"], na.rm = T)
## Warning in max(height[sex == "M"], na.rm = T): no non-missing arguments to
## max; returning -Inf
## [1] -Inf
These aren’t errors per se, but be very careful if you plan to use these results in downstream calculations.
If you’re testing for missing values, use is.na()
instead of ==
x <- NA
x == NA # always returns NA
## [1] NA
is.na(x)
## [1] TRUE
Be careful when subsetting on data with missing values:
height <- c(55, 60, 65, 72, 66, 72, 63, 66)
sex <- c("F", "F", "F", "F", "F", "M", "M", NA)
dat <- data.frame(height = height, sex = sex)
dat[dat$sex == "M",]
## height sex
## 6 72 M
## 7 63 M
## NA NA <NA>
Where did the NA’s come from? There are no missing values for height here. Let’s break this down: The statement dat$sex == "M"
returns NA when applied to the NA value for sex:
dat$sex == "M"
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE NA
When we use dat$sex == "M"
to subset, the resulting NA creates a row where all the values are NA:
dat[dat$sex == "M",]
## height sex
## 6 72 M
## 7 63 M
## NA NA <NA>
There are a few ways around this:
subset()
subset(dat, sex == "M")
## height sex
## 6 72 M
## 7 63 M
which()
dat[which(dat$sex == "M"),]
## height sex
## 6 72 M
## 7 63 M
Sometimes we want to remove certain observations. What happens if you try to take away something that’s not there?
dat
## height sex
## 1 55 F
## 2 60 F
## 3 65 F
## 4 72 F
## 5 66 F
## 6 72 M
## 7 63 M
## 8 66 <NA>
drop <- which(dat$sex == "Male")
drop
## integer(0)
dat2 <- dat[-drop,]
dat2
## [1] height sex
## <0 rows> (or 0-length row.names)
Logically, one would think that taking nothing away from something would leave what you started with…
The solution is to check that you aren’t using an empty vector to subset:
drop <- which(dat$sex == "Male")
if (length(drop) > 0){
dat2 <- dat[-drop,]
}else{
dat2 <- dat
}
dat2
## height sex
## 1 55 F
## 2 60 F
## 3 65 F
## 4 72 F
## 5 66 F
## 6 72 M
## 7 63 M
## 8 66 <NA>
You may be aware that you can use dat[1,]
to pull out the first row of the data frame dat
:
dat
## height sex
## 1 55 F
## 2 60 F
## 3 65 F
## 4 72 F
## 5 66 F
## 6 72 M
## 7 63 M
## 8 66 <NA>
dat[1,]
## height sex
## 1 55 F
What happens if you forget the comma?
dat[1]
## height
## 1 55
## 2 60
## 3 65
## 4 72
## 5 66
## 6 72
## 7 63
## 8 66
Since a data frame is a list, using a single subscript gives you a list (here, another data frame) with a single element
str(dat[1])
## 'data.frame': 8 obs. of 1 variable:
## $ height: num 55 60 65 72 66 72 63 66
However, this behaviour is totally different with a matrix
mat <- as.matrix(dat)
mat
## height sex
## [1,] "55" "F"
## [2,] "60" "F"
## [3,] "65" "F"
## [4,] "72" "F"
## [5,] "66" "F"
## [6,] "72" "M"
## [7,] "63" "M"
## [8,] "66" NA
mat[1]
## [1] "55"
When you take a single subscript of a matrix, you get that element of the vectorized matrix
By default, when reading in data R converts column names to syntactically valid R variable names. Spaces and punctuation are converted to “.”, and column names starting with a number have “X” added to the beginning:
dat <- read.csv("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2019-March-Bioinformatics-Prerequisites/master/thursday/check_names_example.csv")
dat
## X1st.Run X2nd.Run X3rd.Run
## 1 0.8996841 0.1321942 0.03146651
## 2 -0.7056392 -0.6499181 1.10609271
## 3 0.8223703 -0.2632767 0.11815986
## 4 0.1686693 -2.5247638 0.14299149
## 5 -1.3234993 -0.8117055 0.90373851
If you want to keep the original column names, use check.names = F
:
dat <- read.csv("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2019-March-Bioinformatics-Prerequisites/master/thursday/check_names_example.csv", check.names = F)
dat
## 1st Run 2nd Run 3rd Run
## 1 0.8996841 0.1321942 0.03146651
## 2 -0.7056392 -0.6499181 1.10609271
## 3 0.8223703 -0.2632767 0.11815986
## 4 0.1686693 -2.5247638 0.14299149
## 5 -1.3234993 -0.8117055 0.90373851
However, since these aren’t valid R names you have to put backticks (``) around the names when using them:
# dat$1st Run # doesn't work
dat$`1st Run`
## [1] 0.8996841 -0.7056392 0.8223703 0.1686693 -1.3234993
Upon installation of a package, R may sometimes say the package is not available. This may happen for a couple of reasons:
install.packages("limma")
## Installing package into 'C:/Users/bpdurbin/Desktop/mar2019_course/packrat/lib/x86_64-w64-mingw32/3.5.3'
## (as 'lib' is unspecified)
## Warning: package 'limma' is not available (as a binary package for R
## version 3.5.3)
(Sometimes RStudio may look in multiple archives without you asking explicitly, but I have only seen this behaviour on Linux).
(This is how you would install the package from Bioconductor):
# if (!requireNamespace("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
# BiocManager::install("limma", version = "3.8")
You spelled the package name wrong.
The package actually doesn’t exist (least likely)
Whitespace in character data (from e.g. an accidentally typed space in an Excel file) is handled like a regular character in R:
dat <- read.csv("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2019-March-Bioinformatics-Prerequisites/master/thursday/whitespace_example.csv")
dat
## Subject Sex
## 1 1 M
## 2 2 M
## 3 3 F
## 4 4 F
## 5 5 F
## 6 6 F
table(dat$Sex)
##
## F F M
## 3 1 2
Notice that "F"
and "F "
are treated as separate categories.
To trim leading and trailing whitespace on reading data in, use strip.white = T
dat <- read.csv("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2019-March-Bioinformatics-Prerequisites/master/thursday/whitespace_example.csv", strip.white = T)
table(dat$Sex)
##
## F M
## 4 2
Not every method of reading data in R has this option, but you can also use the function trimws()
:
dat <- read.csv("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2019-March-Bioinformatics-Prerequisites/master/thursday/whitespace_example.csv")
table(dat$Sex)
##
## F F M
## 3 1 2
dat$Sex <- trimws(dat$Sex)
table(dat$Sex)
##
## F M
## 4 2
By default, R orders levels of factors (or characters that it converts to factors for analysis) alphabetically. This may not be what you want.
rating <- factor(c("Strongly Agree (5)", "Agree (4)", "Agree (4)", "Disagree (2)", "Disagree (2)", "Strongly Disagree (1)", "Agree (4)", "Neither Agree nor Disagree (3)"))
table(rating)
## rating
## Agree (4) Disagree (2)
## 3 2
## Neither Agree nor Disagree (3) Strongly Agree (5)
## 1 1
## Strongly Disagree (1)
## 1
The default order doesn’t make sense here. However, we can change the levels when creating the factor variable:
levels.use <- c("Strongly Disagree (1)", "Disagree (2)", "Neither Agree nor Disagree (3)", "Agree (4)", "Strongly Agree (5)")
rating <- factor(c("Strongly Agree (5)", "Agree (4)", "Agree (4)", "Disagree (2)", "Disagree (2)", "Strongly Disagree (1)", "Agree (4)", "Neither Agree nor Disagree (3)"),
levels = levels.use)
table(rating)
## rating
## Strongly Disagree (1) Disagree (2)
## 1 2
## Neither Agree nor Disagree (3) Agree (4)
## 1 3
## Strongly Agree (5)
## 1
But beware of trying to use levels()
after the fact to change the ordering of the levels:
rating <- factor(c("Strongly Agree (5)", "Agree (4)", "Agree (4)", "Disagree (2)", "Disagree (2)", "Strongly Disagree (1)", "Agree (4)", "Neither Agree nor Disagree (3)"))
levels(rating) <- levels.use
table(rating)
## rating
## Strongly Disagree (1) Disagree (2)
## 3 2
## Neither Agree nor Disagree (3) Agree (4)
## 1 1
## Strongly Agree (5)
## 1
We have accidentally changed all of the “Agree” responses to “Strongly Disagree”!
To make a factor level the reference (first) level, you can use the relevel()
function:
genotype <- factor(c("Mut", "WT", "Mut", "Mut", "WT", "WT"))
levels(genotype)
## [1] "Mut" "WT"
Make “WT” the reference level:
genotype <- relevel(genotype, ref = "WT")
levels(genotype)
## [1] "WT" "Mut"
R functions (like plot() and print()) often do different things depending on the type of object you apply them to. This is useful, unless you aren’t expecting it.
Consider the function plotMDS in the limma package, which creates a multidimensional scaling plot. Let’s read in some RNA-Seq count data and plot it:
counts <- read.delim("https://ucdavis-bioinformatics-training.github.io/2018-June-RNA-Seq-Workshop/friday/all_counts.txt")
limma::plotMDS(counts)
edgeR has an object class called a DGEList that is a container for counts and other information. By changing the class, we get a totally different plot (the data are scaled and log transformed first):
d <- edgeR::DGEList(counts)
limma::plotMDS(d)
Check the documentation to see how a function behaves for different types of input.