Home
Introduction
Introduction to the Workshop and the Core
Schedule
Support
Slack
Zoom
Cheat Sheets
Introduction to R
Introduction to R Day 1
Introduction to R Day 2
Introduction to R Day 3
Challenge solutions
ETC
Closing thoughts
Github page
Biocore website

embed_day3.knit

Intro to R Day 3

Load your day 2 workspace data:

load("day2.RData")

Loops

In programming, it is common that one has to do one set of specific operation on a sequence of elements. In this case, for loop is very useful to achieve the goal.

The basic structure of for loop is:

for (value in sequence){

some operation(s)

}

For example, we would like to calculate the sum of a row for every row in the matrix we created earlier. We are going to use a for loop to do it.

for (i in 1:dim(my_matrix)[1]) {
  out <- sum(my_matrix[i, ])
  print(out)
}

## [1] 11
## [1] 58
## [1] 302
## [1] 38

There is a while loop in R similarly as in command line or any other programming language. The basic structure of a while loop is:

while (condition){

some operation

}

Here is the same row sum calculation using a while loop:

i=1 # need to initialize i
while (i <= dim(my_matrix)[1]){
  out <- sum(my_matrix[i,])
  print(out)
  i <- i + 1
}

## [1] 11
## [1] 58
## [1] 302
## [1] 38

The “apply” family of functions

A few useful functions: apply(), lapply(), sapply(), and tapply() to replace for loop

apply() takes an array or matrix as input and outputs a vector, array or list.

# recall my_matrix
my_matrix

##      col1 col2 col3
## row1    1    2    8
## row2    3   18   37
## row3    8   27  267
## row4    9   10   19

# check the usage of apply() function
?apply()
# calculate sums for each row
apply(my_matrix, MARGIN=1, sum)

## row1 row2 row3 row4 
##   11   58  302   38

lapply() takes a list, vector or data frame as input and outputs a list.

?lapply()

# generate some random data matrix
data3 <- as.data.frame(matrix(rnorm(49), ncol=7), stringsAsFactors=F)
dim(data3)

## [1] 7 7

# calculate the sum for each row
lapply(1:dim(data3)[1], function(x){sum(data3[x,])})

## [[1]]
## [1] 2.035963
## 
## [[2]]
## [1] 1.514792
## 
## [[3]]
## [1] -0.01635569
## 
## [[4]]
## [1] -1.877682
## 
## [[5]]
## [1] -5.900493
## 
## [[6]]
## [1] 0.8841496
## 
## [[7]]
## [1] -1.084613

# comparing the results to apply() results
apply(data3, MARGIN=1, sum)

## [1]  2.03596280  1.51479170 -0.01635569 -1.87768180 -5.90049304  0.88414957
## [7] -1.08461324

# calculate log10 of the sum of each row
lapply(1:dim(data3)[1], function(x){log10(sum(data3[x,]))})

## Warning in FUN(X[[i]], ...): NaNs produced

## Warning in FUN(X[[i]], ...): NaNs produced

## Warning in FUN(X[[i]], ...): NaNs produced

## Warning in FUN(X[[i]], ...): NaNs produced

## [[1]]
## [1] 0.3087698
## 
## [[2]]
## [1] 0.1803529
## 
## [[3]]
## [1] NaN
## 
## [[4]]
## [1] NaN
## 
## [[5]]
## [1] NaN
## 
## [[6]]
## [1] -0.05347426
## 
## [[7]]
## [1] NaN

The function sapply() works like function lapply(), but tries to simplify the output to the simplest data structure possible. As a matter of fact, sapply() is a “wrapper” function for lapply(). By default, it returns a vector.

# To check the syntax of using sapply():
?sapply()

sapply(1:dim(data3)[1], function(x){log10(sum(data3[x,]))})

## Warning in FUN(X[[i]], ...): NaNs produced

## Warning in FUN(X[[i]], ...): NaNs produced

## Warning in FUN(X[[i]], ...): NaNs produced

## Warning in FUN(X[[i]], ...): NaNs produced

## [1]  0.30876984  0.18035292         NaN         NaN         NaN -0.05347426
## [7]         NaN

If the “simplify” parameter is turned off, sapply() will produced exactly the same results as lapply(), in the form of a list. By default, “simplify” is turned on.

sapply(1:dim(data3)[1], function(x){log10(sum(data3[x,]))}, simplify=FALSE)

## Warning in FUN(X[[i]], ...): NaNs produced

## Warning in FUN(X[[i]], ...): NaNs produced

## Warning in FUN(X[[i]], ...): NaNs produced

## Warning in FUN(X[[i]], ...): NaNs produced

## [[1]]
## [1] 0.3087698
## 
## [[2]]
## [1] 0.1803529
## 
## [[3]]
## [1] NaN
## 
## [[4]]
## [1] NaN
## 
## [[5]]
## [1] NaN
## 
## [[6]]
## [1] -0.05347426
## 
## [[7]]
## [1] NaN

The function tapply() applys a function to each subset of a vector based on a second vector of factors.

?tapply()

# Let's use Fisher's Iris data to demonstrate the usage of tapply().
# First, load the Iris dataset
data(iris)

# Take a look at what the data includes
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

# Generate a summary of the sepal lengths for each iris species.
tapply(iris$Sepal.Length, iris$Species, summary)

## $setosa
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.300   4.800   5.000   5.006   5.200   5.800 
## 
## $versicolor
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.900   5.600   5.900   5.936   6.300   7.000 
## 
## $virginica
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.900   6.225   6.500   6.588   6.900   7.900

User defined functions

Even though there are a lot of R packages available, there are always situations where one might have to write one’s own function to accomplish some very specific goals. Functions are defined by code with a specific format:

function.name <- function(arg1=arg1, arg2, …){

var <- sin(arg1) + sin(arg2) # carry out tasks

var / 2

}

Here, we are going to write a function to calculate the area of a triangle given the lengths of three sides.

my.area <- function(side1=side1, side2=side2, side3=side3){
    circumference <- (side1 + side2 + side3) / 2
    area <- sqrt(circumference * (circumference - side1) * (circumference - side2) * (circumference - side3))
    return(area)
}

# let's carry out a test
my.area(side1=3, side2=4, side3=5)

## [1] 6

Quiz 5

CHALLENGE

Write a function to calculate min-max normalization for a single row. Here is the formula:

Then use apply to normalize every row of data2. You might want to take a small piece of data2 to test it with, e.g. the first 3 rows and first 3 columns.

HARD CHALLENGE

Now take your normalized data and write a function to find the log2-fold change (i.e. the log of the ratio of normalized counts) between any two samples across all genes. Then use one of the apply functions to calculate log2-fold change across ALL samples, given one sample. Finally, use a for loop to find the pair-wise log2-fold changes for every pair of samples.

Topic 6. Simple data visulization in R

Scatter plot and line plot can be produced using the function plot().

x <- c(1:50)
y <- 1 + sqrt(x)/2
plot(x,y)

plot(x,y, type="l")

# plot both the points and lines
## first plot points
plot(x,y)
lines(x,y, type="l")

## lines() can only be used to add information to a graph, while it cannot produce a graph on its own.

boxplot() can be used to summarize data.

boxplot(data3, xlab="Sample ID", ylab="Raw Counts")

x <- rnorm(1000)
boxplot(x)

hist() can be used to create histograms of data.

hist(x)

# use user defined break points
hist(x, breaks=seq(range(x)[1]-1, range(x)[2]+1, by=0.5))

# clear plotting device/area
dev.off()

## null device 
##           1

Topic 7. Install packages in R

Starting from Bioconductor version 3.8, the installation of packages is recommended to use BiocManager.

if (!requireNamespace("BiocManager"))
    install.packages("BiocManager")
## install core packages
BiocManager::install()
## install specific packages
BiocManager::install(c("devtools", "tidyverse","bsseq","DSS"))

Bioconductor has a repository and release schedule that differ from R (Bioconductor has a ‘devel’ branch to which new packages and updates are introduced, and a stable ‘release’ branch emitted once every 6 months to which bug fixes but not new features are introduced). This mismatch causes that the version detected by install.packages() is sometimes not the most recent ‘release’.
A consequence of the distinct ‘devel’ branch is that install.packages() sometimes points only to the ‘release’ repository, while users might want to have access to the leading-edge features in the develop version.
An indirect consequence of Bioconductor’s structured release is that packages generally have more extensive dependences with one another.

It is always recommended to update to the most current version of R and Bioconductor. If it is not possible and R < 3.5.0, please use the legacy approach to install Bioconductor packages

source("http://bioconductor.org/biocLite.R")
## install core packages
biocLite()
## install specific packages
biocLite(c("devtools", "tidyverse","bsseq","DSS"))

The R function install.packages() can be used to install packages that are not part of Bioconductor.

install.packages("ggplot2")
install.packages(c("kableExtra","knitr","dplyr"))

Install from source of github.

library(devtools)
install_github("stephenturner/qqman")

Topic 8. Slightly more advanced visualization

Working with an R notebook, we will load the Iris data as we did earlier in this documentation, generate a table that lists the median of each measurement (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) for each species. Then we will generate plots based on the result. Finally produce an html report with the table and the plot.

Install some packages

In order to output a nice looking table in the final report, one may consider the R package kableExtra https://github.com/haozhu233/kableExtra. One may also check out the documentation at https://cran.r-project.org/web/packages/htmlTable/vignettes/
Figure 1 can be produced using base R functions: plot(), points(), axis(), text().
Figure 2 can be produced using functions in R package lattice https://cran.r-project.org/web/packages/lattice/index.html
Figure 3 can be produced using function boxplot()

library(lattice)
library(reshape2)
library(kableExtra)
tmp <- sapply(1:4, function(x){tapply(iris[,x], iris[[5]], median)})
colnames(tmp) <- colnames(iris)[1:4]
nms <- colnames(tmp)
kable(data.frame(tmp, stringsAsFactors=F), align='c') %>% kable_styling(bootstrap_options=c("striped", "hover", "responsive"), full_width=F, position="center")

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width
setosa	5.0	3.4	1.50	0.2
versicolor	5.9	2.8	4.35	1.3
virginica	6.5	3.0	5.55	2.0

scatter plot of mean measurement

species <- as.vector(levels(iris$Species))
x <- c(1, 2, 3, 4)
plot(x, tmp["setosa",], pch=20, col='red', ylim=c(0, max(tmp)), xaxt="n", xlab="Measurement type", ylab="Measurement results", cex.lab=1.0)
points(x, tmp["virginica",], pch=20, col='orange')
points(x, tmp["versicolor",], pch=20, col='blue')
axis(1, at=x, labels=nms, las=2, cex.axis=0.7)
text(c(1.5,1.5,1.5), c(0, 0.7, 1.4), labels=species, col=c("red", "blue", "orange"), cex=1.5)

scatter plot of measurement by species

dd <- melt(iris)

## Using Species as id variables

xyplot(value ~ variable | Species, data=dd, scales=list(x=list(rot=90)), xlab="Measurements", ylab="Values")

boxplot by group

cols <- c("red", "blue", "orange")
boxplot(value ~ Species + variable, data=dd, col = cols, xaxt="n", yaxt="n", xlab="Measurement Type", ylab="Values")
axis(side=1, labels=FALSE)
axis(side=2, las=2)
text(x=1:12, y=par("usr")[3] - 0.85, labels=c("", "Sepal.Length", "", "", "Sepal.Width", "", "", "Petal.Length", "", "", "Petal.Width", ""), xpd=NA, srt=35, cex=1)
legend("topright", fill=cols, legend=levels(dd$Species))

VISUALIZATION CHALLENGE

Download the gapminder dataset to your Rstudio session using the following URL:

https://github.com/ucdavis-bioinformatics-training/2023-February-Introduction-to-R-for-Bioinformatics/raw/main/R/gapminder.csv

Take a look at the dataset. Subset the data to only look at rows from 1982. Then make a scatterplot of the subset, adding x and y labels. Find out how to log scale the x axis from the plot documentation.

Next, make a named vector of continents to colors. Use the vector to add colors to each point based on the continent for that point. Add a legend to the plot showing which colors correspond to which continent.

Finally, create a function that takes in a numeric vector and a minimum and maximum circle size. Within the function, take the square root of the vector, then use the min-max normalization method to normalize each element of the vector. Once you have normalized data, use those values to scale between the minimum and maximum circle size. Use this function to change the size of the circles to correspond to the population of the country using the “cex” parameter for the scatter plot.

☰ Menu

Introduction to R for Bioinformatics