☰ Menu

      mRNA-Seq Workshop

Home
Introduction and Lectures
Intro to the Workshop and Core
Schedule
What is Bioinformatics/Genomics?
Experimental Design and Cost Estimation
RNA Sequencing Technologies - Dr. Lutz Froenicke
Support
Using Zoom
Using Slack
Cheat Sheets
Software and Links
Scripts
Prerequisites
CLI - Logging in and Transferring Files
CLI - Intro to Command-Line
CLI - Advanced Command-Line (extra)
CLI - Running jobs on the Cluster and using modules
R - Getting Started
R - Intro to R
R - Prepare Data in R (extra)
R - Data in R (extra)
More Materials (extra)
Data Reduction
Files and Filetypes
Prepare dataset
Preprocessing raw data
Indexing a Genome
Alignment with Star
Generating counts tables
Alignment/Counts with Salmon (Extra)
Data analysis
Prepare R for data analysis
Annotation from BioMart
Differential Expression Analysis
Pathway Analysis
Comparison between STAR and Salmon
ETC
Closing thoughts
Workshop Photos
Github page
Report Errors
Biocore website

Topics covered in this introduction to R

  1. Basic concepts
  2. Basic data types in R
  3. Import and export data in R
  4. Functions in R
  5. Basic statistics in R
  6. Simple data visulization in R
  7. Install packages in R
  8. Save data in R session

Topic 1. Basic concepts

There are three concepts that we should be familiar with before working in R:

Operators

Assignment Operators in R
Operator Description
<-, = Assignment
Arithmetic Operators in R
Operator Description
+ Addition
- Subtraction
* Multiplication
/ Division
^ Exponent
%% Modulus
%/% Integer Division
Relational Operators in R
Operator Description
< Less than
> Greater than
<= Less than or equal to
>= Greater than or equal to
== Equal to
!= Not equal to
Logical Operators in R
Operator Description
! Logical NOT
& Element-wise logical AND
&& Logical AND
| Element-wise logical OR
|| Logical OR

Functions

Functions are essential in all programming languages. A function takes zero or more parameters and return a result. The way to use a function in R is:

function.name(parameter1=value1, …)

read.delim("rnaseq_workshop_counts_mm.txt", row.names = 1)
mouse_110_WT_C mouse_110_WT_NC mouse_148_WT_C mouse_148_WT_NC mouse_158_WT_C mouse_158_WT_NC mouse_183_KOMIR150_C mouse_183_KOMIR150_NC mouse_198_KOMIR150_C mouse_198_KOMIR150_NC mouse_206_KOMIR150_C mouse_206_KOMIR150_NC mouse_2670_KOTet3_C mouse_2670_KOTet3_NC mouse_7530_KOTet3_C mouse_7530_KOTet3_NC mouse_7531_KOTet3_C mouse_7532_WT_NC mouse_H510_WT_C mouse_H510_WT_NC mouse_H514_WT_C mouse_H514_WT_NC
ENSMUSG00000102693 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ENSMUSG00000064842 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ENSMUSG00000051951 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ENSMUSG00000102851 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ENSMUSG00000103377 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ENSMUSG00000104017 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Variables

A variable is a named storage. The name of a variable can have letters, numbers, dot and underscore. However, a valid variable name cannot start with a underscore or a number, or start with a dot that is followed by a number.

Using an assignment operator, we can store the results of a function inside a variable.

counts <- read.delim("rnaseq_workshop_counts_mm.txt", row.names = 1)
CHALLENGE

Which of the following assignments will work? Do you notice anything different about the behavior of any of these variables?


Topic 2. Basic data types in R

Basic Data Types
Type Example
Numeric 3, 3e-2
Character "ENSMUSG00000102693"
Logical TRUE, FALSE
Factor any categorical variable

Simple variables: variables that have a numeric value, a character value (such as a string), or a logical value (True or False)

We can check the type of data stored within a variable using the class function.

a <- 1
class(a)
## [1] "numeric"
a <- "ENSMUSG00000102693"
class(a)
## [1] "character"
CHALLENGE

Try the following assignments. What is the class of each of the variables produced? Why? What is happening when converting between types using the as.numeric, as.logical, and as.character functions?

Vectors: a vector is a combination of multiple values(numeric, character or logical) in the same object. A vector is created using the function c() (for concatenate).

The values in the vector must all be of the same type.

gene_ids <- c("ENSMUSG00000102693", "ENSMUSG00000064842", "ENSMUSG00000051951")
gene_ids
## [1] "ENSMUSG00000102693" "ENSMUSG00000064842" "ENSMUSG00000051951"
of_interest <- c(TRUE, FALSE, FALSE)
of_interest
## [1]  TRUE FALSE FALSE
length(of_interest)
## [1] 3

Elements of a vector can be named by providing names to an existing vector, or assigning names when creating a vector.

names(of_interest) <- gene_ids
of_interest
## ENSMUSG00000102693 ENSMUSG00000064842 ENSMUSG00000051951
##               TRUE              FALSE              FALSE
of_interest <- c(ENSMUSG00000102693 = TRUE, ENSMUSG00000064842 = FALSE, ENSMUSG00000051951 = FALSE)
of_interest
## ENSMUSG00000102693 ENSMUSG00000064842 ENSMUSG00000051951
##               TRUE              FALSE              FALSE

Elements of a vector can be accessed by index, or by name if it is a named vector. The element or elements to be accessed are specified using square brackets.

of_interest[2]
## ENSMUSG00000064842
##              FALSE
of_interest["ENSMUSG00000064842"]
## ENSMUSG00000064842
##              FALSE
of_interest[c(1,3)]
## ENSMUSG00000102693 ENSMUSG00000051951
##               TRUE              FALSE
of_interest[c("ENSMUSG00000102693", "ENSMUSG00000051951")]
## ENSMUSG00000102693 ENSMUSG00000051951
##               TRUE              FALSE

Additionally, a subset of a vector can be selected using a logical vector.

gene_ids[of_interest]
## [1] "ENSMUSG00000102693"
NOTE: a vector can only hold elements of the same type.

Factors: a factor represents categorical or groups in data. The function factor() can be used to create a factor variable.

treatment <- factor(c(1,2,1,2))
treatment
## [1] 1 2 1 2
## Levels: 1 2
class(treatment)
## [1] "factor"

In R, categories are called factor levels. The function levels() can be used to access the factor levels.

levels(treatment)
## [1] "1" "2"

Factor levels can also be changed by assigning new levels to a factor.

levels(treatment) <- c("A", "C")
treatment
## [1] A C A C
## Levels: A C

Change the order of levels.

levels(treatment) <- c("C", "A")
treatment
## [1] C A C A
## Levels: C A

By default, the order of factor levels is taken in the order of numeric or alphabetic.

treatment <- factor(c("C", "A", "C", "A"))
treatment
## [1] C A C A
## Levels: A C

The factor levels can be specified when creating the factor, if the order does not follow the default rule.

treatment <- factor(c("C", "A", "C", "A"), levels=c("C", "A"))
treatment
## [1] C A C A
## Levels: C A

If you want to know the number of individuals at each levels, there are two functions.

summary(treatment)
## C A
## 2 2
table(treatment)
## treatment
## C A
## 2 2

Matrices: A matrix is like a speadsheet containing multiple rows and columns. It is used to combine vectors of the same type.

col1 <- c(1,0,0,0)
col2 <- c(0,1,0,0)
col3 <- c(0,0,1,0)

my_matrix <- cbind(col1, col2, col3)
my_matrix
##      col1 col2 col3
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1
## [4,]    0    0    0
rownames(my_matrix) <- c("row1", "row2", "row3", "row4")
my_matrix
##      col1 col2 col3
## row1    1    0    0
## row2    0    1    0
## row3    0    0    1
## row4    0    0    0
dim(my_matrix) # matrix dimensions
## [1] 4 3
nrow(my_matrix)
## [1] 4
ncol(my_matrix)
## [1] 3
t(my_matrix) # transpose
##      row1 row2 row3 row4
## col1    1    0    0    0
## col2    0    1    0    0
## col3    0    0    1    0

Accessing elements of a matrix is done in similar ways to accessing elements of a vector.

my_matrix[1,3]
## [1] 0
my_matrix["row1", "col3"]
## [1] 0
my_matrix[1,]
## col1 col2 col3
##    1    0    0
my_matrix[,3]
## row1 row2 row3 row4
##    0    0    1    0
my_matrix[col3 > 0,]
## col1 col2 col3
##    0    0    1

Matrices may be used in calculations. We will not be focusing on calculations or operations on matrices today, but you may want to explore the results of the code below later.

my_matrix * 3
##      col1 col2 col3
## row1    3    0    0
## row2    0    3    0
## row3    0    0    3
## row4    0    0    0
log10(my_matrix)
##      col1 col2 col3
## row1    0 -Inf -Inf
## row2 -Inf    0 -Inf
## row3 -Inf -Inf    0
## row4 -Inf -Inf -Inf

Total of each row.

rowSums(my_matrix)
## row1 row2 row3 row4
##    1    1    1    0

Total of each column.

colSums(my_matrix)
## col1 col2 col3
##    1    1    1

It is also possible to use the function apply() to apply any statistical functions to rows/columns of matrices. The advantage of using apply() is that it can take a function created by user.

The simplified format of apply() is as following:

apply(X, MARGIN, FUN)

X: data matrix MARGIN: possible values are 1 (for rows) and 2 (for columns) FUN: the function to apply on rows/columns

We can calculate the mean or the median of each row with the following:

apply(my_matrix, 1, mean)
##      row1      row2      row3      row4
## 0.3333333 0.3333333 0.3333333 0.0000000
apply(my_matrix, 1, median)
## row1 row2 row3 row4
##    0    0    0    0
CHALLENGE

How would you calculate the mean and median of each column, instead?

Data frames: a data frame is like a matrix but can have columns with different types (numeric, character, logical).

A data frame can be created using the function data.frame(), from new or previously defined vectors.

genes <- data.frame(ensembl = gene_ids, interest = of_interest)
genes
##                               ensembl interest
## ENSMUSG00000102693 ENSMUSG00000102693     TRUE
## ENSMUSG00000064842 ENSMUSG00000064842    FALSE
## ENSMUSG00000051951 ENSMUSG00000051951    FALSE
genes <- data.frame(ensembl = c("ENSMUSG00000102693", "ENSMUSG00000064842", "ENSMUSG00000051951"), interest = c(TRUE, FALSE, FALSE))
genes
##              ensembl interest
## 1 ENSMUSG00000102693     TRUE
## 2 ENSMUSG00000064842    FALSE
## 3 ENSMUSG00000051951    FALSE
is.data.frame(genes)
## [1] TRUE

To obtain a subset of a data frame can be done in similar ways to vectors and matrices: by index, by row or column names, or by logical values.

genes[1,]
##              ensembl interest
## 1 ENSMUSG00000102693     TRUE
genes[, "ensembl"]
## [1] "ENSMUSG00000102693" "ENSMUSG00000064842" "ENSMUSG00000051951"
genes[of_interest,]
##              ensembl interest
## 1 ENSMUSG00000102693     TRUE

A column of a data frame can be specified using the “$”.

genes$ensembl
## [1] "ENSMUSG00000102693" "ENSMUSG00000064842" "ENSMUSG00000051951"
genes[genes$interest,]
##              ensembl interest
## 1 ENSMUSG00000102693     TRUE

Data frames are easily extended using the “$” as well.

genes$name <- c("4933401J01Rik", "Gm26206", "Xkr4")
genes
##              ensembl interest          name
## 1 ENSMUSG00000102693     TRUE 4933401J01Rik
## 2 ENSMUSG00000064842    FALSE       Gm26206
## 3 ENSMUSG00000051951    FALSE          Xkr4

When we used the function read.delim() at the very beginning, we created a data frame.

class(counts)
## [1] "data.frame"
CHALLENGE

What type of data is stored in counts? Can you access a subset of that data? Can you add a column to counts? Explore counts using the following functions:


Topic 3. Import and export data in R

R base function read.table() is a general function that can be used to read a file in table format. There are also several variants of read.table that can be used to make reading a file even easier, like read.delim, which we used above to create counts. The data will be imported as a data frame.

# To read a local file, provide read.table with the path for to file location (same as taught in the command line session). Here we assume rnaseq_workshop_counts_mm.txt is in our current working directory
counts2 <- read.table(file="rnaseq_workshop_counts_mm.txt", sep="\t", header=T, stringsAsFactors=F)

# To read a file from the internet, provide read.table with the URL.
counts3 <- read.table(file="https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2020-mRNA_Seq_Workshop/master/datasets/rnaseq_workshop_counts_mm.txt", sep="\t", header=T, stringsAsFactors=F)

Take a look at the beginning part of the data frame.

head(counts2)
##                    mouse_110_WT_C mouse_110_WT_NC mouse_148_WT_C
## ENSMUSG00000102693              0               0              0
## ENSMUSG00000064842              0               0              0
## ENSMUSG00000051951              0               0              0
## ENSMUSG00000102851              0               0              0
## ENSMUSG00000103377              0               0              0
## ENSMUSG00000104017              0               0              0
##                    mouse_148_WT_NC mouse_158_WT_C mouse_158_WT_NC
## ENSMUSG00000102693               0              0               0
## ENSMUSG00000064842               0              0               0
## ENSMUSG00000051951               0              0               0
## ENSMUSG00000102851               0              0               0
## ENSMUSG00000103377               0              0               0
## ENSMUSG00000104017               0              0               0
##                    mouse_183_KOMIR150_C mouse_183_KOMIR150_NC
## ENSMUSG00000102693                    0                     0
## ENSMUSG00000064842                    0                     0
## ENSMUSG00000051951                    0                     0
## ENSMUSG00000102851                    0                     0
## ENSMUSG00000103377                    0                     0
## ENSMUSG00000104017                    0                     0
##                    mouse_198_KOMIR150_C mouse_198_KOMIR150_NC
## ENSMUSG00000102693                    0                     0
## ENSMUSG00000064842                    0                     0
## ENSMUSG00000051951                    0                     0
## ENSMUSG00000102851                    0                     0
## ENSMUSG00000103377                    0                     0
## ENSMUSG00000104017                    0                     0
##                    mouse_206_KOMIR150_C mouse_206_KOMIR150_NC
## ENSMUSG00000102693                    0                     0
## ENSMUSG00000064842                    0                     0
## ENSMUSG00000051951                    0                     0
## ENSMUSG00000102851                    0                     0
## ENSMUSG00000103377                    0                     0
## ENSMUSG00000104017                    0                     0
##                    mouse_2670_KOTet3_C mouse_2670_KOTet3_NC mouse_7530_KOTet3_C
## ENSMUSG00000102693                   0                    0                   0
## ENSMUSG00000064842                   0                    0                   0
## ENSMUSG00000051951                   0                    0                   0
## ENSMUSG00000102851                   0                    0                   0
## ENSMUSG00000103377                   0                    0                   0
## ENSMUSG00000104017                   0                    0                   0
##                    mouse_7530_KOTet3_NC mouse_7531_KOTet3_C mouse_7532_WT_NC
## ENSMUSG00000102693                    0                   0                0
## ENSMUSG00000064842                    0                   0                0
## ENSMUSG00000051951                    0                   0                0
## ENSMUSG00000102851                    0                   0                0
## ENSMUSG00000103377                    0                   0                0
## ENSMUSG00000104017                    0                   0                0
##                    mouse_H510_WT_C mouse_H510_WT_NC mouse_H514_WT_C
## ENSMUSG00000102693               0                0               0
## ENSMUSG00000064842               0                0               0
## ENSMUSG00000051951               0                0               0
## ENSMUSG00000102851               0                0               0
## ENSMUSG00000103377               0                0               0
## ENSMUSG00000104017               0                0               0
##                    mouse_H514_WT_NC
## ENSMUSG00000102693                0
## ENSMUSG00000064842                0
## ENSMUSG00000051951                0
## ENSMUSG00000102851                0
## ENSMUSG00000103377                0
## ENSMUSG00000104017                0

Depending on the format of the file, several variants of read.table() are available to make reading a file easier.

read.csv(): for reading “comma separated value” files (.csv).

read.csv2(): variant used in countries that use a comma “,” as decimal point and a semicolon “;” as field separators.

read.delim(): for reading “tab separated value” files (“.txt”). By default, point(“.”) is used as decimal point.

read.delim2(): for reading “tab separated value” files (“.txt”). By default, comma (“,”) is used as decimal point.

Choosing the correct function (or parameters) is important! If we use read.csv() to read our tab-delimited file, it becomes a mess.

counts2 <- read.csv(file="rnaseq_workshop_counts_mm.txt", stringsAsFactors=F)

head(counts2)
##   mouse_110_WT_C.mouse_110_WT_NC.mouse_148_WT_C.mouse_148_WT_NC.mouse_158_WT_C.mouse_158_WT_NC.mouse_183_KOMIR150_C.mouse_183_KOMIR150_NC.mouse_198_KOMIR150_C.mouse_198_KOMIR150_NC.mouse_206_KOMIR150_C.mouse_206_KOMIR150_NC.mouse_2670_KOTet3_C.mouse_26 ...
## 1                                                                                                                                                                           ENSMUSG00000102693\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0
## 2                                                                                                                                                                           ENSMUSG00000064842\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0
## 3                                                                                                                                                                           ENSMUSG00000051951\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0
## 4                                                                                                                                                                           ENSMUSG00000102851\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0
## 5                                                                                                                                                                           ENSMUSG00000103377\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0
## 6                                                                                                                                                                           ENSMUSG00000104017\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0

Since the data contained in these files is the same, we don’t need to keep three copies.

rm(counts2, counts3)

R base function write.table() can be used to export data to a file.

# To write to a file called "genes.txt" in your current working directory.
write.table(genes, file="genes.txt", sep="\t", quote=F, row.names=T, col.names=T)

It is also possible to export data to a csv file.

write.csv()

write.csv2()


Topic 4. Functions in R

We have already used a number of functions, including read.delim, class, mean, and many others. Functions can be invoked with 0 or more arguments.

# No arguments: use getwd() to find out the current working directory
getwd()
## [1] "/Users/hannah/Desktop/workshop_de_analysis_2020_07_30"
# No arguments: use ls() to list all variables in the environment
ls()
##  [1] "a"           "col1"        "col2"        "col3"        "colFmt"     
##  [6] "counts"      "gene_ids"    "genes"       "my_matrix"   "of_interest"
## [11] "treatment"
# One argument: use str() to find out information on a variable
str(genes)
## 'data.frame':	3 obs. of  3 variables:
##  $ ensembl : chr  "ENSMUSG00000102693" "ENSMUSG00000064842" "ENSMUSG00000051951"
##  $ interest: logi  TRUE FALSE FALSE
##  $ name    : chr  "4933401J01Rik" "Gm26206" "Xkr4"
# More arguments: use seq() to generate a vector of sequential numbers
?seq # how does seq() work?
seq(from = 2, to = 3, by = 0.1)
##  [1] 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0
seq(2, 3, 0.1) # less explicit, same result
##  [1] 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0

One useful function to find out information on a variable: str().


Topic 5. Basic statistics in R

Description R_function
Mean mean()
Standard deviation sd()
Variance var()
Minimum min()
Maximum max()
Median median()
Range of values: minimum and maximum range()
Sample quantiles quantile()
Generic function summary()
Interquartile range IQR()

Calculate the mean expression for each sample.

apply(counts, 2, mean)
##        mouse_110_WT_C       mouse_110_WT_NC        mouse_148_WT_C
##              42.06295              51.04228              50.62418
##       mouse_148_WT_NC        mouse_158_WT_C       mouse_158_WT_NC
##              47.02723              53.47496              47.81311
##  mouse_183_KOMIR150_C mouse_183_KOMIR150_NC  mouse_198_KOMIR150_C
##              45.60650              33.48397              51.08072
## mouse_198_KOMIR150_NC  mouse_206_KOMIR150_C mouse_206_KOMIR150_NC
##              52.55656              24.99558              17.16092
##   mouse_2670_KOTet3_C  mouse_2670_KOTet3_NC   mouse_7530_KOTet3_C
##              52.23667              52.85384              46.98483
##  mouse_7530_KOTet3_NC   mouse_7531_KOTet3_C      mouse_7532_WT_NC
##              51.88361              47.93512              48.82374
##       mouse_H510_WT_C      mouse_H510_WT_NC       mouse_H514_WT_C
##              46.50284              51.01271              41.38043
##      mouse_H514_WT_NC
##              47.66608
CHALLENGE

Topic 6. Simple data visulization in R

Scatter plot and line plot can be produced using the function plot().

x <- c(1:50)
y <- 1 + sqrt(x)/2
plot(x,y)

plot(x,y, type="l")

# plot both the points and lines
## first plot points
plot(x,y)
lines(x,y, type="l")

## lines() can only be used to add information to a graph, while it cannot produce a graph on its own.

boxplot() can be used to summarize data.

boxplot(counts, xlab="Sample ID", ylab="Counts")

hist() can be used to create histograms of data.

hist(counts$mouse_110_WT_C)

# use user defined break points
hist(counts$mouse_110_WT_C, breaks=seq(0, 25000, by=50))

# lower counts only
hist(counts$mouse_110_WT_C, breaks=seq(0, 25000, by=5), xlim = c(0, 250))


Topic 7. Install packages in R

Starting from Bioconductor version 3.8, the installation of packages is recommended to use BiocManager.

if (!requireNamespace("BiocManager"))
	install.packages("BiocManager")
# install core packages
BiocManager::install()
# install specific packages
BiocManager::install(c("ggplot2", "ShortRead"))
source("http://bioconductor.org/biocLite.R")
# install core packages
biocLite()
# install specific packages
biocLite("RCircos")
biocLite(c("IdeoViz", "devtools"))

The R function install.packages() can be used to install packages that are not part of Bioconductor.

install.packages("ggplot2", repos="http://cran.us.r-project.org")

Topic 8. Save data in R session

To save objects in R session

save(list=c("genes", "counts"), file="intro_20200603.RData")