Topics covered in this introduction to R
- Basic concepts
- Basic data types in R
- Import and export data in R
- Functions in R
- Basic statistics in R
- Simple data visulization in R
- Install packages in R
- Save data in R session
Topic 1. Basic concepts
There are three concepts that we should be familiar with before working in R:
- Operators
- Functions
- Variables
Operators
Operator | Description |
---|---|
<-, = | Assignment |
Operator | Description |
---|---|
+ | Addition |
- | Subtraction |
* | Multiplication |
/ | Division |
^ | Exponent |
%% | Modulus |
%/% | Integer Division |
Operator | Description |
---|---|
< | Less than |
> | Greater than |
<= | Less than or equal to |
>= | Greater than or equal to |
== | Equal to |
!= | Not equal to |
Operator | Description |
---|---|
! | Logical NOT |
& | Element-wise logical AND |
&& | Logical AND |
| | Element-wise logical OR |
|| | Logical OR |
Functions
Functions are essential in all programming languages. A function takes zero or more parameters and return a result. The way to use a function in R is:
function.name(parameter1=value1, …)
read.delim("rnaseq_workshop_counts_mm.txt", row.names = 1)
mouse_110_WT_C | mouse_110_WT_NC | mouse_148_WT_C | mouse_148_WT_NC | mouse_158_WT_C | mouse_158_WT_NC | mouse_183_KOMIR150_C | mouse_183_KOMIR150_NC | mouse_198_KOMIR150_C | mouse_198_KOMIR150_NC | mouse_206_KOMIR150_C | mouse_206_KOMIR150_NC | mouse_2670_KOTet3_C | mouse_2670_KOTet3_NC | mouse_7530_KOTet3_C | mouse_7530_KOTet3_NC | mouse_7531_KOTet3_C | mouse_7532_WT_NC | mouse_H510_WT_C | mouse_H510_WT_NC | mouse_H514_WT_C | mouse_H514_WT_NC | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ENSMUSG00000102693 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ENSMUSG00000064842 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ENSMUSG00000051951 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ENSMUSG00000102851 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ENSMUSG00000103377 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ENSMUSG00000104017 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Variables
A variable is a named storage. The name of a variable can have letters, numbers, dot and underscore. However, a valid variable name cannot start with a underscore or a number, or start with a dot that is followed by a number.
Using an assignment operator, we can store the results of a function inside a variable.
counts <- read.delim("rnaseq_workshop_counts_mm.txt", row.names = 1)
CHALLENGE
Which of the following assignments will work? Do you notice anything different about the behavior of any of these variables?
- a <- 1
- b = 2
- .c <- 3
- 1a <- 4
- .1a <- 5
- a.1 <- 6
Topic 2. Basic data types in R
Type | Example |
---|---|
Numeric | 3, 3e-2 |
Character | "ENSMUSG00000102693" |
Logical | TRUE, FALSE |
Factor | any categorical variable |
Simple variables: variables that have a numeric value, a character value (such as a string), or a logical value (True or False)
We can check the type of data stored within a variable using the class function.
a <- 1
class(a)
## [1] "numeric"
a <- "ENSMUSG00000102693"
class(a)
## [1] "character"
CHALLENGE
Try the following assignments. What is the class of each of the variables produced? Why? What is happening when converting between types using the as.numeric, as.logical, and as.character functions?
- a <- 150
- b <- 3e-2
- c <- “ENSMUSG00000102693”
- d <- TRUE
- e <- a < 1
- f <- b < 1
- g <- is.numeric(a)
- h <- is.character(a)
- i <- as.numeric(d)
- j <- as.character(a)
- l <- as.logical(b)
- m <- as.numeric(f)
- n <- as.logical(0)
Vectors: a vector is a combination of multiple values(numeric, character or logical) in the same object. A vector is created using the function c() (for concatenate).
The values in the vector must all be of the same type.
gene_ids <- c("ENSMUSG00000102693", "ENSMUSG00000064842", "ENSMUSG00000051951")
gene_ids
## [1] "ENSMUSG00000102693" "ENSMUSG00000064842" "ENSMUSG00000051951"
of_interest <- c(TRUE, FALSE, FALSE)
of_interest
## [1] TRUE FALSE FALSE
length(of_interest)
## [1] 3
Elements of a vector can be named by providing names to an existing vector, or assigning names when creating a vector.
names(of_interest) <- gene_ids
of_interest
## ENSMUSG00000102693 ENSMUSG00000064842 ENSMUSG00000051951
## TRUE FALSE FALSE
of_interest <- c(ENSMUSG00000102693 = TRUE, ENSMUSG00000064842 = FALSE, ENSMUSG00000051951 = FALSE)
of_interest
## ENSMUSG00000102693 ENSMUSG00000064842 ENSMUSG00000051951
## TRUE FALSE FALSE
Elements of a vector can be accessed by index, or by name if it is a named vector. The element or elements to be accessed are specified using square brackets.
of_interest[2]
## ENSMUSG00000064842
## FALSE
of_interest["ENSMUSG00000064842"]
## ENSMUSG00000064842
## FALSE
of_interest[c(1,3)]
## ENSMUSG00000102693 ENSMUSG00000051951
## TRUE FALSE
of_interest[c("ENSMUSG00000102693", "ENSMUSG00000051951")]
## ENSMUSG00000102693 ENSMUSG00000051951
## TRUE FALSE
Additionally, a subset of a vector can be selected using a logical vector.
gene_ids[of_interest]
## [1] "ENSMUSG00000102693"
NOTE: a vector can only hold elements of the same type.
Factors: a factor represents categorical or groups in data. The function factor() can be used to create a factor variable.
treatment <- factor(c(1,2,1,2))
treatment
## [1] 1 2 1 2
## Levels: 1 2
class(treatment)
## [1] "factor"
In R, categories are called factor levels. The function levels() can be used to access the factor levels.
levels(treatment)
## [1] "1" "2"
Factor levels can also be changed by assigning new levels to a factor.
levels(treatment) <- c("A", "C")
treatment
## [1] A C A C
## Levels: A C
Change the order of levels.
levels(treatment) <- c("C", "A")
treatment
## [1] C A C A
## Levels: C A
By default, the order of factor levels is taken in the order of numeric or alphabetic.
treatment <- factor(c("C", "A", "C", "A"))
treatment
## [1] C A C A
## Levels: A C
The factor levels can be specified when creating the factor, if the order does not follow the default rule.
treatment <- factor(c("C", "A", "C", "A"), levels=c("C", "A"))
treatment
## [1] C A C A
## Levels: C A
If you want to know the number of individuals at each levels, there are two functions.
summary(treatment)
## C A
## 2 2
table(treatment)
## treatment
## C A
## 2 2
Matrices: A matrix is like a speadsheet containing multiple rows and columns. It is used to combine vectors of the same type.
col1 <- c(1,0,0,0)
col2 <- c(0,1,0,0)
col3 <- c(0,0,1,0)
my_matrix <- cbind(col1, col2, col3)
my_matrix
## col1 col2 col3
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
## [4,] 0 0 0
rownames(my_matrix) <- c("row1", "row2", "row3", "row4")
my_matrix
## col1 col2 col3
## row1 1 0 0
## row2 0 1 0
## row3 0 0 1
## row4 0 0 0
dim(my_matrix) # matrix dimensions
## [1] 4 3
nrow(my_matrix)
## [1] 4
ncol(my_matrix)
## [1] 3
t(my_matrix) # transpose
## row1 row2 row3 row4
## col1 1 0 0 0
## col2 0 1 0 0
## col3 0 0 1 0
Accessing elements of a matrix is done in similar ways to accessing elements of a vector.
my_matrix[1,3]
## [1] 0
my_matrix["row1", "col3"]
## [1] 0
my_matrix[1,]
## col1 col2 col3
## 1 0 0
my_matrix[,3]
## row1 row2 row3 row4
## 0 0 1 0
my_matrix[col3 > 0,]
## col1 col2 col3
## 0 0 1
Matrices may be used in calculations. We will not be focusing on calculations or operations on matrices today, but you may want to explore the results of the code below later.
my_matrix * 3
## col1 col2 col3
## row1 3 0 0
## row2 0 3 0
## row3 0 0 3
## row4 0 0 0
log10(my_matrix)
## col1 col2 col3
## row1 0 -Inf -Inf
## row2 -Inf 0 -Inf
## row3 -Inf -Inf 0
## row4 -Inf -Inf -Inf
Total of each row.
rowSums(my_matrix)
## row1 row2 row3 row4
## 1 1 1 0
Total of each column.
colSums(my_matrix)
## col1 col2 col3
## 1 1 1
It is also possible to use the function apply() to apply any statistical functions to rows/columns of matrices. The advantage of using apply() is that it can take a function created by user.
The simplified format of apply() is as following:
apply(X, MARGIN, FUN)
X: data matrix MARGIN: possible values are 1 (for rows) and 2 (for columns) FUN: the function to apply on rows/columns
We can calculate the mean or the median of each row with the following:
apply(my_matrix, 1, mean)
## row1 row2 row3 row4
## 0.3333333 0.3333333 0.3333333 0.0000000
apply(my_matrix, 1, median)
## row1 row2 row3 row4
## 0 0 0 0
CHALLENGE
How would you calculate the mean and median of each column, instead?
Data frames: a data frame is like a matrix but can have columns with different types (numeric, character, logical).
A data frame can be created using the function data.frame(), from new or previously defined vectors.
genes <- data.frame(ensembl = gene_ids, interest = of_interest)
genes
## ensembl interest
## ENSMUSG00000102693 ENSMUSG00000102693 TRUE
## ENSMUSG00000064842 ENSMUSG00000064842 FALSE
## ENSMUSG00000051951 ENSMUSG00000051951 FALSE
genes <- data.frame(ensembl = c("ENSMUSG00000102693", "ENSMUSG00000064842", "ENSMUSG00000051951"), interest = c(TRUE, FALSE, FALSE))
genes
## ensembl interest
## 1 ENSMUSG00000102693 TRUE
## 2 ENSMUSG00000064842 FALSE
## 3 ENSMUSG00000051951 FALSE
is.data.frame(genes)
## [1] TRUE
To obtain a subset of a data frame can be done in similar ways to vectors and matrices: by index, by row or column names, or by logical values.
genes[1,]
## ensembl interest
## 1 ENSMUSG00000102693 TRUE
genes[, "ensembl"]
## [1] "ENSMUSG00000102693" "ENSMUSG00000064842" "ENSMUSG00000051951"
genes[of_interest,]
## ensembl interest
## 1 ENSMUSG00000102693 TRUE
A column of a data frame can be specified using the “$”.
genes$ensembl
## [1] "ENSMUSG00000102693" "ENSMUSG00000064842" "ENSMUSG00000051951"
genes[genes$interest,]
## ensembl interest
## 1 ENSMUSG00000102693 TRUE
Data frames are easily extended using the “$” as well.
genes$name <- c("4933401J01Rik", "Gm26206", "Xkr4")
genes
## ensembl interest name
## 1 ENSMUSG00000102693 TRUE 4933401J01Rik
## 2 ENSMUSG00000064842 FALSE Gm26206
## 3 ENSMUSG00000051951 FALSE Xkr4
When we used the function read.delim() at the very beginning, we created a data frame.
class(counts)
## [1] "data.frame"
CHALLENGE
What type of data is stored in counts? Can you access a subset of that data? Can you add a column to counts? Explore counts using the following functions:
- rownames()
- colnames()
- class()
Topic 3. Import and export data in R
R base function read.table() is a general function that can be used to read a file in table format. There are also several variants of read.table that can be used to make reading a file even easier, like read.delim, which we used above to create counts. The data will be imported as a data frame.
# To read a local file, provide read.table with the path for to file location (same as taught in the command line session). Here we assume rnaseq_workshop_counts_mm.txt is in our current working directory
counts2 <- read.table(file="rnaseq_workshop_counts_mm.txt", sep="\t", header=T, stringsAsFactors=F)
# To read a file from the internet, provide read.table with the URL.
counts3 <- read.table(file="https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2020-mRNA_Seq_Workshop/master/datasets/rnaseq_workshop_counts_mm.txt", sep="\t", header=T, stringsAsFactors=F)
Take a look at the beginning part of the data frame.
head(counts2)
## mouse_110_WT_C mouse_110_WT_NC mouse_148_WT_C
## ENSMUSG00000102693 0 0 0
## ENSMUSG00000064842 0 0 0
## ENSMUSG00000051951 0 0 0
## ENSMUSG00000102851 0 0 0
## ENSMUSG00000103377 0 0 0
## ENSMUSG00000104017 0 0 0
## mouse_148_WT_NC mouse_158_WT_C mouse_158_WT_NC
## ENSMUSG00000102693 0 0 0
## ENSMUSG00000064842 0 0 0
## ENSMUSG00000051951 0 0 0
## ENSMUSG00000102851 0 0 0
## ENSMUSG00000103377 0 0 0
## ENSMUSG00000104017 0 0 0
## mouse_183_KOMIR150_C mouse_183_KOMIR150_NC
## ENSMUSG00000102693 0 0
## ENSMUSG00000064842 0 0
## ENSMUSG00000051951 0 0
## ENSMUSG00000102851 0 0
## ENSMUSG00000103377 0 0
## ENSMUSG00000104017 0 0
## mouse_198_KOMIR150_C mouse_198_KOMIR150_NC
## ENSMUSG00000102693 0 0
## ENSMUSG00000064842 0 0
## ENSMUSG00000051951 0 0
## ENSMUSG00000102851 0 0
## ENSMUSG00000103377 0 0
## ENSMUSG00000104017 0 0
## mouse_206_KOMIR150_C mouse_206_KOMIR150_NC
## ENSMUSG00000102693 0 0
## ENSMUSG00000064842 0 0
## ENSMUSG00000051951 0 0
## ENSMUSG00000102851 0 0
## ENSMUSG00000103377 0 0
## ENSMUSG00000104017 0 0
## mouse_2670_KOTet3_C mouse_2670_KOTet3_NC mouse_7530_KOTet3_C
## ENSMUSG00000102693 0 0 0
## ENSMUSG00000064842 0 0 0
## ENSMUSG00000051951 0 0 0
## ENSMUSG00000102851 0 0 0
## ENSMUSG00000103377 0 0 0
## ENSMUSG00000104017 0 0 0
## mouse_7530_KOTet3_NC mouse_7531_KOTet3_C mouse_7532_WT_NC
## ENSMUSG00000102693 0 0 0
## ENSMUSG00000064842 0 0 0
## ENSMUSG00000051951 0 0 0
## ENSMUSG00000102851 0 0 0
## ENSMUSG00000103377 0 0 0
## ENSMUSG00000104017 0 0 0
## mouse_H510_WT_C mouse_H510_WT_NC mouse_H514_WT_C
## ENSMUSG00000102693 0 0 0
## ENSMUSG00000064842 0 0 0
## ENSMUSG00000051951 0 0 0
## ENSMUSG00000102851 0 0 0
## ENSMUSG00000103377 0 0 0
## ENSMUSG00000104017 0 0 0
## mouse_H514_WT_NC
## ENSMUSG00000102693 0
## ENSMUSG00000064842 0
## ENSMUSG00000051951 0
## ENSMUSG00000102851 0
## ENSMUSG00000103377 0
## ENSMUSG00000104017 0
Depending on the format of the file, several variants of read.table() are available to make reading a file easier.
read.csv(): for reading “comma separated value” files (.csv).
read.csv2(): variant used in countries that use a comma “,” as decimal point and a semicolon “;” as field separators.
read.delim(): for reading “tab separated value” files (“.txt”). By default, point(“.”) is used as decimal point.
read.delim2(): for reading “tab separated value” files (“.txt”). By default, comma (“,”) is used as decimal point.
Choosing the correct function (or parameters) is important! If we use read.csv()
to read our tab-delimited file, it becomes a mess.
counts2 <- read.csv(file="rnaseq_workshop_counts_mm.txt", stringsAsFactors=F)
head(counts2)
## mouse_110_WT_C.mouse_110_WT_NC.mouse_148_WT_C.mouse_148_WT_NC.mouse_158_WT_C.mouse_158_WT_NC.mouse_183_KOMIR150_C.mouse_183_KOMIR150_NC.mouse_198_KOMIR150_C.mouse_198_KOMIR150_NC.mouse_206_KOMIR150_C.mouse_206_KOMIR150_NC.mouse_2670_KOTet3_C.mouse_26 ...
## 1 ENSMUSG00000102693\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0
## 2 ENSMUSG00000064842\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0
## 3 ENSMUSG00000051951\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0
## 4 ENSMUSG00000102851\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0
## 5 ENSMUSG00000103377\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0
## 6 ENSMUSG00000104017\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0
Since the data contained in these files is the same, we don’t need to keep three copies.
rm(counts2, counts3)
R base function write.table() can be used to export data to a file.
# To write to a file called "genes.txt" in your current working directory.
write.table(genes, file="genes.txt", sep="\t", quote=F, row.names=T, col.names=T)
It is also possible to export data to a csv file.
write.csv()
write.csv2()
Topic 4. Functions in R
We have already used a number of functions, including read.delim, class, mean, and many others. Functions can be invoked with 0 or more arguments.
# No arguments: use getwd() to find out the current working directory
getwd()
## [1] "/Users/hannah/Desktop/workshop_de_analysis_2020_07_30"
# No arguments: use ls() to list all variables in the environment
ls()
## [1] "a" "col1" "col2" "col3" "colFmt"
## [6] "counts" "gene_ids" "genes" "my_matrix" "of_interest"
## [11] "treatment"
# One argument: use str() to find out information on a variable
str(genes)
## 'data.frame': 3 obs. of 3 variables:
## $ ensembl : chr "ENSMUSG00000102693" "ENSMUSG00000064842" "ENSMUSG00000051951"
## $ interest: logi TRUE FALSE FALSE
## $ name : chr "4933401J01Rik" "Gm26206" "Xkr4"
# More arguments: use seq() to generate a vector of sequential numbers
?seq # how does seq() work?
seq(from = 2, to = 3, by = 0.1)
## [1] 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0
seq(2, 3, 0.1) # less explicit, same result
## [1] 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0
One useful function to find out information on a variable: str().
Topic 5. Basic statistics in R
Description | R_function |
---|---|
Mean | mean() |
Standard deviation | sd() |
Variance | var() |
Minimum | min() |
Maximum | max() |
Median | median() |
Range of values: minimum and maximum | range() |
Sample quantiles | quantile() |
Generic function | summary() |
Interquartile range | IQR() |
Calculate the mean expression for each sample.
apply(counts, 2, mean)
## mouse_110_WT_C mouse_110_WT_NC mouse_148_WT_C
## 42.06295 51.04228 50.62418
## mouse_148_WT_NC mouse_158_WT_C mouse_158_WT_NC
## 47.02723 53.47496 47.81311
## mouse_183_KOMIR150_C mouse_183_KOMIR150_NC mouse_198_KOMIR150_C
## 45.60650 33.48397 51.08072
## mouse_198_KOMIR150_NC mouse_206_KOMIR150_C mouse_206_KOMIR150_NC
## 52.55656 24.99558 17.16092
## mouse_2670_KOTet3_C mouse_2670_KOTet3_NC mouse_7530_KOTet3_C
## 52.23667 52.85384 46.98483
## mouse_7530_KOTet3_NC mouse_7531_KOTet3_C mouse_7532_WT_NC
## 51.88361 47.93512 48.82374
## mouse_H510_WT_C mouse_H510_WT_NC mouse_H514_WT_C
## 46.50284 51.01271 41.38043
## mouse_H514_WT_NC
## 47.66608
CHALLENGE
- What is the highest number of counts for a gene in sample mouse_110_WT_C?
- What is the range of counts for ENSMUSG00000026739?
Topic 6. Simple data visulization in R
Scatter plot and line plot can be produced using the function plot().
x <- c(1:50)
y <- 1 + sqrt(x)/2
plot(x,y)
plot(x,y, type="l")
# plot both the points and lines
## first plot points
plot(x,y)
lines(x,y, type="l")
## lines() can only be used to add information to a graph, while it cannot produce a graph on its own.
boxplot() can be used to summarize data.
boxplot(counts, xlab="Sample ID", ylab="Counts")
hist() can be used to create histograms of data.
hist(counts$mouse_110_WT_C)
# use user defined break points
hist(counts$mouse_110_WT_C, breaks=seq(0, 25000, by=50))
# lower counts only
hist(counts$mouse_110_WT_C, breaks=seq(0, 25000, by=5), xlim = c(0, 250))
Topic 7. Install packages in R
Starting from Bioconductor version 3.8, the installation of packages is recommended to use BiocManager.
if (!requireNamespace("BiocManager"))
install.packages("BiocManager")
# install core packages
BiocManager::install()
# install specific packages
BiocManager::install(c("ggplot2", "ShortRead"))
-
Bioconductor has a repository and release schedule that differ from R (Bioconductor has a ‘devel’ branch to which new packages and updates are introduced, and a stable ‘release’ branch emitted once every 6 months to which bug fixes but not new features are introduced). This mismatch causes that the version detected by install.packages() is sometimes not the most recent ‘release’.
-
A consequence of the ‘devel’ branch is that install.packages() sometimes points only to the ‘release’ repository, while users might want to have access to the leading-edge features in the ‘devel’ version.
-
An indirect consequence of Bioconductor’s structured release is that packages generally have more extensive dependences with one another.
It is always recommended to update to the most current version of R and Bioconductor. If it is not possible and R < 3.5.0, please use the legacy approach to install Bioconductor packages
source("http://bioconductor.org/biocLite.R")
# install core packages
biocLite()
# install specific packages
biocLite("RCircos")
biocLite(c("IdeoViz", "devtools"))
The R function install.packages() can be used to install packages that are not part of Bioconductor.
install.packages("ggplot2", repos="http://cran.us.r-project.org")
Topic 8. Save data in R session
To save objects in R session
save(list=c("genes", "counts"), file="intro_20200603.RData")