Home
Introduction and Lectures
Intro to the Workshop and Core
Schedule
What is Bioinformatics/Genomics?
Experimental Design and Cost Estimation
Single Cell Sample Preparation - Dr. Diana Burkart-Waco
Biology at True Resolution - Introduction to Single Cell and Visium Spatial Solutions
Visium Spatial Protocols - Tissue Preparation Guide
Support
Using Slack in this workshop
Using Zoom in this workshop
Cheat Sheets
Software and Links
Scripts
Prerequisites
CLI - Logging in and Transferring Files
CLI - Intro to Command-Line
CLI - Advanced Command-Line (extra)
CLI - Running jobs on the Cluster and using modules
R - Getting Started
R - Intro to R
R - Prepare Data in R (extra)
R - Data in R (extra)
More Materials (extra)
Data Reduction
Project setup
Generating Expression Matrix
scRNAseq Analysis
Prepare scRNAseq Analysis
scRNAseq Analysis - PART1
scRNAseq Analysis - PART2
scRNAseq Analysis - PART3
scRNAseq Analysis - PART4
scRNAseq Analysis - PART5
scRNAseq Analysis - PART6
Shiny App (extra)
Shiny App on AWS (extra)
ETC
Closing thoughts
Workshop Photos
Github page
Biocore website

To download the R notebook for this portion of the course, execute the following command in your R console:

download.file("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2020-August-intro-scRNAseq/master/prerequisites/intro2R/Intro2R.Rmd", "Intro2R.Rmd")

Topics covered in this introduction to R

Basic concepts
Basic data types in R
Import and export data in R
Basic statistics in R
Simple data visualization in R
Install packages in R

Topic 1. Basic concepts

Everything we will deal with in R during this workshop is either:

an operator
a function
a variable / an object

Operators

We use operators to tell R to perform arithmetic or logical operations. Arithmetic operations are things like addition and subtraction. Logical operations allow us to ask R questions like “which samples came from females?”

Assignment Operators in R
Operator	Description
<-, =	Assignment

Arithmetic Operators in R
Operator	Description
+	Addition
-	Subtraction
*	Multiplication
/	Division
^	Exponent
%%	Modulus
%/%	Integer Division

Relational Operators in R
Operator	Description
<	Less than
>	Greater than
<=	Less than or equal to
>=	Greater than or equal to
==	Equal to
!=	Not equal to

Logical Operators in R
Operator	Description
!	Logical NOT
&	Element-wise logical AND
&&	Logical AND
\|	Element-wise logical OR
\|\|	Logical OR

# arithmetic
1 + 3

## [1] 4

# logical
1 > 3

## [1] FALSE

Functions

Functions are essential in all programming languages. A function takes zero or more parameters and returns a result. The way to use a function in R is:

function.name(parameter1=value1, …)

Help documentation about functions can be displayed by putting a “?” before the function name.

?getwd # what does getwd() do?
getwd()

## [1] "/Users/hannah/Documents/Jobs/bioinformatics_training_program/workshop_documentation/intro_scRNA_aug/prerequisites/intro2R"

ls()

## [1] "colFmt"

sum(1, 3)

## [1] 4

seq(from = 2, to = 3, by = 0.1)

##  [1] 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0

Variables

Variables (and objects) give us a way to store data for use later and refer to it by name. The name of a variable can have letters, numbers, dot and underscore. However, a valid variable name cannot start with a underscore or a number, or start with a dot that is followed by a number.

Using an assignment operator, we can store the results of a function inside a variable. In many cases, the assignment operators “<-“ and “=” are equivalent. There are, however, some differences. In this introductory material, we will use “<-“ for assignment and “=” to define parameters within functions.

total <- sum(1, 3)

CHALLENGE

Which of the following assignments will work? Do you notice anything different about the behavior of any of these variables?

a <- 1
b = 2
.c <- 3
1a <- 4
.1a <- 5
a.1 <- 6

Topic 2. Basic data types in R

The data types we will encounter in this workshop are numeric, character, logical, and factor.

Basic Data Types
Type	Example
Numeric	3, 3e-2
Character	"ENSMUSG00000102693"
Logical	TRUE, FALSE
Factor	any categorical variable

We will cover factors in greater depth later. First let’s explore the other data types. To check the type of a variable, use the class function.

a <- 1
class(a)

## [1] "numeric"

a <- "ENSMUSG00000102693"
class(a)

## [1] "character"

CHALLENGE

Try the following assignments. What is the class of each of the variables produced? Why? What is happening when converting between types using the as.numeric, as.logical, and as.character functions?

a <- 150
b <- 3e-2
c <- "ENSMUSG00000102693"
d <- TRUE
e <- a < 1
f <- b < 1
g <- is.numeric(a)
h <- is.character(a)
i <- as.numeric(d)
j <- as.character(a)
l <- as.logical(b)
m <- as.numeric(f)
n <- as.logical(0)

To store more than a single data point, R uses data structures, including vectors, matrices, data frames, lists, and complex objects. We will focus on vectors and data frames for now.

Vectors

Vectors store a group of values of the same type within a single object. A vector is created by combining values using the function c().

gene_ids <- c("ENSMUSG00000102693", "ENSMUSG00000064842", "ENSMUSG00000051951")
gene_ids

## [1] "ENSMUSG00000102693" "ENSMUSG00000064842" "ENSMUSG00000051951"

of_interest <- c(TRUE, FALSE, FALSE)
of_interest

## [1]  TRUE FALSE FALSE

length(of_interest)

## [1] 3

Elements of a vector can be named by providing names to an existing vector, or assigning names when creating a vector.

names(of_interest) <- gene_ids
of_interest

## ENSMUSG00000102693 ENSMUSG00000064842 ENSMUSG00000051951
##               TRUE              FALSE              FALSE

of_interest <- c(ENSMUSG00000102693 = TRUE, ENSMUSG00000064842 = FALSE, ENSMUSG00000051951 = FALSE)
of_interest

## ENSMUSG00000102693 ENSMUSG00000064842 ENSMUSG00000051951
##               TRUE              FALSE              FALSE

Elements of a vector can be accessed by index, or by name if it is a named vector. The element or elements to be accessed are specified using square brackets.

of_interest[2]

## ENSMUSG00000064842
##              FALSE

of_interest["ENSMUSG00000064842"]

## ENSMUSG00000064842
##              FALSE

of_interest[c(1,3)]

## ENSMUSG00000102693 ENSMUSG00000051951
##               TRUE              FALSE

of_interest[c("ENSMUSG00000102693", "ENSMUSG00000051951")]

## ENSMUSG00000102693 ENSMUSG00000051951
##               TRUE              FALSE

Additionally, a subset of a vector can be selected using a logical vector.

gene_ids[of_interest]

## [1] "ENSMUSG00000102693"

A factor represents categorical data, or groups in data. The function factor() can be used to create a factor.

treatment <- factor(c(1,2,1,2))
treatment

## [1] 1 2 1 2
## Levels: 1 2

class(treatment)

## [1] "factor"

In R, categories are called factor levels. The function levels() can be used to access the factor levels.

levels(treatment)

## [1] "1" "2"

Factor levels can also be changed by assigning new levels to a factor.

levels(treatment) <- c("A", "C")
treatment

## [1] A C A C
## Levels: A C

Change the order of levels.

levels(treatment) <- c("C", "A")
treatment

## [1] C A C A
## Levels: C A

By default, the order of factor levels is taken in the order of numeric or alphabetic.

treatment <- factor(c("C", "A", "C", "A"))
treatment

## [1] C A C A
## Levels: A C

The factor levels can be specified when creating the factor, if the order does not follow the default rule.

treatment <- factor(c("C", "A", "C", "A"), levels=c("C", "A"))
treatment

## [1] C A C A
## Levels: C A

If you want to know the number of individuals at each levels, there are two simple functions we can use to summarize the information contained in a factor.

summary(treatment)

## C A
## 2 2

table(treatment)

## treatment
## C A
## 2 2

Data frames

A data frame looks a bit like a spreadsheet, and can contain different types of data in each column. A data frame can be created using the function data.frame(), from new or previously defined vectors.

genes <- data.frame(ensembl = gene_ids, interest = of_interest)
genes

##                               ensembl interest
## ENSMUSG00000102693 ENSMUSG00000102693     TRUE
## ENSMUSG00000064842 ENSMUSG00000064842    FALSE
## ENSMUSG00000051951 ENSMUSG00000051951    FALSE

genes <- data.frame(ensembl = c("ENSMUSG00000102693", "ENSMUSG00000064842", "ENSMUSG00000051951"), interest = c(TRUE, FALSE, FALSE))
genes

##              ensembl interest
## 1 ENSMUSG00000102693     TRUE
## 2 ENSMUSG00000064842    FALSE
## 3 ENSMUSG00000051951    FALSE

is.data.frame(genes)

## [1] TRUE

The syntax to specify a subset of data within a data frame looks a lot like the syntax to specify a subset of a vector, except for a data frame, we need to specify both row and column, rather than index:

dataframe[row, column]

We can use the row and column number, row and column names, or logical vectors to subset a data frame.

genes[1,]

##              ensembl interest
## 1 ENSMUSG00000102693     TRUE

genes[, "ensembl"]

## [1] "ENSMUSG00000102693" "ENSMUSG00000064842" "ENSMUSG00000051951"

genes[of_interest,]

##              ensembl interest
## 1 ENSMUSG00000102693     TRUE

A column of a data frame can be specified using the “$”.

genes$ensembl

## [1] "ENSMUSG00000102693" "ENSMUSG00000064842" "ENSMUSG00000051951"

genes[genes$interest,]

##              ensembl interest
## 1 ENSMUSG00000102693     TRUE

Data frames are easily extended using the “$” as well.

genes$name <- c("4933401J01Rik", "Gm26206", "Xkr4")
genes

##              ensembl interest          name
## 1 ENSMUSG00000102693     TRUE 4933401J01Rik
## 2 ENSMUSG00000064842    FALSE       Gm26206
## 3 ENSMUSG00000051951    FALSE          Xkr4

CHALLENGE

Let’s explore a data frame. R has several built-in data sets. We will use “ToothGrowth,” which contains the results of an experiment investigating the results of vitamin C supplementation on tooth growth in Guinea pigs.

class(ToothGrowth)

## [1] "data.frame"

What type of data is stored in ToothGrowth? Can you access a subset of that data? Can you add a column to ToothGrowth? Explore ToothGrowth using the following functions:

rownames()
colnames()
class()

Topic 3. Import and export data in R

R base function read.table() is a general function that can be used to read a file in table format. There are also several variants of read.table that can be used to make reading a file even easier, like read.delim. The data will be imported as a data frame.

To read a local file, provide read.table with the path to the file’s location.

download.file("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2020-August-intro-scRNAseq/master/datasets/scran_biomart.txt", "scran.txt")
scran <- read.table(file="scran.txt", sep="\t", header=T, stringsAsFactors=F)

To read a file from the internet, provide read.table with the URL.

scran2 <- read.table(file="https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2020-August-intro-scRNAseq/master/datasets/scran_biomart.txt", sep="\t", header=T, stringsAsFactors=F)

Take a look at the beginning part of the data frame.

head(scran)

##      mgi_symbol    ensembl_gene_id
## 1 0610040J01Rik ENSMUSG00000060512
## 2 1110008P14Rik ENSMUSG00000039195
## 3 1110012L19Rik ENSMUSG00000045237
## 4 1110032F04Rik ENSMUSG00000046999
## 5 1110051M20Rik ENSMUSG00000040591
## 6 1110065P20Rik ENSMUSG00000078570

Depending on the format of the file, several variants of read.table() are available to make reading a file easier, including read.csv, read.csv2, read.delim, and read.delim2.

Choosing the correct function (or parameters) is important!

CHALLENGE

What happens when you use read.csv on the file instead? Why?

scran2 <- read.csv(file="scran.txt", stringsAsFactors=F)
head(scran2)
rm(scran, scran2)

R base function write.table() can be used to export data to a file.

write.table(genes, file="genes.txt", sep="\t", quote=F, row.names=T, col.names=T)

It is also possible to export data to a csv file using the functions write.csv() and write.csv2().

Topic 4. Basic statistics in R

Description	R_function
Mean	mean()
Standard deviation	sd()
Variance	var()
Minimum	min()
Maximum	max()
Median	median()
Range of values: minimum and maximum	range()
Sample quantiles	quantile()
Generic function	summary()
Interquartile range	IQR()

Calculate the mean expression for each sample.

summary(ToothGrowth)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

mean(ToothGrowth[ToothGrowth$supp == "OJ", "len"])

## [1] 20.66333

mean(ToothGrowth[ToothGrowth$supp == "VC", "len"])

## [1] 16.96333

CHALLENGE

What is the largest tooth growth measured?
Which treatment had a larger range of length values?

# use the functions listed above to answer the questions

Topic 5. Simple data visualization in R

To summarize the tooth growth data, we can create a box plot. Take a look at the help for the boxplot() function to get an idea of what the various parameters do.

boxplot(ToothGrowth$len ~ ToothGrowth$dose, xlab = "dose", ylab = "length")

boxplot(ToothGrowth$len ~ ToothGrowth$supp + ToothGrowth$dose,
        xlab = "supplement.dose", ylab = "length")

We can also look at the tooth growth data as a histogram, using the hist() function.

hist(ToothGrowth$len)

hist(ToothGrowth$len, breaks = seq(0,36, by = 2), xlab = "length", main = "Tooth Growth")

Scatter and line plots and can be produced using the function plot(). Because ToothGrowth doesn’t contain any data that makes an interesting scatter plot, we will use another or R’s built in data sets, “mtcars.”

plot(mtcars$wt, mtcars$mpg, xlab = "weight", ylab = "mpg")

Topic 6. Install packages in R

Starting from Bioconductor version 3.8, the installation of packages is recommended to use BiocManager.

if (!any(rownames(installed.packages()) == "Seurat")){
  if (!requireNamespace("BiocManager", quietly = TRUE))
      install.packages("BiocManager")
  BiocManager::install("Seurat")
}
library(Seurat)

if (!any(rownames(installed.packages()) == "sva")){
  if (!requireNamespace("BiocManager", quietly = TRUE))
      install.packages("BiocManager")
  BiocManager::install("sva")
}
library(sva)

if (!any(rownames(installed.packages()) == "scran")){
  if (!requireNamespace("BiocManager", quietly = TRUE))
      install.packages("BiocManager")
  BiocManager::install("scran")
}
library(scran)

if (!any(rownames(installed.packages()) == "ggplot2")){
  if (!requireNamespace("BiocManager", quietly = TRUE))
      install.packages("BiocManager")
  BiocManager::install("ggplot2")
}
library(ggplot2)

if (!any(rownames(installed.packages()) == "dplyr")){
  if (!requireNamespace("BiocManager", quietly = TRUE))
      install.packages("BiocManager")
  BiocManager::install("dplyr")
}
library(dplyr)

if (!any(rownames(installed.packages()) == "knitr")){
  if (!requireNamespace("BiocManager", quietly = TRUE))
      install.packages("BiocManager")
  BiocManager::install("knitr")
}
library(knitr)

if (!any(rownames(installed.packages()) == "kableExtra")){
  if (!requireNamespace("BiocManager", quietly = TRUE))
      install.packages("BiocManager")
  BiocManager::install("kableExtra")
}
library(kableExtra)

if (!any(rownames(installed.packages()) == "limma")){
  if (!requireNamespace("BiocManager", quietly = TRUE))
      install.packages("BiocManager")
  BiocManager::install("limma")
}
library(limma)

if (!any(rownames(installed.packages()) == "biomaRt")){
  if (!requireNamespace("BiocManager", quietly = TRUE))
      install.packages("BiocManager")
  BiocManager::install("biomaRt")
}
library(biomaRt)

if (!any(rownames(installed.packages()) == "topGO")){
  if (!requireNamespace("BiocManager", quietly = TRUE))
      install.packages("BiocManager")
  BiocManager::install("topGO")
}
library(topGO)

if (!any(rownames(installed.packages()) == "org.Mm.eg.db")){
  if (!requireNamespace("BiocManager", quietly = TRUE))
      install.packages("BiocManager")
  BiocManager::install("org.Mm.eg.db")
}
library(org.Mm.eg.db)

if (!any(rownames(installed.packages()) == "WGCNA")){
  if (!requireNamespace("BiocManager", quietly = TRUE))
      install.packages("BiocManager")
  BiocManager::install("WGCNA")
}
library(WGCNA)

if (!any(rownames(installed.packages()) == "rmarkdown")){
  if (!requireNamespace("BiocManager", quietly = TRUE))
      install.packages("BiocManager")
  BiocManager::install("rmarkdown")
}
library(rmarkdown)

if (!any(rownames(installed.packages()) == "tinytex")){
  if (!requireNamespace("BiocManager", quietly = TRUE))
      install.packages("BiocManager")
  BiocManager::install("tinytex")
}
library(tinytex)

Bioconductor has a repository and release schedule that differ from R (Bioconductor has a ‘devel’ branch to which new packages and updates are introduced, and a stable ‘release’ branch emitted once every 6 months to which bug fixes but not new features are introduced). This mismatch causes that the version detected by install.packages() is sometimes not the most recent ‘release’.
A consequence of the ‘devel’ branch is that install.packages() sometimes points only to the ‘release’ repository, while users might want to have access to the leading-edge features in the ‘devel’ version.
An indirect consequence of Bioconductor’s structured release is that packages generally have more extensive dependences with one another.

☰ Menu

Introduction to Single Cell RNA-Seq Workshop

Topics covered in this introduction to R

Topic 1. Basic concepts

Operators

Functions

Variables

CHALLENGE

Topic 2. Basic data types in R

CHALLENGE

Vectors

Data frames

CHALLENGE

Topic 3. Import and export data in R

CHALLENGE

Topic 4. Basic statistics in R

CHALLENGE

Topic 5. Simple data visualization in R

Topic 6. Install packages in R