Home
Introduction
Introduction to the Workshop and the Core
Schedule
Support
Slack
Zoom
Cheat Sheets
Introduction to R
Introduction to R
Challenge Solutions
ETC
Closing thoughts
Github page
Biocore website

R and RStudio

What is R?

R is a language and environment for statistical computing and graphics developed in 1993. It provides a wide variety of statistical and graphical techniques (linear and nonlinear modeling, statistical tests, time series analysis, classification, clustering, …), and is highly extensible, meaning that the user community can write new R tools. It is a GNU project (Free and Open Source).

The R language has its roots in the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and now, R is developed by the R Development Core Team, of which Chambers is a member. R is named partly after the first names of the first two R authors (Robert Gentleman and Ross Ihaka), and partly as a play on the name of S. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.

Some of R’s strengths:

The ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control. Such as examples like the following (extracted from http://web.stanford.edu/class/bios221/book/Chap-Graphics.html):

It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.
R can be extended (easily) via packages.
R has its own LaTeX-like documentation format, which is used to supply comprehensive documentation, both on-line in a number of formats and in hardcopy.
It has a vast community both in academia and in business.
It’s FREE!

The R environment

R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes

an effective data handling and storage facility,
a suite of operators for calculations on arrays, in particular matrices,
a large, coherent, integrated collection of intermediate tools for data analysis,
graphical facilities for data analysis and display either on-screen or on hardcopy, and
a well-developed, and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.

The term “environment” is intended to characterize it as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software.

R, like S, is designed around a true computer language, and it allows users to add additional functionality by defining new functions. Much of the system is itself written in the R dialect of S, which makes it easy for users to follow the algorithmic choices made. For computationally-intensive tasks, C, C++ and Fortran code can be linked and called at run time. Advanced users can write C code to manipulate R objects directly.

Many users think of R as a statistics system. The R group prefers to think of it of an environment within which statistical techniques are implemented.

The R Homepage

The R homepage has a wealth of information on it,

R-project.org

On the homepage you can:

Learn more about R
Download R
Get Documentation (official and user supplied)
Get access to CRAN ‘Comprehensive R archival network’

Interface for R

There are many ways one can interface with R language. Here are a few popular ones:

RStudio
RGui
Jupyter and R notebooks
text editors, such as vi(m), Emacs…

RStudio

RStudio started in 2010, to offer R a more full featured integrated development environment (IDE) and modeled after matlab’s IDE.

RStudio has many features:

syntax highlighting
code completion
smart indentation
“Projects”
workspace browser and data viewer
embedded plots
Markdown notebooks, Sweave authoring and knitr with one click pdf or html
runs on all platforms and over the web
etc. etc. etc.

RStudio and its team have contributed to many R packages.[13] These include:

Tidyverse – R packages for data science, including ggplot2, dplyr, tidyr, and purrr
Shiny – An interactive web technology
RMarkdown – Insert R code into markdown documents
knitr – Dynamic reports combining R, TeX, Markdown & HTML
packrat – Package dependency tool
devtools – Package development tool

RStudio Cheat Sheets: rstudio-ide.pdf

Programming fundamentals

There are three concepts that are essential in any programming language:

VARIABLES

A variable is a named storage. Creating a variable is to reserve some space in memory. In R, the name of a variable can have letters, numbers, dot and underscore. However, a valid variable name cannot start with a underscore or a number, or start with a dot that is followed by a number.

FUNCTIONS

A function is a block of organized, reusable code that is used to perform a set of predefined operations. A function may take zero or more parameters and return a result.

The way to use a function in R is:

function.name(parameter1=value1, …)

In R, to get help information on a funciton, one may use the command:

?function.name

OPERATIONS

Assignment Operators in R
Operator	Description
<-, =	Assignment

Arithmetic Operators in R
Operator	Description
+	Addition
-	Subtraction
*	Multiplication
/	Division
^	Exponent
%%	Modulus
%/%	Integer Division

Relational Operators in R
Operator	Description
<	Less than
>	Greater than
<=	Less than or equal to
>=	Greater than or equal to
==	Equal to
!=	Not equal to

Logical Operators in R
Operator	Description
!	Logical NOT
&	Element-wise logical AND
&&	Logical AND
\|	Element-wise logical OR
\|\|	Logical OR

<!DOCTYPE html> embed.utf8

Start an R session

BEFORE YOU BEGIN, YOU NEED TO START AN R SESSION

You can run this tutorial in an IDE (like Rstudio) on your laptop, or you can run R on the command-line on tadpole by logging into tadpole in a terminal and running the following commands:

module load R

R

NOTE: Below, the text in the yellow boxes is code to input (by typing it or copy/pasting) into your R session, the text in the white boxes is the expected output.

Topics covered in this introduction to R

Basic data types in R
Basic data structures in R
Import and export data in R
Functions in R
Basic statistics in R
Simple data visulization in R
Install packages in R
Save data in R session
R markdown and R notebooks

Topic 1. Basic data types in R

There are 5 basic atomic classes: numeric (integer, complex), character, logical

Examples of numeric values.

# assign number 150 to variable a.
a <- 150
a

## [1] 150

# assign a number in scientific format to variable b.
b <- 3e-2
b

## [1] 0.03

Examples of character values.

# assign a string "BRCA1" to variable gene
gene <- "BRCA1"
gene

## [1] "BRCA1"

# assign a string "Hello World" to variable hello
hello <- "Hello World"
hello

## [1] "Hello World"

Examples of logical values.

# assign logical value "TRUE" to variable brca1_expressed
brca1_expressed <- TRUE
brca1_expressed

## [1] TRUE

# assign logical value "FALSE" to variable her2_expressed
her2_expressed <- FALSE
her2_expressed

## [1] FALSE

# assign logical value to a variable by logical operation
her2_expression_level <- 0
her2_expressed <- her2_expression_level > 0
her2_expressed

## [1] FALSE

To find out the type of variable.

class(her2_expressed)

## [1] "logical"

# To check whether the variable is a specific type
is.numeric(gene)

## [1] FALSE

is.numeric(a)

## [1] TRUE

is.character(gene)

## [1] TRUE

In the case that one compares two different classes of data, the coersion rule in R is logical -> integer -> numeric -> complex -> character . The following is an example of converting a numeric variable to character.

## [1] 0.03

as.character(b)

## [1] "0.03"

What happens when one converts a logical variable to numeric?

# recall her2_expressed
her2_expressed

## [1] FALSE

# conversion
as.numeric(her2_expressed)

## [1] 0

her2_expressed + 1

## [1] 1

A logical TRUE is converted to integer 1 and a logical FALSE is converted to integer 0.

Quiz 1

Topic 2. Basic data structures in R

	Homogeneous	Heterogeneous
1d	Atomic vector	List
2d	Matrix	Data frame
Nd	Array

Atomic vectors: an atomic vector is a combination of multiple values(numeric, character or logical) in the same object. An atomic vector is created using the function c().

gene_names <- c("ESR1", "p53", "PI3K", "BRCA1", "EGFR")
gene_names

## [1] "ESR1"  "p53"   "PI3K"  "BRCA1" "EGFR"

gene_expression <- c(0, 100, 50, 200, 80)
gene_expression

## [1]   0 100  50 200  80

One can give names to the elements of an atomic vector.

# assign names to a vector by specifying them
names(gene_expression) <- c("ESR1", "p53", "PI3K", "BRCA1", "EGFR")
gene_expression

##  ESR1   p53  PI3K BRCA1  EGFR 
##     0   100    50   200    80

# assign names to a vector using another vector
names(gene_expression) <- gene_names
gene_expression

##  ESR1   p53  PI3K BRCA1  EGFR 
##     0   100    50   200    80

Or One may create a vector with named elements from scratch.

gene_expression <- c(ESR1=0, p53=100, PI3K=50, BRCA1=200, EGFR=80)
gene_expression

##  ESR1   p53  PI3K BRCA1  EGFR 
##     0   100    50   200    80

To find out the length of a vector:

length(gene_expression)

## [1] 5

NOTE: a vector can only hold elements of the same type. If there are a mixture of data types, they will be coerced according to the coersion rule mentioned earlier in this documentation.

Factors: a factor is a special vector. It stores categorical data, which are important in statistical modeling and can only take on a limited number of pre-defined values. The function factor() can be used to create a factor.

disease_stage <- factor(c("Stage1", "Stage2", "Stage2", "Stage3", "Stage1", "Stage4"))
disease_stage

## [1] Stage1 Stage2 Stage2 Stage3 Stage1 Stage4
## Levels: Stage1 Stage2 Stage3 Stage4

In R, categories of the data are stored as factor levels. The function levels() can be used to access the factor levels.

levels(disease_stage)

## [1] "Stage1" "Stage2" "Stage3" "Stage4"

A function to compactly display the internal structure of an R object is str(). Please use str() to display the internal structure of the object we just created disease_stage. It shows that disease_stage is a factor with four levels: “Stage1”, “Stage2”, “Stage3”, etc… The integer numbers after the colon shows that these levels are encoded under the hood by integer values: the first level is 1, the second level is 2, and so on. Basically, when factor function is called, R first scan through the vector to determine how many different categories there are, then it converts the character vector to a vector of integer values, with each integer value labeled with a category.

str(disease_stage)

##  Factor w/ 4 levels "Stage1","Stage2",..: 1 2 2 3 1 4

By default, R infers the factor levels by ordering the unique elements in a factor alphanumerically. One may specifically define the factor levels at the creation of the factor.

disease_stage <- factor(c("Stage1", "Stage2", "Stage2", "Stage3", "Stage1", "Stage4"), levels=c("Stage2", "Stage1", "Stage3", "Stage4"))
# The encoding for levels are different from above.
str(disease_stage)

##  Factor w/ 4 levels "Stage2","Stage1",..: 2 1 1 3 2 4

If you want to know the number of individuals at each levels, there are two functions: summary and table.

summary(disease_stage)

## Stage2 Stage1 Stage3 Stage4 
##      2      2      1      1

table(disease_stage)

## disease_stage
## Stage2 Stage1 Stage3 Stage4 
##      2      2      1      1

Quiz 2

Matrices: A matrix is like an Excel sheet containing multiple rows and columns. It is used to combine vectors of the same type.

col1 <- c(1,3,8,9)
col2 <- c(2,18,27,10)
col3 <- c(8,37,267,19)

my_matrix <- cbind(col1, col2, col3)
my_matrix

##      col1 col2 col3
## [1,]    1    2    8
## [2,]    3   18   37
## [3,]    8   27  267
## [4,]    9   10   19

One other way to create a matrix is to use matrix() function.

nums <- c(col1, col2, col3)
nums

##  [1]   1   3   8   9   2  18  27  10   8  37 267  19

matrix(nums, ncol=2)

##      [,1] [,2]
## [1,]    1   27
## [2,]    3   10
## [3,]    8    8
## [4,]    9   37
## [5,]    2  267
## [6,]   18   19

rownames(my_matrix) <- c("row1", "row2", "row3", "row4")
my_matrix

##      col1 col2 col3
## row1    1    2    8
## row2    3   18   37
## row3    8   27  267
## row4    9   10   19

t(my_matrix)

##      row1 row2 row3 row4
## col1    1    3    8    9
## col2    2   18   27   10
## col3    8   37  267   19

To find out the dimension of a matrix:

ncol(my_matrix)

## [1] 3

nrow(my_matrix)

## [1] 4

dim(my_matrix)

## [1] 4 3

Calculations with numeric matrices.

my_matrix * 3

##      col1 col2 col3
## row1    3    6   24
## row2    9   54  111
## row3   24   81  801
## row4   27   30   57

log10(my_matrix)

##           col1     col2     col3
## row1 0.0000000 0.301030 0.903090
## row2 0.4771213 1.255273 1.568202
## row3 0.9030900 1.431364 2.426511
## row4 0.9542425 1.000000 1.278754

Total of each row.

rowSums(my_matrix)

## row1 row2 row3 row4 
##   11   58  302   38

Total of each column.

colSums(my_matrix)

## col1 col2 col3 
##   21   57  331

There is a data structure Array in R, that holds multi-dimensional (d > 2) data and is a generalized version of a matrix. Matrix is used much more commonly than Array, therefore we are not going to talk about Array here.

Data frames: a data frame is like a matrix but can have columns with different types (numeric, character, logical).

A data frame can be created using the function data.frame().

# creating a data frame using pre-defined vectors
patients_name=c("Patient1", "Patient2", "Patient3", "Patient4", "Patient5", "Patient6")
Family_history=c("Y", "N", "Y", "N", "Y", "Y")
patients_age=c(31, 40, 39, 50, 45, 65)
meta.data <- data.frame(patients_name=patients_name, disease_stage=disease_stage, Family_history=Family_history, patients_age=patients_age)
meta.data

##   patients_name disease_stage Family_history patients_age
## 1      Patient1        Stage1              Y           31
## 2      Patient2        Stage2              N           40
## 3      Patient3        Stage2              Y           39
## 4      Patient4        Stage3              N           50
## 5      Patient5        Stage1              Y           45
## 6      Patient6        Stage4              Y           65

To check whether a data is a data frame, use the function is.data.frame().

is.data.frame(meta.data)

## [1] TRUE

is.data.frame(my_matrix)

## [1] FALSE

One can convert a matrix object to a data frame using the function as.data.frame().

class(my_matrix)

## [1] "matrix" "array"

my_data <- as.data.frame(my_matrix)
class(my_data)

## [1] "data.frame"

A data frame can be transposed in the similar way as a matrix. However, the result of transposing a data frame might not be a data frame anymore.

my_data

##      col1 col2 col3
## row1    1    2    8
## row2    3   18   37
## row3    8   27  267
## row4    9   10   19

t(my_data)

##      row1 row2 row3 row4
## col1    1    3    8    9
## col2    2   18   27   10
## col3    8   37  267   19

A data frame can be extended.

# add a column that has the information on harmful mutations in BRCA1/BRCA2 genes for each patient.
meta.data

##   patients_name disease_stage Family_history patients_age
## 1      Patient1        Stage1              Y           31
## 2      Patient2        Stage2              N           40
## 3      Patient3        Stage2              Y           39
## 4      Patient4        Stage3              N           50
## 5      Patient5        Stage1              Y           45
## 6      Patient6        Stage4              Y           65

meta.data$BRCA <- c("YES", "NO", "YES", "YES", "YES", "NO")
meta.data

##   patients_name disease_stage Family_history patients_age BRCA
## 1      Patient1        Stage1              Y           31  YES
## 2      Patient2        Stage2              N           40   NO
## 3      Patient3        Stage2              Y           39  YES
## 4      Patient4        Stage3              N           50  YES
## 5      Patient5        Stage1              Y           45  YES
## 6      Patient6        Stage4              Y           65   NO

A data frame can also be extended using the functions cbind() and rbind(), for adding columns and rows respectively. When using cbind(), the number of values in the new column must match the number of rows in the data frame. When using rbind(), the two data frames must have the same variables/columns.

# add a column that has the information on the racial information for each patient.
cbind(meta.data, Race=c("AJ", "AS", "AA", "NE", "NE", "AS"))

##   patients_name disease_stage Family_history patients_age BRCA Race
## 1      Patient1        Stage1              Y           31  YES   AJ
## 2      Patient2        Stage2              N           40   NO   AS
## 3      Patient3        Stage2              Y           39  YES   AA
## 4      Patient4        Stage3              N           50  YES   NE
## 5      Patient5        Stage1              Y           45  YES   NE
## 6      Patient6        Stage4              Y           65   NO   AS

# rbind can be used to add more rows to a data frame.
rbind(meta.data, data.frame(patients_name="Patient7", disease_stage="S4", Family_history="Y", patients_age=48, BRCA="YES"))

##   patients_name disease_stage Family_history patients_age BRCA
## 1      Patient1        Stage1              Y           31  YES
## 2      Patient2        Stage2              N           40   NO
## 3      Patient3        Stage2              Y           39  YES
## 4      Patient4        Stage3              N           50  YES
## 5      Patient5        Stage1              Y           45  YES
## 6      Patient6        Stage4              Y           65   NO
## 7      Patient7            S4              Y           48  YES

One may use the function merge to merge two data frames horizontally, based on one or more common key variables.

expression.data <- data.frame(patients_name=c("Patient3", "Patient4", "Patient5", "Patient1", "Patient2", "Patient6"), EGFR=c(10, 472, 103784, 1782, 187, 18289), TP53=c(16493, 72, 8193, 1849, 173894, 1482))
expression.data

##   patients_name   EGFR   TP53
## 1      Patient3     10  16493
## 2      Patient4    472     72
## 3      Patient5 103784   8193
## 4      Patient1   1782   1849
## 5      Patient2    187 173894
## 6      Patient6  18289   1482

md2 <- merge(meta.data, expression.data, by="patients_name")
md2

##   patients_name disease_stage Family_history patients_age BRCA   EGFR   TP53
## 1      Patient1        Stage1              Y           31  YES   1782   1849
## 2      Patient2        Stage2              N           40   NO    187 173894
## 3      Patient3        Stage2              Y           39  YES     10  16493
## 4      Patient4        Stage3              N           50  YES    472     72
## 5      Patient5        Stage1              Y           45  YES 103784   8193
## 6      Patient6        Stage4              Y           65   NO  18289   1482

Quiz 3

CHALLENGE

Using the mtcars built-in dataset (Type “mtcars” to see it), add a row that has the averages of each column and name it “Averages”. Now, add a column to mtcars called “hp.gt.100” that is TRUE or FALSE depending on whether the horsepower (hp) for that car is greater than 100 or not.

Lists: a list is an ordered collection of objects, which can be any type of R objects (vectors, matrices, data frames, even lists).

A list is constructed using the function list().

my_list <- list(1:5, "a", c(TRUE, FALSE, FALSE), c(3.2, 103.0, 82.3))
my_list

## [[1]]
## [1] 1 2 3 4 5
## 
## [[2]]
## [1] "a"
## 
## [[3]]
## [1]  TRUE FALSE FALSE
## 
## [[4]]
## [1]   3.2 103.0  82.3

str(my_list)

## List of 4
##  $ : int [1:5] 1 2 3 4 5
##  $ : chr "a"
##  $ : logi [1:3] TRUE FALSE FALSE
##  $ : num [1:3] 3.2 103 82.3

One could construct a list by giving names to elements.

my_list <- list(Ranking=1:5, ID="a", Test=c(TRUE, FALSE, FALSE), Score=c(3.2, 103.0, 82.3))

# display the names of elements in the list using the function *names*, or *str*. Compare the output of *str* with the above results to see the difference.
names(my_list)

## [1] "Ranking" "ID"      "Test"    "Score"

str(my_list)

## List of 4
##  $ Ranking: int [1:5] 1 2 3 4 5
##  $ ID     : chr "a"
##  $ Test   : logi [1:3] TRUE FALSE FALSE
##  $ Score  : num [1:3] 3.2 103 82.3

# number of elements in the list
length(my_list)

## [1] 4

Subsetting data

Subsetting allows one to access the piece of data of interest. When combinded with assignment, subsetting can modify selected pieces of data. The operators that can be used to subset data are: [, $, and [[.

First, we are going to talk about subsetting data using [, which is the most commonly used operator. We will start by looking at vectors and talk about four ways to subset a vector.

Positive integers return elements at the specified positions

# first to recall what are stored in gene_names
gene_names

## [1] "ESR1"  "p53"   "PI3K"  "BRCA1" "EGFR"

# obtain the first and the third elements
gene_names[c(1,3)]

## [1] "ESR1" "PI3K"

R uses 1 based indexing, meaning the first element is at the position 1, not at position 0.

Negative integers omit elements at the specified positions

gene_names[-c(1,3)]

## [1] "p53"   "BRCA1" "EGFR"

One may not mixed positive and negative integers in one single subset operation.

# The following command will produce an error.
gene_names[c(-1, 2)]

## Error in gene_names[c(-1, 2)]: only 0's may be mixed with negative subscripts

Logical vectors select elements where the corresponding logical value is TRUE, This is very useful because one may write the expression that creates the logical vector.

gene_names[c(TRUE, FALSE, TRUE, FALSE, FALSE)]

## [1] "ESR1" "PI3K"

Recall that we have created one vector called gene_expression. Let’s assume that gene_expression stores the expression values correspond to the genes in gene_names. Then we may subset the genes based on expression values.

gene_expression

##  ESR1   p53  PI3K BRCA1  EGFR 
##     0   100    50   200    80

gene_names[gene_expression > 50]

## [1] "p53"   "BRCA1" "EGFR"

If the logical vector is shorter in length than the data vector that we want to subset, then it will be recycled to be the same length as the data vector.

gene_names[c(TRUE, FALSE)]

## [1] "ESR1" "PI3K" "EGFR"

If the logical vector has “NA” in it, the corresponding value will be “NA” in the output. “NA” in R is a symbol for missing value.

gene_names[c(TRUE, NA, FALSE, TRUE, NA)]

## [1] "ESR1"  NA      "BRCA1" NA

Character vectors return elements with matching names, when the vector is named.

gene_expression

##  ESR1   p53  PI3K BRCA1  EGFR 
##     0   100    50   200    80

gene_expression[c("ESR1", "p53")]

## ESR1  p53 
##    0  100

Nothing returns the original vector, This is more useful for matrices, data frames than for vectors.

gene_names[]

## [1] "ESR1"  "p53"   "PI3K"  "BRCA1" "EGFR"

Subsetting a list works in the same way as subsetting an atomic vector. Using [ will always return a list.

my_list[1]

## $Ranking
## [1] 1 2 3 4 5

Subsetting a matrix can be done by simply generalizing the one dimension subsetting: one may supply a one dimension index for each dimension of the matrix. Blank/Nothing subsetting is now useful in keeping all rows or all columns.

my_matrix[c(TRUE, FALSE), ]

##      col1 col2 col3
## row1    1    2    8
## row3    8   27  267

Subsetting a data frame can be done similarly as subsetting a matrix. In addition, one may supply only one 1-dimensional index to subset a data frame. In this case, R will treat the data frame as a list with each column is an element in the list.

# recall a data frame created from above: *meta.data*
meta.data

##   patients_name disease_stage Family_history patients_age BRCA
## 1      Patient1        Stage1              Y           31  YES
## 2      Patient2        Stage2              N           40   NO
## 3      Patient3        Stage2              Y           39  YES
## 4      Patient4        Stage3              N           50  YES
## 5      Patient5        Stage1              Y           45  YES
## 6      Patient6        Stage4              Y           65   NO

# subset the data frame similarly to a matrix
meta.data[c(TRUE, FALSE, FALSE, TRUE),]

##   patients_name disease_stage Family_history patients_age BRCA
## 1      Patient1        Stage1              Y           31  YES
## 4      Patient4        Stage3              N           50  YES
## 5      Patient5        Stage1              Y           45  YES

# subset the data frame using one vector
meta.data[c("patients_age", "disease_stage")]

##   patients_age disease_stage
## 1           31        Stage1
## 2           40        Stage2
## 3           39        Stage2
## 4           50        Stage3
## 5           45        Stage1
## 6           65        Stage4

Subsetting operators: [[ and $

[[ is similar to [, except that it returns the content of the element.

# recall my_list
my_list

## $Ranking
## [1] 1 2 3 4 5
## 
## $ID
## [1] "a"
## 
## $Test
## [1]  TRUE FALSE FALSE
## 
## $Score
## [1]   3.2 103.0  82.3

# comparing [[ with [ in subsetting a list
my_list[[1]]

## [1] 1 2 3 4 5

my_list[1]

## $Ranking
## [1] 1 2 3 4 5

[[ is very useful when working with a list. Because when [ is applied to a list, it always returns a list. While [[ returns the contents of the list. [[ can only extrac/return one element, so it only accept one integer/string as input.

Because data frames are implemented as lists of columns, one may use [[ to extract a column from data frames.

meta.data[["disease_stage"]]

## [1] Stage1 Stage2 Stage2 Stage3 Stage1 Stage4
## Levels: Stage2 Stage1 Stage3 Stage4

$ is a shorthand for [[ combined with character subsetting.

# subsetting a list using $ 
my_list$Score

## [1]   3.2 103.0  82.3

# subsetting a data frame using
meta.data$disease_stage

## [1] Stage1 Stage2 Stage2 Stage3 Stage1 Stage4
## Levels: Stage2 Stage1 Stage3 Stage4

Simplifying vs. preserving subsetting

We have seen some examples of simplying vs. preserving subsetting, for example:

# simplifying subsetting
my_list[[1]]

## [1] 1 2 3 4 5

# preserving subsetting
my_list[1]

## $Ranking
## [1] 1 2 3 4 5

Basically, simplying subsetting returns the simplest possible data structure that can represent the output. While preserving subsetting keeps the structure of the output as the same as the input. In the above example, [[ simplifies the output to a vector, while [ keeps the output as a list.

Because the syntax of carrying out simplifying and preserving subsetting differs depending on the data structure, the table below provides the information for the most basic data structure.

	Simplifying	Preserving
Vector	x[[1]]	x[1]
List	x[[1]]	x[1]
Factor	x[1:3, drop=T]	x[1:3]
Data frame	x[, 1] or x[[1]]	x[, 1, drop=F] or x[1]

CHALLENGES

Using the built-in dataset iris, first subset the dataframe keeping only those rows where the sepal length is greater than 6. Then find the total number for each Species in that subset.

Using iris, remove the width columns and then create a new dataframe with the Species and the sum of the rows.

Topic 3. Import and export data in R

R base function read.table() is a general funciton that can be used to read a file in table format. The data will be imported as a data frame.

# There is a very convenient way to read files from the internet.
data1 <- read.table(file="https://github.com/ucdavis-bioinformatics-training/courses/raw/master/Intro2R/raw_counts.txt", sep="\t", header=T, stringsAsFactors=F)

# To read a local file. If you have downloaded the raw_counts.txt file to your local machine, you may use the following command to read it in, by providing the full path for the file location. The way to specify the full path is the same as taught in the command line session.
download.file("https://github.com/ucdavis-bioinformatics-training/courses/raw/master/Intro2R/raw_counts.txt", "./raw_counts.txt")
data1 <- read.table(file="./raw_counts.txt", sep="\t", header=T, stringsAsFactors=F)

To check what type of object data1 is in and take a look at the beginning part of the data.

is.data.frame(data1)

## [1] TRUE

head(data1)

##            C61  C62  C63  C64  C91  C92  C93 C94 I561 I562 I563 I564 I591 I592
## AT1G01010  322  346  256  396  372  506  361 342  638  488  440  479  770  430
## AT1G01020  149   87  162  144  189  169  147 108  163  141  119  147  182  156
## AT1G01030   15   32   35   22   24   33   21  35   18    8   54   35   23    8
## AT1G01040  687  469  568  651  885  978  794 862  799  769  725  715  811  567
## AT1G01046    1    1    5    4    5    3    0   2    4    3    1    0    2    8
## AT1G01050 1447 1032 1083 1204 1413 1484 1138 938 1247 1516  984 1044 1374 1355
##           I593 I594 I861 I862 I863 I864 I891 I892 I893 I894
## AT1G01010  656  467  143  453  429  206  567  458  520  474
## AT1G01020  153  177   43  144  114   50  161  195  157  144
## AT1G01030   16   24   42   17   22   39   26   28   39   30
## AT1G01040  831  694  345  575  605  404  735  651  725  591
## AT1G01046    8    1    0    4    0    3    5    7    0    5
## AT1G01050 1437 1577  412 1338 1051  621 1434 1552 1248 1186

Depending on the format of the file, several variants of read.table() are available to make reading a file easier.

read.csv(): for reading “comma separated value” files (.csv).
read.csv2(): variant used in countries that use a comma “,” as decimal point and a semicolon “;” as field separators.
read.delim(): for reading “tab separated value” files (“.txt”). By default, point(“.”) is used as decimal point.
read.delim2(): for reading “tab separated value” files (“.txt”). By default, comma (“,”) is used as decimal point.

# We are going to read a file over the internet by providing the url of the file.
data2 <- read.csv(file="https://github.com/ucdavis-bioinformatics-training/courses/raw/master/Intro2R/raw_counts.csv", stringsAsFactors=F)

# To look at the file:
head(data2)

##            C61  C62  C63  C64  C91  C92  C93 C94 I561 I562 I563 I564 I591 I592
## AT1G01010  322  346  256  396  372  506  361 342  638  488  440  479  770  430
## AT1G01020  149   87  162  144  189  169  147 108  163  141  119  147  182  156
## AT1G01030   15   32   35   22   24   33   21  35   18    8   54   35   23    8
## AT1G01040  687  469  568  651  885  978  794 862  799  769  725  715  811  567
## AT1G01046    1    1    5    4    5    3    0   2    4    3    1    0    2    8
## AT1G01050 1447 1032 1083 1204 1413 1484 1138 938 1247 1516  984 1044 1374 1355
##           I593 I594 I861 I862 I863 I864 I891 I892 I893 I894
## AT1G01010  656  467  143  453  429  206  567  458  520  474
## AT1G01020  153  177   43  144  114   50  161  195  157  144
## AT1G01030   16   24   42   17   22   39   26   28   39   30
## AT1G01040  831  694  345  575  605  404  735  651  725  591
## AT1G01046    8    1    0    4    0    3    5    7    0    5
## AT1G01050 1437 1577  412 1338 1051  621 1434 1552 1248 1186

R base function write.table() can be used to export data to a file.

# To write to a file called "output.txt" in your current working directory.
write.table(data2[1:20,], file="output.txt", sep="\t", quote=F, row.names=T, col.names=T)

It is also possible to export data to a csv file.

write.csv()

write.csv2()

Quiz 4

Topic 4. R markdown and R notebooks

Markdown is a system that allow easy incorporation of annotations/comments together with computing code. Both the raw source of markdown file and the rendered output are easy to read. R markdown allows both interactive mode with R and producing a reproducible document. An R notebook is an R markdown document with code chunks that can be executed independently and interactively, with output visible immediately beneath the input. In RStudio, by default, all R markdown documents are run in R notebook mode. Under the R notebook mode, when executing a chunk, the code is sent to the console to be run one line at a time. This allows execution to stop if a line raises an error.

In RStudio, creating an R notebook can be done by going to the menu command ** File -> New File -> R Notebook **.

An example of an R notebook looks like:

The way to run the R code inside the code chunk is to use the green arrow located at the top right corner of each of the code chunk, or use ** Ctrl + Shift + Enter ** on Windows, or ** Cmd + Shift + Enter ** on Mac to run the current code chunk. To run each individual code line, one uses ** Ctrl + Enter ** on Windows, or ** Cmd + Enter ** on Mac.

To render R notebook to html/pdf/word documents can be done using the Preview menu.

Topic 5. Functions in R

Invoking a function by its name, followed by the parenthesis and zero or more arguments.

# to find out the current working directory
getwd()

## [1] "/share/biocore/joshi/projects/workshops/2021-March-Introduction-to-R-for-Bioinformatics/R"

# to set a different working directory, use setwd
#setwd("/Users/jli/Desktop")

# to list all objects in the environment
ls()

##  [1] "a"                     "b"                     "brca1_expressed"      
##  [4] "col1"                  "col2"                  "col3"                 
##  [7] "colFmt"                "data1"                 "data2"                
## [10] "disease_stage"         "expression.data"       "Family_history"       
## [13] "gene"                  "gene_expression"       "gene_names"           
## [16] "hello"                 "her2_expressed"        "her2_expression_level"
## [19] "md2"                   "meta.data"             "my_data"              
## [22] "my_list"               "my_matrix"             "nums"                 
## [25] "patients_age"          "patients_name"

# to create a vector from 2 to 3, using increment of 0.1
seq(2, 3, by=0.1)

##  [1] 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0

# to create a vector with repeated elements
rep(1:3, times=3)

## [1] 1 2 3 1 2 3 1 2 3

rep(1:3, each=3)

## [1] 1 1 1 2 2 2 3 3 3

# to get help information on a function in R: ?function.name
?seq
?sort
?rep

One useful function to find out information on an R object: str(). It compactly display the internal structure of an R object.

str(data2)

## 'data.frame':    33602 obs. of  24 variables:
##  $ C61 : int  322 149 15 687 1 1447 2667 297 0 74 ...
##  $ C62 : int  346 87 32 469 1 1032 2472 226 0 79 ...
##  $ C63 : int  256 162 35 568 5 1083 2881 325 0 138 ...
##  $ C64 : int  396 144 22 651 4 1204 2632 341 0 85 ...
##  $ C91 : int  372 189 24 885 5 1413 5120 199 0 68 ...
##  $ C92 : int  506 169 33 978 3 1484 6176 180 0 41 ...
##  $ C93 : int  361 147 21 794 0 1138 7088 195 0 110 ...
##  $ C94 : int  342 108 35 862 2 938 6810 107 0 81 ...
##  $ I561: int  638 163 18 799 4 1247 2258 377 0 72 ...
##  $ I562: int  488 141 8 769 3 1516 1808 534 0 76 ...
##  $ I563: int  440 119 54 725 1 984 2279 300 0 184 ...
##  $ I564: int  479 147 35 715 0 1044 2299 223 0 156 ...
##  $ I591: int  770 182 23 811 2 1374 4755 298 0 96 ...
##  $ I592: int  430 156 8 567 8 1355 3128 318 0 70 ...
##  $ I593: int  656 153 16 831 8 1437 4419 397 0 77 ...
##  $ I594: int  467 177 24 694 1 1577 3726 373 0 77 ...
##  $ I861: int  143 43 42 345 0 412 1452 86 0 174 ...
##  $ I862: int  453 144 17 575 4 1338 1516 266 0 113 ...
##  $ I863: int  429 114 22 605 0 1051 1455 281 0 69 ...
##  $ I864: int  206 50 39 404 3 621 1429 164 0 176 ...
##  $ I891: int  567 161 26 735 5 1434 3867 230 0 69 ...
##  $ I892: int  458 195 28 651 7 1552 4718 270 0 80 ...
##  $ I893: int  520 157 39 725 0 1248 4580 220 0 81 ...
##  $ I894: int  474 144 30 591 5 1186 3575 229 0 62 ...

Conditional structure

Decision making is important in programming. This can be achieved using an if…else statement.

The basic structure of an if…else statement is

if (condition statement){

some operation

}

Two examples of if…else statement

Temperature <- 30

if (Temperature < 32) {
  print("Very cold")
}

## [1] "Very cold"

# recall gene_expression, we are going to design a *if...else* statement to decide treatment plans based on gene expression.

if (gene_expression["ESR1"] > 0) {
  print("Treatment plan 1")
} else if (gene_expression["BRCA1"] > 0) {
  print("Treatment plan 2")
} else if (gene_expression["p53"] > 0) {
  print("Treatment plan 3")
} else {
  print("Treatment plan 4")
}

## [1] "Treatment plan 2"

Loop structure

In programming, it is common that one has to do one set of specific operation on a sequence of elements. In this case, for loop is very useful to achieve the goal.

The basic structure of for loop is:

for (value in sequence){

some operation

}

For example, we would like to calculate the sum of a row for every row in the matrix we created earlier. We are going to use a for loop to do it.

for (i in 1:dim(my_matrix)[1]) {
  out <- sum(my_matrix[i, ])
  print(out)
}

## [1] 11
## [1] 58
## [1] 302
## [1] 38

There is a while loop in R similarly as in command line or any other programming language. The basic structure of a while loop is:

while (condition){

some operation

}

Here is the same row sum calculation using a while loop:

while (i <= dim(my_matrix)[1]){
  out <- sum(my_matrix[i,])
  print(out)
  i <- i + 1
}

## [1] 38

A few useful functions: apply(), lapply(), sapply(), and tapply() to replace for loop

apply() takes an array or matrix as input and outputs a vector, array or list.

# recall my_matrix
my_matrix

##      col1 col2 col3
## row1    1    2    8
## row2    3   18   37
## row3    8   27  267
## row4    9   10   19

# check the usage of apply() function
?apply()
# calculate sums for each row
apply(my_matrix, MARGIN=1, sum)

## row1 row2 row3 row4 
##   11   58  302   38

lapply() takes a list, vector or data frame as input and outputs a list.

?lapply()

# generate some random data matrix
data3 <- as.data.frame(matrix(rnorm(49), ncol=7), stringsAsFactors=F)
dim(data3)

## [1] 7 7

# calculate the sum for each row
lapply(1:dim(data3)[1], function(x){sum(data3[x,])})

## [[1]]
## [1] -1.350476
## 
## [[2]]
## [1] 0.09070074
## 
## [[3]]
## [1] 2.705578
## 
## [[4]]
## [1] 1.186491
## 
## [[5]]
## [1] 0.5893242
## 
## [[6]]
## [1] -1.244707
## 
## [[7]]
## [1] 0.131479

# comparing the results to apply() results
apply(data3, MARGIN=1, sum)

## [1] -1.35047577  0.09070074  2.70557830  1.18649091  0.58932419 -1.24470654
## [7]  0.13147903

# calculate log10 of the sum of each row
lapply(1:dim(data3)[1], function(x){log10(sum(data3[x,]))})

## Warning in FUN(X[[i]], ...): NaNs produced

## Warning in FUN(X[[i]], ...): NaNs produced

## [[1]]
## [1] NaN
## 
## [[2]]
## [1] -1.042389
## 
## [[3]]
## [1] 0.4322601
## 
## [[4]]
## [1] 0.07426441
## 
## [[5]]
## [1] -0.2296457
## 
## [[6]]
## [1] NaN
## 
## [[7]]
## [1] -0.8811435

The function sapply() works like function lapply(), but tries to simplify the output to the simplest data structure possible. As a matter of fact, sapply() is a “wrapper” function for lapply(). By default, it returns a vector.

# To check the syntax of using sapply():
?sapply()

sapply(1:dim(data3)[1], function(x){log10(sum(data3[x,]))})

## Warning in FUN(X[[i]], ...): NaNs produced

## Warning in FUN(X[[i]], ...): NaNs produced

## [1]         NaN -1.04238918  0.43226011  0.07426441 -0.22964573         NaN
## [7] -0.88114352

If the “simplify” parameter is turned off, sapply() will produced exactly the same results as lapply(), in the form of a list. By default, “simplify” is turned on.

sapply(1:dim(data3)[1], function(x){log10(sum(data3[x,]))}, simplify=FALSE)

## Warning in FUN(X[[i]], ...): NaNs produced

## Warning in FUN(X[[i]], ...): NaNs produced

## [[1]]
## [1] NaN
## 
## [[2]]
## [1] -1.042389
## 
## [[3]]
## [1] 0.4322601
## 
## [[4]]
## [1] 0.07426441
## 
## [[5]]
## [1] -0.2296457
## 
## [[6]]
## [1] NaN
## 
## [[7]]
## [1] -0.8811435

The function tapply() applys a function to each subset of a vector based on a second vector of factors.

?tapply()

# Let's use Fisher's Iris data to demonstrate the usage of tapply().
# First, load the Iris dataset
data(iris)

# Take a look at what the data includes
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

# Generate a summary of the sepal lengths for each iris species.
tapply(iris$Sepal.Length, iris$Species, summary)

## $setosa
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.300   4.800   5.000   5.006   5.200   5.800 
## 
## $versicolor
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.900   5.600   5.900   5.936   6.300   7.000 
## 
## $virginica
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.900   6.225   6.500   6.588   6.900   7.900

Write one’s own functions

Even though there are a lot of R packages available, there are always situations where one might have to write one’s own function to accomplish some very specific goals. Functions are defined by code with a specific format:

function.name <- function(arg1=arg1, arg2, ...){
    var <- sin(arg1) + sin(arg2)  # carry out tasks
    var / 2
}

Here, we are going to write a function to calculate the area of a triangle given the lengths of three sides.

my.area <- function(side1=side1, side2=side2, side3=side3){
    circumference <- (side1 + side2 + side3) / 2
    area <- sqrt(circumference * (circumference - side1) * (circumference - side2) * (circumference - side3))
    return(area)
}

# let's carry out some test
my.area(side1=3, side2=4, side3=5)

## [1] 6

Quiz 5

CHALLENGE

Write a function to calculate min-max normalization for a single row. Here is the formula:

Then use apply to normalize every row of data2. You might want to take a small piece of data2 to test it with, e.g. the first 3 rows and first 3 columns.

HARD CHALLENGE

Now take your normalized data and write a function to find the log2-fold change (i.e. the log of the ratio of normalized counts) between any two samples across all genes. Then use one of the apply functions to calculate log2-fold change across ALL samples, given one sample. Finally, use a for loop to find the pair-wise log2-fold changes for every pair of samples.

Topic 6. Basic statistics in R

Description	R_function
Mean	mean()
Standard deviation	sd()
Variance	var()
Minimum	min()
Maximum	max()
Median	median()
Range of values: minimum and maximum	range()
Sample quantiles	quantile()
Generic function	summary()
Interquartile range	IQR()

Calculate the mean expression for each sample.

apply(data3, 2, mean)

##         V1         V2         V3         V4         V5         V6         V7 
## -0.1659378  0.4394602  0.8915552  0.2118826 -0.3783782 -0.3751873 -0.3221960

Calculate the range of expression for each sample.

apply(data3, 2, range)

##              V1         V2         V3         V4         V5        V6
## [1,] -0.8825093 -0.8945144 -0.2597303 -0.8238730 -1.2652317 -2.248503
## [2,]  0.5756467  1.5552837  2.1465491  0.8924542  0.8004843  1.101128
##              V7
## [1,] -1.7170773
## [2,]  0.5627799

Calculate the quantiles of each samples.

apply(data3, 2, quantile)

##               V1         V2         V3          V4         V5          V6
## 0%   -0.88250933 -0.8945144 -0.2597303 -0.82387303 -1.2652317 -2.24850311
## 25%  -0.75427070 -0.2798758  0.4435817 -0.05025611 -0.9920285 -0.60989225
## 50%  -0.02864424  0.5263068  0.7454713  0.15831049 -0.4412469 -0.45288128
## 75%   0.34124190  1.2244485  1.3607164  0.67839925  0.1207021  0.09686497
## 100%  0.57564667  1.5552837  2.1465491  0.89245421  0.8004843  1.10112759
##               V7
## 0%   -1.71707730
## 25%  -0.86850036
## 50%   0.07852304
## 75%   0.27870153
## 100%  0.56277993

Topic 7. Simple data visulization in R

Scatter plot and line plot can be produced using the function plot().

x <- c(1:50)
y <- 1 + sqrt(x)/2
plot(x,y)

plot(x,y, type="l")

# plot both the points and lines
## first plot points
plot(x,y)
lines(x,y, type="l")

## lines() can only be used to add information to a graph, while it cannot produce a graph on its own.

boxplot() can be used to summarize data.

boxplot(data3, xlab="Sample ID", ylab="Raw Counts")

x <- rnorm(1000)
boxplot(x)

hist() can be used to create histograms of data.

hist(x)

# use user defined break points
hist(x, breaks=seq(range(x)[1]-1, range(x)[2]+1, by=0.5))

# clear plotting device/area
dev.off()

## null device 
##           1

Topic 8. Install packages in R

Starting from Bioconductor version 3.8, the installation of packages is recommended to use BiocManager.

if (!requireNamespace("BiocManager"))
    install.packages("BiocManager")
## install core packages
BiocManager::install()
## install specific packages
BiocManager::install(c("devtools", "tidyverse","bsseq","DSS"))

Bioconductor has a repository and release schedule that differ from R (Bioconductor has a ‘devel’ branch to which new packages and updates are introduced, and a stable ‘release’ branch emitted once every 6 months to which bug fixes but not new features are introduced). This mismatch causes that the version detected by install.packages() is sometimes not the most recent ‘release’.
A consequence of the distinct ‘devel’ branch is that install.packages() sometimes points only to the ‘release’ repository, while users might want to have access to the leading-edge features in the develop version.
An indirect consequence of Bioconductor’s structured release is that packages generally have more extensive dependences with one another.

It is always recommended to update to the most current version of R and Bioconductor. If it is not possible and R < 3.5.0, please use the legacy approach to install Bioconductor packages

source("http://bioconductor.org/biocLite.R")
## install core packages
biocLite()
## install specific packages
biocLite(c("devtools", "tidyverse","bsseq","DSS"))

The R function install.packages() can be used to install packages that are not part of Bioconductor.

install.packages("ggplot2", repos="http://cran.us.r-project.org")
install.packages(c("kableExtra","knitr","dplyr"))

Install from source of github.

library(devtools)
install_github("stephenturner/qqman")

Topic 9. Save data in R session

To save history in R session

#savehistory(file="Oct08.history")

#loadhistory(file="Oct08.history")

To save objects in R session

save(list=c("x", "data"), file="Oct08.RData")

#load("Oct08.RData")

Final challenge

Working with an R notebook, load the Iris data as we did earlier in this documentation, generate a table that lists the median of each measurement (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) for each species. Then generate a plot based on the result. Finally produce an html report with the table and the plot. Below serves as an example output.

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width
setosa	5.0	3.4	1.50	0.2
versicolor	5.9	2.8	4.35	1.3
virginica	6.5	3.0	5.55	2.0

Hints:

In order to output a nice looking table in the final report, one may consider the R package kableExtra https://github.com/haozhu233/kableExtra. One may also check out the documentation at https://cran.r-project.org/web/packages/htmlTable/vignettes/
Figure 1 can be produced using base R functions: plot(), points(), axis(), text().
Figure 2 can be produced using functions in R package lattice https://cran.r-project.org/web/packages/lattice/index.html
Figure 3 can be produced using function boxplot()

Solution:

Download solution.Rmd

☰ Menu

Introduction to R for Bioinformatics

R and RStudio

What is R?

The R environment

The R Homepage

Interface for R

RStudio

Programming fundamentals

Start an R session

Topics covered in this introduction to R

Topic 1. Basic data types in R

There are 5 basic atomic classes: numeric (integer, complex), character, logical

Examples of numeric values.

Examples of character values.

Examples of logical values.

To find out the type of variable.

In the case that one compares two different classes of data, the coersion rule in R is logical -> integer -> numeric -> complex -> character . The following is an example of converting a numeric variable to character.

A logical TRUE is converted to integer 1 and a logical FALSE is converted to integer 0.

Quiz 1

Topic 2. Basic data structures in R

Atomic vectors: an atomic vector is a combination of multiple values(numeric, character or logical) in the same object. An atomic vector is created using the function c().

One can give names to the elements of an atomic vector.

Or One may create a vector with named elements from scratch.

To find out the length of a vector:

NOTE: a vector can only hold elements of the same type. If there are a mixture of data types, they will be coerced according to the coersion rule mentioned earlier in this documentation.

Factors: a factor is a special vector. It stores categorical data, which are important in statistical modeling and can only take on a limited number of pre-defined values. The function factor() can be used to create a factor.

In R, categories of the data are stored as factor levels. The function levels() can be used to access the factor levels.

By default, R infers the factor levels by ordering the unique elements in a factor alphanumerically. One may specifically define the factor levels at the creation of the factor.

Quiz 2

Matrices: A matrix is like an Excel sheet containing multiple rows and columns. It is used to combine vectors of the same type.

One other way to create a matrix is to use matrix() function.

To find out the dimension of a matrix:

Calculations with numeric matrices.

There is a data structure Array in R, that holds multi-dimensional (d > 2) data and is a generalized version of a matrix. Matrix is used much more commonly than Array, therefore we are not going to talk about Array here.

Data frames: a data frame is like a matrix but can have columns with different types (numeric, character, logical).

A data frame can be created using the function data.frame().

To check whether a data is a data frame, use the function is.data.frame().

One can convert a matrix object to a data frame using the function as.data.frame().

A data frame can be transposed in the similar way as a matrix. However, the result of transposing a data frame might not be a data frame anymore.

A data frame can be extended.

One may use the function merge to merge two data frames horizontally, based on one or more common key variables.

Quiz 3

CHALLENGE

Lists: a list is an ordered collection of objects, which can be any type of R objects (vectors, matrices, data frames, even lists).

A list is constructed using the function list().

One could construct a list by giving names to elements.

Subsetting data

Subsetting allows one to access the piece of data of interest. When combinded with assignment, subsetting can modify selected pieces of data. The operators that can be used to subset data are: [, $, and [[.

First, we are going to talk about subsetting data using [, which is the most commonly used operator. We will start by looking at vectors and talk about four ways to subset a vector.

Subsetting a list works in the same way as subsetting an atomic vector. Using [ will always return a list.

Subsetting a matrix can be done by simply generalizing the one dimension subsetting: one may supply a one dimension index for each dimension of the matrix. Blank/Nothing subsetting is now useful in keeping all rows or all columns.

Subsetting a data frame can be done similarly as subsetting a matrix. In addition, one may supply only one 1-dimensional index to subset a data frame. In this case, R will treat the data frame as a list with each column is an element in the list.

Subsetting operators: [[ and $

[[ is similar to [, except that it returns the content of the element.

$ is a shorthand for [[ combined with character subsetting.

Simplifying vs. preserving subsetting

CHALLENGES

Topic 3. Import and export data in R

R base function read.table() is a general funciton that can be used to read a file in table format. The data will be imported as a data frame.

Depending on the format of the file, several variants of read.table() are available to make reading a file easier.

R base function write.table() can be used to export data to a file.

Quiz 4

Topic 4. R markdown and R notebooks

Topic 5. Functions in R

Invoking a function by its name, followed by the parenthesis and zero or more arguments.

One useful function to find out information on an R object: str(). It compactly display the internal structure of an R object.

Conditional structure

Loop structure

A few useful functions: apply(), lapply(), sapply(), and tapply() to replace for loop

apply() takes an array or matrix as input and outputs a vector, array or list.

lapply() takes a list, vector or data frame as input and outputs a list.

The function sapply() works like function lapply(), but tries to simplify the output to the simplest data structure possible. As a matter of fact, sapply() is a “wrapper” function for lapply(). By default, it returns a vector.

If the “simplify” parameter is turned off, sapply() will produced exactly the same results as lapply(), in the form of a list. By default, “simplify” is turned on.

The function tapply() applys a function to each subset of a vector based on a second vector of factors.

Write one’s own functions

Quiz 5

CHALLENGE

HARD CHALLENGE

Topic 6. Basic statistics in R