R and RStudio
What is R?
R is a language and environment for statistical computing and graphics developed in 1993. It provides a wide variety of statistical and graphical techniques (linear and nonlinear modeling, statistical tests, time series analysis, classification, clustering, …), and is highly extensible, meaning that the user community can write new R tools. It is a GNU project (Free and Open Source).
The R language has its roots in the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and now, R is developed by the R Development Core Team, of which Chambers is a member. R is named partly after the first names of the first two R authors (Robert Gentleman and Ross Ihaka), and partly as a play on the name of S. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.
Some of R’s strengths:
- The ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control. Such as examples like the following (extracted from http://web.stanford.edu/class/bios221/book/Chap-Graphics.html):
- It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.
- R can be extended (easily) via packages.
- R has its own LaTeX-like documentation format, which is used to supply comprehensive documentation, both on-line in a number of formats and in hardcopy.
- It has a vast community both in academia and in business.
- It’s FREE!
The R environment
R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes
- an effective data handling and storage facility,
- a suite of operators for calculations on arrays, in particular matrices,
- a large, coherent, integrated collection of intermediate tools for data analysis,
- graphical facilities for data analysis and display either on-screen or on hardcopy, and
- a well-developed, and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.
The term “environment” is intended to characterize it as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software.
R, like S, is designed around a true computer language, and it allows users to add additional functionality by defining new functions. Much of the system is itself written in the R dialect of S, which makes it easy for users to follow the algorithmic choices made. For computationally-intensive tasks, C, C++ and Fortran code can be linked and called at run time. Advanced users can write C code to manipulate R objects directly.
Many users think of R as a statistics system. The R group prefers to think of it of an environment within which statistical techniques are implemented.
The R Homepage
The R homepage has a wealth of information on it,
On the homepage you can:
- Learn more about R
- Download R
- Get Documentation (official and user supplied)
- Get access to CRAN ‘Comprehensive R archival network’
Interface for R
There are many ways one can interface with R language. Here are a few popular ones:
RStudio
RStudio started in 2010, to offer R a more full featured integrated development environment (IDE) and modeled after matlab’s IDE.
RStudio has many features:
- syntax highlighting
- code completion
- smart indentation
- “Projects”
- workspace browser and data viewer
- embedded plots
- Markdown notebooks, Sweave authoring and knitr with one click pdf or html
- runs on all platforms and over the web
- etc. etc. etc.
RStudio and its team have contributed to many R packages.[13] These include:
- Tidyverse – R packages for data science, including ggplot2, dplyr, tidyr, and purrr
- Shiny – An interactive web technology
- RMarkdown – Insert R code into markdown documents
- knitr – Dynamic reports combining R, TeX, Markdown & HTML
- packrat – Package dependency tool
- devtools – Package development tool
RStudio Cheat Sheets: rstudio-ide.pdf
Programming fundamentals
There are three concepts that are essential in any programming language:
- VARIABLES
A variable is a named storage. Creating a variable is to reserve some space in memory. In R, the name of a variable can have letters, numbers, dot and underscore. However, a valid variable name cannot start with a underscore or a number, or start with a dot that is followed by a number.
- FUNCTIONS
A function is a block of organized, reusable code that is used to perform a set of predefined operations. A function may take zero or more parameters and return a result.
The way to use a function in R is:
function.name(parameter1=value1, …)
In R, to get help information on a funciton, one may use the command:
?function.name
- OPERATIONS
Operator | Description |
---|---|
<-, = | Assignment |
Operator | Description |
---|---|
+ | Addition |
- | Subtraction |
* | Multiplication |
/ | Division |
^ | Exponent |
%% | Modulus |
%/% | Integer Division |
Operator | Description |
---|---|
< | Less than |
> | Greater than |
<= | Less than or equal to |
>= | Greater than or equal to |
== | Equal to |
!= | Not equal to |
Operator | Description |
---|---|
! | Logical NOT |
& | Element-wise logical AND |
&& | Logical AND |
| | Element-wise logical OR |
|| | Logical OR |
Start an R session
BEFORE YOU BEGIN, YOU NEED TO START AN R SESSION
You can run this tutorial in an IDE (like Rstudio) on your laptop, or you can run R on the command-line on tadpole by logging into tadpole in a terminal and running the following commands:
module load R
R
NOTE: Below, the text in the yellow boxes is code to input (by typing it or copy/pasting) into your R session, the text in the white boxes is the expected output.
Topics covered in this introduction to R
- Basic data types in R
- Basic data structures in R
- Import and export data in R
- Functions in R
- Basic statistics in R
- Simple data visulization in R
- Install packages in R
- Save data in R session
- R markdown and R notebooks
Topic 1. Basic data types in R
There are 5 basic atomic classes: numeric (integer, complex), character, logical
Examples of numeric values.
# assign number 150 to variable a.
a <- 150
a
## [1] 150
# assign a number in scientific format to variable b.
b <- 3e-2
b
## [1] 0.03
Examples of character values.
# assign a string "BRCA1" to variable gene
gene <- "BRCA1"
gene
## [1] "BRCA1"
# assign a string "Hello World" to variable hello
hello <- "Hello World"
hello
## [1] "Hello World"
Examples of logical values.
# assign logical value "TRUE" to variable brca1_expressed
brca1_expressed <- TRUE
brca1_expressed
## [1] TRUE
# assign logical value "FALSE" to variable her2_expressed
her2_expressed <- FALSE
her2_expressed
## [1] FALSE
# assign logical value to a variable by logical operation
her2_expression_level <- 0
her2_expressed <- her2_expression_level > 0
her2_expressed
## [1] FALSE
To find out the type of variable.
class(her2_expressed)
## [1] "logical"
# To check whether the variable is a specific type
is.numeric(gene)
## [1] FALSE
is.numeric(a)
## [1] TRUE
is.character(gene)
## [1] TRUE
In the case that one compares two different classes of data, the coersion rule in R is logical -> integer -> numeric -> complex -> character . The following is an example of converting a numeric variable to character.
b
## [1] 0.03
as.character(b)
## [1] "0.03"
What happens when one converts a logical variable to numeric?
# recall her2_expressed
her2_expressed
## [1] FALSE
# conversion
as.numeric(her2_expressed)
## [1] 0
her2_expressed + 1
## [1] 1
A logical TRUE is converted to integer 1 and a logical FALSE is converted to integer 0.
Quiz 1
Topic 2. Basic data structures in R
Homogeneous | Heterogeneous | |
---|---|---|
1d | Atomic vector | List |
2d | Matrix | Data frame |
Nd | Array |
Atomic vectors: an atomic vector is a combination of multiple values(numeric, character or logical) in the same object. An atomic vector is created using the function c().
gene_names <- c("ESR1", "p53", "PI3K", "BRCA1", "EGFR")
gene_names
## [1] "ESR1" "p53" "PI3K" "BRCA1" "EGFR"
gene_expression <- c(0, 100, 50, 200, 80)
gene_expression
## [1] 0 100 50 200 80
One can give names to the elements of an atomic vector.
# assign names to a vector by specifying them
names(gene_expression) <- c("ESR1", "p53", "PI3K", "BRCA1", "EGFR")
gene_expression
## ESR1 p53 PI3K BRCA1 EGFR
## 0 100 50 200 80
# assign names to a vector using another vector
names(gene_expression) <- gene_names
gene_expression
## ESR1 p53 PI3K BRCA1 EGFR
## 0 100 50 200 80
Or One may create a vector with named elements from scratch.
gene_expression <- c(ESR1=0, p53=100, PI3K=50, BRCA1=200, EGFR=80)
gene_expression
## ESR1 p53 PI3K BRCA1 EGFR
## 0 100 50 200 80
To find out the length of a vector:
length(gene_expression)
## [1] 5
NOTE: a vector can only hold elements of the same type. If there are a mixture of data types, they will be coerced according to the coersion rule mentioned earlier in this documentation.
Factors: a factor is a special vector. It stores categorical data, which are important in statistical modeling and can only take on a limited number of pre-defined values. The function factor() can be used to create a factor.
disease_stage <- factor(c("Stage1", "Stage2", "Stage2", "Stage3", "Stage1", "Stage4"))
disease_stage
## [1] Stage1 Stage2 Stage2 Stage3 Stage1 Stage4
## Levels: Stage1 Stage2 Stage3 Stage4
In R, categories of the data are stored as factor levels. The function levels() can be used to access the factor levels.
levels(disease_stage)
## [1] "Stage1" "Stage2" "Stage3" "Stage4"
A function to compactly display the internal structure of an R object is str(). Please use str() to display the internal structure of the object we just created disease_stage. It shows that disease_stage is a factor with four levels: “Stage1”, “Stage2”, “Stage3”, etc… The integer numbers after the colon shows that these levels are encoded under the hood by integer values: the first level is 1, the second level is 2, and so on. Basically, when factor function is called, R first scan through the vector to determine how many different categories there are, then it converts the character vector to a vector of integer values, with each integer value labeled with a category.
str(disease_stage)
## Factor w/ 4 levels "Stage1","Stage2",..: 1 2 2 3 1 4
By default, R infers the factor levels by ordering the unique elements in a factor alphanumerically. One may specifically define the factor levels at the creation of the factor.
disease_stage <- factor(c("Stage1", "Stage2", "Stage2", "Stage3", "Stage1", "Stage4"), levels=c("Stage2", "Stage1", "Stage3", "Stage4"))
# The encoding for levels are different from above.
str(disease_stage)
## Factor w/ 4 levels "Stage2","Stage1",..: 2 1 1 3 2 4
If you want to know the number of individuals at each levels, there are two functions: summary and table.
summary(disease_stage)
## Stage2 Stage1 Stage3 Stage4
## 2 2 1 1
table(disease_stage)
## disease_stage
## Stage2 Stage1 Stage3 Stage4
## 2 2 1 1
Quiz 2
Matrices: A matrix is like an Excel sheet containing multiple rows and columns. It is used to combine vectors of the same type.
col1 <- c(1,3,8,9)
col2 <- c(2,18,27,10)
col3 <- c(8,37,267,19)
my_matrix <- cbind(col1, col2, col3)
my_matrix
## col1 col2 col3
## [1,] 1 2 8
## [2,] 3 18 37
## [3,] 8 27 267
## [4,] 9 10 19
One other way to create a matrix is to use matrix() function.
nums <- c(col1, col2, col3)
nums
## [1] 1 3 8 9 2 18 27 10 8 37 267 19
matrix(nums, ncol=2)
## [,1] [,2]
## [1,] 1 27
## [2,] 3 10
## [3,] 8 8
## [4,] 9 37
## [5,] 2 267
## [6,] 18 19
rownames(my_matrix) <- c("row1", "row2", "row3", "row4")
my_matrix
## col1 col2 col3
## row1 1 2 8
## row2 3 18 37
## row3 8 27 267
## row4 9 10 19
t(my_matrix) # transposing the matrix
## row1 row2 row3 row4
## col1 1 3 8 9
## col2 2 18 27 10
## col3 8 37 267 19
To find out the dimension of a matrix:
ncol(my_matrix)
## [1] 3
nrow(my_matrix)
## [1] 4
dim(my_matrix)
## [1] 4 3
Calculations with numeric matrices.
my_matrix * 3
## col1 col2 col3
## row1 3 6 24
## row2 9 54 111
## row3 24 81 801
## row4 27 30 57
log10(my_matrix)
## col1 col2 col3
## row1 0.0000000 0.301030 0.903090
## row2 0.4771213 1.255273 1.568202
## row3 0.9030900 1.431364 2.426511
## row4 0.9542425 1.000000 1.278754
Total of each row.
rowSums(my_matrix)
## row1 row2 row3 row4
## 11 58 302 38
Total of each column.
colSums(my_matrix)
## col1 col2 col3
## 21 57 331
There is a data structure Array in R, that holds multi-dimensional (d > 2) data and is a generalized version of a matrix. Matrix is used much more commonly than Array, therefore we are not going to talk about Array here.
Data frames: a data frame is like a matrix but can have columns with different types (numeric, character, logical).
A data frame can be created using the function data.frame().
# creating a data frame using pre-defined vectors
patients_name=c("Patient1", "Patient2", "Patient3", "Patient4", "Patient5", "Patient6")
Family_history=c("Y", "N", "Y", "N", "Y", "Y")
patients_age=c(31, 40, 39, 50, 45, 65)
meta.data <- data.frame(patients_name=patients_name, disease_stage=disease_stage, Family_history=Family_history, patients_age=patients_age)
meta.data
## patients_name disease_stage Family_history patients_age
## 1 Patient1 Stage1 Y 31
## 2 Patient2 Stage2 N 40
## 3 Patient3 Stage2 Y 39
## 4 Patient4 Stage3 N 50
## 5 Patient5 Stage1 Y 45
## 6 Patient6 Stage4 Y 65
To check whether a data is a data frame, use the function is.data.frame().
is.data.frame(meta.data)
## [1] TRUE
is.data.frame(my_matrix)
## [1] FALSE
One can convert a matrix object to a data frame using the function as.data.frame().
class(my_matrix)
## [1] "matrix" "array"
my_data <- as.data.frame(my_matrix)
class(my_data)
## [1] "data.frame"
A data frame can be transposed in the similar way as a matrix. However, the result of transposing a data frame might not be a data frame anymore.
my_data
## col1 col2 col3
## row1 1 2 8
## row2 3 18 37
## row3 8 27 267
## row4 9 10 19
t(my_data)
## row1 row2 row3 row4
## col1 1 3 8 9
## col2 2 18 27 10
## col3 8 37 267 19
A data frame can be extended.
# add a column that has the information on harmful mutations in BRCA1/BRCA2 genes for each patient.
meta.data
## patients_name disease_stage Family_history patients_age
## 1 Patient1 Stage1 Y 31
## 2 Patient2 Stage2 N 40
## 3 Patient3 Stage2 Y 39
## 4 Patient4 Stage3 N 50
## 5 Patient5 Stage1 Y 45
## 6 Patient6 Stage4 Y 65
meta.data$BRCA <- c("YES", "NO", "YES", "YES", "YES", "NO")
meta.data
## patients_name disease_stage Family_history patients_age BRCA
## 1 Patient1 Stage1 Y 31 YES
## 2 Patient2 Stage2 N 40 NO
## 3 Patient3 Stage2 Y 39 YES
## 4 Patient4 Stage3 N 50 YES
## 5 Patient5 Stage1 Y 45 YES
## 6 Patient6 Stage4 Y 65 NO
A data frame can also be extended using the functions cbind() and rbind(), for adding columns and rows respectively. When using cbind(), the number of values in the new column must match the number of rows in the data frame. When using rbind(), the two data frames must have the same variables/columns.
# add a column that has the information on the racial information for each patient.
cbind(meta.data, Race=c("AJ", "AS", "AA", "NE", "NE", "AS"))
## patients_name disease_stage Family_history patients_age BRCA Race
## 1 Patient1 Stage1 Y 31 YES AJ
## 2 Patient2 Stage2 N 40 NO AS
## 3 Patient3 Stage2 Y 39 YES AA
## 4 Patient4 Stage3 N 50 YES NE
## 5 Patient5 Stage1 Y 45 YES NE
## 6 Patient6 Stage4 Y 65 NO AS
# rbind can be used to add more rows to a data frame.
rbind(meta.data, data.frame(patients_name="Patient7", disease_stage="Stage4", Family_history="Y", patients_age=48, BRCA="YES"))
## patients_name disease_stage Family_history patients_age BRCA
## 1 Patient1 Stage1 Y 31 YES
## 2 Patient2 Stage2 N 40 NO
## 3 Patient3 Stage2 Y 39 YES
## 4 Patient4 Stage3 N 50 YES
## 5 Patient5 Stage1 Y 45 YES
## 6 Patient6 Stage4 Y 65 NO
## 7 Patient7 Stage4 Y 48 YES
One may use the function merge to merge two data frames horizontally, based on one or more common key variables.
expression.data <- data.frame(patients_name=c("Patient3", "Patient4", "Patient5", "Patient1", "Patient2", "Patient6"), EGFR=c(10, 472, 103784, 1782, 187, 18289), TP53=c(16493, 72, 8193, 1849, 173894, 1482))
expression.data
## patients_name EGFR TP53
## 1 Patient3 10 16493
## 2 Patient4 472 72
## 3 Patient5 103784 8193
## 4 Patient1 1782 1849
## 5 Patient2 187 173894
## 6 Patient6 18289 1482
md2 <- merge(meta.data, expression.data, by="patients_name")
md2
## patients_name disease_stage Family_history patients_age BRCA EGFR TP53
## 1 Patient1 Stage1 Y 31 YES 1782 1849
## 2 Patient2 Stage2 N 40 NO 187 173894
## 3 Patient3 Stage2 Y 39 YES 10 16493
## 4 Patient4 Stage3 N 50 YES 472 72
## 5 Patient5 Stage1 Y 45 YES 103784 8193
## 6 Patient6 Stage4 Y 65 NO 18289 1482
Save your workspace to a file so we can load it for day 2:
save.image("day1.RData")
Quiz 3
HOMEWORK
Using the mtcars built-in dataset (Type “mtcars” to see it), add a row that has the averages of each column and name it “Averages”. Now, add a column to mtcars called “hp.gt.100” that is TRUE or FALSE depending on whether the horsepower (hp) for that car is greater than 100 or not.