Examples of numeric values.
# assign number 150 to variable a.
a <- 150
a
## [1] 150
# assign a number in scientific format to variable b.
b <- 3e-2
b
## [1] 0.03
Examples of character values.
# assign a string "Professor" to variable title
title <- "Professor"
title
## [1] "Professor"
# assign a string "Hello World" to variable hello
hello <- "Hello World"
hello
## [1] "Hello World"
Examples of logical values.
# assign logical value "TRUE" to variable is_female
is_female <- TRUE
is_female
## [1] TRUE
# assign logical value "FALSE" to variable is_male
is_male <- FALSE
is_male
## [1] FALSE
# assign logical value to a variable by logical operation
age <- 20
is_adult <- age > 18
is_adult
## [1] TRUE
To find out the type of variable.
class(is_female)
## [1] "logical"
# To check whether the variable is a specific type
is.numeric(hello)
## [1] FALSE
is.numeric(a)
## [1] TRUE
is.character(hello)
## [1] TRUE
The rule to convert a logical variable to numeric: TRUE > 1, FALSE > 0
as.numeric(is_female)
## [1] 1
as.numeric(is_male)
## [1] 0
R does not know how to convert a numeric variable to a character variable.
b
## [1] 0.03
as.character(b)
## [1] "0.03"
friend_ages <- c(21, 27, 26, 32)
friend_ages
## [1] 21 27 26 32
friend_names <- c("Mina", "Ella", "Anna", "Cora")
friend_names
## [1] "Mina" "Ella" "Anna" "Cora"
One can give names to the elements of a vector.
# assign names to a vector by specifying them
names(friend_ages) <- c("Mina", "Ella", "Anna", "Carla")
friend_ages
## Mina Ella Anna Carla
## 21 27 26 32
# assign names to a vector using another vector
names(friend_ages) <- friend_names
friend_ages
## Mina Ella Anna Cora
## 21 27 26 32
Or One may create a vector with named elements from scratch.
friend_ages <- c(Mina=21, Ella=27, Anna=26, Cora=32)
friend_ages
## Mina Ella Anna Cora
## 21 27 26 32
To find out the length of a vector:
length(friend_ages)
## [1] 4
To access elements of a vector: by index, or by name if it is a named vector.
friend_ages[2]
## Ella
## 27
friend_ages["Ella"]
## Ella
## 27
friend_ages[c(1,3)]
## Mina Anna
## 21 26
friend_ages[c("Mina", "Anna")]
## Mina Anna
## 21 26
# selecting elements of a vector by excluding some of them.
friend_ages[-3]
## Mina Ella Cora
## 21 27 32
To select a subset of a vector can be done by logical vector.
my_friends <- c("Mina", "Ella", "Anna", "Cora")
my_friends
## [1] "Mina" "Ella" "Anna" "Cora"
has_child <- c("TRUE", "TRUE", "FALSE", "TRUE")
has_child
## [1] "TRUE" "TRUE" "FALSE" "TRUE"
my_friends[has_child == "TRUE"]
## [1] "Mina" "Ella" "Cora"
*** NOTE: a vector can only hold elements of the same type.
col1 <- c(1,3,8,9)
col2 <- c(2,18,27,10)
col3 <- c(8,37,267,19)
my_matrix <- cbind(col1, col2, col3)
my_matrix
## col1 col2 col3
## [1,] 1 2 8
## [2,] 3 18 37
## [3,] 8 27 267
## [4,] 9 10 19
rownames(my_matrix) <- c("row1", "row2", "row3", "row4")
my_matrix
## col1 col2 col3
## row1 1 2 8
## row2 3 18 37
## row3 8 27 267
## row4 9 10 19
t(my_matrix)
## row1 row2 row3 row4
## col1 1 3 8 9
## col2 2 18 27 10
## col3 8 37 267 19
To find out the dimension of a matrix:
ncol(my_matrix)
## [1] 3
nrow(my_matrix)
## [1] 4
dim(my_matrix)
## [1] 4 3
Accessing elements of a matrix is done in similar ways to accessing elements of a vector.
my_matrix[1,3]
## [1] 8
my_matrix["row1", "col3"]
## [1] 8
my_matrix[1,]
## col1 col2 col3
## 1 2 8
my_matrix[,3]
## row1 row2 row3 row4
## 8 37 267 19
my_matrix[col3 > 20,]
## col1 col2 col3
## row2 3 18 37
## row3 8 27 267
Calculations with matrices.
my_matrix * 3
## col1 col2 col3
## row1 3 6 24
## row2 9 54 111
## row3 24 81 801
## row4 27 30 57
log10(my_matrix)
## col1 col2 col3
## row1 0.0000000 0.301030 0.903090
## row2 0.4771213 1.255273 1.568202
## row3 0.9030900 1.431364 2.426511
## row4 0.9542425 1.000000 1.278754
Total of each row.
rowSums(my_matrix)
## row1 row2 row3 row4
## 11 58 302 38
Total of each column.
colSums(my_matrix)
## col1 col2 col3
## 21 57 331
It is also possible to use the function apply() to apply any statistical functions to rows/columns of matrices. The advantage of using apply() is that it can take a function created by user.
The simplified format of apply() is as following:
apply(X, MARGIN, FUN)
X: data matrix MARGIN: possible values are 1 (for rows) and 2 (for columns) FUN: the function to apply on rows/columns
To calculate the mean of each row.
apply(my_matrix, 1, mean)
## row1 row2 row3 row4
## 3.666667 19.333333 100.666667 12.666667
To calculate the median of each row
apply(my_matrix, 1, median)
## row1 row2 row3 row4
## 2 18 27 10
friend_groups <- factor(c(1,2,1,2))
friend_groups
## [1] 1 2 1 2
## Levels: 1 2
In R, categories are called factor levels. The function levels() can be used to access the factor levels.
levels(friend_groups)
## [1] "1" "2"
Change the factor levels.
levels(friend_groups) <- c("best_friend", "not_best_friend")
friend_groups
## [1] best_friend not_best_friend best_friend not_best_friend
## Levels: best_friend not_best_friend
Change the order of levels.
levels(friend_groups) <- c("not_best_friend", "best_friend")
friend_groups
## [1] not_best_friend best_friend not_best_friend best_friend
## Levels: not_best_friend best_friend
By default, the order of factor levels is taken in the order of numeric or alphabetic.
friend_groups <- factor(c("not_best_friend", "best_friend", "not_best_friend", "best_friend"))
friend_groups
## [1] not_best_friend best_friend not_best_friend best_friend
## Levels: best_friend not_best_friend
The factor levels can be specified when creating the factor, if the order does not follow the default rule.
friend_groups <- factor(c("not_best_friend", "best_friend", "not_best_friend", "best_friend"), levels=c("not_best_friend", "best_friend"))
friend_groups
## [1] not_best_friend best_friend not_best_friend best_friend
## Levels: not_best_friend best_friend
If you want to know the number of individuals at each levels, there are two functions.
summary(friend_groups)
## not_best_friend best_friend
## 2 2
table(friend_groups)
## friend_groups
## not_best_friend best_friend
## 2 2
A data frame can be created using the function data.frame().
# creating a data frame using previously defined vectors
friends <- data.frame(name=friend_names, age=friend_ages, child=has_child)
friends
## name age child
## Mina Mina 21 TRUE
## Ella Ella 27 TRUE
## Anna Anna 26 FALSE
## Cora Cora 32 TRUE
To check whether a data is a data frame, use the function is.data.frame().
is.data.frame(friends)
## [1] TRUE
is.data.frame(my_matrix)
## [1] FALSE
One can convert a object to a data frame using the function as.data.frame().
class(my_matrix)
## [1] "matrix"
my_data <- as.data.frame(my_matrix)
class(my_data)
## [1] "data.frame"
A data frame can be transposed in the similar way as a matrix.
my_data
## col1 col2 col3
## row1 1 2 8
## row2 3 18 37
## row3 8 27 267
## row4 9 10 19
t(my_data)
## row1 row2 row3 row4
## col1 1 3 8 9
## col2 2 18 27 10
## col3 8 37 267 19
To obtain a subset of a data frame can be done in similar ways as we have discussed: by index, by row/column names, or by logical values.
friends["Mina",]
## name age child
## Mina Mina 21 TRUE
# The columns of a data frame can be referred to by the names of the columns
friends
## name age child
## Mina Mina 21 TRUE
## Ella Ella 27 TRUE
## Anna Anna 26 FALSE
## Cora Cora 32 TRUE
friends$age
## [1] 21 27 26 32
friends[friends$age > 26,]
## name age child
## Ella Ella 27 TRUE
## Cora Cora 32 TRUE
friends[friends$child == "TRUE",]
## name age child
## Mina Mina 21 TRUE
## Ella Ella 27 TRUE
## Cora Cora 32 TRUE
Function subset() can also be used to get a subset of a data frame.
# select friends that are older than 26
subset(friends, age > 26)
## name age child
## Ella Ella 27 TRUE
## Cora Cora 32 TRUE
# select the information of the ages of friends
subset(friends, select=age)
## age
## Mina 21
## Ella 27
## Anna 26
## Cora 32
A data frame can be extended.
# add a column that has the information on the marrital status of friends
friends$married <- c("YES", "YES", "NO", "YES")
friends
## name age child married
## Mina Mina 21 TRUE YES
## Ella Ella 27 TRUE YES
## Anna Anna 26 FALSE NO
## Cora Cora 32 TRUE YES
A data frame can also be extended using the functions cbind() and rbind().
# add a column that has the information on the salaries of friends
cbind(friends, salary=c(4000, 8000, 2000, 6000))
## name age child married salary
## Mina Mina 21 TRUE YES 4000
## Ella Ella 27 TRUE YES 8000
## Anna Anna 26 FALSE NO 2000
## Cora Cora 32 TRUE YES 6000
A list can be created using the function list().
my_list <- list(mother="Sophia", father="John", sisters=c("Anna", "Emma"), sister_age=c(5, 10))
my_list
## $mother
## [1] "Sophia"
##
## $father
## [1] "John"
##
## $sisters
## [1] "Anna" "Emma"
##
## $sister_age
## [1] 5 10
# names of elements in the list
names(my_list)
## [1] "mother" "father" "sisters" "sister_age"
# number of elements in the list
length(my_list)
## [1] 4
To access elements of a list can be done using its name or index.
my_list$mother
## [1] "Sophia"
my_list[["mother"]]
## [1] "Sophia"
my_list[[1]]
## [1] "Sophia"
my_list[[3]]
## [1] "Anna" "Emma"
my_list[[3]][2]
## [1] "Emma"
R base function read.table() is a general funciton that can be used to read a file in table format. The data will be imported as a data frame.
# If you have downloaded the raw_counts.txt file to your working directory, you may use the following command to read it in.
data <- read.table(file="raw_counts.txt", sep="\t", header=T, stringsAsFactors=F)
# There is a more convenient way to read files from the internet.
data <- read.table(file="https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2017-August-Variant-Analysis-Workshop/master/friday/Intro2R/raw_counts.txt", sep="\t", header=T, stringsAsFactors=F)
Take a look at the beginning part of the data frame.
head(data)
## C61 C62 C63 C64 C91 C92 C93 C94 I561 I562 I563 I564 I591
## AT1G01010 322 346 256 396 372 506 361 342 638 488 440 479 770
## AT1G01020 149 87 162 144 189 169 147 108 163 141 119 147 182
## AT1G01030 15 32 35 22 24 33 21 35 18 8 54 35 23
## AT1G01040 687 469 568 651 885 978 794 862 799 769 725 715 811
## AT1G01046 1 1 5 4 5 3 0 2 4 3 1 0 2
## AT1G01050 1447 1032 1083 1204 1413 1484 1138 938 1247 1516 984 1044 1374
## I592 I593 I594 I861 I862 I863 I864 I891 I892 I893 I894
## AT1G01010 430 656 467 143 453 429 206 567 458 520 474
## AT1G01020 156 153 177 43 144 114 50 161 195 157 144
## AT1G01030 8 16 24 42 17 22 39 26 28 39 30
## AT1G01040 567 831 694 345 575 605 404 735 651 725 591
## AT1G01046 8 8 1 0 4 0 3 5 7 0 5
## AT1G01050 1355 1437 1577 412 1338 1051 621 1434 1552 1248 1186
Depending on the format of the file, several variants of read.table() are available to make reading a file easier.
read.csv(): for reading “comma separated value” files (.csv).
read.csv2(): variant used in countries that use a comma “,” as decimal point and a semicolon “;” as field separators.
read.delim(): for reading “tab separated value” files (“.txt”). By default, point(“.”) is used as decimal point.
read.delim2(): for reading “tab separated value” files (“.txt”). By default, comma (“,”) is used as decimal point.
# If you have downloaded the raw_counts.csv file to your working directory, you may use the following command to read it in.
data2 <- read.csv(file="raw_counts.csv", stringsAsFactors=F)
# Otherwise, you may read the file by providing the url to the read.csv() function.
data2 <- read.csv(file="https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2017-August-Variant-Analysis-Workshop/master/friday/Intro2R/raw_counts.csv", stringsAsFactors=F)
# To look at the file:
head(data2)
## C61 C62 C63 C64 C91 C92 C93 C94 I561 I562 I563 I564 I591
## AT1G01010 322 346 256 396 372 506 361 342 638 488 440 479 770
## AT1G01020 149 87 162 144 189 169 147 108 163 141 119 147 182
## AT1G01030 15 32 35 22 24 33 21 35 18 8 54 35 23
## AT1G01040 687 469 568 651 885 978 794 862 799 769 725 715 811
## AT1G01046 1 1 5 4 5 3 0 2 4 3 1 0 2
## AT1G01050 1447 1032 1083 1204 1413 1484 1138 938 1247 1516 984 1044 1374
## I592 I593 I594 I861 I862 I863 I864 I891 I892 I893 I894
## AT1G01010 430 656 467 143 453 429 206 567 458 520 474
## AT1G01020 156 153 177 43 144 114 50 161 195 157 144
## AT1G01030 8 16 24 42 17 22 39 26 28 39 30
## AT1G01040 567 831 694 345 575 605 404 735 651 725 591
## AT1G01046 8 8 1 0 4 0 3 5 7 0 5
## AT1G01050 1355 1437 1577 412 1338 1051 621 1434 1552 1248 1186
R base function write.table() can be used to export a data frame or matrix to a file.
# To write to a file called "output.txt" in your current working directory.
write.table(data2[1:20,], file="output.txt", sep="\t", quote=F, row.names=T, col.names=T)
It is also possible to export data to a csv file.
write.csv()
write.csv2()
# to find out the current working directory
getwd()
## [1] "/Users/mattsettles/projects/src/github.com-ucdavis-bioinformatics-training/2018-September-Bioinformatics-Prerequisites/thursday/Intro2R"
# to set a different working directory, use setwd
setwd("/Users/jli/Desktop")
# to list all variables in the environment
ls()
## [1] "a" "age" "b" "col1"
## [5] "col2" "col3" "data" "data2"
## [9] "friend_ages" "friend_groups" "friend_names" "friends"
## [13] "has_child" "hello" "is_adult" "is_female"
## [17] "is_male" "my_data" "my_friends" "my_list"
## [21] "my_matrix" "title"
# to create a vector from 2 to 3, usin increment of 0.1
seq(2, 3, by=0.1)
## [1] 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0
# to create a vector with repeated elements
rep(1:3, times=3)
## [1] 1 2 3 1 2 3 1 2 3
rep(1:3, each=3)
## [1] 1 1 1 2 2 2 3 3 3
# to get help information on a function in R: ?function.name()
?seq()
?sort()
?rep()
str(data2)
## 'data.frame': 33602 obs. of 24 variables:
## $ C61 : int 322 149 15 687 1 1447 2667 297 0 74 ...
## $ C62 : int 346 87 32 469 1 1032 2472 226 0 79 ...
## $ C63 : int 256 162 35 568 5 1083 2881 325 0 138 ...
## $ C64 : int 396 144 22 651 4 1204 2632 341 0 85 ...
## $ C91 : int 372 189 24 885 5 1413 5120 199 0 68 ...
## $ C92 : int 506 169 33 978 3 1484 6176 180 0 41 ...
## $ C93 : int 361 147 21 794 0 1138 7088 195 0 110 ...
## $ C94 : int 342 108 35 862 2 938 6810 107 0 81 ...
## $ I561: int 638 163 18 799 4 1247 2258 377 0 72 ...
## $ I562: int 488 141 8 769 3 1516 1808 534 0 76 ...
## $ I563: int 440 119 54 725 1 984 2279 300 0 184 ...
## $ I564: int 479 147 35 715 0 1044 2299 223 0 156 ...
## $ I591: int 770 182 23 811 2 1374 4755 298 0 96 ...
## $ I592: int 430 156 8 567 8 1355 3128 318 0 70 ...
## $ I593: int 656 153 16 831 8 1437 4419 397 0 77 ...
## $ I594: int 467 177 24 694 1 1577 3726 373 0 77 ...
## $ I861: int 143 43 42 345 0 412 1452 86 0 174 ...
## $ I862: int 453 144 17 575 4 1338 1516 266 0 113 ...
## $ I863: int 429 114 22 605 0 1051 1455 281 0 69 ...
## $ I864: int 206 50 39 404 3 621 1429 164 0 176 ...
## $ I891: int 567 161 26 735 5 1434 3867 230 0 69 ...
## $ I892: int 458 195 28 651 7 1552 4718 270 0 80 ...
## $ I893: int 520 157 39 725 0 1248 4580 220 0 81 ...
## $ I894: int 474 144 30 591 5 1186 3575 229 0 62 ...
#?lapply()
data <- as.data.frame(matrix(rnorm(49), ncol=7), stringsAsFactors=F)
dim(data)
## [1] 7 7
lapply(1:dim(data)[1], function(x){sum(data[x,])})
## [[1]]
## [1] -3.196251
##
## [[2]]
## [1] 4.629736
##
## [[3]]
## [1] -2.386937
##
## [[4]]
## [1] -4.244555
##
## [[5]]
## [1] -4.26624
##
## [[6]]
## [1] -1.920159
##
## [[7]]
## [1] 0.4293733
apply(data, MARGIN=1, sum)
## [1] -3.1962511 4.6297360 -2.3869367 -4.2445553 -4.2662399 -1.9201587
## [7] 0.4293733
lapply(1:dim(data)[1], function(x){log10(sum(data[x,]))})
## Warning in FUN(X[[i]], ...): NaNs produced
## Warning in FUN(X[[i]], ...): NaNs produced
## Warning in FUN(X[[i]], ...): NaNs produced
## Warning in FUN(X[[i]], ...): NaNs produced
## Warning in FUN(X[[i]], ...): NaNs produced
## [[1]]
## [1] NaN
##
## [[2]]
## [1] 0.6655562
##
## [[3]]
## [1] NaN
##
## [[4]]
## [1] NaN
##
## [[5]]
## [1] NaN
##
## [[6]]
## [1] NaN
##
## [[7]]
## [1] -0.367165
# To check the syntax of using sapply():
#?sapply()
sapply(1:dim(data)[1], function(x){log10(sum(data[x,]))})
## Warning in FUN(X[[i]], ...): NaNs produced
## Warning in FUN(X[[i]], ...): NaNs produced
## Warning in FUN(X[[i]], ...): NaNs produced
## Warning in FUN(X[[i]], ...): NaNs produced
## Warning in FUN(X[[i]], ...): NaNs produced
## [1] NaN 0.6655562 NaN NaN NaN NaN
## [7] -0.3671650
sapply(1:dim(data)[1], function(x){log10(sum(data[x,]))}, simplify=FALSE)
## Warning in FUN(X[[i]], ...): NaNs produced
## Warning in FUN(X[[i]], ...): NaNs produced
## Warning in FUN(X[[i]], ...): NaNs produced
## Warning in FUN(X[[i]], ...): NaNs produced
## Warning in FUN(X[[i]], ...): NaNs produced
## [[1]]
## [1] NaN
##
## [[2]]
## [1] 0.6655562
##
## [[3]]
## [1] NaN
##
## [[4]]
## [1] NaN
##
## [[5]]
## [1] NaN
##
## [[6]]
## [1] NaN
##
## [[7]]
## [1] -0.367165
Description | R_function |
---|---|
Mean | mean() |
Standard deviation | sd() |
Variance | var() |
Minimum | min() |
Maximum | max() |
Median | median() |
Range of values: minimum and maximum | range() |
Sample quantiles | quantile() |
Generic function | summary() |
Interquartile range | IQR() |
Calculate the mean expression for each sample.
apply(data, 2, mean)
## V1 V2 V3 V4 V5 V6
## -0.07493860 0.09016817 -0.07351656 -0.41336725 -0.35316022 -0.46762713
## V7
## -0.27256304
Calculate the range of expression for each sample.
apply(data, 2, range)
## V1 V2 V3 V4 V5 V6
## [1,] -1.660427 -1.191180 -1.536469 -2.256240 -1.496846 -1.6728333
## [2,] 1.391026 0.980829 1.177698 1.078647 2.134961 0.7939947
## V7
## [1,] -1.298585
## [2,] 0.716350
Calculate the quantiles of each samples.
apply(data, 2, quantile)
## V1 V2 V3 V4 V5 V6
## 0% -1.6604271 -1.1911804 -1.5364690 -2.2562401 -1.4968461 -1.6728333
## 25% -0.8622092 -0.3864804 -0.8237535 -1.4613897 -1.1177519 -1.2923110
## 50% -0.2017710 0.0567153 -0.2640791 0.1169598 -0.4286974 -0.4084193
## 75% 0.8355103 0.7788870 0.8778705 0.5449211 -0.2230176 0.2992450
## 100% 1.3910257 0.9808290 1.1776980 1.0786468 2.1349609 0.7939947
## V7
## 0% -1.2985853
## 25% -0.5358018
## 50% -0.3978581
## 75% 0.0718778
## 100% 0.7163500
Scatter plot and line plot can be produced using the function plot().
x <- c(1:50)
y <- 1 + sqrt(x)/2
plot(x,y)
plot(x,y, type="l")
# plot both the points and lines
## first plot points
plot(x,y)
lines(x,y, type="l")
## lines() can only be used to add information to a graph, while it cannot produce a graph on its own.
boxplot() can be used to summarize data.
boxplot(data, xlab="Sample ID", ylab="Raw Counts")
x <- rnorm(1000)
boxplot(x)
hist() can be used to create histograms of data.
hist(x)
# use user defined break points
hist(x, breaks=seq(range(x)[1]-1, range(x)[2]+1, by=0.5))
# clear plotting device/area
dev.off()
## null device
## 1
There are two ways to install bioconductor packages in R: biocLite(), install.packages()
source("http://bioconductor.org/biocLite.R")
## Bioconductor version 3.7 (BiocInstaller 1.30.0), ?biocLite for help
## install core packages
#biocLite()
## install specific packages
#biocLite("RCircos")
#biocLite(c("IdeoViz", "devtools"))
If it’s the first time you invoke the above “source” command, you should see a very similar screen as in the following picture:
#install.packages("ggplot2", repos="http://cran.us.r-project.org")
Install from source of github.
library(devtools)
install_github("stephenturner/qqman")
## Skipping install of 'qqman' from a github remote, the SHA1 (0c3aa54e) has not changed since last install.
## Use `force = TRUE` to force installation
Bioconductor has a repository and release schedule that differ from R (Bioconductor has a ‘devel’ branch to which new packages and updates are introduced, and a stable ‘release’ branch emitted once every 6 months to which bug fixes but not new features are introduced). This mismatch causes that the version detected by install.packages() is sometimes not the most recent ‘release’.
A consequence of the distince ‘devel’ branch is that install.packages() sometimes points only to the ‘release’ repository, while users might want to have access to the leading-edge features in the develop version.
An indirect consequence of Bioconductor’s structured release is that packages generally have more extensive dependences with one another.
#biocLite("BiocUpgrade")
savehistory(file="Sept6.history")
#loadhistory(file="Sept6.history")
save(list=c("x", "data"), file="Sept6.RData")
#load("Sept6.RData")