3 Data frames | Introduction to R for Bioinformatics

3.1 Structure of a data frame

What is the birthweight object? In the enviroment browser, you should see that it is “42 obs. of 18 variables.” It’s probably a data table with 42 rows and 18 coumns. We can verfiy this using the class() and dim() functions.

class(birthweight)

## [1] "data.frame"

dim(birthweight)

## [1] 42 18

A note on formatting: In this documentation lines beginning with ## are the output of the R code shown. Running dim(birthweight) asked R what the dimensions of the birthweight object are; the answer is 42 (rows) x 21 (columns). The “[1]” is not part of the output. It is an index added by R to help you keep track of the values when an operation outputs a large number of values. We will see other examples later that will hopefully make this more clear. For now, we can safely ignore that “[1]”.

A data frame organizes data into rows and columns. The object must be “rectangular,” with all rows having the same number of fields, and all values in a column must be of the same type.

Each column of a data frame is a vector. A vector is an ordered collection of values of the same type.

Let’s take a look at the contents.

birthweight

Table 3.1: The birthweight data set.
ID	birth.date	location	length	birthweight	head.circumference	weeks.gestation	smoker	maternal.age	maternal.cigarettes	maternal.height	maternal.prepregnant.weight	paternal.age	paternal.education	paternal.cigarettes	paternal.height	low.birthweight	geriatric.pregnancy
1107	1/25/1967	General	52	3.23	36	38	no	31	0	164	57	NA	NA	NA	NA	0	FALSE
697	2/6/1967	Silver Hill	48	3.03	35	39	no	27	0	162	62	27	14	0	178	0	FALSE
1683	2/14/1967	Silver Hill	53	3.35	33	41	no	27	0	164	62	37	14	0	170	0	FALSE
27	3/9/1967	Silver Hill	53	3.55	37	41	yes	37	25	161	66	46	NA	0	175	0	TRUE
1522	3/13/1967	Memorial	50	2.74	33	39	yes	21	17	156	53	24	12	7	179	0	FALSE
569	3/23/1967	Memorial	50	2.51	35	39	yes	22	7	159	52	23	14	25	NA	1	FALSE
365	4/23/1967	Memorial	52	3.53	37	40	yes	26	25	170	62	30	10	25	181	0	FALSE
808	5/5/1967	Silver Hill	48	2.92	33	34	no	26	0	167	64	25	12	25	175	0	FALSE
1369	6/4/1967	Silver Hill	49	3.18	34	38	yes	31	25	162	57	32	16	50	194	0	FALSE
1023	6/7/1967	Memorial	52	3.00	35	38	yes	30	12	165	64	38	14	50	180	0	FALSE
822	6/14/1967	Memorial	50	3.42	35	38	no	20	0	157	48	22	14	0	179	0	FALSE
1272	6/20/1967	Memorial	53	2.75	32	40	yes	37	50	168	61	31	16	0	173	0	TRUE
1262	6/25/1967	Silver Hill	53	3.19	34	41	yes	27	35	163	51	31	16	25	185	0	FALSE
575	7/12/1967	Memorial	50	2.78	30	37	yes	19	7	165	60	20	14	0	183	0	FALSE
1016	7/13/1967	Silver Hill	53	4.32	36	40	no	19	0	171	62	19	12	0	183	0	FALSE
792	9/7/1967	Memorial	53	3.64	38	40	yes	20	2	170	59	24	12	12	185	0	FALSE
820	10/7/1967	General	52	3.77	34	40	no	24	0	157	50	31	16	0	173	0	FALSE
752	10/19/1967	General	49	3.32	36	40	yes	27	12	152	48	37	12	25	170	0	FALSE
619	11/1/1967	Memorial	52	3.41	33	39	yes	23	25	181	69	23	16	2	181	0	FALSE
1764	12/7/1967	Silver Hill	58	4.57	39	41	yes	32	12	173	70	38	14	25	180	0	FALSE
1081	12/14/1967	Silver Hill	54	3.63	38	38	no	18	0	172	50	20	12	7	172	0	FALSE
516	1/8/1968	Silver Hill	47	2.66	33	35	yes	20	35	170	57	23	12	50	186	1	FALSE
272	1/10/1968	Memorial	52	3.86	36	39	yes	30	25	170	78	40	16	50	178	0	FALSE
321	1/21/1968	Silver Hill	48	3.11	33	37	no	28	0	158	54	39	10	0	171	0	FALSE
1636	2/2/1968	Silver Hill	51	3.93	38	38	no	29	0	165	61	NA	NA	NA	NA	0	FALSE
1360	2/16/1968	General	56	4.55	34	44	no	20	0	162	57	23	10	35	179	0	FALSE
1388	2/22/1968	Memorial	51	3.14	33	41	yes	22	7	160	53	24	16	12	176	0	FALSE
1363	4/2/1968	General	48	2.37	30	37	yes	20	7	163	47	20	10	35	185	1	FALSE
1058	4/24/1968	Silver Hill	53	3.15	34	40	no	29	0	167	60	30	16	NA	182	0	FALSE
755	4/25/1968	Memorial	53	3.20	33	41	no	21	0	155	55	25	14	25	183	0	FALSE
462	6/19/1968	Silver Hill	58	4.10	39	41	no	35	0	172	58	31	16	25	185	0	TRUE
300	7/18/1968	Silver Hill	46	2.05	32	35	yes	41	7	166	57	37	14	25	173	1	TRUE
1088	7/24/1968	General	51	3.27	36	40	no	24	0	168	53	29	16	0	181	0	FALSE
57	8/12/1968	Memorial	51	3.32	38	39	yes	23	17	157	48	NA	NA	NA	NA	0	FALSE
553	8/17/1968	Silver Hill	54	3.94	37	42	no	24	0	175	66	30	12	0	184	0	FALSE
1191	9/7/1968	General	53	3.65	33	42	no	21	0	165	61	21	10	25	185	0	FALSE
431	9/16/1968	Silver Hill	48	1.92	30	33	yes	20	7	161	50	20	10	35	180	1	FALSE
1313	9/27/1968	Silver Hill	43	2.65	32	33	no	24	0	149	45	26	16	0	169	1	FALSE
1600	10/9/1968	General	53	2.90	34	39	no	19	0	165	57	NA	NA	NA	NA	0	FALSE
532	10/25/1968	General	53	3.59	34	40	yes	31	12	163	49	41	12	50	191	0	FALSE
223	12/11/1968	General	50	3.87	33	45	yes	28	25	163	54	30	16	0	183	0	FALSE
1187	12/19/1968	Silver Hill	53	4.07	38	44	no	20	0	174	68	26	14	25	189	0	FALSE

The data frame format should look familiar. It’s a lot like a spreadsheet.

Generally, we don’t want to operate on the entire data frame. For example, to calculate the mean birth weight, we don’t need the information in the “paternal.education” column.

There are three ways to have R subset the data frame: $, [[, and [.

3.2 Selecting a single column using the $ and [[ operators

The simplest way to get all the values in the “birthweight” column is with the $ operator.

birthweight$birthweight

##  [1] 3.23 3.03 3.35 3.55 2.74 2.51 3.53 2.92 3.18 3.00 3.42 2.75 3.19 2.78 4.32
## [16] 3.64 3.77 3.32 3.41 4.57 3.63 2.66 3.86 3.11 3.93 4.55 3.14 2.37 3.15 3.20
## [31] 4.10 2.05 3.27 3.32 3.94 3.65 1.92 2.65 2.90 3.59 3.87 4.07

Notice that there are now three numbers inside brackets: one at the beginning of each line of output. These are the indices (locations) of the following number within the output vector. They give us a general idea of the length of the vector, and allow us to determine the value of a particular observation at a glance. For example, we can answer the question “what was the birth weight of the 34th baby?”

Once the vector of birth weights has been extracted from the rest of the data frame, it can be used to calculate a mean.

mean(birthweight$birthweight)

## [1] 3.312857

This $ operator is a shortcut for the [[ sub-setting operator, which requires typing six additional characters (two pairs of square brackets and a pair of quotation marks). They function in the same way, returning the value of the element named.

birthweight[["birthweight"]]

##  [1] 3.23 3.03 3.35 3.55 2.74 2.51 3.53 2.92 3.18 3.00 3.42 2.75 3.19 2.78 4.32
## [16] 3.64 3.77 3.32 3.41 4.57 3.63 2.66 3.86 3.11 3.93 4.55 3.14 2.37 3.15 3.20
## [31] 4.10 2.05 3.27 3.32 3.94 3.65 1.92 2.65 2.90 3.59 3.87 4.07

mean(birthweight[["birthweight"]])

## [1] 3.312857

One difference to note is that while [[ works with the index, or column number, $ does not.

# which column contains the birth weight?
# lines beginning with a '#' are comments, and are not executed by R
colnames(birthweight)

##  [1] "ID"                          "birth.date"                 
##  [3] "location"                    "length"                     
##  [5] "birthweight"                 "head.circumference"         
##  [7] "weeks.gestation"             "smoker"                     
##  [9] "maternal.age"                "maternal.cigarettes"        
## [11] "maternal.height"             "maternal.prepregnant.weight"
## [13] "paternal.age"                "paternal.education"         
## [15] "paternal.cigarettes"         "paternal.height"            
## [17] "low.birthweight"             "geriatric.pregnancy"

birthweight[[5]]

##  [1] 3.23 3.03 3.35 3.55 2.74 2.51 3.53 2.92 3.18 3.00 3.42 2.75 3.19 2.78 4.32
## [16] 3.64 3.77 3.32 3.41 4.57 3.63 2.66 3.86 3.11 3.93 4.55 3.14 2.37 3.15 3.20
## [31] 4.10 2.05 3.27 3.32 3.94 3.65 1.92 2.65 2.90 3.59 3.87 4.07

mean(birthweight[[5]])

## [1] 3.312857

# the $ operator can't take an index
birthweight$5

3.3 Selecting a subset of the data frame using the [ operator

Unlike $ and [[, which return the value(s) contained in the specified element, [ returns an object of the same type it is used to subset. Using [ to retrieve the fifth column will return a data frame with 42 rows and 1 column. This may not seem like a big difference, but it can be an important distinction in some cases.

birthweight[5]

Table 3.2: Results of birthweight[5].
birthweight
3.23
3.03
3.35
3.55
2.74
2.51
3.53
2.92
3.18
3.00
3.42
2.75
3.19
2.78
4.32
3.64
3.77
3.32
3.41
4.57
3.63
2.66
3.86
3.11
3.93
4.55
3.14
2.37
3.15
3.20
4.10
2.05
3.27
3.32
3.94
3.65
1.92
2.65
2.90
3.59
3.87
4.07

Because the [ operator returns a new data frame, it can be used to specify multiple rows and / or columns.

birthweight[c(1,5)]

Table 3.3: Results of birthweight[c(1,5)].
ID	birthweight
1107	3.23
697	3.03
1683	3.35
27	3.55
1522	2.74
569	2.51
365	3.53
808	2.92
1369	3.18
1023	3.00
822	3.42
1272	2.75
1262	3.19
575	2.78
1016	4.32
792	3.64
820	3.77
752	3.32
619	3.41
1764	4.57
1081	3.63
516	2.66
272	3.86
321	3.11
1636	3.93
1360	4.55
1388	3.14
1363	2.37
1058	3.15
755	3.20
462	4.10
300	2.05
1088	3.27
57	3.32
553	3.94
1191	3.65
431	1.92
1313	2.65
1600	2.90
532	3.59
223	3.87
1187	4.07

The c() function creates a vector. This allows R to treat indices 1 and 5 as a single argument. This is critical, because birthweight[1,5] does not produce the same effect at all.

birthweight[1, 5]

## [1] 3.23

What happened?

When there are two arguments provided to [, R interprets these as the index on the first (row) and second (column) dimension of the object. The value returned is the content of the first row, fifth column: the birth weight of individual 1107.

The default behavior of [ is to return the entire object. The first argument acts as a sort of filter on the first dimension, the second argument as a filter on the second dimension, and so on. Leaving the space before the comma blank will return all rows (no filter applied), while leaving the space following the comma blank will return all columns. Be sure to try variations on the example code below to see what happens.

birthweight[c(2,7,29), c(1,5)]

Table 3.4: Results of birthweight[c(2,7,29), c(1,5)].
	ID	birthweight
2	697	3.03
7	365	3.53
29	1058	3.15

Using a minus sign before an index or group of indices will exclude the specified rows / columns.

colnames(birthweight)

##  [1] "ID"                          "birth.date"                 
##  [3] "location"                    "length"                     
##  [5] "birthweight"                 "head.circumference"         
##  [7] "weeks.gestation"             "smoker"                     
##  [9] "maternal.age"                "maternal.cigarettes"        
## [11] "maternal.height"             "maternal.prepregnant.weight"
## [13] "paternal.age"                "paternal.education"         
## [15] "paternal.cigarettes"         "paternal.height"            
## [17] "low.birthweight"             "geriatric.pregnancy"

# exclude paternal data (columns 13-16)
birthweight[c(1,3,5:13), -c(13:16)]

Table 3.5: Results of birthweight[c(1,3,5:13), -c(13:16)].
	ID	birth.date	location	length	birthweight	head.circumference	weeks.gestation	smoker	maternal.age	maternal.cigarettes	maternal.height	maternal.prepregnant.weight	low.birthweight	geriatric.pregnancy
1	1107	1/25/1967	General	52	3.23	36	38	no	31	0	164	57	0	FALSE
3	1683	2/14/1967	Silver Hill	53	3.35	33	41	no	27	0	164	62	0	FALSE
5	1522	3/13/1967	Memorial	50	2.74	33	39	yes	21	17	156	53	0	FALSE
6	569	3/23/1967	Memorial	50	2.51	35	39	yes	22	7	159	52	1	FALSE
7	365	4/23/1967	Memorial	52	3.53	37	40	yes	26	25	170	62	0	FALSE
8	808	5/5/1967	Silver Hill	48	2.92	33	34	no	26	0	167	64	0	FALSE
9	1369	6/4/1967	Silver Hill	49	3.18	34	38	yes	31	25	162	57	0	FALSE
10	1023	6/7/1967	Memorial	52	3.00	35	38	yes	30	12	165	64	0	FALSE
11	822	6/14/1967	Memorial	50	3.42	35	38	no	20	0	157	48	0	FALSE
12	1272	6/20/1967	Memorial	53	2.75	32	40	yes	37	50	168	61	0	TRUE
13	1262	6/25/1967	Silver Hill	53	3.19	34	41	yes	27	35	163	51	0	FALSE

R will also accept row or column names in quotations as a way to subset the data frame.

birthweight[c("maternal.cigarettes", "birthweight")]

Table 3.6: Results of birthweight[c(“maternal.cigarettes”, “birthweight”)].
maternal.cigarettes	birthweight
0	3.23
0	3.03
0	3.35
25	3.55
17	2.74
7	2.51
25	3.53
0	2.92
25	3.18
12	3.00
0	3.42
50	2.75
35	3.19
7	2.78
0	4.32
2	3.64
0	3.77
12	3.32
25	3.41
12	4.57
0	3.63
35	2.66
25	3.86
0	3.11
0	3.93
0	4.55
7	3.14
7	2.37
0	3.15
0	3.20
0	4.10
7	2.05
0	3.27
17	3.32
0	3.94
0	3.65
7	1.92
0	2.65
0	2.90
12	3.59
25	3.87
0	4.07

Finally, vectors of logical (TRUE/FALSE) values can be used to subset data. Rows or columns corresponding to “TRUE” elements will be returned, while rows or columns corresponding to “FALSE” elements will be excluded.

birthweight[c(1,3,5:13), c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE)]

Table 3.7: Results of birthweight[c(1,3,5:13), c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE)].
	ID	birth.date	location	length	birthweight	head.circumference	weeks.gestation	smoker	maternal.age	maternal.cigarettes	maternal.height	maternal.prepregnant.weight	low.birthweight	geriatric.pregnancy
1	1107	1/25/1967	General	52	3.23	36	38	no	31	0	164	57	0	FALSE
3	1683	2/14/1967	Silver Hill	53	3.35	33	41	no	27	0	164	62	0	FALSE
5	1522	3/13/1967	Memorial	50	2.74	33	39	yes	21	17	156	53	0	FALSE
6	569	3/23/1967	Memorial	50	2.51	35	39	yes	22	7	159	52	1	FALSE
7	365	4/23/1967	Memorial	52	3.53	37	40	yes	26	25	170	62	0	FALSE
8	808	5/5/1967	Silver Hill	48	2.92	33	34	no	26	0	167	64	0	FALSE
9	1369	6/4/1967	Silver Hill	49	3.18	34	38	yes	31	25	162	57	0	FALSE
10	1023	6/7/1967	Memorial	52	3.00	35	38	yes	30	12	165	64	0	FALSE
11	822	6/14/1967	Memorial	50	3.42	35	38	no	20	0	157	48	0	FALSE
12	1272	6/20/1967	Memorial	53	2.75	32	40	yes	37	50	168	61	0	TRUE
13	1262	6/25/1967	Silver Hill	53	3.19	34	41	yes	27	35	163	51	0	FALSE

This is much more useful than it may sound.

birthweight$length

##  [1] 52 48 53 53 50 50 52 48 49 52 50 53 53 50 53 53 52 49 52 58 54 47 52 48 51
## [26] 56 51 48 53 53 58 46 51 51 54 53 48 43 53 53 50 53

birthweight$length < 50

##  [1] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE
## [25] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [37]  TRUE  TRUE FALSE FALSE FALSE FALSE

Since the result of the birthweight$length < 50 operation is a vector of TRUE / FALSE values, it can be used to subset the data frame.

birthweight[birthweight$length < 50, c(1,4:12,17,18)]

Table 3.8: Results of birthweight[birthweight$length < 50, c(1,4:12,17,18)].
	ID	length	birthweight	head.circumference	weeks.gestation	smoker	maternal.age	maternal.cigarettes	maternal.height	maternal.prepregnant.weight	low.birthweight	geriatric.pregnancy
2	697	48	3.03	35	39	no	27	0	162	62	0	FALSE
8	808	48	2.92	33	34	no	26	0	167	64	0	FALSE
9	1369	49	3.18	34	38	yes	31	25	162	57	0	FALSE
18	752	49	3.32	36	40	yes	27	12	152	48	0	FALSE
22	516	47	2.66	33	35	yes	20	35	170	57	1	FALSE
24	321	48	3.11	33	37	no	28	0	158	54	0	FALSE
28	1363	48	2.37	30	37	yes	20	7	163	47	1	FALSE
32	300	46	2.05	32	35	yes	41	7	166	57	1	TRUE
37	431	48	1.92	30	33	yes	20	7	161	50	1	FALSE
38	1313	43	2.65	32	33	no	24	0	149	45	1	FALSE

3.3.1 Subsetting a vector

A vector, like a column of a data frame, can be subsetted using the [ operator with an index or another vector.

birthweight$length[1]

## [1] 52

birthweight$length[c(1,2)]

## [1] 52 48

3.4 Exercise 1: exploring the data

Use the min(), max(), and mean() functions on subsets of the data frame to answer one or more of the following questions. Work together. Often there will be more than one way to arrive at the answer.

What is the range of paternal ages in the data set?
What is the mean maternal age?
What is the age of the mother with the highest maternal.cigarettes value?
Is the mean pre-pregnant weight higher or lower among women who gave birth to low birth weight children?

What other questions are of interest to you? Can you answer any of them with the functions we have used so far?