3 Data frames

3.1 Structure of a data frame

What is the birthweight object? In the enviroment browser, you should see that it is “42 obs. of 18 variables.” It’s probably a data table with 42 rows and 18 coumns. We can verfiy this using the class() and dim() functions.

class(birthweight)
## [1] "data.frame"
dim(birthweight)
## [1] 42 18

A note on formatting: In this documentation lines beginning with ## are the output of the R code shown. Running dim(birthweight) asked R what the dimensions of the birthweight object are; the answer is 42 (rows) x 21 (columns). The “[1]” is not part of the output. It is an index added by R to help you keep track of the values when an operation outputs a large number of values. We will see other examples later that will hopefully make this more clear. For now, we can safely ignore that “[1]”.

A data frame organizes data into rows and columns. The object must be “rectangular,” with all rows having the same number of fields, and all values in a column must be of the same type.

Each column of a data frame is a vector. A vector is an ordered collection of values of the same type.

Let’s take a look at the contents.

birthweight
Table 3.1: The birthweight data set.
ID birth.date location length birthweight head.circumference weeks.gestation smoker maternal.age maternal.cigarettes maternal.height maternal.prepregnant.weight paternal.age paternal.education paternal.cigarettes paternal.height low.birthweight geriatric.pregnancy
1107 1/25/1967 General 52 3.23 36 38 no 31 0 164 57 NA NA NA NA 0 FALSE
697 2/6/1967 Silver Hill 48 3.03 35 39 no 27 0 162 62 27 14 0 178 0 FALSE
1683 2/14/1967 Silver Hill 53 3.35 33 41 no 27 0 164 62 37 14 0 170 0 FALSE
27 3/9/1967 Silver Hill 53 3.55 37 41 yes 37 25 161 66 46 NA 0 175 0 TRUE
1522 3/13/1967 Memorial 50 2.74 33 39 yes 21 17 156 53 24 12 7 179 0 FALSE
569 3/23/1967 Memorial 50 2.51 35 39 yes 22 7 159 52 23 14 25 NA 1 FALSE
365 4/23/1967 Memorial 52 3.53 37 40 yes 26 25 170 62 30 10 25 181 0 FALSE
808 5/5/1967 Silver Hill 48 2.92 33 34 no 26 0 167 64 25 12 25 175 0 FALSE
1369 6/4/1967 Silver Hill 49 3.18 34 38 yes 31 25 162 57 32 16 50 194 0 FALSE
1023 6/7/1967 Memorial 52 3.00 35 38 yes 30 12 165 64 38 14 50 180 0 FALSE
822 6/14/1967 Memorial 50 3.42 35 38 no 20 0 157 48 22 14 0 179 0 FALSE
1272 6/20/1967 Memorial 53 2.75 32 40 yes 37 50 168 61 31 16 0 173 0 TRUE
1262 6/25/1967 Silver Hill 53 3.19 34 41 yes 27 35 163 51 31 16 25 185 0 FALSE
575 7/12/1967 Memorial 50 2.78 30 37 yes 19 7 165 60 20 14 0 183 0 FALSE
1016 7/13/1967 Silver Hill 53 4.32 36 40 no 19 0 171 62 19 12 0 183 0 FALSE
792 9/7/1967 Memorial 53 3.64 38 40 yes 20 2 170 59 24 12 12 185 0 FALSE
820 10/7/1967 General 52 3.77 34 40 no 24 0 157 50 31 16 0 173 0 FALSE
752 10/19/1967 General 49 3.32 36 40 yes 27 12 152 48 37 12 25 170 0 FALSE
619 11/1/1967 Memorial 52 3.41 33 39 yes 23 25 181 69 23 16 2 181 0 FALSE
1764 12/7/1967 Silver Hill 58 4.57 39 41 yes 32 12 173 70 38 14 25 180 0 FALSE
1081 12/14/1967 Silver Hill 54 3.63 38 38 no 18 0 172 50 20 12 7 172 0 FALSE
516 1/8/1968 Silver Hill 47 2.66 33 35 yes 20 35 170 57 23 12 50 186 1 FALSE
272 1/10/1968 Memorial 52 3.86 36 39 yes 30 25 170 78 40 16 50 178 0 FALSE
321 1/21/1968 Silver Hill 48 3.11 33 37 no 28 0 158 54 39 10 0 171 0 FALSE
1636 2/2/1968 Silver Hill 51 3.93 38 38 no 29 0 165 61 NA NA NA NA 0 FALSE
1360 2/16/1968 General 56 4.55 34 44 no 20 0 162 57 23 10 35 179 0 FALSE
1388 2/22/1968 Memorial 51 3.14 33 41 yes 22 7 160 53 24 16 12 176 0 FALSE
1363 4/2/1968 General 48 2.37 30 37 yes 20 7 163 47 20 10 35 185 1 FALSE
1058 4/24/1968 Silver Hill 53 3.15 34 40 no 29 0 167 60 30 16 NA 182 0 FALSE
755 4/25/1968 Memorial 53 3.20 33 41 no 21 0 155 55 25 14 25 183 0 FALSE
462 6/19/1968 Silver Hill 58 4.10 39 41 no 35 0 172 58 31 16 25 185 0 TRUE
300 7/18/1968 Silver Hill 46 2.05 32 35 yes 41 7 166 57 37 14 25 173 1 TRUE
1088 7/24/1968 General 51 3.27 36 40 no 24 0 168 53 29 16 0 181 0 FALSE
57 8/12/1968 Memorial 51 3.32 38 39 yes 23 17 157 48 NA NA NA NA 0 FALSE
553 8/17/1968 Silver Hill 54 3.94 37 42 no 24 0 175 66 30 12 0 184 0 FALSE
1191 9/7/1968 General 53 3.65 33 42 no 21 0 165 61 21 10 25 185 0 FALSE
431 9/16/1968 Silver Hill 48 1.92 30 33 yes 20 7 161 50 20 10 35 180 1 FALSE
1313 9/27/1968 Silver Hill 43 2.65 32 33 no 24 0 149 45 26 16 0 169 1 FALSE
1600 10/9/1968 General 53 2.90 34 39 no 19 0 165 57 NA NA NA NA 0 FALSE
532 10/25/1968 General 53 3.59 34 40 yes 31 12 163 49 41 12 50 191 0 FALSE
223 12/11/1968 General 50 3.87 33 45 yes 28 25 163 54 30 16 0 183 0 FALSE
1187 12/19/1968 Silver Hill 53 4.07 38 44 no 20 0 174 68 26 14 25 189 0 FALSE

The data frame format should look familiar. It’s a lot like a spreadsheet.

Generally, we don’t want to operate on the entire data frame. For example, to calculate the mean birth weight, we don’t need the information in the “paternal.education” column.

There are three ways to have R subset the data frame: $, [[, and [.

3.2 Selecting a single column using the $ and [[ operators

The simplest way to get all the values in the “birthweight” column is with the $ operator.

birthweight$birthweight
##  [1] 3.23 3.03 3.35 3.55 2.74 2.51 3.53 2.92 3.18 3.00 3.42 2.75 3.19 2.78 4.32
## [16] 3.64 3.77 3.32 3.41 4.57 3.63 2.66 3.86 3.11 3.93 4.55 3.14 2.37 3.15 3.20
## [31] 4.10 2.05 3.27 3.32 3.94 3.65 1.92 2.65 2.90 3.59 3.87 4.07

Notice that there are now three numbers inside brackets: one at the beginning of each line of output. These are the indices (locations) of the following number within the output vector. They give us a general idea of the length of the vector, and allow us to determine the value of a particular observation at a glance. For example, we can answer the question “what was the birth weight of the 34th baby?”

Once the vector of birth weights has been extracted from the rest of the data frame, it can be used to calculate a mean.

mean(birthweight$birthweight)
## [1] 3.312857

This $ operator is a shortcut for the [[ sub-setting operator, which requires typing six additional characters (two pairs of square brackets and a pair of quotation marks). They function in the same way, returning the value of the element named.

birthweight[["birthweight"]]
##  [1] 3.23 3.03 3.35 3.55 2.74 2.51 3.53 2.92 3.18 3.00 3.42 2.75 3.19 2.78 4.32
## [16] 3.64 3.77 3.32 3.41 4.57 3.63 2.66 3.86 3.11 3.93 4.55 3.14 2.37 3.15 3.20
## [31] 4.10 2.05 3.27 3.32 3.94 3.65 1.92 2.65 2.90 3.59 3.87 4.07
mean(birthweight[["birthweight"]])
## [1] 3.312857

One difference to note is that while [[ works with the index, or column number, $ does not.

# which column contains the birth weight?
# lines beginning with a '#' are comments, and are not executed by R
colnames(birthweight)
##  [1] "ID"                          "birth.date"                 
##  [3] "location"                    "length"                     
##  [5] "birthweight"                 "head.circumference"         
##  [7] "weeks.gestation"             "smoker"                     
##  [9] "maternal.age"                "maternal.cigarettes"        
## [11] "maternal.height"             "maternal.prepregnant.weight"
## [13] "paternal.age"                "paternal.education"         
## [15] "paternal.cigarettes"         "paternal.height"            
## [17] "low.birthweight"             "geriatric.pregnancy"
birthweight[[5]]
##  [1] 3.23 3.03 3.35 3.55 2.74 2.51 3.53 2.92 3.18 3.00 3.42 2.75 3.19 2.78 4.32
## [16] 3.64 3.77 3.32 3.41 4.57 3.63 2.66 3.86 3.11 3.93 4.55 3.14 2.37 3.15 3.20
## [31] 4.10 2.05 3.27 3.32 3.94 3.65 1.92 2.65 2.90 3.59 3.87 4.07
mean(birthweight[[5]])
## [1] 3.312857
# the $ operator can't take an index
birthweight$5

3.3 Selecting a subset of the data frame using the [ operator

Unlike $ and [[, which return the value(s) contained in the specified element, [ returns an object of the same type it is used to subset. Using [ to retrieve the fifth column will return a data frame with 42 rows and 1 column. This may not seem like a big difference, but it can be an important distinction in some cases.

birthweight[5]
Table 3.2: Results of birthweight[5].
birthweight
3.23
3.03
3.35
3.55
2.74
2.51
3.53
2.92
3.18
3.00
3.42
2.75
3.19
2.78
4.32
3.64
3.77
3.32
3.41
4.57
3.63
2.66
3.86
3.11
3.93
4.55
3.14
2.37
3.15
3.20
4.10
2.05
3.27
3.32
3.94
3.65
1.92
2.65
2.90
3.59
3.87
4.07

Because the [ operator returns a new data frame, it can be used to specify multiple rows and / or columns.

birthweight[c(1,5)]
Table 3.3: Results of birthweight[c(1,5)].
ID birthweight
1107 3.23
697 3.03
1683 3.35
27 3.55
1522 2.74
569 2.51
365 3.53
808 2.92
1369 3.18
1023 3.00
822 3.42
1272 2.75
1262 3.19
575 2.78
1016 4.32
792 3.64
820 3.77
752 3.32
619 3.41
1764 4.57
1081 3.63
516 2.66
272 3.86
321 3.11
1636 3.93
1360 4.55
1388 3.14
1363 2.37
1058 3.15
755 3.20
462 4.10
300 2.05
1088 3.27
57 3.32
553 3.94
1191 3.65
431 1.92
1313 2.65
1600 2.90
532 3.59
223 3.87
1187 4.07

The c() function creates a vector. This allows R to treat indices 1 and 5 as a single argument. This is critical, because birthweight[1,5] does not produce the same effect at all.

birthweight[1, 5]
## [1] 3.23

What happened?

When there are two arguments provided to [, R interprets these as the index on the first (row) and second (column) dimension of the object. The value returned is the content of the first row, fifth column: the birth weight of individual 1107.

The default behavior of [ is to return the entire object. The first argument acts as a sort of filter on the first dimension, the second argument as a filter on the second dimension, and so on. Leaving the space before the comma blank will return all rows (no filter applied), while leaving the space following the comma blank will return all columns. Be sure to try variations on the example code below to see what happens.

birthweight[c(2,7,29), c(1,5)]
Table 3.4: Results of birthweight[c(2,7,29), c(1,5)].
ID birthweight
2 697 3.03
7 365 3.53
29 1058 3.15

Using a minus sign before an index or group of indices will exclude the specified rows / columns.

colnames(birthweight)
##  [1] "ID"                          "birth.date"                 
##  [3] "location"                    "length"                     
##  [5] "birthweight"                 "head.circumference"         
##  [7] "weeks.gestation"             "smoker"                     
##  [9] "maternal.age"                "maternal.cigarettes"        
## [11] "maternal.height"             "maternal.prepregnant.weight"
## [13] "paternal.age"                "paternal.education"         
## [15] "paternal.cigarettes"         "paternal.height"            
## [17] "low.birthweight"             "geriatric.pregnancy"
# exclude paternal data (columns 13-16)
birthweight[c(1,3,5:13), -c(13:16)]
Table 3.5: Results of birthweight[c(1,3,5:13), -c(13:16)].
ID birth.date location length birthweight head.circumference weeks.gestation smoker maternal.age maternal.cigarettes maternal.height maternal.prepregnant.weight low.birthweight geriatric.pregnancy
1 1107 1/25/1967 General 52 3.23 36 38 no 31 0 164 57 0 FALSE
3 1683 2/14/1967 Silver Hill 53 3.35 33 41 no 27 0 164 62 0 FALSE
5 1522 3/13/1967 Memorial 50 2.74 33 39 yes 21 17 156 53 0 FALSE
6 569 3/23/1967 Memorial 50 2.51 35 39 yes 22 7 159 52 1 FALSE
7 365 4/23/1967 Memorial 52 3.53 37 40 yes 26 25 170 62 0 FALSE
8 808 5/5/1967 Silver Hill 48 2.92 33 34 no 26 0 167 64 0 FALSE
9 1369 6/4/1967 Silver Hill 49 3.18 34 38 yes 31 25 162 57 0 FALSE
10 1023 6/7/1967 Memorial 52 3.00 35 38 yes 30 12 165 64 0 FALSE
11 822 6/14/1967 Memorial 50 3.42 35 38 no 20 0 157 48 0 FALSE
12 1272 6/20/1967 Memorial 53 2.75 32 40 yes 37 50 168 61 0 TRUE
13 1262 6/25/1967 Silver Hill 53 3.19 34 41 yes 27 35 163 51 0 FALSE

R will also accept row or column names in quotations as a way to subset the data frame.

birthweight[c("maternal.cigarettes", "birthweight")]
Table 3.6: Results of birthweight[c(“maternal.cigarettes”, “birthweight”)].
maternal.cigarettes birthweight
0 3.23
0 3.03
0 3.35
25 3.55
17 2.74
7 2.51
25 3.53
0 2.92
25 3.18
12 3.00
0 3.42
50 2.75
35 3.19
7 2.78
0 4.32
2 3.64
0 3.77
12 3.32
25 3.41
12 4.57
0 3.63
35 2.66
25 3.86
0 3.11
0 3.93
0 4.55
7 3.14
7 2.37
0 3.15
0 3.20
0 4.10
7 2.05
0 3.27
17 3.32
0 3.94
0 3.65
7 1.92
0 2.65
0 2.90
12 3.59
25 3.87
0 4.07

Finally, vectors of logical (TRUE/FALSE) values can be used to subset data. Rows or columns corresponding to “TRUE” elements will be returned, while rows or columns corresponding to “FALSE” elements will be excluded.

birthweight[c(1,3,5:13), c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE)]
Table 3.7: Results of birthweight[c(1,3,5:13), c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE)].
ID birth.date location length birthweight head.circumference weeks.gestation smoker maternal.age maternal.cigarettes maternal.height maternal.prepregnant.weight low.birthweight geriatric.pregnancy
1 1107 1/25/1967 General 52 3.23 36 38 no 31 0 164 57 0 FALSE
3 1683 2/14/1967 Silver Hill 53 3.35 33 41 no 27 0 164 62 0 FALSE
5 1522 3/13/1967 Memorial 50 2.74 33 39 yes 21 17 156 53 0 FALSE
6 569 3/23/1967 Memorial 50 2.51 35 39 yes 22 7 159 52 1 FALSE
7 365 4/23/1967 Memorial 52 3.53 37 40 yes 26 25 170 62 0 FALSE
8 808 5/5/1967 Silver Hill 48 2.92 33 34 no 26 0 167 64 0 FALSE
9 1369 6/4/1967 Silver Hill 49 3.18 34 38 yes 31 25 162 57 0 FALSE
10 1023 6/7/1967 Memorial 52 3.00 35 38 yes 30 12 165 64 0 FALSE
11 822 6/14/1967 Memorial 50 3.42 35 38 no 20 0 157 48 0 FALSE
12 1272 6/20/1967 Memorial 53 2.75 32 40 yes 37 50 168 61 0 TRUE
13 1262 6/25/1967 Silver Hill 53 3.19 34 41 yes 27 35 163 51 0 FALSE

This is much more useful than it may sound.

birthweight$length
##  [1] 52 48 53 53 50 50 52 48 49 52 50 53 53 50 53 53 52 49 52 58 54 47 52 48 51
## [26] 56 51 48 53 53 58 46 51 51 54 53 48 43 53 53 50 53
birthweight$length < 50
##  [1] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE
## [25] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [37]  TRUE  TRUE FALSE FALSE FALSE FALSE

Since the result of the birthweight$length < 50 operation is a vector of TRUE / FALSE values, it can be used to subset the data frame.

birthweight[birthweight$length < 50, c(1,4:12,17,18)]
Table 3.8: Results of birthweight[birthweight$length < 50, c(1,4:12,17,18)].
ID length birthweight head.circumference weeks.gestation smoker maternal.age maternal.cigarettes maternal.height maternal.prepregnant.weight low.birthweight geriatric.pregnancy
2 697 48 3.03 35 39 no 27 0 162 62 0 FALSE
8 808 48 2.92 33 34 no 26 0 167 64 0 FALSE
9 1369 49 3.18 34 38 yes 31 25 162 57 0 FALSE
18 752 49 3.32 36 40 yes 27 12 152 48 0 FALSE
22 516 47 2.66 33 35 yes 20 35 170 57 1 FALSE
24 321 48 3.11 33 37 no 28 0 158 54 0 FALSE
28 1363 48 2.37 30 37 yes 20 7 163 47 1 FALSE
32 300 46 2.05 32 35 yes 41 7 166 57 1 TRUE
37 431 48 1.92 30 33 yes 20 7 161 50 1 FALSE
38 1313 43 2.65 32 33 no 24 0 149 45 1 FALSE

3.3.1 Subsetting a vector

A vector, like a column of a data frame, can be subsetted using the [ operator with an index or another vector.

birthweight$length[1]
## [1] 52
birthweight$length[c(1,2)]
## [1] 52 48

3.4 Exercise 1: exploring the data

Use the min(), max(), and mean() functions on subsets of the data frame to answer one or more of the following questions. Work together. Often there will be more than one way to arrive at the answer.

  • What is the range of paternal ages in the data set?
  • What is the mean maternal age?
  • What is the age of the mother with the highest maternal.cigarettes value?
  • Is the mean pre-pregnant weight higher or lower among women who gave birth to low birth weight children?

What other questions are of interest to you? Can you answer any of them with the functions we have used so far?