3 Data frames
3.1 Structure of a data frame
What is the birthweight object? In the enviroment browser, you should see that it is “42 obs. of 18 variables.” It’s probably a data table with 42 rows and 18 coumns. We can verfiy this using the class()
and dim()
functions.
## [1] "data.frame"
## [1] 42 18
A note on formatting: In this documentation lines beginning with ##
are the output of the R code shown. Running dim(birthweight)
asked R what the dimensions of the birthweight object are; the answer is 42 (rows) x 21 (columns). The “[1]” is not part of the output. It is an index added by R to help you keep track of the values when an operation outputs a large number of values. We will see other examples later that will hopefully make this more clear. For now, we can safely ignore that “[1]”.
A data frame organizes data into rows and columns. The object must be “rectangular,” with all rows having the same number of fields, and all values in a column must be of the same type.
Each column of a data frame is a vector. A vector is an ordered collection of values of the same type.
Let’s take a look at the contents.
ID | birth.date | location | length | birthweight | head.circumference | weeks.gestation | smoker | maternal.age | maternal.cigarettes | maternal.height | maternal.prepregnant.weight | paternal.age | paternal.education | paternal.cigarettes | paternal.height | low.birthweight | geriatric.pregnancy |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1107 | 1/25/1967 | General | 52 | 3.23 | 36 | 38 | no | 31 | 0 | 164 | 57 | NA | NA | NA | NA | 0 | FALSE |
697 | 2/6/1967 | Silver Hill | 48 | 3.03 | 35 | 39 | no | 27 | 0 | 162 | 62 | 27 | 14 | 0 | 178 | 0 | FALSE |
1683 | 2/14/1967 | Silver Hill | 53 | 3.35 | 33 | 41 | no | 27 | 0 | 164 | 62 | 37 | 14 | 0 | 170 | 0 | FALSE |
27 | 3/9/1967 | Silver Hill | 53 | 3.55 | 37 | 41 | yes | 37 | 25 | 161 | 66 | 46 | NA | 0 | 175 | 0 | TRUE |
1522 | 3/13/1967 | Memorial | 50 | 2.74 | 33 | 39 | yes | 21 | 17 | 156 | 53 | 24 | 12 | 7 | 179 | 0 | FALSE |
569 | 3/23/1967 | Memorial | 50 | 2.51 | 35 | 39 | yes | 22 | 7 | 159 | 52 | 23 | 14 | 25 | NA | 1 | FALSE |
365 | 4/23/1967 | Memorial | 52 | 3.53 | 37 | 40 | yes | 26 | 25 | 170 | 62 | 30 | 10 | 25 | 181 | 0 | FALSE |
808 | 5/5/1967 | Silver Hill | 48 | 2.92 | 33 | 34 | no | 26 | 0 | 167 | 64 | 25 | 12 | 25 | 175 | 0 | FALSE |
1369 | 6/4/1967 | Silver Hill | 49 | 3.18 | 34 | 38 | yes | 31 | 25 | 162 | 57 | 32 | 16 | 50 | 194 | 0 | FALSE |
1023 | 6/7/1967 | Memorial | 52 | 3.00 | 35 | 38 | yes | 30 | 12 | 165 | 64 | 38 | 14 | 50 | 180 | 0 | FALSE |
822 | 6/14/1967 | Memorial | 50 | 3.42 | 35 | 38 | no | 20 | 0 | 157 | 48 | 22 | 14 | 0 | 179 | 0 | FALSE |
1272 | 6/20/1967 | Memorial | 53 | 2.75 | 32 | 40 | yes | 37 | 50 | 168 | 61 | 31 | 16 | 0 | 173 | 0 | TRUE |
1262 | 6/25/1967 | Silver Hill | 53 | 3.19 | 34 | 41 | yes | 27 | 35 | 163 | 51 | 31 | 16 | 25 | 185 | 0 | FALSE |
575 | 7/12/1967 | Memorial | 50 | 2.78 | 30 | 37 | yes | 19 | 7 | 165 | 60 | 20 | 14 | 0 | 183 | 0 | FALSE |
1016 | 7/13/1967 | Silver Hill | 53 | 4.32 | 36 | 40 | no | 19 | 0 | 171 | 62 | 19 | 12 | 0 | 183 | 0 | FALSE |
792 | 9/7/1967 | Memorial | 53 | 3.64 | 38 | 40 | yes | 20 | 2 | 170 | 59 | 24 | 12 | 12 | 185 | 0 | FALSE |
820 | 10/7/1967 | General | 52 | 3.77 | 34 | 40 | no | 24 | 0 | 157 | 50 | 31 | 16 | 0 | 173 | 0 | FALSE |
752 | 10/19/1967 | General | 49 | 3.32 | 36 | 40 | yes | 27 | 12 | 152 | 48 | 37 | 12 | 25 | 170 | 0 | FALSE |
619 | 11/1/1967 | Memorial | 52 | 3.41 | 33 | 39 | yes | 23 | 25 | 181 | 69 | 23 | 16 | 2 | 181 | 0 | FALSE |
1764 | 12/7/1967 | Silver Hill | 58 | 4.57 | 39 | 41 | yes | 32 | 12 | 173 | 70 | 38 | 14 | 25 | 180 | 0 | FALSE |
1081 | 12/14/1967 | Silver Hill | 54 | 3.63 | 38 | 38 | no | 18 | 0 | 172 | 50 | 20 | 12 | 7 | 172 | 0 | FALSE |
516 | 1/8/1968 | Silver Hill | 47 | 2.66 | 33 | 35 | yes | 20 | 35 | 170 | 57 | 23 | 12 | 50 | 186 | 1 | FALSE |
272 | 1/10/1968 | Memorial | 52 | 3.86 | 36 | 39 | yes | 30 | 25 | 170 | 78 | 40 | 16 | 50 | 178 | 0 | FALSE |
321 | 1/21/1968 | Silver Hill | 48 | 3.11 | 33 | 37 | no | 28 | 0 | 158 | 54 | 39 | 10 | 0 | 171 | 0 | FALSE |
1636 | 2/2/1968 | Silver Hill | 51 | 3.93 | 38 | 38 | no | 29 | 0 | 165 | 61 | NA | NA | NA | NA | 0 | FALSE |
1360 | 2/16/1968 | General | 56 | 4.55 | 34 | 44 | no | 20 | 0 | 162 | 57 | 23 | 10 | 35 | 179 | 0 | FALSE |
1388 | 2/22/1968 | Memorial | 51 | 3.14 | 33 | 41 | yes | 22 | 7 | 160 | 53 | 24 | 16 | 12 | 176 | 0 | FALSE |
1363 | 4/2/1968 | General | 48 | 2.37 | 30 | 37 | yes | 20 | 7 | 163 | 47 | 20 | 10 | 35 | 185 | 1 | FALSE |
1058 | 4/24/1968 | Silver Hill | 53 | 3.15 | 34 | 40 | no | 29 | 0 | 167 | 60 | 30 | 16 | NA | 182 | 0 | FALSE |
755 | 4/25/1968 | Memorial | 53 | 3.20 | 33 | 41 | no | 21 | 0 | 155 | 55 | 25 | 14 | 25 | 183 | 0 | FALSE |
462 | 6/19/1968 | Silver Hill | 58 | 4.10 | 39 | 41 | no | 35 | 0 | 172 | 58 | 31 | 16 | 25 | 185 | 0 | TRUE |
300 | 7/18/1968 | Silver Hill | 46 | 2.05 | 32 | 35 | yes | 41 | 7 | 166 | 57 | 37 | 14 | 25 | 173 | 1 | TRUE |
1088 | 7/24/1968 | General | 51 | 3.27 | 36 | 40 | no | 24 | 0 | 168 | 53 | 29 | 16 | 0 | 181 | 0 | FALSE |
57 | 8/12/1968 | Memorial | 51 | 3.32 | 38 | 39 | yes | 23 | 17 | 157 | 48 | NA | NA | NA | NA | 0 | FALSE |
553 | 8/17/1968 | Silver Hill | 54 | 3.94 | 37 | 42 | no | 24 | 0 | 175 | 66 | 30 | 12 | 0 | 184 | 0 | FALSE |
1191 | 9/7/1968 | General | 53 | 3.65 | 33 | 42 | no | 21 | 0 | 165 | 61 | 21 | 10 | 25 | 185 | 0 | FALSE |
431 | 9/16/1968 | Silver Hill | 48 | 1.92 | 30 | 33 | yes | 20 | 7 | 161 | 50 | 20 | 10 | 35 | 180 | 1 | FALSE |
1313 | 9/27/1968 | Silver Hill | 43 | 2.65 | 32 | 33 | no | 24 | 0 | 149 | 45 | 26 | 16 | 0 | 169 | 1 | FALSE |
1600 | 10/9/1968 | General | 53 | 2.90 | 34 | 39 | no | 19 | 0 | 165 | 57 | NA | NA | NA | NA | 0 | FALSE |
532 | 10/25/1968 | General | 53 | 3.59 | 34 | 40 | yes | 31 | 12 | 163 | 49 | 41 | 12 | 50 | 191 | 0 | FALSE |
223 | 12/11/1968 | General | 50 | 3.87 | 33 | 45 | yes | 28 | 25 | 163 | 54 | 30 | 16 | 0 | 183 | 0 | FALSE |
1187 | 12/19/1968 | Silver Hill | 53 | 4.07 | 38 | 44 | no | 20 | 0 | 174 | 68 | 26 | 14 | 25 | 189 | 0 | FALSE |
The data frame format should look familiar. It’s a lot like a spreadsheet.
Generally, we don’t want to operate on the entire data frame. For example, to calculate the mean birth weight, we don’t need the information in the “paternal.education” column.
There are three ways to have R subset the data frame: $
, [[
, and [
.
3.2 Selecting a single column using the $ and [[ operators
The simplest way to get all the values in the “birthweight” column is with the $
operator.
## [1] 3.23 3.03 3.35 3.55 2.74 2.51 3.53 2.92 3.18 3.00 3.42 2.75 3.19 2.78 4.32
## [16] 3.64 3.77 3.32 3.41 4.57 3.63 2.66 3.86 3.11 3.93 4.55 3.14 2.37 3.15 3.20
## [31] 4.10 2.05 3.27 3.32 3.94 3.65 1.92 2.65 2.90 3.59 3.87 4.07
Notice that there are now three numbers inside brackets: one at the beginning of each line of output. These are the indices (locations) of the following number within the output vector. They give us a general idea of the length of the vector, and allow us to determine the value of a particular observation at a glance. For example, we can answer the question “what was the birth weight of the 34th baby?”
Once the vector of birth weights has been extracted from the rest of the data frame, it can be used to calculate a mean.
## [1] 3.312857
This $
operator is a shortcut for the [[
sub-setting operator, which requires typing six additional characters (two pairs of square brackets and a pair of quotation marks). They function in the same way, returning the value of the element named.
## [1] 3.23 3.03 3.35 3.55 2.74 2.51 3.53 2.92 3.18 3.00 3.42 2.75 3.19 2.78 4.32
## [16] 3.64 3.77 3.32 3.41 4.57 3.63 2.66 3.86 3.11 3.93 4.55 3.14 2.37 3.15 3.20
## [31] 4.10 2.05 3.27 3.32 3.94 3.65 1.92 2.65 2.90 3.59 3.87 4.07
## [1] 3.312857
One difference to note is that while [[
works with the index, or column number, $
does not.
# which column contains the birth weight?
# lines beginning with a '#' are comments, and are not executed by R
colnames(birthweight)
## [1] "ID" "birth.date"
## [3] "location" "length"
## [5] "birthweight" "head.circumference"
## [7] "weeks.gestation" "smoker"
## [9] "maternal.age" "maternal.cigarettes"
## [11] "maternal.height" "maternal.prepregnant.weight"
## [13] "paternal.age" "paternal.education"
## [15] "paternal.cigarettes" "paternal.height"
## [17] "low.birthweight" "geriatric.pregnancy"
## [1] 3.23 3.03 3.35 3.55 2.74 2.51 3.53 2.92 3.18 3.00 3.42 2.75 3.19 2.78 4.32
## [16] 3.64 3.77 3.32 3.41 4.57 3.63 2.66 3.86 3.11 3.93 4.55 3.14 2.37 3.15 3.20
## [31] 4.10 2.05 3.27 3.32 3.94 3.65 1.92 2.65 2.90 3.59 3.87 4.07
## [1] 3.312857
3.3 Selecting a subset of the data frame using the [ operator
Unlike $
and [[
, which return the value(s) contained in the specified element, [
returns an object of the same type it is used to subset. Using [
to retrieve the fifth column will return a data frame with 42 rows and 1 column. This may not seem like a big difference, but it can be an important distinction in some cases.
birthweight |
---|
3.23 |
3.03 |
3.35 |
3.55 |
2.74 |
2.51 |
3.53 |
2.92 |
3.18 |
3.00 |
3.42 |
2.75 |
3.19 |
2.78 |
4.32 |
3.64 |
3.77 |
3.32 |
3.41 |
4.57 |
3.63 |
2.66 |
3.86 |
3.11 |
3.93 |
4.55 |
3.14 |
2.37 |
3.15 |
3.20 |
4.10 |
2.05 |
3.27 |
3.32 |
3.94 |
3.65 |
1.92 |
2.65 |
2.90 |
3.59 |
3.87 |
4.07 |
Because the [
operator returns a new data frame, it can be used to specify multiple rows and / or columns.
ID | birthweight |
---|---|
1107 | 3.23 |
697 | 3.03 |
1683 | 3.35 |
27 | 3.55 |
1522 | 2.74 |
569 | 2.51 |
365 | 3.53 |
808 | 2.92 |
1369 | 3.18 |
1023 | 3.00 |
822 | 3.42 |
1272 | 2.75 |
1262 | 3.19 |
575 | 2.78 |
1016 | 4.32 |
792 | 3.64 |
820 | 3.77 |
752 | 3.32 |
619 | 3.41 |
1764 | 4.57 |
1081 | 3.63 |
516 | 2.66 |
272 | 3.86 |
321 | 3.11 |
1636 | 3.93 |
1360 | 4.55 |
1388 | 3.14 |
1363 | 2.37 |
1058 | 3.15 |
755 | 3.20 |
462 | 4.10 |
300 | 2.05 |
1088 | 3.27 |
57 | 3.32 |
553 | 3.94 |
1191 | 3.65 |
431 | 1.92 |
1313 | 2.65 |
1600 | 2.90 |
532 | 3.59 |
223 | 3.87 |
1187 | 4.07 |
The c()
function creates a vector. This allows R to treat indices 1 and 5 as a single argument. This is critical, because birthweight[1,5]
does not produce the same effect at all.
## [1] 3.23
What happened?
When there are two arguments provided to [
, R interprets these as the index on the first (row) and second (column) dimension of the object. The value returned is the content of the first row, fifth column: the birth weight of individual 1107.
The default behavior of [
is to return the entire object. The first argument acts as a sort of filter on the first dimension, the second argument as a filter on the second dimension, and so on. Leaving the space before the comma blank will return all rows (no filter applied), while leaving the space following the comma blank will return all columns. Be sure to try variations on the example code below to see what happens.
ID | birthweight | |
---|---|---|
2 | 697 | 3.03 |
7 | 365 | 3.53 |
29 | 1058 | 3.15 |
Using a minus sign before an index or group of indices will exclude the specified rows / columns.
## [1] "ID" "birth.date"
## [3] "location" "length"
## [5] "birthweight" "head.circumference"
## [7] "weeks.gestation" "smoker"
## [9] "maternal.age" "maternal.cigarettes"
## [11] "maternal.height" "maternal.prepregnant.weight"
## [13] "paternal.age" "paternal.education"
## [15] "paternal.cigarettes" "paternal.height"
## [17] "low.birthweight" "geriatric.pregnancy"
ID | birth.date | location | length | birthweight | head.circumference | weeks.gestation | smoker | maternal.age | maternal.cigarettes | maternal.height | maternal.prepregnant.weight | low.birthweight | geriatric.pregnancy | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1107 | 1/25/1967 | General | 52 | 3.23 | 36 | 38 | no | 31 | 0 | 164 | 57 | 0 | FALSE |
3 | 1683 | 2/14/1967 | Silver Hill | 53 | 3.35 | 33 | 41 | no | 27 | 0 | 164 | 62 | 0 | FALSE |
5 | 1522 | 3/13/1967 | Memorial | 50 | 2.74 | 33 | 39 | yes | 21 | 17 | 156 | 53 | 0 | FALSE |
6 | 569 | 3/23/1967 | Memorial | 50 | 2.51 | 35 | 39 | yes | 22 | 7 | 159 | 52 | 1 | FALSE |
7 | 365 | 4/23/1967 | Memorial | 52 | 3.53 | 37 | 40 | yes | 26 | 25 | 170 | 62 | 0 | FALSE |
8 | 808 | 5/5/1967 | Silver Hill | 48 | 2.92 | 33 | 34 | no | 26 | 0 | 167 | 64 | 0 | FALSE |
9 | 1369 | 6/4/1967 | Silver Hill | 49 | 3.18 | 34 | 38 | yes | 31 | 25 | 162 | 57 | 0 | FALSE |
10 | 1023 | 6/7/1967 | Memorial | 52 | 3.00 | 35 | 38 | yes | 30 | 12 | 165 | 64 | 0 | FALSE |
11 | 822 | 6/14/1967 | Memorial | 50 | 3.42 | 35 | 38 | no | 20 | 0 | 157 | 48 | 0 | FALSE |
12 | 1272 | 6/20/1967 | Memorial | 53 | 2.75 | 32 | 40 | yes | 37 | 50 | 168 | 61 | 0 | TRUE |
13 | 1262 | 6/25/1967 | Silver Hill | 53 | 3.19 | 34 | 41 | yes | 27 | 35 | 163 | 51 | 0 | FALSE |
R will also accept row or column names in quotations as a way to subset the data frame.
maternal.cigarettes | birthweight |
---|---|
0 | 3.23 |
0 | 3.03 |
0 | 3.35 |
25 | 3.55 |
17 | 2.74 |
7 | 2.51 |
25 | 3.53 |
0 | 2.92 |
25 | 3.18 |
12 | 3.00 |
0 | 3.42 |
50 | 2.75 |
35 | 3.19 |
7 | 2.78 |
0 | 4.32 |
2 | 3.64 |
0 | 3.77 |
12 | 3.32 |
25 | 3.41 |
12 | 4.57 |
0 | 3.63 |
35 | 2.66 |
25 | 3.86 |
0 | 3.11 |
0 | 3.93 |
0 | 4.55 |
7 | 3.14 |
7 | 2.37 |
0 | 3.15 |
0 | 3.20 |
0 | 4.10 |
7 | 2.05 |
0 | 3.27 |
17 | 3.32 |
0 | 3.94 |
0 | 3.65 |
7 | 1.92 |
0 | 2.65 |
0 | 2.90 |
12 | 3.59 |
25 | 3.87 |
0 | 4.07 |
Finally, vectors of logical (TRUE/FALSE) values can be used to subset data. Rows or columns corresponding to “TRUE” elements will be returned, while rows or columns corresponding to “FALSE” elements will be excluded.
birthweight[c(1,3,5:13), c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE)]
ID | birth.date | location | length | birthweight | head.circumference | weeks.gestation | smoker | maternal.age | maternal.cigarettes | maternal.height | maternal.prepregnant.weight | low.birthweight | geriatric.pregnancy | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1107 | 1/25/1967 | General | 52 | 3.23 | 36 | 38 | no | 31 | 0 | 164 | 57 | 0 | FALSE |
3 | 1683 | 2/14/1967 | Silver Hill | 53 | 3.35 | 33 | 41 | no | 27 | 0 | 164 | 62 | 0 | FALSE |
5 | 1522 | 3/13/1967 | Memorial | 50 | 2.74 | 33 | 39 | yes | 21 | 17 | 156 | 53 | 0 | FALSE |
6 | 569 | 3/23/1967 | Memorial | 50 | 2.51 | 35 | 39 | yes | 22 | 7 | 159 | 52 | 1 | FALSE |
7 | 365 | 4/23/1967 | Memorial | 52 | 3.53 | 37 | 40 | yes | 26 | 25 | 170 | 62 | 0 | FALSE |
8 | 808 | 5/5/1967 | Silver Hill | 48 | 2.92 | 33 | 34 | no | 26 | 0 | 167 | 64 | 0 | FALSE |
9 | 1369 | 6/4/1967 | Silver Hill | 49 | 3.18 | 34 | 38 | yes | 31 | 25 | 162 | 57 | 0 | FALSE |
10 | 1023 | 6/7/1967 | Memorial | 52 | 3.00 | 35 | 38 | yes | 30 | 12 | 165 | 64 | 0 | FALSE |
11 | 822 | 6/14/1967 | Memorial | 50 | 3.42 | 35 | 38 | no | 20 | 0 | 157 | 48 | 0 | FALSE |
12 | 1272 | 6/20/1967 | Memorial | 53 | 2.75 | 32 | 40 | yes | 37 | 50 | 168 | 61 | 0 | TRUE |
13 | 1262 | 6/25/1967 | Silver Hill | 53 | 3.19 | 34 | 41 | yes | 27 | 35 | 163 | 51 | 0 | FALSE |
This is much more useful than it may sound.
## [1] 52 48 53 53 50 50 52 48 49 52 50 53 53 50 53 53 52 49 52 58 54 47 52 48 51
## [26] 56 51 48 53 53 58 46 51 51 54 53 48 43 53 53 50 53
## [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE
## [25] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [37] TRUE TRUE FALSE FALSE FALSE FALSE
Since the result of the birthweight$length < 50
operation is a vector of TRUE / FALSE values, it can be used to subset the data frame.
ID | length | birthweight | head.circumference | weeks.gestation | smoker | maternal.age | maternal.cigarettes | maternal.height | maternal.prepregnant.weight | low.birthweight | geriatric.pregnancy | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 697 | 48 | 3.03 | 35 | 39 | no | 27 | 0 | 162 | 62 | 0 | FALSE |
8 | 808 | 48 | 2.92 | 33 | 34 | no | 26 | 0 | 167 | 64 | 0 | FALSE |
9 | 1369 | 49 | 3.18 | 34 | 38 | yes | 31 | 25 | 162 | 57 | 0 | FALSE |
18 | 752 | 49 | 3.32 | 36 | 40 | yes | 27 | 12 | 152 | 48 | 0 | FALSE |
22 | 516 | 47 | 2.66 | 33 | 35 | yes | 20 | 35 | 170 | 57 | 1 | FALSE |
24 | 321 | 48 | 3.11 | 33 | 37 | no | 28 | 0 | 158 | 54 | 0 | FALSE |
28 | 1363 | 48 | 2.37 | 30 | 37 | yes | 20 | 7 | 163 | 47 | 1 | FALSE |
32 | 300 | 46 | 2.05 | 32 | 35 | yes | 41 | 7 | 166 | 57 | 1 | TRUE |
37 | 431 | 48 | 1.92 | 30 | 33 | yes | 20 | 7 | 161 | 50 | 1 | FALSE |
38 | 1313 | 43 | 2.65 | 32 | 33 | no | 24 | 0 | 149 | 45 | 1 | FALSE |
3.4 Exercise 1: exploring the data
Use the min()
, max()
, and mean()
functions on subsets of the data frame to answer one or more of the following questions. Work together. Often there will be more than one way to arrive at the answer.
- What is the range of paternal ages in the data set?
- What is the mean maternal age?
- What is the age of the mother with the highest maternal.cigarettes value?
- Is the mean pre-pregnant weight higher or lower among women who gave birth to low birth weight children?
What other questions are of interest to you? Can you answer any of them with the functions we have used so far?