4 Basic data types

We have already said that logical values can be used to subset a data frame, and all the values in a given column of a data frame must be of the same type or class. But what does this mean?

4.1 Understanding class

R has the following basic data classes:

  • numeric (includes integer and double)
  • character
  • logical
  • complex
  • raw

Generally, in bioinformatics, values belong to one of the first three classes. Read more about the complex and raw data types here.

class(birthweight$birthweight)
## [1] "numeric"
class(birthweight$smoker)
## [1] "character"
class(birthweight$geriatric.pregnancy)
## [1] "logical"

The numeric category is fairly self-explanatory. What are character and logical?

Character values are exactly what they sound like: stored characters (letters and / or numbers). In the birthweight table, the “birth.date” and “location” columns contain character values.

head(birthweight$location)
## [1] "General"     "Silver Hill" "Silver Hill" "Silver Hill" "Memorial"   
## [6] "Memorial"

Characters are recognizable by the quotation marks that appear around them in the output. R cannot perform mathematical operations on numbers stored as characters.

1 + "1"

Logical values are TRUE, FALSE, or NA (missing). Logical values are the result of comparing one item to another with relational operators.

The relational operators in R are:

  • > greater than
  • >= greater than or equal to
  • < less than
  • <= less than or equal to
  • == equal to
  • != not equal to
birthweight[birthweight$head.circumference > 35, c("length", "weeks.gestation", "maternal.height", "paternal.height")]
Table 4.1: Results of birthweight[birthweight$head.circumference > 35, c(“length”, “weeks.gestation”, “maternal.height”, “paternal.height”)], c(1,4:12,17,18)].
length weeks.gestation maternal.height paternal.height
1 52 38 164 NA
4 53 41 161 175
7 52 40 170 181
15 53 40 171 183
16 53 40 170 185
18 49 40 152 170
20 58 41 173 180
21 54 38 172 172
23 52 39 170 178
25 51 38 165 NA
31 58 41 172 185
33 51 40 168 181
34 51 39 157 NA
35 54 42 175 184
42 53 44 174 189
birthweight[birthweight$maternal.age <= 20, c("location", "maternal.age", "paternal.age")]
Table 4.2: Results of birthweight[birthweight$maternal.age <= 20, c(“location”, “maternal.age”, “paternal.age”)].
location maternal.age paternal.age
11 Memorial 20 22
14 Memorial 19 20
15 Silver Hill 19 19
16 Memorial 20 24
21 Silver Hill 18 20
22 Silver Hill 20 23
26 General 20 23
28 General 20 20
37 Silver Hill 20 20
39 General 19 NA
42 Silver Hill 20 26

Notice that when R is asked to perform a comparison between a number and a missing value, the result is a missing value.

birthweight[birthweight$paternal.education == 10, c(1,13:16)]
Table 4.3: Results of birthweight[birthweight$paternal.education == 10, c(1,13:16)].
ID paternal.age paternal.education paternal.cigarettes paternal.height
NA NA NA NA NA NA
NA.1 NA NA NA NA NA
7 365 30 10 25 181
24 321 39 10 0 171
NA.2 NA NA NA NA NA
26 1360 23 10 35 179
28 1363 20 10 35 185
NA.3 NA NA NA NA NA
36 1191 21 10 25 185
37 431 20 10 35 180
NA.4 NA NA NA NA NA
birthweight[birthweight$weeks.gestation != 40, "weeks.gestation"]
##  [1] 38 39 41 41 39 39 34 38 38 38 41 37 39 41 38 35 39 37 38 44 41 37 41 41 35
## [26] 39 42 42 33 33 39 45 44
birthweight[birthweight$location == "General",]
Table 4.4: Results of birthweight[birthweight$location == “General”,].
ID birth.date location length birthweight head.circumference weeks.gestation smoker maternal.age maternal.cigarettes maternal.height maternal.prepregnant.weight paternal.age paternal.education paternal.cigarettes paternal.height low.birthweight geriatric.pregnancy
1 1107 1/25/1967 General 52 3.23 36 38 no 31 0 164 57 NA NA NA NA 0 FALSE
17 820 10/7/1967 General 52 3.77 34 40 no 24 0 157 50 31 16 0 173 0 FALSE
18 752 10/19/1967 General 49 3.32 36 40 yes 27 12 152 48 37 12 25 170 0 FALSE
26 1360 2/16/1968 General 56 4.55 34 44 no 20 0 162 57 23 10 35 179 0 FALSE
28 1363 4/2/1968 General 48 2.37 30 37 yes 20 7 163 47 20 10 35 185 1 FALSE
33 1088 7/24/1968 General 51 3.27 36 40 no 24 0 168 53 29 16 0 181 0 FALSE
36 1191 9/7/1968 General 53 3.65 33 42 no 21 0 165 61 21 10 25 185 0 FALSE
39 1600 10/9/1968 General 53 2.90 34 39 no 19 0 165 57 NA NA NA NA 0 FALSE
40 532 10/25/1968 General 53 3.59 34 40 yes 31 12 163 49 41 12 50 191 0 FALSE
41 223 12/11/1968 General 50 3.87 33 45 yes 28 25 163 54 30 16 0 183 0 FALSE

Many of R’s functions also return logical values.

is.numeric(birthweight$ID)
## [1] TRUE
is.numeric(birthweight$smoker)
## [1] FALSE

4.2 Coercion: converting between classes

The birthweight data frame has three columns that should probably be logical values: “smoker”, “low.birthweight”, and “geriatric.pregnancy”. All of these are questions that can be answered with TRUE/FALSE. However, only “geriatric.pregnancy” is stored as a logical value. Storing “smoker” and “low.birthweight” as logical values would be more useful, since it allows us to subset the data frame more easily.

Changing the class of data is known as coercion.

as.logical(birthweight$low.birthweight)
##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
## [25] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [37]  TRUE  TRUE FALSE FALSE FALSE FALSE
as.logical(birthweight$smoker)
##  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

The as.logical() function converted “low.birthweight” to a logical vector, but could not convert “smoker,” and returned a vector of missing data denoted by NA. Why is this?

The coercion rule in R is as follows:

logical > integer > numeric > complex > character

R can convert logical values to integers, store integers as the more general numeric type, or represent numeric data as a character, but these coercion operations cannot always be reversed without losing information.

as.numeric(birthweight$geriatric.pregnancy)
##  [1] 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0
## [39] 0 0 0 0

The as.logical() function only operates on “low.birthweight” the way we want because the data was encoded as 0s and 1s. If any other numbers were used, the results might be unexpected.

as.logical(birthweight$maternal.age)
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Let’s convert the “low.birthweight” column to logical.

birthweight$low.birthweight <- as.logical(birthweight$low.birthweight)
birthweight
Table 4.5: The birthweight data frame.
ID birth.date location length birthweight head.circumference weeks.gestation smoker maternal.age maternal.cigarettes maternal.height maternal.prepregnant.weight paternal.age paternal.education paternal.cigarettes paternal.height low.birthweight geriatric.pregnancy
1107 1/25/1967 General 52 3.23 36 38 no 31 0 164 57 NA NA NA NA FALSE FALSE
697 2/6/1967 Silver Hill 48 3.03 35 39 no 27 0 162 62 27 14 0 178 FALSE FALSE
1683 2/14/1967 Silver Hill 53 3.35 33 41 no 27 0 164 62 37 14 0 170 FALSE FALSE
27 3/9/1967 Silver Hill 53 3.55 37 41 yes 37 25 161 66 46 NA 0 175 FALSE TRUE
1522 3/13/1967 Memorial 50 2.74 33 39 yes 21 17 156 53 24 12 7 179 FALSE FALSE
569 3/23/1967 Memorial 50 2.51 35 39 yes 22 7 159 52 23 14 25 NA TRUE FALSE
365 4/23/1967 Memorial 52 3.53 37 40 yes 26 25 170 62 30 10 25 181 FALSE FALSE
808 5/5/1967 Silver Hill 48 2.92 33 34 no 26 0 167 64 25 12 25 175 FALSE FALSE
1369 6/4/1967 Silver Hill 49 3.18 34 38 yes 31 25 162 57 32 16 50 194 FALSE FALSE
1023 6/7/1967 Memorial 52 3.00 35 38 yes 30 12 165 64 38 14 50 180 FALSE FALSE
822 6/14/1967 Memorial 50 3.42 35 38 no 20 0 157 48 22 14 0 179 FALSE FALSE
1272 6/20/1967 Memorial 53 2.75 32 40 yes 37 50 168 61 31 16 0 173 FALSE TRUE
1262 6/25/1967 Silver Hill 53 3.19 34 41 yes 27 35 163 51 31 16 25 185 FALSE FALSE
575 7/12/1967 Memorial 50 2.78 30 37 yes 19 7 165 60 20 14 0 183 FALSE FALSE
1016 7/13/1967 Silver Hill 53 4.32 36 40 no 19 0 171 62 19 12 0 183 FALSE FALSE
792 9/7/1967 Memorial 53 3.64 38 40 yes 20 2 170 59 24 12 12 185 FALSE FALSE
820 10/7/1967 General 52 3.77 34 40 no 24 0 157 50 31 16 0 173 FALSE FALSE
752 10/19/1967 General 49 3.32 36 40 yes 27 12 152 48 37 12 25 170 FALSE FALSE
619 11/1/1967 Memorial 52 3.41 33 39 yes 23 25 181 69 23 16 2 181 FALSE FALSE
1764 12/7/1967 Silver Hill 58 4.57 39 41 yes 32 12 173 70 38 14 25 180 FALSE FALSE
1081 12/14/1967 Silver Hill 54 3.63 38 38 no 18 0 172 50 20 12 7 172 FALSE FALSE
516 1/8/1968 Silver Hill 47 2.66 33 35 yes 20 35 170 57 23 12 50 186 TRUE FALSE
272 1/10/1968 Memorial 52 3.86 36 39 yes 30 25 170 78 40 16 50 178 FALSE FALSE
321 1/21/1968 Silver Hill 48 3.11 33 37 no 28 0 158 54 39 10 0 171 FALSE FALSE
1636 2/2/1968 Silver Hill 51 3.93 38 38 no 29 0 165 61 NA NA NA NA FALSE FALSE
1360 2/16/1968 General 56 4.55 34 44 no 20 0 162 57 23 10 35 179 FALSE FALSE
1388 2/22/1968 Memorial 51 3.14 33 41 yes 22 7 160 53 24 16 12 176 FALSE FALSE
1363 4/2/1968 General 48 2.37 30 37 yes 20 7 163 47 20 10 35 185 TRUE FALSE
1058 4/24/1968 Silver Hill 53 3.15 34 40 no 29 0 167 60 30 16 NA 182 FALSE FALSE
755 4/25/1968 Memorial 53 3.20 33 41 no 21 0 155 55 25 14 25 183 FALSE FALSE
462 6/19/1968 Silver Hill 58 4.10 39 41 no 35 0 172 58 31 16 25 185 FALSE TRUE
300 7/18/1968 Silver Hill 46 2.05 32 35 yes 41 7 166 57 37 14 25 173 TRUE TRUE
1088 7/24/1968 General 51 3.27 36 40 no 24 0 168 53 29 16 0 181 FALSE FALSE
57 8/12/1968 Memorial 51 3.32 38 39 yes 23 17 157 48 NA NA NA NA FALSE FALSE
553 8/17/1968 Silver Hill 54 3.94 37 42 no 24 0 175 66 30 12 0 184 FALSE FALSE
1191 9/7/1968 General 53 3.65 33 42 no 21 0 165 61 21 10 25 185 FALSE FALSE
431 9/16/1968 Silver Hill 48 1.92 30 33 yes 20 7 161 50 20 10 35 180 TRUE FALSE
1313 9/27/1968 Silver Hill 43 2.65 32 33 no 24 0 149 45 26 16 0 169 TRUE FALSE
1600 10/9/1968 General 53 2.90 34 39 no 19 0 165 57 NA NA NA NA FALSE FALSE
532 10/25/1968 General 53 3.59 34 40 yes 31 12 163 49 41 12 50 191 FALSE FALSE
223 12/11/1968 General 50 3.87 33 45 yes 28 25 163 54 30 16 0 183 FALSE FALSE
1187 12/19/1968 Silver Hill 53 4.07 38 44 no 20 0 174 68 26 14 25 189 FALSE FALSE

Note that the output of as.logical(birthweight$low.birthweight) must be assigned to the “low.birthweight” column in order for the values in the column to change.

4.3 Exercise 2: converting “smoker” from character to logical

Simple coercion is not going to convert the “smoker” column from character to logical.

How can you solve this problem?