-
Notifications
You must be signed in to change notification settings - Fork 1
/
foundations_ls07_data_structures.qmd
291 lines (198 loc) Β· 8.74 KB
/
foundations_ls07_data_structures.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
---
title: 'Data structures'
output:
html_document:
number_sections: true
toc: true
toc_float: true
css: !expr here::here("global/style/style.css")
highlight: kate
editor_options:
markdown:
wrap: 100
canonical: true
chunk_output_type: console
---
```{r, include = FALSE, warning = FALSE, message = FALSE}
## Load packages
if(!require(pacman)) install.packages("pacman")
pacman::p_load(tidyverse, knitr, here)
## knitr settings
knitr::opts_chunk$set(warning = F, message = F,
class.source = "tgc-code-block", error = T)
```
## Intros
In this lesson, we'll take a brief look at data structures in R. Understanding data structures is crucial for data manipulation and analysis. We will start by exploring vectors, the basic data structure in R. Then, we will learn how to combine vectors into data frames, the most common structure for organizing and analyzing data.
## Learning objectives
1. You can create vectors with the `c()` function.
2. You can combine vectors into data frames.
3. You understand the difference between a tibble and a data frame.
## Packages
Please load the packages needed for this lesson with the code below:
```{r}
if(!require(pacman)) install.packages("pacman")
pacman::p_load(tidyverse)
```
## Introducing vectors
The most basic data structures in R are vectors. Vectors are a collection of values that all share the same class (e.g., all numeric or all character). It may be helpful to think of a vector as a column in an Excel spreadsheet.
## Creating vectors
Vectors can be created using the `c()` function, with the components of the vector separated by commas. For example, the code `c(1, 2, 3)` defines a vector with the elements `1`, `2` and `3`.
In your script, define the following vectors:
```{r}
age <- c(18, 25, 46)
sex <- c('M', 'F', 'F')
positive_test <- c(T, T, F)
id <- 1:3 # the colon creates a sequence of numbers
```
You can also check the classes of these vectors:
```{r}
class(age)
class(sex)
class(positive_test)
```
::: {.callout-tip title='Practice'}
Each line of code below tries to define a vector with three elements but has a mistake. Fix the mistakes and perform the assignment.
```{r eval = FALSE}
my_vec_1 <- (1,2,3)
my_vec_2 <- c("Obi", "Chika" "Nonso")
```
:::
::: {.callout-note title='Vocab'}
The individual values within a vector are called *components* or elements. So the vector `c(1, 2, 3)` has three components/elements.
:::
## Manipulating vectors
Many of the functions and operations you have encountered so far in the course can be applied to vectors.
For example, we can multiply our `age` object by 2:
```{r}
age
age * 2
```
Notice that every element in the vector was multiplied by 2.
Or, below we take the square root of `age`:
```{r}
age
sqrt(age)
```
------------------------------------------------------------------------
You can also can add (numeric) vectors to each other:
```{r}
age + id
```
Note that the first element of `age` is added to the first element of `id` and the second element of `age` is added to the second element of `id` and so on.
------------------------------------------------------------------------
## From vectors to data frames
Now that we have a handle on creating vectors, let's move on to the most commonly used object in R: data frames. A data frame is just a collection of vectors of the same length with some helpful metadata. We can create one using the `data.frame()` function.
We previously created vector variables (id, age, sex and positive_test) for three individuals:
We can now use the `data.frame()` function to combine these into a single tabular structure:
```{r}
data_epi <- data.frame(id, age, sex, positive_test)
data_epi
```
Note that instead of creating each vector separately, you can create your data frame defining each of the vectors inside the `data.frame()` function.
```{r}
data_epi_2 <- data.frame(age = c(18, 25, 46),
sex = c('M', 'F', 'F'))
data_epi_2
```
::: {.callout-note title='Side Note'}
Most of the time you work with data in R, you will be importing it from external contexts. But it is sometimes useful to create datasets *within* R itself. It is in such cases that the `data.frame()` function will come in handy.
:::
To extract the vectors back out of the data frame, use the `$` syntax. Run the following lines of code in your console to observe this.
```{r eval = FALSE}
data_epi$age
is.vector(data_epi$age) # verify that this column is indeed a vector
class(data_epi$age) # check the class of the vector
```
::: {.callout-tip title='Practice'}
Combine the vectors below into a data frame, with the following column names: "name" for the character vector, "number_of_children" for the numeric vector and "is_married" for the logical vector.
```{r}
character_vec <- c("Bob", "Jane", "Joe")
numeric_vec <- c(1, 2, 3)
logical_vec <- c(T, F, F)
```
:::
::: {.callout-tip title='Practice'}
Use the `data.frame()` function to define a data frame in R that resembles the following table:
| room | num_windows |
|---------|-------------|
| dining | 3 |
| kitchen | 2 |
| bedroom | 5 |
:::
## Tibbles
The default version of tabular data in R is called a data frame, but there is another representation of tabular data provided by the *tidyverse* package. It's called a `tibble`, and it is an improved version of the data frame.
You can convert from a data frame to a tibble with the `as_tibble()` function:
```{r}
data_epi
tibble_epi <- as_tibble(data_epi)
tibble_epi
```
Notice that the tibble gives the data dimensions in the first line:
```
π# A tibble: 3 Γ 4π
id age sex positive_test
<int> <dbl> <chr> <lgl>
1 1 18 M TRUE
2 2 25 F TRUE
3 3 46 F FALSE
```
And also tells you the data types, at the top of each column:
```
## A tibble: 3 Γ 4
id age sex positive_test
π <int> <dbl> <chr> <lgl> π
1 1 18 M TRUE
2 2 25 F TRUE
3 3 46 F FALSE
```
There, "int" stands for integer, dbl" stands for double (which is a kind of numeric class), "chr" stands for character, and "lgl" for logical.
------------------------------------------------------------------------
The other benefit of tibbles is they avoid flooding your console when you print a long table.
Consider the console output of the lines below, for example:
```{r eval = F}
## print the infert data frame (a built in R dataset)
infert # Veryyy long print
as_tibble(infert) # more manageable print
```
For your most of your data analysis needs, you should prefer tibbles over regular data frames.
### `read_csv()` creates tibbles
When you import data with the `read_csv()` function from {readr}, you get a tibble:
```{r}
ebola_tib <- read_csv("https://tinyurl.com/ebola-data-sample") # Needs internet to run
class(ebola_tib)
```
But when you import data with the base `read.csv()` function, you get a data.frame:
```{r}
ebola_df <- read.csv("https://tinyurl.com/ebola-data-sample") # Needs internet to run
class(ebola_df)
```
Try printing `ebola_tib` and `ebola_df` to your console to observe the different printing behavior of tibbles and data frames.
This is one reason we recommend using `read_csv()` instead of `read.csv()`.
## Wrap-up
With your understanding of data classes and structures, you are now well-equipped to perform data manipulation tasks in R. In the upcoming lessons, we will explore the powerful data transformation capabilities of the dplyr package, which will further enhance your data analysis skills.
Congratulations on making it this far! You have covered a lot and should be proud of yourself.
## Solutions
Solution to the first r-practice block:
```{r}
my_vec_1 <- c(1,2,3) # Use 'c' function to create a vector
my_vec_2 <- c("Obi", "Chika", "Nonso") # Separate each string with a comma
```
Solution to the second r-practice block:
```{r}
df <- data.frame(name = character_vec,
number_of_children = numeric_vec,
is_married = logical_vec)
```
Solution to the third r-practice block:
```{r}
## Solution to the third r-practice block
rooms <- data.frame(room = c("dining", "kitchen", "bedroom"),
num_windows = c(3, 2, 5))
```
<!-- Only team members who contributed "substantially" to a specific lesson should be listed here -->
<!-- See https://tinyurl.com/icjme-authorship for notes on "substantial" contribution-->
### References {.unlisted .unnumbered}
Some material in this lesson was adapted from the following sources:
- Wickham, H., & Grolemund, G. (n.d.). *R for data science*. 15 Factors \| R for Data Science. Accessed October 26, 2022. <https://r4ds.had.co.nz/factors.html>.
<!-- (Chicago format. You can use https://www.citationmachine.net) -->
`r tgc_license()`