-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathproject6.Rmd
353 lines (254 loc) · 11.6 KB
/
project6.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
---
title: "Chapter 3: Data manipulation using dplyr (part 2)"
description: |
Learn how to manipulate your data with the dplyr package. In the continuation with the previous chapter.
author:
- name: Jewel Johnson
date: 12-13-2021
base_url: https://jeweljohnsonj.github.io/jeweljohnson.github.io/
output:
distill::distill_article:
toc_depth: 3
preview: photos/dplyr.png
---
```{r setup, include=FALSE}
library(dplyr)
library(tidyr)
library(palmerpenguins)
library(rmarkdown)
library(knitr)
knitr::opts_chunk$set(echo = TRUE)
```
```{r, xaringanExtra-clipboard, echo=FALSE}
htmltools::tagList(
xaringanExtra::use_clipboard(
button_text = "<i class=\"fa fa-clone fa-2x\" style=\"color: #301e64\"></i>",
success_text = "<i class=\"fa fa-check fa-2x\" style=\"color: #90BE6D\"></i>",
error_text = "<i class=\"fa fa-times fa-2x\" style=\"color: #F94144\"></i>"
),
rmarkdown::html_dependency_font_awesome()
)
```
---
# xaringanExtra package will help us to have inbuilt tabs insdie the article
---
```{r panelset, echo=FALSE}
xaringanExtra::use_panelset()
```
---
# loading javascipt file which helps in folding outputs similar to code folding
---
<script src="js/output_folding.js"></script>
---
#start editing from here
---
## Continuation from the previous chapter
In the [previous chapter](https://jeweljohnsonj.github.io/jeweljohnson.github.io/project5.html) we have seen quite a lot of functions from the `dplyr` package. In this chapter, we will see the rest of the functions where we learn how to handle row names, how to join columns and rows and different set operations in the `dplyr` package.
```{r, eval=FALSE}
#loading neessary packages
library(dplyr)
```
### rownames_to_column() & column_to_rownames()
Tidy data does not use row names. So use `rownames_to_column()` command to convert row names to a new column to the data. The function `column_to_rownames()` does the exact opposite of `rownames_to_column()` as it converts a column into rownames but make sure that the column you are converting into rownames does not contain `NA` values.
```{r, eval=FALSE}
# mtcars dataset contains rownames
# creates new column called car_names which contains row names
mtcars %>% rownames_to_column(var = "car_names")
# returns the original mtcars dataset
mtcars %>% rownames_to_column(var = "car_names") %>%
column_to_rownames(var = "car_names")
```
## Combine tables/columns
### bind_cols()
Joins columns with other columns. Similar function as that of `cbind()` from base R.
```{r, eval=FALSE}
df1 <- tidytable::data.table(x = letters[1:5], y = c(1:5))
df2 <- tidytable::data.table(x = letters[3:7], y = c(6:10))
bind_cols(df1,df2)
#similar functionality
cbind(df1,df2)
```
### bind_rows()
Joins rows with other rows. Similar function as that of `rbind()` from base R.
```{r, eval=FALSE}
df1 <- tidytable::data.table(x = letters[1:5], y = c(1:5))
df2 <- tidytable::data.table(x = letters[3:7], y = c(6:10))
bind_rows(df1,df2)
#similar functionality
rbind(df1,df2)
```
The functions that are described below have the same functionality as that of `bind_cols()` but give you control over how the columns are joined.
## Mutating joins and filtering joins
Mutating joins include `left_join()`, `right_join()`, `inner_join()` and `full_join()` and filtering joins include `semi_join()` and `anti_join()`.
::: {.l-page}
::: {.panelset}
::: {.panel}
### left_join() {.panel-name}
In the code below, matching variables of df2 are joined with df1. In the final data, you can see that only kevin and sam from df2 are matched with df1, and only those row values are joined with df1. For those variables which didn't get a match, the row values for those are filled with `NA`. You can interpret the variables with `NA` values as; both john and chris are not present in df2.
If you are familiar with set theory in mathematics, what we are doing essentially is similar to $(df1 \cap df2) \cup df1$.
```{r, eval=FALSE}
df1 <- tidytable::data.table(x = c("john","kevin","chris","sam","sam"), y = 1:5)
df2 <- tidytable::data.table(x = c("kevin","sam", "bob"), z = 10:12)
df1 %>% left_join(df2)
```
```{r, echo=FALSE}
df1 <- tidytable::data.table(x = c("john","kevin","chris","sam","sam"), y = 1:5)
df2 <- tidytable::data.table(x = c("kevin","sam", "bob"), z = 10:12)
paged_table(df1 %>% left_join(df2))
```
:::
::: {.panel}
### right_join() {.panel-name}
Similar to `left_join()` but here, you will be joining matching values from df1 to df2, the opposite of what we did earlier. As you can see only kevin and sam from the df1 is matched with df2, and only those row values are joined with df2. For the variables which didn't get a match, the row values for those are filled with `NA`. You can interpret the variables with `NA` values as; bob is not present in df1.
This function, in the manner used here, is similar to $(df1 \cap df2) \cup df2$.
```{r, eval=FALSE}
df1 <- tidytable::data.table(x = c("john","kevin","chris","sam","sam"), y = 1:5)
df2 <- tidytable::data.table(x = c("kevin","sam", "bob"), z = 10:12)
df1 %>% right_join(df2)
```
```{r, echo=FALSE}
df1 <- tidytable::data.table(x = c("john","kevin","chris","sam","sam"), y = 1:5)
df2 <- tidytable::data.table(x = c("kevin","sam", "bob"), z = 10:12)
paged_table(df1 %>% right_join(df2))
```
:::
::: {.panel}
### inner_join() {.panel-name}
The function `inner_join()` compares both df1 and df2 variables and only joins rows with the same variables. Here only kevin and sam are common in both the dataframes so the row values of only those columns are joined and others are omitted.
This function is similar to $df1 \cap df2$.
```{r, eval=FALSE}
df1 <- tidytable::data.table(x = c("john","kevin","chris","sam","sam"), y = 1:5)
df2 <- tidytable::data.table(x = c("kevin","sam", "bob"), z = 10:12)
df1 %>% inner_join(df2)
```
```{r, echo=FALSE}
df1 <- tidytable::data.table(x = c("john","kevin","chris","sam","sam"), y = 1:5)
df2 <- tidytable::data.table(x = c("kevin","sam", "bob"), z = 10:12)
paged_table(df1 %>% inner_join(df2))
```
:::
::: {.panel}
### full_join() {.panel-name}
The function `full_join()` compares both df1 and df2 variables and joins all possible matches while retaining both mistakes in df1 and df2 with `NA` values.
This function is similar to $df1 \cup df2$.
```{r, eval=FALSE}
df1 <- tidytable::data.table(x = c("john","kevin","chris","sam","sam"), y = 1:5)
df2 <- tidytable::data.table(x = c("kevin","sam", "bob"), z = 10:12)
df1 %>% full_join(df2)
```
```{r, echo=FALSE}
df1 <- tidytable::data.table(x = c("john","kevin","chris","sam","sam"), y = 1:5)
df2 <- tidytable::data.table(x = c("kevin","sam", "bob"), z = 10:12)
paged_table(df1 %>% full_join(df2))
```
:::
::: {.panel}
### anti_join() {.panel-name}
This is an example of filtering join. The function `anti_join()` compares df1 variables to and df2 variables and only outputs those variables of df1 which didn't get a match with df2.
This function, in the manner used here, is similar to $df1 \cap df2^c$.
```{r, eval=FALSE}
df1 <- tidytable::data.table(x = c("john","kevin","chris","sam","sam"), y = 1:5)
df2 <- tidytable::data.table(x = c("kevin","sam", "bob"), z = 10:12)
df1 %>% anti_join(df2)
```
```{r, echo=FALSE}
df1 <- tidytable::data.table(x = c("john","kevin","chris","sam","sam"), y = 1:5)
df2 <- tidytable::data.table(x = c("kevin","sam", "bob"), z = 10:12)
paged_table(df1 %>% anti_join(df2))
```
:::
::: {.panel}
### semi_join() {.panel-name}
This is an example of filtering join. The function `semi_join()` is similar to `inner_join()` but it only gives variables of df1 which has a match with df2.
```{r, eval=FALSE}
df1 <- tidytable::data.table(x = c("john","kevin","chris","sam","sam"), y = 1:5)
df2 <- tidytable::data.table(x = c("kevin","sam", "bob"), z = 10:12)
df1 %>% semi_join(df2)
```
```{r, echo=FALSE}
df1 <- tidytable::data.table(x = c("john","kevin","chris","sam","sam"), y = 1:5)
df2 <- tidytable::data.table(x = c("kevin","sam", "bob"), z = 10:12)
paged_table(df1 %>% semi_join(df2))
```
:::
:::
:::
Here is a nice graphical representation of the functions we just described now. Image [source](https://rpubs.com/frasermyers/627597).
```{r echo=FALSE, fig.cap = "Source: RPubs.com"}
knitr::include_graphics('photos/joins.jpg')
```
```{r echo=FALSE, fig.cap = "Source: RPubs.com"}
knitr::include_graphics('photos/joins1.jpg')
```
## Additional commands for joins
Additionally, you can specify which common columns to match.
```{r, eval=FALSE}
df1 <- tidytable::data.table(x = c("john","kevin","chris","sam","sam"), y = 1:5)
df2 <- tidytable::data.table(x = c("kevin","sam", "bob"), z = 10:12)
# match with column 'x'
df1 %>% left_join(df2, by = "x")
df3 <- tidytable::data.table(a = c("john","kevin","chris","sam","sam"), y = 1:5)
df4 <- tidytable::data.table(b = c("kevin","sam", "bob"), z = 10:12)
# matching with column having different names, a and b in this case
df3 %>% left_join(df4, by = c("a" = "b"))
```
## Set operations
Similar to the mutating join functions that we had seen, there are different functions related to set theory operations.
### intersect()
Outputs common rows in the dataset.
```{r}
df1 <- tidytable::data.table(x = c("john","kevin","chris","sam","sam"))
df2 <- tidytable::data.table(x = c("kevin","sam", "bob"))
intersect(df1, df2)
```
### setdiff()
Outputs rows in first data frame but not in second data frame.
```{r}
df1 <- tidytable::data.table(x = c("john","kevin","chris","sam","sam"))
df2 <- tidytable::data.table(x = c("kevin","sam", "bob"))
setdiff(df1, df2)
```
### union()
Outputs all the rows in both dataframes
```{r}
df1 <- tidytable::data.table(x = c("john","kevin","chris","sam","sam"))
df2 <- tidytable::data.table(x = c("kevin","sam", "bob"))
union(df1, df2)
```
### setequal()
Checks whether two datasets have same number of rows.
```{r}
df1 <- tidytable::data.table(x = c("john","kevin","chris","sam","sam"))
df2 <- tidytable::data.table(x = c("kevin","sam", "bob"))
setequal(df1, df2)
```
## Summary
In this chapter, we have seen how to handle row names, how to combine columns and rows, what are mutating and filtering joins and various set operations. Thus to conclude this chapter, we have now learned almost all functions in the dplyr package and have seen how to manipulate data efficiently. With the knowledge of the pipe operator that we have seen in chapter 1, we are now equipped to write codes compactly and more clearly. I hope this chapter was useful for you and I will see you next time. Have a good day!
<br>
<a href="project5.html" class="btn button_round" style="float: left;">Previous chapter:<br> 2: Data manipulation using dplyr (part 1)</a>
<!-- adding share buttons on the right side of the page -->
<!-- AddToAny BEGIN -->
<div class="a2a_kit a2a_kit_size_32 a2a_floating_style a2a_vertical_style" style="right:0px; top:150px; data-a2a-url="https://jeweljohnsonj.github.io/jeweljohnson.github.io/" data-a2a-title="One-carat Blog">
<a class="a2a_button_twitter"></a>
<a class="a2a_button_whatsapp"></a>
<a class="a2a_button_telegram"></a>
<a class="a2a_button_google_gmail"></a>
<a class="a2a_button_pinterest"></a>
<a class="a2a_button_reddit"></a>
<a class="a2a_button_facebook"></a>
<a class="a2a_button_facebook_messenger"></a>
</div>
<script>
var a2a_config = a2a_config || {};
a2a_config.onclick = 1;
</script>
<script async src="https://static.addtoany.com/menu/page.js"></script>
<!-- AddToAny END -->
## References
1. Hadley Wickham, Romain François, Lionel Henry and Kirill Müller (2021). dplyr: A Grammar of Data Manipulation. R
package version 1.0.7. https://CRAN.R-project.org/package=dplyr
Here is the [link](https://github.com/rstudio/cheatsheets/blob/main/data-transformation.pdf) to the cheat sheet explaining each function in `dplyr`.
## Last updated on {.appendix}
```{r, echo=FALSE}
Sys.time()
```