-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path09-Rreviewandmore.Rmd
686 lines (395 loc) · 28.7 KB
/
09-Rreviewandmore.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
# Review of R and New tricks
```{r, echo = FALSE, warning=FALSE, message=FALSE}
library(tidyverse)
library(knitr)
library(blogdown)
library(stringr)
library(tweetrmd)
library(emo)
library(tufte)
library(cowplot)
library(lubridate)
library(ggthemes)
library(ggforce)
library(DT)
library(datasauRus)
library(ggridges)
library(palmerpenguins)
options(crayon.enabled = FALSE)
iris <- tibble(iris)
```
By now, we have hopefully gotten the hang of getting things done in `R`. But we're all in a different place. Some of us still have some fundamental things we don't understand. Others feel ok on the whole, but there might be something that you don't quite get, or a bunch of confusing errors that you haven't understood.
Maybe you have `R` questions related to work you have done outside of his class, or you are curious about base R.
Or you want to make nice reproducible reports in `RMarkdown` and/or learn how to organize an `RProject`, or organize make a collaborative open project on github.
This lesson is designed to give you a chance to catch up on `R`, connect the dots between what you have learned, review, and/or learn new things. This should help you prepare for your first major assignment of an exploratory data analysis, due on Monday.
## When `R` goes wrong
`R` can be fun, useful and empowering, but it can also be a bummer. Sometimes it seems like you've done everything right, but things still aren't working. Other times, you just can't figure out how to do what you want. When this happens it is natural to tense up, curse, throw $h!t etc... While some frustration is OK, you should try to avoid sulking. Here's what I do.
1. Take a **deep breath**.
2. **Copy and paste and try again**. If that doesn't work, go to 3.
3. **Look at the code for any common issues** (like those in Section \@ref(gotcha), below). If I see one of these issues, I fix it and feel like a champ. If not, I go to step 4.
4. Remember that **google is my friend**, use it. I usually copy the error message R gives and paste it into google. Does this give me a path towards solving this? If so give it a shot, if not, go to 5.
5. **Take a walk** around the house, grab a drink of water, get away from the computer for a minute. Then think about what I was trying to do and return to the computer to see if I've done it right and if that time away allowed me to see mistakes I missed before. If I see the issue, I fix it and feel like a champ. If not, I go to step 6.
6. **Move on, do something else** and come back to it later (7). If you have more R to work on, try that (unless you need a break). If you're not going to do more R, you should probably close your RStudio session.
7. OK back to it. Reopen RStudio and work through your code. Do you see the issue now? If so fix it, if not move onto 8.
8. **Explain the issue** to a friend / peer. I often figure out what I did wrong, when I explain it. Like "I said add 5+5 and it kept giving me 10, when the answer should have been 25." And then I realize I added when I wanted to multiply. Or maybe your friend figures it out.
9. How important is this thing? Can I **do something slightly different that is good enough**? If so I try that. If it's essential,
10. **Find an expert** or [stackoverflow](https://stackoverflow.com/) or something.
```{r, echo=FALSE}
include_tweet("https://twitter.com/allison_horst/status/1213275783675822080?s=20")
```
#### Run only a few lines
When you type more complex stuff errors are bound to show up somewhere. I suggest running each line one at a time (or at least running a few lines at a time to unstick yourself if you found an error \@ref(fig:afew)).
```{r afew, fig.cap = 'You can run a few lines by highlighting what you want to turn (make sure not to end on a pipe %>%).', message=FALSE, warning=FALSE}
include_graphics("images/run a few lines.jpeg")
```
### Warnings and Errors, Mistakes and Impasses
Before digging into the common R errors, lets go over the four ways R can go wrong.
1. **A warning.** We did something that got `R` nervous, and it gives us a brief message. It is possible everything is ok, but just have a look. I think of this as a yellow light. The most common warning I get is the harmless `summarise() ungrouping output (override with .groups argument)`.
2. **An error.** We did something that R does not understand or cannot do. R will stop working and make us fix this. I think of this as a red light.
3. **A mistake.** Our communication with R broke down -- it thought we were doing one thing but we wanted it to do another. Mistakes are the most likely to cause a big problem. So, remember that just because R works without an error or warning, does not mean it did what we hoped.
4. **An impasse.** There's something we want to do, but can't figure it out.
Be mindful of these types of issues with R as you code and as you read the common errors below.
### Common gotcha's {#gotcha}
I share my most common mistakes, below. I note that these are my common mistakes. If you find that you often make different sorts of mistakes, email me with them, and I'll add them.
#### Spelling / capitilization etc
`R` cant read your mind <span style="color:LightGrey">(although tab-completion in RStudio is awesome)</span>, and pays attention to capital letters. Spellng errors are my most common mistake. For example, the column, `Sepal.Length`, in the `iris` dataset has sepal lengths for a bunch of individuals. Trying to select the column by typing any one of the options below will yield a similar error:
```{r, eval=FALSE}
dplyr::select(iris, sepal.length)
dplyr::select(iris, Sepal_Length)
dplyr::select(iris, Sepal.Lngth)
```
```{r, error=TRUE, echo=FALSE}
dplyr::select(iris, Sepal.Lngth)
```
Similarly, you might misspell the function:
```{r, error=TRUE}
dplyr::selct(iris, Sepal.Length)
```
So, check for these mistakes and fix the code above to look like this:
```{r, eval=FALSE}
dplyr::select(iris, Sepal.Length)
```
```{r, echo=FALSE}
dplyr::select(iris, Sepal.Length) %>% head()
```
```{block2, type='rmdtip'}
When it comes to spelling errors, a helpful hint might be to have a consistent (As possible) way of naming vectors, eg. always_separate_with_underscores, or AlwaysCapitalizeTheFirstLetter, or always.separate.with.periods
```
#### Confusing `==` and `=`
Consecutive equals signs `==` ask if the thing on the right equals the thing on the left. For example, if I want to know if two equals six divided by two, I type `2 == (6/2)` and R says `r 2 == (6/2)`. But what if I accidentally type (6/2) = 3. In this case R gets very confused.
```{r, error=TRUE}
2 = (6/2)
```
This confusion arises because we told R to make two equal six divided by two, which is nonsense.
```{r, }
two <- 2
```
But it could be worse than nonsense. Say the value two is assigned to `two`, and now we ask if `two` equals (6/2). Asking with a `==`, as in `two == (6/2)`, gives us our expected answer: `r two == (6/2)`. But typing `two = 6/2` does not ask if `two` equals `6/2`, rather it tells R that `two` equals `6/2` for now, returning unexpected results like that below.
```{r, echo=FALSE}
two = 6/2
```
```{r}
two^2
```
**This is one of many cases in which `R` does what we say and does not warn of an error, but does not do what we hoped it would.**
One more note while we're here: **The clarity and utility of R's error messages vary tremendously.**
For example, if we confuse `=` and `==` inside the [`filter()`](https://dplyr.tidyverse.org/reference/filter.html) function in the [`dplyr`]( https://dplyr.tidyverse.org/) package, R gives a very useful error message. For example, say we mess up in asking R to only return data for *Iris setosa* from the `iris` data set.
```{r, error = TRUE}
filter(.data = iris, Species = "setosa")
```
By contrast doing a similar operation in base R <span style="color:LightGrey">(which you may have seen previously, but we don't cover in this course) </span>, yields a less clear error message:
```{r, error = TRUE}
iris[ iris$Species = "setosa", ]
```
#### Confusing `=` and `<-`
As we've seen throughout the course, we use the *global operator*, to assign values to variables. But inside a function we use the `=` sign to assign values to argument in a function. Using the global operator in a function does a bunch of bad things.
Say we wanted to sample a letter from the alphabet at random. Typing `sample(x = letters, size =1)`, will do the trick and will not assign any value to x outside of that function.
```{r, error=TRUE}
#### DO THIS
sample(x = letters[1:10], size =1)
x
```
The *error* above is a good thing -- we wanted to sample letters, not have `x` equal the letters. By contrast, using `<-` to assign values to arguments has bad consequences.
```{r, error=TRUE}
#### DONT DO THIS
sample(x <- letters, size =1)
x
```
This is BAD. We did not want to assign the alphabet to `x`, we just wanted the [`sample()`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/sample.html) function to sample from the alphabet. Bottom line:
- Use the `=` sign to assign values to arguments in a function.
- Use `<-` to assign values to variables outside of a function.
- Use `==` to ask if one thing equals another.
#### Dealing with missing data
Often our data includes missing values. When we do math on a vector including missing values, `R` will return `NA` unless we tell it to do something else. See below for an example:
```{r}
my_vector <- c(3,1,NA)
mean(my_vector)
```
Depending on the function, we have different ways of telling R what to do with missing data. In the [`mean()`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/mean.html) function, we type.
```{r}
mean(my_vector, na.rm = TRUE)
```
We have to be extra careful of missing values when doing math ourselves. If we found the mean of `my_vector` by dividing its sum by its length, like this: `sum(my_vector) / length(my_vector) =` `r sum(my_vector) / length(my_vector)`, we would have the wrong answer. So be careful and avoid this mistake.
#### Conflicts in function names
Each function in a package must be unique. But functions in different packages can have the same name and do different things. This means we might be using a totally different function than we think. If we're lucky this results in an error, and we can fix it. If we're unlucky, this results in a bad mistake.
We can avoid this mistake by typing the package name and two colons and then the function name (e.g. `dplyr::filter`) before using any function. But this is quite tedious. Installing and loading the [`conflicted package`](https://www.tidyverse.org/blog/2018/06/conflicted/), which tells us when we use a function that is used by more than one package loaded, resulting in a warning that we can fix!
#### Mistakes in assignment
I often mess up in assigning values to variables. I do so in a few different ways, I:
- Forget to assign a variable to memory,
- I use the variable before it's assigned,
- I don't update my assignment after doing something, or
- I overwrite my old assignment.
I'll show you what I mean and how to spot and fix these common issues...
##### **Mistakes in assignment:** Not in memory
```{r nomem, echo=FALSE, fig.cap='I did not assign the value one to x.', out.width='60%', echo=FALSE, out.extra='style="float:right; padding:10px"'}
include_graphics("images/nox.jpeg")
```
As the example in Figure \@ref(fig:nomem) shows, typing a value into an `R` script is not enough. We need to enter it into memory. You can do this by either, hitting `ctrl + shift`, or `/command + shift`, or hitting the `run` button in the RStudio IDE, or copying and pasting into the terminal widow.
To see all the variables in `R`s memory, check the environment tab in the RStudio IDE, or use the [`ls()`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/ls.html) function with no arguments.
##### **Mistakes in assignment:** Wrong order
If you want to square `x`, (as in Figure \@ref(fig:nomem)), you must assign it a value first. This mistake is quite similar to the one above.
```{r, eval=FALSE}
x^2
x <- 1
```
```{r, error=TRUE, echo=FALSE}
x^2
```
```{r, echo=FALSE}
x <-1
```
##### **Mistakes in assignment:** Not updating assignment
Another common mistake is to run some code but not assign the output to anything. So, for example we wanted to make a density plot of the ratio of petal width to sepal width. Can you spot the error in the code below?
```{r, error=TRUE,out.width='0%'}
iris %>%
mutate(petal_to_sepal_width = Petal.Width / Sepal.Width)
ggplot(iris, aes(x = petal_to_sepal_width, fill = Species )) +
geom_density(alpha = .5)
```
We clearly calculated `petal_to_sepal_width`, above, so why can't `R` find it? The answer is that we did not save the results. Let's fix this by assigning our new modifications to `iris`.
```{r,fig.height=1.4, fig.width=3}
iris <- iris %>%
mutate(petal_to_sepal_width = Petal.Width / Sepal.Width)
ggplot(iris, aes(x = petal_to_sepal_width, fill = Species )) +
geom_density(alpha = .5)
```
##### **Mistakes in assignment:** Overwriting assignments
Above, I showed how failing to reassign after doing some calculation can get us in trouble. But other times, reassigning can cause it own problems.
For example, let's say I want to calculate [`mean()`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/mean.html) petal to sepal widths for each species and then make the same plot as above.
```{r,fig.height=0.1, fig.width=3, error = TRUE, warning=FALSE}
iris <- iris %>%
group_by(Species) %>%
dplyr::summarize(mean_petal_to_sepal_width = mean(petal_to_sepal_width))
ggplot(iris, aes(x = petal_to_sepal_width, fill = Species )) +
geom_density(alpha = .5)
```
So, what went wrong here? Let's take a look at what I did to `iris`:
```{r}
iris
```
Ooops, we just have species means. By combining summarise with a reassignment, we replaced the whole iris dataset with a summary of means. In this case, it's better to assign your output to a new variable...
```{r, echo=FALSE, message=FALSE, warning=FALSE}
rm(iris)
iris <- tibble(iris)
iris <- iris %>%
mutate(petal_to_sepal_width = Petal.Width / Sepal.Width)
```
```{r,fig.height=1.4, fig.width=3, warning=FALSE, message=FALSE}
iris_petal2sepalw_bysp <- iris %>%
group_by(Species) %>%
dplyr::summarize(mean_petal_to_sepal_width = mean(petal_to_sepal_width), .groups= "drop_last")
ggplot(iris, aes(x = petal_to_sepal_width, fill = Species )) +
geom_density(alpha = .5)
```
No worries though if you have your code well laid out, you can just rerun everything until the pint before you made this mistake and you're back in business.
**So when should we reassign to the old variable name, and when should we assign to a new name?** My rule of thumb is to reassign to the same variable when I add things to a tibble, but do not change existing data, while I assign to a new variable when values change or are removed <span style="color:LightGrey">(with some exceptions)</span>.
But what if **you wanted the mean in the same tibble as the initial data** so you could add lines for species means? You can so this with [`mutate()`](https://dplyr.tidyverse.org/reference/mutate.html) instead of [`summarise()`](https://dplyr.tidyverse.org/reference/summarise.html).
```{r,fig.height=1.4, fig.width=3, warning=FALSE, message=FALSE}
iris <- iris %>%
group_by(Species) %>%
dplyr::mutate(mean_petal_to_sepal_width = mean(petal_to_sepal_width), .groups= "drop_last")
ggplot(iris, aes(x = petal_to_sepal_width, fill = Species)) +
geom_density(alpha = .5)+
geom_vline(aes(xintercept = mean_petal_to_sepal_width, color = Species), lty = 2)
```
#### Just because you didn't get an error doesn't mean R is doing what you want (or think):
Say you want to reassign `x` to the value `10`, and you want `y` to equal `x$^2$` (so `y` should be `100`). The code below messes this up by assigning the value `x^2` to `y` before it setting `x` to `10`. This means that `y` is using the older value `x`, which equals `1`, set above.
```{r}
y <- x^2
x <- 10
print(y)
```
#### Unbalanced parentheses, quotes, etc...
```{r,echo=FALSE, out.width='15%',out.extra='style="float:right; padding:10px"'}
include_graphics("images/keepon.jpeg")
```
Some things, like parentheses and quotes come in pairs. Too many or too few will cause trouble.
```{r,echo=FALSE, out.width='15%',out.extra='style="float:right; padding:10px"'}
include_graphics("images/keepon.jpeg")
```
- If you have one quote or parenthesis open, without its partner (e.g. `"p`), R will wait for you to close it. With a short piece of code you can usually figure out where to place the partner. If the code is long and it's not easy to spot, hit `escape` to start that line of code over.
- Also, in a script, clicking the space after a parenthesis should highlight its partner and bring up a little X in the sidebar to flag missing parentheses.
- Unbalanced parentheses cause R to report an error (below).
```{r, error=TRUE}
c((1)
```
#### Common issues in ggplot
##### Using `%>%` instead of `+`
We see that tidyverse has two ways to take what it has and keep moving.
- When dealing with data, we pipe opperations forward with the `%>%` operator. For example, we told `R` to tak our tibble and then pull out our desired coumn, above `name_of_tibble %>% pull(var = name_of_column)`.
- When building a plot, we add elements to it with the `+` sign. For example, in a scatterplot, we type: `ggplot(data = <name_of_tibble>, aes(x=x_var, y = y_var)) + geom_point()`.
I occasionally confuse these and get errors like this:
```{r, error=TRUE}
ggplot(iris, aes(x = Petal.Length, y = Petal.Width)) %>%
geom_point()
```
Or
```{r, error=TRUE}
iris +
summarise(mean_petal_length = mean(Petal.Length))
```
##### Specifying a color in the `aes` argument
Say we want to plot the relationship between petal and sepal length in *Iris setosa* with points in <span style="color:Blue">blue</span>.
The code below is a common way to do this wrong.
```{r,fig.height=1.7, fig.width=3, warning=FALSE, message=FALSE}
filter(iris, Species == "setosa") %>%
ggplot(aes(x=Petal.Length, y = Sepal.Length, color = "blue")) +
geom_point()
```
The right way to do this is
```{r,fig.height=1.7, fig.width=2.15, warning=FALSE, message=FALSE}
filter(iris, Species == "setosa") %>%
ggplot(aes(x=Petal.Length, y = Sepal.Length)) +
geom_point(color = "blue")
```
##### Fill vs color
There are two types of things to color in `R` -- the `fill` argument fill space with a color and the `color` argument colors lines and points. To demonstrate let's first make two histograms of *Iris setosa* sepal length with decorative color:
```{r, fig.height=2, fig.width=5.75, echo=FALSE, warning=FALSE}
plotA <- filter(iris, Species == "setosa") %>%
ggplot(aes(x=Petal.Length)) +
geom_histogram(color = "blue", bins = 10)+
labs(subtitle = 'geom_histogram(color = "blue")')
plotB <- filter(iris, Species == "setosa") %>%
ggplot(aes(x=Petal.Length)) +
geom_histogram(fill = "blue", bins = 10)+
labs(title = 'geom_histogram(fill = "blue")')
plot_grid(plotA, plotB, labels = c("a","b"))
```
*So we usually want to make histograms and density plots by specifying our desired color with `fill`*. I usually add `color = "white"` to separate the bars in the histogram.*
Now let's try the same for color, now plotting petal length against sepal length, mapping species onto color or fill.
```{r, fig.height=1.7}
plotA <- ggplot(iris, aes(x=Petal.Length, y = Sepal.Length, color = Species)) +
geom_point()+ labs(title = 'aes(color = Species)')
plotB <- ggplot(iris, aes(x=Petal.Length, y = Sepal.Length, fill = Species)) +
geom_point()+ labs(title = 'aes(fill = Species)')
# using plot_grid in the cowplot package to combine plots
plot_grid(plotA, plotB, labels = c("a","b"))
```
*So we usually want to make scatterplots and lineplots by specifying our desired color with `color`*.
##### Bar plots with count data
Imagine we wanted to plot hair color and eye color for men and women. Because this is count data, some form of bar chart would be good. If our data looks like this
```{r, echo=FALSE, warning=FALSE}
hair_eye <- as_tibble(HairEyeColor) %>%
uncount(weight = n) %>%
sample_n(size = 592)
HairEyeColor <- as_tibble(HairEyeColor)
DT::datatable(hair_eye, options = list(pageLength = 5, lengthMenu = c(5, 30, 90)))
```
We can make a bar plot with `geom_bar`
```{r, fig.height=1.7}
ggplot(hair_eye, aes(x = Sex, fill = Eye))+
geom_bar(position = "fill", color = "white")+
facet_wrap(~Hair, nrow = 1, labeller = "label_both")+
scale_fill_manual(values = c("blue","brown","green","gold"))
```
But if our data looked like this:
```{r, echo=FALSE, warning=FALSE}
DT::datatable(HairEyeColor, options = list(pageLength = 5, lengthMenu = c(5, 30, 90)))
```
`geom_bar` would result in an error.
```{r, fig.height=.1, error = TRUE}
ggplot(HairEyeColor, aes(x = Sex, y = n, fill = Eye))+
geom_bar(position = "fill", color = "white")+
facet_wrap(~Hair, nrow = 1, labeller = "label_both")+
scale_fill_manual(values = c("blue","brown","green","gold"))
```
We could overcome this error by using [`geom_col()`](https://ggplot2.tidyverse.org/reference/geom_bar.html) instead of [`geom_bar()`](https://ggplot2.tidyverse.org/reference/geom_bar.html), or by typing `geom_bar(position = "fill", color = "white", stat = 'identity')`.
#### When you give `R` a tibble
Throughout this course, we deal mostly with data in [`tibbles`](https://r4ds.had.co.nz/tibbles.html), a really nice way to store a bunch of variables of different classes -- each as its own vector in a column. However occasionally we need to `pull` a vector from its tibble, to do so, type:
```{r, eval=FALSE}
pull(.data = <name_of_tibble>, var = <name_of_column>)
## or
name_of_tibble %>%
pull(var = name_of_column)
```
## Making Reproducible examples to get help
In R it's good to seek help, but great to help people help you. Watch this video on making reproducible examples, so that people can help you with your R (or you can help yourself)
```{r, echo=FALSE}
include_url("https://www.youtube.com/embed/G5Nm-GpmrLw")
```
## Readable and usable R code
Remember - major benefit of `R` or any scripting language over e.g. doing a bunch of calculations in an excel are
1. You have a record of exactly what you did,
2. Which you can share with others,
3. And/or you can update / change as your analyses progress.
For this reason it is important to have reliable a way to go from the R code you wrote one day to the output you got that day. There are two broad strategies you could take to accomplish this -- your could save your works as a well organized script or you could write your code in RMarkdown (or comparably as an RNotebook). I discuss how to do each below.
### Saving well-organized R scripts
Saving your `R` script is a great way to keep a shareable, replicable, reusable and editable record of what you have done. However, simply saving your R script does not guarantee that you will achieve these goals. Below I have some tips about what to include in your R script what to exclude, and examples of good and bad R scripts.
#### Things that should be in an R script
An R script should have all the commands and variable assignments etc that are necessary to reproduce your results. This includes loading the appropriate libraries, data sets etc etc etc.
Additionally, saved `R` scripts should be heavily commented (remember that comments start with `#` to tell R that were not writing code). Our goal here is not just that someone could run our code and get our result, but they could understand the intermediate steps ad why we did them.
#### Things that should not be in an R script
Because we will often share our R scripts with others, it is generally bad practice to point to your home directory (that is, use R projects rather than `setwd()`).
It is also considered unfriendly to begin your code by clearing R's memory (do not start your code with `rm(list = ls())`). If you want to clear R's memory (which is often a very good idea), type `rm(list = ls())` in the console rather than in your saved script.
Additionally, you should only have commands in your saved scripts that are necessary to get through your analysis. So, for example, although we should use the `glimpse()` and `view()` functions extensively as we develop our analysis there is no reason to include these functions in the code you save.
#### Examples of good and bad `R` scripts
So our goal in writing an R script is not just to have it work immediately, but to (1) have it work if we exited `R`, reopened it, and ran our code without thinking, and (2) Have a sense of what the code was trying to do and how it was trying to do it.
Here is a bad `R` script. Note that this does not state our goal, it does not load the required library and it will not work if you simply run it. That is not to say this didn't work when you first coded it -- you could have had `tidyverse` loaded elsewhere, and you could have entered code into the console in an order which differed from how it is seen in you script. But it wont work as is.
```{r, eval=FALSE}
mean_iris_sepal_length
mean_iris_sepal_length <- summarise(grouped_iris_data, mean(Sepal.Length))
grouped_iris_data <- group_by(iris, Species)
```
Here is a good R script
```{r, eval=FALSE}
# Yaniv Brandvain
# Feb 6 2022
# Calculating means with group_by
library(tidyverse) # load the tidyverse library
# Today our goal is to calculate the mean Sepal.Length
# for each species in the iris data set and save it to
# mean_iris_sepal_length
grouped_iris_data <- iris %>% # Staring with iris dataset
group_by(Species) # When dealing with this tibble, do commands separately for each Species
mean_iris_sepal_length <- grouped_iris_data %>%
summarise(mean_sepal_length = mean(Sepal.Length)) # calculate the mean Sepal.Length
mean_iris_sepal_length # print our results to console
```
### RMarkdown
```{r, echo=FALSE}
include_graphics("https://github.com/allisonhorst/stats-illustrations/blob/master/rstats-artwork/rmarkdown_wizards.png?raw=true")
```
RMarkdown is a file format that allows us to seamlessly combine text, R Code, results and plots. You use RMarkdown by writing in plain text and then interspersed with *code chunks*. See the video below (\@ref(fig:rmarkdownoverview)) for a brief overview.
```{r rmarkdownoverview, fig.cap='A brief (4 min and 37 sec) overview of RMarkdown from Stat 545.', echo=FALSE, out.extra= 'allowfullscreen'}
include_url("https://www.youtube.com/embed/ZzDSkBgt9xQ" )
```
You can use RMarkdown to make pdfs, html, or word documents that you can share with peers, employers etc... RMardown is especially useful for communicating results of complex data analysis, as your numbers, figures, and code will all match. This also means that anyone (especially future you, See Fig. \@ref(fig:reproducible)) can recreate, learn from, and build off of your work.
```{r reproducible, fig.cap='Why make a reproducible workflow (A dramatic 1 min and 44 sec video).', echo=FALSE, out.extra= 'allowfullscreen'}
include_url("https://www.youtube.com/embed/s3JldKoA0zw")
```
Many students in this course like to turn in their homeworks as html documents generated by RMarkdown, because they can share their code, figures and ideas all in one place. Outside of class, the benefits are similar -- people can see your code and results as they read your explanation. RMarkdown is pretty flexible -- you can write lab reports, scientific papers, or even this book in RMarkdown.
To get started with RMarkdown, I suggest click `File > New File > RMarkdown` and start exploring. For a longer introduction, check out [Chapter 27 of R for Data Science](https://r4ds.had.co.nz/r-markdown.html) [@grolemund2018]. Push onto [Chapter 2 of RMarkdown: The definitive guide](https://bookdown.org/yihui/rmarkdown/basics.html#basics) [@xie2018] to dig even deeper.
*A few RMarkdown tips:*
- You can control figure size by specifying `fig.height` and `fig.width` and you can show the code or not with the `echo = TRUE` or `echo = FALSE` options in the beginning of your codechunk `{r, fig.height = ..., fig.width = ..., echo = ...}`).
- The `DT` and `kableExtra` packages can help make prettier tables.
If you have the time and energy, I strongly recommend that you **turn in your first homework as an html generated by RMarkdown.**
```{r fig.cap='download the [RMarkdown cheat sheet](https://rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf)', echo=FALSE}
knitr::include_graphics('https://d33wubrfki0l68.cloudfront.net/374f4c769f97c4ded7300d521eb59b24168a7261/c72ad/lesson-images/cheatsheets-1-cheatsheet.png')
```
## R again Quiz
```{r echo = FALSE}
include_app("https://brandvain.shinyapps.io/rreview/")
```
```{r, echo=FALSE}
rm(list = ls())
```