forked from OHI-Science/data-science-training
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathoverview.Rmd
134 lines (72 loc) · 7.97 KB
/
overview.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
# Overview {#overview}
Welcome.
In this training you will learn R, RStudio, Git, and GitHub. It's going to be fun and empowering. You will learn a reproducible workflow that can be used in research and analyses of all kinds, including Ocean Health Index assessments. This is really powerful, cool stuff, and not just for data: I made and published this book using those four tools and workflow.
We will practice learning three main things all at the same time: coding with best practices (R/RStudio), collaborative version control (Git/GitHub), and communication/publishing (RMarkdown/GitHub). This training will teach these all together to reinforce skills and best practices, and get you comfortable with a workflow that you can use in your own projects.
## What to expect
This is going to be a fun workshop.
The plan is to expose you to a lot of great tools that you can have confidence using in your research. You'll be working hands-on and doing the same things on your own computer as we do live on up on the screen. We're going to go through a lot in these two days and it's less important that you remember it all. More imporatantly, you'll have experience with it and confidence that you can do it. The main thing to take away is that there *are* good ways to approach your analyses; we will teach you to expect that so you can find what you need and use it! And, you can use these materials as a reference as you go forward with your analyses.
We'll be talking about :
- how to THINK about data. And not just any data; tidy data.
- how to increase reproducibility in your science
- how to more easily collaborate with others--including your future self!
- how the #rstats community is fantastic. The tools we're using are developed by real people. They are building great stuff and helping people of all skill-levels learn how to use it.
Everyone in this workshop is coming from a different place with different experiences and expectations. But everyone will learn something new here, because there is so much innovation in the data science world. Even instructors and helpers learn something new every time, from each other and from your questions. You are all welcome here and encouraged to help each other.
Are you familiar with some of this material already? Then focus on how you might teach it to others: A big part of this is not only you learning these skills, but increasing these practices in science as a whole.
Here are some important themes throughout (these are joke book covers):
![](img/practical_dev_both.png)
### Tidy data workflow
We will be learning about tidy data.
[**Hadley Wickham**](http://hadley.nz/) has developed a ton of the tools we'll use today.
Here's an overview of techniques to be covered in Hadley Wickham and Garrett Grolemund of RStudio's book [R for Data Science](http://r4ds.had.co.nz/):
![](img/r4ds_data-science.png)
We will be focusing on:
- **Tidy**: `tidyr` to organize rows of data into unique values
- **Transform**: `dplyr` to manipulate/wrangle data based on subsetting by rows or columns, sorting and joining
- **Visualize**:
- `ggplot2` static plots, using grammar of graphics principles
- **Communicate**
- dynamic documents with *R Markdown*
## Gapminder data
One of the most important things we hope you learn from this book is how to think about data separately from your own research context. Said in another way, you'll learn to distinguish your data questions from your research questions. Here, we are focusing on data questions, and we will use data that is not specific to your research. We learn through metaphor, and you will likely see parallels to your own data, which will ultimately help you in your research.
We'll be using the [gapminder dataset](http://www.gapminder.org/world), which represents the health and wealth of nations. It was pioneered by [Hans Rosling](https://www.ted.com/speakers/hans_rosling), who is famous for describing the prosperity of nations over time through famines, wars and other historic events with this beautiful data visualization in his [2006 TED Talk: The best stats you've ever seen](https://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen):
[Gapminder Motion Chart<br\>![](https://github.com/remi-daigle/2016-04-15-UCSB/raw/gh-pages/viz/img/gapminder-world_motion-chart.png)](http://www.gapminder.org/world)
While these data are not specifically oriented around your research, it is a fantastically rich data set with many parallels to data you may have and wrangling you will need to do. There are there are various indicators that are tracked across multiple study sites over many years.
## By the end of the course...
By the end of the course, you'll wrangle the gapminder data, and make your own graphics that you'll publish on a webpage you've built with GitHub and RMarkdown. Woop!
I made this training book with GitHub and RStudio's RMarkdown, which is what we'll be learning in the workshop.
## Prerequisites
Before the training, please make sure you have done the following:
1. Have up-to-date versions of `R` and RStudio and have RStudio configured with Git/GitHub
- Download and install R: https://cloud.r-project.org
- Download and install RStudio: http://www.rstudio.com/download
- Create a GitHub account: https://github.com *Note! Shorter names that kind of identify you are better, and use your work email!*
1. Get comfortable: if you're not in a physical workshop, be set up with two screens if possible. You will be following along in RStudio on your own computer while also watching a virtual training or following this tutorial on your own.
<!---
## Motivation
More often than not, there are more than one way to do things. I'm going to focus mostly on what I have ended up using day-to-day; I try to incorporate better practices as I come upon them but that's not always the case. RStudio has some built-in redundancy too that I'll try to show you so that you can approach things in different ways and ease in.
- based on literature: best and good enough practices
- also based on our team's experience of how to do better science in less time
## Collaboration
Everything we learn today is to going to help you collaborate with your most important collaborator — YOU. Science is collaborative, starting with Future You, your current collaborators, and anyone wanting to build off your science later on.
## Reproducibility
- record of your analyses.
- rerun them!
- modify them, maybe change a threshold, try a different coefficient, etc, maybe today
- modify them, make a new figure, in 6 months!
## Mindset
New but will become increasingly familiar. We’ll start you off with some momentum, like if you were going to learn to ride a bike or ...
Expect that there is a way to do what you want to do
- stop confounding data science with your science. Expect that someone has had your problem before or done what you want to do.
If you plan to program mostly in one particular language on a single platform (such as Mac or Windows), you might try an integrated development environment (IDE). IDEs integrate text editing, syntax highlighting, version control, help, build tools, and debugging in one interface, simplifying development.
http://r-bio.github.io/intro-git-rstudio/
## Data science is a discipline
It has theories, methods, and tools.
Tidyverse and Hadley’s graphic. Tidy data.
Going to teach you how to think differently, get into some of the theory but in the context of hands-on work.
--->
## Credit
This material builds from a lot of fantastic materials developed by others in the open data science community. In particular, it pulls from the following resources, which are highly recommended for further learning and as resources later on. Specific lessons will also cite more resources.
- [R for Data Science](http://r4ds.had.co.nz/) by Hadley Wickham and Garrett Grolemund
- [STAT 545](http://stat545.com/) by Jenny Bryan
- [Happy Git with R](http://happygitwithr.com) by Jenny Bryan
- [Software Carpentry](https://software-carpentry.org/lessons/) by the Carpentries