-
Notifications
You must be signed in to change notification settings - Fork 32
/
Copy path01-software_for_modeling.Rmd
295 lines (251 loc) · 21.1 KB
/
01-software_for_modeling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
# Software for modeling
**Learning objectives:**
- **Recognize the principles** around which the `{tidymodels}` packages were designed.
- Classify models as **descriptive, inferential,** and/or **predictive.**
- Define **descriptive model.**
- Define **inferential model.**
- Define **predictive model.**
- Differentiate between **supervised** and **unsupervised** models.
- Differentiate between **regression** and **classification** models.
- Differentiate between **quantitative** and **qualitative** data.
- Understand the **roles that data can have** in an analysis.
- Apply the **data science process.**
- Recognize the **phases of modeling.**
>The utility of a model hinges on its ability to be *reductive*. The primary influences in the data can be captured mathematically in a useful way, such as in a relationship that can be expressed as an equation.
<blockquote> <img src="https://www.tmwr.org/images/robot.png" class="robot"> There are two reasons that models permeate our lives today: an abundance of software exists to create models and it has become easier to record data and make it accessible. </blockquote>
## The pit of success
`{tidymodels}` aims to help us fall into the Pit of Success:
> The Pit of Success: in stark contrast to a summit, a peak, or a journey across a desert to find victory through many trials and surprises, we want our customers to simply fall into winning practices by using our platform and frameworks.
- **Avoid confusion:** Software should facilitate proper usage.
- **Avoid mistakes:** Software should make it easy for users to do the right thing.
Examples of creating a pit of success (discussed in more details later)
- internal consistency
- sensible defaults
- fail with meaningful error messages rather than silently producing incorrect results
## Types of models
- **Descriptive models:** Describe or illustrate characteristics of data.
- **Inferential models:** Make some statement of truth regarding a predefined conjecture or idea.
- Inferential techniques typically produce some type of probabilistic output, such as a p-value, confidence interval, or posterior probability.
- Usually delayed feedback between inference and actual result.
- **Predictive models:** Produce the most accurate possible prediction for new data. *Estimation* ("How much?") rather than *inference* ("Will it?").
- **Mechanistic models** are derived using first principles to produce a model equation that is dependent on assumptions.
- Depend on the assumptions that define their model equations.
- Unlike inferential models, it is easy to make data-driven statements about how well the model performs based on how well it predicts the existing data
- **Empirically driven models** have more vague assumptions, and are derived directly from the data.
- No theoretical or probabilistic assumptions are made about the equations or the variables
- The primary method of evaluating the appropriateness of the model is to assess its accuracy using existing data
<sub>1. Broader discussions of these distinctions can be found in Breiman ([2001b](https://www.tmwr.org/software-modeling.html#ref-breiman2001)) and Shmueli ([2010](https://www.tmwr.org/software-modeling.html#ref-shmueli2010))</sub>
## Terminology
- **Unsupervised models** are used to understand relationships between variables or sets of variables without an explicit relationship between variables and an outcome.
- Examples: PCA, clustering, autoencoders.
- **Supervised models** have an outcome variable.
- Examples: linear regression, decision trees, neural networks.
- **Regression:** numeric outcome
- **Classification:** ordered or unordered qualitative values.
- **Quantitative** data: numbers.
- **Qualitative** (nominal) data: non-numbers.
- *Qualitative data still might be coded as numbers, e.g. one-hot encoding or dummy variable encoding*
- Data can have different roles in analyses:
- **Outcomes** (labels, endpoints, dependent variables): the value being predicted in supervised models.
- **Predictors** (independent variables): the variables used to predict the outcome.
- Identifiers
Choosing a model type will depend on the type of question we want to answer / problem to solve and on the available data, among other things.
## The data analysis process
1. Cleaning the data: investigate the data to make sure that they are applicable to the project goals, accurate, and appropriate
2. Understanding the data: often referred to as exploratory data analysis (EDA). EDA brings to light how the different variables are related to one another, their distributions, typical ranges, and other attributes.
- "How did I come by *these* data?"
- "Is the data *relevant*?"
3. Develop clear expectations of the goal of your model and how performance will be judged ([Chapter 9](https://www.tmwr.org/performance.html))
- "What is/are the *performance metrics or realistic goal/s* of what can be achieved?"
::: {style="text-align:center;"}
![The data science process (from R for Data Science by Wickham and Grolemund.](https://www.tmwr.org/premade/data-science-model.svg)
:::
## The modeling process
::: {style="text-align:center;"}
![The modeling process.](https://www.tmwr.org/premade/modeling-process.svg)
:::
- **Exploratory data analysis:** Explore the data to see what they might tell you. (See previous)
- **Feature engineering:** Create specific model terms that make it easier to accurately model the observed data. Covered in [Chapter 6](https://www.tmwr.org/recipes.html#recipes).
- **Model tuning and selection:** Generate a variety of models and compare performance.
- Some models require **hyperparameter tuning**
- **Model evaluation:** Use EDA-like analyses and compare model performance metrics to choose the best model for your situation.
The final model may be used for a conclusion and/or produce predictions on new data.
## Meeting Videos
### Cohort 1
`r knitr::include_url("https://www.youtube.com/embed/jrBiEppKt_0")`
<details>
<summary> Meeting chat log </summary>
```
00:10:57 Andrew G. Farina: Sorry guys, I have a sleeping baby in the room, so I am stuck with only chat tonight. Looking forward to the discussion though.
00:11:10 mayagans: Hi baby!!!
00:11:31 mayagans: (Hi everyone else too — Im also in a loud house right now super stoked for this!)
00:11:38 Jim Gruman: Hello everyone
00:11:39 Tony ElHabr: the chat is where all of the fun happens anyways!
00:11:59 Tan Ho: Obviously!
00:12:01 Scott Nestler: It's been way too long since I've seen many of you. Hope everyone is doing well. I'm excited for this.
00:12:04 Jeremy: Yep, I’ve got a puppy who believes she’s an attack dog going crazy so I’ll probably mute for a while
00:12:36 Tyler Grant Smith: on kid bath duty for the start of this
00:15:16 Yoni Sidi: It’s a gitbook!
00:15:37 Tan Ho: It's a book about a book
00:15:40 Tan Ho: classic Jon
00:15:46 Joe Sydlowski: Metabook
00:15:54 Scott Nestler: Very meta.
00:16:03 Tony ElHabr: presentation + book seems like it is prime for a package
00:16:17 Tony ElHabr: counting on jon to jump on that idea
00:26:44 shamsuddeen: The utility of a model hinges on its ability to be reductive. What is the meaning of this from the book?
00:28:03 Tony ElHabr: I think that means "a model should be interpretable"
00:28:28 Tony ElHabr: yeah, "simpler" is a better word
00:28:41 Yoni Sidi: Sparse model means less overfitting
00:28:41 shamsuddeen: sure
00:29:15 Gabriela Palomo: Perhaps it may also mean that a model uses a bunch of data and simplifies it in an equation or model?
00:29:29 Jacob Miller: As someone who is an intermediate user of caret, how useful would it be to switch completely over to tidymodels and not revert back to caret? Or are there benefits to using both consistently?
00:29:56 Gabriela Palomo: So in a way it's simpler to understand as well vs seeing all the raw data
00:30:09 Tony ElHabr: i feel you have much more "low-level" control with tidymodels
00:30:36 Scott Nestler: I had a similar question to Jacob's, but with regard to mle & mle3.
00:31:01 Tan Ho: Caret is broader but tidymodels is deeper (see yesterday's xkcd :P)
00:31:10 Arjun’s iPhone: you can mix tidymodels and caret.... preprocess using tidymodels and feed it to caret
00:32:12 Scott Nestler: TYPO ALERT. I meant mlr and mlr3.
00:32:18 Asmae Toumi: Agreed with David. For example, weighted RMSE stuff is only on caret (for now) and there’s a GitHub issue reply by max basically saying its too hard to add to tidy models right now. Either way tidy models seems the way to go to not be behind in say, 2 years, when its well developed
00:32:20 Conor Tompkins: My understanding is that caret is deprecated. It still works, but tidymodels is where its at now. Like dplyr in 2015. Not 100% coverage compared to base R or data.table, but heading in the right direction fast.
00:32:25 Maria: Yes, usemodels is great!
00:32:38 Conor Tompkins: Yeah not officially depreacted
00:34:30 mayagans: My only comment is that I love how many people are here!!!! I can only imagine the range of domain expertise in this “room” - I HATE ice breakers but do people want to throw in the chat what domain they want to write models in/why they’re reading this book? Im a pharma person but Im also obsessed with music analytics :) I look forward to seeing how presenters apply their chapters!
00:34:43 Connor Krenzer: The book says in the Empirically driven models section: "No theoretical or probabilistic assumptions are made about the sales numbers or the variables that are used to define similarity. In fact, the primary method of evaluating the appropriateness of the model is to assess its accuracy using existing data. If the structure of this type of model was a good choice, the predictions would be close to the actual values."
How does the significance of the model's variables play into the above? Let's take linear regression for example. Does this mean we are only supposed to care about R-square instead of p-values?
00:34:51 Yoni Sidi: https://github.com/topepo/workflowsets
00:35:41 Asmae Toumi: Sure Maya, great idea, my domain is healthcare/medtech and for fun, sports analytics!
00:36:44 Jordan Krogmann: Live and die by the pun :)
00:37:04 Kevin Kent: @Maya - I originally learned ML stuff in sklearn but do 80% of my work in R, so I’d like to move that all over to tidy models. I work in healthcare technology, in the devops area
00:37:26 Tony ElHabr: good question Connor. i think it depends on your intention. for an inferential model, the variables and p values matter more, but that's not to say that your model's R2 is "allowed" to be really bad. for a predictive model, it would all be about maximizing R2
00:37:38 Yoni Sidi: Modeling and simulation in pharma
00:37:59 Conor Tompkins: I don’t use modeling professionally, would like to get there. I use R to avoid using Excel. Very interested in sports data and civic hacking
00:38:12 Scott Nestler: I guide many students in capstone projects building models in all kinds of domains. Much of my own work is in sports analytics, either for fun or with some of our teams here on campus.
00:38:24 Tony ElHabr: electricity markets. sports for fun
00:38:36 Jonathan Trattner: Undergrad studying computational neuroscience!
00:38:43 Jonathan Trattner: I can volunteer for the tidyverse primer!
00:38:45 Jim Gruman: Im in industry/agriculture - marketing/geospatial/IoT events/survival
00:38:46 Maria: I also in healthcare/research
00:38:58 Tyler Grant Smith: im a predictive modeling actuary working in p&c insurance
00:39:02 Vasant M: @Connor Krenzer yes, the less you use p-values as a metric to assess models the better. R-square is one metric, but not always the most reliable one to use. For instance r-square doesn’t mean anything for non-linear models. I would rather depend on model accuracy to guide model building
00:39:05 Jonathan Trattner: Sounds good (:
00:39:08 Tan Ho: I work in homebuilding, so finance-ish data - and work on fantasy football data as well
00:39:13 Stephen - Computer - No Mic: Degree in Health Data Analytics - currently working on an automated trading algorithm (which is built with tidymodels)
00:39:29 Aashish Cheruvu: I’m a student and I’m interesting in healthcare analytics and tech
00:39:38 Miles Ott (he/him/his): Hi everyone! Excited to be here :)I am a stats/data science prof at Smith College and my work/research stuff is in social network analysis and sampling applied to public health
00:39:43 Vasant M: I am Bioinformatician - Work in Biomedical research, currently doing Lipidomics in Sleep Mediccine
00:39:51 shamsuddeen: Student interested in natural language processing
00:39:55 Andrew G. Farina: I am a grad candidate currently, trying to build a solid base in modeling to use in the future.
00:39:56 Stephen - Computer - No Mic: I have been using R to run a text messaging campaign for the Senate run-offs in Georgia recently
00:40:13 Tim Moloney: I work in environmental consulting, do a lot of geospatial and/or statistics analyses with R
00:40:19 Adrienne St Clair: Hi all, I'm a botanist and work in plant conservation in public parks. I am a nascent data nerd and want to learn all I can about data analysis.
00:40:21 Conor Tompkins: I am currently using tidymodels to build a model to predict house sale prices in Pittsburgh
00:40:34 Jonathan Leslie: I work in data science consulting...I work with businesses/government agencies to design data science projects.
00:40:43 Vasant M: @Stephen that’s very cool.
00:40:45 ErickKnackstedt: Business intelligence developer in the mental health/mindfulness space, no real modeling experience really excited to learn tidymodels
00:40:53 Andrew G (he/him): I work in App Analytics. Will be starting a new gig in app/game analytics soon. Historically modeling on the job has been few and far between so I’m looking forward to understanding best practices, workflow, etc…
00:41:23 Ben Gramza: Hi I'm Ben, I just graduated with a stats degree (and thus am unemployed and without a domain). I've done some work with COVID survey data and redistricting/gerrymandering in the past. I also keep up with the sports analytics scene in my free time.
00:41:53 Giovani Ferreira: Tech Team Leader here, data hobbyist, usually very interested in NLP and Topic Modelling, decided to use this bookclub to level my modelling skills
00:42:25 Jacob Miller: Senior studying stats, done actuarial consulting internships, and planning on grad school in stats. Sports analytics is the hobby/passion
00:43:01 mayagans: Aaaahhh so many cool domains!! Everyone is a bad ass wow - I selfishly hope everyone talks ties in the content with their passions and maybe Ill even know something about #SPORTS by the time we’re done LOL
00:44:00 Stephen - Computer - No Mic: Thanks @ Vasant !
00:44:21 Conor Tompkins: Deployment seems very domain specific
00:44:38 Conor Tompkins: Tech stack = domain
00:45:57 David Severski: Oh, do I have thoughts on cloudy… ;P
00:46:04 David Severski: S/cloudy/cloudyr/
00:46:32 Jordan Krogmann: https://github.com/wlandau/targets
00:46:39 tim: To get some more background in machine learning, in addition to learning tidymodels, any suggestions for books? I was thinking Applied Predictive Modeling - but keep changing my mind and need something to stick to. I guess it uses caret too? So that might be useful.
00:47:36 mayagans: ……Is Yoni in an aquarium of pizzas?
00:48:07 Tan Ho: asking the important questions :D
00:48:16 Scott Nestler: Responding to Tim's question … I'm currently working my way through Machine Learning with R, the tidyverse, and mlr (Rhys).
00:48:25 Vasant M: @Tim Statistical Learning PDF link http://www.ime.unicamp.br/~dias/Intoduction%20to%20Statistical%20Learning.pdf
00:48:34 Connor Krenzer: @tim I hear Introduction to Statistical Learning is a classic
00:49:06 Vasant M: @Tim if you like a course https://www.edx.org/course/statistical-learning
00:49:15 Tony ElHabr: i'm just glad we won't have to nag people to volunteer to do presentations since we have so many participants lol
00:49:21 ErickKnackstedt: https://dtkaplan.github.io/SM2-bookdown/preface-to-this-electronic-version.html
00:49:38 tim: Thanks, this is awesome! Now I just need to pick something and stick with it, haha
00:49:38 ErickKnackstedt: That book is legit
00:49:59 Jonathan Leslie: @Tim I second the recommendation for Introduction to Statistical Learning. It’s a great overview of different modelling approaches and how to interpret model outputs.
00:50:18 Miles Ott (he/him/his): nice to meet you all!
00:50:19 David Severski: Have a great one, everyone!
00:50:24 Jordan Krogmann: thanks take it ea y
00:50:28 Yoni Sidi: Bye and thanks for all the fish
00:50:29 Tan Ho: Cheers gang!
00:50:32 Aashish Cheruvu: Bye everyone and thank you
00:50:33 mayagans: Thanks Jon!!
00:50:36 Maria: Cheers!
00:51:03 Arjun’s iPhone: p
```
</details>
### Cohort 2
`r knitr::include_url("https://www.youtube.com/embed/nHZu-as7O7w")`
<details>
<summary> Meeting chat log </summary>
```
00:06:50 Stephen Holsenbeck: https://docs.google.com/spreadsheets/d/1vD4LG4_nhsxSAxXiBi42iKIvZXQtNxgB5C_PUkIZ0wo/edit#gid=0
00:08:48 Amélie Gourdon-Kanhukamwe: Questions (if forgotten): name (pronunciation), location, fun fact about yourself, why are you here?
00:21:23 Carmen Santana: when the pandemic is over you should come to Portugal, great wine here!!
00:21:50 Kevin Kent: There was a wine ratings tidytuesday a while ago https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-05-28
00:21:50 shamsuddeen: Hahhaha, yes. Specially Porto wine -:)
00:22:30 Layla Bouzoubaa: Yes to portugal
00:22:37 Layla Bouzoubaa: Thanks, Kevin!
00:26:36 Stephen Holsenbeck: https://github.com/r4ds/bookclub-tmwr
00:28:15 Amélie Gourdon-Kanhukamwe: I love Portugal too, but other fun fact: I am non-wine drinking French.
00:28:49 Layla Bouzoubaa: 😱
01:02:53 Kevin Kent: I really like fast ai’s chapter on ethics - https://github.com/fastai/fastbook/blob/master/03_ethics.ipynb (free by the way)
01:03:14 Carmen Santana: thanx
01:03:44 Layla Bouzoubaa: Thank you, kevin
01:04:14 Layla Bouzoubaa: Yes, let’s add it to google doc
01:07:34 Amélie Gourdon-Kanhukamwe: Possibly after FE too?
01:08:45 Layla Bouzoubaa: After
01:08:48 Layla Bouzoubaa: Agree with auust
01:08:51 Kevin Kent: After is ok with me
01:11:03 Layla Bouzoubaa: There was also rstudio conf from last year from googlebrain for ethics to predict lending
01:14:55 shamsuddeen: Cheers guys
```
</details>
### Cohort 3
`r knitr::include_url("https://www.youtube.com/embed/h8aGFPVj3C8")`
<details>
<summary> Meeting chat log </summary>
```
00:13:08 Morgan Grovenburg: Daniel I love your Blue's Clues background!
00:13:34 Daniel Chen: ( : #steve
00:13:56 Morgan Grovenburg: I miss Steve!
00:17:58 priyanka gagneja: nice
00:36:34 Daniel Chen: e.g., mechanistic model is maybe how gasses are absorbed in our cells at depth (in scuba diving) -- buhlmann decompression algorithm: https://en.wikipedia.org/wiki/B%C3%BChlmann_decompression_algorithm
00:37:00 Daniel Chen: ^ not sure, but that was the first thing that came to mind about something that is purely based on physical properties
00:39:13 Chris Martin: Similarly - mechanistic models made me think of models (may be systems of differential equations ...) of the chemical reactions taking place in the Earth's atmosphere.
00:41:01 Daniel Chen: for those who do not know "one-hot encoding" is the engineering/computer science term for "dummy variables" in statistics -- that took me a long time to realize
00:41:36 Morgan Grovenburg: Thank you Daniel! I had no idea!
00:42:10 priyanka gagneja: I need to run early today . See you all later
00:43:24 Chris Martin: Thanks, yes a new term to me too!
00:46:53 Daniel Chen: one-hot encoding is what they use in sci-kit learn in python. it confused the heck out of me when I was trying to learn how to fit models there
00:57:33 Morgan Grovenburg: I got to go. Thanks Ildiko!
00:59:22 Daniel Chen: https://github.com/r4ds/bookclub-tmwr
01:00:32 Daniel Chen: i can do next week
01:00:40 Ildiko Czeller: https://github.com/r4ds/bookclub-tmwr/blob/main/README.md
01:02:58 Daniel Chen: see/talk to everyone on slack :)
```
</details>
### Cohort 4
`r knitr::include_url("https://www.youtube.com/embed/93MbbtczA7M")`
<details>
<summary> Meeting chat log </summary>
```
00:32:07 Federica Gazzelloni: https://www.tmwr.org/index.html
00:32:34 Federica Gazzelloni: https://www.tidymodels.org/
00:34:44 Federica Gazzelloni: https://www.youtube.com/c/JuliaSilge
00:37:03 Federica Gazzelloni: https://github.com/r4ds/bookclub-tmwr
```
</details>
`r knitr::include_url("https://www.youtube.com/embed/U8-Lj0AKgZw")`
<details>
<summary> Meeting chat log </summary>
```
00:09:51 Brandon Hurr: https://docs.google.com/spreadsheets/d/1-S1UbKWay_TeR5n9LkztZY2XXrMjZr3snl1srPvTvH4/edit#gid=0
00:45:33 Ryan Metcalf: What if we were to augment / change / modify the current variables and then repredict?
00:46:12 Brandon Hurr: Happens all the time in my world. You try and explain with the least amount of variables as you can and then dig more when you don’t cover all the variation in the data that you want.
00:46:36 Brandon Hurr: Even the imperfect models are considered “mechanistic”
01:03:01 Isabella Velásquez: And I’m in the US ! In Central time right now but will be in Pacific time soon - thank you so much for pushing back the time a bit 🙌
```
</details>