-
Notifications
You must be signed in to change notification settings - Fork 288
/
index.Rmd
113 lines (90 loc) · 7.76 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
---
knit: "bookdown::render_book"
title: "Tidy Modeling with R"
author: ["Max Kuhn and Julia Silge"]
date: "`r tmwr_version()`"
site: bookdown::bookdown_site
description: "The tidymodels framework is a collection of R packages for modeling and machine learning using tidyverse principles. This book provides a thorough introduction to how to use tidymodels, and an outline of good methodology and statistical practice for phases of the modeling process."
github-repo: tidymodels/TMwR
twitter-handle: topepos
cover-image: images/cover.png
documentclass: book
classoption: 11pt
bibliography: [TMwR.bib]
biblio-style: apalike
link-citations: yes
colorlinks: yes
---
# Hello World {-}
<a href="https://amzn.to/35Hn96s"><img src="images/cover.png" width="350" height="460" alt="Buy from Amazon" class="cover" /></a>
Welcome to _Tidy Modeling with R_! This book is a guide to using a collection of software in the R programming language for model building called `r pkg(tidymodels)`, and it has two main goals:
- First and foremost, this book provides a practical introduction to **how to use** these specific R packages to create models. We focus on a dialect of R called [the tidyverse](https://www.tidyverse.org/) that is designed with a consistent, human-centered philosophy, and demonstrate how the tidyverse and the `r pkg(tidymodels)` packages can be used to produce high quality statistical and machine learning models.
- Second, this book will show you how to **develop good methodology and statistical practices**. Whenever possible, our software, documentation, and other materials attempt to prevent common pitfalls.
In Chapter \@ref(software-modeling), we outline a taxonomy for models and highlight what good software for modeling is like. The ideas and syntax of the tidyverse, which we introduce (or review) in Chapter \@ref(tidyverse), are the basis for the tidymodels approach to these challenges of methodology and practice. Chapter \@ref(base-r) provides a quick tour of conventional base R modeling functions and summarizes the unmet needs in that area.
After that, this book is separated into parts, starting with the basics of modeling with tidy data principles. Chapters \@ref(ames) through \@ref(performance) introduces an example data set on house prices and demonstrates how to use the fundamental tidymodels packages: `r pkg(recipes)`, `r pkg(parsnip)`, `r pkg(workflows)`, `r pkg(yardstick)`, and others.
The next part of the book moves forward with more details on the process of creating an effective model. Chapters \@ref(resampling) through \@ref(workflow-sets) focus on creating good estimates of performance as well as tuning model hyperparameters.
Finally, the last section of this book, Chapters \@ref(dimensionality) through \@ref(inferential), covers other important topics for model building. We discuss more advanced feature engineering approaches like dimensionality reduction and encoding high cardinality predictors, as well as how to answer questions about why a model makes certain predictions and when to trust your model predictions.
We do not assume that readers have extensive experience in model building and statistics. Some statistical knowledge is required, such as random sampling, variance, correlation, basic linear regression, and other topics that are usually found in a basic undergraduate statistics or data analysis course. We do assume that the reader is at least slightly familiar with dplyr, ggplot2, and the `%>%` "pipe" operator in R, and is interested in applying these tools to modeling. For users who don't yet have this background R knowledge, we recommend books such as [*R for Data Science*](https://r4ds.had.co.nz/) by Wickham and Grolemund (2016). Investigating and analyzing data are an important part of any model process.
This book is not intended to be a comprehensive reference on modeling techniques; we suggest other resources to learn more about the statistical methods themselves. For general background on the most common type of model, the linear model, we suggest @fox08. For predictive models, @apm and @fes are good resources. For machine learning methods, @Goodfellow is an excellent (but formal) source of information. In some cases, we do describe the models we use in some detail, but in a way that is less mathematical, and hopefully more intuitive.
## Acknowledgments {-}
```{r, eval = FALSE, echo = FALSE}
library(tidyverse)
contribs_all_json <- gh::gh("/repos/:owner/:repo/contributors",
owner = "tidymodels",
repo = "TMwR",
.limit = Inf
)
contribs_all <- tibble(
login = contribs_all_json %>% map_chr("login"),
n = contribs_all_json %>% map_int("contributions")
)
contribs_old <- read_csv("contributors.csv", col_types = list())
contribs_new <- contribs_all %>% anti_join(contribs_old, by = "login")
# Get info for new contributors
needed_json <- map(
contribs_new$login,
~ gh::gh("/users/:username", username = .x)
)
info_new <- tibble(
login = contribs_new$login,
name = map_chr(needed_json, "name", .default = NA),
blog = map_chr(needed_json, "blog", .default = NA)
)
contribs_new <- contribs_new %>% left_join(info_new, by = "login")
contribs_all <- bind_rows(contribs_old, contribs_new) %>% arrange(login)
write_csv(contribs_all, "contributors.csv")
```
We are so thankful for the contributions, help, and perspectives of people who have supported us in this project. There are several we would like to thank in particular.
We would like to thank our RStudio colleagues on the `r pkg(tidymodels)` team (Davis Vaughan, Hannah Frick, Emil Hvitfeldt, and Simon Couch) as well as the rest of our coworkers on the RStudio open source team. Thank you to Desirée De Leon for the site design of the online work. We would also like to thank our technical reviewers, Chelsea Parlett-Pelleriti and Dan Simpson, for their detailed, insightful feedback that substantively improved this book, as well as our editors, Nicole Tache and Rita Fernando, for their perspective and guidance during the process of writing and publishing.
```{r, results = "asis", echo = FALSE, message = FALSE}
library(dplyr)
contributors <- read.csv("contributors.csv", stringsAsFactors = FALSE)
contributors <- contributors %>%
filter(!login %in% c("topepo", "juliasilge", "dcossyleon")) %>%
mutate(
login = paste0("\\@", login),
desc = ifelse(is.na(name), login, paste0(name, " (", login, ")"))
)
cat("This book was written in the open, and multiple people contributed via pull requests or issues. Special thanks goes to the ", xfun::n2w(nrow(contributors)), " people who contributed via GitHub pull requests (in alphabetical order by username): ", sep = "")
cat(paste0(contributors$desc, collapse = ", "))
cat(".\n")
```
## Using Code Examples {-}
```{r pkg-list, echo = FALSE}
deps <- desc::desc_get_deps()
pkgs <- sort(deps$package[deps$type == "Imports"])
pkgs <- sessioninfo::package_info(pkgs, dependencies = FALSE)
df <- tibble::tibble(
package = pkgs$package,
version = pkgs$ondiskversion,
source = pkgs$source
) %>%
mutate(
source = stringr::str_split(source, " "),
source = purrr::map_chr(source, ~ .x[1]),
info = paste0(package, " (", version, ", ", source, ")")
)
pkg_info <- knitr::combine_words(df$info)
```
This book was written with [RStudio](http://www.rstudio.com/ide/) using [bookdown](http://bookdown.org/). The [website](https://tmwr.org) is hosted via [Netlify](http://netlify.com/), and automatically built after every push by [GitHub Actions](https://help.github.com/actions). The complete source is available on [GitHub](https://github.com/tidymodels/TMwR). We generated all plots in this book using [ggplot2](https://ggplot2.tidyverse.org/) and its black and white theme (`theme_bw()`).
This version of the book was built with `r R.version.string`, [pandoc](https://pandoc.org/) version `r rmarkdown::pandoc_version()`, and the following packages: `r pkg_info`.