index.Rmd

--- 
title: "Data Wrangling with R"
author: "Bradley Boehmke"
site: bookdown::bookdown_site
output: bookdown::gitbook
documentclass: book
bibliography: references.bib
biblio-style: apalike
link-citations: yes
github-repo: bradleyboehmke/uc-bana-7025
twitter-handle: bradleyboehmke
description: "Master the art of data wrangling with the R programming language."
---

`r if (knitr::is_latex_output()) '<!--'` 

# Syllabus {-}

```{block, type = "note"}
This is the primary "textbook" for the UC BANA 7025 Data Wrangling course. The following is a truncated syllabus; for the full syllabus along with complete course content please visit the online course content in [Canvas](https://uc.instructure.com/). 
```

Welcome to ___Data Wrangling with R___! This course provides an intensive, hands-on introduction to Data Wrangling with the R programming language. You will learn the fundamental skills required to acquire, munge, transform, manipulate, and visualize data in a computing environment that fosters reproducibility.

Data wrangling, which is also commonly referred to as data munging, transformation, manipulation, janitor work, etc. can be a painstakingly laborious process. In fact, it has been stated that up to 80% of data analysis is spent on the process of cleaning and preparing data [@tidy-data; @dasu2003exploratory]. However, being a prerequisite to the rest of the data analysis workflow (visualization, modeling, reporting), it's essential that you become fluent *and* efficient in data wrangling techniques.

```{block, type='video'}
<iframe id="kaltura_player" src="https://cdnapisec.kaltura.com/p/1492301/sp/149230100/embedIframeJs/uiconf_id/49148882/partner_id/1492301?iframeembed=true&playerId=kaltura_player&entry_id=1_b0j6r2xv&flashvars[streamerType]=auto&amp;flashvars[localizationCode]=en_US&amp;flashvars[leadWithHTML5]=true&amp;flashvars[sideBarContainer.plugin]=true&amp;flashvars[sideBarContainer.position]=left&amp;flashvars[sideBarContainer.clickToClose]=true&amp;flashvars[chapters.plugin]=true&amp;flashvars[chapters.layout]=vertical&amp;flashvars[chapters.thumbnailRotator]=false&amp;flashvars[streamSelector.plugin]=true&amp;flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&amp;flashvars[dualScreen.plugin]=true&amp;flashvars[Kaltura.addCrossoriginToIframe]=true&amp;&wid=1_i0rvqaeq" width="640" height="610" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="BANA 7025 | Class Intro"></iframe>
```


## Learning Objectives {-}

This course will guide you through the data wrangling process along with give you a solid foundation of the basics of working with data in R. My goal is to teach you how to easily wrangle your data so you can spend more time focused on understanding the content of your data via visualization, modeling, and reporting your results. Upon successfully completing this course, you will be able to:

* Perform your data analysis in a literate programming environment
* Manage different types of data
* Manage different data structures
* Import and export data
* Index, subset, reshape and transform your data
* Compute descriptive statistics
* Visualize data
* Make your code efficient by using control statements & iteration
* Write your own functions
* Train and evaluate predictive models

...all with R!

```{block, type = "note"}
This course assumes no prior knowledge of R. Experience with programming concepts or another programming language will help, but is not required to understand the material. 
```

## Material {-}

The bulk of the classroom material will be provided via this book, the recorded lectures, and any additional resources provided via Canvas. In some cases there may be additional recommended readings, all of which are readily available online.

## Class Structure {-}

__Modules__: For this class each module is covered over the course of week. In the "Overview" section for each module you will find overall learning objectives, a short description of the learning content covered in that module, along with all tasks that are required of you for that module (i.e. quizzes, lab). Each module will have two or more primary lessons and associated quizzes along with a lab.

__Lessons__: For each lesson you will read and work through the tutorial. Short videos will be sprinkled throughout the lesson to further discuss and reinforce lesson concepts. Each lesson will have various "TODO" exercises throughout, along with end-of-lesson exercises. I highly recommend you work through these exercises as they will prepare you for the quizzes, labs, and project work.

__Quizzes__: There will be a short quiz associated with _each lesson_. These quizzes will be hosted in the course website on Canvas. Please check Canvas for due dates for these quizzes.

__Labs__: There will be a lab associated with _each module_. For these labs students will be guided through a case study step-by-step. The aim is to provide a detailed view on how to manage a variety of complex real-world data; how to convert real problems into data wrangling and analysis problems; and to apply R to address these problems and extract insights from the data. These labs will be provided via the course website on Canvas and the submission of these labs will also be done through the course website on Canvas. Please check Canvas for due dates for these labs.

__Project__: The final project is designed for you to put to work the tools and knowledge that you gain throughout this course. This provides you with multiple benefits. 
- It will provide you with more experience using data wrangling tools on real life data sets.
- It helps you become a self-directed learner. As a data scientist, a large part of your job is to self-direct your learning and interests to find unique and creative ways to find insights in data.
- It starts to build your data science portfolio. Establishing a data science portfolio is a great way to show potential employers your ability to work with data.


## Schedule {-}

```{block, type = "note"}
See the [Canvas](https://uc.instructure.com/) course webpage for a detailed schedule with due dates for quizzes, labs, etc.
```

| Module        | Description                                         |
|:-------------:|:----------------------------------------------------|
| **1**         | **Introduction**                                    |
|               | R fundamentals & the Rstudio IDE                    |
|               | Deeper understanding of vectors                     |
| **2**         | **Reproducible Documents and Importing Data**       |
|               | Managing your workflow and reproducibility          |
|               | Data structures & importing data                    |
| **3**         | **Tidy Data and Data Manipulation**                 |
|               | Data manipulation & summarization                   |
|               | Tidy data                                           |
| **4**         | **Relational Data and More Tidyverse Packages**     |
|               | Relational data                                     |
|               | Leveraging the Tidyverse to analyze text & date-time data   |
| **5**         | **Data Visualization & Exploration**                |
|               | Data visualization                                  |
|               | Exploratory data analysis                           |
| **6**         | **Creating Efficient Code in R**                    |
|               | Control statements & iteration                      |
|               | Writing functions                                   |
| **7**         | **Introduction to Applied Modeling**                |
|               | Introduction to tidymodels                          |
|               | Feature engineering & model evaluation/selection    |

## Conventions used in this book {-}

The following typographical conventions are used in this book:

* ___strong italic___: indicates new terms,
* __bold__: indicates package & file names,
* `inline code`: monospaced highlighted text indicates functions or other commands that could be typed literally by the user,
* code chunk: indicates commands or other text that could be typed literally by the user

```{r, first-code-chunk, collapse=TRUE}
1 + 2
```

In addition to the general text used throughout, you will notice the following cells that provide additional context for improved learning:

```{block, type = "video"}
A video that further discusses and demonstrates this topic is available.
```

```{block, type = "tip"}
A tip or suggestion that will likely produce better results.
```

```{block, type = "note"}
A general note that could improve your understanding but is not required for the course requirements.
```

```{block, type = "warning"}
Warning or caution to look out for.
```

```{block, type = "todo"}
Knowledge check exercises to gauge your learning progress.
```

## Feedback {-}

To report errors or bugs that you find in this course material please post an issue at https://github.com/bradleyboehmke/uc-bana-7025/issues. For all other communication be sure to use Canvas or the university email. 

```{block, type = "tip"}
When communicating with me via email, please always include **BANA7025** in the subject line.
```

## Acknowledgements {-}

This course and its materials have been influenced by the following resources:

- Jenny Bryan, [STAT 545: Data wrangling, exploration, and analysis with R](http://stat545.com/)
- Garrett Grolemund & Hadley Wickham, [R for Data Science](http://r4ds.had.co.nz/index.html)
- Stephanie Hicks, [Statistical Computing](https://www.stephaniehicks.com/jhustatcomputing2021/)
- Chester Ismay & Albert Kim, [ModernDive](https://moderndive.netlify.app/index.html)
- Alex Douglas et al., [An Introduction to R](https://intro2r.com/)