18-data_wrangling.Rmd

# Appendix B: Data wrangling

In this lecture, we will take a look at how to wrangle data using the [dplyr](https://dplyr.tidyverse.org/) package. Again, getting our data into shape is something we'll need to do throughout the course, so it's worth spending some time getting a good sense for how this works. The nice thing about R is that (thanks to the `tidyverse`), both visualization and data wrangling are particularly powerful. 

## Learning goals

- Review R basics (incl. variable modes, data types, operators, control flow, and functions). 
- Learn how the pipe operator `%>%` works. 
- See different ways for getting a sense of one's data. 
- Master key data manipulation verbs from the `dplyr` package (incl. `filter()`, `arrange()`, `rename()`, `relocate()`, `select()`, `mutate()`) as well as the helper functions `across()` and `where()`.

## Load packages

Let's first load the packages that we need for this chapter. 

```{r, message=FALSE}
library("knitr")        # for rendering the RMarkdown file
library("skimr")        # for visualizing data 
library("visdat")       # for visualizing data 
library("DT")           # for visualizing data 
library("tidyverse")    # for data wrangling

opts_chunk$set(comment = "",
               fig.show = "hold")
```

## Some R basics

To test your knowledge of the R basics, I recommend taking the free interactive tutorial on datacamp: [Introduction to R](https://www.datacamp.com/courses/free-introduction-to-r). Here, I will just give a very quick overview of some of the basics. 

### Modes

Variables in R can have different modes. Table \@ref(tab:variable-modes) shows the most common ones. 

```{r variable-modes, echo=FALSE}
name = c("numeric", "character", "logical", "not available")
example = c("`1`, `3`, `48`",
            "`'Steve'`, `'a'`, `'78'`",
            "`TRUE`, `FALSE`","`NA`")
kable(x = tibble(name, example), 
      caption = "Most commonly used variable modes in R.",
      align = c("r", "l"),
      booktabs = TRUE)
```

For characters you can either use `"` or `'`. R has a number of functions to convert a variable from one mode to another. `NA` is used for missing values.

```{r}
tmp1 = "1" # we start with a character
str(tmp1) 

tmp2 = as.numeric(tmp1) # turn it into a numeric
str(tmp2) 

tmp3 = as.factor(tmp2) # turn that into a factor
str(tmp3)

tmp4 = as.character(tmp3) # and go full cycle by turning it back into a character
str(tmp4)

identical(tmp1, tmp4) # checks whether tmp1 and tmp4 are the same

```

The `str()` function displays the structure of an R object. Here, it shows us what mode the variable is. 

### Data types

R has a number of different data types. Table \@ref(tab:data-types) shows the ones you're most likely to come across (taken from [this source](https://www.statmethods.net/input/datatypes.html)): 

```{r data-types, echo=FALSE}
name = c("vector", "factor", "matrix", "array", "data frame", "list") 
description = c(
  "list of values with of the same variable mode",
  "for ordinal variables",
  "2D data structure",
  "same as matrix for higher dimensional data",
  "similar to matrix but with column names",
  "flexible type that can contain different other variable types"
  )
kable(x = tibble(name, description), 
      align = c("r", "l"),
      caption = "Most commonly used data types in R.",
      booktabs = TRUE)
```

#### Vectors

We build vectors using the concatenate function `c()`, and we use `[]` to access one or more elements of a vector.  

```{r}
numbers = c(1, 4, 5) # make a vector
numbers[2] # access the second element 
numbers[1:2] # access the first two elements
numbers[c(1, 3)] # access the first and last element
```

In R (unlike in Python for example), 1 refers to the first element of a vector (or list). 

#### Matrix

We build a matrix using the `matrix()` function, and we use `[]` to access its elements. 

```{r}
matrix = matrix(data = c(1, 2, 3, 4, 5, 6),
                nrow = 3,
                ncol = 2)
matrix # the full matrix
matrix[1, 2] # element in row 1, column 2
matrix[1, ] # all elements in the first row 
matrix[ , 1] # all elements in the first column 
matrix[-1, ] # a matrix which excludes the first row
```

Note how we use an empty placeholder to indicate that we want to select all the values in a row or column, and `-` to indicate that we want to remove something.

#### Array

Arrays work the same was as matrices with data of more than two dimensions. 

#### Data frame

```{r}
df = tibble(participant_id = c(1, 2, 3),
            participant_name = c("Leia", "Luke", "Darth")) # make the data frame 

df # the complete data frame
df[1, 2] # a single element using numbers 

df$participant_id # all participants 
df[["participant_id"]] # same as before but using [[]] instead of $

df$participant_name[2] # name of the second participant
df[["participant_name"]][2] # same as above
```

We'll use data frames a lot. Data frames are like a matrix with column names. Data frames are also more general than matrices in that different columns can have different modes. For example, one column might be a character, another one numeric, and another one a factor. 

Here we used the `tibble()` function to create the data frame. A `tibble` is almost the same as a data frame but it has better defaults for formatting output in the console (more information on tibbles is [here](http://r4ds.had.co.nz/tibbles.html)). 

#### Lists

```{r}
l.mixed = list(number = 1, 
               character = "2", 
               factor = factor(3), 
               matrix = matrix(1:4, ncol = 2),
               df = tibble(x = c(1, 2), y = c(3, 4)))
l.mixed

# three different ways of accessing a list
l.mixed$character
l.mixed[["character"]]
l.mixed[[2]] 
```

Lists are a very flexible data format. You can put almost anything in a list.

### Operators

Table \@ref(tab:logical-operators) shows the comparison operators that result in logical outputs. 

```{r logical-operators, echo=FALSE}
operators = c("`==`", "`!=`", "`>`, `<`", "`>=`, `<=`", "`&`, `|`, `!`", "`%in%`")
explanation = c("equal to", "not equal to", "greater/less than", 
                "greater/less than or equal", "logical operators: and, or, not", 
                "checks whether an element is in an object")
kable(tibble(symbol = operators, name = explanation), 
      caption = "Table of comparison operators that result in 
      boolean (TRUE/FALSE) outputs.", 
      booktabs = TRUE)
```

The `%in%` operator is very useful, and we can use it like so: 

```{r data-10}
x = c(1, 2, 3)
2 %in% x 
c(3, 4) %in% x
```

It's particularly useful for filtering data as we will see below. 

### Control flow

#### if-then {#if-else}

```{r}
number = 3

if (number == 1) {
  print("The number is 1.")
} else if (number == 2) {
  print("The number is 2.")
} else {
  print("The number is neither 1 nor 2.")
}
```

As a shorthand version, we can also use the `ifelse()` function like so: 

```{r}
number = 3
ifelse(test = number == 1, yes = "correct", no = "false")
```

#### for loop

```{r}
sequence = 1:10

for(i in 1:length(sequence)){
  print(i)
}
```

#### while loop

```{r}
number = 1 

while(number <= 10){
  print(number)
  number = number + 1
}
```

### Functions

```{r}
fun.add_two_numbers = function(a, b){
  x = a + b
  return(str_c("The result is ", x))
}

fun.add_two_numbers(1, 2)
```

I've used the `str_c()` function here to concatenate the string with the number. (R converts the number `x` into a string for us.) Note, R functions can only return a single object. However, this object can be a list (which can contain anything). 

#### Some often used functions

```{r, echo=FALSE}
name = c(
"`length()`",
"`dim()`",
"`rm()  `",
"`seq()`",
"`rep()`",
"`max()`",
"`min()`",
"`which.max()`",
"`which.min()`",
"`mean()`",
"`median()`",
"`sum()`",
"`var()`",
"`sd()`"
)
description = c(
"length of an object",
"dimensions of an object (e.g. number of rows and columns)",
"remove an object",
"generate a sequence of numbers",
"repeat something n times",
"maximum",
"minimum",
"index of the maximum",
"index of the maximum",
"mean",
"median",
"sum",
"variance",
"standard deviation"
)
kable(x = tibble(name, description), 
      caption = "Some frequently used functions.", 
      align = c("r", "l"),
      booktabs = TRUE)
```

### The pipe operator `%>%`

```{r, out.width = "80%", echo=FALSE, fig.cap="Inspiration for the `magrittr` package name."}
include_graphics("figures/pipe.jpg")
```

```{r, out.width = '40%', echo=FALSE, fig.cap="The `magrittr` package logo."}
include_graphics("figures/magrittr.png")
```

The pipe operator `%>%` is a special operator introduced in the `magrittr` package. It is used heavily in the tidyverse. The basic idea is simple: this operator allows us to "pipe" several functions into one long chain that matches the order in which we want to do stuff.  

Let's consider the following example of making and eating a cake (thanks to https://twitter.com/dmi3k/status/1191824875842879489?s=09). This would be the traditional way of writing some code: 

```{r, eval=F}
eat(
  slice(
    bake(
      put(
        pour(
          mix(ingredients),
          into = baking_form),
        into = oven),
      time = 30),
    pieces = 6),
  1)
```

To see what's going on here, we need to read the code inside out. That is, we have to start in the innermost bracket, and then work our way outward. However, there is a natural causal ordering to these steps and wouldn't it be nice if we could just write code in that order? Thanks to the pipe operator `%>%` we can! Here is the same example using the pipe: 

```{r, eval=F}
ingredients %>% 
  mix %>% 
  pour(into = baking_form) %>% 
  put(into = oven) %>% 
  bake(time = 30) %>% 
  slice(pieces = 6) %>% 
  eat(1)
```

This code is much easier to read and write, since it represents the order in which we want to do things! 

Abstractly, the pipe operator does the following: 

> `f(x)` can be rewritten as `x %>% f()`

For example, in standard R, we would write: 

```{r}
x = 1:3

# standard R 
sum(x)
```

With the pipe, we can rewrite this as: 

```{r}
x = 1:3

# with the pipe  
x %>% sum()
```

This doesn't seem super useful yet, but just hold on a little longer. 

> `f(x, y)` can be rewritten as `x %>% f(y)`

So, we could rewrite the following standard R code ... 

```{r}
# rounding pi to 6 digits, standard R 
round(pi, digits = 6)
```

... by using the pipe: 

```{r}
# rounding pi to 6 digits, standard R 
pi %>% round(digits = 6)
```

Here is another example: 

```{r}
a = 3
b = 4
sum(a, b) # standard way 
a %>% sum(b) # the pipe way 
```

The pipe operator inserts the result of the previous computation as a first element into the next computation. So, `a %>% sum(b)` is equivalent to `sum(a, b)`. We can also specify to insert the result at a different position via the `.` operator. For example:  

```{r}
a = 1
b = 10 
b %>% seq(from = a, to = .)
```

Here, I used the `.` operator to specify that I woud like to insert the result of `b` where I've put the `.` in the `seq()` function. 

> `f(x, y)` can be rewritten as `y %>% f(x, .)`

Still not to thrilled about the pipe? We can keep going though (and I'm sure you'll be convinced eventually.)

> `h(g(f(x)))` can be rewritten as `x %>% f() %>% g() %>% h()`

For example, consider that we want to calculate the root mean squared error (RMSE) between prediction and data. 

Here is how the RMSE is defined: 

$$
\text{RMSE} = \sqrt\frac{\sum_{i=1}^n(\hat{y}_i-y_i)^2}{n}
$$
where $\hat{y}_i$ denotes the prediction, and $y_i$ the actually observed value.

In base R, we would do the following. 

```{r}
data = c(1, 3, 4, 2, 5)
prediction = c(1, 2, 2, 1, 4)

# calculate root mean squared error
rmse = sqrt(mean((prediction-data)^2))
print(rmse)
```

Using the pipe operator makes the operation more intuitive: 

```{r}
data = c(1, 3, 4, 2, 5)
prediction = c(1, 2, 2, 1, 4)

# calculate root mean squared error the pipe way 
rmse = (prediction-data)^2 %>% 
  mean() %>% 
  sqrt() %>% 
  print() 
```

First, we calculate the squared error, then we take the mean, then the square root, and then print the result. 

The pipe operator `%>%` is similar to the `+` used in `ggplot2`. It allows us to take step-by-step actions in a way that fits the causal ordering of how we want to do things. 

> __Tip__: The keyboard shortcut for the pipe operator is:   
> `cmd/ctrl + shift + m`   
> __Definitely learn this one__ -- we'll use the pipe a lot!! 

> __Tip__: Code is generally easier to read when the pipe `%>%` is at the end of a line (just like the `+` in `ggplot2`).

A key advantage of using the pipe is that you don't have to save intermediate computations as new variables and this helps to keep your environment nice and clean! 

#### Practice 1

Let's practice the pipe operator. 

```{r}
# here are some numbers
x = seq(from = 1, to = 5, by = 1)

# taking the log the standard way
log(x)

# now take the log the pipe way (write your code underneath)
```

```{r}
# some more numbers
x = seq(from = 10, to = 5, by = -1)

# the standard way
mean(round(sqrt(x), digits = 2))

# the pipe way (write your code underneath)
```

## A quick note on naming things

Personally, I like to name things in a (pretty) consistent way so that I have no trouble finding stuff even when I open up a project that I haven't worked on for a while. I try to use the following naming conventions: 

```{r, echo=FALSE}
name = c("df.thing",
         "l.thing",
         "fun.thing",
         "tmp.thing")
use = c("for data frames",
        "for lists",
        "for functions",
        "for temporary variables")
kable(x = tibble(name, use), 
      caption = "Some naming conventions I adopt to make my life easier.", 
      align = c("r", "l"),
      booktabs = TRUE)
```

## Looking at data

The package `dplyr` which we loaded as part of the tidyverse, includes a data set with information about starwars characters. Let's store this as  `df.starwars`. 

```{r}
df.starwars = starwars
```

> Note: Unlike in other languages (such as Python or Matlab), a `.` in a variable name has no special meaning and can just be used as part of the name. I've used `df` here to indicate for myself that this variable is a data frame. 
Before visualizing the data, it's often useful to take a quick direct look at the data. 

There are several ways of taking a look at data in R. Personally, I like to look at the data within RStudio's data viewer. To do so, you can: 

- click on the `df.starwars` variable in the "Environment" tab  
- type `View(df.starwars)` in the console 
- move your mouse over (or select) the variable in the editor (or console) and hit `F2` 

I like the `F2` route the best as it's fast and flexible. 

Sometimes it's also helpful to look at data in the console instead of the data viewer. Particularly when the data is very large, the data viewer can be sluggish. 

Here are some useful functions: 

### `head()`

Without any extra arguments specified, `head()` shows the top six rows of the data. 

```{r}
head(df.starwars)
```

### `glimpse()`

`glimpse()` is helpful when the data frame has many columns. The data is shown in a transposed way with columns as rows. 

```{r}
glimpse(df.starwars)
```

### `distinct()`

`distinct()` shows all the distinct values for a character or factor column. 

```{r}
df.starwars %>% 
  distinct(species)
```

### `count()`

`count()` shows a count of all the different distinct values in a column. 

```{r}
df.starwars %>% 
  count(eye_color)
```

It's possible to do grouped counts by combining several variables.

```{r}
df.starwars %>% 
  count(eye_color, gender) %>% 
  head(n = 10)
```

### `datatable()`

For RMardkown files specifically, we can use the `datatable()` function from the `DT` package to get an interactive table widget.

```{r}
df.starwars %>% 
  DT::datatable()
```

### Other tools for taking a quick look at data

#### `vis_dat()`

The `vis_dat()` function from the `visdat` package, gives a visual summary that makes it easy to see the variable types and whether there are missing values in the data. 

```{r}
visdat::vis_dat(df.starwars)
```

```{block, type='info'}
When R loads packages, functions loaded in earlier packages are overwritten by functions of the same name from later packages. This means that the order in which packages are loaded matters. To make sure that a function from the correct package is used, you can use the `package_name::function_name()` construction. This way, the `function_name()` from the `package_name` is used, rather than the same function from a different package. 

This is why, in general, I recommend to load the tidyverse package last (since it contains a large number of functions that we use a lot).
```

#### `skim()`

The `skim()` function from the `skimr` package provides a nice overview of the data, separated by variable types. 

```{r}
# install.packages("skimr")
skimr::skim(df.starwars)
```

#### `dfSummary()`

The `summarytools` package is another great package for taking a look at the data. It renders a nice html output for the data frame including a lot of helpful information. You can find out more about this package [here](https://cran.r-project.org/web/packages/summarytools/index.html).

```{r, eval=FALSE}
df.starwars %>% 
  select(where(~ !is.list(.))) %>% # this removes all list columns
  summarytools::dfSummary() %>% 
  summarytools::view()
```

> Note: The summarytools::view() function will not show up here in the html. It generates a summary of the data that is displayed in the Viewer in RStudio. 

Once we've taken a look at the data, the next step would be to visualize relationships between variables of interest. 

## Wrangling data

We use the functions in the package `dplyr` to manipulate our data. 

### `filter()`

`filter()` lets us apply logical (and other) operators (see Table \@ref(tab:logical-operators)) to subset the data. Here, I've filtered out the masculine characters. 

```{r}
df.starwars %>% 
  filter(gender == "masculine")
```

We can combine multiple conditions in the same call. Here, I've filtered out masculine characters, whose height is greater than the median height (i.e. they are in the top 50 percentile), and whose mass was not `NA`. 

```{r}
df.starwars %>% 
  filter(gender == "masculine",
         height > median(height, na.rm = T),
         !is.na(mass))
```

Many functions like `mean()`, `median()`, `var()`, `sd()`, `sum()` have the argument `na.rm` which is set to `FALSE` by default. I set the argument to `TRUE` here (or `T` for short), which means that the `NA` values are ignored, and the `median()` is calculated based on the remaining values.

You can use `,` and `&` interchangeably in `filter()`. Make sure to use parentheses when combining several logical operators to indicate which logical operation should be performed first: 

```{r}
df.starwars %>% 
  filter((skin_color %in% c("dark", "pale") | sex == "hermaphroditic") & height > 170)
```

The starwars characters that have either a `"dark"` or a `"pale"` skin tone, or whose sex is `"hermaphroditic"`, and whose height is at least `170` cm. The `%in%` operator is useful when there are multiple options. Instead of `skin_color %in% c("dark", "pale")`, I could have also written `skin_color == "dark" | skin_color == "pale"` but this gets cumbersome as the number of options increases. 

### `arrange()`

`arrange()` allows us to sort the values in a data frame by one or more column entries. 

```{r}
df.starwars %>% 
  arrange(hair_color, desc(height))
```

Here, I've sorted the data frame first by `hair_color`, and then by `height`. I've used the `desc()` function to sort `height` in descending order. Bail Prestor Organa is the tallest black character in starwars. 

### `rename() `

`rename()` renames column names.

```{r}
df.starwars %>% 
  rename(person = name,
         mass_kg = mass)
```

The new variable names goes on the LHS of the`=` sign, and the old name on the RHS.  

To rename all variables at the same time use `rename_with()`: 

```{r}
df.starwars %>%
  rename_with(.fn = ~ toupper(.))
```

Notice that I used the `~` here in the function call. I will explain what this does shortly. 

### `relocate()`

`relocate()` moves columns. For example, the following piece of code moves the `species` column to the front of the data frame: 

```{r}
df.starwars %>% 
  relocate(species)
```

We could also move the `species` column after the name column like so: 

```{r}
df.starwars %>% 
  relocate(species, .after = name)
```

### `select()`

`select()` allows us to select a subset of the columns in the data frame. 

```{r}
df.starwars %>% 
  select(name, height, mass)
```

We can select multiple columns using the `(from:to)` syntax: 

```{r}
df.starwars %>%  
  select(name:birth_year) # from name to birth_year
```

Or use a variable for column selection: 

```{r}
columns = c("name", "height", "species")

df.starwars %>% 
  select(one_of(columns)) # useful when using a variable for column selection
```

We can also _deselect_ (multiple) columns:

```{r}
df.starwars %>% 
  select(-name, -(birth_year:vehicles))
```

And select columns by partially matching the column name:

```{r}
df.starwars %>% 
  select(contains("_")) # every column that contains the character "_"
```

```{r}
df.starwars %>% 
  select(starts_with("h")) # every column that starts with an "h"
```

We can rename some of the columns using `select()` like so: 

```{r}
df.starwars %>% 
  select(person = name, height, mass_kg = mass)
```

#### `where()`

`where()` is a useful helper function that comes in handy, for example, when we want to select columns based on their data type. 

```{r}
df.starwars %>% 
  select(where(fn = is.numeric)) # just select numeric columns
```

The following selects all columns that are not numeric: 

```{r}
df.starwars %>% 
  select(where(fn = ~ !is.numeric(.))) # selects all columns that are not numeric
```

Note that I used `~` here to indicate that I'm creating an anonymous function to check whether column type is numeric. A one-sided formula (expression beginning with `~`) is interpreted as `function(x)`, and wherever `x` would go in the function is represented by `.`.

So, I could write the same code like so: 

```{r}
df.starwars %>% 
  select(where(function(x) !is.numeric(x))) # selects all columns that are not numeric
```

For more details, take a look at the help file for `select()`, and this [this great tutorial](https://suzan.rbind.io/2018/01/dplyr-tutorial-1/) in which I learned about some of the more advanced ways of using `select()`.
 
### Practice 2

Create a data frame that: 
- only has the species `Human` and `Droid` 
- with the following data columns (in this order): name, species, birth_year, homeworld
- is arranged according to birth year (with the lowest entry at the top of the data frame)
- and has the `name` column renamed to `person`

```{r}
# write your code here 
```

### `mutate() `

`mutate()` is used to change existing columns or make new ones. 

```{r}
df.starwars %>% 
  mutate(height = height / 100, # to get height in meters
         bmi = mass / (height^2)) %>% # bmi = kg / (m^2)
  select(name, height, mass, bmi)
```

Here, I've calculated the bmi for the different starwars characters. I first mutated the height variable by going from cm to m, and then created the new column "bmi".

A useful helper function for `mutate()` is `ifelse()` which is a shorthand for the if-else control flow (Section \@ref(if-else)). Here is an example: 

```{r}
df.starwars %>% 
  mutate(height_categorical = ifelse(height > median(height, na.rm = T),
                                     "tall",
                                     "short")) %>% 
  select(name, contains("height"))
```

`ifelse()` works in the following way: we first specify the condition, then what should be returned if the condition is true, and finally what should be returned otherwise. The more verbose version of the statement above would be: `ifelse(test = height > median(height, na.rm = T), yes = "tall", no = "short")` 

In previous versions of `dplyr` (the package we use for data wrangling), there were a variety of additional mutate functions such as `mutate_at()`, `mutate_if()`, and `mutate_all()`. In the most recent version of `dplyr`, these additional functions have been deprecated, and replaced with the flexible `across()` helper function. 

#### `across()`

`across()` allows us to use the syntax that we've learned for `select()` to select particular variables and apply a function to each of the selected variables. 

For example, let's imagine that we want to z-score a number of variables in our data frame. We can do this like so: 

```{r}
df.starwars %>%  
  mutate(across(.cols = c(height, mass, birth_year),
                .fns = scale))
```

In the `.cols = ` argument of `across()`, I've specified what variables to mutate. In the `.fns = ` argument, I've specified that I want to use the function `scale`. Note that I wrote the function without `()`. The `.fns` argument expects allows these possible values: 

- the function itself, e.g. `mean`
- a call to the function with `.` as a dummy argument, `~ mean(.)` (note the `~` before the function call)
- a list of functions `list(mean = mean, median = ~ median(.))` (where I've mixed both of the other ways)

We can also use names to create new columns:

```{r}
df.starwars %>%  
  mutate(across(.cols = c(height, mass, birth_year),
                .fns = scale,
                .names = "{.col}_z")) %>% 
  select(name, contains("height"), contains("mass"), contains("birth_year"))
```

I've specified how I'd like the new variables to be called by using the `.names = ` argument of `across()`. `{.col}` stands of the name of the original column, and here I've just added `_z` to each column name for the scaled columns. 

We can also apply several functions at the same time. 

```{r}
df.starwars %>% 
  mutate(across(.cols = c(height, mass, birth_year),
                .fns = list(z = scale,
                            centered = ~ scale(., scale = FALSE)))) %>%
  select(name, contains("height"), contains("mass"), contains("birth_year"))
```

Here, I've created z-scored and centered (i.e. only subtracted the mean but didn't divide by the standard deviation) versions of the `height`, `mass`, and `birth_year` columns in one go. 

You can use the `everything()` helper function if you want to apply a function to all of the columns in your data frame. 

```{r}
df.starwars %>% 
  select(height, mass) %>%
  mutate(across(.cols = everything(),
                .fns = as.character)) # transform all columns to characters
```

Here, I've selected some columns first, and then changed the mode to character in each of them. 

Sometimes, you want to apply a function only to those columns that have a particular data type. This is where `where()` comes in handy! 

For example, the following code changes all the numeric columns to character columns:

```{r}
df.starwars %>% 
  mutate(across(.cols = where(~ is.numeric(.)),
                .fns = ~ as.character(.)))
```

Or we could round all the numeric columns to one digit: 

```{r}
df.starwars %>% 
  mutate(across(.cols = where(~ is.numeric(.)),
                .fns = ~ round(., digits = 1)))
```

### Practice 3

Compute the body mass index for `masculine` characters who are `human`.

- select only the columns you need 
- filter out only the rows you need 
- make the new variable with the body mass index 
- arrange the data frame starting with the highest body mass index 

```{r}
# write your code here 
```

### Summarizing data

OK, let's load the `starwars` data set again:

```{r}
df.starwars = starwars
```

A particularly powerful way of interacting with data is by grouping and summarizing it. `summarize()` returns a single value for each summary that we ask for:

```{r}
df.starwars %>%
  summarize(height_mean = mean(height, na.rm = T),
            height_max = max(height, na.rm = T),
            n = n())
```

Here, I computed the mean height, the maximum height, and the total number of observations (using the function `n()`).
Let's say we wanted to get a quick sense for how tall starwars characters from different species are. To do that, we combine grouping with summarizing:

```{r}
df.starwars %>%
  group_by(species) %>%
  summarize(height_mean = mean(height, na.rm = T))
```

I've first used `group_by()` to group our data frame by the different species, and then used `summarize()` to calculate the mean height of each species.

It would also be useful to know how many observations there are in each group.

```{r}
df.starwars %>%
  group_by(species) %>%
  summarize(height_mean = mean(height, na.rm = T),
            group_size = n()) %>%
  arrange(desc(group_size))
```

Here, I've used the `n()` function to get the number of observations in each group, and then I've arranged the data frame according to group size in descending order.

Note that `n()` always yields the number of observations in each group. If we don't group the data, then we get the overall number of observations in our data frame (i.e. the number of rows).

So, Humans are the largest group in our data frame, followed by Droids (who are considerably smaller) and Gungans (who would make for good Basketball players).

Sometimes `group_by()` is also useful without summarizing the data. For example, we often want to z-score (i.e. normalize) data on the level of individual participants. To do so, we first group the data on the level of participants, and then use `mutate()` to scale the data. Here is an example:

```{r}
# first let's generate some random data
set.seed(1) # to make this reproducible

df.summarize = tibble(participant = rep(1:3, each = 5),
                      judgment = sample(0:100, size = 15, replace = TRUE)) %>%
  print()
```

```{r}
df.summarize %>%
  group_by(participant) %>% # group by participants
  mutate(judgment_zscored = scale(judgment)) %>% # z-score data of individual participants
  ungroup() %>% # ungroup the data frame
  head(n = 10) # print the top 10 rows
```

First, I've generated some random data using the repeat function `rep()` for making a `participant` column, and the `sample()` function to randomly choose values from a range between 0 and 100 with replacement. (We will learn more about these functions later when we look into how to simulate data.) I've then grouped the data by participant, and used the scale function to z-score the data.

> __TIP__: Don't forget to `ungroup()` your data frame. Otherwise, any subsequent operations are applied per group.

Sometimes, I want to run operations on each row, rather than per column. For example, let's say that I wanted each character's average combined height and mass.

Let's see first what doesn't work:

```{r}
df.starwars %>%
  mutate(mean_height_mass = mean(c(height, mass), na.rm = T)) %>%
  select(name, height, mass, mean_height_mass)
```

Note that all the values are the same. The value shown here is just the mean of all the values in `height` and `mass`.

```{r}
df.starwars %>%
  select(height, mass) %>%
  unlist() %>% # turns the data frame into a vector
  mean(na.rm = T)
```

To get the mean by row, we can either spell out the arithmetic

```{r}
df.starwars %>%
  mutate(mean_height_mass = (height + mass) / 2) %>% # here, I've replaced the mean() function
  select(name, height, mass, mean_height_mass)
```

or use the `rowwise()` helper function which is like `group_by()` but treats each row like a group:

```{r}
df.starwars %>%
  rowwise() %>% # now, each row is treated like a separate group
  mutate(mean_height_mass = mean(c(height, mass), na.rm = T)) %>%
  ungroup() %>%
  select(name, height, mass, mean_height_mass)
```

#### Practice 1

Find out what the average `height` and `mass` (as well as the standard deviation) is from different `species` in different `homeworld`s. Why is the standard deviation `NA` for many groups?

```{r}
# write your code here
```

Who is the tallest member of each species? What eye color do they have? The `top_n()` function or the `row_number()` function (in combination with `filter()`) will be useful here.

```{r}
# write your code here

```

### Reshaping data

We want our data frames to be tidy. What's tidy?

1. Each variable must have its own column.
2. Each observation must have its own row.
3. Each value must have its own cell.

For more information on tidy data frames see the [Tidy data](http://r4ds.had.co.nz/tidy-data.html) chapter in Hadley Wickham's R for Data Science book.

> "Happy families are all alike; every unhappy family is unhappy in its own way." –– Leo Tolstoy

> "Tidy datasets are all alike, but every messy dataset is messy in its own way." –– Hadley Wickham

#### `pivot_longer()` and `pivot_wider()`

Let's first generate a data set that is _not_ tidy.

```{r}
# construct data frame
df.reshape = tibble(participant = c(1, 2),
                    observation_1 = c(10, 25),
                    observation_2 = c(100, 63),
                    observation_3 = c(24, 45)) %>%
  print()
```

Here, I've generated data from two participants with three observations. This data frame is not tidy since each row contains more than a single observation. Data frames that have one row per participant but many observations are called _wide_ data frames.

We can make it tidy using the `pivot_longer()` function.

```{r}
df.reshape.long = df.reshape %>%
  pivot_longer(cols = -participant,
               names_to = "index",
               values_to = "rating") %>%
  arrange(participant) %>%
  print()
```

`df.reshape.long` now contains one observation in each row. Data frames with one row per observation are called _long_ data frames.

The `pivot_longer()` function takes at least four arguments:

1. the data which I've passed to it via the pipe `%>%`
2. a specification for which columns we want to gather -- here I've specified that we want to gather the values from all columns except the `participant` column
3. a `names_to` argument which specifies the name of the column which will contain the column names of the original data frame
4. a `values_to` argument which specifies the name of the column which will contain the values that were spread across different columns in the original data frame

`pivot_wider()` is the counterpart of `pivot_longer()`. We can use it to go from a data frame that is in _long_ format, to a data frame in _wide_ format, like so:

```{r}
df.reshape.wide = df.reshape.long %>%
  pivot_wider(names_from = index,
              values_from = rating) %>%
  print()
```

For my data, I often have a wide data frame that contains demographic information about participants, and a long data frame that contains participants' responses in the experiment. In Section \@ref(joining-multiple-data-frames), we will learn how to combine information from multiple data frames (with potentially different formats).

Here is a more advanced example that involves reshaping a data frame. Let's consider the following data frame to start with:

```{r}
# construct data frame
df.reshape2 = tibble(participant = c(1, 2),
                     stimulus_1 = c("flower", "car"),
                     observation_1 = c(10, 25),
                     stimulus_2 = c("house", "flower"),
                     observation_2 = c(100, 63),
                     stimulus_3 = c("car", "house"),
                     observation_3 = c(24, 45)) %>%
  print()
```

The data frame contains in each row: which stimuli a participant saw, and what rating she gave. The participants saw a picture of a flower, car, and house, and rated how much they liked the picture on a scale from 0 to 100. The order at which the pictures were presented was randomized between participants. I will use a combination of `pivot_longer()`, and `pivot_wider()` to turn this into a data frame in long format.

```{r}
df.reshape2 %>%
  pivot_longer(cols = -participant,
               names_to = c("index", "order"),
               names_sep = "_",
               values_to = "rating",
               values_transform = list(rating = as.character)) %>%
  pivot_wider(names_from = "index",
              values_from = "rating") %>%
  mutate(across(.cols = c(order, observation),
                .fns = ~ as.numeric(.))) %>%
  select(participant, order, stimulus, rating = observation)
```

Voilà! Getting the desired data frame involved a few new tricks. Let's take it step by step.

First, I use `pivot_longer()` to make a long table.

```{r}
df.reshape2 %>%
  pivot_longer(cols = -participant,
               names_to = c("index", "order"),
               names_sep = "_",
               values_to = "rating",
               values_transform = list(rating = as.character))
```

Notice how I've used a combination of the `names_to = ` and `names_sep = ` arguments to create two columns. Because I'm combining data of two different types ("character" and "numeric"), I needed to specify what I want the resulting data type to be via the `values_transform = ` argument.

I would like to have the information about the stimulus and the observation in the same row. That is, I want to see what rating a participant gave to the flower stimulus, for example. To get there, I can use the `pivot_wider()` function to make a separate column for each entry in `index` that contains the values in `rating`.

```{r}
df.reshape2 %>%
  pivot_longer(cols = -participant,
               names_to = c("index", "order"),
               names_sep = "_",
               values_to = "rating",
               values_transform = list(rating = as.character)) %>%
  pivot_wider(names_from = "index",
              values_from = "rating")
```

That's pretty much it. Now, each row contains information about the order in which a stimulus was presented, what the stimulus was, and the judgment that a participant made in this trial.

```{r}
df.reshape2 %>%
  pivot_longer(cols = -participant,
               names_to = c("index", "order"),
               names_sep = "_",
               values_to = "rating",
               values_transform = list(rating = as.character)) %>%
  pivot_wider(names_from = "index",
              values_from = "rating") %>%
  mutate(across(.cols = c(order, observation),
                .fns = ~ as.numeric(.))) %>%
  select(participant, order, stimulus, rating = observation)
```

The rest is familiar. I've used `mutate()` with `across()` to turn `order` and `observation` into numeric columns, `select()` to change the order of the columns (and renamed the `observation` column to `rating` along the way).

Getting familiar with `pivot_longer()` and `pivot_wider()` takes some time plus trial and error. So don't be discouraged if you don't get what you want straight away. Once you've mastered these functions, they will make it much easier to beat your data frames into shape.

After having done some transformations like this, it's worth checking that nothing went wrong. I often compare a few values in the transformed and original data frame to make sure everything is legit.

When reading older code, you will often see `gather()` (instead of `pivot_longer()`), and `spread()` (instead of `pivot_wider()`). `gather` and `spread` are not developed anymore now, and their newer counterparts have additional functionality that comes in handy.

#### `separate()` and `unite()`

Sometimes, we want to separate one column into multiple columns. For example, we could have achieved the same result we did above slightly differently, like so:

```{r}
df.reshape2 %>%
  pivot_longer(cols = -participant,
               names_to = "index",
               values_to = "rating",
               values_transform = list(rating = as.character)) %>%
  separate(col = index,
           into = c("index", "order"),
           sep = "_")
```

Here, I've used the `separate()` function to separate the original `index` column into two columns. The `separate()` function takes four arguments:

1. the data which I've passed to it via the pipe `%>%`
2. the name of the column `col` which we want to separate
3. the names of the columns `into` into which we want to separate the original column
4. the separator `sep` that we want to use to split the columns.

Note, like `pivot_longer()` and `pivot_wider()`, there is a partner for `separate()`, too. It's called `unite()` and it allows you to combine several columns into one, like so:

```{r}
tibble(index = c("flower", "observation"),
       order = c(1, 2)) %>%
  unite("combined", index, order)
```

Sometimes, we may have a data frame where data is recorded in a long string.

```{r}
df.reshape3 = tibble(participant = 1:2,
                     judgments = c("10, 4, 12, 15", "3, 4")) %>%
  print()
```

Here, I've created a data frame with data from two participants. For whatever reason, we have four judgments from participant 1 and only two judgments from participant 2 (data is often messy in real life, too!).

We can use the `separate_rows()` function to turn this into a tidy data frame in long format.

```{r}
df.reshape3 %>%
  separate_rows(judgments)
```

#### Practice 2

Load this data frame first.

```{r}
df.practice2 = tibble(participant = 1:10,
                      initial = c("AR", "FA", "IR", "NC", "ER", "PI", "DH", "CN", "WT", "JD"),
                      judgment_1 = c(12, 13, 1, 14, 5, 6, 12, 41, 100, 33),
                      judgment_2 = c(2, 20, 10, 89, 94, 27, 29, 19, 57, 74),
                      judgment_3 = c(2, 20, 10, 89, 94, 27, 29, 19, 57, 74))
```

- Make the `df.practice2` data frame tidy (by turning into a long format).
- Compute the z-score of each participants' judgments (using the `scale()` function).
- Calculate the mean and standard deviation of each participants' z-scored judgments.
- Notice anything interesting? Think about what [z-scoring](https://www.statisticshowto.com/probability-and-statistics/z-score/) does ...

```{r}
# write your code here

```


### Joining multiple data frames

It's nice to have all the information we need in a single, tidy data frame. We have learned above how to go from a single untidy data frame to a tidy one. However, often our situation to start off with is even worse. The information we need sits in several, messy data frames.

For example, we may have one data frame `df.stimuli` with information about each stimulus, and then have another data frame with participants' responses `df.responses` that only contains a stimulus index but no other infromation about the stimuli.

```{r}
set.seed(1) # setting random seed to make this example reproducible

# data frame with stimulus information
df.stimuli = tibble(index = 1:5,
  height = c(2, 3, 1, 4, 5),
  width = c(4, 5, 2, 3, 1),
  n_dots = c(12, 15, 5, 13, 7),
  color = c("green", "blue", "white", "red", "black")) %>%
  print()

# data frame with participants' responses
df.responses = tibble(participant = rep(1:3, each = 5),
  index = rep(1:5, 3),
  response = sample(0:100, size = 15, replace = TRUE)) %>% # randomly sample 15 values from 0 to 100
  print()
```

The `df.stimuli` data frame contains an `index`, information about the `height`, and `width`, as well as the number of `dots`, and their `color`. Let's imagine that participants had to judge how much they liked each image from a scale of 0 ("not liking this dot pattern at all") to 100 ("super thrilled about this dot pattern").

Let's say that I now wanted to know what participants' average response for the differently colored dot patterns are. Here is how I would do this:

```{r}
df.responses %>%
  left_join(df.stimuli %>%
              select(index, color),
            by = "index") %>%
  group_by(color) %>%
  summarize(response_mean = mean(response))
```

Let's take it step by step. The key here is to add the information from the `df.stimuli` data frame to the `df.responses` data frame.

```{r}
df.responses %>%
  left_join(df.stimuli %>%
              select(index, color),
            by = "index")
```

I've joined the `df.stimuli` table in which I've only selected the `index` and `color` column, with the `df.responses` table, and specified the `index` column as the one by which the tables should be joined. This is the only column that both of the data frames have in common.

To specify multiple columns by which we would like to join tables, we specify the `by` argument as follows: `by = c("one_column", "another_column")`.

Sometimes, the tables I want to join don't have any column names in common. In that case, we can tell the `left_join()` function which column pair(s) should be used for joining.

```{r}
df.responses %>%
  rename(stimuli = index) %>% # I've renamed the index column to stimuli
  left_join(df.stimuli %>%
              select(index, color),
            by = c("stimuli" = "index"))
```

Here, I've first renamed the index column (to create the problem) and then used the `by = c("stimuli" = "index")` construction (to solve the problem).

In my experience, it often takes a little bit of playing around to make sure that the data frames were joined as intended. One very good indicator is the row number of the initial data frame, and the joined one. For a `left_join()`, most of the time, we want the row number of the original data frame ("the one on the left") and the joined data frame to be the same. If the row number changed, something probably went wrong.

Take a look at the `join` help file to see other operations for combining two or more data frames into one (make sure to look at the one from the `dplyr` package).

#### Practice 3

Load these three data frames first:

```{r}
set.seed(1)

df.judgments = tibble(participant = rep(1:3, each = 5),
                      stimulus = rep(c("red", "green", "blue"), 5),
                      judgment = sample(0:100, size = 15, replace = T))

df.information = tibble(number = seq(from = 0, to = 100, length.out = 5),
                        color = c("red", "green", "blue", "black", "white"))
```

Create a new data frame called `df.join` that combines the information from both `df.judgments` and `df.information`. Note that column with the colors is called `stimulus` in `df.judgments` and `color` in `df.information`. At the end, you want a data frame that contains the following columns: `participant`, `stimulus`, `number`, and `judgment`.

```{r}
# write your code here

```


### Dealing with missing data

There are two ways for data to be missing.

- __implicit__: data is not present in the table
- __explicit__: data is flagged with `NA`

We can check for explicit missing values using the `is.na()` function like so:

```{r}
tmp.na = c(1, 2, NA, 3)
is.na(tmp.na)
```

I've first created a vector `tmp.na` with a missing value at index 3. Calling the `is.na()` function on this vector yields a logical vector with `FALSE` for each value that is not missing, and `TRUE` for each missing value.

Let's say that we have a data frame with missing values and that we want to replace those missing values with something else. Let's first create a data frame with missing values.

```{r}
df.missing = tibble(x = c(1, 2, NA),
                    y = c("a", NA, "b"))
print(df.missing)
```

We can use the `replace_na()` function to replace the missing values with something else.

```{r}
df.missing %>%
  mutate(x = replace_na(x, replace = 0),
         y = replace_na(y, replace = "unknown"))
```

We can also remove rows with missing values using the `drop_na()` function.

```{r}
df.missing %>%
  drop_na()
```

If we only want to drop values from specific columns, we can specify these columns within the `drop_na()` function call. So, if we only want to drop rows that have missing values in the `x` column, we can write:

```{r}
df.missing %>%
  drop_na(x)
```

To make the distinction between implicit and explicit missing values more concrete, let's consider the following example (taken from [here](https://r4ds.had.co.nz/tidy-data.html#missing-values-3)):

```{r}
df.stocks = tibble(year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
                   qtr    = c(   1,    2,    3,    4,    2,    3,    4),
                   return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66))
```

There are two missing values in this dataset:

- The return for the fourth quarter of 2015 is explicitly missing, because the cell where its value should be instead contains `NA`.
- The return for the first quarter of 2016 is implicitly missing, because it simply does not appear in the dataset.

We can use the `complete()` function to make implicit missing values explicit:

```{r}
df.stocks %>%
  complete(year, qtr)
```

Note how now, the data frame contains an additional row in which `year = 2016`, `qtr = 1` and `return = NA` even though we didn't originally specify this.

We can also directly tell the `complete()` function to replace the `NA` values via passing a list to its `fill` argument like so:

```{r}
df.stocks %>%
  complete(year, qtr, fill = list(return = 0))
```

This specifies that we would like to replace any `NA` in the `return` column with `0`. Again, if we had multiple columns with `NA`s, we could speficy for each column separately how to replace it.

## Reading in data

So far, we've used data sets that already came with the packages we've loaded. In the visualization chapters, we used the `diamonds` data set from the `ggplot2` package, and in the data wrangling chapters, we used the `starwars` data set from the `dplyr` package.

```{r, echo=FALSE}
file_type = c("`csv`", "`RData`", "`xls`", "`json`", "`feather`")
platform = c("general",
             "R",
             "excel",
             "general",
             "python & R")
description = c("medium-size data frames",
                "saving the results of intensive computations",
                "people who use excel",
                "more complex data structures",
                "fast interaction between R and python")

kable(tibble(`file type` = file_type,
             platform = platform,
             description = description),
      align = c("r", "l", "l"))
```


The `foreign` [package](https://cran.r-project.org/web/packages/foreign/index.html) helps with importing data that was saved in SPSS, Stata, or Minitab.

For data in a json format, I highly recommend the `tidyjson` [package](https://github.com/sailthru/tidyjson).

### csv

I've stored some data files in the `data/` subfolder. Let's first read a csv (= **c**omma-**s**eparated-**v**alue) file.

```{r}
df.csv = read_csv("data/movies.csv")
```

The `read_csv()` function gives us information about how each column was parsed. Here, we have some columns that are characters (such as `title` and `genre`), and some columns that are numeric (such as `year` and `duration`). Note that it says `double()` in the specification but double and numeric are identical.

And let's take a quick peek at the data:

```{r}
df.csv %>% glimpse()
```

The data frame contains a bunch of movies with information about their genre, director, rating, etc.

The `readr` package (which contains the `read_csv()` function) has a number of other functions for reading data. Just type `read_` in the console below and take a look at the suggestions that autocomplete offers.

### RData

RData is a data format native to R. Since this format can only be read by R, it's not a good format for sharing data. However, it's a useful format that allows us to flexibly save and load R objects. For example, consider that we always start our script by reading in and structuring data, and that this takes quite a while. One thing we can do is to save the output of intermediate steps as an RData object, and then simply load this object (instead of re-running the whole routine every time).

We read (or load) an RData file in the following way:

```{r}
load("data/test.RData", verbose = TRUE)
```

I've set the `verbose = ` argument to `TRUE` here so that the `load()` function tells me what objects it added to the environment. This is useful for checking whether existing objects were overwritten.

## Saving data

### csv

To save a data frame as a csv file, we simply write:

```{r}
df.test = tibble(x = 1:3,
                 y = c("test1", "test2", "test3"))

write_csv(df.test, file = "data/test.csv")
```

Just like for reading in data, the `readr` package has a number of other functions for saving data. Just type `write_` in the console below and take a look at the autocomplete suggestions.

### RData

To save objects as an RData file, we write:

```{r}
save(df.test, file = "data/test.RData")
```

We can add multiple objects simply by adding them at the beginning, like so:

```{r}
save(df.test, df.starwars, file = "data/test_starwars.RData")
```


## Additional resources

### Cheatsheets

- [base R](figures/base-r.pdf) --> summary of how to use base R (we will mostly use the tidyverse but it's still important to know how to do things in base R)
- [data transformation](figures/data-transformation.pdf) --> transforming data using `dplyr`

### Data camp courses

- [dplyr](https://www.datacamp.com/courses/dplyr-data-manipulation-r-tutorial)
- [tidyverse](https://www.datacamp.com/courses/introduction-to-the-tidyverse)
- [working with data in the tidyverse](https://www.datacamp.com/courses/working-with-data-in-the-tidyverse)
- [cleaning data](https://www.datacamp.com/courses/importing-cleaning-data-in-r-case-studies)
- [cleaning data: case studies](https://www.datacamp.com/courses/importing-cleaning-data-in-r-case-studies)
- [string manipulation in R](https://www.datacamp.com/courses/string-manipulation-in-r-with-stringr)
- [Intermediate R](https://www.datacamp.com/courses/intermediate-r)
- [Writing functions in R](https://www.datacamp.com/courses/introduction-to-function-writing-in-r)

### Books and chapters

- [Chapters 9-15 in "R for Data Science"](https://r4ds.had.co.nz/wrangle-intro.html)
- [Chapter 5 in "Data Visualization - A practical introduction"](http://socviz.co/workgeoms.html#workgeoms)

## Session info

Information about this R session including which version of R was used, and what packages were loaded. 

```{r}
sessionInfo()
```