diff --git a/01-functions-01.qmd b/01-functions-01.qmd index a71115f..ff29700 100644 --- a/01-functions-01.qmd +++ b/01-functions-01.qmd @@ -35,6 +35,8 @@ Emma Rand 🐘[\@3mma\@mastodon.social](https://mastodon.social/@3mma) +Garrett Grolemund + Stephanie Hazlitt ::: @@ -49,17 +51,17 @@ Mouna Belaid   -Standing on the shoulders of +## Standing on the shoulders of + -::: {style="font-size: 80%;"} - [R for Data Science (2e)](https://r4ds.hadley.nz/) @wickham2023 - [The tidyverse style guide](https://style.tidyverse.org/index.html) @wickham-style - [Programming with dplyr vignette](https://dplyr.tidyverse.org/articles/programming.html) @dplyr -::: -\ + + ### WiFi TOFIX @@ -68,7 +70,37 @@ Standing on the shoulders of ## Introductions -To each other! +## Introductions + +To each other! With help from Yorkshire! + +. . . + +::: columns +::: {.column width="36%"} +::: {style="font-size: 40%;"} +![](images/quality-street.png){width="300"} + +*Ingredients*: Sugar, Glucose syrup, Cocoa mass, Vegetable fats (Palm, Rapeseed, Sunflower, Coconut,Mango kernel/ Sal/ Shea), Sweetened condensed skimmed milk (Skimmed milk, Sugar), Cocoa butter, Dried whole milk, Glucose-fructose syrup, Coconut, Lactose and proteins from whey (from Milk), Whey powder (from Milk), Hazelnuts, Skimmed milk powder, Butter (from Milk), Emulsifiers (Sunflower lecithin, E471), Flavourings, Butterfat (from Milk), Fat-reduced cocoa powder, Salt, Lactic acid. +::: +::: + +::: {.column width="32%"} +::: {style="font-size: 40%;"} +![](images/after-eight-thin-mint-squares-25-piece-box.jpg){width="300"} + +*Ingredients*: Sugar, Semi-Sweet Chocolate (Sugar, Chocolate, Cocoa Butter, Milkfat, Soy and Sunflower Lecithin, Natural Vanilla Flavor), Glucose Syrup, Peppermint Oil, Citric Acid, Invertase. +::: +::: + +::: {.column width="32%"} +::: {style="font-size: 40%;"} +![](images/haribo-strawbs.jpeg){width="300"} + +*Ingredients*: Glucose Syrup, Sugar, Starch, Acid: Citric Acid, Flavouring, Fruit and Plant Concentrates: Aronia, Blackcurrant, Elderberry, Grape, Lemon, Orange, Safflower Spirulina, Caramelised Sugar Syrup, Glazing Agents: Beeswax, Carnauba Wax, Elderberry Extract. +::: +::: +::: ## Code of Conduct TOFIX @@ -674,8 +706,7 @@ sum_sq <- function(x){ } ``` - -. . . +. . . 🎬 Try it out @@ -683,9 +714,6 @@ sum_sq <- function(x){ sum_sq(penguins$bill_length_mm) ``` - - - ## Types of function We will cover two types of function @@ -696,9 +724,7 @@ We will cover two types of function ii. ✔️ summary functions: input is vector, output is a single value -**2. ➡️ data frame functions: df as input and df as output** - - +**2. ➡️ data frame functions: df as input and df as output** # Dataframe functions @@ -752,11 +778,13 @@ my_summary(penguins, bill_length_mm) `tidyverse` functions like `dplyr::summarise()` use "tidy evaluation" so you can refer to the names of variables inside dataframes. For example, you can use: either + ``` r penguins |> summarise(mean = mean(bill_depth_mm)) ``` Or + ``` r summarise(penguins, mean = mean(bill_depth_mm)) ``` @@ -766,7 +794,8 @@ rather than `$` notation ``` r summarise(penguins, mean = mean(penguins$bill_depth_mm)) ``` -. . . + +. . . This is known as data-masking: the dataframe environment masks the user environment by giving priority to the dataframe. @@ -774,17 +803,16 @@ This is known as data-masking: the dataframe environment masks the user environm and makes life easier when working interactively -. . . +. . . But not so useful in functions Because of data-masking, `summarise()` in `my_summary()` is looking for a column literally called `column` in the dataframe that has been passed in. It is not looking in the variable `column` for the name of column you want to give it. -. . . +. . . [Programming with dplyr](https://dplyr.tidyverse.org/articles/programming.html) - ## Fix `my_summary()` function The solution is to use embracing: `{{ var }}` @@ -822,7 +850,6 @@ When tidy evaluation is used 🎬 Write a function to calculate the median, maximum and minimum values of a variable grouped by another variable. - ## A solution - 1 ```{r} @@ -836,7 +863,6 @@ my_summary <- function(df, summary_var, group_var){ } ``` - ## Your turn 🎬 Try it out @@ -860,7 +886,6 @@ my_summary <- function(df, summary_var, group_var = NULL){ } ``` - ## Your turn 🎬 Try it out @@ -902,7 +927,6 @@ my_summary <- function(df, summary_var, group_var = NULL){ my_summary(penguins, bill_length_mm, c(species, island)) ``` - ## Extras - Short cuts: @@ -913,8 +937,11 @@ my_summary(penguins, bill_length_mm, c(species, island)) ## Summary - + - + - + - ## References diff --git a/03-iteration-01.qmd b/03-iteration-01.qmd index 23de251..b92f605 100644 --- a/03-iteration-01.qmd +++ b/03-iteration-01.qmd @@ -4,7 +4,7 @@ subtitle: "Iteration 1" author: "Emma Rand and Ian Lyttle" format: revealjs: - theme: [simple] + theme: [simple, styles.scss] slide-number: true chalkboard: true code-link: true @@ -23,28 +23,33 @@ brief intro At the end of this section you will be able to: ::: {style="font-size: 70%;"} -- - -- +- recognise that much iteration comes free with R -- +- iterate across rows using `across()` -- + - use selection functions to select columns for iteration + - use anonymous functions to pass arguments + - give more than one function for iteration + - use `.names` to control the output -- +- use `across()` in functions ::: -## What is iteration +## What is iteration? -lksdjfjksdf +- Iteration means repeating steps multiple times until a condition is met -## Iteration in R - -Iteration is different in R because much is an inherent part of the language. +- In other languages, iteration is performed with loops: `for`, `while` . . . -For example, if +- Iteration is different in R + +- You *can* use loops....... but you often don't *need* to + +## Iteration in R + +Iteration is an inherent part of the language. For example, if ```{r} nums <- c(3, 1, 6, 4) @@ -60,7 +65,7 @@ Then is -. . . +## Iteration in R ``` r [1] 6 2 12 8 @@ -74,17 +79,24 @@ and NOT [1] 6 2 12 8 6 2 12 8 ``` -other languages, a for loop would be right after hello world - ## Iteration in R -For examples +We have: + +- the `apply()` family - `group_by()` with `summarize()` - `facet_wrap()` -## slide +- `across()` and `purrr()` + + +. . . + +other languages, a for loop would be right after hello world + +## Functional programming "functional programming" because functions take other functions as input @@ -96,17 +108,15 @@ For examples # Set up -## Project and Packages - -🎬 Create a Project: +## Create a `.R` ```{r} #| eval: false -usethis::create_project("workshop-iterations") +usethis::use_r("functions-01") ``` -. . . +## Packages 🎬 Load packages: @@ -115,13 +125,26 @@ library(tidyverse) library(palmerpenguins) ``` -## Data +``` +── Attaching core tidyverse packages ──────────────────────────────────────────────────────────── tidyverse 2.0.0 ── +✔ dplyr 1.1.2 ✔ readr 2.1.4 +✔ forcats 1.0.0 ✔ stringr 1.5.0 +✔ ggplot2 3.4.2 ✔ tibble 3.2.1 +✔ lubridate 1.9.2 ✔ tidyr 1.3.0 +✔ purrr 1.0.1 ── Conflicts ────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ── +✖ dplyr::filter() masks stats::filter() +✖ dplyr::lag() masks stats::lag() +ℹ Use the conflicted package to force all conflicts to become errors' +``` + +## Load `penguins` 🎬 Load `penguins` data set ```{r} data(penguins) glimpse(penguins) + ``` # Modifying multiple columns @@ -152,6 +175,8 @@ penguins |> ⚠️ Code repetition! +How can we iterate over rows? + ## Solution: `across()` ```{r} @@ -165,15 +190,14 @@ penguins |> 3 important arguments +## `across()` Arguments - which columns you want to iterate over: `.cols = bill_length_mm:body_mass_g` . . . - what you want to do to each column: `.fns = sd_error` - - single function - - include arguments to that function - - more than one function + - single function, include arguments, more than one function . . . @@ -183,6 +207,10 @@ penguins |> - we could use colon notation, `bill_length_mm:body_mass_g`, because columns are adjacent +. . . + +but + - `.cols` uses same specification as `select()`: `starts_with()`, `ends_with()`, `contains()`, `matches()` ## selecting columns with `.cols` @@ -204,6 +232,19 @@ penguins |> ## selecting columns with `.cols` +```{r} +#| eval: false +penguins |> + group_by(species, island, sex) |> + summarise(across(everything(), sd_error)) +``` + +- variables in `group_by()` are excluded + +- all of `bill_length_mm`, `bill_depth_mm`, `flipper_length_mm`, `body_mass_g`, `year` + +## selecting columns with `.cols` + - `everything()`: all non-grouping columns without year ```{r} @@ -215,7 +256,11 @@ penguins |> ## selecting columns with `.cols` -- all the numeric columns *without* grouping: `where()` +- My columns have very different names and I don't want to group! + +. . . + +- all the *numeric* columns: `where()` ```{r} penguins |> @@ -223,41 +268,59 @@ penguins |> summarise(across(where(is.numeric), sd_error)) ``` -## Your turn - -🎬 Write a function ....... - -(maybe based on gapminder) - ## `.funs`: calling one function - we can pass a function, `sd_error` to `across()` since R is a functional programming language -- note we are not calling `sd_error()` +- note, we are not calling `sd_error()` - instead we pass `sd_error` so `across()` can call it - thus function name is **not** followed by `()` -show the error - easy to forget in functions eg my_summary \<- function(df, cols = where(is.numeric)){...} +## function name is **not** followed by `()` + +📢 -## Include arguments ```{r} +#| error: true penguins |> - summarise(across(ends_with("mm"), mean)) + select(-year) |> + summarise(across(where(is.numeric), sd_error())) ``` . . . +This error is easy to make! + +## Include arguments + +```{r} +penguins |> + summarise(across(ends_with("mm"), mean)) +``` + We get the NA because we have missing values[^1]. ## Include arguments -`mean()` has an `na.rm` argument. How can we pass on `na.rm = TRUE`? +`mean()` has an `na.rm` argument. + +How can we pass on `na.rm = TRUE`? . . . +We might try: + +```{r} +#| error: true +penguins |> + summarise(across(ends_with("mm"), mean(na.rm = TRUE))) +``` + +## Include arguments + The solution is to create a new function that calls `mean()` with `na.rm = TRUE` . . . @@ -268,26 +331,58 @@ penguins |> function(x) mean(x, na.rm = TRUE))) ``` +. . . + +`mean` is replaced by a function definition + +## Anonymous functions + +``` r +penguins |> + summarise(across(ends_with("mm"), + function(x) mean(x, na.rm = TRUE))) +``` + +- This is called an **anonymous** or **lambda** function. + +- It is anonymous because we do not give it a name with `<-` + ## Anonymous functions +Shorthand + +. . . + Instead of writing `function` we can use `\` ```{r} penguins |> - summarise(across(ends_with("mm"), \(x) mean(x, na.rm = TRUE))) + summarise(across(ends_with("mm"), + \(x) mean(x, na.rm = TRUE))) ``` + +## Anonymous functions + +Note, You might also see: + +```{r} +penguins |> + summarise(across(ends_with("mm"), + ~ mean(.x, na.rm = TRUE))) +``` . . . -- This is called an **anonymous** or **lambda** function. +- `\(x)` is base syntax new in in 4.1.0 **Recommended** -- It is anonymous because we do not give it a name with `<-` +- `~ .x` is fine but only works in tidyverse functions -## `.funs`: calling \> one function + +## `.funs`: calling more than one function How can we use more than one function across the columns? -``` +``` r penguins |> summarise(across(ends_with("mm"), _MORE THAN ONE FUNCTION_)) ``` @@ -296,7 +391,29 @@ penguins |> by using a list -## `.funs`: calling \> 1 function +## `.funs`: calling more than one function + +Using a list: + +``` r +penguins |> + summarise(across(where(is.numeric), list( + sd_error, + length))) +``` + +. . . + +Or, with anonymous functions: + +``` r +penguins |> + summarise(across(ends_with("mm"), list( + \(x) mean(x, na.rm = TRUE), + \(x) sd(x, na.rm = TRUE)))) +``` + +## `.funs`: calling more than one function ```{r} penguins |> @@ -307,11 +424,11 @@ penguins |> . . . -the `_1` and `_2` are not very useful. +Problem: the suffixes `_1` and `_2` for functions are not very useful. -## `.funs`: calling \> one function +## `.funs`: calling more than one function -We can improve with naming the elements in the list +We can improve by naming the elements in the list ```{r} penguins |> @@ -324,9 +441,11 @@ penguins |> The column name is `{.col}_{.fn}`: `bill_length_mm_mean` +fn: **f**unction **n**ame + . . . -We can change using `.names` +We can change using the `.names` argument ## `.names` to control output @@ -340,7 +459,7 @@ penguins |> ## `.names` to control output -Especially important for mutate because column names are used in `across()` +Especially important for `mutate()`. Recall our `to_z()` function @@ -351,6 +470,8 @@ to_z <- function(x, middle = 1) { } ``` +## `to_z()` function in `mutate()` + which we used like this ```{r} @@ -365,6 +486,8 @@ penguins |> ## `.names` to control output +It makes sense to use `across()` to apply the transformation to all three variables + ```{r} penguins |> mutate(across(ends_with("mm"), @@ -373,7 +496,7 @@ penguins |> glimpse() ``` -Results go into existing columns +😮 Results go into existing columns! ## @@ -386,56 +509,94 @@ penguins |> glimpse() ``` -## A note on dots in argument names + -- + -- + + + + + + +## Your turn -## Iteration over columns in `filter()` +Time to bring together functions and iteration! + +🎬 Write a function that summarises multiple specified columns of a data frame + + + +``` r +my_summary <- function(df, cols) { + + . . . . + +} + +``` + + +``` r +my_summary(penguins, ends_with("mm")) +``` -?? -## `across()` in functions +## A solution ```{r} -my_summary <- function(df, cols){ +my_summary <- function(df, cols) { df |> - summarise(across({{cols}}, + summarise(across({{ cols }}, list(mean = \(x) mean(x, na.rm = TRUE), - sdev = \(x) sd(x, na.rm = TRUE)), - .names = "{.fn}_of_{.col}"), + sdev = \(x) sd(x, na.rm = TRUE))), .groups = "drop") } ``` +## Try it out + ```{r} -my_summary(penguins, ends_with("mm")) +penguins |> + group_by(species) |> + my_summary(ends_with("mm")) ``` + +## A improved solution + +Include a default. + ```{r} -my_summary <- function(df, cols = where(is.numeric)){ +my_summary <- function(df, cols = where(is.numeric)) { df |> summarise(across({{cols}}, list(mean = \(x) mean(x, na.rm = TRUE), - sdev = \(x) sd(x, na.rm = TRUE)), - .names = "{.fn}_of_{.col}"), + sdev = \(x) sd(x, na.rm = TRUE))), .groups = "drop") } ``` + +## Try it out + ```{r} -my_summary(penguins) +penguins |> + select(-year) |> + my_summary() ``` -## Your turn -## link between across and pivot_longer +## Summary -?? +- -## Summary +- + +- + +- [^1]: There is no problem when we use `sd_error()` because we accounted for NA in our function definition diff --git a/images/after-eight-thin-mint-squares-25-piece-box.jpg b/images/after-eight-thin-mint-squares-25-piece-box.jpg new file mode 100644 index 0000000..51699f1 Binary files /dev/null and b/images/after-eight-thin-mint-squares-25-piece-box.jpg differ diff --git a/images/haribo-strawbs.jpeg b/images/haribo-strawbs.jpeg new file mode 100644 index 0000000..de14548 Binary files /dev/null and b/images/haribo-strawbs.jpeg differ diff --git a/images/quality-street.png b/images/quality-street.png new file mode 100644 index 0000000..a834d17 Binary files /dev/null and b/images/quality-street.png differ