From 8c662529388a64d3ff2a34d84ba9f33acae7d55e Mon Sep 17 00:00:00 2001
From: Nic Crane <thisisnic@gmail.com>
Date: Sun, 17 Sep 2023 05:14:12 -0500
Subject: [PATCH] fix row highlighting

---
 .../materials/2_data_manipulation_1/execute-results/html.json | 4 ++--
 materials/2_data_manipulation_1.qmd                           | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/_freeze/materials/2_data_manipulation_1/execute-results/html.json b/_freeze/materials/2_data_manipulation_1/execute-results/html.json
index 69f4d1d..558da2e 100644
--- a/_freeze/materials/2_data_manipulation_1/execute-results/html.json
+++ b/_freeze/materials/2_data_manipulation_1/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "582b36bee10442ef8a0af96d0f49ebaf",
+  "hash": "ed57973bde547ddb275688a9aa4faa4c",
   "result": {
-    "markdown": "---\nfooter: \"[🔗 posit.io/arrow](https://posit-conf-2023.github.io/arrow)\"\nlogo: \"images/logo.png\"\nexecute:\n  echo: true\nformat:\n  revealjs: \n    theme: default\nengine: knitr\neditor: source\n---\n\n\n# Data Manipulation---Part 1 {#data-manip-1}\n\n\n::: {.cell}\n\n:::\n\n\n## Goals\n\nAvoiding these! But...don't worry!\n\n![](images/segfault.png)\n\n\n## dplyr API in arrow\n\n![](images/dplyr-backend.png)\n\n## An Arrow Dataset\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\n\nnyc_taxi <- open_dataset(here::here(\"data/nyc-taxi\"))\nnyc_taxi\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nFileSystemDataset with 120 Parquet files\nvendor_name: string\npickup_datetime: timestamp[ms]\ndropoff_datetime: timestamp[ms]\npassenger_count: int64\ntrip_distance: double\npickup_longitude: double\npickup_latitude: double\nrate_code: string\nstore_and_fwd: string\ndropoff_longitude: double\ndropoff_latitude: double\npayment_type: string\nfare_amount: double\nextra: double\nmta_tax: double\ntip_amount: double\ntolls_amount: double\ntotal_amount: double\nimprovement_surcharge: double\ncongestion_surcharge: double\npickup_location_id: int64\ndropoff_location_id: int64\nyear: int32\nmonth: int32\n```\n:::\n:::\n\n## Arrow Datasets\n\n![](images/nyc_taxi_dataset.png)\n\n\n## Constructing queries\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dplyr)\n\nshared_rides <- nyc_taxi |>\n  group_by(year) |>\n  summarize(\n    all_trips = n(),\n    shared_trips = sum(passenger_count > 1, na.rm = TRUE)\n  ) |>\n  mutate(pct_shared = shared_trips / all_trips * 100) \n\nclass(shared_rides)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"arrow_dplyr_query\"\n```\n:::\n:::\n\n\n## arrow dplyr queries\n\n\n::: {.cell}\n\n```{.r .cell-code}\nshared_rides\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nFileSystemDataset (query)\nyear: int32\nall_trips: int64\nshared_trips: uint64\npct_shared: double (multiply_checked(divide(cast(shared_trips, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false}), cast(all_trips, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false})), 100))\n\nSee $.data for the source Arrow object\n```\n:::\n:::\n\n\n## arrow dplyr queries\n\n-   query has been constructed but not evaluated\n-   nothing has been pulled into memory\n\n## To `collect()` or to `compute()`?\n\n-   `compute()` evaluates the query, in-memory output stays in Arrow\n-   `collect()` evaluates the query, in-memory output returns to R\n\n## `compute()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncompute(shared_rides)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nTable\n10 rows x 4 columns\n$year <int32>\n$all_trips <int64>\n$shared_trips <uint64>\n$pct_shared <double>\n```\n:::\n:::\n\n\n## `collect()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncollect(shared_rides)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 10 × 4\n    year all_trips shared_trips pct_shared\n   <int>     <int>        <int>      <dbl>\n 1  2012 178544324     53313752       29.9\n 2  2013 173179759     51215013       29.6\n 3  2014 165114361     48816505       29.6\n 4  2015 146112989     43081091       29.5\n 5  2016 131165043     38163870       29.1\n 6  2017 113495512     32296166       28.5\n 7  2018 102797401     28796633       28.0\n 8  2019  84393604     23515989       27.9\n 9  2020  24647055      5837960       23.7\n10  2021  30902618      7221844       23.4\n```\n:::\n:::\n\n\n## Calling `nrow()` to see how much data\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n  filter(year %in% 2017:2021) |>\n  nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 356236190\n```\n:::\n:::\n\n\n## Calling `nrow()` doesn't work with intermediate step\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n  filter(year %in% 2017:2021) |>\n  group_by(year) |>\n  summarize(\n    all_trips = n(),\n    shared_trips = sum(passenger_count > 1, na.rm = TRUE)\n  ) |>\n  mutate(pct_shared = shared_trips / all_trips * 100) |>\n  nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] NA\n```\n:::\n:::\n\n\n## Use `compute()` to execute intermediate steps\n\n\n::: {.cell}\n\n```{.r .cell-code  code-line-numbers=\"9\"}\nnyc_taxi |>\n  filter(year %in% 2017:2021) |>\n  group_by(year) |>\n  summarize(\n    all_trips = n(),\n    shared_trips = sum(passenger_count > 1, na.rm = TRUE)\n  ) |>\n  mutate(pct_shared = shared_trips / all_trips * 100) |>\n  compute() |>\n  nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 5\n```\n:::\n:::\n\n\n## Your Turn\n\nUse the function `nrow()` to work out the answers to these questions:\n\n1.  How many taxi fares in the dataset had a total amount greater than \\$100?\n\n2.  How many distinct pickup locations (distinct combinations of the `pickup_latitude` and `pickup_longitude` columns) are in the dataset since 2016? \n\n➡️ [Data Manipulation Part I Exercises Page](2_data_manipulation_1-exercises.html)\n\n\n## Previewing output for large queries\n\nHow much were fares in GBP (£)?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfares_pounds <- nyc_taxi |>\n  mutate(\n    fare_amount_pounds = fare_amount * 0.79\n  )\n```\n:::\n\n\nHow many rows?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfares_pounds |>\n  nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1150352666\n```\n:::\n:::\n\n\n## Use `head()`, `select()`, `filter()`, and `collect()` to preview results\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n  filter(year == 2020) |>\n  mutate(fare_pounds = fare_amount * 0.79) |>\n  select(fare_amount, fare_pounds) |>\n  head() |>\n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 2\n  fare_amount fare_pounds\n        <dbl>       <dbl>\n1         8          6.32\n2        17         13.4 \n3         6.5        5.14\n4         7          5.53\n5         6.5        5.14\n6        42         33.2 \n```\n:::\n:::\n\n\n## Use `across()` to transform data in multiple columns\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxis_gbp <- nyc_taxi |>\n  mutate(across(ends_with(\"amount\"), list(pounds = ~.x * 0.79)))\n\ntaxis_gbp\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nFileSystemDataset (query)\nvendor_name: string\npickup_datetime: timestamp[ms]\ndropoff_datetime: timestamp[ms]\npassenger_count: int64\ntrip_distance: double\npickup_longitude: double\npickup_latitude: double\nrate_code: string\nstore_and_fwd: string\ndropoff_longitude: double\ndropoff_latitude: double\npayment_type: string\nfare_amount: double\nextra: double\nmta_tax: double\ntip_amount: double\ntolls_amount: double\ntotal_amount: double\nimprovement_surcharge: double\ncongestion_surcharge: double\npickup_location_id: int64\ndropoff_location_id: int64\nyear: int32\nmonth: int32\nfare_amount_pounds: double (multiply_checked(fare_amount, 0.79))\ntip_amount_pounds: double (multiply_checked(tip_amount, 0.79))\ntolls_amount_pounds: double (multiply_checked(tolls_amount, 0.79))\ntotal_amount_pounds: double (multiply_checked(total_amount, 0.79))\n\nSee $.data for the source Arrow object\n```\n:::\n:::\n\n\n## Use `across()` to transform data in multiple columns\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxis_gbp |>\n  select(contains(\"amount\")) |>\n  head() |>\n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 8\n  fare_amount tip_amount tolls_amount total_amount fare_amount_pounds\n        <dbl>      <dbl>        <dbl>        <dbl>              <dbl>\n1        29.7       6.04            0        36.2               23.5 \n2         9.3       0               0         9.8                7.35\n3         4.1       1.38            0         5.98               3.24\n4         4.5       1               0         6                  3.56\n5         4.5       0               0         5.5                3.56\n6         4.1       0               0         5.6                3.24\n# ℹ 3 more variables: tip_amount_pounds <dbl>, tolls_amount_pounds <dbl>,\n#   total_amount_pounds <dbl>\n```\n:::\n:::\n\n\n## Summary\n\n-   Use `nrow()` to work out how many rows of data your analyses will return\n-   Use `compute()` when you need to execute intermediate steps\n-   Use `collect()` to pull all of the data into your R session\n-   Use `head()`, `select()`, `filter()`, and `collect()` to preview results\n-   Use `across()` to manipulate data in multiple columns at once\n\n# dplyr verbs API in arrow - alternatives\n\n## Example - `slice()`\n\nFirst three trips in the dataset in 2021 where distance \\> 100 miles\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlong_rides_2021 <- nyc_taxi |>\n  filter(year == 2021 & trip_distance > 100) |>\n  select(pickup_datetime, year, trip_distance)\n\nlong_rides_2021 |>\n  slice(1:3)\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in UseMethod(\"slice\"): no applicable method for 'slice' applied to an object of class \"arrow_dplyr_query\"\n```\n:::\n:::\n\n\n## Head to the docs!\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?`arrow-dplyr`\n```\n:::\n\n\nor view them at <https://arrow.apache.org/docs/r/reference/acero.html>\n\n## A different function\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlong_rides_2021 |>\n  slice_max(n = 3, order_by = trip_distance, with_ties = FALSE) |>\n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 3\n  pickup_datetime      year trip_distance\n  <dttm>              <int>         <dbl>\n1 2021-11-16 12:55:00  2021       351613.\n2 2021-10-27 17:46:00  2021       345124.\n3 2021-12-11 10:48:00  2021       335094.\n```\n:::\n:::\n\n\n## Or call `collect()` first\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlong_rides_2021 |>\n  collect() |>\n  slice(1:3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 3\n  pickup_datetime      year trip_distance\n  <dttm>              <int>         <dbl>\n1 2021-10-03 16:45:02  2021          134 \n2 2021-10-03 17:29:35  2021          218.\n3 2021-10-03 17:58:15  2021          225.\n```\n:::\n:::\n\n\n## tidyr functions - pivot\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyr)\n\nnyc_taxi |> \n  group_by(vendor_name) |>\n  summarise(max_fare = max(fare_amount)) |>\n  pivot_longer(!vendor_name, names_to = \"metric\") |> \n  collect()\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in UseMethod(\"pivot_longer\"): no applicable method for 'pivot_longer' applied to an object of class \"arrow_dplyr_query\"\n```\n:::\n:::\n\n\n## duckdb\n\n![](images/dplyr-arrow-duckdb.png)\n\n## tidyr functions - pivot with duckdb!\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(duckdb)\n\nnyc_taxi |> \n  group_by(vendor_name) |>\n  summarise(max_fare = max(fare_amount)) |>\n  to_duckdb() |> # send data to duckdb\n  pivot_longer(!vendor_name, names_to = \"metric\") |> \n  to_arrow() |> # return data back to arrow\n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 3\n  vendor_name metric     value\n  <chr>       <chr>      <dbl>\n1 CMT         max_fare 998310.\n2 VTS         max_fare  10000.\n3 <NA>        max_fare   3555.\n```\n:::\n:::\n\n\n::: {.callout-caution collapse=\"true\"}\n## Requires arrow 13.0.0\n\nThis code requires arrow 13.0.0 or above to run, due to a bugfix in this version\n:::\n\n# Using functions inside verbs\n\n## Using functions inside verbs\n\n-   lots of the [lubridate](https://lubridate.tidyverse.org/) and [stringr](https://stringr.tidyverse.org/) APIs supported!\n-   base R and others too - always good to check the docs\n\n## Morning vs afternoon with namespacing\n\n\n::: {.cell}\n\n```{.r .cell-code  code-line-numbers=\"2\"}\nnyc_taxi |>\n  group_by(\n    time_of_day = ifelse(lubridate::am(pickup_datetime), \"morning\", \"afternoon\")\n  ) |>\n  count() |>\n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 × 2\n# Groups:   time_of_day [2]\n  time_of_day         n\n  <chr>           <int>\n1 afternoon   736491676\n2 morning     413860990\n```\n:::\n:::\n\n\n## Morning vs afternoon - without namespacing\n\n\n::: {.cell}\n\n```{.r .cell-code  code-line-numbers=\"4\"}\nlibrary(lubridate)\n\nnyc_taxi |>\n  group_by(\n    time_of_day = ifelse(am(pickup_datetime), \"morning\", \"afternoon\")\n  ) |>\n  count() |>\n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 × 2\n# Groups:   time_of_day [2]\n  time_of_day         n\n  <chr>           <int>\n1 afternoon   736491676\n2 morning     413860990\n```\n:::\n:::\n\n\n## What if a function isn't implemented?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n  mutate(vendor_name = na_if(vendor_name, \"CMT\")) |>\n  head() |>\n  collect()\n```\n\n::: {.cell-output .cell-output-error}\n```\nError: Expression na_if(vendor_name, \"CMT\") not supported in Arrow\nCall collect() first to pull data into R.\n```\n:::\n:::\n\n\n## Head to the docs again to see what's implemented!\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?`arrow-dplyr`\n```\n:::\n\n\nor view them at <https://arrow.apache.org/docs/r/reference/acero.html>\n\n## Option 1 - find a workaround!\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n  mutate(vendor_name = ifelse(vendor_name == \"CMT\", NA, vendor_name)) |>\n  head() |>\n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 24\n  vendor_name pickup_datetime     dropoff_datetime    passenger_count\n  <chr>       <dttm>              <dttm>                        <int>\n1 <NA>        2012-01-20 14:09:36 2012-01-20 14:42:25               1\n2 <NA>        2012-01-20 14:54:10 2012-01-20 15:06:55               1\n3 <NA>        2012-01-20 08:08:01 2012-01-20 08:11:02               1\n4 <NA>        2012-01-20 08:36:22 2012-01-20 08:39:44               1\n5 <NA>        2012-01-20 20:58:32 2012-01-20 21:03:04               1\n6 <NA>        2012-01-20 19:40:20 2012-01-20 19:43:43               2\n# ℹ 20 more variables: trip_distance <dbl>, pickup_longitude <dbl>,\n#   pickup_latitude <dbl>, rate_code <chr>, store_and_fwd <chr>,\n#   dropoff_longitude <dbl>, dropoff_latitude <dbl>, payment_type <chr>,\n#   fare_amount <dbl>, extra <dbl>, mta_tax <dbl>, tip_amount <dbl>,\n#   tolls_amount <dbl>, total_amount <dbl>, improvement_surcharge <dbl>,\n#   congestion_surcharge <dbl>, pickup_location_id <int>,\n#   dropoff_location_id <int>, year <int>, month <int>\n```\n:::\n:::\n\n\n## Option 2\n\n-   In data manipulation part 2!\n\n## Your Turn\n\n1.  Use the `dplyr::filter()` and `stringr::str_ends()` functions to return a subset of the data which is a) from September 2020, and b) the value in `vendor_name` ends with the letter \"S\".\n\n2.  Try to use the `stringr` function `str_replace_na()` to replace any `NA` values in the `vendor_name` column with the string \"No vendor\" instead. What happens, and why?\n\n3.  Bonus question: see if you can find a different way of completing the task in question 2.\n\n➡️ [Data Manipulation Part I Exercises Page](2_data_manipulation_1-exercises.html)\n\n## Summary\n\n\n-   Working with Arrow Datasets allow you to manipulate data which is larger-than-memory\n-   You can use many dplyr functions with arrow - run `` ?`arrow-dplyr` `` to view the docs\n-   You can pass data to duckdb to use functions implemented in duckdb but not arrow\n",
+    "markdown": "---\nfooter: \"[🔗 posit.io/arrow](https://posit-conf-2023.github.io/arrow)\"\nlogo: \"images/logo.png\"\nexecute:\n  echo: true\nformat:\n  revealjs: \n    theme: default\nengine: knitr\neditor: source\n---\n\n\n# Data Manipulation---Part 1 {#data-manip-1}\n\n\n::: {.cell}\n\n:::\n\n\n## Goals\n\nAvoiding these! But...don't worry!\n\n![](images/segfault.png)\n\n\n## dplyr API in arrow\n\n![](images/dplyr-backend.png)\n\n## An Arrow Dataset\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\n\nnyc_taxi <- open_dataset(here::here(\"data/nyc-taxi\"))\nnyc_taxi\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nFileSystemDataset with 120 Parquet files\nvendor_name: string\npickup_datetime: timestamp[ms]\ndropoff_datetime: timestamp[ms]\npassenger_count: int64\ntrip_distance: double\npickup_longitude: double\npickup_latitude: double\nrate_code: string\nstore_and_fwd: string\ndropoff_longitude: double\ndropoff_latitude: double\npayment_type: string\nfare_amount: double\nextra: double\nmta_tax: double\ntip_amount: double\ntolls_amount: double\ntotal_amount: double\nimprovement_surcharge: double\ncongestion_surcharge: double\npickup_location_id: int64\ndropoff_location_id: int64\nyear: int32\nmonth: int32\n```\n:::\n:::\n\n## Arrow Datasets\n\n![](images/nyc_taxi_dataset.png)\n\n\n## Constructing queries\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dplyr)\n\nshared_rides <- nyc_taxi |>\n  group_by(year) |>\n  summarize(\n    all_trips = n(),\n    shared_trips = sum(passenger_count > 1, na.rm = TRUE)\n  ) |>\n  mutate(pct_shared = shared_trips / all_trips * 100) \n\nclass(shared_rides)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"arrow_dplyr_query\"\n```\n:::\n:::\n\n\n## arrow dplyr queries\n\n\n::: {.cell}\n\n```{.r .cell-code}\nshared_rides\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nFileSystemDataset (query)\nyear: int32\nall_trips: int64\nshared_trips: uint64\npct_shared: double (multiply_checked(divide(cast(shared_trips, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false}), cast(all_trips, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false})), 100))\n\nSee $.data for the source Arrow object\n```\n:::\n:::\n\n\n## arrow dplyr queries\n\n-   query has been constructed but not evaluated\n-   nothing has been pulled into memory\n\n## To `collect()` or to `compute()`?\n\n-   `compute()` evaluates the query, in-memory output stays in Arrow\n-   `collect()` evaluates the query, in-memory output returns to R\n\n## `compute()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncompute(shared_rides)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nTable\n10 rows x 4 columns\n$year <int32>\n$all_trips <int64>\n$shared_trips <uint64>\n$pct_shared <double>\n```\n:::\n:::\n\n\n## `collect()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncollect(shared_rides)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 10 × 4\n    year all_trips shared_trips pct_shared\n   <int>     <int>        <int>      <dbl>\n 1  2012 178544324     53313752       29.9\n 2  2013 173179759     51215013       29.6\n 3  2014 165114361     48816505       29.6\n 4  2015 146112989     43081091       29.5\n 5  2016 131165043     38163870       29.1\n 6  2017 113495512     32296166       28.5\n 7  2018 102797401     28796633       28.0\n 8  2019  84393604     23515989       27.9\n 9  2020  24647055      5837960       23.7\n10  2021  30902618      7221844       23.4\n```\n:::\n:::\n\n\n## Calling `nrow()` to see how much data\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n  filter(year %in% 2017:2021) |>\n  nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 356236190\n```\n:::\n:::\n\n\n## Calling `nrow()` doesn't work with intermediate step\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n  filter(year %in% 2017:2021) |>\n  group_by(year) |>\n  summarize(\n    all_trips = n(),\n    shared_trips = sum(passenger_count > 1, na.rm = TRUE)\n  ) |>\n  mutate(pct_shared = shared_trips / all_trips * 100) |>\n  nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] NA\n```\n:::\n:::\n\n\n## Use `compute()` to execute intermediate steps\n\n\n::: {.cell}\n\n```{.r .cell-code  code-line-numbers=\"9\"}\nnyc_taxi |>\n  filter(year %in% 2017:2021) |>\n  group_by(year) |>\n  summarize(\n    all_trips = n(),\n    shared_trips = sum(passenger_count > 1, na.rm = TRUE)\n  ) |>\n  mutate(pct_shared = shared_trips / all_trips * 100) |>\n  compute() |>\n  nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 5\n```\n:::\n:::\n\n\n## Your Turn\n\nUse the function `nrow()` to work out the answers to these questions:\n\n1.  How many taxi fares in the dataset had a total amount greater than \\$100?\n\n2.  How many distinct pickup locations (distinct combinations of the `pickup_latitude` and `pickup_longitude` columns) are in the dataset since 2016? \n\n➡️ [Data Manipulation Part I Exercises Page](2_data_manipulation_1-exercises.html)\n\n\n## Previewing output for large queries\n\nHow much were fares in GBP (£)?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfares_pounds <- nyc_taxi |>\n  mutate(\n    fare_amount_pounds = fare_amount * 0.79\n  )\n```\n:::\n\n\nHow many rows?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfares_pounds |>\n  nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1150352666\n```\n:::\n:::\n\n\n## Use `head()`, `select()`, `filter()`, and `collect()` to preview results\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n  filter(year == 2020) |>\n  mutate(fare_pounds = fare_amount * 0.79) |>\n  select(fare_amount, fare_pounds) |>\n  head() |>\n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 2\n  fare_amount fare_pounds\n        <dbl>       <dbl>\n1         8          6.32\n2        17         13.4 \n3         6.5        5.14\n4         7          5.53\n5         6.5        5.14\n6        42         33.2 \n```\n:::\n:::\n\n\n## Use `across()` to transform data in multiple columns\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxis_gbp <- nyc_taxi |>\n  mutate(across(ends_with(\"amount\"), list(pounds = ~.x * 0.79)))\n\ntaxis_gbp\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nFileSystemDataset (query)\nvendor_name: string\npickup_datetime: timestamp[ms]\ndropoff_datetime: timestamp[ms]\npassenger_count: int64\ntrip_distance: double\npickup_longitude: double\npickup_latitude: double\nrate_code: string\nstore_and_fwd: string\ndropoff_longitude: double\ndropoff_latitude: double\npayment_type: string\nfare_amount: double\nextra: double\nmta_tax: double\ntip_amount: double\ntolls_amount: double\ntotal_amount: double\nimprovement_surcharge: double\ncongestion_surcharge: double\npickup_location_id: int64\ndropoff_location_id: int64\nyear: int32\nmonth: int32\nfare_amount_pounds: double (multiply_checked(fare_amount, 0.79))\ntip_amount_pounds: double (multiply_checked(tip_amount, 0.79))\ntolls_amount_pounds: double (multiply_checked(tolls_amount, 0.79))\ntotal_amount_pounds: double (multiply_checked(total_amount, 0.79))\n\nSee $.data for the source Arrow object\n```\n:::\n:::\n\n\n## Use `across()` to transform data in multiple columns\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxis_gbp |>\n  select(contains(\"amount\")) |>\n  head() |>\n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 8\n  fare_amount tip_amount tolls_amount total_amount fare_amount_pounds\n        <dbl>      <dbl>        <dbl>        <dbl>              <dbl>\n1        29.7       6.04            0        36.2               23.5 \n2         9.3       0               0         9.8                7.35\n3         4.1       1.38            0         5.98               3.24\n4         4.5       1               0         6                  3.56\n5         4.5       0               0         5.5                3.56\n6         4.1       0               0         5.6                3.24\n# ℹ 3 more variables: tip_amount_pounds <dbl>, tolls_amount_pounds <dbl>,\n#   total_amount_pounds <dbl>\n```\n:::\n:::\n\n\n## Summary\n\n-   Use `nrow()` to work out how many rows of data your analyses will return\n-   Use `compute()` when you need to execute intermediate steps\n-   Use `collect()` to pull all of the data into your R session\n-   Use `head()`, `select()`, `filter()`, and `collect()` to preview results\n-   Use `across()` to manipulate data in multiple columns at once\n\n# dplyr verbs API in arrow - alternatives\n\n## Example - `slice()`\n\nFirst three trips in the dataset in 2021 where distance \\> 100 miles\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlong_rides_2021 <- nyc_taxi |>\n  filter(year == 2021 & trip_distance > 100) |>\n  select(pickup_datetime, year, trip_distance)\n\nlong_rides_2021 |>\n  slice(1:3)\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in UseMethod(\"slice\"): no applicable method for 'slice' applied to an object of class \"arrow_dplyr_query\"\n```\n:::\n:::\n\n\n## Head to the docs!\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?`arrow-dplyr`\n```\n:::\n\n\nor view them at <https://arrow.apache.org/docs/r/reference/acero.html>\n\n## A different function\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlong_rides_2021 |>\n  slice_max(n = 3, order_by = trip_distance, with_ties = FALSE) |>\n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 3\n  pickup_datetime      year trip_distance\n  <dttm>              <int>         <dbl>\n1 2021-11-16 06:55:00  2021       351613.\n2 2021-10-27 11:46:00  2021       345124.\n3 2021-12-11 04:48:00  2021       335094.\n```\n:::\n:::\n\n\n## Or call `collect()` first\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlong_rides_2021 |>\n  collect() |>\n  slice(1:3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 3\n  pickup_datetime      year trip_distance\n  <dttm>              <int>         <dbl>\n1 2021-01-03 03:01:26  2021          216.\n2 2021-01-03 05:36:52  2021          268.\n3 2021-01-06 01:27:55  2021          271.\n```\n:::\n:::\n\n\n## tidyr functions - pivot\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyr)\n\nnyc_taxi |> \n  group_by(vendor_name) |>\n  summarise(max_fare = max(fare_amount)) |>\n  pivot_longer(!vendor_name, names_to = \"metric\") |> \n  collect()\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in UseMethod(\"pivot_longer\"): no applicable method for 'pivot_longer' applied to an object of class \"arrow_dplyr_query\"\n```\n:::\n:::\n\n\n## duckdb\n\n![](images/dplyr-arrow-duckdb.png)\n\n## tidyr functions - pivot with duckdb!\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(duckdb)\n\nnyc_taxi |> \n  group_by(vendor_name) |>\n  summarise(max_fare = max(fare_amount)) |>\n  to_duckdb() |> # send data to duckdb\n  pivot_longer(!vendor_name, names_to = \"metric\") |> \n  to_arrow() |> # return data back to arrow\n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 3\n  vendor_name metric     value\n  <chr>       <chr>      <dbl>\n1 CMT         max_fare 998310.\n2 VTS         max_fare  10000.\n3 <NA>        max_fare   3555.\n```\n:::\n:::\n\n\n::: {.callout-caution collapse=\"true\"}\n## Requires arrow 13.0.0\n\nThis code requires arrow 13.0.0 or above to run, due to a bugfix in this version\n:::\n\n# Using functions inside verbs\n\n## Using functions inside verbs\n\n-   lots of the [lubridate](https://lubridate.tidyverse.org/) and [stringr](https://stringr.tidyverse.org/) APIs supported!\n-   base R and others too - always good to check the docs\n\n## Morning vs afternoon with namespacing\n\n\n::: {.cell}\n\n```{.r .cell-code  code-line-numbers=\"3\"}\nnyc_taxi |>\n  group_by(\n    time_of_day = ifelse(lubridate::am(pickup_datetime), \"morning\", \"afternoon\")\n  ) |>\n  count() |>\n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 × 2\n# Groups:   time_of_day [2]\n  time_of_day         n\n  <chr>           <int>\n1 afternoon   736491676\n2 morning     413860990\n```\n:::\n:::\n\n\n## Morning vs afternoon - without namespacing\n\n\n::: {.cell}\n\n```{.r .cell-code  code-line-numbers=\"5\"}\nlibrary(lubridate)\n\nnyc_taxi |>\n  group_by(\n    time_of_day = ifelse(am(pickup_datetime), \"morning\", \"afternoon\")\n  ) |>\n  count() |>\n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 × 2\n# Groups:   time_of_day [2]\n  time_of_day         n\n  <chr>           <int>\n1 afternoon   736491676\n2 morning     413860990\n```\n:::\n:::\n\n\n## What if a function isn't implemented?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n  mutate(vendor_name = na_if(vendor_name, \"CMT\")) |>\n  head() |>\n  collect()\n```\n\n::: {.cell-output .cell-output-error}\n```\nError: Expression na_if(vendor_name, \"CMT\") not supported in Arrow\nCall collect() first to pull data into R.\n```\n:::\n:::\n\n\n## Head to the docs again to see what's implemented!\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?`arrow-dplyr`\n```\n:::\n\n\nor view them at <https://arrow.apache.org/docs/r/reference/acero.html>\n\n## Option 1 - find a workaround!\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n  mutate(vendor_name = ifelse(vendor_name == \"CMT\", NA, vendor_name)) |>\n  head() |>\n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 24\n  vendor_name pickup_datetime     dropoff_datetime    passenger_count\n  <chr>       <dttm>              <dttm>                        <int>\n1 <NA>        2012-01-20 08:09:36 2012-01-20 08:42:25               1\n2 <NA>        2012-01-20 08:54:10 2012-01-20 09:06:55               1\n3 <NA>        2012-01-20 02:08:01 2012-01-20 02:11:02               1\n4 <NA>        2012-01-20 02:36:22 2012-01-20 02:39:44               1\n5 <NA>        2012-01-20 14:58:32 2012-01-20 15:03:04               1\n6 <NA>        2012-01-20 13:40:20 2012-01-20 13:43:43               2\n# ℹ 20 more variables: trip_distance <dbl>, pickup_longitude <dbl>,\n#   pickup_latitude <dbl>, rate_code <chr>, store_and_fwd <chr>,\n#   dropoff_longitude <dbl>, dropoff_latitude <dbl>, payment_type <chr>,\n#   fare_amount <dbl>, extra <dbl>, mta_tax <dbl>, tip_amount <dbl>,\n#   tolls_amount <dbl>, total_amount <dbl>, improvement_surcharge <dbl>,\n#   congestion_surcharge <dbl>, pickup_location_id <int>,\n#   dropoff_location_id <int>, year <int>, month <int>\n```\n:::\n:::\n\n\n## Option 2\n\n-   In data manipulation part 2!\n\n## Your Turn\n\n1.  Use the `dplyr::filter()` and `stringr::str_ends()` functions to return a subset of the data which is a) from September 2020, and b) the value in `vendor_name` ends with the letter \"S\".\n\n2.  Try to use the `stringr` function `str_replace_na()` to replace any `NA` values in the `vendor_name` column with the string \"No vendor\" instead. What happens, and why?\n\n3.  Bonus question: see if you can find a different way of completing the task in question 2.\n\n➡️ [Data Manipulation Part I Exercises Page](2_data_manipulation_1-exercises.html)\n\n## Summary\n\n\n-   Working with Arrow Datasets allow you to manipulate data which is larger-than-memory\n-   You can use many dplyr functions with arrow - run `` ?`arrow-dplyr` `` to view the docs\n-   You can pass data to duckdb to use functions implemented in duckdb but not arrow\n",
     "supporting": [
       "2_data_manipulation_1_files"
     ],
diff --git a/materials/2_data_manipulation_1.qmd b/materials/2_data_manipulation_1.qmd
index 5ea88e1..17d71c3 100644
--- a/materials/2_data_manipulation_1.qmd
+++ b/materials/2_data_manipulation_1.qmd
@@ -318,7 +318,7 @@ This code requires arrow 13.0.0 or above to run, due to a bugfix in this version
 
 ## Morning vs afternoon with namespacing
 
-```{r, `code-line-numbers`="2"}
+```{r, `code-line-numbers`="3"}
 #| label: namespacing
 
 nyc_taxi |>
@@ -331,7 +331,7 @@ nyc_taxi |>
 
 ## Morning vs afternoon - without namespacing
 
-```{r, `code-line-numbers`="4"}
+```{r, `code-line-numbers`="5"}
 #| label: no-namespacing
 
 library(lubridate)