From 8c662529388a64d3ff2a34d84ba9f33acae7d55e Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Sun, 17 Sep 2023 05:14:12 -0500 Subject: [PATCH] fix row highlighting --- .../materials/2_data_manipulation_1/execute-results/html.json | 4 ++-- materials/2_data_manipulation_1.qmd | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/_freeze/materials/2_data_manipulation_1/execute-results/html.json b/_freeze/materials/2_data_manipulation_1/execute-results/html.json index 69f4d1d..558da2e 100644 --- a/_freeze/materials/2_data_manipulation_1/execute-results/html.json +++ b/_freeze/materials/2_data_manipulation_1/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "582b36bee10442ef8a0af96d0f49ebaf", + "hash": "ed57973bde547ddb275688a9aa4faa4c", "result": { - "markdown": "---\nfooter: \"[🔗 posit.io/arrow](https://posit-conf-2023.github.io/arrow)\"\nlogo: \"images/logo.png\"\nexecute:\n echo: true\nformat:\n revealjs: \n theme: default\nengine: knitr\neditor: source\n---\n\n\n# Data Manipulation---Part 1 {#data-manip-1}\n\n\n::: {.cell}\n\n:::\n\n\n## Goals\n\nAvoiding these! But...don't worry!\n\n![](images/segfault.png)\n\n\n## dplyr API in arrow\n\n![](images/dplyr-backend.png)\n\n## An Arrow Dataset\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\n\nnyc_taxi <- open_dataset(here::here(\"data/nyc-taxi\"))\nnyc_taxi\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nFileSystemDataset with 120 Parquet files\nvendor_name: string\npickup_datetime: timestamp[ms]\ndropoff_datetime: timestamp[ms]\npassenger_count: int64\ntrip_distance: double\npickup_longitude: double\npickup_latitude: double\nrate_code: string\nstore_and_fwd: string\ndropoff_longitude: double\ndropoff_latitude: double\npayment_type: string\nfare_amount: double\nextra: double\nmta_tax: double\ntip_amount: double\ntolls_amount: double\ntotal_amount: double\nimprovement_surcharge: double\ncongestion_surcharge: double\npickup_location_id: int64\ndropoff_location_id: int64\nyear: int32\nmonth: int32\n```\n:::\n:::\n\n## Arrow Datasets\n\n![](images/nyc_taxi_dataset.png)\n\n\n## Constructing queries\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dplyr)\n\nshared_rides <- nyc_taxi |>\n group_by(year) |>\n summarize(\n all_trips = n(),\n shared_trips = sum(passenger_count > 1, na.rm = TRUE)\n ) |>\n mutate(pct_shared = shared_trips / all_trips * 100) \n\nclass(shared_rides)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"arrow_dplyr_query\"\n```\n:::\n:::\n\n\n## arrow dplyr queries\n\n\n::: {.cell}\n\n```{.r .cell-code}\nshared_rides\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nFileSystemDataset (query)\nyear: int32\nall_trips: int64\nshared_trips: uint64\npct_shared: double (multiply_checked(divide(cast(shared_trips, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false}), cast(all_trips, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false})), 100))\n\nSee $.data for the source Arrow object\n```\n:::\n:::\n\n\n## arrow dplyr queries\n\n- query has been constructed but not evaluated\n- nothing has been pulled into memory\n\n## To `collect()` or to `compute()`?\n\n- `compute()` evaluates the query, in-memory output stays in Arrow\n- `collect()` evaluates the query, in-memory output returns to R\n\n## `compute()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncompute(shared_rides)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nTable\n10 rows x 4 columns\n$year \n$all_trips \n$shared_trips \n$pct_shared \n```\n:::\n:::\n\n\n## `collect()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncollect(shared_rides)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 10 × 4\n year all_trips shared_trips pct_shared\n \n 1 2012 178544324 53313752 29.9\n 2 2013 173179759 51215013 29.6\n 3 2014 165114361 48816505 29.6\n 4 2015 146112989 43081091 29.5\n 5 2016 131165043 38163870 29.1\n 6 2017 113495512 32296166 28.5\n 7 2018 102797401 28796633 28.0\n 8 2019 84393604 23515989 27.9\n 9 2020 24647055 5837960 23.7\n10 2021 30902618 7221844 23.4\n```\n:::\n:::\n\n\n## Calling `nrow()` to see how much data\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n filter(year %in% 2017:2021) |>\n nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 356236190\n```\n:::\n:::\n\n\n## Calling `nrow()` doesn't work with intermediate step\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n filter(year %in% 2017:2021) |>\n group_by(year) |>\n summarize(\n all_trips = n(),\n shared_trips = sum(passenger_count > 1, na.rm = TRUE)\n ) |>\n mutate(pct_shared = shared_trips / all_trips * 100) |>\n nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] NA\n```\n:::\n:::\n\n\n## Use `compute()` to execute intermediate steps\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"9\"}\nnyc_taxi |>\n filter(year %in% 2017:2021) |>\n group_by(year) |>\n summarize(\n all_trips = n(),\n shared_trips = sum(passenger_count > 1, na.rm = TRUE)\n ) |>\n mutate(pct_shared = shared_trips / all_trips * 100) |>\n compute() |>\n nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 5\n```\n:::\n:::\n\n\n## Your Turn\n\nUse the function `nrow()` to work out the answers to these questions:\n\n1. How many taxi fares in the dataset had a total amount greater than \\$100?\n\n2. How many distinct pickup locations (distinct combinations of the `pickup_latitude` and `pickup_longitude` columns) are in the dataset since 2016? \n\n➡️ [Data Manipulation Part I Exercises Page](2_data_manipulation_1-exercises.html)\n\n\n## Previewing output for large queries\n\nHow much were fares in GBP (£)?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfares_pounds <- nyc_taxi |>\n mutate(\n fare_amount_pounds = fare_amount * 0.79\n )\n```\n:::\n\n\nHow many rows?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfares_pounds |>\n nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1150352666\n```\n:::\n:::\n\n\n## Use `head()`, `select()`, `filter()`, and `collect()` to preview results\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n filter(year == 2020) |>\n mutate(fare_pounds = fare_amount * 0.79) |>\n select(fare_amount, fare_pounds) |>\n head() |>\n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 2\n fare_amount fare_pounds\n \n1 8 6.32\n2 17 13.4 \n3 6.5 5.14\n4 7 5.53\n5 6.5 5.14\n6 42 33.2 \n```\n:::\n:::\n\n\n## Use `across()` to transform data in multiple columns\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxis_gbp <- nyc_taxi |>\n mutate(across(ends_with(\"amount\"), list(pounds = ~.x * 0.79)))\n\ntaxis_gbp\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nFileSystemDataset (query)\nvendor_name: string\npickup_datetime: timestamp[ms]\ndropoff_datetime: timestamp[ms]\npassenger_count: int64\ntrip_distance: double\npickup_longitude: double\npickup_latitude: double\nrate_code: string\nstore_and_fwd: string\ndropoff_longitude: double\ndropoff_latitude: double\npayment_type: string\nfare_amount: double\nextra: double\nmta_tax: double\ntip_amount: double\ntolls_amount: double\ntotal_amount: double\nimprovement_surcharge: double\ncongestion_surcharge: double\npickup_location_id: int64\ndropoff_location_id: int64\nyear: int32\nmonth: int32\nfare_amount_pounds: double (multiply_checked(fare_amount, 0.79))\ntip_amount_pounds: double (multiply_checked(tip_amount, 0.79))\ntolls_amount_pounds: double (multiply_checked(tolls_amount, 0.79))\ntotal_amount_pounds: double (multiply_checked(total_amount, 0.79))\n\nSee $.data for the source Arrow object\n```\n:::\n:::\n\n\n## Use `across()` to transform data in multiple columns\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxis_gbp |>\n select(contains(\"amount\")) |>\n head() |>\n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 8\n fare_amount tip_amount tolls_amount total_amount fare_amount_pounds\n \n1 29.7 6.04 0 36.2 23.5 \n2 9.3 0 0 9.8 7.35\n3 4.1 1.38 0 5.98 3.24\n4 4.5 1 0 6 3.56\n5 4.5 0 0 5.5 3.56\n6 4.1 0 0 5.6 3.24\n# ℹ 3 more variables: tip_amount_pounds , tolls_amount_pounds ,\n# total_amount_pounds \n```\n:::\n:::\n\n\n## Summary\n\n- Use `nrow()` to work out how many rows of data your analyses will return\n- Use `compute()` when you need to execute intermediate steps\n- Use `collect()` to pull all of the data into your R session\n- Use `head()`, `select()`, `filter()`, and `collect()` to preview results\n- Use `across()` to manipulate data in multiple columns at once\n\n# dplyr verbs API in arrow - alternatives\n\n## Example - `slice()`\n\nFirst three trips in the dataset in 2021 where distance \\> 100 miles\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlong_rides_2021 <- nyc_taxi |>\n filter(year == 2021 & trip_distance > 100) |>\n select(pickup_datetime, year, trip_distance)\n\nlong_rides_2021 |>\n slice(1:3)\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in UseMethod(\"slice\"): no applicable method for 'slice' applied to an object of class \"arrow_dplyr_query\"\n```\n:::\n:::\n\n\n## Head to the docs!\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?`arrow-dplyr`\n```\n:::\n\n\nor view them at \n\n## A different function\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlong_rides_2021 |>\n slice_max(n = 3, order_by = trip_distance, with_ties = FALSE) |>\n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 3\n pickup_datetime year trip_distance\n \n1 2021-11-16 12:55:00 2021 351613.\n2 2021-10-27 17:46:00 2021 345124.\n3 2021-12-11 10:48:00 2021 335094.\n```\n:::\n:::\n\n\n## Or call `collect()` first\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlong_rides_2021 |>\n collect() |>\n slice(1:3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 3\n pickup_datetime year trip_distance\n \n1 2021-10-03 16:45:02 2021 134 \n2 2021-10-03 17:29:35 2021 218.\n3 2021-10-03 17:58:15 2021 225.\n```\n:::\n:::\n\n\n## tidyr functions - pivot\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyr)\n\nnyc_taxi |> \n group_by(vendor_name) |>\n summarise(max_fare = max(fare_amount)) |>\n pivot_longer(!vendor_name, names_to = \"metric\") |> \n collect()\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in UseMethod(\"pivot_longer\"): no applicable method for 'pivot_longer' applied to an object of class \"arrow_dplyr_query\"\n```\n:::\n:::\n\n\n## duckdb\n\n![](images/dplyr-arrow-duckdb.png)\n\n## tidyr functions - pivot with duckdb!\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(duckdb)\n\nnyc_taxi |> \n group_by(vendor_name) |>\n summarise(max_fare = max(fare_amount)) |>\n to_duckdb() |> # send data to duckdb\n pivot_longer(!vendor_name, names_to = \"metric\") |> \n to_arrow() |> # return data back to arrow\n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 3\n vendor_name metric value\n \n1 CMT max_fare 998310.\n2 VTS max_fare 10000.\n3 max_fare 3555.\n```\n:::\n:::\n\n\n::: {.callout-caution collapse=\"true\"}\n## Requires arrow 13.0.0\n\nThis code requires arrow 13.0.0 or above to run, due to a bugfix in this version\n:::\n\n# Using functions inside verbs\n\n## Using functions inside verbs\n\n- lots of the [lubridate](https://lubridate.tidyverse.org/) and [stringr](https://stringr.tidyverse.org/) APIs supported!\n- base R and others too - always good to check the docs\n\n## Morning vs afternoon with namespacing\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"2\"}\nnyc_taxi |>\n group_by(\n time_of_day = ifelse(lubridate::am(pickup_datetime), \"morning\", \"afternoon\")\n ) |>\n count() |>\n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 × 2\n# Groups: time_of_day [2]\n time_of_day n\n \n1 afternoon 736491676\n2 morning 413860990\n```\n:::\n:::\n\n\n## Morning vs afternoon - without namespacing\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"4\"}\nlibrary(lubridate)\n\nnyc_taxi |>\n group_by(\n time_of_day = ifelse(am(pickup_datetime), \"morning\", \"afternoon\")\n ) |>\n count() |>\n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 × 2\n# Groups: time_of_day [2]\n time_of_day n\n \n1 afternoon 736491676\n2 morning 413860990\n```\n:::\n:::\n\n\n## What if a function isn't implemented?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n mutate(vendor_name = na_if(vendor_name, \"CMT\")) |>\n head() |>\n collect()\n```\n\n::: {.cell-output .cell-output-error}\n```\nError: Expression na_if(vendor_name, \"CMT\") not supported in Arrow\nCall collect() first to pull data into R.\n```\n:::\n:::\n\n\n## Head to the docs again to see what's implemented!\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?`arrow-dplyr`\n```\n:::\n\n\nor view them at \n\n## Option 1 - find a workaround!\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n mutate(vendor_name = ifelse(vendor_name == \"CMT\", NA, vendor_name)) |>\n head() |>\n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 24\n vendor_name pickup_datetime dropoff_datetime passenger_count\n \n1 2012-01-20 14:09:36 2012-01-20 14:42:25 1\n2 2012-01-20 14:54:10 2012-01-20 15:06:55 1\n3 2012-01-20 08:08:01 2012-01-20 08:11:02 1\n4 2012-01-20 08:36:22 2012-01-20 08:39:44 1\n5 2012-01-20 20:58:32 2012-01-20 21:03:04 1\n6 2012-01-20 19:40:20 2012-01-20 19:43:43 2\n# ℹ 20 more variables: trip_distance , pickup_longitude ,\n# pickup_latitude , rate_code , store_and_fwd ,\n# dropoff_longitude , dropoff_latitude , payment_type ,\n# fare_amount , extra , mta_tax , tip_amount ,\n# tolls_amount , total_amount , improvement_surcharge ,\n# congestion_surcharge , pickup_location_id ,\n# dropoff_location_id , year , month \n```\n:::\n:::\n\n\n## Option 2\n\n- In data manipulation part 2!\n\n## Your Turn\n\n1. Use the `dplyr::filter()` and `stringr::str_ends()` functions to return a subset of the data which is a) from September 2020, and b) the value in `vendor_name` ends with the letter \"S\".\n\n2. Try to use the `stringr` function `str_replace_na()` to replace any `NA` values in the `vendor_name` column with the string \"No vendor\" instead. What happens, and why?\n\n3. Bonus question: see if you can find a different way of completing the task in question 2.\n\n➡️ [Data Manipulation Part I Exercises Page](2_data_manipulation_1-exercises.html)\n\n## Summary\n\n\n- Working with Arrow Datasets allow you to manipulate data which is larger-than-memory\n- You can use many dplyr functions with arrow - run `` ?`arrow-dplyr` `` to view the docs\n- You can pass data to duckdb to use functions implemented in duckdb but not arrow\n", + "markdown": "---\nfooter: \"[🔗 posit.io/arrow](https://posit-conf-2023.github.io/arrow)\"\nlogo: \"images/logo.png\"\nexecute:\n echo: true\nformat:\n revealjs: \n theme: default\nengine: knitr\neditor: source\n---\n\n\n# Data Manipulation---Part 1 {#data-manip-1}\n\n\n::: {.cell}\n\n:::\n\n\n## Goals\n\nAvoiding these! But...don't worry!\n\n![](images/segfault.png)\n\n\n## dplyr API in arrow\n\n![](images/dplyr-backend.png)\n\n## An Arrow Dataset\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\n\nnyc_taxi <- open_dataset(here::here(\"data/nyc-taxi\"))\nnyc_taxi\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nFileSystemDataset with 120 Parquet files\nvendor_name: string\npickup_datetime: timestamp[ms]\ndropoff_datetime: timestamp[ms]\npassenger_count: int64\ntrip_distance: double\npickup_longitude: double\npickup_latitude: double\nrate_code: string\nstore_and_fwd: string\ndropoff_longitude: double\ndropoff_latitude: double\npayment_type: string\nfare_amount: double\nextra: double\nmta_tax: double\ntip_amount: double\ntolls_amount: double\ntotal_amount: double\nimprovement_surcharge: double\ncongestion_surcharge: double\npickup_location_id: int64\ndropoff_location_id: int64\nyear: int32\nmonth: int32\n```\n:::\n:::\n\n## Arrow Datasets\n\n![](images/nyc_taxi_dataset.png)\n\n\n## Constructing queries\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dplyr)\n\nshared_rides <- nyc_taxi |>\n group_by(year) |>\n summarize(\n all_trips = n(),\n shared_trips = sum(passenger_count > 1, na.rm = TRUE)\n ) |>\n mutate(pct_shared = shared_trips / all_trips * 100) \n\nclass(shared_rides)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"arrow_dplyr_query\"\n```\n:::\n:::\n\n\n## arrow dplyr queries\n\n\n::: {.cell}\n\n```{.r .cell-code}\nshared_rides\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nFileSystemDataset (query)\nyear: int32\nall_trips: int64\nshared_trips: uint64\npct_shared: double (multiply_checked(divide(cast(shared_trips, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false}), cast(all_trips, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false})), 100))\n\nSee $.data for the source Arrow object\n```\n:::\n:::\n\n\n## arrow dplyr queries\n\n- query has been constructed but not evaluated\n- nothing has been pulled into memory\n\n## To `collect()` or to `compute()`?\n\n- `compute()` evaluates the query, in-memory output stays in Arrow\n- `collect()` evaluates the query, in-memory output returns to R\n\n## `compute()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncompute(shared_rides)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nTable\n10 rows x 4 columns\n$year \n$all_trips \n$shared_trips \n$pct_shared \n```\n:::\n:::\n\n\n## `collect()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncollect(shared_rides)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 10 × 4\n year all_trips shared_trips pct_shared\n \n 1 2012 178544324 53313752 29.9\n 2 2013 173179759 51215013 29.6\n 3 2014 165114361 48816505 29.6\n 4 2015 146112989 43081091 29.5\n 5 2016 131165043 38163870 29.1\n 6 2017 113495512 32296166 28.5\n 7 2018 102797401 28796633 28.0\n 8 2019 84393604 23515989 27.9\n 9 2020 24647055 5837960 23.7\n10 2021 30902618 7221844 23.4\n```\n:::\n:::\n\n\n## Calling `nrow()` to see how much data\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n filter(year %in% 2017:2021) |>\n nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 356236190\n```\n:::\n:::\n\n\n## Calling `nrow()` doesn't work with intermediate step\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n filter(year %in% 2017:2021) |>\n group_by(year) |>\n summarize(\n all_trips = n(),\n shared_trips = sum(passenger_count > 1, na.rm = TRUE)\n ) |>\n mutate(pct_shared = shared_trips / all_trips * 100) |>\n nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] NA\n```\n:::\n:::\n\n\n## Use `compute()` to execute intermediate steps\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"9\"}\nnyc_taxi |>\n filter(year %in% 2017:2021) |>\n group_by(year) |>\n summarize(\n all_trips = n(),\n shared_trips = sum(passenger_count > 1, na.rm = TRUE)\n ) |>\n mutate(pct_shared = shared_trips / all_trips * 100) |>\n compute() |>\n nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 5\n```\n:::\n:::\n\n\n## Your Turn\n\nUse the function `nrow()` to work out the answers to these questions:\n\n1. How many taxi fares in the dataset had a total amount greater than \\$100?\n\n2. How many distinct pickup locations (distinct combinations of the `pickup_latitude` and `pickup_longitude` columns) are in the dataset since 2016? \n\n➡️ [Data Manipulation Part I Exercises Page](2_data_manipulation_1-exercises.html)\n\n\n## Previewing output for large queries\n\nHow much were fares in GBP (£)?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfares_pounds <- nyc_taxi |>\n mutate(\n fare_amount_pounds = fare_amount * 0.79\n )\n```\n:::\n\n\nHow many rows?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfares_pounds |>\n nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1150352666\n```\n:::\n:::\n\n\n## Use `head()`, `select()`, `filter()`, and `collect()` to preview results\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n filter(year == 2020) |>\n mutate(fare_pounds = fare_amount * 0.79) |>\n select(fare_amount, fare_pounds) |>\n head() |>\n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 2\n fare_amount fare_pounds\n \n1 8 6.32\n2 17 13.4 \n3 6.5 5.14\n4 7 5.53\n5 6.5 5.14\n6 42 33.2 \n```\n:::\n:::\n\n\n## Use `across()` to transform data in multiple columns\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxis_gbp <- nyc_taxi |>\n mutate(across(ends_with(\"amount\"), list(pounds = ~.x * 0.79)))\n\ntaxis_gbp\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nFileSystemDataset (query)\nvendor_name: string\npickup_datetime: timestamp[ms]\ndropoff_datetime: timestamp[ms]\npassenger_count: int64\ntrip_distance: double\npickup_longitude: double\npickup_latitude: double\nrate_code: string\nstore_and_fwd: string\ndropoff_longitude: double\ndropoff_latitude: double\npayment_type: string\nfare_amount: double\nextra: double\nmta_tax: double\ntip_amount: double\ntolls_amount: double\ntotal_amount: double\nimprovement_surcharge: double\ncongestion_surcharge: double\npickup_location_id: int64\ndropoff_location_id: int64\nyear: int32\nmonth: int32\nfare_amount_pounds: double (multiply_checked(fare_amount, 0.79))\ntip_amount_pounds: double (multiply_checked(tip_amount, 0.79))\ntolls_amount_pounds: double (multiply_checked(tolls_amount, 0.79))\ntotal_amount_pounds: double (multiply_checked(total_amount, 0.79))\n\nSee $.data for the source Arrow object\n```\n:::\n:::\n\n\n## Use `across()` to transform data in multiple columns\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxis_gbp |>\n select(contains(\"amount\")) |>\n head() |>\n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 8\n fare_amount tip_amount tolls_amount total_amount fare_amount_pounds\n \n1 29.7 6.04 0 36.2 23.5 \n2 9.3 0 0 9.8 7.35\n3 4.1 1.38 0 5.98 3.24\n4 4.5 1 0 6 3.56\n5 4.5 0 0 5.5 3.56\n6 4.1 0 0 5.6 3.24\n# ℹ 3 more variables: tip_amount_pounds , tolls_amount_pounds ,\n# total_amount_pounds \n```\n:::\n:::\n\n\n## Summary\n\n- Use `nrow()` to work out how many rows of data your analyses will return\n- Use `compute()` when you need to execute intermediate steps\n- Use `collect()` to pull all of the data into your R session\n- Use `head()`, `select()`, `filter()`, and `collect()` to preview results\n- Use `across()` to manipulate data in multiple columns at once\n\n# dplyr verbs API in arrow - alternatives\n\n## Example - `slice()`\n\nFirst three trips in the dataset in 2021 where distance \\> 100 miles\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlong_rides_2021 <- nyc_taxi |>\n filter(year == 2021 & trip_distance > 100) |>\n select(pickup_datetime, year, trip_distance)\n\nlong_rides_2021 |>\n slice(1:3)\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in UseMethod(\"slice\"): no applicable method for 'slice' applied to an object of class \"arrow_dplyr_query\"\n```\n:::\n:::\n\n\n## Head to the docs!\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?`arrow-dplyr`\n```\n:::\n\n\nor view them at \n\n## A different function\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlong_rides_2021 |>\n slice_max(n = 3, order_by = trip_distance, with_ties = FALSE) |>\n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 3\n pickup_datetime year trip_distance\n \n1 2021-11-16 06:55:00 2021 351613.\n2 2021-10-27 11:46:00 2021 345124.\n3 2021-12-11 04:48:00 2021 335094.\n```\n:::\n:::\n\n\n## Or call `collect()` first\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlong_rides_2021 |>\n collect() |>\n slice(1:3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 3\n pickup_datetime year trip_distance\n \n1 2021-01-03 03:01:26 2021 216.\n2 2021-01-03 05:36:52 2021 268.\n3 2021-01-06 01:27:55 2021 271.\n```\n:::\n:::\n\n\n## tidyr functions - pivot\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyr)\n\nnyc_taxi |> \n group_by(vendor_name) |>\n summarise(max_fare = max(fare_amount)) |>\n pivot_longer(!vendor_name, names_to = \"metric\") |> \n collect()\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in UseMethod(\"pivot_longer\"): no applicable method for 'pivot_longer' applied to an object of class \"arrow_dplyr_query\"\n```\n:::\n:::\n\n\n## duckdb\n\n![](images/dplyr-arrow-duckdb.png)\n\n## tidyr functions - pivot with duckdb!\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(duckdb)\n\nnyc_taxi |> \n group_by(vendor_name) |>\n summarise(max_fare = max(fare_amount)) |>\n to_duckdb() |> # send data to duckdb\n pivot_longer(!vendor_name, names_to = \"metric\") |> \n to_arrow() |> # return data back to arrow\n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 3\n vendor_name metric value\n \n1 CMT max_fare 998310.\n2 VTS max_fare 10000.\n3 max_fare 3555.\n```\n:::\n:::\n\n\n::: {.callout-caution collapse=\"true\"}\n## Requires arrow 13.0.0\n\nThis code requires arrow 13.0.0 or above to run, due to a bugfix in this version\n:::\n\n# Using functions inside verbs\n\n## Using functions inside verbs\n\n- lots of the [lubridate](https://lubridate.tidyverse.org/) and [stringr](https://stringr.tidyverse.org/) APIs supported!\n- base R and others too - always good to check the docs\n\n## Morning vs afternoon with namespacing\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"3\"}\nnyc_taxi |>\n group_by(\n time_of_day = ifelse(lubridate::am(pickup_datetime), \"morning\", \"afternoon\")\n ) |>\n count() |>\n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 × 2\n# Groups: time_of_day [2]\n time_of_day n\n \n1 afternoon 736491676\n2 morning 413860990\n```\n:::\n:::\n\n\n## Morning vs afternoon - without namespacing\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"5\"}\nlibrary(lubridate)\n\nnyc_taxi |>\n group_by(\n time_of_day = ifelse(am(pickup_datetime), \"morning\", \"afternoon\")\n ) |>\n count() |>\n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 × 2\n# Groups: time_of_day [2]\n time_of_day n\n \n1 afternoon 736491676\n2 morning 413860990\n```\n:::\n:::\n\n\n## What if a function isn't implemented?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n mutate(vendor_name = na_if(vendor_name, \"CMT\")) |>\n head() |>\n collect()\n```\n\n::: {.cell-output .cell-output-error}\n```\nError: Expression na_if(vendor_name, \"CMT\") not supported in Arrow\nCall collect() first to pull data into R.\n```\n:::\n:::\n\n\n## Head to the docs again to see what's implemented!\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?`arrow-dplyr`\n```\n:::\n\n\nor view them at \n\n## Option 1 - find a workaround!\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n mutate(vendor_name = ifelse(vendor_name == \"CMT\", NA, vendor_name)) |>\n head() |>\n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 6 × 24\n vendor_name pickup_datetime dropoff_datetime passenger_count\n \n1 2012-01-20 08:09:36 2012-01-20 08:42:25 1\n2 2012-01-20 08:54:10 2012-01-20 09:06:55 1\n3 2012-01-20 02:08:01 2012-01-20 02:11:02 1\n4 2012-01-20 02:36:22 2012-01-20 02:39:44 1\n5 2012-01-20 14:58:32 2012-01-20 15:03:04 1\n6 2012-01-20 13:40:20 2012-01-20 13:43:43 2\n# ℹ 20 more variables: trip_distance , pickup_longitude ,\n# pickup_latitude , rate_code , store_and_fwd ,\n# dropoff_longitude , dropoff_latitude , payment_type ,\n# fare_amount , extra , mta_tax , tip_amount ,\n# tolls_amount , total_amount , improvement_surcharge ,\n# congestion_surcharge , pickup_location_id ,\n# dropoff_location_id , year , month \n```\n:::\n:::\n\n\n## Option 2\n\n- In data manipulation part 2!\n\n## Your Turn\n\n1. Use the `dplyr::filter()` and `stringr::str_ends()` functions to return a subset of the data which is a) from September 2020, and b) the value in `vendor_name` ends with the letter \"S\".\n\n2. Try to use the `stringr` function `str_replace_na()` to replace any `NA` values in the `vendor_name` column with the string \"No vendor\" instead. What happens, and why?\n\n3. Bonus question: see if you can find a different way of completing the task in question 2.\n\n➡️ [Data Manipulation Part I Exercises Page](2_data_manipulation_1-exercises.html)\n\n## Summary\n\n\n- Working with Arrow Datasets allow you to manipulate data which is larger-than-memory\n- You can use many dplyr functions with arrow - run `` ?`arrow-dplyr` `` to view the docs\n- You can pass data to duckdb to use functions implemented in duckdb but not arrow\n", "supporting": [ "2_data_manipulation_1_files" ], diff --git a/materials/2_data_manipulation_1.qmd b/materials/2_data_manipulation_1.qmd index 5ea88e1..17d71c3 100644 --- a/materials/2_data_manipulation_1.qmd +++ b/materials/2_data_manipulation_1.qmd @@ -318,7 +318,7 @@ This code requires arrow 13.0.0 or above to run, due to a bugfix in this version ## Morning vs afternoon with namespacing -```{r, `code-line-numbers`="2"} +```{r, `code-line-numbers`="3"} #| label: namespacing nyc_taxi |> @@ -331,7 +331,7 @@ nyc_taxi |> ## Morning vs afternoon - without namespacing -```{r, `code-line-numbers`="4"} +```{r, `code-line-numbers`="5"} #| label: no-namespacing library(lubridate)