Merge pull request #11 from posit-conf-2023/use-13.0.0

Update code to use 13.0.0
posit-conf-2023 · Aug 31, 2023 · 820dd89 · 820dd89
2 parents cfcd256 + 5a764b7
commit 820dd89
Show file tree

Hide file tree

Showing 17 changed files with 54 additions and 67 deletions.
diff --git a/_freeze/materials/1_hello_arrow-exercises/execute-results/html.json b/_freeze/materials/1_hello_arrow-exercises/execute-results/html.json
@@ -1,8 +1,10 @@
 {
-  "hash": "741bd535116c5b43069f2373bcc57e78",
+  "hash": "53f610ff8cc8524ff8fdda04614a7b6f",
   "result": {
-    "markdown": "---\ntitle: \"Hello Arrow Exercises\"\nexecute:\n  echo: true\n  messages: false\n  warning: false\n---\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\nlibrary(dplyr)\n```\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi <- open_dataset(here::here(\"data/nyc-taxi\"))\nnyc_taxi\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nFileSystemDataset with 122 Parquet files\nvendor_name: string\npickup_datetime: timestamp[ms]\ndropoff_datetime: timestamp[ms]\npassenger_count: int64\ntrip_distance: double\npickup_longitude: double\npickup_latitude: double\nrate_code: string\nstore_and_fwd: string\ndropoff_longitude: double\ndropoff_latitude: double\npayment_type: string\nfare_amount: double\nextra: double\nmta_tax: double\ntip_amount: double\ntolls_amount: double\ntotal_amount: double\nimprovement_surcharge: double\ncongestion_surcharge: double\npickup_location_id: int64\ndropoff_location_id: int64\nyear: int32\nmonth: int32\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |> \n  nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1155795912\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n  filter(year %in% 2014:2017) |> \n  group_by(year) |>\n  summarize(\n    all_trips = n(),\n    shared_trips = sum(passenger_count > 1, na.rm = TRUE)\n  ) |>\n  mutate(pct_shared = shared_trips / all_trips * 100) |>\n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 × 4\n   year all_trips shared_trips pct_shared\n  <int>     <int>        <int>      <dbl>\n1  2014 165114361     48816505       29.6\n2  2015 146112989     43081091       29.5\n3  2016 131165043     38163870       29.1\n4  2017 113495512     32296166       28.5\n```\n:::\n:::\n\n\n::: {#exercise-hello-nyc-taxi .callout-tip}\n## Exercises: First {dplyr} pipeline with Arrow\n\n::: panel-tabset\n## Problems\n\n1.  Calculate the total number of rides for every month in 2019\n2.  About how long did this query of 1.15 billion rows take?\n\n## Solution 1\n\nTotal number of rides for every month in 2019:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |> \n  filter(year == 2019) |>\n  count(month) |>\n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 12 × 2\n   month       n\n   <int>   <int>\n 1     1 7667255\n 2    12 6895933\n 3    11 6877463\n 4    10 7213588\n 5     2 7018750\n 6     3 7832035\n 7     4 7432826\n 8     5 7564884\n 9     6 6940489\n10     7 6310134\n11     8 6072851\n12     9 6567396\n```\n:::\n:::\n\n\n## Solution 2\n\nCompute time for querying the 1.15 billion rows:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |> \n  filter(year == 2019) |>\n  group_by(month) |>\n  summarize(longest_trip = max(trip_distance, na.rm = TRUE)) |>\n  arrange(month) |> \n  collect() |> \n  system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n   user  system elapsed \n  2.962   0.209   0.364 \n```\n:::\n:::\n\n\nor\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tictoc)\n\ntic()\nnyc_taxi |> \n  filter(year == 2019) |>\n  group_by(month) |>\n  summarize(longest_trip = max(trip_distance, na.rm = TRUE)) |>\n  arrange(month) |> \n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 12 × 2\n   month longest_trip\n   <int>        <dbl>\n 1     1         832.\n 2     2         702.\n 3     3         237.\n 4     4         831.\n 5     5         401.\n 6     6       45977.\n 7     7         312.\n 8     8         602.\n 9     9         604.\n10    10         308.\n11    11         701.\n12    12       19130.\n```\n:::\n\n```{.r .cell-code}\ntoc()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n0.324 sec elapsed\n```\n:::\n:::\n\n:::\n:::\n",
-    "supporting": [],
+    "markdown": "---\ntitle: \"Hello Arrow Exercises\"\nexecute:\n  echo: true\n  messages: false\n  warning: false\n---\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\nlibrary(dplyr)\n```\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi <- open_dataset(here::here(\"data/nyc-taxi\"))\nnyc_taxi\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nFileSystemDataset with 122 Parquet files\nvendor_name: string\npickup_datetime: timestamp[ms]\ndropoff_datetime: timestamp[ms]\npassenger_count: int64\ntrip_distance: double\npickup_longitude: double\npickup_latitude: double\nrate_code: string\nstore_and_fwd: string\ndropoff_longitude: double\ndropoff_latitude: double\npayment_type: string\nfare_amount: double\nextra: double\nmta_tax: double\ntip_amount: double\ntolls_amount: double\ntotal_amount: double\nimprovement_surcharge: double\ncongestion_surcharge: double\npickup_location_id: int64\ndropoff_location_id: int64\nyear: int32\nmonth: int32\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |> \n  nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1155795912\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |>\n  filter(year %in% 2014:2017) |> \n  group_by(year) |>\n  summarize(\n    all_trips = n(),\n    shared_trips = sum(passenger_count > 1, na.rm = TRUE)\n  ) |>\n  mutate(pct_shared = shared_trips / all_trips * 100) |>\n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 × 4\n   year all_trips shared_trips pct_shared\n  <int>     <int>        <int>      <dbl>\n1  2014 165114361     48816505       29.6\n2  2015 146112989     43081091       29.5\n3  2016 131165043     38163870       29.1\n4  2017 113495512     32296166       28.5\n```\n:::\n:::\n\n\n::: {#exercise-hello-nyc-taxi .callout-tip}\n## Exercises: First {dplyr} pipeline with Arrow\n\n::: panel-tabset\n## Problems\n\n1.  Calculate the total number of rides for every month in 2019\n2.  About how long did this query of 1.15 billion rows take?\n\n## Solution 1\n\nTotal number of rides for every month in 2019:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |> \n  filter(year == 2019) |>\n  count(month) |>\n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 12 × 2\n   month       n\n   <int>   <int>\n 1     1 7667255\n 2    11 6877463\n 3    12 6895933\n 4    10 7213588\n 5     2 7018750\n 6     3 7832035\n 7     4 7432826\n 8     5 7564884\n 9     6 6940489\n10     7 6310134\n11     8 6072851\n12     9 6567396\n```\n:::\n:::\n\n\n## Solution 2\n\nCompute time for querying the 1.15 billion rows:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |> \n  filter(year == 2019) |>\n  group_by(month) |>\n  summarize(longest_trip = max(trip_distance, na.rm = TRUE)) |>\n  arrange(month) |> \n  collect() |> \n  system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n   user  system elapsed \n  2.844   0.175   0.331 \n```\n:::\n:::\n\n\nor\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tictoc)\n\ntic()\nnyc_taxi |> \n  filter(year == 2019) |>\n  group_by(month) |>\n  summarize(longest_trip = max(trip_distance, na.rm = TRUE)) |>\n  arrange(month) |> \n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 12 × 2\n   month longest_trip\n   <int>        <dbl>\n 1     1         832.\n 2     2         702.\n 3     3         237.\n 4     4         831.\n 5     5         401.\n 6     6       45977.\n 7     7         312.\n 8     8         602.\n 9     9         604.\n10    10         308.\n11    11         701.\n12    12       19130.\n```\n:::\n\n```{.r .cell-code}\ntoc()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n0.379 sec elapsed\n```\n:::\n:::\n\n:::\n:::\n",
+    "supporting": [
+      "1_hello_arrow-exercises_files"
+    ],
     "filters": [
       "rmarkdown/pagebreak.lua"
     ],

diff --git a/_freeze/materials/2_data_manipulation_1-exercises/execute-results/html.json b/_freeze/materials/2_data_manipulation_1-exercises/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "62602546d12ab790ca112e42f2d3549a",
+  "hash": "4e2ed176da5e01d7cca35ff8a9067c99",
   "result": {
-    "markdown": "---\ntitle: \"Data Manipulation Part 1 - Exercises\"\nexecute:\n  echo: true\n  messages: false\n  warning: false\n---\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\nlibrary(dplyr)\nlibrary(stringr)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi <- open_dataset(here::here(\"data/nyc-taxi\"))\nnyc_taxi\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nFileSystemDataset with 122 Parquet files\nvendor_name: string\npickup_datetime: timestamp[ms]\ndropoff_datetime: timestamp[ms]\npassenger_count: int64\ntrip_distance: double\npickup_longitude: double\npickup_latitude: double\nrate_code: string\nstore_and_fwd: string\ndropoff_longitude: double\ndropoff_latitude: double\npayment_type: string\nfare_amount: double\nextra: double\nmta_tax: double\ntip_amount: double\ntolls_amount: double\ntotal_amount: double\nimprovement_surcharge: double\ncongestion_surcharge: double\npickup_location_id: int64\ndropoff_location_id: int64\nyear: int32\nmonth: int32\n```\n:::\n:::\n\n\n::: {#exercise-compute-collect .callout-tip}\n# Using `compute()` and `collect()`\n\n::: panel-tabset\n## Problem\n\n1.  How many taxi fares in the dataset had a total amount greater than \\$100?\n\n2.  How many distinct pickup locations are in the dataset?\n\n## Solution 1\n\n\n::: {.cell hash='2_data_manipulation_1-exercises_cache/html/compute-collect-1_6f0b91138fe8ef9057e815121068628b'}\n\n```{.r .cell-code}\nnyc_taxi %>%\n  filter(total_amount > 100) %>%\n  nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1529191\n```\n:::\n:::\n\n\n## Solution 2\n\n\n::: {.cell hash='2_data_manipulation_1-exercises_cache/html/compute-collect-2_b6ea5034a000a75cef933166dbea5e4e'}\n\n```{.r .cell-code}\nnyc_taxi %>%\n  distinct(pickup_longitude, pickup_latitude) %>%\n  compute() %>%\n  nrow()\n```\n:::\n\n:::\n:::\n\n::: {#exercise-dplyr-api .callout-tip}\n# Using the dplyr API in arrow\n\n::: panel-tabset\n## Problem\n\n1.  Use the `dplyr::filter()` and `stringr::str_ends()` to return a subset of the data which is a) from September 2020, and b) the value in `vendor_name` ends with the letter \"S\".\n\n2.  Try to use the `stringr` function `str_replace_na()` to replace any `NA` values in the `vendor_name` column with the string \"No vendor\" instead. What happens, and why?\n\n3.  Bonus question: see if you can find a different way of completing the task in question 2.\n\n## Solution 1\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi %>%\n  filter(str_ends(vendor_name, \"S\"), year == 2020,  month == 9) %>%\n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 847,149 × 24\n   vendor_name pickup_datetime     dropoff_datetime    passenger_count\n   <chr>       <dttm>              <dttm>                        <int>\n 1 VTS         2020-09-03 14:27:50 2020-09-03 14:43:50               1\n 2 VTS         2020-09-03 14:53:22 2020-09-03 15:07:33               3\n 3 VTS         2020-09-03 14:32:22 2020-09-03 14:41:19               2\n 4 VTS         2020-09-03 14:48:33 2020-09-03 15:06:47               3\n 5 VTS         2020-09-03 14:54:54 2020-09-03 15:13:48               1\n 6 VTS         2020-09-03 14:23:52 2020-09-03 14:26:03               2\n 7 VTS         2020-09-03 14:31:24 2020-09-03 14:35:20               1\n 8 VTS         2020-09-03 14:20:13 2020-09-03 14:49:34               2\n 9 VTS         2020-09-03 14:06:08 2020-09-03 14:19:54               1\n10 VTS         2020-09-03 14:29:26 2020-09-03 14:32:45               1\n# ℹ 847,139 more rows\n# ℹ 20 more variables: trip_distance <dbl>, pickup_longitude <dbl>,\n#   pickup_latitude <dbl>, rate_code <chr>, store_and_fwd <chr>,\n#   dropoff_longitude <dbl>, dropoff_latitude <dbl>, payment_type <chr>,\n#   fare_amount <dbl>, extra <dbl>, mta_tax <dbl>, tip_amount <dbl>,\n#   tolls_amount <dbl>, total_amount <dbl>, improvement_surcharge <dbl>,\n#   congestion_surcharge <dbl>, pickup_location_id <int>, …\n```\n:::\n:::\n\n\n## Solution 2 and 3\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi %>%\n  mutate(vendor_name = stringr::str_replace_na(vendor_name, \"No vendor\")) %>%\n  collect()\n```\n\n::: {.cell-output .cell-output-error}\n```\nError: Expression stringr::str_replace_na(vendor_name, \"No vendor\") not supported in Arrow\nCall collect() first to pull data into R.\n```\n:::\n:::\n\n\nThis won't work as `stringr::str_replace_na()` hasn't been implemented in Arrow. You could try using `mutate()` and `ifelse()` here instead.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi %>%\n  mutate(vendor_name = ifelse(is.na(vendor_name), \"No vendor\", vendor_name)) %>%\n  collect()\n```\n:::\n\n\nOr, if you only needed a subset of the data, you could apply the function after collecting it into R memory.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi %>%\n  filter(year == 2019, month == 10) %>% # smaller subset of the data\n  collect() %>%\n  mutate(vendor_name = stringr::str_replace_na(vendor_name, \"No vendor\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 7,213,588 × 24\n   vendor_name pickup_datetime     dropoff_datetime    passenger_count\n   <chr>       <dttm>              <dttm>                        <int>\n 1 VTS         2019-10-01 21:41:22 2019-10-01 21:52:59               1\n 2 CMT         2019-10-01 21:53:46 2019-10-01 22:13:09               1\n 3 CMT         2019-10-01 21:05:22 2019-10-01 21:14:06               1\n 4 CMT         2019-10-01 21:19:59 2019-10-01 21:39:04               1\n 5 CMT         2019-10-01 21:45:45 2019-10-01 22:06:14               1\n 6 CMT         2019-10-01 21:03:44 2019-10-01 21:09:16               1\n 7 CMT         2019-10-01 21:15:40 2019-10-01 21:31:26               1\n 8 CMT         2019-10-01 21:34:57 2019-10-01 21:42:53               1\n 9 CMT         2019-10-01 21:57:55 2019-10-01 22:04:22               1\n10 CMT         2019-10-01 21:19:21 2019-10-01 21:29:08               1\n# ℹ 7,213,578 more rows\n# ℹ 20 more variables: trip_distance <dbl>, pickup_longitude <dbl>,\n#   pickup_latitude <dbl>, rate_code <chr>, store_and_fwd <chr>,\n#   dropoff_longitude <dbl>, dropoff_latitude <dbl>, payment_type <chr>,\n#   fare_amount <dbl>, extra <dbl>, mta_tax <dbl>, tip_amount <dbl>,\n#   tolls_amount <dbl>, total_amount <dbl>, improvement_surcharge <dbl>,\n#   congestion_surcharge <dbl>, pickup_location_id <int>, …\n```\n:::\n:::\n\n:::\n:::\n",
+    "markdown": "---\ntitle: \"Data Manipulation Part 1 - Exercises\"\nexecute:\n  echo: true\n  messages: false\n  warning: false\n---\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\nlibrary(dplyr)\nlibrary(stringr)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi <- open_dataset(here::here(\"data/nyc-taxi\"))\nnyc_taxi\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nFileSystemDataset with 122 Parquet files\nvendor_name: string\npickup_datetime: timestamp[ms]\ndropoff_datetime: timestamp[ms]\npassenger_count: int64\ntrip_distance: double\npickup_longitude: double\npickup_latitude: double\nrate_code: string\nstore_and_fwd: string\ndropoff_longitude: double\ndropoff_latitude: double\npayment_type: string\nfare_amount: double\nextra: double\nmta_tax: double\ntip_amount: double\ntolls_amount: double\ntotal_amount: double\nimprovement_surcharge: double\ncongestion_surcharge: double\npickup_location_id: int64\ndropoff_location_id: int64\nyear: int32\nmonth: int32\n```\n:::\n:::\n\n\n::: {#exercise-compute-collect .callout-tip}\n# Using `compute()` and `collect()`\n\n::: panel-tabset\n## Problem\n\n1.  How many taxi fares in the dataset had a total amount greater than \\$100?\n\n2.  How many distinct pickup locations are in the dataset since 2016?\n\n## Solution 1\n\n\n::: {.cell hash='2_data_manipulation_1-exercises_cache/html/compute-collect-1_6f0b91138fe8ef9057e815121068628b'}\n\n```{.r .cell-code}\nnyc_taxi %>%\n  filter(total_amount > 100) %>%\n  nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1529191\n```\n:::\n:::\n\n\n## Solution 2\n\n\n::: {.cell hash='2_data_manipulation_1-exercises_cache/html/compute-collect-2_31838425beb6cb58051570c1c799a7ff'}\n\n```{.r .cell-code}\nnyc_taxi %>%\n  filter(year >= 2016) %>%\n  distinct(pickup_longitude, pickup_latitude) %>%\n  compute() %>%\n  nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 29105801\n```\n:::\n:::\n\n:::\n:::\n\n::: {#exercise-dplyr-api .callout-tip}\n# Using the dplyr API in arrow\n\n::: panel-tabset\n## Problem\n\n1.  Use the `dplyr::filter()` and `stringr::str_ends()` to return a subset of the data which is a) from September 2020, and b) the value in `vendor_name` ends with the letter \"S\".\n\n2.  Try to use the `stringr` function `str_replace_na()` to replace any `NA` values in the `vendor_name` column with the string \"No vendor\" instead. What happens, and why?\n\n3.  Bonus question: see if you can find a different way of completing the task in question 2.\n\n## Solution 1\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi %>%\n  filter(str_ends(vendor_name, \"S\"), year == 2020,  month == 9) %>%\n  collect()\n```\n:::\n\n\n## Solution 2 and 3\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi %>%\n  mutate(vendor_name = stringr::str_replace_na(vendor_name, \"No vendor\")) %>%\n  head() %>%\n  collect()\n```\n:::\n\n\nThis won't work as `stringr::str_replace_na()` hasn't been implemented in Arrow. You could try using `mutate()` and `ifelse()` here instead.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi %>%\n  mutate(vendor_name = ifelse(is.na(vendor_name), \"No vendor\", vendor_name)) %>%\n  head() %>%\n  collect()\n```\n:::\n\n\nOr, if you only needed a subset of the data, you could apply the function after collecting it into R memory.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi %>%\n  filter(year == 2019, month == 10) %>% # smaller subset of the data\n  collect() %>%\n  mutate(vendor_name = stringr::str_replace_na(vendor_name, \"No vendor\"))\n```\n:::\n\n:::\n:::\n",
     "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"