From 6fd0912d97fda1226b9d6326651b321484a5053a Mon Sep 17 00:00:00 2001 From: Terry McLaughlin <45657289+terrymclaughlin@users.noreply.github.com> Date: Tue, 18 Jun 2024 09:37:14 +0100 Subject: [PATCH 01/13] Create Writing R Code that Runs in Parallel.md Initial commit. Requires further work. --- R/Writing R Code that Runs in Parallel.md | 79 +++++++++++++++++++++++ 1 file changed, 79 insertions(+) create mode 100644 R/Writing R Code that Runs in Parallel.md diff --git a/R/Writing R Code that Runs in Parallel.md b/R/Writing R Code that Runs in Parallel.md new file mode 100644 index 0000000..732e033 --- /dev/null +++ b/R/Writing R Code that Runs in Parallel.md @@ -0,0 +1,79 @@ +# Writing R Code that Runs in Parallel + +## Purpose + +This document aims to provide R users in Public Health Scotland with an introduction to the concepts of Parallel Processing, describe how to run `{dplyr}` code in parallel using the `{multidplyr}` package, explain the benefits of doing so, along with some of the downsides. + +## Summary and Key Points + +Parallel processing involves dividing a large computational task into smaller, more manageable tasks that can be executed simultaneously across multiple processors or cores. This approach can significantly speed up the execution time of complex computations, especially for data-intensive applications. + +`{dplyr}` is a powerful package for data manipulation in R, but it operates on a single core (or CPU). `{multidplyr}` extends `{dplyr}` by allowing operations to be performed in parallel across multiple cores, potentially reducing computation time. + +We will create a large dataset with 10 million rows and 256 numeric columns, then measure the time it takes to perform a data manipulation task using both `{dplyr}` and `{multidplyr}`. + +## Parallel Processing + +### What is Parallel Processing? + +Parallel processing is a method of computation where many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. This is particularly useful for tasks that require significant computational power and time. + +### How Does Parallel Processing Relate to R? + +R is a single-threaded language by default, meaning it processes tasks sequentially, one after the other. This can be a limitation when working with large datasets or performing complex computations. The `{multidplyr}` package addresses this limitation by enabling parallel processing within the `{dplyr}` framework. + +`{multidplyr}` allows you to partition your data across multiple cores and perform `{dplyr}` operations in parallel. This can lead to significant performance improvements by utilising the full computational power of modern multi-core processors. By distributing the workload, `{multidplyr}` can reduce the time required for data manipulation tasks. + +### Why Use Parallel Processing? + +- **Speed**: By dividing tasks across multiple cores, parallel processing can significantly reduce the time required to complete large computations. +- **Efficiency**: It allows for better utilisation of available computational resources, making it possible to handle larger datasets and more complex analyses. +- **Scalability**: Parallel processing can scale with the number of available cores, making it suitable for both small and large-scale data processing tasks. + +### Reasons You Might Not Use Parallel Processing + +- **Overhead**: Setting up and managing parallel processes can introduce overhead, which might negate the performance benefits for smaller tasks. +- **Complexity**: Writing and debugging parallel code can be more complex than writing sequential code. +- **Memory Usage**: Parallel processing can increase memory usage, as each core may require its own copy of the data. + +## Example + +### Creating a Large Dataset with 256 Numeric Columns + +First, we will create a large dataset with 10 million rows and 256 numeric columns. + +```r +# Load necessary libraries +library(dplyr) +library(multidplyr) +library(lubridate) +library(microbenchmark) + +# Set seed for reproducibility +set.seed(123) + +# Create a large dataset with 10 million rows and 256 numeric columns +n <- 10000000 +num_cols <- 256 +data <- data.frame( + id = 1:n, + dt = sample(seq(as.Date('2000/01/01'), as.Date('2024/01/01'), by="day"), n, replace = TRUE) +) + +# Add 256 numeric columns +for (i in 1:num_cols) { + col_name <- paste0("num", i) + data[[col_name]] <- rnorm(n) +} + +# Display the first few rows of the dataset +head(data) +``` + +| id | dt | num1 | num2 | num3 | num4 | num5 | ... | num256 | +|----|------------|-----------|----------|----------|----------|----------|-----|----------| +| 1 | 2001-12-10 | -0.5604756| 9.073164 | 99.37355 | 0.487429 | 0.738325 | ... | 0.718781 | +| 2 | 2011-01-15 | -0.2301775| 9.183643 | 99.18364 | 0.738324 | 0.575781 | ... | 0.158325 | +| 3 | 2010-11-25 | 1.5587083 | 8.164371 | 98.16437 | 0.575781 | 0.694611 | ... | 0.368781 | +| 4 | 2004-01-01 | 0.0705084 | 9.595281 | 99.59528 | 0.694611 | 0.511781 | ... | 0.638325 | +| 5 | 2012-05-20 | 0.1292877 | 9.329508 | 99.32951 | 0.511781 | 0.738325 | ... | 0.498611 | From 7c82bedcf560e84f63e2e982ca80311b0b02f962 Mon Sep 17 00:00:00 2001 From: Terry McLaughlin <45657289+terrymclaughlin@users.noreply.github.com> Date: Tue, 18 Jun 2024 10:27:13 +0100 Subject: [PATCH 02/13] Update Writing R Code that Runs in Parallel.md Add detail on inherently parallel data manipulation tasks and the {multidplyr} R package --- R/Writing R Code that Runs in Parallel.md | 23 +++++++++++++++++++++-- 1 file changed, 21 insertions(+), 2 deletions(-) diff --git a/R/Writing R Code that Runs in Parallel.md b/R/Writing R Code that Runs in Parallel.md index 732e033..e5f939d 100644 --- a/R/Writing R Code that Runs in Parallel.md +++ b/R/Writing R Code that Runs in Parallel.md @@ -36,9 +36,28 @@ R is a single-threaded language by default, meaning it processes tasks sequentia - **Complexity**: Writing and debugging parallel code can be more complex than writing sequential code. - **Memory Usage**: Parallel processing can increase memory usage, as each core may require its own copy of the data. -## Example +### Inherently (or Embarrasingly) Parallel Data Manipulation Tasks -### Creating a Large Dataset with 256 Numeric Columns +Inherently (or Embarrassingly) parallel tasks are those that can be easily divided into independent subtasks, each of which can be processed simultaneously without requiring communication between the subtasks. This type of parallelism is particularly efficient because it minimises the overhead associated with inter-process communication. + +#### Examples of Inherently Parallel Tasks + +1. Summarising data by groups (e.g., calculating the mean or sum for each group) can be done independently for each group. +2. Running the same computation with different sets of parameters; each parameter set can be processed independently. + +## The `{multidplyr}` R package + +The `{multidplyr}` R package is a backend for `{dplyr}` that facilitates parallel processing by partitioning data frames across multiple cores. The package is part of the [Tidyverse](https://www.tidyverse.org/) + +To use `{multidplyr}`, users first need to create a cluster of worker processes. Each worker is an independent R process that the operating system allocates to different cores. + +The `partition()` function divides the data frame into chunks that are processed independently by each worker, ensuring that all observations within a group are assigned to the same worker, thus maintaining the integrity of grouped operations. Once the data is partitioned, users can perform various dplyr operations such as `mutate()`, `summarise()`, and `filter()` in parallel, and then collect the results using the `collect()` function. + +For simpler operations or smaller datasets (less than ~10 million observations), the overhead of communication between nodes may outweigh the benefits of parallel processing. + +### Example + +#### Creating a Large Dataset with 256 Numeric Columns First, we will create a large dataset with 10 million rows and 256 numeric columns. From 800939f16b87cd7fdfb97ebf0429491119ad9777 Mon Sep 17 00:00:00 2001 From: Terry McLaughlin <45657289+terrymclaughlin@users.noreply.github.com> Date: Tue, 18 Jun 2024 10:59:49 +0100 Subject: [PATCH 03/13] Update Writing R Code that Runs in Parallel.md Ready for initial review --- R/Writing R Code that Runs in Parallel.md | 63 ++++++++++++++++++++++- 1 file changed, 62 insertions(+), 1 deletion(-) diff --git a/R/Writing R Code that Runs in Parallel.md b/R/Writing R Code that Runs in Parallel.md index e5f939d..cb8885a 100644 --- a/R/Writing R Code that Runs in Parallel.md +++ b/R/Writing R Code that Runs in Parallel.md @@ -57,7 +57,7 @@ For simpler operations or smaller datasets (less than ~10 million observations), ### Example -#### Creating a Large Dataset with 256 Numeric Columns +#### Step 1: Creating a Large Dataset with 256 Numeric Columns First, we will create a large dataset with 10 million rows and 256 numeric columns. @@ -96,3 +96,64 @@ head(data) | 3 | 2010-11-25 | 1.5587083 | 8.164371 | 98.16437 | 0.575781 | 0.694611 | ... | 0.368781 | | 4 | 2004-01-01 | 0.0705084 | 9.595281 | 99.59528 | 0.694611 | 0.511781 | ... | 0.638325 | | 5 | 2012-05-20 | 0.1292877 | 9.329508 | 99.32951 | 0.511781 | 0.738325 | ... | 0.498611 | + +#### Step 2: Perform Data Manipulation with `{dplyr}` + +We'll group the data by `dt` and calculate the mean of all 256 numeric columns. + +```r +# Measure the time taken by dplyr +dplyr_time <- microbenchmark( + dplyr = { + result_dplyr <- data %>% + group_by(dt) %>% + summarise(across(starts_with("num"), \(x) mean(x, na.rm = TRUE))) + }, + times = 3 +) + +# Print the summary of the benchmark +print(dplyr_time) +``` + +#### Step 3: Perform the same Data Manipulation in Parallel with `{multidplyr}` + +We'll use `{multidplyr}` to parallelise the same operation across multiple cores. + +```r +# Create a cluster with the desired number of workers +cluster <- new_cluster(parallelly::availableCores() - 1) +cluster_library(cluster, "dplyr") + +# Partition the data across the cluster +data_partitioned <- data %>% + group_by(dt) %>% + partition(cluster) + +# Measure the time taken by multidplyr +multidplyr_time <- microbenchmark( + multidplyr = { + result_multidplyr <- data_partitioned %>% + summarise(across(starts_with("num"), \(x) mean(x, na.rm = TRUE))) %>% + collect() + }, + times = 3 +) + +# Print the summary of the benchmark +print(multidplyr_time) +``` + +#### Benchmarking Results + +The results of running the example detailed above in a Posit Workbench session with 16 CPUs are summarised below. + +The results show the minimum, lower quartile (lq), mean, median, upper quartile (uq), and maximum times taken for each method over 3 iterations. + +| Method | Min (s) | LQ (s) | Mean (s) | Median (s) | UQ (s) | Max (s) | Evaluations | +|-------------|---------|--------|----------|------------|--------|---------|-------------| +| dplyr | 61.229 | 61.335 | 61.720 | 61.442 | 61.966 | 62.490 | 3 | +| multidplyr | 6.849 | 6.976 | 7.962 | 7.104 | 8.519 | 9.934 | 3 | + +The results clearly indicate that `{multidplyr}` significantly outperforms `{dplyr}` in terms of execution time for the given data manipulation task. The mean execution time for `{multidplyr}` is approximately 7.96 seconds, compared to 61.72 seconds for dplyr. This demonstrates the potential performance benefits of using parallel processing with `{multidplyr}` for large datasets. + From fb96ab9d6a7c37214eb6a41abf3db58b0caa75cb Mon Sep 17 00:00:00 2001 From: Terry McLaughlin <45657289+terrymclaughlin@users.noreply.github.com> Date: Mon, 1 Jul 2024 15:07:00 +0100 Subject: [PATCH 04/13] Update R/Writing R Code that Runs in Parallel.md Co-authored-by: James Fixter <74598550+JFix89@users.noreply.github.com> --- R/Writing R Code that Runs in Parallel.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/R/Writing R Code that Runs in Parallel.md b/R/Writing R Code that Runs in Parallel.md index cb8885a..bc49668 100644 --- a/R/Writing R Code that Runs in Parallel.md +++ b/R/Writing R Code that Runs in Parallel.md @@ -2,7 +2,7 @@ ## Purpose -This document aims to provide R users in Public Health Scotland with an introduction to the concepts of Parallel Processing, describe how to run `{dplyr}` code in parallel using the `{multidplyr}` package, explain the benefits of doing so, along with some of the downsides. +This document aims to provide R users in Public Health Scotland with an introduction to the concepts of parallel processing, describe how to run `{dplyr}` code in parallel using the `{multidplyr}` package, explain the benefits of doing so, along with some of the downsides. ## Summary and Key Points From e618aa3cfd856ec52e1f58f7674c07129bd5c7a4 Mon Sep 17 00:00:00 2001 From: Terry McLaughlin <45657289+terrymclaughlin@users.noreply.github.com> Date: Mon, 1 Jul 2024 15:07:24 +0100 Subject: [PATCH 05/13] Update R/Writing R Code that Runs in Parallel.md Co-authored-by: James Fixter <74598550+JFix89@users.noreply.github.com> --- R/Writing R Code that Runs in Parallel.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/R/Writing R Code that Runs in Parallel.md b/R/Writing R Code that Runs in Parallel.md index bc49668..f5166a6 100644 --- a/R/Writing R Code that Runs in Parallel.md +++ b/R/Writing R Code that Runs in Parallel.md @@ -38,7 +38,7 @@ R is a single-threaded language by default, meaning it processes tasks sequentia ### Inherently (or Embarrasingly) Parallel Data Manipulation Tasks -Inherently (or Embarrassingly) parallel tasks are those that can be easily divided into independent subtasks, each of which can be processed simultaneously without requiring communication between the subtasks. This type of parallelism is particularly efficient because it minimises the overhead associated with inter-process communication. +Inherently (or embarrassingly) parallel tasks are those that can be easily divided into independent subtasks, each of which can be processed simultaneously without requiring communication between the subtasks. This type of parallelism is particularly efficient because it minimises the overhead associated with inter-process communication. #### Examples of Inherently Parallel Tasks From c0d8630a4b78031fbe8ecf5c62faa7681e79e955 Mon Sep 17 00:00:00 2001 From: Terry McLaughlin <45657289+terrymclaughlin@users.noreply.github.com> Date: Mon, 1 Jul 2024 15:07:46 +0100 Subject: [PATCH 06/13] Update R/Writing R Code that Runs in Parallel.md Co-authored-by: James Fixter <74598550+JFix89@users.noreply.github.com> --- R/Writing R Code that Runs in Parallel.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/R/Writing R Code that Runs in Parallel.md b/R/Writing R Code that Runs in Parallel.md index f5166a6..4af846d 100644 --- a/R/Writing R Code that Runs in Parallel.md +++ b/R/Writing R Code that Runs in Parallel.md @@ -47,7 +47,7 @@ Inherently (or embarrassingly) parallel tasks are those that can be easily divid ## The `{multidplyr}` R package -The `{multidplyr}` R package is a backend for `{dplyr}` that facilitates parallel processing by partitioning data frames across multiple cores. The package is part of the [Tidyverse](https://www.tidyverse.org/) +The `{multidplyr}` R package is a backend for `{dplyr}` that facilitates parallel processing by partitioning data frames across multiple cores. The package is part of the [Tidyverse](https://www.tidyverse.org/). To use `{multidplyr}`, users first need to create a cluster of worker processes. Each worker is an independent R process that the operating system allocates to different cores. From 4408c61ff4dafc698ff949d17ce5dd105ae82bcb Mon Sep 17 00:00:00 2001 From: Terry McLaughlin <45657289+terrymclaughlin@users.noreply.github.com> Date: Mon, 1 Jul 2024 15:09:46 +0100 Subject: [PATCH 07/13] Update R/Writing R Code that Runs in Parallel.md Co-authored-by: James Fixter <74598550+JFix89@users.noreply.github.com> --- R/Writing R Code that Runs in Parallel.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/R/Writing R Code that Runs in Parallel.md b/R/Writing R Code that Runs in Parallel.md index 4af846d..b8b6dfb 100644 --- a/R/Writing R Code that Runs in Parallel.md +++ b/R/Writing R Code that Runs in Parallel.md @@ -8,7 +8,7 @@ This document aims to provide R users in Public Health Scotland with an introduc Parallel processing involves dividing a large computational task into smaller, more manageable tasks that can be executed simultaneously across multiple processors or cores. This approach can significantly speed up the execution time of complex computations, especially for data-intensive applications. -`{dplyr}` is a powerful package for data manipulation in R, but it operates on a single core (or CPU). `{multidplyr}` extends `{dplyr}` by allowing operations to be performed in parallel across multiple cores, potentially reducing computation time. +`{dplyr}` is a powerful package for data manipulation in R, but it operates on a single core (i.e. a central processing unit, or CPU for short). `{multidplyr}` extends `{dplyr}` by allowing operations to be performed in parallel across multiple cores, potentially reducing computation time. We will create a large dataset with 10 million rows and 256 numeric columns, then measure the time it takes to perform a data manipulation task using both `{dplyr}` and `{multidplyr}`. From e2a6dc8ff3a65b035bc63f7fb3ddf705123cf20f Mon Sep 17 00:00:00 2001 From: Terry McLaughlin <45657289+terrymclaughlin@users.noreply.github.com> Date: Mon, 1 Jul 2024 15:11:23 +0100 Subject: [PATCH 08/13] Update R/Writing R Code that Runs in Parallel.md Co-authored-by: James McMahon --- R/Writing R Code that Runs in Parallel.md | 1 + 1 file changed, 1 insertion(+) diff --git a/R/Writing R Code that Runs in Parallel.md b/R/Writing R Code that Runs in Parallel.md index b8b6dfb..a8e2dbd 100644 --- a/R/Writing R Code that Runs in Parallel.md +++ b/R/Writing R Code that Runs in Parallel.md @@ -102,6 +102,7 @@ head(data) We'll group the data by `dt` and calculate the mean of all 256 numeric columns. ```r +library(microbenchmark) # Measure the time taken by dplyr dplyr_time <- microbenchmark( dplyr = { From ab3b807ac0b3d44aad5caeab3468089de5611fd3 Mon Sep 17 00:00:00 2001 From: Terry McLaughlin Date: Thu, 8 Aug 2024 09:22:31 +0100 Subject: [PATCH 09/13] Update R/Writing R Code that Runs in Parallel.md Draft commit --- R/Writing R Code that Runs in Parallel.md | 43 +++++++++++++++++++++++ 1 file changed, 43 insertions(+) diff --git a/R/Writing R Code that Runs in Parallel.md b/R/Writing R Code that Runs in Parallel.md index a8e2dbd..6b1ffbe 100644 --- a/R/Writing R Code that Runs in Parallel.md +++ b/R/Writing R Code that Runs in Parallel.md @@ -45,6 +45,33 @@ Inherently (or embarrassingly) parallel tasks are those that can be easily divid 1. Summarising data by groups (e.g., calculating the mean or sum for each group) can be done independently for each group. 2. Running the same computation with different sets of parameters; each parameter set can be processed independently. +## Multithreading + +### What is multithreading? + +Multithreading is a method of executing code in parallel, rather than sequentially, making better use of computer resources, and potentially reducing the amount of time it takes to execute the code. The tasks carried out using multithreading do not necessarily need to be related to one another and do not need to wait for each to complete. + +### How does multithreading relate to R? + +As stated previously, R is a single-threaded language by default, meaning it processes tasks sequentially. It is not possible to implement multithreading in R without calling on additional R packages, such as the [`{data.table}`](https://rdatatable.gitlab.io/data.table/) package in which many common operations will execute using multiple CPU threads. + +### Multithreading inside a Kubernetes container in Azure + +Your R session on Posit Workbench runs inside a Kubernetes container. That container is allocated a certain amount of CPU resource. This resource is provided by the Azure Kubernetes Service (AKS), where 1 CPU corresponds to 1 vCPU (virtual CPU). 1 vCPU is the equivalent of a single hyper-thread on a physical CPU core. + +If you attempt to run multithreaded code in a session with just 1 CPU in Posit Workbench, the multiple threads will be executed by taking turns on the single thread. This will give the illusion of multithreading, but all that is happening is that each thread is using a slice of CPU time, running sequentially. + +In order to run multithreaded code in Posit Workbench, you must open a session with more than 1 CPU. To demonstrate this, below are the results of running a computationally expensive operation on a large `{data.table}` of 600 million rows in Posit Workbench sessions with 1, 2, 4 and 8 CPUs, and using 1, 2, 4 and 8 threads: + +| Threads | 1 vCPU | 2 vCPUs | 4 vCPUs | 8 vCPUs | +|---------|--------|---------|---------|---------| +| 1 | 72.285 | 72.612 | 67.869 | 67.244 | +| 2 | 75.373 | 48.828 | 44.957 | 45.019 | +| 4 | 87.979 | 52.276 | 34.109 | 34.641 | +| 8 | 100.207| 60.684 | 37.004 | 29.259 | + +The results are the execution times in seconds for each combination of number of CPUs and threads. As you can see, running code with multiple threads on 1 vCPU takes longer than running the same code in a single thread of execution. A reduction in execution time is only seen if the number of CPUs is increased, and the most optimal combination is where the number of CPUs matches the number of threads of execution. + ## The `{multidplyr}` R package The `{multidplyr}` R package is a backend for `{dplyr}` that facilitates parallel processing by partitioning data frames across multiple cores. The package is part of the [Tidyverse](https://www.tidyverse.org/). @@ -158,3 +185,19 @@ The results show the minimum, lower quartile (lq), mean, median, upper quartile The results clearly indicate that `{multidplyr}` significantly outperforms `{dplyr}` in terms of execution time for the given data manipulation task. The mean execution time for `{multidplyr}` is approximately 7.96 seconds, compared to 61.72 seconds for dplyr. This demonstrates the potential performance benefits of using parallel processing with `{multidplyr}` for large datasets. +## The `{furrr}` R package + +The [`{furrr}`]{https://furrr.futureverse.org/} R package is an extension of the [`{purrr}`]{https://purrr.tidyverse.org/} package that enables parallel processing across multiple cores with minimal changes to existing purrr-based code. The package is part of the [Futureverse](https://www.futureverse.org/). + +To use `{furrr}`, users first need to set up a parallel processing plan using the [`{future}`]{https://future.futureverse.org/} package. Then, `{purrr}` functions like `map()` can be replaced with their `{furrr}` equivalents e.g. `future_map()`. The `{furrr}` functions will automatically distribute the iterations across the number of cores defined in the processing plan. + +The `partition()` function divides the data frame into chunks that are processed independently by each worker, ensuring that all observations within a group are assigned to the same worker, thus maintaining the integrity of grouped operations. Once the data is partitioned, users can perform various dplyr operations such as `mutate()`, `summarise()`, and `filter()` in parallel, and then collect the results using the `collect()` function. + +For simpler operations or smaller datasets (less than ~10 million observations), the overhead of communication between nodes may outweigh the benefits of parallel processing. + +### Example + +Timing Comparison: +Sequential: 150.0840 seconds +Parallel: 30.8370 seconds +Speedup: 4.87x From 85b097aa7ad2aff380a8f272c282c481077612ab Mon Sep 17 00:00:00 2001 From: Terry McLaughlin <45657289+terrymclaughlin@users.noreply.github.com> Date: Thu, 8 Aug 2024 14:25:37 +0100 Subject: [PATCH 10/13] Update R/Writing R Code that Runs in Parallel.md Fix links - wrong type of brackets Co-authored-by: James McMahon --- R/Writing R Code that Runs in Parallel.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/R/Writing R Code that Runs in Parallel.md b/R/Writing R Code that Runs in Parallel.md index 6b1ffbe..ec20204 100644 --- a/R/Writing R Code that Runs in Parallel.md +++ b/R/Writing R Code that Runs in Parallel.md @@ -187,9 +187,9 @@ The results clearly indicate that `{multidplyr}` significantly outperforms `{dpl ## The `{furrr}` R package -The [`{furrr}`]{https://furrr.futureverse.org/} R package is an extension of the [`{purrr}`]{https://purrr.tidyverse.org/} package that enables parallel processing across multiple cores with minimal changes to existing purrr-based code. The package is part of the [Futureverse](https://www.futureverse.org/). +The [`{furrr}`](https://furrr.futureverse.org/) R package is an extension of the [`{purrr}`](https://purrr.tidyverse.org/) package that enables parallel processing across multiple cores with minimal changes to existing purrr-based code. The package is part of the [Futureverse](https://www.futureverse.org/). -To use `{furrr}`, users first need to set up a parallel processing plan using the [`{future}`]{https://future.futureverse.org/} package. Then, `{purrr}` functions like `map()` can be replaced with their `{furrr}` equivalents e.g. `future_map()`. The `{furrr}` functions will automatically distribute the iterations across the number of cores defined in the processing plan. +To use `{furrr}`, users first need to set up a parallel processing plan using the [`{future}`](https://future.futureverse.org/) package. Then, `{purrr}` functions like `map()` can be replaced with their `{furrr}` equivalents e.g. `future_map()`. The `{furrr}` functions will automatically distribute the iterations across the number of cores defined in the processing plan. The `partition()` function divides the data frame into chunks that are processed independently by each worker, ensuring that all observations within a group are assigned to the same worker, thus maintaining the integrity of grouped operations. Once the data is partitioned, users can perform various dplyr operations such as `mutate()`, `summarise()`, and `filter()` in parallel, and then collect the results using the `collect()` function. From b2394c9571b4486532e914acba552aea7cdfab16 Mon Sep 17 00:00:00 2001 From: Terry McLaughlin <45657289+terrymclaughlin@users.noreply.github.com> Date: Thu, 8 Aug 2024 14:33:24 +0100 Subject: [PATCH 11/13] Update R/Writing R Code that Runs in Parallel.md Minor word change Co-authored-by: James McMahon --- R/Writing R Code that Runs in Parallel.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/R/Writing R Code that Runs in Parallel.md b/R/Writing R Code that Runs in Parallel.md index ec20204..e6c60c2 100644 --- a/R/Writing R Code that Runs in Parallel.md +++ b/R/Writing R Code that Runs in Parallel.md @@ -80,7 +80,7 @@ To use `{multidplyr}`, users first need to create a cluster of worker processes. The `partition()` function divides the data frame into chunks that are processed independently by each worker, ensuring that all observations within a group are assigned to the same worker, thus maintaining the integrity of grouped operations. Once the data is partitioned, users can perform various dplyr operations such as `mutate()`, `summarise()`, and `filter()` in parallel, and then collect the results using the `collect()` function. -For simpler operations or smaller datasets (less than ~10 million observations), the overhead of communication between nodes may outweigh the benefits of parallel processing. +For simpler operations or smaller datasets (fewer than ~10 million observations), the overhead of communication between nodes may outweigh the benefits of parallel processing. ### Example From 47ab853a0df1facfc9cc55b53fa7bb9d93d72147 Mon Sep 17 00:00:00 2001 From: Terry McLaughlin <45657289+terrymclaughlin@users.noreply.github.com> Date: Thu, 8 Aug 2024 14:34:05 +0100 Subject: [PATCH 12/13] Update R/Writing R Code that Runs in Parallel.md Removing explicit loading of microbenchmark package Co-authored-by: James McMahon --- R/Writing R Code that Runs in Parallel.md | 1 - 1 file changed, 1 deletion(-) diff --git a/R/Writing R Code that Runs in Parallel.md b/R/Writing R Code that Runs in Parallel.md index e6c60c2..ee9eb90 100644 --- a/R/Writing R Code that Runs in Parallel.md +++ b/R/Writing R Code that Runs in Parallel.md @@ -93,7 +93,6 @@ First, we will create a large dataset with 10 million rows and 256 numeric colum library(dplyr) library(multidplyr) library(lubridate) -library(microbenchmark) # Set seed for reproducibility set.seed(123) From 2ab352f8349f1faf61b019b8e4b0e60356132a7d Mon Sep 17 00:00:00 2001 From: Terry McLaughlin <45657289+terrymclaughlin@users.noreply.github.com> Date: Thu, 8 Aug 2024 14:34:32 +0100 Subject: [PATCH 13/13] Update R/Writing R Code that Runs in Parallel.md Style changes Co-authored-by: James McMahon --- R/Writing R Code that Runs in Parallel.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/R/Writing R Code that Runs in Parallel.md b/R/Writing R Code that Runs in Parallel.md index ee9eb90..de3a51d 100644 --- a/R/Writing R Code that Runs in Parallel.md +++ b/R/Writing R Code that Runs in Parallel.md @@ -102,7 +102,7 @@ n <- 10000000 num_cols <- 256 data <- data.frame( id = 1:n, - dt = sample(seq(as.Date('2000/01/01'), as.Date('2024/01/01'), by="day"), n, replace = TRUE) + dt = sample(seq(as.Date("2000/01/01"), as.Date("2024/01/01"), by = "day"), n, replace = TRUE) ) # Add 256 numeric columns