Writing R code that will run in Parallel #110

terrymclaughlin · 2024-06-18T10:07:20Z

Pull Request Details

Issue Number: #10

#10

Type: Documentation

Description of the Change

I've focused on writing up an example of using {multidplyr} to run R code in parallel.

Verification Process

I've tested the code runs on Posit Workbench, and the benchmark results are from this testing.

Additional Work Required

More background on the difference between single-threaded execution, multithreaded execution, parallel processing and sequential processing.

Need to also include examples of using {furrr}.

Also detail why we need to use the parallelly::availableCores() function on Posit Workbench running in Kubernetes.

Release Notes

This is the first version of this documentation.

Closes #10

Initial commit. Requires further work.

Add detail on inherently parallel data manipulation tasks and the {multidplyr} R package

Ready for initial review

JFix89

Hi Terry, overall I think this is very comprehensive and easy to follow! I've just left a few (very minor) suggestions for things you might consider changing. Thanks for putting this together.

R/Writing R Code that Runs in Parallel.md

JFix89 · 2024-06-18T12:04:24Z

R/Writing R Code that Runs in Parallel.md

+| dplyr       | 61.229  | 61.335 | 61.720   | 61.442     | 61.966 | 62.490  | 3           |
+| multidplyr  | 6.849   | 6.976  | 7.962    | 7.104      | 8.519  | 9.934   | 3           |
+
+The results clearly indicate that `{multidplyr}` significantly outperforms `{dplyr}` in terms of execution time for the given data manipulation task. The mean execution time for `{multidplyr}` is approximately 7.96 seconds, compared to 61.72 seconds for dplyr. This demonstrates the potential performance benefits of using parallel processing with `{multidplyr}` for large datasets.


Possibly include a 'Conclusion' section at the same level as the Summary at the start of the document? The structure seems to end quite abruptly at the moment.

I agree @JFix89 that a 'Conclusion' section is needed. I am going to add further content to the document, and then add a 'Conclusion' section that summarises everything.

R/Writing R Code that Runs in Parallel.md

Moohan · 2024-06-18T13:58:17Z

R/Writing R Code that Runs in Parallel.md

+
+## Purpose
+
+This document aims to provide R users in Public Health Scotland with an introduction to the concepts of Parallel Processing, describe how to run `{dplyr}` code in parallel using the `{multidplyr}` package, explain the benefits of doing so, along with some of the downsides.


Add links to package docs

Suggested change

This document aims to provide R users in Public Health Scotland with an introduction to the concepts of Parallel Processing, describe how to run `{dplyr}` code in parallel using the `{multidplyr}` package, explain the benefits of doing so, along with some of the downsides.

This document aims to provide R users in Public Health Scotland with an introduction to the concepts of Parallel Processing, describe how to run [`{dplyr}` code](https://dplyr.tidyverse.org/) in parallel using the [`{multidplyr}` package](https://multidplyr.tidyverse.org/), explain the benefits of doing so, along with some of the downsides.

R/Writing R Code that Runs in Parallel.md

Moohan · 2024-06-18T14:56:19Z

R/Writing R Code that Runs in Parallel.md

Since the practical part of this doc exclusively references multidplyr, I think the title and / or doc name should be changed to reflect that?

Or is the intention that it will be expanded with sections / chapters on other 'parallel packages' in the future?

I am definitely going to add an example using the {furrr} package. I have been working up that example and will add it in very shortly.

Moohan · 2024-06-18T14:58:07Z

R/Writing R Code that Runs in Parallel.md

+### What is Parallel Processing?
+
+Parallel processing is a method of computation where many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. This is particularly useful for tasks that require significant computational power and time.
+
+### How Does Parallel Processing Relate to R?
+
+R is a single-threaded language by default, meaning it processes tasks sequentially, one after the other. This can be a limitation when working with large datasets or performing complex computations. The `{multidplyr}` package addresses this limitation by enabling parallel processing within the `{dplyr}` framework.
+
+`{multidplyr}` allows you to partition your data across multiple cores and perform `{dplyr}` operations in parallel. This can lead to significant performance improvements by utilising the full computational power of modern multi-core processors. By distributing the workload, `{multidplyr}` can reduce the time required for data manipulation tasks.


These two sections are very similar to the summary and key points section above. I'd suggest merging the non-multidplyr content into summary and keypoints, then having a multidplyr section for the rest.

Moohan · 2024-06-18T15:01:30Z

R/Writing R Code that Runs in Parallel.md

+
+### Reasons You Might Not Use Parallel Processing
+
+- **Overhead**: Setting up and managing parallel processes can introduce overhead, which might negate the performance benefits for smaller tasks.


I think overhead might need an explanation.

The only think I'm thinking of is an analogy - A car can go faster than a person can walk but for short journeys, the 'overhead' of finding your keys, getting in the car, finding a parking space etc. mean that just walking is quicker.

@Moohan Great analogy!

Co-authored-by: James Fixter <[email protected]>

Co-authored-by: James McMahon <[email protected]>

Moohan · 2024-07-24T10:17:06Z

I answered a question recently, using furrr in a way I'd not previously considered: https://teams.microsoft.com/l/message/19:[email protected]/1721726497682?tenantId=10efe0bd-a030-4bca-809c-b5e6745e499a&groupId=ec4250f9-b70a-4f32-9372-a232ccb4f713&parentMessageId=1721652167281&teamName=PHS%20Data%20and%20Intelligence%20Forum&channelName=Beginners%20Channel&createdTime=1721726497682

Used in this way it's very similar to how multidplyr works but gets round the issue that arbitrary functions aren't 'translated' into multidplyr.

Draft commit

terrymclaughlin · 2024-08-08T11:17:29Z

@Moohan Can you let me know what you think of the section on multithreading at https://github.com/Public-Health-Scotland/technical-docs/blob/10-req-guidance-on-writing-r-code-that-will-run-in-parallel/R/Writing%20R%20Code%20that%20Runs%20in%20Parallel.md#multithreading ? Hopefully this makes sense?

Moohan

Other than the comments, a thought that was triggered by your mention of data.table - It would be nice to see a list / table of 'common' packages that do and don't take advantage of >1 CPU, it's something I'm never 100% sure about, e.g. is all the tidyverse entirely single-threaded?

R/Writing R Code that Runs in Parallel.md

Moohan · 2024-08-08T11:42:23Z

R/Writing R Code that Runs in Parallel.md

+### Multithreading inside a Kubernetes container in Azure
+
+Your R session on Posit Workbench runs inside a Kubernetes container.  That container is allocated a certain amount of CPU resource.  This resource is provided by the Azure Kubernetes Service (AKS), where 1 CPU corresponds to 1 vCPU (virtual CPU). 1 vCPU is the equivalent of a single hyper-thread on a physical CPU core.
+
+If you attempt to run multithreaded code in a session with just 1 CPU in Posit Workbench, the multiple threads will be executed by taking turns on the single thread.  This will give the illusion of multithreading, but all that is happening is that each thread is using a slice of CPU time, running sequentially.
+
+In order to run multithreaded code in Posit Workbench, you must open a session with more than 1 CPU.  To demonstrate this, below are the results of running a computationally expensive operation on a large `{data.table}` of 600 million rows in Posit Workbench sessions with 1, 2, 4 and 8 CPUs, and using 1, 2, 4 and 8 threads:
+
+| Threads | 1 vCPU | 2 vCPUs | 4 vCPUs | 8 vCPUs |
+|---------|--------|---------|---------|---------|
+| 1       | 72.285 | 72.612  | 67.869  | 67.244  |
+| 2       | 75.373 | 48.828  | 44.957  | 45.019  |
+| 4       | 87.979 | 52.276  | 34.109  | 34.641  |
+| 8       | 100.207| 60.684  | 37.004  | 29.259  |
+
+The results are the execution times in seconds for each combination of number of CPUs and threads.  As you can see, running code with multiple threads on 1 vCPU takes longer than running the same code in a single thread of execution.  A reduction in execution time is only seen if the number of CPUs is increased, and the most optimal combination is where the number of CPUs matches the number of threads of execution.


I think the gist of this section is good but the language is overly technical in places:

I don't think the distinction of CPU vs. vCPU and hyper-threads is not particularly useful, I would remove a lot of the technical language to focus on n CPUs when starting a workbench session = n threads / workers / possible parallel jobs.

The table is nice but again I think the takeaway message is a bit lost, for me the takeaway is that matching CPUs requested with actual requirements (in this case threads for data.table) is key. Possibly reduce to 1 or 2 dps, and maybe highlight (italics?) the main diagonal where vCPU = threads.

I like the last paragraph, maybe highlighting (bold?) the most optimal combination is where the number of CPUs matches the number of threads of execution. This also seems a good place to mention the 'overhead' associated with making code parallel, possibly just with a link to a different section.

Moohan · 2024-08-08T11:44:21Z

R/Writing R Code that Runs in Parallel.md

+
+## The `{multidplyr}` R package
+
+The `{multidplyr}` R package is a backend for `{dplyr}` that facilitates parallel processing by partitioning data frames across multiple cores.  The package is part of the [Tidyverse](https://www.tidyverse.org/).


It would probably be useful to either fudge the terminology and use one of CPU / core / thread, or have a mini-glossary to highlight the difference. I think it's confusing to have one section talk about threads and vCPUS and the next to talk about cores.

@Moohan

You make a good point - and also above in the multithreading section. The terminology is confusing, and unless you have some prior understanding of CPU architecture, and know something about the concept of virtual CPUs, it quite literally will mean nothing. I need to go through this again and use the term CPU throughout, because that's what is referred to in Posit Workbench.

R/Writing R Code that Runs in Parallel.md

Fix links - wrong type of brackets Co-authored-by: James McMahon <[email protected]>

Minor word change Co-authored-by: James McMahon <[email protected]>

Removing explicit loading of microbenchmark package Co-authored-by: James McMahon <[email protected]>

Style changes Co-authored-by: James McMahon <[email protected]>

terrymclaughlin added 3 commits June 18, 2024 09:37

Create Writing R Code that Runs in Parallel.md

6fd0912

Initial commit. Requires further work.

Update Writing R Code that Runs in Parallel.md

7c82bed

Add detail on inherently parallel data manipulation tasks and the {multidplyr} R package

Update Writing R Code that Runs in Parallel.md

800939f

Ready for initial review

terrymclaughlin linked an issue Jun 18, 2024 that may be closed by this pull request

REQ - Guidance on writing R code that will run in parallel #10

Open

terrymclaughlin self-assigned this Jun 18, 2024

terrymclaughlin added the documentation Improvements or additions to documentation label Jun 18, 2024

terrymclaughlin requested review from jakeybob, rmccreath, mgannon99, JFix89, Alex-H-Reid, alasdairgm and Moohan June 18, 2024 10:11

JFix89 reviewed Jun 18, 2024

View reviewed changes

Moohan reviewed Jun 18, 2024

View reviewed changes

terrymclaughlin and others added 5 commits July 1, 2024 15:07

Update R/Writing R Code that Runs in Parallel.md

fb96ab9

Co-authored-by: James Fixter <[email protected]>

Update R/Writing R Code that Runs in Parallel.md

e618aa3

Co-authored-by: James Fixter <[email protected]>

Update R/Writing R Code that Runs in Parallel.md

c0d8630

Co-authored-by: James Fixter <[email protected]>

Update R/Writing R Code that Runs in Parallel.md

4408c61

Co-authored-by: James Fixter <[email protected]>

Update R/Writing R Code that Runs in Parallel.md

e2a6dc8

Co-authored-by: James McMahon <[email protected]>

terrymclaughlin marked this pull request as draft July 1, 2024 14:13

Update R/Writing R Code that Runs in Parallel.md

ab3b807

Draft commit

terrymclaughlin force-pushed the 10-req-guidance-on-writing-r-code-that-will-run-in-parallel branch from 0a74b03 to ab3b807 Compare August 8, 2024 08:24

Moohan reviewed Aug 8, 2024

View reviewed changes

terrymclaughlin and others added 4 commits August 8, 2024 14:25

Update R/Writing R Code that Runs in Parallel.md

85b097a

Fix links - wrong type of brackets Co-authored-by: James McMahon <[email protected]>

Update R/Writing R Code that Runs in Parallel.md

b2394c9

Minor word change Co-authored-by: James McMahon <[email protected]>

Update R/Writing R Code that Runs in Parallel.md

47ab853

Removing explicit loading of microbenchmark package Co-authored-by: James McMahon <[email protected]>

Update R/Writing R Code that Runs in Parallel.md

2ab352f

Style changes Co-authored-by: James McMahon <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writing R code that will run in Parallel #110

Writing R code that will run in Parallel #110

terrymclaughlin commented Jun 18, 2024

JFix89 left a comment

JFix89 Jun 18, 2024

terrymclaughlin Jul 1, 2024

Moohan Jun 18, 2024

Moohan Jun 18, 2024

terrymclaughlin Jul 1, 2024

Moohan Jun 18, 2024

Moohan Jun 18, 2024

terrymclaughlin Jul 1, 2024

Moohan commented Jul 24, 2024

terrymclaughlin commented Aug 8, 2024

Moohan left a comment

Moohan Aug 8, 2024

Moohan Aug 8, 2024

terrymclaughlin Aug 8, 2024


		## Purpose

		This document aims to provide R users in Public Health Scotland with an introduction to the concepts of Parallel Processing, describe how to run `{dplyr}` code in parallel using the `{multidplyr}` package, explain the benefits of doing so, along with some of the downsides.


		### Reasons You Might Not Use Parallel Processing

		- Overhead: Setting up and managing parallel processes can introduce overhead, which might negate the performance benefits for smaller tasks.


		## The `{multidplyr}` R package

		The `{multidplyr}` R package is a backend for `{dplyr}` that facilitates parallel processing by partitioning data frames across multiple cores. The package is part of the [Tidyverse](https://www.tidyverse.org/).

Writing R code that will run in Parallel #110

Are you sure you want to change the base?

Writing R code that will run in Parallel #110

Conversation

terrymclaughlin commented Jun 18, 2024

Pull Request Details

Description of the Change

Verification Process

Additional Work Required

Release Notes

JFix89 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Moohan commented Jul 24, 2024

terrymclaughlin commented Aug 8, 2024

Moohan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment