Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing R code that will run in Parallel #110

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

terrymclaughlin
Copy link
Contributor

Pull Request Details

Issue Number: #10

#10

Type: Documentation

Description of the Change

I've focused on writing up an example of using {multidplyr} to run R code in parallel.

Verification Process

I've tested the code runs on Posit Workbench, and the benchmark results are from this testing.

Additional Work Required

More background on the difference between single-threaded execution, multithreaded execution, parallel processing and sequential processing.

Need to also include examples of using {furrr}.

Also detail why we need to use the parallelly::availableCores() function on Posit Workbench running in Kubernetes.

Release Notes

This is the first version of this documentation.

Closes #10

Initial commit.  Requires further work.
Add detail on inherently parallel data manipulation tasks and the {multidplyr} R package
@terrymclaughlin terrymclaughlin linked an issue Jun 18, 2024 that may be closed by this pull request
@terrymclaughlin terrymclaughlin self-assigned this Jun 18, 2024
@terrymclaughlin terrymclaughlin added the documentation Improvements or additions to documentation label Jun 18, 2024
Copy link
Contributor

@JFix89 JFix89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Terry, overall I think this is very comprehensive and easy to follow! I've just left a few (very minor) suggestions for things you might consider changing. Thanks for putting this together.

R/Writing R Code that Runs in Parallel.md Outdated Show resolved Hide resolved
R/Writing R Code that Runs in Parallel.md Outdated Show resolved Hide resolved
R/Writing R Code that Runs in Parallel.md Outdated Show resolved Hide resolved
| dplyr | 61.229 | 61.335 | 61.720 | 61.442 | 61.966 | 62.490 | 3 |
| multidplyr | 6.849 | 6.976 | 7.962 | 7.104 | 8.519 | 9.934 | 3 |

The results clearly indicate that `{multidplyr}` significantly outperforms `{dplyr}` in terms of execution time for the given data manipulation task. The mean execution time for `{multidplyr}` is approximately 7.96 seconds, compared to 61.72 seconds for dplyr. This demonstrates the potential performance benefits of using parallel processing with `{multidplyr}` for large datasets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly include a 'Conclusion' section at the same level as the Summary at the start of the document? The structure seems to end quite abruptly at the moment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree @JFix89 that a 'Conclusion' section is needed. I am going to add further content to the document, and then add a 'Conclusion' section that summarises everything.

R/Writing R Code that Runs in Parallel.md Outdated Show resolved Hide resolved

## Purpose

This document aims to provide R users in Public Health Scotland with an introduction to the concepts of Parallel Processing, describe how to run `{dplyr}` code in parallel using the `{multidplyr}` package, explain the benefits of doing so, along with some of the downsides.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add links to package docs

Suggested change
This document aims to provide R users in Public Health Scotland with an introduction to the concepts of Parallel Processing, describe how to run `{dplyr}` code in parallel using the `{multidplyr}` package, explain the benefits of doing so, along with some of the downsides.
This document aims to provide R users in Public Health Scotland with an introduction to the concepts of Parallel Processing, describe how to run [`{dplyr}` code](https://dplyr.tidyverse.org/) in parallel using the [`{multidplyr}` package](https://multidplyr.tidyverse.org/), explain the benefits of doing so, along with some of the downsides.

R/Writing R Code that Runs in Parallel.md Outdated Show resolved Hide resolved
R/Writing R Code that Runs in Parallel.md Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the practical part of this doc exclusively references multidplyr, I think the title and / or doc name should be changed to reflect that?

Or is the intention that it will be expanded with sections / chapters on other 'parallel packages' in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am definitely going to add an example using the {furrr} package. I have been working up that example and will add it in very shortly.

Comment on lines +17 to +25
### What is Parallel Processing?

Parallel processing is a method of computation where many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. This is particularly useful for tasks that require significant computational power and time.

### How Does Parallel Processing Relate to R?

R is a single-threaded language by default, meaning it processes tasks sequentially, one after the other. This can be a limitation when working with large datasets or performing complex computations. The `{multidplyr}` package addresses this limitation by enabling parallel processing within the `{dplyr}` framework.

`{multidplyr}` allows you to partition your data across multiple cores and perform `{dplyr}` operations in parallel. This can lead to significant performance improvements by utilising the full computational power of modern multi-core processors. By distributing the workload, `{multidplyr}` can reduce the time required for data manipulation tasks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two sections are very similar to the summary and key points section above. I'd suggest merging the non-multidplyr content into summary and keypoints, then having a multidplyr section for the rest.


### Reasons You Might Not Use Parallel Processing

- **Overhead**: Setting up and managing parallel processes can introduce overhead, which might negate the performance benefits for smaller tasks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think overhead might need an explanation.

The only think I'm thinking of is an analogy - A car can go faster than a person can walk but for short journeys, the 'overhead' of finding your keys, getting in the car, finding a parking space etc. mean that just walking is quicker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Moohan Great analogy!

@terrymclaughlin terrymclaughlin marked this pull request as draft July 1, 2024 14:13
@Moohan
Copy link
Member

Moohan commented Jul 24, 2024

I answered a question recently, using furrr in a way I'd not previously considered: https://teams.microsoft.com/l/message/19:[email protected]/1721726497682?tenantId=10efe0bd-a030-4bca-809c-b5e6745e499a&groupId=ec4250f9-b70a-4f32-9372-a232ccb4f713&parentMessageId=1721652167281&teamName=PHS%20Data%20and%20Intelligence%20Forum&channelName=Beginners%20Channel&createdTime=1721726497682

Used in this way it's very similar to how multidplyr works but gets round the issue that arbitrary functions aren't 'translated' into multidplyr.

@terrymclaughlin terrymclaughlin force-pushed the 10-req-guidance-on-writing-r-code-that-will-run-in-parallel branch from 0a74b03 to ab3b807 Compare August 8, 2024 08:24
@terrymclaughlin
Copy link
Contributor Author

Copy link
Member

@Moohan Moohan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the comments, a thought that was triggered by your mention of data.table - It would be nice to see a list / table of 'common' packages that do and don't take advantage of >1 CPU, it's something I'm never 100% sure about, e.g. is all the tidyverse entirely single-threaded?

R/Writing R Code that Runs in Parallel.md Outdated Show resolved Hide resolved
Comment on lines +58 to +73
### Multithreading inside a Kubernetes container in Azure

Your R session on Posit Workbench runs inside a Kubernetes container. That container is allocated a certain amount of CPU resource. This resource is provided by the Azure Kubernetes Service (AKS), where 1 CPU corresponds to 1 vCPU (virtual CPU). 1 vCPU is the equivalent of a single hyper-thread on a physical CPU core.

If you attempt to run multithreaded code in a session with just 1 CPU in Posit Workbench, the multiple threads will be executed by taking turns on the single thread. This will give the illusion of multithreading, but all that is happening is that each thread is using a slice of CPU time, running sequentially.

In order to run multithreaded code in Posit Workbench, you must open a session with more than 1 CPU. To demonstrate this, below are the results of running a computationally expensive operation on a large `{data.table}` of 600 million rows in Posit Workbench sessions with 1, 2, 4 and 8 CPUs, and using 1, 2, 4 and 8 threads:

| Threads | 1 vCPU | 2 vCPUs | 4 vCPUs | 8 vCPUs |
|---------|--------|---------|---------|---------|
| 1 | 72.285 | 72.612 | 67.869 | 67.244 |
| 2 | 75.373 | 48.828 | 44.957 | 45.019 |
| 4 | 87.979 | 52.276 | 34.109 | 34.641 |
| 8 | 100.207| 60.684 | 37.004 | 29.259 |

The results are the execution times in seconds for each combination of number of CPUs and threads. As you can see, running code with multiple threads on 1 vCPU takes longer than running the same code in a single thread of execution. A reduction in execution time is only seen if the number of CPUs is increased, and the most optimal combination is where the number of CPUs matches the number of threads of execution.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the gist of this section is good but the language is overly technical in places:

I don't think the distinction of CPU vs. vCPU and hyper-threads is not particularly useful, I would remove a lot of the technical language to focus on n CPUs when starting a workbench session = n threads / workers / possible parallel jobs.

The table is nice but again I think the takeaway message is a bit lost, for me the takeaway is that matching CPUs requested with actual requirements (in this case threads for data.table) is key. Possibly reduce to 1 or 2 dps, and maybe highlight (italics?) the main diagonal where vCPU = threads.

I like the last paragraph, maybe highlighting (bold?) the most optimal combination is where the number of CPUs matches the number of threads of execution. This also seems a good place to mention the 'overhead' associated with making code parallel, possibly just with a link to a different section.


## The `{multidplyr}` R package

The `{multidplyr}` R package is a backend for `{dplyr}` that facilitates parallel processing by partitioning data frames across multiple cores. The package is part of the [Tidyverse](https://www.tidyverse.org/).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be useful to either fudge the terminology and use one of CPU / core / thread, or have a mini-glossary to highlight the difference. I think it's confusing to have one section talk about threads and vCPUS and the next to talk about cores.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Moohan

You make a good point - and also above in the multithreading section. The terminology is confusing, and unless you have some prior understanding of CPU architecture, and know something about the concept of virtual CPUs, it quite literally will mean nothing. I need to go through this again and use the term CPU throughout, because that's what is referred to in Posit Workbench.

R/Writing R Code that Runs in Parallel.md Outdated Show resolved Hide resolved
R/Writing R Code that Runs in Parallel.md Outdated Show resolved Hide resolved
R/Writing R Code that Runs in Parallel.md Outdated Show resolved Hide resolved
terrymclaughlin and others added 4 commits August 8, 2024 14:25
Fix links - wrong type of brackets

Co-authored-by: James McMahon <[email protected]>
Minor word change

Co-authored-by: James McMahon <[email protected]>
Removing explicit loading of microbenchmark package

Co-authored-by: James McMahon <[email protected]>
Style changes

Co-authored-by: James McMahon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

REQ - Guidance on writing R code that will run in parallel
3 participants