-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Writing R code that will run in Parallel #110
base: main
Are you sure you want to change the base?
Writing R code that will run in Parallel #110
Conversation
Initial commit. Requires further work.
Add detail on inherently parallel data manipulation tasks and the {multidplyr} R package
Ready for initial review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Terry, overall I think this is very comprehensive and easy to follow! I've just left a few (very minor) suggestions for things you might consider changing. Thanks for putting this together.
| dplyr | 61.229 | 61.335 | 61.720 | 61.442 | 61.966 | 62.490 | 3 | | ||
| multidplyr | 6.849 | 6.976 | 7.962 | 7.104 | 8.519 | 9.934 | 3 | | ||
|
||
The results clearly indicate that `{multidplyr}` significantly outperforms `{dplyr}` in terms of execution time for the given data manipulation task. The mean execution time for `{multidplyr}` is approximately 7.96 seconds, compared to 61.72 seconds for dplyr. This demonstrates the potential performance benefits of using parallel processing with `{multidplyr}` for large datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly include a 'Conclusion' section at the same level as the Summary at the start of the document? The structure seems to end quite abruptly at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree @JFix89 that a 'Conclusion' section is needed. I am going to add further content to the document, and then add a 'Conclusion' section that summarises everything.
|
||
## Purpose | ||
|
||
This document aims to provide R users in Public Health Scotland with an introduction to the concepts of Parallel Processing, describe how to run `{dplyr}` code in parallel using the `{multidplyr}` package, explain the benefits of doing so, along with some of the downsides. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add links to package docs
This document aims to provide R users in Public Health Scotland with an introduction to the concepts of Parallel Processing, describe how to run `{dplyr}` code in parallel using the `{multidplyr}` package, explain the benefits of doing so, along with some of the downsides. | |
This document aims to provide R users in Public Health Scotland with an introduction to the concepts of Parallel Processing, describe how to run [`{dplyr}` code](https://dplyr.tidyverse.org/) in parallel using the [`{multidplyr}` package](https://multidplyr.tidyverse.org/), explain the benefits of doing so, along with some of the downsides. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the practical part of this doc exclusively references multidplyr, I think the title and / or doc name should be changed to reflect that?
Or is the intention that it will be expanded with sections / chapters on other 'parallel packages' in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am definitely going to add an example using the {furrr} package. I have been working up that example and will add it in very shortly.
### What is Parallel Processing? | ||
|
||
Parallel processing is a method of computation where many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. This is particularly useful for tasks that require significant computational power and time. | ||
|
||
### How Does Parallel Processing Relate to R? | ||
|
||
R is a single-threaded language by default, meaning it processes tasks sequentially, one after the other. This can be a limitation when working with large datasets or performing complex computations. The `{multidplyr}` package addresses this limitation by enabling parallel processing within the `{dplyr}` framework. | ||
|
||
`{multidplyr}` allows you to partition your data across multiple cores and perform `{dplyr}` operations in parallel. This can lead to significant performance improvements by utilising the full computational power of modern multi-core processors. By distributing the workload, `{multidplyr}` can reduce the time required for data manipulation tasks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two sections are very similar to the summary and key points section above. I'd suggest merging the non-multidplyr content into summary and keypoints, then having a multidplyr section for the rest.
|
||
### Reasons You Might Not Use Parallel Processing | ||
|
||
- **Overhead**: Setting up and managing parallel processes can introduce overhead, which might negate the performance benefits for smaller tasks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think overhead might need an explanation.
The only think I'm thinking of is an analogy - A car can go faster than a person can walk but for short journeys, the 'overhead' of finding your keys, getting in the car, finding a parking space etc. mean that just walking is quicker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Moohan Great analogy!
Co-authored-by: James Fixter <[email protected]>
Co-authored-by: James Fixter <[email protected]>
Co-authored-by: James Fixter <[email protected]>
Co-authored-by: James Fixter <[email protected]>
Co-authored-by: James McMahon <[email protected]>
I answered a question recently, using furrr in a way I'd not previously considered: https://teams.microsoft.com/l/message/19:[email protected]/1721726497682?tenantId=10efe0bd-a030-4bca-809c-b5e6745e499a&groupId=ec4250f9-b70a-4f32-9372-a232ccb4f713&parentMessageId=1721652167281&teamName=PHS%20Data%20and%20Intelligence%20Forum&channelName=Beginners%20Channel&createdTime=1721726497682 Used in this way it's very similar to how multidplyr works but gets round the issue that arbitrary functions aren't 'translated' into multidplyr. |
Draft commit
0a74b03
to
ab3b807
Compare
@Moohan Can you let me know what you think of the section on multithreading at https://github.com/Public-Health-Scotland/technical-docs/blob/10-req-guidance-on-writing-r-code-that-will-run-in-parallel/R/Writing%20R%20Code%20that%20Runs%20in%20Parallel.md#multithreading ? Hopefully this makes sense? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the comments, a thought that was triggered by your mention of data.table - It would be nice to see a list / table of 'common' packages that do and don't take advantage of >1 CPU, it's something I'm never 100% sure about, e.g. is all the tidyverse entirely single-threaded?
### Multithreading inside a Kubernetes container in Azure | ||
|
||
Your R session on Posit Workbench runs inside a Kubernetes container. That container is allocated a certain amount of CPU resource. This resource is provided by the Azure Kubernetes Service (AKS), where 1 CPU corresponds to 1 vCPU (virtual CPU). 1 vCPU is the equivalent of a single hyper-thread on a physical CPU core. | ||
|
||
If you attempt to run multithreaded code in a session with just 1 CPU in Posit Workbench, the multiple threads will be executed by taking turns on the single thread. This will give the illusion of multithreading, but all that is happening is that each thread is using a slice of CPU time, running sequentially. | ||
|
||
In order to run multithreaded code in Posit Workbench, you must open a session with more than 1 CPU. To demonstrate this, below are the results of running a computationally expensive operation on a large `{data.table}` of 600 million rows in Posit Workbench sessions with 1, 2, 4 and 8 CPUs, and using 1, 2, 4 and 8 threads: | ||
|
||
| Threads | 1 vCPU | 2 vCPUs | 4 vCPUs | 8 vCPUs | | ||
|---------|--------|---------|---------|---------| | ||
| 1 | 72.285 | 72.612 | 67.869 | 67.244 | | ||
| 2 | 75.373 | 48.828 | 44.957 | 45.019 | | ||
| 4 | 87.979 | 52.276 | 34.109 | 34.641 | | ||
| 8 | 100.207| 60.684 | 37.004 | 29.259 | | ||
|
||
The results are the execution times in seconds for each combination of number of CPUs and threads. As you can see, running code with multiple threads on 1 vCPU takes longer than running the same code in a single thread of execution. A reduction in execution time is only seen if the number of CPUs is increased, and the most optimal combination is where the number of CPUs matches the number of threads of execution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the gist of this section is good but the language is overly technical in places:
I don't think the distinction of CPU vs. vCPU and hyper-threads is not particularly useful, I would remove a lot of the technical language to focus on n CPUs when starting a workbench session = n threads / workers / possible parallel jobs.
The table is nice but again I think the takeaway message is a bit lost, for me the takeaway is that matching CPUs requested with actual requirements (in this case threads for data.table) is key. Possibly reduce to 1 or 2 dps, and maybe highlight (italics?) the main diagonal where vCPU = threads.
I like the last paragraph, maybe highlighting (bold?) the most optimal combination is where the number of CPUs matches the number of threads of execution
. This also seems a good place to mention the 'overhead' associated with making code parallel, possibly just with a link to a different section.
|
||
## The `{multidplyr}` R package | ||
|
||
The `{multidplyr}` R package is a backend for `{dplyr}` that facilitates parallel processing by partitioning data frames across multiple cores. The package is part of the [Tidyverse](https://www.tidyverse.org/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would probably be useful to either fudge the terminology and use one of CPU / core / thread, or have a mini-glossary to highlight the difference. I think it's confusing to have one section talk about threads and vCPUS and the next to talk about cores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You make a good point - and also above in the multithreading section. The terminology is confusing, and unless you have some prior understanding of CPU architecture, and know something about the concept of virtual CPUs, it quite literally will mean nothing. I need to go through this again and use the term CPU throughout, because that's what is referred to in Posit Workbench.
Fix links - wrong type of brackets Co-authored-by: James McMahon <[email protected]>
Minor word change Co-authored-by: James McMahon <[email protected]>
Removing explicit loading of microbenchmark package Co-authored-by: James McMahon <[email protected]>
Style changes Co-authored-by: James McMahon <[email protected]>
Pull Request Details
Issue Number: #10
#10
Type: Documentation
Description of the Change
I've focused on writing up an example of using
{multidplyr}
to run R code in parallel.Verification Process
I've tested the code runs on Posit Workbench, and the benchmark results are from this testing.
Additional Work Required
More background on the difference between single-threaded execution, multithreaded execution, parallel processing and sequential processing.
Need to also include examples of using
{furrr}
.Also detail why we need to use the
parallelly::availableCores()
function on Posit Workbench running in Kubernetes.Release Notes
This is the first version of this documentation.
Closes #10