Adding OpenACC statements to accelerate MYNN surface scheme performance through GPU offloading #97

timsliwinski-noaa · 2023-08-15T01:06:02Z

Overview:

With very minimal changes to the original code of the scheme, the MYNN surface scheme has been enhanced with OpenACC statements which introduce the capability for offloading computational execution to OpenACC-supported accelerator devices (e.g. Nvidia GPUs). Since the scheme operates by looping multiple times over independent vertical columns, the overall computational strategy maps well to GPU hardware where multiple iterations of each loop can be run in parallel with SIMD methods. Data movement has been optimized to ensure data transfers from host memory to device memory are limited as data movement is a significant source of latency when performing offloading to accelerator devices. Performance increases on a GPU ranged from a 3.3x slowdown to a 41.9x speedup versus CPU execution (See the Performance section for more information).

MYNN Scheme Code Changes:

A few minor code changes were unavoidable due to certain limitations on what OpenACC is able to execute on the accelerator within kernel and parallel blocks. A complete list of these changes is below:

Adding preprocessor directives to disable multiple standard output statements, including those used for debug output. The challenges of these are different depending on the view from the host or accelerator. When run in parallel on the accelerator, these statements are not guaranteed to be presented to the user in-order. Also, in limited cases, these statements would have to output variables that were not transferred to the GPU because they were not necessary for computation, introducing additional transfer overhead to ensure they were present only for these output statements. Further, with hundreds of threads executing at once, the output could be quite large and unwieldy. That said, some of these statements could have been run on the host to alleviate the problems introduced by parallelization on the device. However, this would have necessitated device-to-host transfers of variables to ensure values being output were correct while introducing additional transfer overhead costs to performance. Disabling these for accelerators only seemed the best course of action. These are disabled based on the presence of the __OPENACC compile time variable to ensure these are only disabled when the code is compiled for accelerator usage and does not affect CPU execution.
Changing the CCPP errmsg variable declaration on line 349 of module_sf_mynn.F90 to be a fixed 200 length character array. Since this variable is set at times in the middle of accelerator kernel regions, it must be present on the accelerator. However, when defined with "len=*", it is an assumed-size array, which OpenACC does not support on the accelerator. Rather than disable this variable completely, changing it to a fixed length allows it to be transferred to/from the accelerator and used. This change is enforced by preprocessor directives based on the presence of the __OPENACC compile time variable and ensures this change only occurs when the code is compiled for accelerator usage, therefore it does not affect CPU execution.
Adding preprocessor directives to "move" return statement on line 1399 of module_sf_mynn.F90 out of the main i-loop and instead execute it at line 2006 if errflg is set to 1. This change is necessary as OpenACC accelerator code cannot execute branching such as this, so this conditional return statement can only be executed by the host. This change is enforced by preprocessor directives based on the presence of the __OPENACC compile time variable and ensures this change only occurs when the code is compiled for accelerator usage, therefore it does not affect CPU execution.
Commenting out the zLhux local variable in the zolri function over lines 3671 to 3724. The zLhux variable appears to have been used only to capture values of zolri over multiple iterations, but is never used or passed along after this collection is completed. Since this array would be an assumed-size array based on the value of nmax at runtime, it would have been unsupported by OpenACC. But, since it is unused, the choice was made to simply comment out the variable and all lines related to it, allow the remaining code of the function to executed on the accelerator.

Performance:

Performance testing was performed on a single Nvidia P100 GPU versus a single 10-core Haswell CPU on Hera. Since the MYNN Surface scheme is a serial code, parallelization on the 10-core Haswell was performed using simple data partitioning across the 10 cores using OpenMP threads such that each thread received a near equal amount of data. When data movement was fully optimized for the accelerator -- meaning all CCPP Physics input variables were pre-loaded on the GPU as they would be when the CCPP infrastructure fully supports accelerator offloading -- GPU performance speedups range between 11.8X and 41.8X over the 10-core Haswell when the number of vertical columns (i) was varied between 150k and 750k, respectively.

Performance Timings (optimized data movement):

Columns (i) \ Compute	CPU	GPU	GPU Speedup
150,000	263ms	22 ms	11.9x
450,000	766 ms	28 ms	27.0x
750,000	1314ms	31 ms	41.9x

However, standalone performance -- meaning all CCPP Physics input variables were initially loaded onto the GPU only after being declared in the MYNN subroutine calls -- was slightly less performant than the 10-core Haswell due to the additional overhead incurred by the data transfers. In this case, the decreasing performance lag for the GPU behind the CPU as the number of columns increases is due to the GPU performing better with more data (i.e. higher computational throughput) than the CPU despite more data needing to be transferred to the device.

Performance Timings (standalone):

Columns (i) \ Compute	CPU	GPU	GPU Speedup
150,000	263ms	862ms	-3.3x
450,000	766 ms	1767 ms	-2.3x
750,000	1314ms	2776 ms	-2.1x

With these results, it is clear that this scheme will perform at its best on accelerators once the CCPP infrastructure also supports OpenACC.

Contact Information:

This enhancement was performed by Timothy Sliwinski at NOAA GSL. Questions regarding these changes should be directed to [email protected]

…ce through GPU offloading Overview: --------- With very minimal changes to the original code of the scheme, the MYNN surface scheme has been enhanced with OpenACC statements which introduce the capability for offloading computational execution to OpenACC-supported accelerator devices (e.g. Nvidia GPUs). Since the scheme operates by looping multiple times over independent vertical columns, the overall computational strategy maps well to GPU hardware where multiple iterations of each loop can be run in parallel with SIMD methods. Data movement has been optimized to ensure data transfers from host memory to device memory are limited as data movement is a significant source of latency when performing offloading to accelerator devices. Performance increases on a GPU ranged from a 3.3x slowdown to a 41.9x speedup versus CPU execution (See the Performance section for more information). MYNN Scheme Code Changes: ------------------------- A few minor code changes were unavoidable due to certain limitations on what OpenACC is able to execute on the accelerator within kernel and parallel blocks. A complete list of these changes is below: 1. Adding preprocessor directives to disable multiple standard output statements, including those used for debug output. The challenges of these are different depending on the view from the host or accelerator. When run in parallel on the accelerator, these statements are not guaranteed to be presented to the user in-order. Also, in limited cases, these statements would have to output variables that were not transferred to the GPU because they were not necessary for computation, introducing additional transfer overhead to ensure they were present only for these output statements. Further, with hundreds of threads executing at once, the output could be quite large and unwieldy. That said, some of these statements could have been run on the host to alleviate the problems introduced by parallelization on the device. However, this would have necessitated device-to-host transfers of variables to ensure values being output were correct while introducing additional transfer overhead costs to performance. Disabling these for accelerators only seemed the best course of action. These are disabled based on the presence of the __OPENACC compile time variable to ensure these are only disabled when the code is compiled for accelerator usage and does not affect CPU execution. 2. Changing the CCPP errmsg variable declaration on line 349 of module_sf_mynn.F90 to be a fixed 200 length character array. Since this variable is set at times in the middle of accelerator kernel regions, it must be present on the accelerator. However, when defined with "len=*", it is an assumed-size array, which OpenACC does not support on the accelerator. Rather than disable this variable completely, changing it to a fixed length allows it to be transferred to/from the accelerator and used. This change is enforced by preprocessor directives based on the presence of the __OPENACC compile time variable and ensures this change only occurs when the code is compiled for accelerator usage, therefore it does not affect CPU execution. 3. Adding preprocessor directives to "move" return statement on line 1399 of module_sf_mynn.F90 out of the main i-loop and instead execute it at line 2006 if errflg is set to 1. This change is necessary as OpenACC accelerator code cannot execute branching such as this, so this conditional return statement can only be executed by the host. This change is enforced by preprocessor directives based on the presence of the __OPENACC compile time variable and ensures this change only occurs when the code is compiled for accelerator usage, therefore it does not affect CPU execution. 4. Commenting out the zLhux local variable in the zolri function over lines 3671 to 3724. The zLhux variable appears to have been used only to capture values of zolri over multiple iterations, but is never used or passed along after this collection is completed. Since this array would be an assumed-size array based on the value of nmax at runtime, it would have been unsupported by OpenACC. But, since it is unused, the choice was made to simply comment out the variable and all lines related to it, allow the remaining code of the function to executed on the accelerator. Performance: ------------ Performance testing was performed on a single Nvidia P100 GPU versus a single 10-core Haswell CPU on Hera. Since the MYNN Surface scheme is a serial code, parallelization on the 10-core Haswell was performed using simple data partitioning across the 10 cores using OpenMP threads such that each thread received a near equal amount of data. When data movement was fully optimized for the accelerator -- meaning all CCPP Physics input variables were pre-loaded on the GPU as they would be when the CCPP infrastructure fully supports accelerator offloading -- GPU performance speedups range between 11.8X and 41.8X over the 10-core Haswell when the number of vertical columns (i) was varied between 150k and 750k, respectively. Performance Timings (optimized data movement) Columns (i) \ Compute | CPU | GPU | GPU Speedup | --------------------------------------------------------------------------- 150,000 | 263 ms | 22 ms | 11.9x | --------------------------------------------------------------------------- 450,000 | 766 ms | 28 ms | 27.0x | --------------------------------------------------------------------------- 750,000 | 1314 ms | 31 ms | 41.9x | --------------------------------------------------------------------------- However, standalone performance -- meaning all CCPP Physics input variables were initially loaded onto the GPU only after being declared in the MYNN subroutine calls -- was slightly less performant than the 10-core Haswell due to the additional overhead incurred by the data transfers. In this case, the decreasing performance lag for the GPU behind the CPU as the number of columns increases is due to the GPU performing better with more data (i.e. higher computational throughput) than the CPU despite more data needing to be transferred to the device. Performance Timings (standalone) Columns (i) \ Compute | CPU | GPU | GPU Speedup | ------------------------------------------------------------------------------- 150,000 | 263 ms | 862 ms | -3.3x | ------------------------------------------------------------------------------- 450,000 | 766 ms | 1767 ms | -2.3x | ------------------------------------------------------------------------------- 750,000 | 1314 ms | 2776 ms | -2.1x | ------------------------------------------------------------------------------- With these results, it is clear that this scheme will perform at its best on accelerators once the CCPP infrastructure also supports OpenACC. Contact Information: -------------------- This enhancement was performed by Timothy Sliwinski at NOAA GSL. Questions regarding these changes should be directed to [email protected]

dustinswales · 2023-08-21T15:27:20Z

@timsliwinski-noaa Thanks for this contribution.
A couple comments/questions.

I'm not sure the code works as is "inline using GPUs". See my comment on the changes you made to errmsg in the code. We don't test the GPU functionality as part of our testing system, which is something we are working on, but I'm pretty sure the pre-processor directives around a local declaration of a host-side interstitial might break things. As a workaround you could declare a local copy of errmsg, with explicit extent, that you pass to the GPU.
Do we need this new pre-processor directive, __OPENACC, which is acting to bypass the physics to host-model communication for displaying debugging information? Maybe this could be resolved with some changes in the scheme.
I don't see pull-requests into the ccpp-physics supermodules fv3atm or the UFS weather model. Have these changes been tested in the UFS?

dustinswales · 2023-08-21T15:11:58Z

physics/module_sf_mynn.F90

      character(len=*), intent(inout) :: errmsg
+#else


This looks fine for non-GPU applications, but I doubt that this works when running the physics within a host w/ GPU support enabled?
errmsg is defined by the host, and used throughout the physics, so changing it in this one place is not sufficient, you would need to conditionally change it everywhere. At least for this to work inline.

Hi Dustin. I'm pushing a new change up to this branch which removes these preprocessor substitutions in favor of a local variable approach with explicit extent as you previously suggested. However, this does unfortunately add new complexity due to the use of return statements mid loop that have to be handled specially, so 4 new local variables were necessary to handle this. Please see the commit message for full details explaining this.

Also, note that some preprocessor directives are still unavoidable using __OPENACC despite this change. However, __OPENACC is set by the compiler automatically any time OpenACC is enabled at compile time, so CCPP need not track this.

dustinswales · 2023-08-21T15:12:12Z

physics/module_sf_mynn.F90

@@ -626,7 +666,12 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter,                            &
 !JOE-end

 ! CCPP error handling
+#ifndef _OPENACC


dustinswales · 2023-08-21T15:19:21Z

physics/module_sf_mynn.F90

@@ -1041,6 +1170,7 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter,                            &
         WSPD(I) = MAX(WSPD_ice,WSPD_wat)
         WSPD(I) = MAX(WSPD_lnd,WSPD(I))

+#ifndef _OPENACC


Maybe this is a comment for the scheme developer @joeolson42, but there are print/write statements sprinkled in throughout this scheme, leading @timsliwinski-noaa to put in pre-processor directives to guard against printing them for GPU applications. Some of these are conditional on the state (e.g. if Temp > blahblah), others are controlled with a flag, like this one here.
To avoid needing a pre-processor directive we could control all the debug information with a debug flag, which is set by the host, and not when using GPUs. Thoughts?

Do you mean, use the "debug_code" flag check with all print statements and only allow debug_code to be > 0 for CPU applications? If so, yes, I think that is reasonable.

@joeolson42 Exactly.

joeolson42

To be honest, I'm a bit worried about not being able to print out values within GPU applications. When the compilers can't catch the problems, that is often the only way to debug.

timsliwinski-noaa · 2023-08-21T21:55:28Z

Hi @dustinswales and @joeolson42. Thanks for your feedback. I'm reviewing your suggestions and looking at alternatives. I did want to address some things I didn't mention in my initial PR that may be helpful based on your responses:

First off, I want to mention I approached this effort with an explicit goal of minimizing overall code changes. This did limit what I felt I could change, but was done for the sake of code maintainers being able to recognize their original scientific code and leaving the scheme more or less as intact as when I started. If there is willingness to engage in limited invasive code changes, I'm open to that discussion if you are as it probably would reduce the amount of preprocessor directives scattered about.

Also, since I didn't mention testing -- other than performance testing -- in much detail in my initial PR (apologies!), I wanted to add that right away:

Testing was done, but not with the UFS. Instead, the CCPP Single Column Model was used for testing, though indirectly. Input to and output from the MYNN surface scheme from single column model runs was captured and fed back into a standalone version of the scheme isolated with a custom driver. That driver was built just to read data in and output the results for more lightweight testing and accuracy of solution evaluation. Overall, the expected output was achieved with nothing more than slight differences in last-digit rounding for real numbers as would be expected based on differences between GPU and CPU floating point computations.

Having said that, if there is a proper test case to run the full UFS or CCPP framework with, I would gladly test with those to see how these changes work as a drop-in replacement. As of right now, I was unaware testing of CCPP physics with the full UFS was necessary. I was under the impression that CCPP was meant to be model independent (is that wrong?). I'm also unaware of any established baselines to test physics accuracy of solution against. I am open to suggestions if there are testing best practices I'm missing and should incorporate.

Thanks, and I will return with more information for your specific review points soon after I look at them more closely again.

…essor variable substitutions through the use of new local variables. The changes in this commit affect 3 main areas of module_sf_mynn.F90: 1.) Subroutine SFCLAY_mynn 2.) Subroutine SFCLAY1D_mynn 3.) Subroutine GFS_zt_wat Each of these areas are described in more detail below. 1.) SFCLAY_mynn In the SFCLAY_mynn subroutine, it was possible to remove all #ifdef substitutions of errmsg(len=*) for errmsg(len=200) because errmsg is not used in any code regions of this subroutine that may be run on an accelerator device. Since this is the case, errmsg(len=*) is perfectly acceptable, and can be left alone. The OpenACC data statements within the subroutine were also updated to remove references to errmsg as well since, again, it was not necessary to have errmsg on the device for this subroutine. 2.) SFCLAY1D_mynn - Creation of device_errmsg and device_errflg and proper syncing with errmsg and errflg In the SFCLAY1D_mynn subroutine, it was also possible to remove all #ifdef substitutions by instead creating a new local variable called device_errmsg that is a copy of errmsg but with a fixed size of 512 such that it is acceptable for use on the device. This is necessary because at certain points in the subroutine, loops that are good to be offloaded to the device set errmsg under certain conditions. Since these areas cannot be isolated from the parent loop without a major rework of the loop, we must preserve a way for errmsg to be set on the device. Since device_errmsg is a fixed size, we can do that. However, this complicates the code a bit for error handling purposes as we now have errmsg and device_errmsg which must be synced properly to ensure error messages are returned to CCPP as expected. Therefore, we must keep track of when device_errmsg is set so we can know to sync device_errmsg with errmsg. This is done by making a new local variable called device_errflg to be device_errmsg's complement on the device as errflg is errmsg's complement on the host. When device_errflg is set to a nonzero integer, we then know that device_errmsg must be synced with errmsg. This is simple to do at the end of the subroutine after the device_errmsg on the device is copyout-ed by OpenACC, and a new IF-block has been added for this general case. - Special case of mid-loop return (line 1417), and the creation of device_special_errflg and device_special_errmsg However, there is a special case we must handle a bit differently. In the mid-loop return statement near line 1417, we also must perform this sync to ensure the proper errmsg is returned in the event this return is needed. Therefore, a similar IF-block has been created within the corresponding #ifdef near line 2027 to ensure errmsg has the proper value before the subroutine returns. However, since this block is in the middle of the entire code and only executed on the host, we must first perform an OpenACC sync operation to make sure the device_errmsg and the device_errflg on the host matches the device_errmsg and device_errflg on the host, otherwise the incorrect values could lead to the return statement not executing as expected. This special case seems simple, but an extra trap lay exposed. If device_errmsg and device_errflg is set on the device at any point now before this IF-block, then the return statement we moved out of the loop will now be executed for *ANY* error message, whether that was the intended course or not. Therefore, we need to ensure this special case is only triggered for this specific case. Unfortunately, there appears no other way than to create two additional variables (device_special_errmsg and device_special_errflg) to isolate this case from all other error cases. With these installed in place of just device_errmsg and device_errflg, this special return case is now properly handled. - Complete Ifdef/Ifndef removal not possible Overall, due to the nature of this special case, we have no choice but to leave the #ifdef and #ifndef preprocessor statements in place as they are the only things capable of moving this return statement out of the loop without additional invasive changes to how the code operates. 3.) GFS_zt_wat In the GFS_zt_wat subroutine, since this subroutine is called on the device from within the main I-loop of SFCLAY1D_mynn, we have no choice but to change all errmsg and errflg usage to device_errmsg or device_errflg, otherwise this subroutine and the entire parent loop could not be run on the device. Therefore, all errmsg and errflg lines have been commented out and new, comparable lines using device_errmsg and device_errflg added in their place. Additionally, the subroutine call variable list was updated.

timsliwinski-noaa · 2023-08-24T16:28:58Z

I'm pushing up a new change to this branch that hopefully addresses the variable substitution by preprocessor directive issue. However, some changes still require preprocessor directives to be used based on the __OPENACC compiler variable. Please see the commit message for more details.

Next, I'm going to look more closely at trying to re-enable some of the debug output print statements, but didn't want to intermingle those changes with this latest commit in case the addition of new, extra errmsg variables sparked more concerns.

… for debug and other conditions. Original problem: ----------------- Following feedback that debug information was still desirable for OpenACC device- executed code where possible, this change removes all preprocessor directives which were guarding against the compilation of statements which wrote to standard output. These directives were originally used because debug statements and other standard output had the potential to greatly reduce performance because of the need to copy over certain variables from the host to the device just for debug output purposes. Additionally, when statements were located within parallel-execution regions, the output was not guaranteed to be presented in any specific order and the additional IF-branches in the code also would have reduced performance as branching is not efficient when on SIMD architectures. Resolutions: ------------ However, with a bit of extra work, a few of these issues are alleviated to allow output to work again as requested. First, on the data optimization side of the problem, the impact of pulling in variables just for debugging was minimized by ensuring the data was pulled in and resident on the GPU for the entire subroutine execution. While this increases the memory footprint on the device which may have very limited memory, it reduces the data transfer related performance hit. Next, in the cases where debug output was not within parallel regions but still needing to be executed on the GPU to show the proper values at that state of the overall program execution, OpenACC serial regions were used. These allow the data to not have to be transferred off the GPU mid-execution of the program just to be shown as debug output and also partially solve the problem of out-of-order output. Since debug regions are guarded by IF blocks, these serial regions do not significantly impact performance when debug output is turned off (debug_code=0). However, slowdown is significant for any other debug-levels which should be acceptable for debugging situations. Performance Changes: -------------------- Overall, these changes accomplish the goal of re-enabling debugging output, but not completely without a cost. Overall runtime was slightly impacted on the GPU when tested with 150k and 750k vertical columns (the value of ite used in the i-loops) and debugging turned off (debug_code=0). For 150k columns, the GPU decreased in speed from the original baseline of 22ms to 30ms. For 750k columns, the GPU decreased in speed from the original baseline of 31ms to 70ms. The impact is greater for the larger number of columns due to the impact of the number of times the mid-loop IF branches are evaluated on the GPU. While these are slight declines in performance, these are still significant speedups over the CPU-only tests (8.7x and 18.7x speedups for 150k and 750k, respectively). Compilation Time Changes: ------------------------- One additional noted observation regarding performance is compilation time. When all debug output is disabled (debug_code=0), compilation time is approximately 90 seconds with the additional serial blocks, IF-branches, and so forth as each of these require more work from the OpenACC compiler to generate code for the GPU. This problem is compounded when the debug_code option is increase to either 1 (some debug output) or 2 (full debug output). At a value of 1, compilation time jumps up to approximately 12.5 minutes on the Hera GPU nodes. At a value of 2, compilation time increases further to approximately 18.5 minutes on the same GPU nodes. The explanation for this is the need for the OpenACC compiler to enable greater amounts of serial and branching code that (again) are less optimal on the GPU and so the compiler must do more work to try to optimize them as best it can.

timsliwinski-noaa · 2023-08-28T21:25:26Z

@joeolson42 and @dustinswales -- I have pushed up the final round of requested changes, this time regarding the disabled debugging output. The output has been re-enabled, and a few data optimizations and use of the OpenACC serial construct were necessary to accomplish this. Overall, though, performance was not majorly impacted (except when debugging is enabled) because the majority of the IO operations were already optimized. There is a pretty big hit to compilation time with debugging turned on since the OpenACC compiler must try to optimize a lot of things that are not well-suited for the accelerator, so keep that in mind when trying to test with debugging enabled. Further details are in the commit message. Please let me know if you have any other questions, comments, or suggested changes.

grantfirl · 2023-08-30T18:27:59Z

It seems like we should use this PR as an opportunity to define some best practices with respect to CCPP error handling and debugging print statements when running on GPUs. I don't think that it would be crazy, for example, to declare errmsg with an explicit length. I don't think that doing this with preprocessors directives (like using __OPENACC) would work because of the CCPP prebuild check on the "kind" attribute, but just redefining errmsg as len=256 or something permanently (whether or not we're using GPUs) seems reasonable to me.

@timsliwinski-noaa What is the preferred way to handle debugging write statements with GPU-accelerated code? Do we just need to make sure that we're not branching inside an accelerated section? E.g. if we wanted to conditionally write out anything within an accelerated loop, do we need to just redo the same loop on the CPU side when a debug flag is active or something?

Also, I have a question about the "optimized data movement". You're referring to all acc enter/exit data copyX calls being done in the calling layer of the top-level scheme, right? For example, for mynnsfc_wrapper_run, we would want all of the data copying openacc calls in the CCPP autogenerated cap, right?

grantfirl · 2023-08-30T18:30:17Z

As for testing, I think that the bar that we need to jump over for this PR is to just make sure that we're passing the current regression tests without GPU. At some point, we'll need to add some new tests to actually use Hera's GPUs on at least one suite that exercises the current crop of GPU-accelerated physics from GSL. We've already accepted the GPU-accelerated GF changes and didn't require a GPU test to be added at that time, so it seems unfair to require one for this PR.

timsliwinski-noaa · 2023-08-31T22:05:40Z

What is the preferred way to handle debugging write statements with GPU-accelerated code? Do we just need to make sure that we're not branching inside an accelerated section? E.g. if we wanted to conditionally write out anything within an accelerated loop, do we need to just redo the same loop on the CPU side when a debug flag is active or something?

@grantfirl - I had to stop and think about this after @joeolson42 made his comment about missing the debugging statements on the GPU, to which I came to the following thoughts. If we're enabling debugging, the question would become... how performant do we need it to be? If debugging is turned off, we'd want the code to be as performant as possible, and how we handle any conditionals checking for the debugging state being on/off is important to ensure they have a minimal impact. But when debugging is on, does performance still matter as much? Ideally, we'd try to avoid branching and writing out from the GPU during parallel regions, leaving the GPU just to handle the calculations. And branching is bad for GPUs since each thread has to do something during that branch whether it applies or not because of the SIMD architecture. So, for your accelerated loop conditional write out example, removing the debug write outs from the accelerated loop would be best and would be good to have as an ACC serial region outside the parallel region after it closes. I know serial sounds bad on a GPU - and it is generally to be avoided - but transferring data is a heavier operation/performance hit for something as simple as writing out debug info, and a serial branch is likely to be (I'm guessing here as I don't know for certain) less impactful. Rerunning the loop on the CPU side for debugging may not have the desired effect as the problem being debugged may be a problem in the parallelization on the GPU and itself, which is why I was willing to go back and add in the debug statements at Joe's request since it's something that can't be tested if it's not on the GPU. Another option could be having smaller parallel regions to stick debug output between if it's really necessary, but this options has to be carefully considered as that has a tradeoff as well since each parallel region (and serial region, for that matter) has some performance overhead to get it started on the GPU, so fewer overheads are better.

Also, I have a question about the "optimized data movement". You're referring to all acc enter/exit data copyX calls being done in the calling layer of the top-level scheme, right? For example, for mynnsfc_wrapper_run, we would want all of the data copying openacc calls in the CCPP autogenerated cap, right?

The way I'm using "optimized data movement" in this case can refer to 2 things, and yes, the enter/exit data regions are key here. One, the local data is optimized to where the data stays on the GPU from the start of the subroutine to the end, avoiding transfers as much as possible. This really only applies to local variables though. The second and broader use of "optimized data movement" would refer to the variables coming down into the scheme through the mynn_wrapper_run to have already been staged on the GPU by the CCPP infrastructure, which is currently not done. If the CCPP infrastructure is changed to work this way however, the mynn scheme (and others) itself doesn't have to always re-transfer things back and forth to the GPU every time it is called, just check if they already exist (through the use of Openacc Present statements) and use what is already there. Having every scheme called through CCPP take care of data this way makes everything more GPU-resident and more optimized as when the scheme finishes (whatever it does), the updates are made on the GPU and ready for the next scheme they are needed by, with no transferring necessary.

I hope that helps, but I'm open to more questions if needed.

dustinswales

@timsliwinski-noaa Great work. Thanks for making the requested changes.

joeolson42

THanks for the good work Tim. I can add some "if (debug_code >= 1) then" around the prints that do not have them. Tanya Smirnova & I are planning to thoroughly revise this surface layer scheme soon, so I can take care of it then. I may need your eyes on the revised version at some point to make sure I don't scramble your GPU work.

grantfirl · 2023-09-08T14:37:48Z

Addresses #100

grantfirl · 2023-09-08T14:41:37Z

@timsliwinski-noaa Please pull in the latest ufs/dev branch into your PR branch in anticipation of regression testing.

… into mynnsfc_openacc

timsliwinski-noaa · 2023-09-08T20:56:34Z

@grantfirl Pull complete. Latest ufs/dev now merged into this branch.

zach1221 · 2023-09-14T13:44:01Z

Hi, @grantfirl .

Testing is complete on ufs-wm PR #1891. Can you please merge this sub-component PR for us?

timsliwinski-noaa requested review from joeolson42, grantfirl, Qingfu-Liu and dustinswales as code owners August 15, 2023 01:06

dustinswales requested changes Aug 21, 2023

View reviewed changes

joeolson42 reviewed Aug 21, 2023

View reviewed changes

grantfirl approved these changes Aug 30, 2023

View reviewed changes

Qingfu-Liu approved these changes Aug 30, 2023

View reviewed changes

dustinswales approved these changes Sep 6, 2023

View reviewed changes

joeolson42 approved these changes Sep 6, 2023

View reviewed changes

grantfirl mentioned this pull request Sep 8, 2023

MYNN SFC OpenACC acceleration NOAA-EMC/fv3atm#693

Merged

grantfirl mentioned this pull request Sep 8, 2023

MYNN SFC OpenACC Acceleration ufs-community/ufs-weather-model#1891

Merged

41 tasks

Merge branch 'ufs/dev' of https://github.com/ufs-community/ccpp-physics…

770ad33

… into mynnsfc_openacc

grantfirl merged commit 31a99de into ufs-community:ufs/dev Sep 14, 2023
3 checks passed

grantfirl mentioned this pull request Jan 24, 2024

UFS-dev PR#97 NCAR/ccpp-physics#1037

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding OpenACC statements to accelerate MYNN surface scheme performance through GPU offloading #97

Adding OpenACC statements to accelerate MYNN surface scheme performance through GPU offloading #97

timsliwinski-noaa commented Aug 15, 2023

dustinswales commented Aug 21, 2023

dustinswales Aug 21, 2023

timsliwinski-noaa Aug 24, 2023

dustinswales Aug 21, 2023

dustinswales Aug 21, 2023

joeolson42 Aug 21, 2023

dustinswales Aug 21, 2023

joeolson42 left a comment

timsliwinski-noaa commented Aug 21, 2023

timsliwinski-noaa commented Aug 24, 2023 •

edited

Loading

timsliwinski-noaa commented Aug 28, 2023

grantfirl commented Aug 30, 2023

grantfirl commented Aug 30, 2023 •

edited

Loading

timsliwinski-noaa commented Aug 31, 2023

dustinswales left a comment

joeolson42 left a comment

grantfirl commented Sep 8, 2023

grantfirl commented Sep 8, 2023

timsliwinski-noaa commented Sep 8, 2023

zach1221 commented Sep 14, 2023

Adding OpenACC statements to accelerate MYNN surface scheme performance through GPU offloading #97

Adding OpenACC statements to accelerate MYNN surface scheme performance through GPU offloading #97

Conversation

timsliwinski-noaa commented Aug 15, 2023

Overview:

MYNN Scheme Code Changes:

Performance:

Performance Timings (optimized data movement):

Performance Timings (standalone):

Contact Information:

dustinswales commented Aug 21, 2023

dustinswales Aug 21, 2023

Choose a reason for hiding this comment

timsliwinski-noaa Aug 24, 2023

Choose a reason for hiding this comment

dustinswales Aug 21, 2023

Choose a reason for hiding this comment

dustinswales Aug 21, 2023

Choose a reason for hiding this comment

joeolson42 Aug 21, 2023

Choose a reason for hiding this comment

dustinswales Aug 21, 2023

Choose a reason for hiding this comment

joeolson42 left a comment

Choose a reason for hiding this comment

timsliwinski-noaa commented Aug 21, 2023

timsliwinski-noaa commented Aug 24, 2023 • edited Loading

timsliwinski-noaa commented Aug 28, 2023

grantfirl commented Aug 30, 2023

grantfirl commented Aug 30, 2023 • edited Loading

timsliwinski-noaa commented Aug 31, 2023

dustinswales left a comment

Choose a reason for hiding this comment

joeolson42 left a comment

Choose a reason for hiding this comment

grantfirl commented Sep 8, 2023

grantfirl commented Sep 8, 2023

timsliwinski-noaa commented Sep 8, 2023

zach1221 commented Sep 14, 2023

timsliwinski-noaa commented Aug 24, 2023 •

edited

Loading

grantfirl commented Aug 30, 2023 •

edited

Loading