jrdeslippe · bcfriesen · May 7, 2019 · May 8, 2019 · May 8, 2019 · May 8, 2019
diff --git a/docs/perfport/directives/openacc.md b/docs/perfport/directives/openacc.md
@@ -1,28 +1,28 @@
 # OpenACC
 
-OpenACC is a set of standardized, high-level pragmas that enable C/C++ and Fortran programmers 
-to exploit parallel (co)processors, especially GPUs. OpenACC pragmas can be used to annotate 
+OpenACC is a set of standardized, high-level pragmas that enable C/C++ and Fortran programmers
+to exploit parallel (co)processors, especially GPUs. OpenACC pragmas can be used to annotate
 codes to enable data location, data transfer, and loop or code block parallelism.
- 
-Though OpenACC has much in common with OpenMP, the syntax of the directives is different. 
-More importantly, OpenACC can best be described as having 
+
+Though OpenACC has much in common with OpenMP, the syntax of the directives is different.
+More importantly, OpenACC can best be described as having
 a *descriptive* model, in contrast to the more *prescriptive* model presented by OpenMP.
 This difference in philosophy can most readily be seen by, e.g.,  comparing the ``acc loop`` directive
-to the OpenMP implementation of the equivalent construct. In OpenMP, the programmer has responsibility 
-to specify how the parallelism in a loop is distributed (e.g., via ``distribute`` and ``schedule`` clauses). 
+to the OpenMP implementation of the equivalent construct. In OpenMP, the programmer has responsibility
+to specify how the parallelism in a loop is distributed (e.g., via ``distribute`` and ``schedule`` clauses).
 In OpenACC, the runtime determines how to decompose the iterations across gangs or workers and vectors.
-At an even higher level, an OpenACC programmer can use the ``acc kernels`` construct to allow the compiler complete freedom 
+At an even higher level, an OpenACC programmer can use the ``acc kernels`` construct to allow the compiler complete freedom
 to map the available parallelism in a code block to the available hardware.
 
 
 
 
 ## OpenACC at a glance
 
-Some of the most important  data and control clauses for two of the most 
-used constructs in OpenACC programming - ``$acc parallel`` and ``$acc kernels`` - are 
+Some of the most important  data and control clauses for two of the most
+used constructs in OpenACC programming - ``$acc parallel`` and ``$acc kernels`` - are
 listed below. The data placement and movement clauses also appear in ``$acc data`` constructs.
-``$acc loop`` provides control of parallelism similarly to ``$acc parallel`` but provides loop-level control. 
+``$acc loop`` provides control of parallelism similarly to ``$acc parallel`` but provides loop-level control.
 
 Much more detail can be found at:
 
@@ -39,19 +39,119 @@ Much more detail can be found at:
 
 |construct             | important clauses  | description |
 |:---|:---|---:|
-|``$acc parallel``        
-|    |num_gangs(expression)| Controls how many parallel gangs are created  
-|    |num_workers(expression)| Controls how many workers are created in each gang 
-|    |vector_length(list)| Controls vector length of each worker  
-|    |private(list)| A copy of each variable in list is allocated to each gang  
-|    |firstprivate(list)| private variables initialized from host  
-|    |reduction(operator:list)| private variables combined across gangs
+|``$acc parallel``
+|    |`num_gangs(expression)`| Controls how many parallel gangs are created
+|    |`num_workers(expression)`| Controls how many workers are created in each gang
+|    |`vector_length(list)`| Controls vector length of each worker
+|    |`private(list)`| A copy of each variable in list is allocated to each gang
+|    |`firstprivate(list)`| private variables initialized from host
+|    |`reduction(operator:list)`| private variables combined across gangs
 |``$acc kernels`` |  |  |
-| | copy(list)| Allocates memory on GPU and copies data from host to GPU when entering region and copies data to the host when exiting region
-| | copyin(list) | Allocates memory on GPU and copies data from host to GPU when entering region
-| | copyout(list) |  Allocates memory on GPU and copies data to the host when exiting region
-| | create(list) | Allocates memory on GPU but does not copy
-| | present(list) | Data is already present on GPU from another containing data region
+| | `copy(list)`| Allocates memory on GPU and copies data from host to GPU when entering region and copies data to the host when exiting region
+| | `copyin(list)` | Allocates memory on GPU and copies data from host to GPU when entering region
+| | `copyout(list)` |  Allocates memory on GPU and copies data to the host when exiting region
+| | `create(list)` | Allocates memory on GPU but does not copy
+| | `present(list)` | Data is already present on GPU from another containing data region
+
+
+## Optimizing compute kernels for CPUs and GPUs
+
+OpenACC provides descriptive directives which the programmer can use to
+indicate to the compiler that a particular kernel is desired to achieve high
+performance. The programmer may then instruct the compiler to optimize the
+decorated OpenACC kernels for a particular computer architecture, including
+both CPUs and GPUs.
+
+For example, one may start with a simple Jacobi iterative solver, as provided
+in the [OpenACC GitHub
+page](https://raw.githubusercontent.com/OpenACCUserGroup/openacc-users-group/master/Contributed_Sample_Codes/Tutorial1/solver/jsolvef.F90).
+One may then decorate one of the loops in this code with the `acc kernels`
+directive, which is a very general suggestion to the compiler that the kernel
+which follows should be optimized for a particular compute architecture.
+
+```Fortran
+!$acc kernels
+do i = 1, nsize
+  rsum = 0
+  do j = 1, nsize
+    if( i /= j ) rsum = rsum + A(j,i) * xold(j)
+  enddo
+  xnew(i) = (b(i) - rsum) / A(i,i)
+enddo
+!$acc end kernels
+```
+
+One may then compile this code for a multi-core CPU with the following compile
+stanza:
+
+```console
+pgfortran -o jacobi_CPU.ex -fast -Minfo=all -tp=skylake jacobi.F90 &> compile_CPU.log
+```
+
+which will print diagonstic messages indicating the optimizations made for the
+Intel 'Skylake' CPU architecture:
+
+```console
+init_simple_diag_dom:
+     48, Loop not vectorized/parallelized: contains call
+     56, Zero trip check eliminated
+         Generated vector simd code for the loop
+main:
+    106, Memory zero idiom, loop replaced by call to __c_mzero8
+    107, Memory zero idiom, loop replaced by call to __c_mzero8
+    108, Loop not vectorized/parallelized: contains call
+    123, Loop not vectorized/parallelized: potential early exits
+         FMA (fused multiply-add) instruction(s) generated
+    131, Loop not fused: different loop trip count
+    133, Zero trip check eliminated
+    143, Loop not fused: function call before adjacent loop
+         Generated vector simd code for the loop containing reductions
+    164, Loop not fused: function call before adjacent loop
+    166, Zero trip check eliminated
+         Generated vector simd code for the loop containing reductions
+```
+
+One can compile the same code with different flags, targeting NVIDIA V100
+'Volta' GPUs:
+
+```console
+pgfortran -o jacobi_GPU.ex -fast -Minfo=all -ta=tesla:cc70 -Mcuda=cc70,cuda10.1,lineinfo jacobi.F90 &> compile_GPU.log
+```
+
+which will yield a different set of diagnostic messages:
+
+```console
+init_simple_diag_dom:
+     48, Loop not vectorized/parallelized: contains call
+     56, Zero trip check eliminated
+         Generated vector simd code for the loop
+main:
+    106, Memory zero idiom, loop replaced by call to __c_mzero8
+    107, Memory zero idiom, loop replaced by call to __c_mzero8
+    108, Loop not vectorized/parallelized: contains call
+    123, Loop not vectorized/parallelized: potential early exits
+         FMA (fused multiply-add) instruction(s) generated
+    130, Generating implicit copyout(xnew(1:nsize))
+         Generating implicit copyin(b(1:nsize),a(1:nsize,1:nsize))
+         Generating implicit copyin(xold(1:nsize))
+    131, Loop carried dependence of xnew prevents parallelization
+         Loop carried backward dependence of xnew prevents vectorization
+         Complex loop carried dependence of xold prevents parallelization
+         Generating Tesla code
+        131, !$acc loop seq
+        133, !$acc loop vector(128) ! threadidx%x
+        134, Generating implicit reduction(+:rsum)
+    133, Loop is parallelizable
+    143, Loop not fused: function call before adjacent loop
+         Generated vector simd code for the loop containing reductions
+    164, Loop not fused: function call before adjacent loop
+    166, Zero trip check eliminated
+         Generated vector simd code for the loop containing reductions
+```
+
+By setting the target compute architecture in the compiler invocation, one can
+compile the same code to be optimized for each; this is a very useful feature
+of OpenACC.
 
 ## How to use OpenACC on ASCR facilities
 
@@ -91,6 +191,10 @@ $ module load craype-accel-nvidia35
 $ ftn -h acc vecAdd.f90 -o vecAdd.out
 ```
 
+### NERSC
+
+The PGI compilers are provided on the [Cori GPU
+nodes](https://docs-dev.nersc.gov/cgpu) at NERSC via the `pgi` modules.
 
 ## Benefits and Challenges