body.tex

\section{Introduction}
\subsection{Motivation and Background}
Since the Fortran 2008 standard was published in 2010~\cite{iso2010information}, a Fortran program executes as
a fixed number of instances that the standard refers to as ``images.''  Each image has its own state, including
data and input/output streams, corresponding to a \gls{spmd} programming style.  Images execute asynchronously
except at program launch, termination, and program-specified synchronization points.  Images may define
so-called coarray data structures to provide one-sided access to data by other images in a \gls{pgas}.

A few published studies have provided encouraging assessments of using Fortran's \gls{pgas} features
in research applications running at scale~\cite{preissl2011multithreaded,garain2015comparing,mozdzynski2015partitioned}.
One of the most attractive characteristics of these features is the ability to express parallel algorithms in a single
language without embedding compiler directives or referencing communication layers that are not part of the
language standard (e.g., no \gls{mpi} or OpenMP).  In theory, this should delay choice of communication layers
to link-time.  Ideally, an application developer writes a standard, Fortran-only application, compiles it into object code,
and then links to one or more communication layers.

Such flexibility poses the computing equivalent of what food author Michael Pollan termed ``The Omnivore's Dilemma:'' if I
belong to a species that owes some of its evolutionary advantage to being able to eat a wide variety of foods, what food is
best for me to eat?~\cite{pollan2006omnivore}  Similarly, if I can express my parallel algorithm once using \gls{caf} and then consume cycles atop a
multitude of software stacks and hardware platforms, where best to consume cycles and using which supporting software stack?
Here we report the results of an initial study of several current options for compiling, linking, and executing a \gls{mini-app}
designed to be representative of the parallel numerical algorithms and physics employed in \gls{icar}.

\gls{icar} simulates the motion of the atmosphere at kilometer length scales and produces flow patterns with a fidelity that is
attractive to the hydrology community for studying surface water.  Figure~\ref{figure:icar} depicts results from  \gls{icar}
simulations over North America.

\begin{figure}
   \vspace{-18pt}
   % Animate frames at 7 fps created via 'ffmpeg -i qvmovie.mp4 -r 7 frame%d.pngr', add control buttons,
   % and automatically play when the page is viewed:
   \vbox{\hspace{-24pt}
   %\animategraphics[width=\columnwidth,controls,autoplay]{7}{figures/icar/frames/frame}{1}{30}
   \includegraphics[width=1.1\columnwidth]{figures/icar/frames/frame22.png}
   }
   \vspace{-24pt}
   \caption{A visualization of the atmospheric distribution of water vapor (blues) and the resulting precipitation (green to red) simulated by \href{https://github.com/gutmann/icar}{\gls{icar}}.%{\it For the animated version of this graphic, please access https://github.com/sourceryinstitute/coarray-icar-paw17/releases/download/1.0/main.pdf and use Adobe Acrobat Reader 7.0 or later.}
\label{figure:icar}}
\end{figure}

\begin{figure*}
  \includegraphics[width=\textwidth]{figures/fig1_speedup.png}
  \includegraphics[width=\textwidth]{figures/fig2_timing.png}
  \caption{Speedup and timing results for two different domain sizes, two different communications backends (OSH=OpenSHMEM, MPI=MPI) and two different communication methods (``get'' and ``put'') using \href{https://github.com/gutmann/coarray_icar}{Coarray ICAR}.\label{fig1-2}}
\end{figure*}

\begin{figure*}
  \includegraphics[width=\textwidth]{figures/fig3_cross_platform.png}
  \vspace{-6pt}
  \includegraphics[width=\textwidth]{figures/fig4_extreme_scaling.png}
   \caption{Cross-platform and extreme scaling results using \href{https://github.com/gutmann/coarray_icar}{Coarray ICAR}.\label{fig3-4}}
   \vspace{-6pt}
\end{figure*}

\subsection{Objectives}
This paper evaluates alternatives for compiling, linking, and executing one
\gls{mini-app} \gls{caf} source code using the following technologies:
\begin{itemize}
  \item \gls{mpi}~\cite{mpiforum2016mpi} and OpenSHMEM~\cite{openshmem2016} communication layers,
  \item \gls{gcc} and Cray compilers, and
  \item Multi- and many-core processors.
\end{itemize}
We also tested the Intel Fortran compiler versions 16 and 17. The resulting executable
program crashes on Cray platforms and hangs at launch on when run on more than 16 cores on the SGI platform.

We show performance for two one-sided data access patterns:
\begin{enumerate}
  \item Gets: an image retrieves data from memory managed by another
        image without the providing image's involvement, and
  \item Puts: an image stores data in memory managed by
        another image without the receiving image's involvement.
\end{enumerate}
We also explore the performance and scalability of multi- versus many-core processors.

\section{Methodology}
\subsection{Physics and numerics}
The coarray-\gls{icar} implemention employed here uses the Thompson Eidhammer microphysics parameterization~\cite{Thompson:2014cw} and the first-order MP-DATA advection algorithm~\cite{Smolarkiewicz:1998il}.
The microphysics parameterization requires 11 primary input variables (e.g.\ air pressure, water vapor, cloud water), 9 of which are modified by the parameterization and advected throughout the domain requiring communication between processes.

The test case described here uses an idealized hill: a sine curve representing
a 1000 m high mountain in the middle of the domain.
The air is initially near 100\% relative humidity with a background wind of approximately 10 m s$^{-1}$.

At every time step, the outer edges of the local domain are processed first,
then passed to their neighbors while the interior of the domain is processed.
This is tested using both a \gls{caf} ``put'' operation, in which one-sided communication is initialized asynchronously to send data to a neighbor,
and a \gls{caf} ``get'' operation, in which one-sided blocking communication retrieves data from a neighbor.

\subsection{Compilers, runtimes, and hardware}

We performed the experiments presented here on two supercomputers at \gls{nersc},
located at \gls{lbnl}, and one supercomputer at \gls{ncar}. The \gls{lbnl} systems are
\begin{itemize}

\item Edison: a Cray XC30 featuring 5586 nodes with two sockets of 12-core Intel Xeon Processor E5-2695 v2 (``Ivy Bridge''), running at \num{2.4}~\si{\giga\hertz}.
\item Cori: a Cray XC40 with \num{12076} compute nodes, spanning two architectures; \num{2388} nodes each have two sockets of 16-core Intel Xeon Processor E5-2698 v3 (``Haswell'') operating at \num{2.3}~\si{\giga\hertz}; the other \num{9688} nodes are single-socket, 68-core Intel Xeon Phi Processor 7250 (``Knights Landing'') at \num{1.4}~\si{\giga\hertz}.
\end{itemize}
Edison and Cori have the same ``Aries'' interconnect with a dragonfly topology.
All Cori measurements on the Xeon Phi nodes ran in the ``quadrant'' NUMA configuration with the MCDRAM configured as a transparent cache to the DDR4 memory.

The \gls{ncar} system is ``Cheyenne,'' an SGI ICE XA Cluster with \num{4032} compute nodes, each containing two sockets of 18-core Intel Xeon Processor E5-2697 v4 (``Broadwell''), running at \num{2.3}~\si{\giga\hertz}.
Cheyenne uses a Mellanox EDR Infiniband interconnect with a partial 9D Enhanced Hypercube single-plane topology.
We compiled coarray-\gls{icar} on the \gls{nersc} systems using the Cray Fortran compiler version 8.6.0.  We compiled
at \gls{ncar} using the \gls{gcc} version 6.3 Fortran front end, which uses the
OpenCoarrays \gls{abi}~\cite{fanfarillo2014opencoarrays} to support \gls{caf}.  We tested two parallel runtime libraries that implement the OpenCoarrays \gls{abi}:
\begin{enumerate}
  \item The default \gls{mpi} library using the SGI MPT \gls{mpi},
  \item The recently released OpenSHMEM library.
\end{enumerate}
We distributed \gls{caf} images as follows:
\begin{itemize}
  \item 68 images per Xeon Phi node on Cori,
  \item 24 images per Xeon node on Edison, and
  \item 36 images per node on Cheyenne,
\end{itemize}
corresponding to one image per physical core in each case.

\section{Discussion of Results}
Figure~\ref{fig1-2} shows results with two domain sizes ($500 \times 500 \times 20$ and $2000 \times 2000 \times 20$ grid cells) running on Cheyenne.
These tests compare scaling with blocking ``get'' and non-blocking ``put'' operations, and compare scaling using \gls{mpi} or OpenSHMEM.
The differences between \gls{mpi} and OpenSHMEM are minor, with OpenSHMEM performing only slightly better.
However, the differences between the ``get'' and ``put'' operations are substantial.
For the $500 \times 500 \times 20$ domain, ``puts'' scale well to 1800 cores, then scale poorly to \num{3600} cores;
however, ``gets'' only scale well to \num{504} cores, scale poorly to \num{1224}, and no further improvement comes from up to \num{3600} cores.

\gls{mpi} ``gets'' are not reported because a problem with the \gls{mpi} implementation makes them several orders of magnitude less efficient.
Similarly, \gls{mpi} ``put'' results are not reported beyond \num{1800} cores because a problem in the \gls{mpi} implementation caused memory
allocations to become a bottleneck at higher process counts.
The $2000 \times 2000 \times 20$ domain exhibits the same pattern but the scaling of ``puts'' remains strong up to 3600 cores; whereas
scaling of ``gets'' continues, albeit poorly, to \num{3600} cores.

Figure~\ref{fig3-4} (top) depicts results across multiple compilers, machines, and architectures.
In this case, only ``puts'' are tested, and only for the $2000 \times 2000\times 20$ domain.
The GNU compiler was used on Cheyenne (Xeons). The Cray compiler was used on Edison (Xeons) and Cori (KNL).
To aid interpretation, the scaling results are reported as a fraction of the ideal scaling for each machine in the top left.
These results show that the Cray compiler+system scales better than the gfortran+SGI system,
with 75\% efficiency at >10k cores, while on Cheyenne, only 55\% of ideal was achieved.
This also shows that the KNL system scales well out to large core counts (60\% of the ideal with \num{19200} cores),
but that the total runtimes are significantly slower than the equivalent runtimes on Xeons (top right).
The KNL performance might be improved in the future by implementing OpenMP threaded parallelism within a node.

To further test the scaling performance in the best case, ``puts'' on Cray Xeons, a limited set of tests were performed up to nearly \num{100000} cores (fig.~\ref{fig3-4} bottom).
This system scaled well to over \num{10000} cores, and continued scaling with nearly 50\% of the single-node efficiency at \num{98304} cores.

\section{Conclusions}
We have presented scaling and performance analysis using \gls{caf} on different architectures, compilers, and communications backends.
These tests have utilized modern Fortran standards and as a result no changes to the code were required for different configurations.

This work demonstrates the capability of one-sided non-blocking communications protocols to readily scale up to very high core counts.
It is expected that with moderate performance optimizations and a larger total problem size, this system will be able scale to even higher core counts.


%%%
%%% Sample tables
%%%

%\begin{table}
%  \caption{Frequency of Special Characters}
%  \label{tab:freq}
%  \begin{tabular}{ccl}
%    \toprule
%    Non-English or Math&Frequency&Comments\\
%    \midrule
%    \O & 1 in 1,000& For Swedish names\\
%    $\pi$ & 1 in 5& Common in math\\
%    \$ & 4 in 5 & Used in business\\
%    $\Psi^2_1$ & 1 in 40,000& Unexplained usage\\
%  \bottomrule
%\end{tabular}
%\end{table}

%\begin{table*}
%  \caption{Some Typical Commands}
%  \label{tab:commands}
%  \begin{tabular}{ccl}
%    \toprule
%    Command &A Number & Comments\\
%    \midrule
%    \texttt{{\char'134}author} & 100& Author \\
%    \texttt{{\char'134}table}& 300 & For tables\\
%    \texttt{{\char'134}table*}& 400& For wider tables\\
%    \bottomrule
%  \end{tabular}
%\end{table*}
% end the environment with {table*}, NOTE not {table}!

%It is strongly recommended to use the package booktabs~\cite{Fear05}
%and follow its main principles of typography with respect to tables:
%\begin{enumerate}
%\item Never, ever use vertical rules.
%\item Never use double rules.
%\end{enumerate}
%It is also a good idea not to overuse horizontal rules.


%%%
%%% Sample figures
%%%

%\begin{figure}
%\includegraphics{fly}
%\caption{A sample black and white graphic.}
%\end{figure}

%\begin{figure}
%\includegraphics[height=1in, width=1in]{fly}
%\caption{A sample black and white graphic
%that has been resized with the \texttt{includegraphics} command.}
%\end{figure}

%\begin{figure*}
%\includegraphics{flies}
%\caption{A sample black and white graphic
%that needs to span two columns of text.}
%\end{figure*}
%
%\begin{figure}
%\includegraphics[height=1in, width=1in]{rosette}
%\caption{A sample black and white graphic that has
%been resized with the \texttt{includegraphics} command.}
%\end{figure}

%\end{document}  % This is where a 'short' article might terminate


%\appendix
%Appendix A
%\section{Code snippets: collective subroutines?}
% This next section command marks the start of
% Appendix B, and does not continue the present hierarchy
%\section{Anything else?}

\begin{acks}
  The first author would like to thank the Visitor Programs of the Computational Information Systems Laboratory
  and the Research Applications Laboratory of \gls{ncar} for travel support for a visit during which much of
  the work presented in this paper was performed.

  This work was funded in part through a contract from the U.S. Army Corps of Engineers.

\end{acks}