From 033e238cd3c2939940e266730f5b72810881aae9 Mon Sep 17 00:00:00 2001 From: Gregor von Laszewski Date: Wed, 15 Feb 2023 17:46:51 -0500 Subject: [PATCH] update contents --- .gitignore | 2 + section-data.tex | 121 +++++++++++ section-dev.tex | 119 ++++++----- section-earthquake.tex | 20 +- section-eq-performance.tex | 249 +++++++++++++++++++++++ section-workflow.tex | 112 +++++++++++ vonLaszewski-frontiers.tex | 403 +------------------------------------ 7 files changed, 557 insertions(+), 469 deletions(-) create mode 100644 section-data.tex create mode 100644 section-eq-performance.tex create mode 100644 section-workflow.tex diff --git a/.gitignore b/.gitignore index 00995bd..7b253ec 100644 --- a/.gitignore +++ b/.gitignore @@ -1,6 +1,8 @@ +.#* ~* .DS_Store *.aux +*.toc *.bbl *.blg *.fdb_latexmk diff --git a/section-data.tex b/section-data.tex new file mode 100644 index 0000000..1f85dd4 --- /dev/null +++ b/section-data.tex @@ -0,0 +1,121 @@ +\section{Insights into Data Management} + +In data management wi are currently concerned with various aspects of +the data set, the data compression and storage, as well as the data +access speed. We discuss insights in each of them in the next Sections. + +\subsection{Data Sets} + +When dealing with datasets we typically encounter several issues. +These issues are addressed by the MLCommons benchmarks and +datamanagement activities so that they provide ideal candidates for +education without spending an exorberant amount of time on data. Such +issues typically include access to dato wihout privacy restrictions, +data preprocessing that makes the datau suitable for deep learning, +data labeling in case they are part of a well defined mlcommons +benchmark. Other issues include data bias, noisy or missing data, as +well as overfitting while using training data. Typically the mlcommons +benchmarks will be designed to have no such issuess, or they have +minimal issues. However some benchmarks such as the science group +benchmarks whic are concerned with improving the science will have to +potentially address these issues in order o improve the accuracy. This +could include even injecting new date and different preproocessing +methods. + + +\subsection{Data compression} + +An issue that if of utmost imporatance especially for large data sets +is how the data is represented. For example, for the earthquake +benchmark we found that the original dataset was 11GB big. However we +found that the data can be easily compressed by a factor of 100. This +is significant, as for example in this case the entire dataset can be +stored in Github. The compressed xz archive file is only 21 MB and to +download only the archive file using wget takes 0.253s. In case the +dataset and its repository is downloaded with Git we note that the +entire Git repository is 108MB~\citep{mlcommons-earthquake-data}. To +download this compressed dataset only takes 7.723s. Thus it is +prefered to just download the explictly used data using for example +wget. In both cases the data is compressed. To uncompress the data it +will take an additional 1 minute and 2.522seconds. However if we were +to download the datai in uncompressed form it woudl take approximately +3 huours and 51 seconds. + +From this simple example it is clear that MLCommons benchmarks can +provide insights into how data is managed and delivered to for example +large scale compute clusters with many nodes, while utilizing +compressin algorithms. We will next discuss insights into +infrastructure management while using filesystems in HPC resources. +While often object stores are discussed to host such large datsets it +is imperative to identify the units of storage in such object stores. +In our case an object store that would host individual data recorsd is +not useful due to the vast number of data points. Therfore the best +way to store this data even in an object store is as a single entry of +compressed overall data. + + +\subsection{Data Access} + +Besides having proper data and being able do download it efficiently +from the location of storage, it is impertive to be able to access it +in such a way that the GPUs used for deep learning are being feed with +enough data without being idle. The performance results were somewhat +surprising and had a devestating efect on the overall execution time +that were twice as fast on the personal computer while using an +RTX3090 in contrast to using the HPC center recommended filesystems +when using an A100. For this reason we have made a simple test and +measure the performance to read access the various file systems. The +results are shown in Table~\ref{tab:file-performance} which include +various file systems at University of Virginias Rivanna HPC but also a +comparision with a personal computer from a student. + +Based on this observation it was infasbale to consider running the +earthquake benchmark on the regular configured HPC nodes as they ran +on some resources for almost 24 hours. This is also the limit the +Rivana system allows for one job. Hence we were allowed to use a +special compute node that has additional NVMe storage avalable and +accessible to us. On those nodes (in the Table listed as +\verb|/localsratch|) we wer able to obtain a very suitable performance +for this application while having a 10 times fold increas in access in +contrast to the scratch file system an almost double the perfomance +given to us on the project file system. The /tmp system although being +fast was for our application not suffiintly large and also performes +slower then the \verb|/localscratch| set up for us. In addition we +also made an experiment using a shared memory based hosted filesystem +in the nodes RAM. + + +What we learn from this experience is that a HPC system must provide +the fast file system locally available on the nodes to serve the GPUs +adequately. The computer should be designed form that start to not +only have the fastest possible GPUs for large data processing, but +also a very fast filesystem that can keep up with the data input +requirements presented by the GPU. Furthermore in case updated GPUs +are purchased it is not sufficient to just take toe previous +generation motherboard and CPU processor and memory, but to update the +hradware components and include a state of the art compute note. This +often prevents the repurpossing of the nodde while addeing just GPUs. + +\begin{table}[htb] + \caption{Filetransfer performance of various file systems on Rivanna} + \label{tab:file-performance} + \begin{center} + {\footnotesize + \begin{tabular}{lllllp{4.5cm}} + Machine & File systems & \multicolumn{2}{l}{Bandwidth Performance} & Speedup & Description \\ + \hline + Rivanna & \verb|/scratch/$USER (sbatch)| & 30.6MiB/s & 32.1MB/s & 1.0 & shared scratch space when running in batch mode \\ + Rivanna & \verb|/scratch/$USER (interactive)| & 33.2MiB/s & 34.8MB/s & 1.1 & shared scratch space when running interacively \\ + Rivanna & \verb|/home/$USER| & 40.9MiB/s & 42.9MB/s & 1.3 & users home directory \\ + Rivanna & \verb|/project/$PROJECTID | & 100MiB/s & 105MB/s & 3.3 & project sppecific filesystem \\ + Personal Computer & \verb|c:| & 187MiB/s & 196MB/s & 6.1 & file system on a personal computer \\ + Rivanna & \verb|/tmp| & 271MiB/s & 285MB/s & 8.9 & temporary file system on a node \\ + \hline + Selected Nodes Rivanna & \verb|/localscratch| & 384MiB/s & 403MB/s & 12.6 & Special access to NVMe storage of a special node in the cluster\\ + RAM disk Rivanna & \verb|/dev/shm/*| & 461MiB/s & 483MB/s & 15.1 & simulated filesystem in a RAM disk\\ + \hline + \end{tabular} + \end{table} + \end{center} + } + diff --git a/section-dev.tex b/section-dev.tex index 8b88836..f57649c 100644 --- a/section-dev.tex +++ b/section-dev.tex @@ -1,61 +1,58 @@ -\section{Insights into Development from the Earthquake Code} - -The original code was developed by a single researcher with the goal to create a DL method called tvelop to apply spacial timeseries evolution for multiplw applications including eartquake, hydrology and COVID prediction. The code was developed in a large Python Jupyter notebook on Google Collab. The total number of lines of code was \TODO{line number}. The code included all definitions of variables and hyperparemetes in the code itself. - - -difficult to maintain and understand for others - -easy to develop by author, many experimental aspects - -all varables defined in code not config file - -lots of graphic outputs for interactive development - -How many lines of code?? - - -no use of libraries -limited use of functions -if conditions for different science applications - - -large code is too dificult to maintain in colab - -papermill - -mlcommons focus on one science application at a time - -students can not comprehend code - -rewritten code to just focus on earth quake - -rewritten code to add selected hyperparameters into a configuration file - - -setup - -for - -training - -valiadation - -comparing output - -not much use of libraries - -choices - -development of multiple runs based on variation of additional time based internal hyperparameters, ---> long runtime, no changes to evaluation section in code - -take these parameters out and place them in a configuration fil --> multiple runs needed and caomparision has to be separated fromprg, ;lots of changes to the program, program will run shorter, - - -libraries for mlcommons benchmarking, cloudmesh -portable way to define data locations via config -experiment permutation over hyperparameters. -* repeated experiements -* separate evaluation and comparision of accuray which was not in original code. -* comparision of accuracy across different hyperparameter searches. \ No newline at end of file +\subsection{Insights into Development from the Earthquake Code} + +The original code was developed with the goal to create a DL method +called {\em tevelop} to apply spacial timeseries evolution for +multiplw applications including eartquake, hydrology and COVID +prediction. The code was presented in a large Python Jupyter notebook +on Google Collab. ue to the integration of multiple applications the +code was difficult to understand and maintain. For this reason the +total number of lines of 13500 was reduced by more then 2400 lines +when the hydrology and the covid code was removed. Howver, at the +same time we restructured the code and reached a final length of about +11100 lines of code. The original code contained all hyperparameters +and needed to be changed every time a hyperparameter was modified. +The code included all definitions of variables and hyperparemetes in +the code itself. + +As we can see from this this code has some mayor issues that future +versions ought to address. First, the code includes every aspect that +is not covered by tensorflow and also contains a customized version of +TFT. Second due to this the code is very large and manipulating and +editing the code is time onsuming and error prone. Third, as many +code related parameters are manages still in the code the running the +same code with various parameters becomes combersume. In fact multiple +copies of the code need to be maintained whne new parameters are +shosen, instead of making such paramaters part of a configuration +file. Hence we started moving towards the simplification of the code +by introducing the concept of libraries that can be pip installed, as +well as adding gradually more parameters to a configuration files that +are used by the program. + +The advantage of using a notebok is that it can be augmented with lots +of graphs that give insitu updates to the rogress and its measured +accuracy. It is infeasable for students to use and replicate the run +of this notebook as the runtime can be up to two days. Students for +sure have to use their computers for other things and need to be able +to use them on the go. Often HPC venters provide interactive jobs into +the batch queues, but also here this is not sufficient. Instead we +adapted to use jupyter notebooks in full batch mode by the HPC quieng +system by generation a special batch script that internally uses +papermill to execute the notebook in the background. Papermill, will +also include all cells that have to be updated during runtime +including graphics. The script we developed neded howver to be run +multie times and with different hyperparameters such as the number of +epochs, to give just one example. As the HPC system is a heterogeneous +GPU system having access to A100, V100, P100, rtx2080 the choice of +the GPU system must ba able to be configurable. Hence the batch script +include the ability to also read in the configuration file and adapt +itself to the needed parameters. This is controled by sofisticated but +simple batch job generator which we discuss in a later Section. + + + +%libraries for mlcommons benchmarking, cloudmesh +%portable way to define data locations via config +%experiment permutation over hyperparameters. +%* repeated experiements +%* separate evaluation and comparision of accuray which was not in original code. +%* comparision of accuracy across different hyperparameter searches. \ No newline at end of file diff --git a/section-earthquake.tex b/section-earthquake.tex index 2c7e481..3e61bd2 100644 --- a/section-earthquake.tex +++ b/section-earthquake.tex @@ -8,8 +8,8 @@ \section{Earthquake Forecasting} forecasting methods rely on statistical techniques, we use ML for extracting the evolution and testing the effectiveness of the forecast. As a metric, we use the Nash-Sutcliffe Efficiency (NSE) -\cite{nash-79}. Other qualitative predictions are discussed in -~\cite{fox2022-jm}. +\citep{nash-79}. Other qualitative predictions are discussed in +~\citep{fox2022-jm}. One of the common tasks when dealing with time series is the ability to predict or forecast them in advance. Time series capture the @@ -30,7 +30,7 @@ \section{Earthquake Forecasting} \subsection{Earthquake Data} The data for this earthquake is described in -\cite{las-22-mlcommons-science}. It uses a subset of the earthquake +\citep{las-22-mlcommons-science}. It uses a subset of the earthquake data from the United States Geological Survey (USGS) focused on Southern California between latitude: $32^\circ$N to $36^\circ$N and longitude: $-120^\circ$S to $-114^\circ$S). The data for this region covers all @@ -48,7 +48,7 @@ \subsection{Earthquake Data} from a fortnight up to four years. Furthermore, we calculate summed magnitudes and depths and counts of significant quakes (magnitude $< 3.29$).'' Table~\ref{tab:eq-summary} depicts the key features of the -benchmark \cite{las-22-mlcommons-science}. +benchmark \citep{las-22-mlcommons-science}. \begin{table} @@ -58,7 +58,7 @@ \subsection{Earthquake Data} {\footnotesize \begin{tabular}{p{0.2\columnwidth}p{0.2\columnwidth}p{0.45\columnwidth}} \hline -{\bf Area} & \multicolumn{2}{l}{Earthquake Forecasting~\cite{fox2022-jm,TFT-21,eq-code,eq-data}.}\\ +{\bf Area} & \multicolumn{2}{l}{Earthquake Forecasting~\citep{fox2022-jm,TFT-21,eq-code,eq-data}.}\\ \hline {\bf Objectives} & \multicolumn{2}{l}{Improve the quality of Earthquake forecasting in a region of Southern California.}\\ @@ -70,9 +70,9 @@ \subsection{Earthquake Data} & Size: & 11.3GB (Uncompressed), 21.3MB (Compressed)\\ & Training samples: & 2,400 spatial bins\\ & Validation samples: & 100 spatial bins\\ - & Source: & USGS Servers~\cite{eq-data}\\ + & Source: & USGS Servers~\citep{eq-data}\\ \hline -{\bf Reference Implementation} & \cite{eq-code} & \\ +{\bf Reference Implementation} & \citep{eq-code} & \\ % \hline \hline \end{tabular} @@ -87,14 +87,14 @@ \subsection{Implementation} The reference implementation of the benchmark includes three distinct deep learning-based reference implementations. These are Long short-term memory (LSTM)-based model, Google Temporal Fusion -Transformer (TFT)~\cite{TFT-21}-based model and a custom hybrid +Transformer (TFT)~\citep{TFT-21}-based model and a custom hybrid transformer model. The TFT-based model uses two distinct LSTMs, covering an encoder and a decoder with a temporal attention-based transformer. The custom model includes a space-time transformer for the Decoder and a two-layer LSTM for the encoder. Each model predicts NSE and generates visualizations illustrating the TFT for interpretable multi-horizon time series -forecasting~\cite{TFT-21}. Details of the current reference models can -be found in~\cite{fox2022-jm}. In this paper, we only focus on the +forecasting~\citep{TFT-21}. Details of the current reference models can +be found in~\citep{fox2022-jm}. In this paper, we only focus on the LSTM implementation. diff --git a/section-eq-performance.tex b/section-eq-performance.tex new file mode 100644 index 0000000..50f9ffc --- /dev/null +++ b/section-eq-performance.tex @@ -0,0 +1,249 @@ +\section*{Figure captions} + +%%% max 15 figures abd table, subfig is one figure + +%%% NB logo1.eps is required in the path in order to correctly compile front page header %%% + + + +\begin{figure}[htb] + + \begin{center} + + \begin{minipage}[b]{0.45\textwidth} + \includegraphics[width=1.0\linewidth]{images/2_training-MSE-and-NNSE.pdf} + {\bf (A)} MSE and NNSE - 2 epochs training. + \end{minipage} + \ \ + \begin{minipage}[b]{0.45\textwidth} + \includegraphics[width=1.0\linewidth]{images/2_validation-MSE-and-NNSE.pdf} + {\bf (B)} MSE and NNSE - 2 epochs validation. + \end{minipage} + + \begin{minipage}[b]{0.45\textwidth} + \includegraphics[width=1.0\linewidth]{images/30_training-MSE-and-NNSE.pdf} + {\bf (C)} MSE and NNSE - 30 epochs training. + \end{minipage} + \ \ + \begin{minipage}[b]{0.45\textwidth} + \includegraphics[width=1.0\linewidth]{images/30_validation-MSE-and-NNSE.pdf} + {\bf (D)} MSE and NNSE - 30 epochs validation. + \end{minipage} + + \begin{minipage}[b]{0.45\textwidth} + \includegraphics[width=1.0\linewidth]{images/70_training-MSE-and-NNSE.pdf} + {\bf (E)} MSE and NNSE - 70 epochs training. + \end{minipage} + \ \ + \begin{minipage}[b]{0.45\textwidth} + \includegraphics[width=1.0\linewidth]{images/70_validation-MSE-and-NNSE.pdf} + {\bf (F)} MSE and NNSE - 70 epochs validation. + \end{minipage} +\end{center} + + \caption{NNSE and MSE values for training and validation for epochs 2 (A, B), 30 (C, D), 70 (E, F).} + \label{fig:six graphs} +\end{figure} + +%we have finalized the EQ code but want to make absolutely sure that we +%look at the correct values for the scientific comparison. +% +%This also requires a small sentence to each variable. Could we have a +%small meeting and I take then some notes on what these values are +%tomorrow. I will then add the explanations to the MLCommons EQ benchmark +%policy document. +% +%I just want to make sure I understand over which domain we average and +%sum up. +% +%Also if we were to just do one value (just in case they ask, I think we +%would use the summed up total right. However I think it is better to +%keep all of them.) +% +%Also I forgot what the +26 refers to +% +%I think something like this is almost correct, but we need to +26 +%explanation and get verification from you. +% +%The Magnification based on a years worth of back data, while looking two +%weeks ahead + 26 what? + +\begin{table}[p] + + \caption{Training and validation with time-based hyperparameters + sorted by NNSE accuracy. The table includes the best two + values highlighted in the training and validation results to + showcase the accuracy of the validation. In the validation, + we see that the best value for training is in rank four for the + validation. The number of Epochs for this experiment is 2. + 26 is half of 52 and so 26 2-week intervals is a year.} + \label{tab:training-2} + + \renewcommand{\arraystretch}{1.2} + \begin{center} + {\footnotesize +\begin{tabular}{|r|rl||rl|} + \hline +{\bf Rank} & \multicolumn{2}{c||}{\bfseries Training} & \multicolumn{2}{c|}{\bfseries Validation} \\ + & {\bf NNSE} & {\bf Hyperparameters} & {\bf NNSE} & {\bf Hyperparameters} \\ +\hline +1 & \color{red} 0.191300 & \color{red} Year Back & \color{blue} 0.195200 & \color{blue} 6M 2wk+7AVG \\ +2 & 0.192700 & \color{blue} 6M 2wk+7AVG & \color{teal} 0.201000 & \color{teal} 6 Months Back \\ +3 & 0.197000 & 6M 2wk+13AVG & 0.201600 & 6M 2wk+13AVG \\ +4 & \color{teal} 0.201600 & \color{teal} 6 Months Back & \color{red} 0.204500 & \color{red} Year Back \\ +5 & 0.232600 & 1Y 2wk+13AVG & 0.219700 & 3 Months Back \\ +6 & 0.233000 & 3 Months Back & 0.228900 & 3M 2wk+7AVG \\ +7 & 0.235800 & 1Y 2wk+7AVG & 0.238200 & 1Y 2wk+13AVG \\ +8 & 0.243000 & 3M 2wk+7AVG & 0.249500 & 1Y 2wk+7AVG \\ +9 & 0.251600 & 1Y 2wk+26AVG & 0.264400 & 6M 2wk+26AVG \\ +10 & 0.251700 & 6M 2wk+26AVG & 0.266200 & 3M 2wk+13AVG \\ +11 & 0.278800 & 3M 2wk+13AVG & 0.270300 & 1Y 2wk+26AVG \\ +12 & 0.302500 & 3M 2wk+26AVG & 0.295800 & 3M 2wk+26AVG \\ +13 & 0.405600 & Now 2wk+7AVG & 0.379700 & Now 2wk+7AVG \\ +14 & 0.429900 & Now 2wk+13AVG & 0.412700 & Now 2wk+13AVG \\ +15 & 0.506800 & 2 weeks Now & 0.470100 & 2 weeks Now \\ +16 & 0.521800 & Now 2wk+26AVG & 0.502300 & Now 2wk+26AVG \\ +\hline +\end{tabular} +} +\end{center} + +%\end{table} + +%\begin{table}[htb] + + \caption{Training and validation with time-based hyperparameters + sorted by NNSE accuracy. The table includes the best two + values highlighted in the training and validation results to + showcase the accuracy of the validation. In the validation, + we see that the best value for training is in rank four for the + validation. The number of Epochs for this experiment is 30. + } + \label{tab:training-30} + + \renewcommand{\arraystretch}{1.2} + \begin{center} + {\footnotesize +\begin{tabular}{|r|rl||rl|} +\hline +{\bf Rank} & +\multicolumn{2}{c||}{\bfseries Training} & +\multicolumn{2}{c|}{\bfseries Validation} \\ + {\bf NNSE} & + {\bf Hyperparameters} & + {\bf NNSE} & + {\bf Hyperparameters} \\ +\hline + 1 & \color{red} 0.047600 & \color{red} Year Back & \color{red} 0.050500 & \color{red} Year Back \\ + 2 & \color{blue} 0.069500 & \color{blue} 6 Months Back & \color{blue} 0.070300 & \color{blue} 6 Months Back \\ + 3 & 0.082900 & 1Y 2wk+7AVG & 0.076500 & 1Y 2wk+7AVG \\ + 4 & 0.089700 & 3 Months Back & 0.090400 & 3 Months Back \\ + 5 & 0.171600 & 1Y 2wk+13AVG & 0.153600 & 1Y 2wk+13AVG \\ + 6 & 0.208100 & 6M 2wk+7AVG & 0.186200 & 6M 2wk+7AVG \\ + 7 & 0.319600 & 1Y 2wk+26AVG & 0.290100 & 1Y 2wk+26AVG \\ + 8 & 0.330300 & 3M 2wk+7AVG & 0.291900 & 3M 2wk+7AVG \\ + 9 & 0.341800 & 6M 2wk+13AVG & 0.302800 & 6M 2wk+13AVG \\ +10 & 0.394600 & 3M 2wk+13AVG & 0.343400 & 3M 2wk+13AVG \\ +11 & 0.418900 & 6M 2wk+26AVG & 0.374500 & 6M 2wk+26AVG \\ +12 & 0.450800 & 3M 2wk+26AVG & 0.384100 & 2 weeks Now \\ +13 & 0.488800 & 2 weeks Now & 0.398900 & 3M 2wk+26AVG \\ +14 & 0.517900 & Now 2wk+7AVG & 0.409300 & Now 2wk+7AVG \\ +15 & 0.559200 & Now 2wk+13AVG & 0.453000 & Now 2wk+13AVG \\ +16 & 0.586000 & Now 2wk+26AVG & 0.484100 & Now 2wk+26AVG \\ +\hline +\end{tabular} +} + +\end{center} +\end{table} + + +\begin{table}[htb] + + \caption{Training and validation with time-based hyperparameters + sorted by NNSE accuracy. The table includes the best two + values highlighted in the training and validation results to + showcase the accuracy of the validation. In the validation, + we see that the best value for training is in rank four for the + validation. The number of Epochs for this experiment is 70.} + \label{tab:training-70} + + \renewcommand{\arraystretch}{1.2} + \begin{center} + {\footnotesize +\begin{tabular}{|r|rl||rl|} + \hline +{\bf Rank} & \multicolumn{2}{c||}{\bfseries Training} & \multicolumn{2}{c|}{\bfseries Validation} \\ + & {\bf NNSE} & {\bf Hyperparameters} & {\bf NNSE} & {\bf Hyperparameters} \\ + \hline + 1 & \color{red} 0.067400 & \color{red} 3 Months Back & \color{red}0.069800 & \color{red} 3 Months Back \\ + 2 & \color{blue} 0.073500 & \color{blue} Year Back & \color{blue} 0.071200 & \color{blue} Year Back \\ + 3 & 0.083100 & 1Y 2wk+7AVG & 0.084300 & 1Y 2wk+7AVG \\ + 4 & 0.105300 & 6 Months Back & 0.102200 & 6 Months Back \\ + 5 & 0.138400 & 6M 2wk+7AVG & 0.133700 & 6M 2wk+7AVG \\ + 6 & 0.153500 & 1Y 2wk+13AVG & 0.142800 & 1Y 2wk+13AVG \\ + 7 & 0.252100 & 6M 2wk+13AVG & 0.235400 & 6M 2wk+13AVG \\ + 8 & 0.295900 & 6M 2wk+26AVG & 0.269700 & 6M 2wk+26AVG \\ + 9 & 0.318800 & 1Y 2wk+26AVG & 0.291100 & 3M 2wk+7AVG \\ +10 & 0.335400 & 3M 2wk+7AVG & 0.293500 & 1Y 2wk+26AVG \\ +11 & 0.385200 & 3M 2wk+13AVG & 0.333000 & 3M 2wk+13AVG \\ +12 & 0.421000 & 3M 2wk+26AVG & 0.344500 & 2 weeks Now \\ +13 & 0.425700 & 2 weeks Now & 0.359400 & Now 2wk+7AVG \\ +14 & 0.441300 & Now 2wk+7AVG & 0.370700 & 3M 2wk+26AVG \\ +15 & 0.465800 & Now 2wk+13AVG & 0.385800 & Now 2wk+13AVG \\ +16 & 0.490400 & Now 2wk+26AVG & 0.412500 & Now 2wk+26AVG \\ +\hline +\end{tabular} +\end{center} +} + +\end{table} + + +\section{Energy} + +\begin{figure}[htb] + + \begin{center} + \begin{minipage}[t]{0.30\textwidth} + \includegraphics[width=1.0\linewidth]{images/card-name-v100-gpu-count-1-cpu-num-6-mem-32gb-repeat-1-tfttransformerepochs-2.png} + {\bf (A)} Energy consumption for 2 epochs training and validation. + \end{minipage} + \ \ + \begin{minipage}[t]{0.30\textwidth} + \includegraphics[width=1.0\linewidth]{images/card-name-v100-gpu-count-1-cpu-num-6-mem-32gb-repeat-1-tfttransformerepochs-30.png} + {\bf (B)} Energy consumption for 30 epochs training and validation. + \end{minipage} + \ \ + \begin{minipage}[t]{0.30\textwidth} + \includegraphics[width=1.0\linewidth]{images/card-name-v100-gpu-count-1-cpu-num-6-mem-32gb-repeat-1-tfttransformerepochs-70.png} + {\bf (C)} Energy consumption for 70 epochs training and validation. + \end{minipage} + \end{center} + + \caption {Energy monitoring for 2, 30, and 70 epochs for training and validation.} + \label{fig:energy} + +\end{figure} + +\begin{figure}[p] + + \begin{center} + \begin{minipage}[t]{0.65\textwidth} + \includegraphics[width=1.0\linewidth]{images/NNSE-all-epochs-training} + {\bf (A)} NNSE for training. + \end{minipage} + \end{center} + \ \ + \begin{center} + \begin{minipage}[t]{0.65\textwidth} + \includegraphics[width=1.0\linewidth]{images/NNSE-all-epochs-validation} + {\bf (B)} NNSE for validation. + \end{minipage} + \end{center} + + \caption {NNSE comparison} + \label{fig:NNSE-comparison} + +\end{figure} + diff --git a/section-workflow.tex b/section-workflow.tex new file mode 100644 index 0000000..b18783d --- /dev/null +++ b/section-workflow.tex @@ -0,0 +1,112 @@ +\section{Insights into DL Workflows} + + + + + +\subsection{Cloudmesh-sbatch} + +We describe cloudmesh-sbatch + +\subsection{Cloudmesh-cc} + +\begin{itemize} +\item Different graphics cards + +\item Different epochs of training + +\item Create workflow for cloudmask +\end{itemize} + + +\subsubsection{Analytics Service Pipelines} + +\paragraph{Motivation.} +In many cases, a big data analysis is split up into multiple +subtasks. These subtasks may be reusable in other analytics +pipelines. Hence it is desirable to be able to specify and use them in +a coordinated fashion allowing the reuse of the logic represented by the +analysis. Users must have a clear understanding of what the analysis +is doing and how it can be invoked and integrated. + +\paragraph{Access Requirements.} +The analysis must include a clear and easy-to-understand specification +that encourages reuse and provides sufficient details about its +functionality, data dependency, and performance. Analytics services may +have authentication, autorotation, and access controls built in that +enable access by users controlled by the service providers. + + + +\begin{figure}[htb] +\centering\includegraphics[width=0.75\columnwidth]{images/processes-nist.pdf} +\label{fig:service-interaction} +\caption{Service Interaction.} +\end{figure} + + +\subsubsection{Workflow Compute Coordinator} + +High-performance computing (HPC) is for decades a very important tool +for science. Scientific tasks can be leveraging the processing power +of a supercomputer so they can run at previously unobtainable high +speeds or utilize specialized hardware for acceleration that otherwise +are not available to the user. HPC can be used for analytic programs +that leverage machine learning applied to large data sets to, for +example, predict future values or to model current states. For such +high-complexity projects, there are often multiple complex programs +that may be running repeatedly in either competition or cooperation. +This may include resources in the same or different data centers. We +developed a hybrid multi-cloud analytics service framework that was +created to manage heterogeneous and remote workflows, queues, and +jobs. It can be used through a Python API, the command line, and a +REST service. It is supported on multiple operating systems like +macOS, Linux, and Windows 10 and 11. The workflow is specified via an +easy-to-define YAML file. Specifically, we have developed a library +called Cloudmesh Compute Coordinator (cloudmesh-cc) that adds workflow +features to control the execution of jobs on remote compute resources, +while at the same time leveraging capabilities provided by the local +compute environments to directly interface with graphical +visualizations better suited for the desktop. The goal is to provide +numerous workflows that in cooperation enhances the experience of the +analytics tasks. This includes a REST service and command line tools +to interact with it. + + +\begin{figure}[htb] +\centering\includegraphics[width=0.7\columnwidth]{images/fastapi-service.png} +\caption{Fast API Workflow Service.} +% better resolution +\label{fig:fastapi-cc} +\end{figure} + +\begin{figure}[htb] + \centering + \includegraphics[width=0.50\columnwidth]{images/cloudmesh-cc-new.pdf} + \caption{Architecture Workflow Service.} + \label{fig:cc-2} +\end{figure} + +\begin{figure}[htb] + \centering + \includegraphics[width=0.70\columnwidth]{images/cloudmesh-sbatch-new.pdf} + \caption{Workflow Script Batch Generator.} + \label{fig:cm-sbatch} +\end{figure} + +\begin{figure}[htb] + \centering + \includegraphics[width=0.70\columnwidth]{images/cc-1.png} + \caption{Workflow user interface. } + \label{fig:cc-3} +\end{figure} + + +We have tested the framework while running various MNIST application +examples, including include Multilayer Perceptron, LSTM (Long +short-term memory), Auto-Encoder, Convolutional, and Recurrent Neural +Networks, Distributed Training, and PyTorch training. A much lager +application using earthquake prediction has also been used. + +Figure \ref{fig:fastapi-cc} shows the REST specification and +\ref{fig:cc-2} shows the architecture. diff --git a/vonLaszewski-frontiers.tex b/vonLaszewski-frontiers.tex index f842933..6353a96 100644 --- a/vonLaszewski-frontiers.tex +++ b/vonLaszewski-frontiers.tex @@ -122,8 +122,8 @@ \section{} \end{abstract} -\cite{las-infogram} -\cite{las-workflow,las07-workflow} +\citep{las-infogram} +\citep{las-workflow,las07-workflow} % recording of presentation \url{https://myuva-my.sharepoint.com/:v:/r/personal/dje5dj_virginia_edu/Documents/icbicc.mp4?csf=1&web=1&e=oXV6cx} @@ -155,157 +155,10 @@ \section{Introduction} \input{section-edu-mlcommons} \input{section-earthquake} - -\section{Erathquake Forecasting} - -Application domain - - - -TODO: describe the application - \input{section-dev} - -\section{Insights into Data Management} - - -\subsection{Data Sets} - - -\subsection{Data compression} - - -The earthquake data is compressed by a factor of 100. - -Downloading the git repository takes 4.723s. The git repository is 108M and -is hosted on GitHub~\cite{mlcommons-earthquake-data}. - -The compressed xz archive file is 21 MB. To download only the archive file -using wget takes 0.253s. -To uncompress the archive takes 1m2.522s and the uncompressed files -are 11 GB total. - - -\subsection{Data Access} - -Fast file system - - - -\section{Insights into DL Workflows} - - - - - -\subsection{Cloudmesh-sbatch} - -We describe cloudmesh-sbatch - -\subsection{Cloudmesh-cc} - -\begin{itemize} -\item Different graphics cards - -\item Different epochs of training - -\item Create workflow for cloudmask -\end{itemize} - - -\subsubsection{Analytics Service Pipelines} - -\paragraph{Motivation.} -In many cases, a big data analysis is split up into multiple -subtasks. These subtasks may be reusable in other analytics -pipelines. Hence it is desirable to be able to specify and use them in -a coordinated fashion allowing the reuse of the logic represented by the -analysis. Users must have a clear understanding of what the analysis -is doing and how it can be invoked and integrated. - -\paragraph{Access Requirements.} -The analysis must include a clear and easy-to-understand specification -that encourages reuse and provides sufficient details about its -functionality, data dependency, and performance. Analytics services may -have authentication, autorotation, and access controls built in that -enable access by users controlled by the service providers. - - - -\begin{figure}[htb] -\centering\includegraphics[width=0.75\columnwidth]{images/processes-nist.pdf} -\label{fig:service-interaction} -\caption{Service Interaction.} -\end{figure} - - -\subsubsection{Workflow Compute Coordinator} - -High-performance computing (HPC) is for decades a very important tool -for science. Scientific tasks can be leveraging the processing power -of a supercomputer so they can run at previously unobtainable high -speeds or utilize specialized hardware for acceleration that otherwise -are not available to the user. HPC can be used for analytic programs -that leverage machine learning applied to large data sets to, for -example, predict future values or to model current states. For such -high-complexity projects, there are often multiple complex programs -that may be running repeatedly in either competition or cooperation. -This may include resources in the same or different data centers. We -developed a hybrid multi-cloud analytics service framework that was -created to manage heterogeneous and remote workflows, queues, and -jobs. It can be used through a Python API, the command line, and a -REST service. It is supported on multiple operating systems like -macOS, Linux, and Windows 10 and 11. The workflow is specified via an -easy-to-define YAML file. Specifically, we have developed a library -called Cloudmesh Compute Coordinator (cloudmesh-cc) that adds workflow -features to control the execution of jobs on remote compute resources, -while at the same time leveraging capabilities provided by the local -compute environments to directly interface with graphical -visualizations better suited for the desktop. The goal is to provide -numerous workflows that in cooperation enhances the experience of the -analytics tasks. This includes a REST service and command line tools -to interact with it. - - -\begin{figure}[htb] -\centering\includegraphics[width=0.7\columnwidth]{images/fastapi-service.png} -\caption{Fast API Workflow Service.} -% better resolution -\label{fig:fastapi-cc} -\end{figure} - -\begin{figure}[htb] - \centering - \includegraphics[width=0.50\columnwidth]{images/cloudmesh-cc-new.pdf} - \caption{Architecture Workflow Service.} - \label{fig:cc-2} -\end{figure} - -\begin{figure}[htb] - \centering - \includegraphics[width=0.70\columnwidth]{images/cloudmesh-sbatch-new.pdf} - \caption{Workflow Script Batch Generator.} - \label{fig:cm-sbatch} -\end{figure} - - - -\begin{figure}[htb] - \centering - \includegraphics[width=0.70\columnwidth]{images/cc-1.png} - \caption{Workflow user interface. } - \label{fig:cc-3} -\end{figure} - - -We have tested the framework while running various MNIST application -examples, including include Multilayer Perceptron, LSTM (Long -short-term memory), Auto-Encoder, Convolutional, and Recurrent Neural -Networks, Distributed Training, and PyTorch training. A much lager -application using earthquake prediction has also been used. - -Figure \ref{fig:fastapi-cc} shows the REST specification and -\ref{fig:cc-2} shows the architecture. +\input{section-data} +\input{section-workflow} +\input{section-eq-performance} % \subsubsection{Federated Analytics Service Catalogue} % \subsubsection{Catalogue Attributes} @@ -316,9 +169,6 @@ \subsubsection{Workflow Compute Coordinator} % \subsubsection{Resource Management} % \subsubsection{Security} - - - % three axis graph: system, software, science \section{Nomenclature} @@ -406,247 +256,4 @@ \section*{Data Availability Statement} \bibliography{vonLaszewski-references} -\section*{Figure captions} - -%%% max 15 figures abd table, subfig is one figure - -%%% NB logo1.eps is required in the path in order to correctly compile front page header %%% - - - -\begin{figure}[htb] - - \begin{center} - - \begin{minipage}[b]{0.45\textwidth} - \includegraphics[width=1.0\linewidth]{images/2_training-MSE-and-NNSE.pdf} - {\bf (A)} MSE and NNSE - 2 epochs training. - \end{minipage} - \ \ - \begin{minipage}[b]{0.45\textwidth} - \includegraphics[width=1.0\linewidth]{images/2_validation-MSE-and-NNSE.pdf} - {\bf (B)} MSE and NNSE - 2 epochs validation. - \end{minipage} - - \begin{minipage}[b]{0.45\textwidth} - \includegraphics[width=1.0\linewidth]{images/30_training-MSE-and-NNSE.pdf} - {\bf (C)} MSE and NNSE - 30 epochs training. - \end{minipage} - \ \ - \begin{minipage}[b]{0.45\textwidth} - \includegraphics[width=1.0\linewidth]{images/30_validation-MSE-and-NNSE.pdf} - {\bf (D)} MSE and NNSE - 30 epochs validation. - \end{minipage} - - \begin{minipage}[b]{0.45\textwidth} - \includegraphics[width=1.0\linewidth]{images/70_training-MSE-and-NNSE.pdf} - {\bf (E)} MSE and NNSE - 70 epochs training. - \end{minipage} - \ \ - \begin{minipage}[b]{0.45\textwidth} - \includegraphics[width=1.0\linewidth]{images/70_validation-MSE-and-NNSE.pdf} - {\bf (F)} MSE and NNSE - 70 epochs validation. - \end{minipage} -\end{center} - - \caption{NNSE and MSE values for training and validation for epochs 2 (A, B), 30 (C, D), 70 (E, F).} - \label{fig:six graphs} -\end{figure} - -%we have finalized the EQ code but want to make absolutely sure that we -%look at the correct values for the scientific comparison. -% -%This also requires a small sentence to each variable. Could we have a -%small meeting and I take then some notes on what these values are -%tomorrow. I will then add the explanations to the MLCommons EQ benchmark -%policy document. -% -%I just want to make sure I understand over which domain we average and -%sum up. -% -%Also if we were to just do one value (just in case they ask, I think we -%would use the summed up total right. However I think it is better to -%keep all of them.) -% -%Also I forgot what the +26 refers to -% -%I think something like this is almost correct, but we need to +26 -%explanation and get verification from you. -% -%The Magnification based on a years worth of back data, while looking two -%weeks ahead + 26 what? - -\begin{table}[htb] - - \caption{Training and validation with time-based hyperparameters - sorted by NNSE accuracy. The table includes the best two - values highlighted in the training and validation results to - showcase the accuracy of the validation. In the validation, - we see that the best value for training is in rank four for the - validation. The number of Epochs for this experiment is 2. - 26 is half of 52 and so 26 2-week intervals is a year.} - \label{tab:training-2} - - \renewcommand{\arraystretch}{1.2} -\begin{center} -\begin{tabular}{|r|rl||rl|} - \hline -{\bf Rank} & \multicolumn{2}{c||}{\bfseries Training} & \multicolumn{2}{c|}{\bfseries Validation} \\ - & {\bf NNSE} & {\bf Hyperparameters} & {\bf NNSE} & {\bf Hyperparameters} \\ -\hline -1 & \color{red} 0.191300 & \color{red} Year Back & \color{blue} 0.195200 & \color{blue} 6M 2wk+7AVG \\ -2 & 0.192700 & \color{blue} 6M 2wk+7AVG & \color{teal} 0.201000 & \color{teal} 6 Months Back \\ -3 & 0.197000 & 6M 2wk+13AVG & 0.201600 & 6M 2wk+13AVG \\ -4 & \color{teal} 0.201600 & \color{teal} 6 Months Back & \color{red} 0.204500 & \color{red} Year Back \\ -5 & 0.232600 & 1Y 2wk+13AVG & 0.219700 & 3 Months Back \\ -6 & 0.233000 & 3 Months Back & 0.228900 & 3M 2wk+7AVG \\ -7 & 0.235800 & 1Y 2wk+7AVG & 0.238200 & 1Y 2wk+13AVG \\ -8 & 0.243000 & 3M 2wk+7AVG & 0.249500 & 1Y 2wk+7AVG \\ -9 & 0.251600 & 1Y 2wk+26AVG & 0.264400 & 6M 2wk+26AVG \\ -10 & 0.251700 & 6M 2wk+26AVG & 0.266200 & 3M 2wk+13AVG \\ -11 & 0.278800 & 3M 2wk+13AVG & 0.270300 & 1Y 2wk+26AVG \\ -12 & 0.302500 & 3M 2wk+26AVG & 0.295800 & 3M 2wk+26AVG \\ -13 & 0.405600 & Now 2wk+7AVG & 0.379700 & Now 2wk+7AVG \\ -14 & 0.429900 & Now 2wk+13AVG & 0.412700 & Now 2wk+13AVG \\ -15 & 0.506800 & 2 weeks Now & 0.470100 & 2 weeks Now \\ -16 & 0.521800 & Now 2wk+26AVG & 0.502300 & Now 2wk+26AVG \\ -\hline -\end{tabular} -\end{center} - -\end{table} - -\begin{table}[htb] - - \caption{Training and validation with time-based hyperparameters - sorted by NNSE accuracy. The table includes the best two - values highlighted in the training and validation results to - showcase the accuracy of the validation. In the validation, - we see that the best value for training is in rank four for the - validation. The number of Epochs for this experiment is 30. - } - \label{tab:training-30} - - \renewcommand{\arraystretch}{1.2} -\begin{center} -\begin{tabular}{|r|rl||rl|} -\hline -{\bf Rank} & -\multicolumn{2}{c||}{\bfseries Training} & -\multicolumn{2}{c|}{\bfseries Validation} \\ - {\bf NNSE} & - {\bf Hyperparameters} & - {\bf NNSE} & - {\bf Hyperparameters} \\ -\hline - 1 & \color{red} 0.047600 & \color{red} Year Back & \color{red} 0.050500 & \color{red} Year Back \\ - 2 & \color{blue} 0.069500 & \color{blue} 6 Months Back & \color{blue} 0.070300 & \color{blue} 6 Months Back \\ - 3 & 0.082900 & 1Y 2wk+7AVG & 0.076500 & 1Y 2wk+7AVG \\ - 4 & 0.089700 & 3 Months Back & 0.090400 & 3 Months Back \\ - 5 & 0.171600 & 1Y 2wk+13AVG & 0.153600 & 1Y 2wk+13AVG \\ - 6 & 0.208100 & 6M 2wk+7AVG & 0.186200 & 6M 2wk+7AVG \\ - 7 & 0.319600 & 1Y 2wk+26AVG & 0.290100 & 1Y 2wk+26AVG \\ - 8 & 0.330300 & 3M 2wk+7AVG & 0.291900 & 3M 2wk+7AVG \\ - 9 & 0.341800 & 6M 2wk+13AVG & 0.302800 & 6M 2wk+13AVG \\ -10 & 0.394600 & 3M 2wk+13AVG & 0.343400 & 3M 2wk+13AVG \\ -11 & 0.418900 & 6M 2wk+26AVG & 0.374500 & 6M 2wk+26AVG \\ -12 & 0.450800 & 3M 2wk+26AVG & 0.384100 & 2 weeks Now \\ -13 & 0.488800 & 2 weeks Now & 0.398900 & 3M 2wk+26AVG \\ -14 & 0.517900 & Now 2wk+7AVG & 0.409300 & Now 2wk+7AVG \\ -15 & 0.559200 & Now 2wk+13AVG & 0.453000 & Now 2wk+13AVG \\ -16 & 0.586000 & Now 2wk+26AVG & 0.484100 & Now 2wk+26AVG \\ -\hline -\end{tabular} - -\end{center} -\end{table} - - -\begin{table}[htb] - - \caption{Training and validation with time-based hyperparameters - sorted by NNSE accuracy. The table includes the best two - values highlighted in the training and validation results to - showcase the accuracy of the validation. In the validation, - we see that the best value for training is in rank four for the - validation. The number of Epochs for this experiment is 70.} - \label{tab:training-70} - - \renewcommand{\arraystretch}{1.2} -\begin{center} -\begin{tabular}{|r|rl||rl|} - \hline -{\bf Rank} & \multicolumn{2}{c||}{\bfseries Training} & \multicolumn{2}{c|}{\bfseries Validation} \\ - & {\bf NNSE} & {\bf Hyperparameters} & {\bf NNSE} & {\bf Hyperparameters} \\ - \hline - 1 & \color{red} 0.067400 & \color{red} 3 Months Back & \color{red}0.069800 & \color{red} 3 Months Back \\ - 2 & \color{blue} 0.073500 & \color{blue} Year Back & \color{blue} 0.071200 & \color{blue} Year Back \\ - 3 & 0.083100 & 1Y 2wk+7AVG & 0.084300 & 1Y 2wk+7AVG \\ - 4 & 0.105300 & 6 Months Back & 0.102200 & 6 Months Back \\ - 5 & 0.138400 & 6M 2wk+7AVG & 0.133700 & 6M 2wk+7AVG \\ - 6 & 0.153500 & 1Y 2wk+13AVG & 0.142800 & 1Y 2wk+13AVG \\ - 7 & 0.252100 & 6M 2wk+13AVG & 0.235400 & 6M 2wk+13AVG \\ - 8 & 0.295900 & 6M 2wk+26AVG & 0.269700 & 6M 2wk+26AVG \\ - 9 & 0.318800 & 1Y 2wk+26AVG & 0.291100 & 3M 2wk+7AVG \\ -10 & 0.335400 & 3M 2wk+7AVG & 0.293500 & 1Y 2wk+26AVG \\ -11 & 0.385200 & 3M 2wk+13AVG & 0.333000 & 3M 2wk+13AVG \\ -12 & 0.421000 & 3M 2wk+26AVG & 0.344500 & 2 weeks Now \\ -13 & 0.425700 & 2 weeks Now & 0.359400 & Now 2wk+7AVG \\ -14 & 0.441300 & Now 2wk+7AVG & 0.370700 & 3M 2wk+26AVG \\ -15 & 0.465800 & Now 2wk+13AVG & 0.385800 & Now 2wk+13AVG \\ -16 & 0.490400 & Now 2wk+26AVG & 0.412500 & Now 2wk+26AVG \\ -\hline -\end{tabular} -\end{center} - -\end{table} - - -\section{Energy} - -\begin{figure}[htb] - - \begin{center} - \begin{minipage}[t]{0.30\textwidth} - \includegraphics[width=1.0\linewidth]{images/card-name-v100-gpu-count-1-cpu-num-6-mem-32gb-repeat-1-tfttransformerepochs-2.png} - {\bf (A)} Energy consumption for 2 epochs training and validation. - \end{minipage} - \ \ - \begin{minipage}[t]{0.30\textwidth} - \includegraphics[width=1.0\linewidth]{images/card-name-v100-gpu-count-1-cpu-num-6-mem-32gb-repeat-1-tfttransformerepochs-30.png} - {\bf (B)} Energy consumption for 30 epochs training and validation. - \end{minipage} - \ \ - \begin{minipage}[t]{0.30\textwidth} - \includegraphics[width=1.0\linewidth]{images/card-name-v100-gpu-count-1-cpu-num-6-mem-32gb-repeat-1-tfttransformerepochs-70.png} - {\bf (C)} Energy consumption for 70 epochs training and validation. - \end{minipage} - \end{center} - - \caption {Energy monitoring for 2, 30, and 70 epochs for training and validation.} - \label{fig:energy} - -\end{figure} - -\begin{figure}[htb] - - \begin{center} - \begin{minipage}[t]{0.50\textwidth} - \includegraphics[width=1.0\linewidth]{images/NNSE-all-epochs-training} - {\bf (A)} NNSE for training. - \end{minipage} - \end{center} - \ \ - \begin{center} - \begin{minipage}[t]{0.50\textwidth} - \includegraphics[width=1.0\linewidth]{images/NNSE-all-epochs-validation} - {\bf (B)} NNSE for validation. - \end{minipage} - \end{center} - - \caption {NNSE comparison} - \label{fig:NNSE-comparison} - -\end{figure} - \end{document}