Skip to content

Commit

Permalink
update contents
Browse files Browse the repository at this point in the history
  • Loading branch information
laszewsk committed Feb 15, 2023
1 parent 759ddeb commit 033e238
Show file tree
Hide file tree
Showing 7 changed files with 557 additions and 469 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
.#*
~*
.DS_Store
*.aux
*.toc
*.bbl
*.blg
*.fdb_latexmk
Expand Down
121 changes: 121 additions & 0 deletions section-data.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
\section{Insights into Data Management}

In data management wi are currently concerned with various aspects of
the data set, the data compression and storage, as well as the data
access speed. We discuss insights in each of them in the next Sections.

\subsection{Data Sets}

When dealing with datasets we typically encounter several issues.
These issues are addressed by the MLCommons benchmarks and
datamanagement activities so that they provide ideal candidates for
education without spending an exorberant amount of time on data. Such
issues typically include access to dato wihout privacy restrictions,
data preprocessing that makes the datau suitable for deep learning,
data labeling in case they are part of a well defined mlcommons
benchmark. Other issues include data bias, noisy or missing data, as
well as overfitting while using training data. Typically the mlcommons
benchmarks will be designed to have no such issuess, or they have
minimal issues. However some benchmarks such as the science group
benchmarks whic are concerned with improving the science will have to
potentially address these issues in order o improve the accuracy. This
could include even injecting new date and different preproocessing
methods.


\subsection{Data compression}

An issue that if of utmost imporatance especially for large data sets
is how the data is represented. For example, for the earthquake
benchmark we found that the original dataset was 11GB big. However we
found that the data can be easily compressed by a factor of 100. This
is significant, as for example in this case the entire dataset can be
stored in Github. The compressed xz archive file is only 21 MB and to
download only the archive file using wget takes 0.253s. In case the
dataset and its repository is downloaded with Git we note that the
entire Git repository is 108MB~\citep{mlcommons-earthquake-data}. To
download this compressed dataset only takes 7.723s. Thus it is
prefered to just download the explictly used data using for example
wget. In both cases the data is compressed. To uncompress the data it
will take an additional 1 minute and 2.522seconds. However if we were
to download the datai in uncompressed form it woudl take approximately
3 huours and 51 seconds.

From this simple example it is clear that MLCommons benchmarks can
provide insights into how data is managed and delivered to for example
large scale compute clusters with many nodes, while utilizing
compressin algorithms. We will next discuss insights into
infrastructure management while using filesystems in HPC resources.
While often object stores are discussed to host such large datsets it
is imperative to identify the units of storage in such object stores.
In our case an object store that would host individual data recorsd is
not useful due to the vast number of data points. Therfore the best
way to store this data even in an object store is as a single entry of
compressed overall data.


\subsection{Data Access}

Besides having proper data and being able do download it efficiently
from the location of storage, it is impertive to be able to access it
in such a way that the GPUs used for deep learning are being feed with
enough data without being idle. The performance results were somewhat
surprising and had a devestating efect on the overall execution time
that were twice as fast on the personal computer while using an
RTX3090 in contrast to using the HPC center recommended filesystems
when using an A100. For this reason we have made a simple test and
measure the performance to read access the various file systems. The
results are shown in Table~\ref{tab:file-performance} which include
various file systems at University of Virginias Rivanna HPC but also a
comparision with a personal computer from a student.

Based on this observation it was infasbale to consider running the
earthquake benchmark on the regular configured HPC nodes as they ran
on some resources for almost 24 hours. This is also the limit the
Rivana system allows for one job. Hence we were allowed to use a
special compute node that has additional NVMe storage avalable and
accessible to us. On those nodes (in the Table listed as
\verb|/localsratch|) we wer able to obtain a very suitable performance
for this application while having a 10 times fold increas in access in
contrast to the scratch file system an almost double the perfomance
given to us on the project file system. The /tmp system although being
fast was for our application not suffiintly large and also performes
slower then the \verb|/localscratch| set up for us. In addition we
also made an experiment using a shared memory based hosted filesystem
in the nodes RAM.


What we learn from this experience is that a HPC system must provide
the fast file system locally available on the nodes to serve the GPUs
adequately. The computer should be designed form that start to not
only have the fastest possible GPUs for large data processing, but
also a very fast filesystem that can keep up with the data input
requirements presented by the GPU. Furthermore in case updated GPUs
are purchased it is not sufficient to just take toe previous
generation motherboard and CPU processor and memory, but to update the
hradware components and include a state of the art compute note. This
often prevents the repurpossing of the nodde while addeing just GPUs.

\begin{table}[htb]
\caption{Filetransfer performance of various file systems on Rivanna}
\label{tab:file-performance}
\begin{center}
{\footnotesize
\begin{tabular}{lllllp{4.5cm}}
Machine & File systems & \multicolumn{2}{l}{Bandwidth Performance} & Speedup & Description \\
\hline
Rivanna & \verb|/scratch/$USER (sbatch)| & 30.6MiB/s & 32.1MB/s & 1.0 & shared scratch space when running in batch mode \\
Rivanna & \verb|/scratch/$USER (interactive)| & 33.2MiB/s & 34.8MB/s & 1.1 & shared scratch space when running interacively \\
Rivanna & \verb|/home/$USER| & 40.9MiB/s & 42.9MB/s & 1.3 & users home directory \\
Rivanna & \verb|/project/$PROJECTID | & 100MiB/s & 105MB/s & 3.3 & project sppecific filesystem \\
Personal Computer & \verb|c:| & 187MiB/s & 196MB/s & 6.1 & file system on a personal computer \\
Rivanna & \verb|/tmp| & 271MiB/s & 285MB/s & 8.9 & temporary file system on a node \\
\hline
Selected Nodes Rivanna & \verb|/localscratch| & 384MiB/s & 403MB/s & 12.6 & Special access to NVMe storage of a special node in the cluster\\
RAM disk Rivanna & \verb|/dev/shm/*| & 461MiB/s & 483MB/s & 15.1 & simulated filesystem in a RAM disk\\
\hline
\end{tabular}
\end{table}
\end{center}
}

119 changes: 58 additions & 61 deletions section-dev.tex
Original file line number Diff line number Diff line change
@@ -1,61 +1,58 @@
\section{Insights into Development from the Earthquake Code}

The original code was developed by a single researcher with the goal to create a DL method called tvelop to apply spacial timeseries evolution for multiplw applications including eartquake, hydrology and COVID prediction. The code was developed in a large Python Jupyter notebook on Google Collab. The total number of lines of code was \TODO{line number}. The code included all definitions of variables and hyperparemetes in the code itself.


difficult to maintain and understand for others

easy to develop by author, many experimental aspects

all varables defined in code not config file

lots of graphic outputs for interactive development

How many lines of code??


no use of libraries
limited use of functions
if conditions for different science applications


large code is too dificult to maintain in colab

papermill

mlcommons focus on one science application at a time

students can not comprehend code

rewritten code to just focus on earth quake

rewritten code to add selected hyperparameters into a configuration file


setup

for

training

valiadation

comparing output

not much use of libraries

choices

development of multiple runs based on variation of additional time based internal hyperparameters,
--> long runtime, no changes to evaluation section in code

take these parameters out and place them in a configuration fil
-> multiple runs needed and caomparision has to be separated fromprg, ;lots of changes to the program, program will run shorter,


libraries for mlcommons benchmarking, cloudmesh
portable way to define data locations via config
experiment permutation over hyperparameters.
* repeated experiements
* separate evaluation and comparision of accuray which was not in original code.
* comparision of accuracy across different hyperparameter searches.
\subsection{Insights into Development from the Earthquake Code}

The original code was developed with the goal to create a DL method
called {\em tevelop} to apply spacial timeseries evolution for
multiplw applications including eartquake, hydrology and COVID
prediction. The code was presented in a large Python Jupyter notebook
on Google Collab. ue to the integration of multiple applications the
code was difficult to understand and maintain. For this reason the
total number of lines of 13500 was reduced by more then 2400 lines
when the hydrology and the covid code was removed. Howver, at the
same time we restructured the code and reached a final length of about
11100 lines of code. The original code contained all hyperparameters
and needed to be changed every time a hyperparameter was modified.
The code included all definitions of variables and hyperparemetes in
the code itself.

As we can see from this this code has some mayor issues that future
versions ought to address. First, the code includes every aspect that
is not covered by tensorflow and also contains a customized version of
TFT. Second due to this the code is very large and manipulating and
editing the code is time onsuming and error prone. Third, as many
code related parameters are manages still in the code the running the
same code with various parameters becomes combersume. In fact multiple
copies of the code need to be maintained whne new parameters are
shosen, instead of making such paramaters part of a configuration
file. Hence we started moving towards the simplification of the code
by introducing the concept of libraries that can be pip installed, as
well as adding gradually more parameters to a configuration files that
are used by the program.

The advantage of using a notebok is that it can be augmented with lots
of graphs that give insitu updates to the rogress and its measured
accuracy. It is infeasable for students to use and replicate the run
of this notebook as the runtime can be up to two days. Students for
sure have to use their computers for other things and need to be able
to use them on the go. Often HPC venters provide interactive jobs into
the batch queues, but also here this is not sufficient. Instead we
adapted to use jupyter notebooks in full batch mode by the HPC quieng
system by generation a special batch script that internally uses
papermill to execute the notebook in the background. Papermill, will
also include all cells that have to be updated during runtime
including graphics. The script we developed neded howver to be run
multie times and with different hyperparameters such as the number of
epochs, to give just one example. As the HPC system is a heterogeneous
GPU system having access to A100, V100, P100, rtx2080 the choice of
the GPU system must ba able to be configurable. Hence the batch script
include the ability to also read in the configuration file and adapt
itself to the needed parameters. This is controled by sofisticated but
simple batch job generator which we discuss in a later Section.



%libraries for mlcommons benchmarking, cloudmesh
%portable way to define data locations via config
%experiment permutation over hyperparameters.
%* repeated experiements
%* separate evaluation and comparision of accuray which was not in original code.
%* comparision of accuracy across different hyperparameter searches.
20 changes: 10 additions & 10 deletions section-earthquake.tex
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ \section{Earthquake Forecasting}
forecasting methods rely on statistical techniques, we use ML
for extracting the evolution and testing the effectiveness of the
forecast. As a metric, we use the Nash-Sutcliffe Efficiency (NSE)
\cite{nash-79}. Other qualitative predictions are discussed in
~\cite{fox2022-jm}.
\citep{nash-79}. Other qualitative predictions are discussed in
~\citep{fox2022-jm}.

One of the common tasks when dealing with time series is the ability
to predict or forecast them in advance. Time series capture the
Expand All @@ -30,7 +30,7 @@ \section{Earthquake Forecasting}
\subsection{Earthquake Data}

The data for this earthquake is described in
\cite{las-22-mlcommons-science}. It uses a subset of the earthquake
\citep{las-22-mlcommons-science}. It uses a subset of the earthquake
data from the United States Geological Survey (USGS) focused on Southern
California between latitude: $32^\circ$N to $36^\circ$N and longitude:
$-120^\circ$S to $-114^\circ$S). The data for this region covers all
Expand All @@ -48,7 +48,7 @@ \subsection{Earthquake Data}
from a fortnight up to four years. Furthermore, we calculate summed
magnitudes and depths and counts of significant quakes (magnitude $<
3.29$).'' Table~\ref{tab:eq-summary} depicts the key features of the
benchmark \cite{las-22-mlcommons-science}.
benchmark \citep{las-22-mlcommons-science}.


\begin{table}
Expand All @@ -58,7 +58,7 @@ \subsection{Earthquake Data}
{\footnotesize
\begin{tabular}{p{0.2\columnwidth}p{0.2\columnwidth}p{0.45\columnwidth}}
\hline
{\bf Area} & \multicolumn{2}{l}{Earthquake Forecasting~\cite{fox2022-jm,TFT-21,eq-code,eq-data}.}\\
{\bf Area} & \multicolumn{2}{l}{Earthquake Forecasting~\citep{fox2022-jm,TFT-21,eq-code,eq-data}.}\\
\hline
{\bf Objectives} & \multicolumn{2}{l}{Improve the quality of Earthquake
forecasting in a region of Southern California.}\\
Expand All @@ -70,9 +70,9 @@ \subsection{Earthquake Data}
& Size: & 11.3GB (Uncompressed), 21.3MB (Compressed)\\
& Training samples: & 2,400 spatial bins\\
& Validation samples: & 100 spatial bins\\
& Source: & USGS Servers~\cite{eq-data}\\
& Source: & USGS Servers~\citep{eq-data}\\
\hline
{\bf Reference Implementation} & \cite{eq-code} & \\
{\bf Reference Implementation} & \citep{eq-code} & \\
% \hline
\hline
\end{tabular}
Expand All @@ -87,14 +87,14 @@ \subsection{Implementation}
The reference implementation of the benchmark includes three
distinct deep learning-based reference implementations. These are Long
short-term memory (LSTM)-based model, Google Temporal Fusion
Transformer (TFT)~\cite{TFT-21}-based model and a custom hybrid
Transformer (TFT)~\citep{TFT-21}-based model and a custom hybrid
transformer model. The TFT-based model uses two distinct LSTMs,
covering an encoder and a decoder with a temporal attention-based
transformer. The custom model includes a space-time transformer for
the Decoder and a two-layer LSTM for the encoder. Each model predicts
NSE and generates visualizations illustrating the TFT for
interpretable multi-horizon time series
forecasting~\cite{TFT-21}. Details of the current reference models can
be found in~\cite{fox2022-jm}. In this paper, we only focus on the
forecasting~\citep{TFT-21}. Details of the current reference models can
be found in~\citep{fox2022-jm}. In this paper, we only focus on the
LSTM implementation.

Loading

0 comments on commit 033e238

Please sign in to comment.