update contents

cyberaide · Feb 15, 2023 · 033e238 · 033e238
1 parent 759ddeb
commit 033e238
Show file tree

Hide file tree

Showing 7 changed files with 557 additions and 469 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,6 +1,8 @@
+.#*
 ~*
 .DS_Store
 *.aux
+*.toc
 *.bbl
 *.blg
 *.fdb_latexmk

diff --git a/section-data.tex b/section-data.tex
@@ -0,0 +1,121 @@
+\section{Insights into Data Management}
+
+In data management wi are currently concerned with various aspects of
+the data set, the data compression and storage, as well as the data
+access speed. We discuss insights in each of them in the next Sections.
+
+\subsection{Data Sets}
+
+When dealing with datasets we typically encounter several issues.
+These issues are addressed by the MLCommons benchmarks and
+datamanagement activities so that they provide ideal candidates for
+education without spending an exorberant amount of time on data. Such
+issues typically include access to dato wihout privacy restrictions,
+data preprocessing that makes the datau suitable for deep learning,
+data labeling in case they are part of a well defined mlcommons
+benchmark. Other issues include data bias, noisy or missing data, as
+well as overfitting while using training data. Typically the mlcommons
+benchmarks will be designed to have no such issuess, or they have
+minimal issues. However some benchmarks such as the science group
+benchmarks whic are concerned with improving the science will have to
+potentially address these issues in order o improve the accuracy. This
+could include even injecting new date and different preproocessing
+methods.
+
+
+\subsection{Data compression}
+
+An issue that if of utmost imporatance especially for large data sets
+is how the data is represented. For example, for the earthquake
+benchmark we found that the original dataset was 11GB big. However we
+found that the data can be easily compressed by a factor of 100. This
+is significant, as for example in this case the entire dataset can be
+stored in Github. The compressed xz archive file is only 21 MB and to
+download only the archive file using wget takes 0.253s. In case the
+dataset and its repository is downloaded with Git we note that the
+entire Git repository is 108MB~\citep{mlcommons-earthquake-data}. To
+download this compressed dataset only takes 7.723s. Thus it is
+prefered to just download the explictly used data using for example
+wget. In both cases the data is compressed. To uncompress the data it
+will take an additional 1 minute and 2.522seconds. However if we were
+to download the datai in uncompressed form it woudl take approximately
+3 huours and 51 seconds.
+
+From this simple example it is clear that MLCommons benchmarks can
+provide insights into how data is managed and delivered to for example
+large scale compute clusters with many nodes, while utilizing
+compressin algorithms. We will next discuss insights into
+infrastructure management while using filesystems in HPC resources.
+While often object stores are discussed to host such large datsets it
+is imperative to identify the units of storage in such object stores.
+In our case an object store that would host individual data recorsd is
+not useful due to the vast number of data points. Therfore the best
+way to store this data even in an object store is as a single entry of
+compressed overall data.
+
+
+\subsection{Data Access}
+
+Besides having proper data and being able do download it efficiently
+from the location of storage, it is impertive to be able to access it
+in such a way that the GPUs used for deep learning are being feed with
+enough data without being idle. The performance results were somewhat
+surprising and had a devestating efect on the overall execution time
+that were twice as fast on the personal computer while using an
+RTX3090 in contrast to using the HPC center recommended filesystems
+when using an A100. For this reason we have made a simple test and
+measure the performance to read access the various file systems. The
+results are shown in Table~\ref{tab:file-performance} which include
+various file systems at University of Virginias Rivanna HPC but also a
+comparision with a personal computer from a student.
+
+Based on this observation it was infasbale to consider running the
+earthquake benchmark on the regular configured HPC nodes as they ran
+on some resources for almost 24 hours. This is also the limit the
+Rivana system allows for one job. Hence we were allowed to use a
+special compute node that has additional NVMe storage avalable and
+accessible to us. On those nodes (in the Table listed as
+\verb|/localsratch|) we wer able to obtain a very suitable performance
+for this application while having a 10 times fold increas in access in
+contrast to the scratch file system an almost double the perfomance
+given to us on the project file system. The /tmp system although being
+fast was for our application not suffiintly large and also performes
+slower then the \verb|/localscratch| set up for us. In addition we
+also made an experiment using a shared memory based hosted filesystem
+in the nodes RAM.
+
+
+What we learn from this experience is that a HPC system must provide
+the fast file system locally available on the nodes to serve the GPUs
+adequately. The computer should be designed form that start to not
+only have the fastest possible GPUs for large data processing, but
+also a very fast filesystem that can keep up with the data input
+requirements presented by the GPU. Furthermore in case updated GPUs
+are purchased it is not sufficient to just take toe previous
+generation motherboard and CPU processor and memory, but to update the
+hradware components and include a state of the art compute note. This
+often prevents the repurpossing of the nodde while addeing just GPUs.
+
+\begin{table}[htb]
+  \caption{Filetransfer performance of various file systems on Rivanna}
+  \label{tab:file-performance}
+  \begin{center}
+  {\footnotesize 
+  \begin{tabular}{lllllp{4.5cm}}
+    Machine & File systems & \multicolumn{2}{l}{Bandwidth Performance} & Speedup & Description \\
+    \hline
+    Rivanna & \verb|/scratch/$USER  (sbatch)|     & 30.6MiB/s & 32.1MB/s  & 1.0 & shared scratch space when running in batch mode \\
+    Rivanna & \verb|/scratch/$USER (interactive)| & 33.2MiB/s & 34.8MB/s  & 1.1 & shared scratch space when running interacively \\
+    Rivanna & \verb|/home/$USER|                    & 40.9MiB/s & 42.9MB/s  & 1.3 & users home directory \\
+    Rivanna & \verb|/project/$PROJECTID |     & 100MiB/s  & 105MB/s  & 3.3 & project sppecific filesystem \\
+    Personal Computer  & \verb|c:| & 187MiB/s  & 196MB/s  & 6.1 &  file system on a personal computer \\
+    Rivanna & \verb|/tmp|                         & 271MiB/s  & 285MB/s  & 8.9 & temporary file system on a node \\
+    \hline
+    Selected Nodes Rivanna & \verb|/localscratch|  &              384MiB/s & 403MB/s  & 12.6 & Special access to NVMe storage of a special node in the cluster\\
+    RAM disk Rivanna  & \verb|/dev/shm/*|      &             461MiB/s & 483MB/s  & 15.1 & simulated filesystem in a RAM disk\\
+    \hline                                             
+    \end{tabular}
+  \end{table}
+  \end{center}
+  }
+
diff --git a/section-dev.tex b/section-dev.tex
@@ -1,61 +1,58 @@
-\section{Insights into Development from the Earthquake Code}
-
-The original code was developed by a single researcher with the goal to create a DL method called tvelop to apply spacial timeseries evolution for multiplw applications including eartquake, hydrology and COVID prediction. The code was developed in a large Python Jupyter notebook on Google Collab. The total number of lines of code was \TODO{line number}. The code included all definitions of variables and hyperparemetes in the code itself.
-
-
-difficult to maintain and understand for others
-
-easy to develop by author, many experimental aspects
-
-all varables defined in code not config file
-
-lots of graphic outputs for interactive development
-
-How many lines of code??
-
-
-no use of libraries
-limited use of functions
-if conditions for different science applications
-
-
-large code is too dificult to maintain in colab
-
-papermill
-
-mlcommons focus on one science application at a time
-
-students can not comprehend code
-
-rewritten code to just focus on earth quake
-
-rewritten code to add selected hyperparameters into a configuration file
-
-
-setup
-
-for
-
-training
-
-valiadation 
-
-comparing output
-
-not much use of libraries
-
-choices
-
-development of multiple runs based on variation of additional time based internal hyperparameters,
---> long runtime, no changes to evaluation section in code
-
-take these parameters out and place them in a configuration fil
-->   multiple runs needed and caomparision has to be separated fromprg, ;lots of changes to the program, program will run shorter,
-
-
-libraries for mlcommons benchmarking, cloudmesh
-portable way to define data locations via config
-experiment permutation over hyperparameters.
-* repeated experiements
-* separate evaluation and comparision of accuray which was not in original code.
-* comparision of accuracy across different hyperparameter searches.
+\subsection{Insights into Development from the Earthquake Code}
+
+The original code was developed with the goal to create a DL method
+called {\em tevelop} to apply spacial timeseries evolution for
+multiplw applications including eartquake, hydrology and COVID
+prediction. The code was presented in a large Python Jupyter notebook
+on Google Collab.  ue to the integration of multiple applications the
+code was difficult to understand and maintain. For this reason the
+total number of lines of 13500 was reduced by more then 2400 lines
+when the hydrology and the covid code was removed.  Howver, at the
+same time we restructured the code and reached a final length of about
+11100 lines of code.  The original code contained all hyperparameters
+and needed to be changed every time a hyperparameter was modified.
+The code included all definitions of variables and hyperparemetes in
+the code itself.
+
+As we can see from this this code has some mayor issues that future
+versions ought to address. First, the code includes every aspect that
+is not covered by tensorflow and also contains a customized version of
+TFT. Second due to this the code is very large and manipulating and
+editing the code is time onsuming and error prone. Third, as many
+code related parameters are manages still in the code the running the
+same code with various parameters becomes combersume. In fact multiple
+copies of the code need to be maintained whne new parameters are
+shosen, instead of making such paramaters part of a configuration
+file. Hence we started moving towards the simplification of the code
+by introducing the concept of libraries that can be pip installed, as
+well as adding gradually more parameters to a configuration files that
+are used by the program.
+
+The advantage of using a notebok is that it can be augmented with lots
+of graphs that give insitu updates to the rogress and its measured
+accuracy. It is infeasable for students to use and replicate the run
+of this notebook as the runtime can be up to two days. Students for
+sure have to use their computers for other things and need to be able
+to use them on the go. Often HPC venters provide interactive jobs into
+the batch queues, but also here this is not sufficient. Instead we
+adapted to use jupyter notebooks in full batch mode by the HPC quieng
+system by generation a special batch script that internally uses
+papermill to execute the notebook in the background. Papermill, will
+also include all cells that have to be updated during runtime
+including graphics. The script we developed neded howver to be run
+multie times and with different hyperparameters such as the number of
+epochs, to give just one example. As the HPC system is a heterogeneous
+GPU system having access to A100, V100, P100, rtx2080 the choice of
+the GPU system must ba able to be configurable. Hence the batch script
+include the ability to also read in the configuration file and adapt
+itself to the needed parameters. This is controled by sofisticated but
+simple batch job generator which we discuss in a later Section.
+
+
+
+%libraries for mlcommons benchmarking, cloudmesh
+%portable way to define data locations via config
+%experiment permutation over hyperparameters.
+%* repeated experiements
+%* separate evaluation and comparision of accuray which was not in original code.
+%* comparision of accuracy across different hyperparameter searches.
diff --git a/section-earthquake.tex b/section-earthquake.tex
@@ -8,8 +8,8 @@ \section{Earthquake Forecasting}
 forecasting methods rely on statistical techniques, we use ML
 for extracting the evolution and testing the effectiveness of the
 forecast.  As a metric, we use the Nash-Sutcliffe Efficiency (NSE)
-\cite{nash-79}.  Other qualitative predictions are discussed in
-~\cite{fox2022-jm}.
+\citep{nash-79}.  Other qualitative predictions are discussed in
+~\citep{fox2022-jm}.
 
 One of the common tasks when dealing with time series is the ability
 to predict or forecast them in advance.  Time series capture the
@@ -30,7 +30,7 @@ \section{Earthquake Forecasting}
 \subsection{Earthquake Data}
 
 The data for this earthquake is described in
-\cite{las-22-mlcommons-science}.  It uses a subset of the earthquake
+\citep{las-22-mlcommons-science}.  It uses a subset of the earthquake
 data from the United States Geological Survey (USGS) focused on Southern
 California between latitude: $32^\circ$N to $36^\circ$N and longitude:
 $-120^\circ$S to $-114^\circ$S). The data for this region covers all
@@ -48,7 +48,7 @@ \subsection{Earthquake Data}
 from a fortnight up to four years. Furthermore, we calculate summed
 magnitudes and depths and counts of significant quakes (magnitude $<
 3.29$).''  Table~\ref{tab:eq-summary} depicts the key features of the
-benchmark \cite{las-22-mlcommons-science}.
+benchmark \citep{las-22-mlcommons-science}.
 
 
 \begin{table}
@@ -58,7 +58,7 @@ \subsection{Earthquake Data}
   {\footnotesize
 \begin{tabular}{p{0.2\columnwidth}p{0.2\columnwidth}p{0.45\columnwidth}}
 \hline
-{\bf Area} & \multicolumn{2}{l}{Earthquake Forecasting~\cite{fox2022-jm,TFT-21,eq-code,eq-data}.}\\
+{\bf Area} & \multicolumn{2}{l}{Earthquake Forecasting~\citep{fox2022-jm,TFT-21,eq-code,eq-data}.}\\
 \hline
 {\bf Objectives} &  \multicolumn{2}{l}{Improve the quality of Earthquake
 forecasting in a region of Southern California.}\\
@@ -70,9 +70,9 @@ \subsection{Earthquake Data}
   &  Size:  & 11.3GB (Uncompressed), 21.3MB (Compressed)\\
   & Training samples: & 2,400 spatial bins\\
   & Validation samples:  &  100 spatial bins\\
-  & Source:  & USGS Servers~\cite{eq-data}\\
+  & Source:  & USGS Servers~\citep{eq-data}\\
 \hline
-{\bf Reference Implementation} & \cite{eq-code} & \\
+{\bf Reference Implementation} & \citep{eq-code} & \\
 % \hline
 \hline
 \end{tabular}
@@ -87,14 +87,14 @@ \subsection{Implementation}
 The reference implementation of the benchmark includes three
 distinct deep learning-based reference implementations. These are Long
 short-term memory (LSTM)-based model, Google Temporal Fusion
-Transformer (TFT)~\cite{TFT-21}-based model and a custom hybrid
+Transformer (TFT)~\citep{TFT-21}-based model and a custom hybrid
 transformer model. The TFT-based model uses two distinct LSTMs,
 covering an encoder and a decoder with a temporal attention-based
 transformer. The custom model includes a space-time transformer for
 the Decoder and a two-layer LSTM for the encoder. Each model predicts
 NSE and generates visualizations illustrating the TFT for
 interpretable multi-horizon time series
-forecasting~\cite{TFT-21}. Details of the current reference models can
-be found in~\cite{fox2022-jm}.  In this paper, we only focus on the
+forecasting~\citep{TFT-21}. Details of the current reference models can
+be found in~\citep{fox2022-jm}.  In this paper, we only focus on the
 LSTM implementation.