-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
7 changed files
with
557 additions
and
469 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,8 @@ | ||
.#* | ||
~* | ||
.DS_Store | ||
*.aux | ||
*.toc | ||
*.bbl | ||
*.blg | ||
*.fdb_latexmk | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,121 @@ | ||
\section{Insights into Data Management} | ||
|
||
In data management wi are currently concerned with various aspects of | ||
the data set, the data compression and storage, as well as the data | ||
access speed. We discuss insights in each of them in the next Sections. | ||
|
||
\subsection{Data Sets} | ||
|
||
When dealing with datasets we typically encounter several issues. | ||
These issues are addressed by the MLCommons benchmarks and | ||
datamanagement activities so that they provide ideal candidates for | ||
education without spending an exorberant amount of time on data. Such | ||
issues typically include access to dato wihout privacy restrictions, | ||
data preprocessing that makes the datau suitable for deep learning, | ||
data labeling in case they are part of a well defined mlcommons | ||
benchmark. Other issues include data bias, noisy or missing data, as | ||
well as overfitting while using training data. Typically the mlcommons | ||
benchmarks will be designed to have no such issuess, or they have | ||
minimal issues. However some benchmarks such as the science group | ||
benchmarks whic are concerned with improving the science will have to | ||
potentially address these issues in order o improve the accuracy. This | ||
could include even injecting new date and different preproocessing | ||
methods. | ||
|
||
|
||
\subsection{Data compression} | ||
|
||
An issue that if of utmost imporatance especially for large data sets | ||
is how the data is represented. For example, for the earthquake | ||
benchmark we found that the original dataset was 11GB big. However we | ||
found that the data can be easily compressed by a factor of 100. This | ||
is significant, as for example in this case the entire dataset can be | ||
stored in Github. The compressed xz archive file is only 21 MB and to | ||
download only the archive file using wget takes 0.253s. In case the | ||
dataset and its repository is downloaded with Git we note that the | ||
entire Git repository is 108MB~\citep{mlcommons-earthquake-data}. To | ||
download this compressed dataset only takes 7.723s. Thus it is | ||
prefered to just download the explictly used data using for example | ||
wget. In both cases the data is compressed. To uncompress the data it | ||
will take an additional 1 minute and 2.522seconds. However if we were | ||
to download the datai in uncompressed form it woudl take approximately | ||
3 huours and 51 seconds. | ||
|
||
From this simple example it is clear that MLCommons benchmarks can | ||
provide insights into how data is managed and delivered to for example | ||
large scale compute clusters with many nodes, while utilizing | ||
compressin algorithms. We will next discuss insights into | ||
infrastructure management while using filesystems in HPC resources. | ||
While often object stores are discussed to host such large datsets it | ||
is imperative to identify the units of storage in such object stores. | ||
In our case an object store that would host individual data recorsd is | ||
not useful due to the vast number of data points. Therfore the best | ||
way to store this data even in an object store is as a single entry of | ||
compressed overall data. | ||
|
||
|
||
\subsection{Data Access} | ||
|
||
Besides having proper data and being able do download it efficiently | ||
from the location of storage, it is impertive to be able to access it | ||
in such a way that the GPUs used for deep learning are being feed with | ||
enough data without being idle. The performance results were somewhat | ||
surprising and had a devestating efect on the overall execution time | ||
that were twice as fast on the personal computer while using an | ||
RTX3090 in contrast to using the HPC center recommended filesystems | ||
when using an A100. For this reason we have made a simple test and | ||
measure the performance to read access the various file systems. The | ||
results are shown in Table~\ref{tab:file-performance} which include | ||
various file systems at University of Virginias Rivanna HPC but also a | ||
comparision with a personal computer from a student. | ||
|
||
Based on this observation it was infasbale to consider running the | ||
earthquake benchmark on the regular configured HPC nodes as they ran | ||
on some resources for almost 24 hours. This is also the limit the | ||
Rivana system allows for one job. Hence we were allowed to use a | ||
special compute node that has additional NVMe storage avalable and | ||
accessible to us. On those nodes (in the Table listed as | ||
\verb|/localsratch|) we wer able to obtain a very suitable performance | ||
for this application while having a 10 times fold increas in access in | ||
contrast to the scratch file system an almost double the perfomance | ||
given to us on the project file system. The /tmp system although being | ||
fast was for our application not suffiintly large and also performes | ||
slower then the \verb|/localscratch| set up for us. In addition we | ||
also made an experiment using a shared memory based hosted filesystem | ||
in the nodes RAM. | ||
|
||
|
||
What we learn from this experience is that a HPC system must provide | ||
the fast file system locally available on the nodes to serve the GPUs | ||
adequately. The computer should be designed form that start to not | ||
only have the fastest possible GPUs for large data processing, but | ||
also a very fast filesystem that can keep up with the data input | ||
requirements presented by the GPU. Furthermore in case updated GPUs | ||
are purchased it is not sufficient to just take toe previous | ||
generation motherboard and CPU processor and memory, but to update the | ||
hradware components and include a state of the art compute note. This | ||
often prevents the repurpossing of the nodde while addeing just GPUs. | ||
|
||
\begin{table}[htb] | ||
\caption{Filetransfer performance of various file systems on Rivanna} | ||
\label{tab:file-performance} | ||
\begin{center} | ||
{\footnotesize | ||
\begin{tabular}{lllllp{4.5cm}} | ||
Machine & File systems & \multicolumn{2}{l}{Bandwidth Performance} & Speedup & Description \\ | ||
\hline | ||
Rivanna & \verb|/scratch/$USER (sbatch)| & 30.6MiB/s & 32.1MB/s & 1.0 & shared scratch space when running in batch mode \\ | ||
Rivanna & \verb|/scratch/$USER (interactive)| & 33.2MiB/s & 34.8MB/s & 1.1 & shared scratch space when running interacively \\ | ||
Rivanna & \verb|/home/$USER| & 40.9MiB/s & 42.9MB/s & 1.3 & users home directory \\ | ||
Rivanna & \verb|/project/$PROJECTID | & 100MiB/s & 105MB/s & 3.3 & project sppecific filesystem \\ | ||
Personal Computer & \verb|c:| & 187MiB/s & 196MB/s & 6.1 & file system on a personal computer \\ | ||
Rivanna & \verb|/tmp| & 271MiB/s & 285MB/s & 8.9 & temporary file system on a node \\ | ||
\hline | ||
Selected Nodes Rivanna & \verb|/localscratch| & 384MiB/s & 403MB/s & 12.6 & Special access to NVMe storage of a special node in the cluster\\ | ||
RAM disk Rivanna & \verb|/dev/shm/*| & 461MiB/s & 483MB/s & 15.1 & simulated filesystem in a RAM disk\\ | ||
\hline | ||
\end{tabular} | ||
\end{table} | ||
\end{center} | ||
} | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,61 +1,58 @@ | ||
\section{Insights into Development from the Earthquake Code} | ||
|
||
The original code was developed by a single researcher with the goal to create a DL method called tvelop to apply spacial timeseries evolution for multiplw applications including eartquake, hydrology and COVID prediction. The code was developed in a large Python Jupyter notebook on Google Collab. The total number of lines of code was \TODO{line number}. The code included all definitions of variables and hyperparemetes in the code itself. | ||
|
||
|
||
difficult to maintain and understand for others | ||
|
||
easy to develop by author, many experimental aspects | ||
|
||
all varables defined in code not config file | ||
|
||
lots of graphic outputs for interactive development | ||
|
||
How many lines of code?? | ||
|
||
|
||
no use of libraries | ||
limited use of functions | ||
if conditions for different science applications | ||
|
||
|
||
large code is too dificult to maintain in colab | ||
|
||
papermill | ||
|
||
mlcommons focus on one science application at a time | ||
|
||
students can not comprehend code | ||
|
||
rewritten code to just focus on earth quake | ||
|
||
rewritten code to add selected hyperparameters into a configuration file | ||
|
||
|
||
setup | ||
|
||
for | ||
|
||
training | ||
|
||
valiadation | ||
|
||
comparing output | ||
|
||
not much use of libraries | ||
|
||
choices | ||
|
||
development of multiple runs based on variation of additional time based internal hyperparameters, | ||
--> long runtime, no changes to evaluation section in code | ||
|
||
take these parameters out and place them in a configuration fil | ||
-> multiple runs needed and caomparision has to be separated fromprg, ;lots of changes to the program, program will run shorter, | ||
|
||
|
||
libraries for mlcommons benchmarking, cloudmesh | ||
portable way to define data locations via config | ||
experiment permutation over hyperparameters. | ||
* repeated experiements | ||
* separate evaluation and comparision of accuray which was not in original code. | ||
* comparision of accuracy across different hyperparameter searches. | ||
\subsection{Insights into Development from the Earthquake Code} | ||
|
||
The original code was developed with the goal to create a DL method | ||
called {\em tevelop} to apply spacial timeseries evolution for | ||
multiplw applications including eartquake, hydrology and COVID | ||
prediction. The code was presented in a large Python Jupyter notebook | ||
on Google Collab. ue to the integration of multiple applications the | ||
code was difficult to understand and maintain. For this reason the | ||
total number of lines of 13500 was reduced by more then 2400 lines | ||
when the hydrology and the covid code was removed. Howver, at the | ||
same time we restructured the code and reached a final length of about | ||
11100 lines of code. The original code contained all hyperparameters | ||
and needed to be changed every time a hyperparameter was modified. | ||
The code included all definitions of variables and hyperparemetes in | ||
the code itself. | ||
|
||
As we can see from this this code has some mayor issues that future | ||
versions ought to address. First, the code includes every aspect that | ||
is not covered by tensorflow and also contains a customized version of | ||
TFT. Second due to this the code is very large and manipulating and | ||
editing the code is time onsuming and error prone. Third, as many | ||
code related parameters are manages still in the code the running the | ||
same code with various parameters becomes combersume. In fact multiple | ||
copies of the code need to be maintained whne new parameters are | ||
shosen, instead of making such paramaters part of a configuration | ||
file. Hence we started moving towards the simplification of the code | ||
by introducing the concept of libraries that can be pip installed, as | ||
well as adding gradually more parameters to a configuration files that | ||
are used by the program. | ||
|
||
The advantage of using a notebok is that it can be augmented with lots | ||
of graphs that give insitu updates to the rogress and its measured | ||
accuracy. It is infeasable for students to use and replicate the run | ||
of this notebook as the runtime can be up to two days. Students for | ||
sure have to use their computers for other things and need to be able | ||
to use them on the go. Often HPC venters provide interactive jobs into | ||
the batch queues, but also here this is not sufficient. Instead we | ||
adapted to use jupyter notebooks in full batch mode by the HPC quieng | ||
system by generation a special batch script that internally uses | ||
papermill to execute the notebook in the background. Papermill, will | ||
also include all cells that have to be updated during runtime | ||
including graphics. The script we developed neded howver to be run | ||
multie times and with different hyperparameters such as the number of | ||
epochs, to give just one example. As the HPC system is a heterogeneous | ||
GPU system having access to A100, V100, P100, rtx2080 the choice of | ||
the GPU system must ba able to be configurable. Hence the batch script | ||
include the ability to also read in the configuration file and adapt | ||
itself to the needed parameters. This is controled by sofisticated but | ||
simple batch job generator which we discuss in a later Section. | ||
|
||
|
||
|
||
%libraries for mlcommons benchmarking, cloudmesh | ||
%portable way to define data locations via config | ||
%experiment permutation over hyperparameters. | ||
%* repeated experiements | ||
%* separate evaluation and comparision of accuray which was not in original code. | ||
%* comparision of accuracy across different hyperparameter searches. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.