A repository for replicating the experiments found in "Plumber: Diagnosing and Removing Performance Bottlenecks in Machine Learning Data Pipelines" (MLSys '22).
Plumber consists of two things: a Tensorflow installation and a Plumber "app",
which is installed as a front-end to the data generated by Plumber.
You can find the Tensorflow release here.
This repository contains the application layer, which you can find in
plumber_analysis
.
The application layer is installed on top of the Tensorflow release, so you will
want to first install the Tensorflow release and then install
plumber_analysis
.
We use the following hardware setups in our experiments.
These experiments were run on a consumer-grade desktop machine. The data is read off of ZFS.
Node hardware: CPU: 8-core AMD Ryzen 7 2700X RAM: 32GiB Drives: HP EX920 1TB SSD, WDC WD60EFRX-68L HDD (3x in ZFS configuration)
These experiments were run on the Carnegie Mellon University Parallel Data Lab Orca cluster. Nodes are connected with a Cisco Nexus 3264-Q 64-port QSFP+ 40GbE switch. Each node usually has a HDD and SSD. The data is read off a Network File System.
Node hardware: CPU: 16-core Intel E5-2698Bv3 Xeon 2GHz RAM: 64GiB Drives: 400GB Intel P3600 NVMe SSD, 4TB 7200RPM Seagate ST4000NM0023 HDD NIC: Mellanox MCX314A-BCCT 40GbE
These are cloud TPU VMs as described here. You can readily access them on Google Cloud. The data is read from Google Cloud Storage. According to CLI tools, they have 300GB of memory and 96 Intel Xeon cores along with the TPUv3-8 accelerators.
The Tensorflow component of Plumber is just a fork of Plumber. Building Tensorflow from source is well described via official documentation, but we summarize it below. If you want to see how to just run some code, we provide the install steps on TPU.
It is useful to use Bazelisk to build Tensorflow (we can rename the file to bazel
).
The refresh_tf_build.sh
script (and TPU variant) assumes it is in the Tensorflow directory,
which is why it calls ./bazel
when building.
Feel free to change this to something else if you placed bazel
elsewhere.
wget https://github.com/bazelbuild/bazelisk/releases/download/v1.11.0/bazelisk-linux-amd64
mv bazelisk-linux-amd64 bazel
chmod +x bazel
We also recommend using Miniconda to create a standardized environment. We use a Python3.7 environment (py37) for both building and installing the following software.
To install:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
You will have to scroll though and type "yes" at the end. You should leave the options as the defaults.
After install, you will want to create the environment. To create it:
conda create -n py37 python=3.7
conda activate py37
Tensorflow has its own stated dependencies, which are quite general. We recommend the following for maximum compatibility, though we also provided Tensorflow's build from source recommendation.
Using Miniconda, please install these pip packages inside the py37
environment.
For maximum compatibility, specify the version of numpy explicitly:
sudo apt install python3-dev python3-pip
pip install pip numpy==1.21.5 wheel
pip install keras_preprocessing --no-deps
The dependencies to build Tensorflow from source are:
sudo apt install python3-dev python3-pip
pip install -U --user pip numpy wheel
pip install -U --user keras_preprocessing --no-deps
Note that the numpy version used WILL CAUSE BINARY INCOMPATIBILITY if it does not exist with the exact version when run compared to when built. This is especially problematic with Python wheels, which may uninstall the current numpy version after it was used for building.
As shown in tensorflow/tools/pip_package/setup.py
, a standard value of
numpy==1.19.2
was used.
We tested with numpy==1.21.5
, which requires avoiding the dependency
implied by Tensorflow, so we bumped this version to 1.21.5
, though you may
decide to change the version back.
In any case, you should be consistent with this value.
You should either install the declared version for building or change the version to the
version you are using to
ensure the generated wheel does not overwrite your numpy install with different
versions, causing the above binary incompatibility issue.
If you get an error about version mismatch (e.g., RuntimeError: module compiled against API version 0xe but this version of numpy is 0xd
), it may be helpful to run:
pip install numpy --upgrade
We provide convenience scripts for building Tensorflow inside the directory. If you don't want to use these, skip to the Manual Build section.
To build:
- Clone the Tensorflow repository with Plumber patch.
- (For TPU builds) Copy
libtpu.so
into the Tensorflow directory. - Run
./configure
, setting the appropriate flags. Using Miniconda requires setting the environment before running this part, so the paths are found automatically. The default options are fine, unless you are adding GPU support. - Run the build script
refresh_tf_build.sh
(CPU-only) orrefresh_tf_build_tpu.sh
(TPU). - Install the
plumber_analysis
application layer by cloning this repository, changing directory toplumber_analysis
, and running theinstall.sh
script.
Install miniconda (as described above).
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
Add a Python3.7 env, py37
.
conda create python=3.7 -n py37
You need to install libtpu.so
from a JAX installation.
DO NOT INSTALL UNDER MINICONDA.
We are simply trying to get the libtpu.so
file consistent across the system, which we will use to both
build and run Tensorflow.
conda deactivate # Get out of conda
conda deactivate # Get out of conda
conda deactivate # Get out of conda
python3 -m pip install "jax[tpu]==0.2.19" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html # Match libtpu.so to version used
sudo cp $HOME/.local/libtpu.so/libtpu.so /usr/lib/libtpu.so # Overwrite existing libtpu.so
You can then do a Tensorflow build with that libtpu.so
version, which we'll copy
into the source directory for Tensorflow, below.
When done correctly, that version of libtpu.so
should return the following
checksum.
$sha1sum /usr/lib/libtpu.so
44c88d0d50f2f8ffc9eedc17dde72e60b28f3963 /usr/lib/libtpu.so
Assuming Miniconda is installed with py37
environment and libtpu.so
is the
version you want to use:
conda activate py37
git clone --recurse-submodules https://github.com/mkuchnik/PlumberTensorflow.git
cd PlumberTensorflow
cp /usr/lib/libtpu.so . # Copy libtpu version to Tensorflow for building
# Get bazel. We will call it with ./bazel in the TF Build script.
wget https://github.com/bazelbuild/bazelisk/releases/download/v1.11.0/bazelisk-linux-amd64
mv bazelisk-linux-amd64 bazel
chmod +x bazel
# Install Dependencies
sudo apt -y update
sudo apt -y install python3-venv
conda activate py37
# Standard Tensorflow Dependencies
pip install -U pip numpy==1.21.5 wheel
pip install -U keras_preprocessing --no-deps
# Default options are probably fine (make sure you see a path to miniconda)
./configure
# Build
bash refresh_tf_build_tpu.sh
# Exit Directory
cd ..
# Install Plumber Analysis App
git clone https://github.com/mkuchnik/PlumberApp.git
cd PlumberApp/plumber_analysis
bash install.sh
To test everything is working fine, start a python shell and import tensorflow.
If tf.ones(1)
activates over multiple (8) devices, the TPU is working.
If you can import plumber_analysis
, it is installed.
import tensorflow as tf
tf.ones(1) # Should show 8 TPU "StreamExecutor" devices initializing in logs
tpu_devices = tf.config.list_physical_devices("TPU")
print("tpu_devices: {}".format(tpu_devices)) # Should be 8 TPUs
print("num TPU: {}".format(len(tpu_devices))) # Should be 8
import plumber_analysis # Should not throw error
Similarly, if using JAX for an end-to-end run:
import jax
jax.device_count() # Should show 8 TPU devices
If you see a timeout connecting to TPU, it is likely that some other process is
holding a lock on libtpu.so
. You can try to find and kill it by killing all
python
processes (that are not root).
ps aux | grep python | grep -v "grep python" | awk '{print $2}' | xargs kill -9
You can also try to find the process using libtpu
with:
sudo lsof -w /dev/accel0
As a last resort, the lock is on /tmp/libtpu_lockfile
, so you can delete that
file at the risk of causing libtpu
to enter undefined state.
To build:
- Clone the Tensorflow repository with Plumber patch.
- Copy
libtpu.so
into the Tensorflow directory. - Run
./configure
, setting the appropriate flags. Using Anaconda requires setting the environment before running this part, so the paths are found automatically. The default options are fine, unless you are adding GPU support. - Run the build. TPU build is shown below, since it's most complicated
(requiring an extra
--config=tpu
flag). Note, for TPU, you need alibtpu.so
file to build against, which is distributed with JAX TPU Python packages. Otherwise, you can find them pre-installed on TPU machines under/usr/lib/libtpu.so
. Placelibtpu.so
in the Tensorflow source directory so it can be found during the build installation. SetN_JOBS
to the thread count you want to use for the build. - Install the
plumber_analysis
application layer by cloning this repository, changing directory toplumber_analysis
, and running theinstall.sh
script.
The Tensorflow build in step 4 will look something like:
bazel build \
--config=opt \
--config=tpu \
-c opt \
-j $N_JOBS \
--verbose_failures \
//tensorflow/tools/pip_package:build_pip_package
If you get an error about something TPU related, it is almost always because
there is some issue with finding the appropriate function calls in libtpu.so
.
You may have not placed libtpu.so
in the Tensorflow directory if this is the
case.
If you don't use TPU, omit that config.
bazel build \
--config=opt \
-c opt \
-j $N_JOBS \
--verbose_failures \
//tensorflow/tools/pip_package:build_pip_package
This will generate a executable that builds the pip package. To generate the Python wheel:
./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
You can then install with:
pip install /tmp/tensorflow_pkg/tensorflow-${VERSION}-${TAGS}.whl
where VERSION
(the release version e.g., 2.7.0) and TAGS
(a function of Python version e.g., py37)
are set appropriately for the selected Tensorflow build.
You can just look in the /tmp
directory to see what was generated.
A typical set of dependencies after installation is given in
requirements_post_install.txt
.
Note that if you also wish to run unit-tests, you may need to install JAVA.
sudo apt install openjdk-11-jdk
It's also hard to run the tests with the TPU build; CPU-only is better for testing.
TPUs need a driver and library to run, much like GPUs have CUDA.
To get the libtpu.so
file, run the following (shown above in install):
python3 -m pip install "jax[tpu]>=0.2.16" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
You then have to find the file install location and place it in the Tensorflow build directory. You will also want to move it to be the system library (so you don't get version mismatch errors).
sudo cp $HOME/.local/libtpu.so/libtpu.so /usr/lib/libtpu.so
As noted below, we used jax[tpu]==0.2.19
.
This gives us libtpu-nightly-0.1.dev20210809
.
Only one process can run with libtpu
at a time.
Therefore, if you encounter a problem with:
libtpu.so is already is used by anoter process. Not attempting to load libtpu.so in this process.
(or some variant of a TPU timeout) you will need to find and stop that process.
You can try to find and kill it by killing all
python
processes (that are not root).
ps aux | grep python | grep -v "grep python" | awk '{print $2}' | xargs kill -9
You can also try to find the process using libtpu
with:
sudo lsof -w /dev/accel0
As a last resort, the lock is on /tmp/libtpu_lockfile
, so you can delete that
file at the risk of causing libtpu
to enter undefined state.
Also, note that the libtpu.so
writes logs to /tmp/tpu_logs
, so you should
clear these logs if you see warnings about overwriting these files. The typical
reason for this happening is another user wrote to /tmp/tpu_logs
, which your
user doesn't have permissions to overwrite.
sudo rm -rf /tmp/tpu_logs/`
In the notebook
directory, we show how to run Plumber to analyze a simple
pipeline.
Below, we describe the main experiments from the paper.
We use MLPerf along with Official Tensorflow pipelines to evaluate Plumber. The subset of MLPerf we use are the 5 MLPerfv0.6 models: ResNet50, Mask-RCNN, SSD, Transformer, and GNMT. These models use the following dataset: ImageNet, COCO, COCO, WMT17, and WMT16. Note that due to differences in APIs between older Tensorflow and the current version (namely, that Tensorflow 1 didn't have eager mode and relied on graph-mode), along with the fact that some internal code was redacted in MLPerf code, we used at least MLPerfv0.7 variants and had to make some minor modifications to the code to get it to run.
Each pipeline uses tf.data. We care about 4 pipelines:
- Naive: Using 0 prefetching and parallelism set to 1.
- AUTOTUNE: Using prefetching and parallelism set to tf.data.AUTOTUNE.
- HEURISTIC: Using prefetching and parallelism set to the number of cores.
- Plumber: Using prefetching and parallelism and caching set to what Plumber recommends.
What we care about is the rate (speed/throughput) of the pipeline under these configurations. You can measure this by either counting the number of samples processed in a given amount of time or you can look at the rate estimated per-epoch. We provide bash scripts which go through these permutations for the experiments.
The microbenchmarks are CPU-only and are run as fast as they can go in a loop.
The rate measured is therefore the maximum possible rate sustained by that
pipeline under that configuration.
These are run on Setup A and B, and do not need to be build with libtpu.so
for
TPUs; a CPU-only Tensorflow build suffices.
In fact, if you are using the TPU build, you may hit errors like
tensorflow.python.framework.errors_impl.UnauthenticatedError: ioctl failed
caused by JAX/TensorFlow's TPU libraries attempting authentication (which is not
used)---in that case, please block libtpu
from being loaded by intentionally
breaking out of a running libtpu
session with CTRL+C (which would create /tmp/libtpu_lockfile
).
Note that you will still need to install plumber_analysis
with the install.sh
install
script.
To run these, we provide high level runner scripts, which you can run from the
microbenchmark
directory.
bash run_all.sh
bash run_all_PDL.sh
Dataset Path: We leave the dataset paths hardcoded. Please adjust these in
the respective microbenchmark's train_sweep.sh
script.
For example, ResNet has the path set here.
Each of the microbenchmark directories will emit a directory within it after
running the above runner scripts, which will contain the results of each of the experiments.
To plot the results to show throughput, local decision optimality,
and predictions for
throughput, we provide an example script in microbenchmarks/plot_stats.py
.
You can point this file to the directory created during the microbenchmark to
create the corresponding throughput plots.
For example, for ResNet, run the following from the microbenchmarks
directory:
python3 plot_stats.py simple_resnet/MLPerf/train_sweep_0_resnet_test
The end-to-end results run with the pipeline on CPU and the machine learning model on a Tensor Processing Unit (TPU). We specifically use a TPUv3-8 (Setup C). The rate measured is therefore the minimum of both the model performance and the pipeline performance. Just like for microbenchmarks, you will likely have to adjust the path to the corresponding dataset in the script that runs the experiment.
After installing Tensorflow and the plumber_analysis
library, you will need to get JAX and JAX libtpu for most
end-to-end runs. For example to get tpu version 0.2.16, run:
pip install "jax[tpu]==0.2.16" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
Note that if this libtpu version is different from the system version, you will
have to copy it into the system to avoid version mismatch.
When using miniconda with a python3.7 environment named py37, you will find it
in $HOME/miniconda3/envs/py37/lib/python3.7/site-packages/libtpu/libtpu.so
.
That file needs to go to /usr/lib
.
We used jax[tpu]==0.2.19
in our experiments, along with jaxlib==0.1.70
. The
script we use install 0.2.16 first and then these, though you can probably just
install these versions directly.
Run Order. Note that we recommend starting with ResNet first before moving to other end-to-end runs. It is preferable to do ResNet before other runs because it ensures that the dependencies are working.
What is using TPU. For some runs, we use Tensorflow to control the model, and therefore the TPU. In other runs, we use JAX to do the same. In all cases, Tensorflow is in control of the data pipeline with tf.data.
Constraints.
TF2 with native ops is the main target of Plumber. TF1 workflows currently
cannot be resumed, and therefore require either benchmarking the input pipeline
in a TF2 context (i.e., a microbenchmark)
or tracing the pipeline in a TF1 context and resuming it in a
TF2 context. For the latter, we provide the relevant pipelines (i.e., RCNN/GNMT) with
a get_params.py
, which can start Plumber optimization after tracing in a TF2 context.
Similarly, pipelines with custom kernels (i.e., Flax WMT) cannot be
easily deserialized in either TF2 or TF1, so we provide a show_params.sh
to
print out the currently recommended parameters after tracing.
The parameters can then be copied over to the input pipeline.
This pipeline uses JAX (TPU) + Tensorflow. You should install the minimal requirements to avoid dependency clobber (e.g., tensorflow-gpu being installed):
bash install_jax.sh # Install JAX + plumber_analysis
bash reinstall_tpu_lib.sh # Sync libtpu.so version (note: may need sudo)
pip install -r requirements_very_minimal.txt # Get some pip packages
Dependency Notes.
Note that sometimes keras gets installed alongside keras-nightly, causing an
error when the model is run. If this happens, make sure there are not multiple
versions of keras/keras-nightly installed.
Note we provide a requirements.txt
file with a full dependency list to compare
against with pip freeze
if dependency issues come up.
requirements_very_minimal.txt
is recommended to install from (above) because the amount of
dependencies in requirements.txt
makes it difficult to satisfy them all, and
causes issues with pip
(e.g., the pip dependency solver becomes slow or
complains about incompatible versions).
Then to run with standard 96 threads/cores for resnet18:
bash official_runners/resnet18_model_96.sh
Then to run with standard 96 threads/cores for the linear model:
bash official_runners/linear_model_96.sh
These runs will output something like the following (Linear Autotune runs first and is shown):
I0227 07:24:55.529147 140132377122816 train.py:888] Images per loop: 1282048
I0227 07:24:55.529258 140132377122816 train.py:889] Loop bounds: 5
I0227 07:24:55.529352 140132377122816 train.py:896] Loop 0, rate=0.0, current_loop rate=5072923636.407547
I0227 07:24:55.610690 140132377122816 train.py:898] LR=0.0
I0227 07:27:24.209707 140119247529728 train.py:474] train epoch: 1, loss: 6.8660, accuracy: 0.32
I0227 07:27:31.735340 140132377122816 train.py:896] Loop 1, rate=8207.412015521126, current_loop rate=8211.708982486187
I0227 07:27:31.769288 140132377122816 train.py:898] LR=0.08000000566244125
I0227 07:27:31.775491 140119247529728 train.py:474] eval epoch: 1, loss: 6.7389, accuracy: 0.53
I0227 07:27:31.776827 140119247529728 mllog.py:168] :::MLLOG {"namespace": 0, "time_ms": 1645946851775, "event_type": "POINT_IN_TIME", "key": "eval_accuracy", "value": 0.00528, "metadata": {"file": "/home/mkuchnik_andrew_cmu_edu/PlumberApp/end_to_end/TPU/resnet-research-JAX-Distributed-Shampoo-tpu-v4-256/train.py", "lineno": 478, "epoch_num": 1}}
I0227 07:27:31.777864 140119247529728 mllog.py:168] :::MLLOG {"namespace": 0, "time_ms": 1645946851777, "event_type": "INTERVAL_END", "key": "block_stop", "value": null, "metadata": {"file": "/home/mkuchnik_andrew_cmu_edu/PlumberApp/end_to_end/TPU/resnet-research-JAX-Distributed-Shampoo-tpu-v4-256/train.py", "lineno": 482, "first_epoch_num": 1, "epoch_count": 1}}
I0227 07:27:31.780810 140119247529728 mllog.py:168] :::MLLOG {"namespace": 0, "time_ms": 1645946851780, "event_type": "INTERVAL_START", "key": "block_start", "value": null, "metadata": {"file": "/home/mkuchnik_andrew_cmu_edu/PlumberApp/end_to_end/TPU/resnet-research-JAX-Distributed-Shampoo-tpu-v4-256/train.py", "lineno": 489, "first_epoch_num": 2, "epoch_count": 1}}
I0227 07:29:29.230601 140119247529728 train.py:474] train epoch: 2, loss: 6.6795, accuracy: 0.67
I0227 07:29:29.403882 140132377122816 train.py:896] Loop 2, rate=9362.29673340281, current_loop rate=10898.607160054378
I0227 07:29:31.711492 140132377122816 train.py:898] LR=0.01600000075995922
I0227 07:29:31.717248 140119247529728 train.py:474] eval epoch: 2, loss: 6.6185, accuracy: 0.77
I0227 07:29:31.718048 140119247529728 mllog.py:168] :::MLLOG {"namespace": 0, "time_ms": 1645946971717, "event_type": "POINT_IN_TIME", "key": "eval_accuracy", "value": 0.00772, "metadata": {"file": "/home/mkuchnik_andrew_cmu_edu/PlumberApp/end_to_end/TPU/resnet-research-JAX-Distributed-Shampoo-tpu-v4-256/train.py", "lineno": 478, "epoch_num": 2}}
I0227 07:29:31.718366 140119247529728 mllog.py:168] :::MLLOG {"namespace": 0, "time_ms": 1645946971718, "event_type": "INTERVAL_END", "key": "block_stop", "value": null, "metadata": {"file": "/home/mkuchnik_andrew_cmu_edu/PlumberApp/end_to_end/TPU/resnet-research-JAX-Distributed-Shampoo-tpu-v4-256/train.py", "lineno": 482, "first_epoch_num": 2, "epoch_count": 1}}
I0227 07:29:31.719552 140119247529728 mllog.py:168] :::MLLOG {"namespace": 0, "time_ms": 1645946971719, "event_type": "INTERVAL_START", "key": "block_start", "value": null, "metadata": {"file": "/home/mkuchnik_andrew_cmu_edu/PlumberApp/end_to_end/TPU/resnet-research-JAX-Distributed-Shampoo-tpu-v4-256/train.py", "lineno": 489, "first_epoch_num": 3, "epoch_count": 1}}
I0227 07:31:52.656227 140119247529728 train.py:474] train epoch: 3, loss: 6.6371, accuracy: 0.76
I0227 07:31:52.775688 140132377122816 train.py:896] Loop 3, rate=9217.917191839828, current_loop rate=9088.433776361755
Where the "rate" is the throughput for that epoch (around 9-10k images/second).
For a naive run, for example, we'd expect much worse performance.
You can run the experiment in a window using tmux
and check the CPU utilization while it is running---you should see many cores
being used for the ResNet runs.
This pipeline uses Tensorflow (TPU). Run the dependency install script.
bash install_deps.sh
Then to run with standard 96 threads/cores:
bash official_runners/run_96.sh
To obtain the Plumber parameters, please run the naive pipeline with tracing and then run
get_params.py
with the emitted stats.pb
placed in the same directory as you
are running get_params.py
from.
This pipeline uses JAX (TPU) + Tensorflow. Run the dependency install script.
bash install_deps.sh
We can disable 48/96 cores with:
bash cores_sweep_48.sh
Then to run with 48 threads/cores:
bash official_runners/run.sh
To turn all cores back online:
sudo ./enable_cores.sh
Then to run with 96 threads/cores:
bash official_runners/run_96.sh
This pipeline uses JAX (TPU) + Tensorflow. To run:
bash official_runners/run.sh
This pipeline uses JAX (TPU) + Tensorflow + Tensorflow-text.
You need to install Tensorflow-text, which likely requires building from source
to get compatibility with the Tensorflow fork.
Since we are building a 2.7.0 Tensorflow fork, checkout Tensorflow-text commit
02587f8b6c6871079d35d6ebe22c97b2c358cc3b
.
To ensure you don't clobber Tensorflow with standard 2.6.0 installs, modify the
project_version
and tensorflow version to reflect 2.7.0 in oss_scripts/pip_package/setup.py
.
After installing this dependency, you can run:
bash official_runners/run.sh
Because this pipeline uses custom operators which cannot be easily
deserialized, we utilize the provided show_params.sh
script to
print out Plumber's recommendation when optimizing the stats.pb
file emitted by
tracing the naive pipeline.
Note that this requires running the optimization at least twice to get the final
parameters to account for cache insertion.
Each of the pipelines configurations are provided as separate files which can be
copied over input_pipeline.py
.
Run the dependency install script.
bash install_deps.sh
Then to run:
bash official_runners/run.sh
To obtain the Plumber parameters, please run the naive pipeline with tracing and then run
get_params.py
with the emitted stats.pb
placed in the same directory as you
are running get_params.py
from.