Releases: ml-energy/zeus
Zeus Daemon v0.2.0
Change Highlights
CPU and DRAM energy measurements
Zeus daemon now also supports CPU and DRAM energy measurements with RAPL, which also requires root privileges just for measurement. Zeus daemon has also been integrated into the Zeus Python library, so as long as you have the daemon deployed and you set the ZEUSD_SOCK_PATH
environment variable, you'll be all set!
What's Changed
- [Feat] Implement CPU and DRAM monitoring for
zeusd
by @wbjin in #137 - Incorporate Zeusd for CPU and DRAM monitoring in ZeusMonitor by @michahn01 in #150
- Trace GPU ID in Zeusd GPU routes by @jaywonchung in #152
Zeus v0.11.0
Change Highlights
Renamed to zeus
!
Until now we used zeus-ml
because the name zeus
was taken on PyPI, but now we're finally able to move to zeus
:
pip install zeus
Prometheus Metrics
Zeus power and energy measurements can now be exported as Prometheus metrics! We currently support three metrics:
- Energy consumption of a fixed code range (Histogram)
- Power draw over time (Gauge)
- Cumulative energy consumption over time (Counter)
We wrote up a detailed metric monitoring guide and integration examples.
AMD GPU enhancements
We created ROCm AMDSMI Python bindings (GitHub, PyPI) and integrated it with Zeus. Before this, users had to cd
into their ROCm installation's AMDSMI distribution directory and run pip install
, which isn't very convenient.
Our bindings are unofficial & community-maintained. But AMDSMI maintainers did take a look (ROCm/amdsmi#8).
Carbon Emission Estimations
The new zeus.monitor.carbon.CarbonEmissionMonitor
takes in a carbon intensity provider (e.g., from ElectricityMaps) and provides an estimate for operational carbon emissions. The window-based API is essentially the same as ZeusMonitor
.
Full Changelog
- [Misc] Reorganize Zeus NSDI 23 paper artifacts by @jaywonchung in #126
- [Docs] Add
BUILD_SOCIAL_CARD
env, skip social card build by default by @jaywonchung in #130 - [Feat]
CarbonIntensityProvider
and ElectricityMaps implementation by @danielhou0515 in #129 - [Misc] Fix link in PLO example README by @jaywonchung in #136
- Fix typo in profiler script by @dkopczyk in #138
- [Feat]
amdsmi
bindings integration by @parthraut in #132 - Make sure to assign EmptyCPUs to cpus if there is a permission error by @wbjin in #139
- [Feat] Implement CPU and DRAM monitoring for
zeusd
by @wbjin in #137 - [Fix] Fix tests failing due to deprecated
app
argument in httpx client by @jaywonchung in #140 - Out of Bounds Power Limit in
GlobalPowerLimitOptimizer
by @parthraut in #143 - [CI] Upgrade
actions/cache
to V4 by @jaywonchung in #144 - [Misc] Update Perseus paper link by @jaywonchung in #145
- [feat]
CarbonEmissionMonitor
by @danielhou0515 in #148 - Update
zeusd
dependencies following dependabot suggestions by @jaywonchung in #149 - [Feat] Prometheus metric export by @sharonsyh in #134
- Pytorch Fully Sharded Data Parallel (FSDP) Integration by @parthraut in #147
- Rename package from
zeus-ml
tozeus
by @jaywonchung in #151 - Incorporate Zeusd for CPU and DRAM monitoring in ZeusMonitor by @michahn01 in #150
- Trace GPU ID in Zeusd GPU routes by @jaywonchung in #152
New Contributors
- @dkopczyk made their first contribution in #138
- @michahn01 made their first contribution in #150
Zeus v0.10.1
This is a maintenance release aimed at enhancing usability and fixing small bugs.
What's Changed
- Feat: Catch
PermissionError
and raise with more information by @wbjin in #111 - Feat: Alternative RAPL directory inside Docker containers by @wbjin in #115
- Feat: added utility function to retrieve CPU index from PID by @danielhou0515 in #117
- Docs: More documentation on CPU monitoring by @wbjin in #118
- Feat:
python -m zeus.show_env
by @jaywonchung in #119 - Feat:
getAverageMemoryPowerUsage
by @jaywonchung in #122 - Fix: Add
getAverageMemoryPowerUsage
toGPUs
as well by @jaywonchung in #124
New Contributors 🎉
- @danielhou0515 made their first contribution in #117
Full Changelog: zeus-v0.10.0...zeus-v0.10.1
Zeus v0.10.0: Broader support
What's New
CPU and DRAM energy measurement
We implemented support for Intel RAPL, which allows CPU and DRAM energy measurement on supported CPUs.
Generally speaking, most Intel CPUs support would support both and some AMD CPUs will support RAPL, albeit only CPU measurement.
JAX support
We added preliminary JAX support. Check out our full example here.
API usage is mostly identical:
monitor = ZeusMonitor(sync_execution_with="jax") # JAX!
monitor.begin_window("computations")
# Run computation
measurement = monitor.end_window("computations")
Zeus Daemon
Our energy optimizers require changing setting on the GPU, including power limit and frequency. This requires admin privileges. More details in our docs.
Zeus Daemon lets you circumvent this by running as a standalone daemon process on the node that implements privileged operations on your behalf, so that you don't have to give the entire Zeus-integrated application admin privileges.
We wrote the Zeus Daemon in Rust: Check out the source code and crates.io for details.
Breaking Changes
ZeusMonitor.begin_window
and ZeusMonitor.end_window
's second parameter sync_cuda
was renamed to sync_execution
.
This is because JAX asynchronously runs CPU code as well, and we would like to synchronize both CUDA and CPU computations. This created the need to generalize sync_cuda
to sync_execution
.
Changelog
- Docs: Add warnings about instantiating
ZeusMonitor
as a global variable. by @jaywonchung in #68 - Docs: Fix typo by @Sunt-ing in #69
- Docs: Improve the GPU energy monitoring demo by @Sunt-ing in #70
- Feat: Detect and reject unofficial
pynvml
bindings by @jaywonchung in #71 - Fix: Pandas warnings from
PowerMonitor
by @jaywonchung in #75 - Feat: Zeus daemon by @jaywonchung in #81
- Test: Allow
zeusd
dev and testing on MacOS by @jaywonchung in #82 - Refactor: Reorg
zeus.device.gpu
by @jaywonchung in #83 - Feat: Integrate
zeusd
intozeus.device.gpu
by @jaywonchung in #85 - Chore: Fix typo in GitHub Actions by @jaywonchung in #86
- Chore:
zeusd
debug outputs and doc comments by @jaywonchung in #87 - Feat: Add CPU measurement (via Intel RAPL) to ZeusMonitor by @wbjin in #90
- Fix: RAPL DRAM measurements not to be included in package measurements by @wbjin in #92
- Chore: Run checks in PRs from forks by @jaywonchung in #95
- Docs: Fix attribute name in
ZeusMonitor
example by @HGangloff in #96 - Feat: Add zero energy warning in
ZeusMonitor
by @sharonsyh in #93 - Feat: Add jax support in CUDA sync by @HGangloff in #97
- Docs: Refine JAX integration and example by @jaywonchung in #99
- Feat: Multi arch docker build by @sharonsyh in #104
- News: Add Perseus news and write Perseus blog by @jaywonchung in #107
- Feat: Multi-Arch Docker Build - Pushing to symbioticlab/zeus and mlenergy/zeus by @sharonsyh in #106
- Feat: RAPL Monitor for monitoring wraparounds for a rapl file by @wbjin in #105
- Test: Tests for CPU monitoring onn ZeusMonitor by @wbjin in #100
- Chore: Fix lint warnings from ruff by @wbjin in #108
New Contributors 🎉
- @Sunt-ing made their first contribution in #69
- @wbjin made their first contribution in #90
- @HGangloff made their first contribution in #96
- @sharonsyh made their first contribution in #93
Full Changelog: v0.9.1...zeus-v0.10.0
v0.9.1
What's new
- For GPU power draw, we use
nvmlDeviceGetFieldValues
, which gives us instant power draw (instead of average power draw) for any microarchitecture.
v0.9.0: Batch size optimizer and big cleanups
What's new
- The batch size optimizer is now a full-fledged server that can be deployed independently, with Docker Compose, or on Kubernetes + KubeFlow.
- GPU abstraction: We created an abstraction layer over GPU vendors (NVIDIA and AMD). We're on our way to supporting AMD GPUs.
- Completely revamped documentation under https://ml.energy/zeus.
Deprecated
- See #20 (
ZeusDataLoader
,ZeusMaster
, and the C++ Zeus monitor)
v0.8.0: Energy-efficient large model training
This release features Perseus, an optimizer for energy-efficient large model training.
See the Perseus docs for details.
v0.7.1: Moved to under `ml-energy`!
We moved our repository to under ml-energy
. No feature changes :)
v0.7.0: Python-based power monitor
What's New
- We used to have a C++ power monitor under
zeus_monitor
, but we've deprecated that. There's no need for high speed polling because NVML power counters do not update that quick anyway.- In order to poll power consumption programmatically, use
zeus.monitor.power.PowerMonitor
.
- In order to poll power consumption programmatically, use
- CLI power & energy monitor:
python -m zeus.monitor power
python -m zeus.monitor energy
- We switched from the old
setup.py
to the new package metadata standardpyproject.toml
. - Docker image sizes are drastically smaller now! The compressed image used to be 8.48 GB, but now it's down to 2.71 GB.
v0.6.1: `approx_instant_energy`
What's New
approx_instant_energy
in ZeusMonitor
- Sometimes, the NVML energy counter update period is longer than the measurement window, in which case energy consumption may be return as
0.0
. In this case, whenapprox_instant_energy=True
,ZeusMonitor
will approximate the energy consumption of the window as instant power consumption multiplied by the duration of the measurement window:$$\textrm{Energy} = \int_0^T \textrm{Power}(t) dt \approx \textrm{Power}(T) \cdot T$$