Skip to content

Releases: ml-energy/zeus

Zeus Daemon v0.2.0

03 Feb 07:18
Compare
Choose a tag to compare

Change Highlights

CPU and DRAM energy measurements

Zeus daemon now also supports CPU and DRAM energy measurements with RAPL, which also requires root privileges just for measurement. Zeus daemon has also been integrated into the Zeus Python library, so as long as you have the daemon deployed and you set the ZEUSD_SOCK_PATH environment variable, you'll be all set!

What's Changed

  • [Feat] Implement CPU and DRAM monitoring for zeusd by @wbjin in #137
  • Incorporate Zeusd for CPU and DRAM monitoring in ZeusMonitor by @michahn01 in #150
  • Trace GPU ID in Zeusd GPU routes by @jaywonchung in #152

Zeus v0.11.0

03 Feb 07:10
Compare
Choose a tag to compare

Change Highlights

Renamed to zeus!

Until now we used zeus-ml because the name zeus was taken on PyPI, but now we're finally able to move to zeus:

pip install zeus

Prometheus Metrics

Zeus power and energy measurements can now be exported as Prometheus metrics! We currently support three metrics:

  • Energy consumption of a fixed code range (Histogram)
  • Power draw over time (Gauge)
  • Cumulative energy consumption over time (Counter)

We wrote up a detailed metric monitoring guide and integration examples.

AMD GPU enhancements

We created ROCm AMDSMI Python bindings (GitHub, PyPI) and integrated it with Zeus. Before this, users had to cd into their ROCm installation's AMDSMI distribution directory and run pip install, which isn't very convenient.

Our bindings are unofficial & community-maintained. But AMDSMI maintainers did take a look (ROCm/amdsmi#8).

Carbon Emission Estimations

The new zeus.monitor.carbon.CarbonEmissionMonitor takes in a carbon intensity provider (e.g., from ElectricityMaps) and provides an estimate for operational carbon emissions. The window-based API is essentially the same as ZeusMonitor.

Full Changelog

New Contributors

Zeus v0.10.1

10 Sep 22:01
Compare
Choose a tag to compare

This is a maintenance release aimed at enhancing usability and fixing small bugs.

What's Changed

  • Feat: Catch PermissionError and raise with more information by @wbjin in #111
  • Feat: Alternative RAPL directory inside Docker containers by @wbjin in #115
  • Feat: added utility function to retrieve CPU index from PID by @danielhou0515 in #117
  • Docs: More documentation on CPU monitoring by @wbjin in #118
  • Feat: python -m zeus.show_env by @jaywonchung in #119
  • Feat: getAverageMemoryPowerUsage by @jaywonchung in #122
  • Fix: Add getAverageMemoryPowerUsage to GPUs as well by @jaywonchung in #124

New Contributors 🎉

Full Changelog: zeus-v0.10.0...zeus-v0.10.1

Zeus v0.10.0: Broader support

16 Aug 20:49
Compare
Choose a tag to compare

What's New

CPU and DRAM energy measurement

We implemented support for Intel RAPL, which allows CPU and DRAM energy measurement on supported CPUs.
Generally speaking, most Intel CPUs support would support both and some AMD CPUs will support RAPL, albeit only CPU measurement.

JAX support

We added preliminary JAX support. Check out our full example here.

API usage is mostly identical:

monitor = ZeusMonitor(sync_execution_with="jax")  # JAX!

monitor.begin_window("computations")
# Run computation
measurement = monitor.end_window("computations")

Zeus Daemon

Our energy optimizers require changing setting on the GPU, including power limit and frequency. This requires admin privileges. More details in our docs.

Zeus Daemon lets you circumvent this by running as a standalone daemon process on the node that implements privileged operations on your behalf, so that you don't have to give the entire Zeus-integrated application admin privileges.

We wrote the Zeus Daemon in Rust: Check out the source code and crates.io for details.

Breaking Changes

ZeusMonitor.begin_window and ZeusMonitor.end_window's second parameter sync_cuda was renamed to sync_execution.
This is because JAX asynchronously runs CPU code as well, and we would like to synchronize both CUDA and CPU computations. This created the need to generalize sync_cuda to sync_execution.

Changelog

New Contributors 🎉

Full Changelog: v0.9.1...zeus-v0.10.0

v0.9.1

07 May 04:07
cf8324c
Compare
Choose a tag to compare

What's new

  • For GPU power draw, we use nvmlDeviceGetFieldValues, which gives us instant power draw (instead of average power draw) for any microarchitecture.

v0.9.0: Batch size optimizer and big cleanups

06 May 16:07
0ae4de1
Compare
Choose a tag to compare

What's new

  • The batch size optimizer is now a full-fledged server that can be deployed independently, with Docker Compose, or on Kubernetes + KubeFlow.
  • GPU abstraction: We created an abstraction layer over GPU vendors (NVIDIA and AMD). We're on our way to supporting AMD GPUs.
  • Completely revamped documentation under https://ml.energy/zeus.

Deprecated

  • See #20 (ZeusDataLoader, ZeusMaster, and the C++ Zeus monitor)

v0.8.0: Energy-efficient large model training

13 Oct 21:34
076df3d
Compare
Choose a tag to compare

This release features Perseus, an optimizer for energy-efficient large model training.

See the Perseus docs for details.

v0.7.1: Moved to under `ml-energy`!

24 Sep 04:10
6082db4
Compare
Choose a tag to compare

We moved our repository to under ml-energy. No feature changes :)

v0.7.0: Python-based power monitor

24 Aug 21:22
Compare
Choose a tag to compare

What's New

  • We used to have a C++ power monitor under zeus_monitor, but we've deprecated that. There's no need for high speed polling because NVML power counters do not update that quick anyway.
    • In order to poll power consumption programmatically, use zeus.monitor.power.PowerMonitor.
  • CLI power & energy monitor:
    • python -m zeus.monitor power
    • python -m zeus.monitor energy
  • We switched from the old setup.py to the new package metadata standard pyproject.toml.
  • Docker image sizes are drastically smaller now! The compressed image used to be 8.48 GB, but now it's down to 2.71 GB.

v0.6.1: `approx_instant_energy`

07 Aug 21:18
Compare
Choose a tag to compare

What's New

approx_instant_energy in ZeusMonitor

  • Sometimes, the NVML energy counter update period is longer than the measurement window, in which case energy consumption may be return as 0.0. In this case, when approx_instant_energy=True, ZeusMonitor will approximate the energy consumption of the window as instant power consumption multiplied by the duration of the measurement window: $$\textrm{Energy} = \int_0^T \textrm{Power}(t) dt \approx \textrm{Power}(T) \cdot T$$