datadog_nvml

Monitoring NVIDIA GPUs status using Datadog

Datadog による NVIDIAのGPUの状態をモニタリングするための Agent Check スクリプトです． nvidia-ml-py モジュールを利用しています．

現在のモニタ項目

現在は以下の項目についてGPU毎に取得します．

Metrics

nvml.util.gpu: Percent of time over the past sample period during which one or more kernels was executing on the GPU.
nvml.util.memory: Percent of time over the past sample period during which global (device) memory was being read or written.
nvml.mem.total: トータルメモリ
nvml.mem.used: 使用中メモリ
nvml.mem.free: 空きメモリ
nvml.temp: 温度

REQUIRES

nvidia-ml-py モジュールが必須です．

https://pypi.python.org/pypi/nvidia-ml-py

$ sudo /opt/datadog-agent/embedded/bin/pip install nvidia-ml-py

SETUP

二つのファイルを /etc/dd-agent ディレクトリの checks.d, conf.d ディレクトリにコピーします．

nvml.py: /etc/dd-agent/checks.d
nvml.yaml.default: /etc/dd-agent/conf.d

$ git clone https://github.com/ngi644/datadog_nvml.git
$ cd datadog_nvml
$ sudo cp nvml.py /etc/dd-agent/checks.d
$ sudo cp nvml.yaml.default /etc/dd-agent/conf.d

Datadogを再起動します．

$ sudo service datadog-agent restart

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
docs		docs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
nvml.py		nvml.py
nvml.yaml.default		nvml.yaml.default

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

datadog_nvml

現在のモニタ項目

Metrics

Tags

REQUIRES

SETUP

References

About

Releases

Packages

Languages

License

eewindfly/datadog_nvml

Folders and files

Latest commit

History

Repository files navigation

datadog_nvml

現在のモニタ項目

Metrics

Tags

REQUIRES

SETUP

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages