Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Sandbox] HAMi #97

Closed
2 tasks done
wawa0210 opened this issue Apr 15, 2024 · 27 comments
Closed
2 tasks done

[Sandbox] HAMi #97

wawa0210 opened this issue Apr 15, 2024 · 27 comments
Labels

Comments

@wawa0210
Copy link

wawa0210 commented Apr 15, 2024

Application contact emails

[email protected],[email protected]

Project Summary

Heterogeneous AI Computing Virtualization Middleware (HAMi), is an "all-in-one" tool designed to manage Heterogeneous AI Computing Devices in a k8s cluster.

Project Description

Heterogeneous AI Computing Virtualization Middleware (HAMi) is an "all-in-one" tool designed to manage Heterogeneous AI Computing Devices in a k8s cluster. It includes everything you would expect, such as:

  1. Heterogeneous AI computing device support, currently supports: Nvidia, Cambricon, Hygon, Huawei Ascend, iluvatar
  2. Device sharing: Each task can allocate a portion of a device instead of the entire device, allowing a device to be shared among multiple tasks.
  3. Device Memory Control: Devices can be allocated a specific device memory size (e.g., 3000M) or a percentage of the whole GPU's memory (e.g., 50%), ensuring it does not exceed the specified boundaries.
  4. Device Type Specification: You can specify the type of device to use or avoid for a particular task by setting annotations, such as nvidia.com/use-gputype or nvidia.com/nouse-gputype.
  5. Device UUID Specification: You can specify the UUID of device to use or avoid for a particular task by setting annotations, such as nvidia.com/use-gpuuuid or nvidia.com/nouse-gpuuuid.
  6. Task priority: supports tasks using the same AI computing device to define different priorities. When resources are preempted, high-priority tasks have high QOS
  7. CUDA Unified memory: When the GPU memory is not enough, it supports expanded use of node memory.
  8. Easy to use: You don't need to modify your task YAML to use our scheduler. All your jobs will be automatically supported after installation. Additionally, you can specify a resource name other than nvidia.com/gpu if you prefer.

The core features of HAMi are as follows

  • Hard Limit on Device Memory.
  • Allows partial device allocation by specifying device memory.
  • Imposes a hard limit on streaming multiprocessors.
  • flexible binpack&spread schedule policies base on gpu device and node
  • Permits partial device allocation by specifying device core usage.
  • Requires zero changes to existing programs.

The HAMi architecture is as follows

image

Application Scenarios

  1. Device sharing (or device virtualization) on Kubernetes.
  2. Scenarios where pods need to be allocated with specific device memory
  3. Need to balance GPU usage in a cluster with multiple GPU nodes.
  4. Low utilization of device memory and computing units, such as running 10 TensorFlow servings on one GPU.
  5. Situations that require a large number of small GPUs, such as teaching scenarios where one GPU is provided for multiple students to use, and cloud platforms that offer small GPU instances.

Org repo URL (provide if all repos under the org are in scope of the application)

https://github.com/Project-HAMi

Project repo URL in scope of application

core repo : https://github.com/Project-HAMi/HAMi

And the corresponding multi-public repo https://github.com/Project-HAMi/

Additional repos in scope of the application

No response

Website URL

http://project-hami.io/

Roadmap

https://github.com/Project-HAMi/HAMi?tab=readme-ov-file#roadmap

Roadmap context

Production manufactor MemoryIsolation CoreIsolation MultiCard support
GPU NVIDIA
MLU Cambricon
DCU Hygon
Ascend Huawei In progress In progress
GPU iluvatar In progress In progress
DPU Teco In progress In progress
  • Support video codec processing
  • Support Multi-Instance GPUs (MIG)
  • Support Flexible scheduling policies
    • binpack
    • spread
    • numa affinity
  • integrated gpu-operator
  • Rich observability support
  • DRA Support
  • Support Intel GPU device
  • Support AMD GPU device

Contributing Guide

https://github.com/Project-HAMi/HAMi/blob/master/CONTRIBUTING.md

Here are our community meeting minutes

https://docs.google.com/document/d/1YC6hco03_oXbF9IOUPJ29VWEddmITIKIfSmBX8JtGBw/edit?usp=sharing

Code of Conduct (CoC)

https://github.com/Project-HAMi/HAMi/blob/master/CODE_OF_CONDUCT.md

Adopters

We have done a survey and found that dozens of adopters are already using HAMi. We will maintain it in the HAMi documentation later. Online survey results

Contributing or Sponsoring Org

4paradigm,DaoCloud, HuaweiCloud,Rise Union

Maintainers file

https://github.com/Project-HAMi/HAMi/blob/master/MAINTAINERS.md

IP Policy

  • If the project is accepted, I agree the project will follow the CNCF IP Policy

Trademark and accounts

  • If the project is accepted, I agree to donate all project trademarks and accounts to the CNCF

Why CNCF?

The CNCF is the premier organization for cloud-native technologies and is backed by many leading companies in the industry. It also provides a platform for collaboration and community-building, which can lead to increased visibility, adoption, and contributions to HAMi.

At the same time, HAMi can be combined with more outstanding CNCF projects (such as: Volcano, Kuberay, Kueue) to provide one-stop service for AI infrastructure.

Benefit to the Landscape

As AI becomes more and more popular, different smart devices are springing up, represented by Nvidia, but there are many other smart devices that are also actively embracing K8s and CNCF. But how these numerous GPUs, NPUs and other devices can provide a consistent interactive experience on one platform is particularly important. This is exactly what HAMi is focused on doing. If users use HAMi, it will greatly simplify the management and operation of these GPUs and NPUs on K8s, and the application layer does not need to be aware of the differences in underlying hardware.

Cloud Native 'Fit'

HAMi is built using cloud native technology. It has now used scheduler-plugin, webhook, device-plugin and other technologies to manage and schedule heterogeneous AI computing devices. In the future, it will consider using DRA for architecture optimization.

Cloud Native 'Integration'

HAMi refers to the nvidia device-plugin project part of source codes to support nvidia gpu basic features. On top of this, we support the following functions for nvidia gpu extensions.

  1. Device sharing: Each task can allocate a portion of a device instead of the entire device, allowing a device to be shared among multiple tasks.
  2. Device Memory Control: Devices can be allocated a specific device memory size (e.g., 3000M) or a percentage of the whole GPU's memory (e.g., 50%), ensuring it does not exceed the specified boundaries.
  3. Device Type Specification: You can specify the type of device to use or avoid for a particular task by setting annotations, such as nvidia.com/use-gputype or nvidia.com/nouse-gputype.
  4. Device UUID Specification: You can specify the UUID of device to use or avoid for a particular task by setting annotations, such as nvidia.com/use-gpuuuid or nvidia.com/nouse-gpuuuid.
  5. hami provides scheduling enhancement capabilities based on kube-scheduler and supports binpack&spread capabilities at the node and gpu device levels.

Cloud Native Overlap

We do not think there is direct overlap at this time with other CNCF projects. However, we do touch on some of the areas that other projects are investigating in the space of device-plugin,and scheduler enhancement.

Volcano also provides the ability to share GPUs. In version v1.8, the features of volcano-vgpu were contributed to the volcano repo by hami maintainer. However, after discussions with the maintainer of volcano, in order to support the independent development of the hami community, it was decided to release it in version v1.9. Later, this part of the function was transferred to the HAMi project and maintained by the HAMi community (repo --> https://github.com/Project-HAMi/volcano-vgpu-device-plugin)

Similar projects

Some comparisons with similar projects to HAMi
image

highlight

  • nvidia-device-plugin and k8s-dra-driver only supports nvidia devices and does not support other heterogeneous AI computing devices
  • nvidia-device-plugin and k8s-dra-driver focuses on the combination of gpu and K8s, and does not focus on scheduling enhancements and rich observability indicators.

Comparison of GPU sharing solutions

image

Landscape

yes

image

HAMi is in landscape and also in cnai group

image https://landscape.cncf.io/?group=cnai

Business Product or Service to Project separation

N/A

Project presentations

No response

Project champions

No response

Additional information

No response

@wawa0210 wawa0210 added the New New Application label Apr 15, 2024
@amye amye moved this to 📋 New in Sandbox Application Board Apr 30, 2024
@raravena80
Copy link

TAG-Runtime

@angellk angellk added the Runtime label Jul 9, 2024
@mrbobbytables mrbobbytables moved this from 📋 New to 🏗 Upcoming in Sandbox Application Board Jul 12, 2024
@dims
Copy link
Member

dims commented Jul 23, 2024

  • Project repo URL in scope of application lists just the main repo, are the other repos out of scope for donation?
  • is the k8s-dra-driver fork for convenience or is it really going to be a fork?

@archlitchi
Copy link

  • Project repo URL in scope of application lists just the main repo, are the other repos out of scope for donation?
  • is the k8s-dra-driver fork for convenience or is it really going to be a fork?

all public repos are on the scope for donation

k8s-dra-driver are forked for convenience, we plan to make our own dra-driver

@wawa0210
Copy link
Author

  • Project repo URL in scope of application lists just the main repo, are the other repos out of scope for donation?
  • is the k8s-dra-driver fork for convenience or is it really going to be a fork?

We've been exploring the combination of HAMi and DRA and are currently in the roadmap as well

@angellk
Copy link
Contributor

angellk commented Jul 30, 2024

@raravena80 has TAG Runtime reviewed this project and have a recommendation to the TOC?

@raravena80
Copy link

They presented on May 16th, 2024.

Info:

TAG-Runtime is good with the project going to Sandbox provide they fulfill the CNCF Sandbox admission checklist.

cc: @srust @miao0miao @rajaskakodkar

@zanetworker
Copy link

zanetworker commented Aug 9, 2024

Hami project review: https://docs.google.com/document/d/1Lb4HYnJR21AEsNGurtXcXqEzdrKu95cG0NziufGEI0c/edit

FYI @angellk @raravena80 @srust @rajaskakodkar

Some feedback is still needed from the authors in the doc for completeness.

@wawa0210
Copy link
Author

wawa0210 commented Aug 9, 2024

Hami project review: https://docs.google.com/document/d/1Lb4HYnJR21AEsNGurtXcXqEzdrKu95cG0NziufGEI0c/edit

FYI @angellk @raravena80 @srust @rajaskakodkar

Some feedback is still needed from the authors in the doc for completeness.

Thank you very much. I have made comments in the document and look forward to your reply.

@wawa0210
Copy link
Author

wawa0210 commented Aug 9, 2024

Hami project review: https://docs.google.com/document/d/1Lb4HYnJR21AEsNGurtXcXqEzdrKu95cG0NziufGEI0c/edit

FYI @angellk @raravena80 @srust @rajaskakodkar

Some feedback is still needed from the authors in the doc for completeness.

@zanetworker

Thanks again for the very detailed and high-quality review of HAMi. I have replied to all the comments. If you have any questions, please leave a message.

I would like to clarify a few points.

1. Risk of single vendor contribution.

Due to the non-standard contribution method (direct commit, no PR) before, the statistical information is inaccurate. At present, DaoCloud and 4paradigm have similar contributions,

This is the current contributor statistics, https://github.com/Project-HAMi/HAMi/graphs/contributors?from=2021-07-04&to=2024-08-09&type=c

The top eight contributors come from four different vendors(sort by commits), 4paradigm, DaoCloud, SAP,NIVIC

@archlitchi 4paradigm
@wawa0210 DaoCloud
@peizhaoyou 4paradigm
@lengrongfu DaoCloud
@chaunceyjiang DaoCloud
@CoderTH DaoCloud
@haitwang-cloud SAP
@whybeyoung NIVIC
@gsakun independent

Therefore, I understand that there is no risk of single vendor contribution.

Of course, we will standardize the contribution process and look for more contributors in the future.

@zanetworker
Copy link

Thanks @wawa0210, I have incorporated your comments, and amended the context. Thank you for your collaboration and swift responses on the review.

@jberkus
Copy link

jberkus commented Aug 9, 2024

TAG Contributor strategy has reviewed this project and found the following:

  • The contributor guide is very basic, particularly as it does not cover the current actual contributor process (as mentioned upthread).
  • HAMi does not have written governance, yet.
  • The roadmap is a brief checklist in the project README, mainly focused on future devices and device features to support. It appears to have been updated a few times over the last year.
  • There are three maintainers, who work for DaoCloud, HuaweiCloud, and 4Paradigm. As previously noted, contributor numbers may be misleading, but GitHub shows 24.
  • Community meetings are held in Chinese, and appear to go back more than a year. While agendas are public, we found no public notes or recordings.

This review is for the TOC’s information only. Sandbox projects are not required to have full governance or contributor documentation.

@wawa0210
Copy link
Author

wawa0210 commented Aug 12, 2024

TAG Contributor strategy has reviewed this project and found the following:

  • The contributor guide is very basic, particularly as it does not cover the current actual contributor process (as mentioned upthread).
  • HAMi does not have written governance, yet.
  • The roadmap is a brief checklist in the project README, mainly focused on future devices and device features to support. It appears to have been updated a few times over the last year.
  • There are three maintainers, who work for DaoCloud, HuaweiCloud, and 4Paradigm. As previously noted, contributor numbers may be misleading, but GitHub shows 24.
  • Community meetings are held in Chinese, and appear to go back more than a year. While agendas are public, we found no public notes or recordings.

This review is for the TOC’s information only. Sandbox projects are not required to have full governance or contributor documentation.

After discussion with HAMi maintainers, we added a governance document, https://github.com/Project-HAMi/HAMi?tab=readme-ov-file#governance

There are three maintainers, who work for DaoCloud, HuaweiCloud, and 4Paradigm. As previously noted, contributor numbers may be misleading, but GitHub shows 24.

HAMi has three maintainers, and eleven community members

Community meetings are held in Chinese, and appear to go back more than a year. While agendas are public, we found no public notes or recordings.

We currently have a weekly community meeting in Chinese,this is our calendar, there is also a developer WeChat group, which currently has 137 members. Regarding public meeting minutes and screen recordings, this is indeed missing and needs to be improved. At the same time, we also need to pay attention to internationalization

@jberkus
Copy link

jberkus commented Aug 12, 2024

We currently have a weekly community meeting in Chinese,this is our calendar, there is also a developer WeChat group, which currently has 137 members. Regarding public meeting minutes and screen recordings, this is indeed missing and needs to be improved. At the same time, we also need to pay attention to internationalization

Yeah, that's challenging. But, if your contributors speak Chinese, that makes sense for your meetings. And if you can get meeting notes up in Chinese, other folks can use Google Translate. For that reason, notes are better than recordings.

If you get accepted into the CNCF, you'll want to eventually cultivate a second, English-speaking community as well as your Chinese one.

@william-wang
Copy link

Regarding cloud native overlap, to elaborate further, the two projects, Volcano and Hami, each concentrate on distinct aspects. The two projects have an close collaboration. Taking GPU sharing as an instance, Volcano offers the scheduling of GPU virtualization resources with policy, while Hami provides the isolation of GPU memory and core on the node. The coordination of the two projects has been adopted by a number of users and has received great feedback.

@mrbobbytables
Copy link
Member

/vote

Copy link

git-vote bot commented Aug 20, 2024

Vote created

@mrbobbytables has called for a vote on [Sandbox] HAMi (#97).

The members of the following teams have binding votes:

Team
@cncf/cncf-toc

Non-binding votes are also appreciated as a sign of support!

How to vote

You can cast your vote by reacting to this comment. The following reactions are supported:

In favor Against Abstain
👍 👎 👀

Please note that voting for multiple options is not allowed and those votes won't be counted.

The vote will be open for 2months 30days 2h 52m 48s. It will pass if at least 66% of the users with binding votes vote In favor 👍. Once it's closed, results will be published here as a new comment.

@TheFoxAtWork
Copy link
Contributor

The TOC would also like the project to engage with the following Kubernetes groups in addition to completing the recommendations from the TAG:

  • SIG Node,
  • SIG Scheduling,
  • Batch WG,
  • Device Management WG

@wawa0210
Copy link
Author

The TOC would also like the project to engage with the following Kubernetes groups in addition to completing the recommendations from the TAG:

  • SIG Node,
  • SIG Scheduling,
  • Batch WG,
  • Device Management WG

Thank you very much for the reminder. It happens that HK Kubecon will start on August 21st, and HAMi maintainers will attend the meeting. We will actively try to communicate with these SIG people, listen to their suggestions for HAMi's future, and enrich the roadmap

@mrbobbytables
Copy link
Member

/check-vote

Copy link

git-vote bot commented Aug 20, 2024

Vote status

So far 36.36% of the users with binding vote are in favor (passing threshold: 66%).

Summary

In favor Against Abstain Not voted
4 0 0 7

Binding votes (4)

User Vote Timestamp
angellk In favor 2024-08-20 21:45:19.0 +00:00:00
kevin-wangzefeng In favor 2024-08-20 19:17:22.0 +00:00:00
TheFoxAtWork In favor 2024-08-20 15:24:34.0 +00:00:00
cathyhongzhang In favor 2024-08-20 15:24:10.0 +00:00:00
@dims Pending
@rochaporto Pending
@mauilion Pending
@linsun Pending
@dzolotusky Pending
@nikhita Pending
@kgamanji Pending

Non-binding votes (1)

User Vote Timestamp
wawa0210 In favor 2024-08-20 15:51:30.0 +00:00:00

Copy link

git-vote bot commented Aug 21, 2024

Votes can only be checked once a day.

@wawa0210
Copy link
Author

/check-vote

Copy link

git-vote bot commented Aug 21, 2024

Vote status

So far 63.64% of the users with binding vote are in favor (passing threshold: 66%).

Summary

In favor Against Abstain Not voted
7 0 0 4

Binding votes (7)

User Vote Timestamp
dzolotusky In favor 2024-08-21 13:39:57.0 +00:00:00
linsun In favor 2024-08-21 13:43:54.0 +00:00:00
angellk In favor 2024-08-20 21:45:19.0 +00:00:00
cathyhongzhang In favor 2024-08-20 15:24:10.0 +00:00:00
rochaporto In favor 2024-08-21 7:27:51.0 +00:00:00
TheFoxAtWork In favor 2024-08-20 15:24:34.0 +00:00:00
kevin-wangzefeng In favor 2024-08-20 19:17:22.0 +00:00:00
@dims Pending
@mauilion Pending
@nikhita Pending
@kgamanji Pending

Non-binding votes (4)

User Vote Timestamp
raravena80 In favor 2024-08-20 23:35:09.0 +00:00:00
archlitchi In favor 2024-08-21 1:34:09.0 +00:00:00
zanetworker In favor 2024-08-21 11:07:37.0 +00:00:00
wawa0210 In favor 2024-08-21 15:16:48.0 +00:00:00

Copy link

git-vote bot commented Aug 23, 2024

Vote closed

The vote passed! 🎉

72.73% of the users with binding vote were in favor (passing threshold: 66%).

Summary

In favor Against Abstain Not voted
8 0 0 3

Binding votes (8)

User Vote Timestamp
@cathyhongzhang In favor 2024-08-20 15:24:10.0 +00:00:00
@kevin-wangzefeng In favor 2024-08-20 19:17:22.0 +00:00:00
@TheFoxAtWork In favor 2024-08-20 15:24:34.0 +00:00:00
@dzolotusky In favor 2024-08-21 13:39:57.0 +00:00:00
@linsun In favor 2024-08-21 13:43:54.0 +00:00:00
@nikhita In favor 2024-08-23 10:43:43.0 +00:00:00
@angellk In favor 2024-08-20 21:45:19.0 +00:00:00
@rochaporto In favor 2024-08-21 7:27:51.0 +00:00:00

Non-binding votes (4)

User Vote Timestamp
@raravena80 In favor 2024-08-20 23:35:09.0 +00:00:00
@archlitchi In favor 2024-08-21 1:34:09.0 +00:00:00
@zanetworker In favor 2024-08-21 11:07:37.0 +00:00:00
@wawa0210 In favor 2024-08-21 15:16:48.0 +00:00:00

@git-vote git-vote bot removed the vote open label Aug 23, 2024
@Cmierly
Copy link

Cmierly commented Aug 29, 2024

Welcome and congrats on getting accepted as a CNCF Sandbox project!

You can get started on your on-boarding checklist here: #132

and if you have any questions, please don't hesitate to reach out!

@archlitchi
Copy link

#132

thanks, we'll working on it

@mrbobbytables
Copy link
Member

With #132 created we can go ahead and close this out :)

Congrats again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

No branches or pull requests