-
Notifications
You must be signed in to change notification settings - Fork 698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2170: Kubeflow Training V2 API #2171
Merged
google-oss-prow
merged 24 commits into
kubeflow:master
from
andreyvelich:training-v2-proposal
Aug 6, 2024
Merged
Changes from 1 commit
Commits
Show all changes
24 commits
Select commit
Hold shift + click to select a range
6335970
KEP-2170: Kubeflow Training V2 API
andreyvelich e6ca5e1
Fix some comments
andreyvelich f6692c7
Add user roles diagram
andreyvelich aaa88f2
Move diagrams after design
andreyvelich 39c23ba
Update diagram
andreyvelich 3b43220
Refactor Model and Dataset configs
andreyvelich edefff2
Update runtime timelines
andreyvelich 38ed8f9
Address readability comments
andreyvelich a87177b
Explaination for Trainer
andreyvelich 47dbd31
Update LLM Fine-Tuning Diagram
andreyvelich 57d9591
Fix Llama model name
andreyvelich bc373d9
Add goal for integration with Kueue
andreyvelich 3f5d7bb
Add links for Job run policies
andreyvelich 619d167
Add some alternatives
andreyvelich 1c423ab
Fix more API types
andreyvelich ca23867
Fix empty number of nodes
andreyvelich 77170da
Rename to Coscheduling
andreyvelich 9acf8a3
Change parameters to env
andreyvelich 287a4a4
Update PodSpecOverride with scheduling directives
andreyvelich be8177b
Fix TrainingRuntime field
andreyvelich f80e780
Refactor PodGroupSpec APIs
andreyvelich 08fec42
Add note about scheduler name
andreyvelich d1c1994
Add initial TrainJob status field
andreyvelich 87ed153
Fix pre-commit
andreyvelich File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -402,13 +402,14 @@ spec: | |
nvidia.com/gpu: 2 | ||
``` | ||
|
||
The above command will be converted to: | ||
The container's `torchrun` command in the above YAML will be converted into: | ||
|
||
```bash | ||
torchrun --nnodes=5 --nproc-per-node=2 train.py | ||
``` | ||
|
||
Additionally, the Kubeflow Training SDK allows the user to create the above `TrainJob` using the Python API: | ||
Additionally, the Kubeflow Training SDK allows the user to create the above `TrainJob` using | ||
the Python API: | ||
|
||
```python | ||
def train_func(): | ||
|
@@ -800,9 +801,6 @@ type PodSpecOverride struct { | |
|
||
// Override Pod's tolerations. This is needed to integrate TrainJob and Kueue | ||
Tolerations []corev1.Toleration `json:"tolerations,omitempty"` | ||
|
||
// Custom scheduler for TrainJob, for example YuniKorn. | ||
SchedulerName string `json:"schedulerName,omitempty"` | ||
} | ||
|
||
// Override for each container. | ||
|
@@ -906,8 +904,8 @@ type TrainingRuntime struct { | |
// JobSet spec. | ||
JobSetSpec *batchv1.JobSetSpec `json:",inline"` | ||
|
||
// For gang-scheduling using volcano or scheduler plugins, supported for all frameworks. | ||
GangScheduler *GangScheduler `json:"gangScheduler,omitempty"` | ||
// Spec to create PodGroup for gang-scheduling using volcano or coscheduling. | ||
PodGroupSpec *PodGroupSpec `json:"podGroupSpec,omitempty"` | ||
} | ||
|
||
// One of the specs can be selected. | ||
|
@@ -921,12 +919,15 @@ type MLSpec struct { | |
} | ||
``` | ||
|
||
### The Gang Scheduler API | ||
### The PodGroupSpec API | ||
|
||
Gang scheduler plugin is used to create the appropriate `PodGroup` for Volcano or scheduler plugins. | ||
The `PodGroupSpec` is used to create the appropriate `PodGroup` for gang-scheduling. It can | ||
be used with Volcano or Coscheduling. | ||
User should add the scheduler name into Pod's `.spec.schedulerName` if the default scheduler is | ||
not the same as `PodGroup` plugin. | ||
|
||
```golang | ||
type GangScheduler struct { | ||
type PodGroupSpec struct { | ||
// Plugin for gang scheduling. | ||
Plugin *GangSchedulerPlugin `json:plugin,omitempty"` | ||
|
||
|
@@ -942,6 +943,56 @@ const ( | |
) | ||
``` | ||
|
||
Here is the example of runtime with gang-scheduling using coscheduling plugin. | ||
|
||
```yaml | ||
apiVersion: kubeflow.org/v2alpha1 | ||
kind: ClusterTrainingRuntime | ||
metadata: | ||
name: torch-distributed-multi-node | ||
spec: | ||
mlSpec: | ||
torch: | ||
numProcPerNode: 5 | ||
podGroupSpec: | ||
plugin: coscheduling | ||
scheduleTimeoutSeconds: 100 | ||
replicatedJobs: | ||
- name: node | ||
template: | ||
spec: | ||
template: | ||
spec: | ||
schedulerName: coscheduling | ||
containers: | ||
- name: trainer | ||
image: docker.io/kubeflow/pytorch-mnist | ||
resources: | ||
limits: | ||
nvidia.com/gpu: 1 | ||
env: | ||
- name: MASTER_ADDR | ||
value: "pytorch-node-0-0.pytorch" | ||
- name: MASTER_PORT | ||
value: 29400 | ||
command: | ||
- torchrun train.py | ||
``` | ||
|
||
Training Operator will create the `PodGroup` using the following spec: | ||
|
||
```yaml | ||
apiVersion: scheduling.x-k8s.io/v1alpha1 | ||
kind: PodGroup | ||
metadata: | ||
name: nginx | ||
spec: | ||
scheduleTimeoutSeconds: 100 | ||
minMember: 5 | ||
``` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That is a great PodGroup specification unveil! |
||
|
||
The `TrainJob` will be started only when 5 GPUs are available in the cluster. | ||
|
||
### The Torch Spec API | ||
|
||
The `TorchSpec` API represents the configuration for the PyTorch distributed training. This configuration | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. Naive question - does this imply that any custom gang scheduler used is required to make a PodGroup to function? Or if you have a plugin that doesn't create a PodGroup (and does its own thing) the group would be made anyway? And from the example above, how is
minMember
of 5 derived? You would probably need custom logic per scheduler backend for this API. For example, the PodGroup created by coscheduling vs volcano.sh are different underlying objects. (https://github.com/kubernetes-sigs/scheduler-plugins/blob/d67d2c15199595ccc6218574bd8e07b68b7146f4/apis/scheduling/v1alpha1/types.go#L128 and https://github.com/volcano-sh/apis/blob/78d912ce096c9fd9120488b5c7b999d4996f2cb5/pkg/apis/scheduling/types.go#L147)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this Training API is going t take on the challenge of orchestrating the logic for creation of needed PodGroup (varying by the plugin) then this probably works OK. Otherwise, I'd still place the responsibility on the user for now for creating the pod group, but just allow the
schedulerName
and other labels to use it in whatever turns into the actual job or podspec.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PodGroup creation will be implemented only for supported gang-schedulers (initially Volcano and Coscheduling). Since different schedulers require the various PodGroups to be created.
minMember
is always equal tonumNodes
. For ML Training all gangs should be alive to execute training. In the future, if we find other use-cases (e.g. HPC), we can discuss it again.Yeah, that's correct.
Btw, this also will be supported, since users can just ignore
PodGroupSpec
parameter, and set the.spec.schedulerName
field in the PodSpec.