Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: Kubeflow Training V2 API #2171

Merged
merged 24 commits into from
Aug 6, 2024
Merged
Changes from 1 commit
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 61 additions & 10 deletions docs/proposals/2170-kubeflow-training-v2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -402,13 +402,14 @@ spec:
nvidia.com/gpu: 2
```

The above command will be converted to:
The container's `torchrun` command in the above YAML will be converted into:

```bash
torchrun --nnodes=5 --nproc-per-node=2 train.py
```

Additionally, the Kubeflow Training SDK allows the user to create the above `TrainJob` using the Python API:
Additionally, the Kubeflow Training SDK allows the user to create the above `TrainJob` using
the Python API:

```python
def train_func():
Expand Down Expand Up @@ -800,9 +801,6 @@ type PodSpecOverride struct {

// Override Pod's tolerations. This is needed to integrate TrainJob and Kueue
Tolerations []corev1.Toleration `json:"tolerations,omitempty"`

// Custom scheduler for TrainJob, for example YuniKorn.
SchedulerName string `json:"schedulerName,omitempty"`
}

// Override for each container.
Expand Down Expand Up @@ -906,8 +904,8 @@ type TrainingRuntime struct {
// JobSet spec.
JobSetSpec *batchv1.JobSetSpec `json:",inline"`

// For gang-scheduling using volcano or scheduler plugins, supported for all frameworks.
GangScheduler *GangScheduler `json:"gangScheduler,omitempty"`
// Spec to create PodGroup for gang-scheduling using volcano or coscheduling.
PodGroupSpec *PodGroupSpec `json:"podGroupSpec,omitempty"`
}

// One of the specs can be selected.
Expand All @@ -921,12 +919,15 @@ type MLSpec struct {
}
```

### The Gang Scheduler API
### The PodGroupSpec API

Gang scheduler plugin is used to create the appropriate `PodGroup` for Volcano or scheduler plugins.
The `PodGroupSpec` is used to create the appropriate `PodGroup` for gang-scheduling. It can
be used with Volcano or Coscheduling.
User should add the scheduler name into Pod's `.spec.schedulerName` if the default scheduler is
not the same as `PodGroup` plugin.

```golang
type GangScheduler struct {
type PodGroupSpec struct {
// Plugin for gang scheduling.
Plugin *GangSchedulerPlugin `json:plugin,omitempty"`

Expand All @@ -942,6 +943,56 @@ const (
)
```

Here is the example of runtime with gang-scheduling using coscheduling plugin.

```yaml
apiVersion: kubeflow.org/v2alpha1
kind: ClusterTrainingRuntime
metadata:
name: torch-distributed-multi-node
spec:
mlSpec:
torch:
numProcPerNode: 5
podGroupSpec:
plugin: coscheduling
scheduleTimeoutSeconds: 100
replicatedJobs:
- name: node
template:
spec:
template:
spec:
schedulerName: coscheduling
containers:
- name: trainer
image: docker.io/kubeflow/pytorch-mnist
resources:
limits:
nvidia.com/gpu: 1
env:
- name: MASTER_ADDR
value: "pytorch-node-0-0.pytorch"
- name: MASTER_PORT
value: 29400
command:
- torchrun train.py
```

Training Operator will create the `PodGroup` using the following spec:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. Naive question - does this imply that any custom gang scheduler used is required to make a PodGroup to function? Or if you have a plugin that doesn't create a PodGroup (and does its own thing) the group would be made anyway? And from the example above, how is minMember of 5 derived? You would probably need custom logic per scheduler backend for this API. For example, the PodGroup created by coscheduling vs volcano.sh are different underlying objects. (https://github.com/kubernetes-sigs/scheduler-plugins/blob/d67d2c15199595ccc6218574bd8e07b68b7146f4/apis/scheduling/v1alpha1/types.go#L128 and https://github.com/volcano-sh/apis/blob/78d912ce096c9fd9120488b5c7b999d4996f2cb5/pkg/apis/scheduling/types.go#L147)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this Training API is going t take on the challenge of orchestrating the logic for creation of needed PodGroup (varying by the plugin) then this probably works OK. Otherwise, I'd still place the responsibility on the user for now for creating the pod group, but just allow the schedulerName and other labels to use it in whatever turns into the actual job or podspec.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this imply that any custom gang scheduler used is required to make a PodGroup to function?

The PodGroup creation will be implemented only for supported gang-schedulers (initially Volcano and Coscheduling). Since different schedulers require the various PodGroups to be created.

And from the example above, how is minMember of 5 derived?

minMember is always equal to numNodes. For ML Training all gangs should be alive to execute training. In the future, if we find other use-cases (e.g. HPC), we can discuss it again.

If this Training API is going t take on the challenge of orchestrating the logic for creation of needed PodGroup (varying by the plugin) then this probably works OK

Yeah, that's correct.

Otherwise, I'd still place the responsibility on the user for now for creating the pod group, but just allow the schedulerName and other labels to use it in whatever turns into the actual job or podspec.

Btw, this also will be supported, since users can just ignore PodGroupSpec parameter, and set the .spec.schedulerName field in the PodSpec.


```yaml
apiVersion: scheduling.x-k8s.io/v1alpha1
kind: PodGroup
metadata:
name: nginx
spec:
scheduleTimeoutSeconds: 100
minMember: 5
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a great PodGroup specification unveil!


The `TrainJob` will be started only when 5 GPUs are available in the cluster.

### The Torch Spec API

The `TorchSpec` API represents the configuration for the PyTorch distributed training. This configuration
Expand Down