Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP 498: Synchronized Startup Support for JobSets #499

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
395 changes: 395 additions & 0 deletions keps/498-GroupScheduling/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,395 @@
# KEP-498: Group Scheduling Support

<!--
This is the title of your KEP. Keep it short, simple, and descriptive. A good
title can help communicate what the KEP is and should be considered as part of
any review.
-->

<!--
A table of contents is helpful for quickly jumping to sections of a KEP and for
highlighting any additional information provided beyond the standard KEP
template.

Ensure the TOC is wrapped with
<code>&lt;!-- toc --&rt;&lt;!-- /toc --&rt;</code>
tags, and then generate with `hack/update-toc.sh`.
-->

<!-- toc -->
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [User Stories (Optional)](#user-stories-optional)
- [Story 1: Group scheduling of all replicated jobs](#story-1-group-scheduling-of-all-replicated-jobs)
- [Example JobSet spec using group scheduling for all replicated jobs:](#example-jobset-spec-using-group-scheduling-for-all-replicated-jobs)
- [Story 2: Group scheduling of a specific replicated job](#story-2-group-scheduling-of-a-specific-replicated-job)
- [Example JobSet spec using group scheduling for all replicated jobs:](#example-jobset-spec-using-group-scheduling-for-all-replicated-jobs-1)
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
- [Proposed Group Scheduling API](#proposed-group-scheduling-api)
- [Constraints](#constraints)
- [Implmentation](#implmentation)
- [Example JobSet spec AFTER webhook injection:](#example-jobset-spec-after-webhook-injection)
- [Test Plan](#test-plan)
- [Prerequisite testing updates](#prerequisite-testing-updates)
- [Unit Tests](#unit-tests)
- [Integration tests](#integration-tests)
- [Graduation Criteria](#graduation-criteria)
- [Implementation History](#implementation-history)
- [Drawbacks](#drawbacks)
- [Alternatives](#alternatives)
<!-- /toc -->


## Motivation

Many users of managed K8s services on cloud providers make use of NAP (node auto provisioning) which creates node pools for pending/unschedulable pods, based on those pods requirements (i.e., CPU/memory requirements, GPU/TPU requirements, etc).

Since node pool provisioning takes a variable amount of time, users are running into issues where the first slice finishes provisioning and pods land there and begin running, but eventually timeout before the other slices all finish provisioning and pods land there and become ready.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed other motivations other than provisioning, could you please add them?


For statically provisioned clusters, we have recommended users use Kueue to handle group scheduling of JobSets once sufficient resources are available.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe @tenzen-y, @alculquicondor or @mimowo can comment on Kueue and dynamic clusters?

I know that wg-batch did some work in autoscaling and I'm curious how this feature is related.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@danielvegamyhre danielvegamyhre Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious as well. For additional context, we are talking about NAP (not CA) here.


However, JobSets running independently (without Kueue) on dynmaically provisioned clusters, we need to support group
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
However, JobSets running independently (without Kueue) on dynmaically provisioned clusters, we need to support group
However, JobSets running independently (without Kueue) on dynamically provisioned clusters, we need to support group

scheduling semantics in order to avoid these timeout issues.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
scheduling semantics in order to avoid these timeout issues.
startup semantics in order to avoid these timeout issues.


### Goals

- Enhance the JobSet API with options allowing the user to define a group scheduling policy.

### Non-Goals

- Upstream changes that would improve/simplify group scheduling support in JobSet.

## Proposal

### User Stories (Optional)
danielvegamyhre marked this conversation as resolved.
Show resolved Hide resolved

#### Story 1: Group scheduling of all replicated jobs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### Story 1: Group scheduling of all replicated jobs
#### Story 1: Group startup of all replicated jobs


As a user, in order to make efficient use of expensive accelerator (GPU/TPU) resources, I use node
auto-provisioning to provision infrastructure on an as-needed basis when a pending workload requires

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the node auto-provisioner is a GKE specific. However, similar problem arise in any autoscaled environment.

it, then deprovision it once it's no longer in use. Since the completion time of these provisioning
operations is variable, I want to make sure JobSet pods which land on the earliest provisioned nodes
do not start executing the main container until all the infrastructure is provisioned and
all pods have started. Otherwise, the distributed initialization step will timeout in the earliest
pods while waiting for the remaining provisioning to complete.


### Example JobSet spec using group scheduling for all replicated jobs:

```yaml
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
name: my-jobset
namespace: default
spec:
# main containers in pods part of all replicatedJob will be blocked
# from execution until all pods have started, or timeout after 5min.
groupSchedulingConfig:
timeoutAfterSeconds: 300
replicatedJobs:
- name: workers
replicas: 3
template:
spec:
backoffLimit: 0
completions: 100
parallelism: 100
template:
spec:
containers:
- name: pytorch
image: pytorch/pytorch:latest
command:
- python3
- train.py
```


#### Story 2: Group scheduling of a specific replicated job

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### Story 2: Group scheduling of a specific replicated job
#### Story 2: Group startup of a specific replicated job


As a user, I have a JobSet with 2 replicated jobs:
- `workers` which contains my primary batch workload, running on nodes with accelerator chips (GPUs)
- `auxiliary` which contains auxiliary workloads (proxy server, metrics service) running on CPU nodes.

I want my batch workload workers to be scheduled as a group, but the auxiliary pods can start up

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
I want my batch workload workers to be scheduled as a group, but the auxiliary pods can start up
I want my batch workload workers to start as a group, but the auxiliary pods can start up

individually at any time.

### Example JobSet spec using group scheduling for all replicated jobs:

```yaml
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
name: my-jobset
namespace: default
spec:
# main containers in pods part of replicatedJob `workers` will be blocked
# from execution until all pods have started, or timeout after 5min.
groupSchedulingConfig:
timeoutAfterSeconds: 300
targetReplicatedJobs:
- workers
replicatedJobs:
- name: workers
replicas: 3
template:
spec:
backoffLimit: 0
completions: 100
parallelism: 100
template:
spec:
containers:
- name: pytorch
image: pytorch/pytorch:latest
command:
- python3
- train.py
- name: auxiliary
replicas: 1
template:
spec:
backoffLimit: 0
completions: 1
parallelism: 1
template:
spec:
containers:
- name: auxiliary
image: python:3.10
command:
- python3
- run.py
```

### Notes/Constraints/Caveats (Optional)

<!--
What are the caveats to the proposal?
What are some important details that didn't come across above?
Go in to as much detail as necessary here.
This might be a good place to talk about core concepts and how they relate.
-->

### Risks and Mitigations

<!--
What are the risks of this proposal, and how do we mitigate? Think broadly.
For example, consider both security and how this will impact the larger
Kubernetes ecosystem.

How will security be reviewed, and by whom?

How will UX be reviewed, and by whom?

Consider including folks who also work outside the SIG or subproject.
-->

## Design Details

### Proposed Group Scheduling API

```go

// JobSetSpec defines the desired state of JobSet
type JobSetSpec struct {
// GroupSchedulingConfig defines the desired group-scheduling behavior
danielvegamyhre marked this conversation as resolved.
Show resolved Hide resolved
// for the JobSet. If unspecified, every pod part of this JobSet will
// start as soon as possible, without waiting for the others.
GroupSchedulingConfig *GroupSchedulingConfig `json:groupSchedulingConfig`
}

// GroupSchedulingConfig defines the desired group-scheduling behavior for the
// JobSet. If defined, the main container in pods part of the
// TargetReplicatedJobs will be blocked from executing until all pods in these
// jobs have started.
type GroupSchedulingConfig struct {
// Timeout defines the period after which the injected initContainer
// (which blocks execution of the main container until all pods are started)
// will timeout and exit with an error if not all pods have started yet.
TimeoutAfterSeconds *int32 `json:"timeoutAfterSeconds"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading 5 minutes to mean 300 s I wonder if we should consider durations as an API?

I guess k8s has a lot of apis in seconds so maybe its not a big deal.


// TargetReplicatedJobs are the names of the replicated jobs which will
// be subject to group-scheduling.
// A null or empty list will apply to all replicatedJobs.
// +optional
// +listType=atomic
TargetReplicatedJobs []string `json:"targetReplicatedJobs,omitempty"`
}
```

### Constraints

- In Order Startup Policy is incompatible with a group scheduling config that spans
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should comment on when someone should use startup policy or this feature.

multiple replicated jobs, since they are conceptually mutually exclusive. We must
add validation to our JobSet webhook to enforce this.
- The initContainer name will have a reserved name `group-scheduling-init` that is
reserved and cannot be used by the main container.

### Implmentation


A new pod controller will be added: `group_scheduling_pod_controller.go`, which
reconciles on leader pods which are scheduled and whose JobSet have a group scheduling
config defined.

When a GroupSchedulingConfig is defined:

1. A bash initContainer will be injected into the JobTemplateSpec of every
TargetReplicatedJob by the existing pod mutating webhook.
The initContainer will have a simple bash loop, checking for a particular
directory path to exist. Once the path exists, it will exit successfully.
2. A shared ConfigMap will be created and mounted into the initContainer as an
optional volume mount. The ConfigMap will be empty to start, so the mount
path will not exist in the initContainer yet.
3. The pod controller will reconcile on leader pods, listing and
counting the number of pods which have started in the JobSet. If the started
count is less than the expected count, re-enqueue the pod for reconciliation
after a short interval.
4. Once the expected number of pods for the JobSet have started (derived from the
JobSet spec), the pod controller will update the ConfigMap with some arbitrary data
so that it is no longer empty. This will trigger Kubelet to populate
the initContainer mount point.
5. Now that the mount point directory path exists in the initContainer, it
will exit and allow the main container to start.


#### Example JobSet spec AFTER webhook injection:

```yaml
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
name: my-jobset
namespace: default
spec:
# main containers in pods part of all replicatedJob will be blocked
# from execution until all pods have started, or timeout after 5min.
groupSchedulingConfig:
timeoutAfterSeconds: 300
replicatedJobs:
- name: workers
replicas: 3
template:
spec:
backoffLimit: 0
completions: 100
parallelism: 100
template:
spec:
initContainers:
# Check if the started file is present before exiting and starting the main container.
- name: group-scheduling-init
danielvegamyhre marked this conversation as resolved.
Show resolved Hide resolved
image: bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we provide configuration on the init container that gets injected?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I wonder about the image, registry settings, image pull secrets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about either just using bash:latest (or bash at a specific tag we've tested at).

The command executed in the initContainer will be injected by the JobSet webhook, see the example of the JobSet spec after injection in the implementation section.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I'm concerned that some users may require that all containers have image pull secrets. Or they want to use a special registry for their containers. Maybe they want to avoid pulling images from dockerhub or want to use quay as their registry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is YANGI..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, in openshift, we also have a lot of cases where people use security context and volumes may have selinux label issues.

command: # TODO: add support for configurable timeout parameter
- bash
- -c
- while true; do if [ -f /mnt/group-scheduling/started ]; then break; fi; sleep 1; done
volumeMounts:
- mountPath: /mnt/group-scheduling
name: group-scheduling
containers:
- name: pytorch
image: pytorch/pytorch:latest
command:
- python3
- -c
- train.py
volumes:
- name: group-scheduling
configMap:
# Format: {jobset-name}-started-threshold
name: my-jobset-group-scheduling
optional: true
```


### Test Plan

The testing plan will focus on:
- Unit and integration tests to validate the webhook injection works as intended
- E2E tests, since we need pods to be created to inject the initContainer and ConfigMap volume mount.
- Scale tests, performance benchmarking

##### Prerequisite testing updates

<!--
Based on reviewers feedback describe what additional tests need to be added prior
implementing this enhancement to ensure the enhancements have also solid foundations.
-->

#### Unit Tests

<!--
In principle every added code should have complete unit test coverage, so providing
the exact set of tests will not bring additional value.
However, if complete unit test coverage is not possible, explain the reason of it
together with explanation why this is acceptable.
-->

<!--
Additionally, try to enumerate the core package you will be touching
to implement this enhancement and provide the current unit coverage for those
in the form of:
- <package>: <date> - <current test coverage>

This can inform certain test coverage improvements that we want to do before
extending the production code to implement this enhancement.
-->

- `controllers`: `04/05/2024` - `25.4%`

#### Integration tests
- JobSet webhook integration tests validating that for JobSets with group scheduling
config defined, the ConfigMap is created, and the correct child Jobs have the expected
initContainers and ConfigMap volume mounts injected.
- JobSet controller integration tests validating that for JobSets with group scheduling
config defined, the ConfigMap used for broadcasting the startup signal is created.

### Graduation Criteria

<!--

Clearly define what it means for the feature to be implemented and
considered stable.

If the feature you are introducing has high complexity, consider adding graduation
milestones with these graduation criteria:
- [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels]
- [Feature gate][feature gate] lifecycle
- [Deprecation policy][deprecation-policy]

[feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
-->


## Implementation History

- KEP published April 5th, 2024

## Drawbacks

Once the ConfigMap has data, it takes up to a full Kubelet sync period (1 minute) for the mount
path to exist on the nodes. This will delay startup of the workload. We can force Kubelet to
update the pod earlier if we update the pod annotaitons, but this would require O(number of pods)
calls to the apiserver and writes etcd, which is not scalable.

## Alternatives

One alternative would be for the initContainer to be running Go code which has a ConfigMap Lister
to maintain a local cache of ConfigMaps in the cluster and receive updates via server-sent events.
The initContainer in each pod could increment a counter in the ConfigMap, and once the ConfigMap
counter is equal to the expected number of pods in the JobSet, the initContainer would exit.

However, even using Listers to reduce strain on the apiserver, this is not a scalable since there
would be lots of request retries to increment the counter (sine ConfigMap resource version will
be changing with every write, and writes referencing an old resource version will be rejected
and need to be retried).
Loading