KEP 498: Synchronized Startup Support for JobSets #499

danielvegamyhre · 2024-04-05T18:12:25Z

Design for #498

k8s-ci-robot · 2024-04-05T18:12:37Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danielvegamyhre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [danielvegamyhre]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

netlify · 2024-04-05T18:12:42Z

✅ Deploy Preview for kubernetes-sigs-jobset canceled.

Name	Link
🔨 Latest commit	`bc9f18a`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-jobset/deploys/6610515e717d350008df2864

kannon92 · 2024-04-05T18:24:57Z

nit, Fixes means that this issue will get closed on the KEP PR merge.

danielvegamyhre · 2024-04-05T18:26:00Z

nit, Fixes means that this issue will get closed on the KEP PR merge.

Updated, thanks!

keps/498-GroupScheduling/README.md

kannon92 · 2024-04-05T18:36:43Z

keps/498-GroupScheduling/README.md

+
+For statically provisioned clusters, we have recommended users use Kueue to handle group scheduling of JobSets once sufficient resources are available.
+
+However, for dynmaically provisioned clusters, we have no support for group scheduling in JobSet yet.


So kueue will support dynamically provisioned clusters if there is support in it for jobset?

I don't think Kueue will support dynamic provisioning, we need pods to be pending in order to trigger the dynamic provisioning for them, but Kueue keeps the Jobs in a suspended state (i.e., no pods created) until there is sufficient resources for them.

I meant we have no group scheduling support for JobSets that are being run independently without Kueue. This specifically becomes an issue on clusters using dynamic node pool provisioning. I updated this section for clarity.

keps/498-GroupScheduling/README.md

kannon92 · 2024-04-05T18:44:36Z

keps/498-GroupScheduling/README.md

+              initContainers:
+                # Check if the started file is present before exiting and starting the main container.
+                - name: group-scheduling-init
+                  image: bash


Will we provide configuration on the init container that gets injected?

Generally I wonder about the image, registry settings, image pull secrets.

I was thinking about either just using bash:latest (or bash at a specific tag we've tested at).

The command executed in the initContainer will be injected by the JobSet webhook, see the example of the JobSet spec after injection in the implementation section.

Yea, I'm concerned that some users may require that all containers have image pull secrets. Or they want to use a special registry for their containers. Maybe they want to avoid pulling images from dockerhub or want to use quay as their registry.

Maybe this is YANGI..

For example, in openshift, we also have a lot of cases where people use security context and volumes may have selinux label issues.

keps/498-GroupScheduling/README.md

kannon92 · 2024-04-05T19:33:24Z

keps/498-GroupScheduling/README.md

+
+For statically provisioned clusters, we have recommended users use Kueue to handle group scheduling of JobSets once sufficient resources are available.
+
+However, JobSets running independently (without Kueue) on dynmaically provisioned clusters, we need to support group


Suggested change

However, JobSets running independently (without Kueue) on dynmaically provisioned clusters, we need to support group

However, JobSets running independently (without Kueue) on dynamically provisioned clusters, we need to support group

kannon92 · 2024-04-05T19:35:30Z

keps/498-GroupScheduling/README.md

+
+Since node pool provisioning takes a variable amount of time, users are running into issues where the first slice finishes provisioning and pods land there and begin running, but eventually timeout before the other slices all finish provisioning and pods land there and become ready.
+
+For statically provisioned clusters, we have recommended users use Kueue to handle group scheduling of JobSets once sufficient resources are available.


Maybe @tenzen-y, @alculquicondor or @mimowo can comment on Kueue and dynamic clusters?

I know that wg-batch did some work in autoscaling and I'm curious how this feature is related.

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/provisioning-request.md I think was the proposal.

I'm curious as well. For additional context, we are talking about NAP (not CA) here.

keps/498-GroupScheduling/kep.yaml

alculquicondor · 2024-04-10T20:22:36Z

/cc

tenzen-y · 2024-04-10T20:49:56Z

/cc

kannon92 · 2024-04-10T20:55:20Z

keps/498-GroupScheduling/README.md

+        // Timeout defines the period after which the injected initContainer
+        // (which blocks execution of the main container until all pods are started)
+        // will timeout and exit with an error if not all pods have started yet.
+        TimeoutAfterSeconds *int32 `json:"timeoutAfterSeconds"


Reading 5 minutes to mean 300 s I wonder if we should consider durations as an API?

I guess k8s has a lot of apis in seconds so maybe its not a big deal.

kannon92 · 2024-04-11T03:27:37Z

keps/498-GroupScheduling/README.md

+
+### Constraints
+
+- In Order Startup Policy is incompatible with a group scheduling config that spans


I think we should comment on when someone should use startup policy or this feature.

alculquicondor · 2024-04-15T15:45:49Z

keps/498-GroupScheduling/README.md

+For statically provisioned clusters, we have recommended users use Kueue to handle group scheduling of JobSets once sufficient resources are available.
+
+However, JobSets running independently (without Kueue) on dynmaically provisioned clusters, we need to support group
+scheduling semantics in order to avoid these timeout issues.


Suggested change

scheduling semantics in order to avoid these timeout issues.

startup semantics in order to avoid these timeout issues.

alculquicondor · 2024-04-15T15:48:02Z

keps/498-GroupScheduling/README.md

+
+### User Stories (Optional)
+
+#### Story 1: Group scheduling of all replicated jobs


Suggested change

#### Story 1: Group scheduling of all replicated jobs

#### Story 1: Group startup of all replicated jobs

alculquicondor · 2024-04-15T15:48:38Z

keps/498-GroupScheduling/README.md

+#### Story 1: Group scheduling of all replicated jobs
+
+As a user, in order to make efficient use of expensive accelerator (GPU/TPU) resources, I use node
+auto-provisioning to provision infrastructure on an as-needed basis when a pending workload requires


the node auto-provisioner is a GKE specific. However, similar problem arise in any autoscaled environment.

alculquicondor · 2024-04-15T15:49:10Z

keps/498-GroupScheduling/README.md

+
+Many users of managed K8s services on cloud providers make use of NAP (node auto provisioning) which creates node pools for pending/unschedulable pods, based on those pods requirements (i.e., CPU/memory requirements, GPU/TPU requirements, etc).
+
+Since node pool provisioning takes a variable amount of time, users are running into issues where the first slice finishes provisioning and pods land there and begin running, but eventually timeout before the other slices all finish provisioning and pods land there and become ready.


We discussed other motivations other than provisioning, could you please add them?

alculquicondor · 2024-04-15T15:52:17Z

keps/498-GroupScheduling/README.md

+```
+
+
+#### Story 2: Group scheduling of a specific replicated job


Suggested change

#### Story 2: Group scheduling of a specific replicated job

#### Story 2: Group startup of a specific replicated job

alculquicondor · 2024-04-15T15:52:34Z

keps/498-GroupScheduling/README.md

+- `workers` which contains my primary batch workload, running on nodes with accelerator chips (GPUs)
+- `auxiliary` which contains auxiliary workloads (proxy server, metrics service) running on CPU nodes.
+
+I want my batch workload workers to be scheduled as a group, but the auxiliary pods can start up


Suggested change

I want my batch workload workers to be scheduled as a group, but the auxiliary pods can start up

I want my batch workload workers to start as a group, but the auxiliary pods can start up

danielvegamyhre · 2024-04-16T04:32:44Z

/hold

I will revisit this and address comments sometime this week. I also want to change some things about the proposed design.

k8s-triage-robot · 2024-07-15T05:12:31Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-08-14T05:43:29Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

danielvegamyhre · 2024-09-08T17:04:17Z

Deprioritizing this for now since k8s lacks the right primitives to implement this in a non-hacky and performant way. For now we can recommend users use Kueue for group scheduling which will be good enough to avoid distributed init timeouts for most use cases.

kep 498 initial commit

adcfc50

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 5, 2024

k8s-ci-robot requested review from ahg-g and kannon92 April 5, 2024 18:12

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 5, 2024

kannon92 reviewed Apr 5, 2024

View reviewed changes

keps/498-GroupScheduling/README.md Show resolved Hide resolved

kannon92 reviewed Apr 5, 2024

View reviewed changes

keps/498-GroupScheduling/README.md Outdated Show resolved Hide resolved

kannon92 reviewed Apr 5, 2024

View reviewed changes

keps/498-GroupScheduling/README.md Outdated Show resolved Hide resolved

kannon92 reviewed Apr 5, 2024

View reviewed changes

keps/498-GroupScheduling/README.md Show resolved Hide resolved

kannon92 reviewed Apr 5, 2024

View reviewed changes

keps/498-GroupScheduling/README.md Outdated Show resolved Hide resolved

kannon92 reviewed Apr 5, 2024

View reviewed changes

keps/498-GroupScheduling/README.md Show resolved Hide resolved

kannon92 reviewed Apr 5, 2024

View reviewed changes

keps/498-GroupScheduling/README.md Outdated Show resolved Hide resolved

address comments

3922b91

kannon92 reviewed Apr 5, 2024

View reviewed changes

keps/498-GroupScheduling/README.md Outdated Show resolved Hide resolved

kannon92 reviewed Apr 5, 2024

View reviewed changes

keps/498-GroupScheduling/README.md Outdated Show resolved Hide resolved

more updates

001b37b

danielvegamyhre force-pushed the kep-group-scheduling branch from 13f4e98 to 001b37b Compare April 5, 2024 19:23

fix typos

bc9f18a

kannon92 reviewed Apr 5, 2024

View reviewed changes

keps/498-GroupScheduling/kep.yaml Show resolved Hide resolved

danielvegamyhre changed the title ~~KEP 498: Group Scheduling Support for JobSets~~ KEP 498: Group Startup Support for JobSets Apr 9, 2024

k8s-ci-robot requested a review from alculquicondor April 10, 2024 20:22

k8s-ci-robot requested a review from tenzen-y April 10, 2024 20:49

kannon92 reviewed Apr 10, 2024

View reviewed changes

danielvegamyhre changed the title ~~KEP 498: Group Startup Support for JobSets~~ KEP 498: Synchronized Startup Support for JobSets Apr 10, 2024

kannon92 reviewed Apr 11, 2024

View reviewed changes

alculquicondor reviewed Apr 15, 2024

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 16, 2024

danielvegamyhre mentioned this pull request Apr 16, 2024

Release v0.6.0 requirements #523

Closed

13 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 15, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 14, 2024

danielvegamyhre removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 8, 2024

danielvegamyhre closed this Sep 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP 498: Synchronized Startup Support for JobSets #499

KEP 498: Synchronized Startup Support for JobSets #499

danielvegamyhre commented Apr 5, 2024 •

edited

Loading

k8s-ci-robot commented Apr 5, 2024

netlify bot commented Apr 5, 2024 •

edited

Loading

kannon92 commented Apr 5, 2024

danielvegamyhre commented Apr 5, 2024

kannon92 Apr 5, 2024

danielvegamyhre Apr 5, 2024

kannon92 Apr 5, 2024

kannon92 Apr 5, 2024

danielvegamyhre Apr 5, 2024

kannon92 Apr 5, 2024

kannon92 Apr 5, 2024

kannon92 Apr 10, 2024

kannon92 Apr 5, 2024

kannon92 Apr 5, 2024

kannon92 Apr 5, 2024

danielvegamyhre Apr 5, 2024 •

edited

Loading

alculquicondor commented Apr 10, 2024

tenzen-y commented Apr 10, 2024

kannon92 Apr 10, 2024

kannon92 Apr 11, 2024

alculquicondor Apr 15, 2024

alculquicondor Apr 15, 2024

alculquicondor Apr 15, 2024

alculquicondor Apr 15, 2024

alculquicondor Apr 15, 2024

alculquicondor Apr 15, 2024

danielvegamyhre commented Apr 16, 2024

k8s-triage-robot commented Jul 15, 2024

k8s-triage-robot commented Aug 14, 2024

danielvegamyhre commented Sep 8, 2024


		For statically provisioned clusters, we have recommended users use Kueue to handle group scheduling of JobSets once sufficient resources are available.

		However, for dynmaically provisioned clusters, we have no support for group scheduling in JobSet yet.


		For statically provisioned clusters, we have recommended users use Kueue to handle group scheduling of JobSets once sufficient resources are available.

		However, JobSets running independently (without Kueue) on dynmaically provisioned clusters, we need to support group


		Since node pool provisioning takes a variable amount of time, users are running into issues where the first slice finishes provisioning and pods land there and begin running, but eventually timeout before the other slices all finish provisioning and pods land there and become ready.

		For statically provisioned clusters, we have recommended users use Kueue to handle group scheduling of JobSets once sufficient resources are available.


		### Constraints

		- In Order Startup Policy is incompatible with a group scheduling config that spans

	scheduling semantics in order to avoid these timeout issues.
	startup semantics in order to avoid these timeout issues.


		### User Stories (Optional)

		#### Story 1: Group scheduling of all replicated jobs

	#### Story 1: Group scheduling of all replicated jobs
	#### Story 1: Group startup of all replicated jobs


		Many users of managed K8s services on cloud providers make use of NAP (node auto provisioning) which creates node pools for pending/unschedulable pods, based on those pods requirements (i.e., CPU/memory requirements, GPU/TPU requirements, etc).

		Since node pool provisioning takes a variable amount of time, users are running into issues where the first slice finishes provisioning and pods land there and begin running, but eventually timeout before the other slices all finish provisioning and pods land there and become ready.

		```


		#### Story 2: Group scheduling of a specific replicated job

	#### Story 2: Group scheduling of a specific replicated job
	#### Story 2: Group startup of a specific replicated job

	I want my batch workload workers to be scheduled as a group, but the auxiliary pods can start up
	I want my batch workload workers to start as a group, but the auxiliary pods can start up

KEP 498: Synchronized Startup Support for JobSets #499

KEP 498: Synchronized Startup Support for JobSets #499

Conversation

danielvegamyhre commented Apr 5, 2024 • edited Loading

k8s-ci-robot commented Apr 5, 2024

netlify bot commented Apr 5, 2024 • edited Loading

✅ Deploy Preview for kubernetes-sigs-jobset canceled.

kannon92 commented Apr 5, 2024

danielvegamyhre commented Apr 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielvegamyhre Apr 5, 2024 • edited Loading

Choose a reason for hiding this comment

alculquicondor commented Apr 10, 2024

tenzen-y commented Apr 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielvegamyhre commented Apr 16, 2024

k8s-triage-robot commented Jul 15, 2024

k8s-triage-robot commented Aug 14, 2024

danielvegamyhre commented Sep 8, 2024

danielvegamyhre commented Apr 5, 2024 •

edited

Loading

netlify bot commented Apr 5, 2024 •

edited

Loading

danielvegamyhre Apr 5, 2024 •

edited

Loading