-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP 498: Synchronized Startup Support for JobSets #499
KEP 498: Synchronized Startup Support for JobSets #499
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: danielvegamyhre The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
✅ Deploy Preview for kubernetes-sigs-jobset canceled.
|
nit, Fixes means that this issue will get closed on the KEP PR merge. |
Updated, thanks! |
keps/498-GroupScheduling/README.md
Outdated
|
||
For statically provisioned clusters, we have recommended users use Kueue to handle group scheduling of JobSets once sufficient resources are available. | ||
|
||
However, for dynmaically provisioned clusters, we have no support for group scheduling in JobSet yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So kueue will support dynamically provisioned clusters if there is support in it for jobset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think Kueue will support dynamic provisioning, we need pods to be pending in order to trigger the dynamic provisioning for them, but Kueue keeps the Jobs in a suspended state (i.e., no pods created) until there is sufficient resources for them.
I meant we have no group scheduling support for JobSets that are being run independently without Kueue. This specifically becomes an issue on clusters using dynamic node pool provisioning. I updated this section for clarity.
initContainers: | ||
# Check if the started file is present before exiting and starting the main container. | ||
- name: group-scheduling-init | ||
image: bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we provide configuration on the init container that gets injected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally I wonder about the image, registry settings, image pull secrets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about either just using bash:latest
(or bash at a specific tag we've tested at).
The command executed in the initContainer will be injected by the JobSet webhook, see the example of the JobSet spec after injection in the implementation section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I'm concerned that some users may require that all containers have image pull secrets. Or they want to use a special registry for their containers. Maybe they want to avoid pulling images from dockerhub or want to use quay as their registry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is YANGI..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, in openshift, we also have a lot of cases where people use security context and volumes may have selinux label issues.
13f4e98
to
001b37b
Compare
|
||
For statically provisioned clusters, we have recommended users use Kueue to handle group scheduling of JobSets once sufficient resources are available. | ||
|
||
However, JobSets running independently (without Kueue) on dynmaically provisioned clusters, we need to support group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, JobSets running independently (without Kueue) on dynmaically provisioned clusters, we need to support group | |
However, JobSets running independently (without Kueue) on dynamically provisioned clusters, we need to support group |
|
||
Since node pool provisioning takes a variable amount of time, users are running into issues where the first slice finishes provisioning and pods land there and begin running, but eventually timeout before the other slices all finish provisioning and pods land there and become ready. | ||
|
||
For statically provisioned clusters, we have recommended users use Kueue to handle group scheduling of JobSets once sufficient resources are available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe @tenzen-y, @alculquicondor or @mimowo can comment on Kueue and dynamic clusters?
I know that wg-batch did some work in autoscaling and I'm curious how this feature is related.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious as well. For additional context, we are talking about NAP (not CA) here.
/cc |
/cc |
// Timeout defines the period after which the injected initContainer | ||
// (which blocks execution of the main container until all pods are started) | ||
// will timeout and exit with an error if not all pods have started yet. | ||
TimeoutAfterSeconds *int32 `json:"timeoutAfterSeconds" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading 5 minutes to mean 300 s I wonder if we should consider durations as an API?
I guess k8s has a lot of apis in seconds so maybe its not a big deal.
|
||
### Constraints | ||
|
||
- In Order Startup Policy is incompatible with a group scheduling config that spans |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should comment on when someone should use startup policy or this feature.
For statically provisioned clusters, we have recommended users use Kueue to handle group scheduling of JobSets once sufficient resources are available. | ||
|
||
However, JobSets running independently (without Kueue) on dynmaically provisioned clusters, we need to support group | ||
scheduling semantics in order to avoid these timeout issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scheduling semantics in order to avoid these timeout issues. | |
startup semantics in order to avoid these timeout issues. |
|
||
### User Stories (Optional) | ||
|
||
#### Story 1: Group scheduling of all replicated jobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#### Story 1: Group scheduling of all replicated jobs | |
#### Story 1: Group startup of all replicated jobs |
#### Story 1: Group scheduling of all replicated jobs | ||
|
||
As a user, in order to make efficient use of expensive accelerator (GPU/TPU) resources, I use node | ||
auto-provisioning to provision infrastructure on an as-needed basis when a pending workload requires |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the node auto-provisioner is a GKE specific. However, similar problem arise in any autoscaled environment.
|
||
Many users of managed K8s services on cloud providers make use of NAP (node auto provisioning) which creates node pools for pending/unschedulable pods, based on those pods requirements (i.e., CPU/memory requirements, GPU/TPU requirements, etc). | ||
|
||
Since node pool provisioning takes a variable amount of time, users are running into issues where the first slice finishes provisioning and pods land there and begin running, but eventually timeout before the other slices all finish provisioning and pods land there and become ready. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed other motivations other than provisioning, could you please add them?
``` | ||
|
||
|
||
#### Story 2: Group scheduling of a specific replicated job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#### Story 2: Group scheduling of a specific replicated job | |
#### Story 2: Group startup of a specific replicated job |
- `workers` which contains my primary batch workload, running on nodes with accelerator chips (GPUs) | ||
- `auxiliary` which contains auxiliary workloads (proxy server, metrics service) running on CPU nodes. | ||
|
||
I want my batch workload workers to be scheduled as a group, but the auxiliary pods can start up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want my batch workload workers to be scheduled as a group, but the auxiliary pods can start up | |
I want my batch workload workers to start as a group, but the auxiliary pods can start up |
/hold I will revisit this and address comments sometime this week. I also want to change some things about the proposed design. |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
Deprioritizing this for now since k8s lacks the right primitives to implement this in a non-hacky and performant way. For now we can recommend users use Kueue for group scheduling which will be good enough to avoid distributed init timeouts for most use cases. |
Design for #498