Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP 498: Synchronized Startup Support for JobSets #499

Conversation

danielvegamyhre
Copy link
Contributor

@danielvegamyhre danielvegamyhre commented Apr 5, 2024

Design for #498

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 5, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danielvegamyhre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 5, 2024
Copy link

netlify bot commented Apr 5, 2024

Deploy Preview for kubernetes-sigs-jobset canceled.

Name Link
🔨 Latest commit bc9f18a
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-jobset/deploys/6610515e717d350008df2864

@kannon92
Copy link
Contributor

kannon92 commented Apr 5, 2024

nit, Fixes means that this issue will get closed on the KEP PR merge.

@danielvegamyhre
Copy link
Contributor Author

nit, Fixes means that this issue will get closed on the KEP PR merge.

Updated, thanks!


For statically provisioned clusters, we have recommended users use Kueue to handle group scheduling of JobSets once sufficient resources are available.

However, for dynmaically provisioned clusters, we have no support for group scheduling in JobSet yet.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So kueue will support dynamically provisioned clusters if there is support in it for jobset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think Kueue will support dynamic provisioning, we need pods to be pending in order to trigger the dynamic provisioning for them, but Kueue keeps the Jobs in a suspended state (i.e., no pods created) until there is sufficient resources for them.

I meant we have no group scheduling support for JobSets that are being run independently without Kueue. This specifically becomes an issue on clusters using dynamic node pool provisioning. I updated this section for clarity.

initContainers:
# Check if the started file is present before exiting and starting the main container.
- name: group-scheduling-init
image: bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we provide configuration on the init container that gets injected?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I wonder about the image, registry settings, image pull secrets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about either just using bash:latest (or bash at a specific tag we've tested at).

The command executed in the initContainer will be injected by the JobSet webhook, see the example of the JobSet spec after injection in the implementation section.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I'm concerned that some users may require that all containers have image pull secrets. Or they want to use a special registry for their containers. Maybe they want to avoid pulling images from dockerhub or want to use quay as their registry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is YANGI..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, in openshift, we also have a lot of cases where people use security context and volumes may have selinux label issues.


For statically provisioned clusters, we have recommended users use Kueue to handle group scheduling of JobSets once sufficient resources are available.

However, JobSets running independently (without Kueue) on dynmaically provisioned clusters, we need to support group
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
However, JobSets running independently (without Kueue) on dynmaically provisioned clusters, we need to support group
However, JobSets running independently (without Kueue) on dynamically provisioned clusters, we need to support group


Since node pool provisioning takes a variable amount of time, users are running into issues where the first slice finishes provisioning and pods land there and begin running, but eventually timeout before the other slices all finish provisioning and pods land there and become ready.

For statically provisioned clusters, we have recommended users use Kueue to handle group scheduling of JobSets once sufficient resources are available.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe @tenzen-y, @alculquicondor or @mimowo can comment on Kueue and dynamic clusters?

I know that wg-batch did some work in autoscaling and I'm curious how this feature is related.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@danielvegamyhre danielvegamyhre Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious as well. For additional context, we are talking about NAP (not CA) here.

@danielvegamyhre danielvegamyhre changed the title KEP 498: Group Scheduling Support for JobSets KEP 498: Group Startup Support for JobSets Apr 9, 2024
@alculquicondor
Copy link

/cc

@tenzen-y
Copy link
Member

/cc

// Timeout defines the period after which the injected initContainer
// (which blocks execution of the main container until all pods are started)
// will timeout and exit with an error if not all pods have started yet.
TimeoutAfterSeconds *int32 `json:"timeoutAfterSeconds"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading 5 minutes to mean 300 s I wonder if we should consider durations as an API?

I guess k8s has a lot of apis in seconds so maybe its not a big deal.

@danielvegamyhre danielvegamyhre changed the title KEP 498: Group Startup Support for JobSets KEP 498: Synchronized Startup Support for JobSets Apr 10, 2024

### Constraints

- In Order Startup Policy is incompatible with a group scheduling config that spans
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should comment on when someone should use startup policy or this feature.

For statically provisioned clusters, we have recommended users use Kueue to handle group scheduling of JobSets once sufficient resources are available.

However, JobSets running independently (without Kueue) on dynmaically provisioned clusters, we need to support group
scheduling semantics in order to avoid these timeout issues.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
scheduling semantics in order to avoid these timeout issues.
startup semantics in order to avoid these timeout issues.


### User Stories (Optional)

#### Story 1: Group scheduling of all replicated jobs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### Story 1: Group scheduling of all replicated jobs
#### Story 1: Group startup of all replicated jobs

#### Story 1: Group scheduling of all replicated jobs

As a user, in order to make efficient use of expensive accelerator (GPU/TPU) resources, I use node
auto-provisioning to provision infrastructure on an as-needed basis when a pending workload requires

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the node auto-provisioner is a GKE specific. However, similar problem arise in any autoscaled environment.


Many users of managed K8s services on cloud providers make use of NAP (node auto provisioning) which creates node pools for pending/unschedulable pods, based on those pods requirements (i.e., CPU/memory requirements, GPU/TPU requirements, etc).

Since node pool provisioning takes a variable amount of time, users are running into issues where the first slice finishes provisioning and pods land there and begin running, but eventually timeout before the other slices all finish provisioning and pods land there and become ready.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed other motivations other than provisioning, could you please add them?

```


#### Story 2: Group scheduling of a specific replicated job

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### Story 2: Group scheduling of a specific replicated job
#### Story 2: Group startup of a specific replicated job

- `workers` which contains my primary batch workload, running on nodes with accelerator chips (GPUs)
- `auxiliary` which contains auxiliary workloads (proxy server, metrics service) running on CPU nodes.

I want my batch workload workers to be scheduled as a group, but the auxiliary pods can start up

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
I want my batch workload workers to be scheduled as a group, but the auxiliary pods can start up
I want my batch workload workers to start as a group, but the auxiliary pods can start up

@danielvegamyhre
Copy link
Contributor Author

/hold

I will revisit this and address comments sometime this week. I also want to change some things about the proposed design.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 16, 2024
@danielvegamyhre danielvegamyhre mentioned this pull request Apr 16, 2024
13 tasks
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 15, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 14, 2024
@danielvegamyhre danielvegamyhre removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 8, 2024
@danielvegamyhre
Copy link
Contributor Author

Deprioritizing this for now since k8s lacks the right primitives to implement this in a non-hacky and performant way. For now we can recommend users use Kueue for group scheduling which will be good enough to avoid distributed init timeouts for most use cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants