Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate using Elastic Index Job in exclusive placement implementation #482

Open
Tracked by #523 ...
danielvegamyhre opened this issue Mar 27, 2024 · 15 comments
Open
Tracked by #523 ...
Labels
kind/feature Categorizes issue or PR as related to a new feature. release-0.7

Comments

@danielvegamyhre
Copy link
Contributor

Right now, the implementation of exclusive job placement per topology domain (node pool, zone, etc) relies on a pod webhook which allows leader pods (index 0) to be admitted, created, and scheduled, but blocks follower pods (all other indexes) until the leader pod for that child Job is scheduled.

The leader pods have pod affinity/anti-affinity constraints ensuring each leader pod lands in a different topology. Follower pods have nodeSelectors injected by the pod mutating webhook ensuring the land on the same topology as their leader.

This is an improvement over the original implementation of using pod affinity/anti-affinity constraints on all pods (which did not scale well due to the pod affinity rule computation time scaling linearly with the number of pods and nodes). However, the repeated wasted follower pod creation attempts are putting unnecessary pressure on the apiserver.

One possible option is to use Elastic Indexed Jobs to first create every Job with completions == parallelism == 1 (so this will only create index 0 / leader pods). The pod webhook will still inject pod affinity/anti-affinity constraints into the leader pods as it currently does.

Once a leader pod for a given Job is scheduled, resize the Job to have completions == parallelism == . The follower pods will then be created and have the nodeSelector injected to follow the leader, as it currently does. This will minimize pressure on the apiserver by avoiding unnecessary pod creation attempts.

@kannon92
Copy link
Contributor

#463

I did look into supporting elastic jobs with jobset. I opened up a PR but I found out that we enforce immutability on all replicated jobs.

I'm still waiting for some comments on what kind of validation we would want for replicated jobs.

@danielvegamyhre
Copy link
Contributor Author

#463

I did look into supporting elastic jobs with jobset. I opened up a PR but I found out that we enforce immutability on all replicated jobs.

I'm still waiting for some comments on what kind of validation we would want for replicated jobs.

We don't need to mutate replicatedJobs for this issue, the controller would be performing this logic on individual Jobs.

The thing you're exploring is different (elastic replicated jobs, enabling number of replicas to scale up or down). This issue is scaling the number of Job pods up after creation, which doesn't require mutating the replicatedJob.

@kannon92
Copy link
Contributor

Yikes.. Thats a good distinction but I'm not sure we want to support that in that way..

I guess the main things you'd want to see is how that does break jobset status counting. JobSet is going to converge to what values are in its spec. If users patch the underlying jobs then I wonder what happens with the jobset statuses.

I think this is why deployment/statefulset encourage users to use scale on those applications rather than editing the pods directly.

@danielvegamyhre
Copy link
Contributor Author

danielvegamyhre commented Mar 28, 2024

Yikes.. Thats a good distinction but I'm not sure we want to support that in that way..

I guess the main things you'd want to see is how that does break jobset status counting. JobSet is going to converge to what values are in its spec. If users patch the underlying jobs then I wonder what happens with the jobset statuses.

I think this is why deployment/statefulset encourage users to use scale on those applications rather than editing the pods directly.

It's not the user patching the jobs, it is the JobSet controller.

It creates each child Job with 1 pod (leader pod) and once it's scheduled, updates the child Job completions and parallelism to match what's in the JobSet spec. This way we avoid spamming the apiserver with pod creating requests that we know will be rejected, job controller will undergo exponential backoff up if the leader pod takes too long to schedule, delaying the time for all the workers to be ready.

Regardless, this is just an issue to prototype it and see how it performs versus the current webhook based implementation.

@kannon92
Copy link
Contributor

Okay. I was just thinking that this would make reproducibility a bit challenging as the spec is not the desired state.

but for a prototype I think its worth exploring

@dejanzele
Copy link
Contributor

What is the current state of this work? I see it is unassigned, if nobody is actively working on it, I am interested in picking it up.

@danielvegamyhre
Copy link
Contributor Author

@dejanzele That would be great!

Let me know if you have any questions on the idea or the implementation.

@dejanzele
Copy link
Contributor

Thanks, I'll get familiar and ping you soon

/assign

@danielvegamyhre
Copy link
Contributor Author

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 29, 2024
@danielvegamyhre
Copy link
Contributor Author

@dejanzele are you still planning to explore this? No rush, this isn't needed urgently but just want to check in.

@dejanzele
Copy link
Contributor

@danielvegamyhre yes, I plan to start this week

@danielvegamyhre
Copy link
Contributor Author

@dejanzele how is this going?

@dejanzele
Copy link
Contributor

@danielvegamyhre I got sidetracked with some other work, I don't think I'll have capacity in the next 2-3 weeks.
I'll unassign myself if somebody wants to take it. If it is still unassigned when I get more capacity, I'll revisit it.

/unassign

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 29, 2024
@danielvegamyhre
Copy link
Contributor Author

/remove-lifecycle stale

I still think this would be useful to prototype and benchmark against the current implementation, in theory it should be more performant. I can prototype it but currently don't have access to large clusters for scale testing.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. release-0.7
Projects
None yet
Development

No branches or pull requests

5 participants