cache assigned pod count #708

KunWuLuan · 2024-03-28T12:10:01Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR will enhance the speed of the Coscheduling plugin in counting Pods that have already been assumed.

Which issue(s) this PR fixes:

Fix #707

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE. This is a performance enhancement. Users do not need to do anything to use it.

k8s-ci-robot · 2024-03-28T12:10:11Z

Hi @KunWuLuan. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2024-03-28T12:10:15Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: KunWuLuan
Once this PR has been reviewed and has the lgtm label, please assign denkensk for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

netlify · 2024-03-28T12:10:17Z

✅ Deploy Preview for kubernetes-sigs-scheduler-plugins canceled.

Name	Link
🔨 Latest commit	`1c63722`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-scheduler-plugins/deploys/6618a37f99df4600082a280c

Huang-Wei · 2024-04-01T21:16:36Z

/ok-to-test

Huang-Wei

Could you help fix the CI failures?

pkg/coscheduling/core/core.go

Huang-Wei · 2024-04-02T16:40:28Z

pkg/coscheduling/core/core.go

+			switch t := obj.(type) {
+			case *corev1.Pod:
+				pod := t
+				pgMgr.Unreserve(context.Background(), pod)


PodDelete event consists of 3 types of events:

Pod failed

Pod completed (successfully)

Pod get deleted

but for completed Pod, we should still count them as part of gang, right? could you also help if integration test covers this case?

When pod completed, it will be removed from NodeInfo. CalculateAssignedPods will count pods in NodeInfo, so we did not count completed pods previously.

Ok, I will see if integration test covers this case

so we did not count completed pods previously.

True. I'm wondering if we fix this glitch in this PR - in DeleteFunc(), additionally check if the Pod is completed, if so, do NOT invalidate it from the assignedPodsByPG cache. WDYT?

#560 (comment)

We have discussed in this issues about whether we should count completed pods.
Is there new situation to count completed pods?

I see. It seems restart the whole Job is more conventional for now, then let's postpone the idea until new requirement emerges.

Signed-off-by: KunWuLuan <[email protected]>

KunWuLuan · 2024-04-03T09:27:15Z

Could you help fix the CI failures?

@Huang-Wei Hi, I have fix the CI failures. Please have a look when you have time, thanks

pkg/coscheduling/core/core.go

Huang-Wei · 2024-04-10T06:15:35Z

pkg/coscheduling/core/core.go

+			switch t := obj.(type) {
+			case *corev1.Pod:
+				pod := t
+				pgMgr.Unreserve(context.Background(), pod)


so we did not count completed pods previously.

True. I'm wondering if we fix this glitch in this PR - in DeleteFunc(), additionally check if the Pod is completed, if so, do NOT invalidate it from the assignedPodsByPG cache. WDYT?

pkg/coscheduling/core/core.go

Huang-Wei

I forgot one thing about the cache's consistency during one scheduling cycle - we will need to:

snapshot the pg->podNames map at the beginning of the scheduling cycle (PreFilter), so that we can treat it as source of truth during the whole scheduling cycle
support preemption
- implement the Clone() function
- for each PodAddition dryrun, if the pod is hit, add it
- for each PodDeletion dryrun, if the pod is hit, remove it

hack/install-envtest.sh

KunWuLuan · 2024-04-12T03:40:29Z

I forgot one thing about the cache's consistency during one scheduling cycle - we will need to:

snapshot the pg->podNames map at the beginning of the scheduling cycle (PreFilter), so that we can treat it as source of truth during the whole scheduling cycle

support preemption

implement the Clone() function

for each PodAddition dryrun, if the pod is hit, add it

for each PodDeletion dryrun, if the pod is hit, remove it

We only check the number of pods assigned in Permit, so I think there is no inconsistency during one scheduling cycle.

And postFilter will not check Permit plugin, so implementation of PodAddition and PodDeletion will have no effect on preemption, right?

What we can do is return framework.Unschedulable if the PodDeletion will make a podgroup rejected, but I think it is not enought for preemption of coscheduling.

I think support preemption for coscheduling is complecated, maybe in another issue.
We can determine the expected behaviro for preemption of coscheduling. WDYT?
#581

Huang-Wei · 2024-04-12T05:47:21Z

And postFilter will not check Permit plugin, so implementation of PodAddition and PodDeletion will have no effect on preemption, right?

Yes, the current preemption skeleton code assumes each plugin only use PreFilter to pre-calculate state. But for coscheduling, PreFilter can fail early (upon inadequate quorum).

I think scheduler framework should open up a hook for out-of-tree plugin to choose whether or not to run PreFilter as part of the preemption; otherwise, out-of-tree plugin has to rewrite the PostFilter impl. to hack that part.

I think support preemption for coscheduling is complecated, maybe in another issue.
We can determine the expected behaviro for preemption of coscheduling. WDYT?

Let's consolidate all the cases and use a new PR to try to tackle it. Thanks.

Huang-Wei · 2024-04-12T05:49:11Z

@KunWuLuan are you ok with postpone this PR's merge after I cut release for v0.28, so that we have more time for soak testing.

And could you add a release-note to highlight it's a performance enhancement?

KunWuLuan · 2024-04-12T06:17:32Z

@KunWuLuan are you ok with postpone this PR's merge after I cut release for v0.28, so that we have more time for soak testing.

And could you add a release-note to highlight it's a performance enhancement?

Ok, no problem.

KunWuLuan · 2024-04-12T07:08:51Z

And postFilter will not check Permit plugin, so implementation of PodAddition and PodDeletion will have no effect on preemption, right?

Yes, the current preemption skeleton code assumes each plugin only use PreFilter to pre-calculate state. But for coscheduling, PreFilter can fail early (upon inadequate quorum).

I think scheduler framework should open up a hook for out-of-tree plugin to choose whether or not to run PreFilter as part of the preemption; otherwise, out-of-tree plugin has to rewrite the PostFilter impl. to hack that part.

I think support preemption for coscheduling is complecated, maybe in another issue.
We can determine the expected behaviro for preemption of coscheduling. WDYT?

Let's consolidate all the cases and use a new PR to try to tackle it. Thanks.

Ok. I will try to design a preemption framework in postFilter, and if implementation in postFilter is enough, I will create a new pr to track the kep. Otherwise I will try to open a discuss in kubernetes/scheduling-sigs.

ffromani · 2024-04-12T07:56:51Z

/cc

k8s-ci-robot · 2024-04-15T12:52:05Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-triage-robot · 2024-07-14T13:01:31Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-08-13T13:29:52Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

KunWuLuan · 2024-08-13T13:31:00Z

/remove-lifecycle rotten

k8s-ci-robot requested review from PiotrProkop and Tal-or March 28, 2024 12:10

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 28, 2024

KunWuLuan force-pushed the feat/coscheduling branch from ed85ccd to 07beae1 Compare March 28, 2024 12:10

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 1, 2024

Huang-Wei reviewed Apr 2, 2024

View reviewed changes

KunWuLuan added 2 commits April 3, 2024 14:13

cache assigned pod count

27fcfe9

Signed-off-by: KunWuLuan <[email protected]>

add integration test

e728afd

KunWuLuan force-pushed the feat/coscheduling branch from 07beae1 to e728afd Compare April 3, 2024 06:35

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 3, 2024

KunWuLuan changed the title ~~cache assigned pod count~~ [WIP] cache assigned pod count Apr 3, 2024

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 3, 2024

KunWuLuan force-pushed the feat/coscheduling branch from 7814f35 to bd79ef7 Compare April 3, 2024 07:21

use sigs.k8s.io/controller-runtime/tools/[email protected]

9bfafc9

Signed-off-by: KunWuLuan <[email protected]>

KunWuLuan force-pushed the feat/coscheduling branch from bd79ef7 to 9bfafc9 Compare April 3, 2024 07:41

fix tests

d7b2a45

KunWuLuan force-pushed the feat/coscheduling branch from dd1b7f3 to d7b2a45 Compare April 3, 2024 09:05

KunWuLuan changed the title ~~[WIP] cache assigned pod count~~ cache assigned pod count Apr 3, 2024

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 3, 2024

mochizuki875 mentioned this pull request Apr 10, 2024

Add scheduling op to helm chart #697

Merged

Huang-Wei reviewed Apr 10, 2024

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 10, 2024

Huang-Wei reviewed Apr 10, 2024

View reviewed changes

hack/install-envtest.sh Outdated Show resolved Hide resolved

KunWuLuan added 2 commits April 12, 2024 10:57

add some comments

e4f127c

revert modification in env test

1c63722

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 12, 2024

Huang-Wei added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 12, 2024

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 12, 2024

k8s-ci-robot requested a review from ffromani April 12, 2024 07:56

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 15, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 14, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 13, 2024

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache assigned pod count #708

cache assigned pod count #708

KunWuLuan commented Mar 28, 2024 •

edited

Loading

k8s-ci-robot commented Mar 28, 2024

k8s-ci-robot commented Mar 28, 2024

netlify bot commented Mar 28, 2024 •

edited

Loading

Huang-Wei commented Apr 1, 2024

Huang-Wei left a comment

Huang-Wei Apr 2, 2024

KunWuLuan Apr 3, 2024

KunWuLuan Apr 3, 2024

Huang-Wei Apr 10, 2024

KunWuLuan Apr 12, 2024

Huang-Wei Apr 12, 2024

KunWuLuan Apr 12, 2024

KunWuLuan commented Apr 3, 2024

Huang-Wei Apr 10, 2024

Huang-Wei left a comment

KunWuLuan commented Apr 12, 2024

Huang-Wei commented Apr 12, 2024

Huang-Wei commented Apr 12, 2024

KunWuLuan commented Apr 12, 2024

KunWuLuan commented Apr 12, 2024

ffromani commented Apr 12, 2024

k8s-ci-robot commented Apr 15, 2024

k8s-triage-robot commented Jul 14, 2024

k8s-triage-robot commented Aug 13, 2024

KunWuLuan commented Aug 13, 2024

cache assigned pod count #708

Are you sure you want to change the base?

cache assigned pod count #708

Conversation

KunWuLuan commented Mar 28, 2024 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Mar 28, 2024

k8s-ci-robot commented Mar 28, 2024

netlify bot commented Mar 28, 2024 • edited Loading

✅ Deploy Preview for kubernetes-sigs-scheduler-plugins canceled.

Huang-Wei commented Apr 1, 2024

Huang-Wei left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KunWuLuan commented Apr 3, 2024

Choose a reason for hiding this comment

Huang-Wei left a comment

Choose a reason for hiding this comment

KunWuLuan commented Apr 12, 2024

Huang-Wei commented Apr 12, 2024

Huang-Wei commented Apr 12, 2024

KunWuLuan commented Apr 12, 2024

KunWuLuan commented Apr 12, 2024

ffromani commented Apr 12, 2024

k8s-ci-robot commented Apr 15, 2024

k8s-triage-robot commented Jul 14, 2024

k8s-triage-robot commented Aug 13, 2024

KunWuLuan commented Aug 13, 2024

KunWuLuan commented Mar 28, 2024 •

edited

Loading

netlify bot commented Mar 28, 2024 •

edited

Loading