Support Autoscaling replicatedJob #570

tenzen-y · 2024-05-14T04:58:11Z

What would you like to be added:
I would like to support scale subresource and the metrics corresponding to HPA resource like this:

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: autoscaling-sample
spec:
  scalePolicy:
    replicatedJobName:  workers  # Job Name
    replicas: 2                  # scaling target
    autoScaling:                 # typed `[]autoscalingv2.MetricSpec`
      minReplicas: 1
      maxReplicas: 10
      metrics:
        [...] 
  replicatedJobs:
  - name: workers
[...]

Why is this needed:
In the machine learning field, we often support elastically scaling worker nodes such as PyTorch Elastic.
Here is Kubeflow TrainingOperator Example: https://github.com/kubeflow/training-operator/blob/e31d11faa9f6ce5111b60c01079d39295589e0ef/pkg/apis/kubeflow.org/v1/pytorch_types.go#L98-L135

tenzen-y · 2024-05-14T04:59:02Z

cc: @andreyvelich @ahg-g

tenzen-y · 2024-05-14T05:12:29Z

/kind feature

danielvegamyhre · 2024-05-14T16:06:53Z

Thanks @tenzen-y, support for scale subresource was part of the original JobSet design https://bit.ly/k8s-jobset but we haven't been able to prioritize it yet. We will likely need other developers to do the implementation here, since I am currently busy with other work.

tenzen-y · 2024-05-14T16:21:53Z

Thanks @tenzen-y, support for scale subresource was part of the original JobSet design https://bit.ly/k8s-jobset but we haven't been able to prioritize it yet. We will likely need other developers to do the implementation here, since I am currently busy with other work.

Yeah, actually during JobSet design in https://bit.ly/k8s-jobset, I mentioned the scale subresource :)
For sure, if anyone doesn't have enough bandwidth, I may be able to take this issue, but I'm not confident that I definitely have sufficient bandwidth for this issue now.

danielvegamyhre · 2024-05-16T18:01:47Z

@tenzen-y it would be helpful if you or others could take this on, a short KEP would be great so we can align on the API changes to JobSet

tenzen-y · 2024-05-17T05:39:02Z

@tenzen-y it would be helpful if you or others could take this on, a short KEP would be great so we can align on the API changes to JobSet

Yeah, definitely should create a small KEP. Once I find the time, I will try to assign this issue to me.

k8s-triage-robot · 2024-08-15T06:14:54Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

tenzen-y · 2024-08-15T07:09:10Z

/remove-lifecycle stale

googs1025 · 2024-09-10T03:07:48Z

I can refer to this issue and try to implement it. I am willing to give it a try. :)

googs1025 · 2024-09-10T03:07:59Z

/assign

googs1025 · 2024-09-12T01:10:31Z

After the poc test, I found some problems.
I want to use this method to implement the scale subresource.
like this:
// +kubebuilder:subresource:scale:specpath=.spec.replicatedJobs[*].replicas,statuspath=.status.replicatedJobsStatus[*].active,selectorpath=

but got.

root@VM-0-6-ubuntu:/home/ubuntu# kubectl get hpa
NAME             REFERENCE               TARGETS              MINPODS   MAXPODS   REPLICAS   AGE
network-jobset   JobSet/network-jobset   cpu: <unknown>/80%   1         3         0          10h
root@VM-0-6-ubuntu:/home/ubuntu# kubectl describe hpa
Name:                                                  network-jobset
Namespace:                                             default
Labels:                                                <none>
Annotations:                                           <none>
CreationTimestamp:                                     Wed, 11 Sep 2024 22:46:13 +0800
Reference:                                             JobSet/network-jobset
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):  <unknown> / 80%
Min replicas:                                          1
Max replicas:                                          3
JobSet pods:                                           0 current / 0 desired
Conditions:
  Type         Status  Reason          Message
  ----         ------  ------          -------
  AbleToScale  False   FailedGetScale  the HPA controller was unable to get the target's current scale: Internal error occurred: the spec replicas field ".spec.replicatedJobs[*].replicas" does not exist
Events:
  Type     Reason          Age                     From                       Message
  ----     ------          ----                    ----                       -------
  Warning  FailedGetScale  2m43s (x2461 over 10h)  horizontal-pod-autoscaler  Internal error occurred: the spec replicas field ".spec.replicatedJobs[*].replicas" does not exist
root@VM-0-6-ubuntu:/home/ubuntu#

I'd like to implement this without changing too many of the original APIs, but []ReplicatedJob seems to be the main reason to implement this scale subresource.

// JobSetSpec defines the desired state of JobSet
type JobSetSpec struct {
	// ReplicatedJobs is the group of jobs that will form the set.
	// +listType=map
	// +listMapKey=name
	ReplicatedJobs []ReplicatedJob `json:"replicatedJobs,omitempty"`
...
}

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label May 14, 2024

tenzen-y mentioned this issue May 14, 2024

Graduate the API to v1 #380

Open

tenzen-y mentioned this issue Aug 1, 2024

KEP-2170: Kubeflow Training V2 API kubeflow/training-operator#2171

Merged

andreyvelich mentioned this issue Aug 12, 2024

KEP-2170: Kubeflow Training V2 API kubeflow/training-operator#2170

Open

18 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 15, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 15, 2024

k8s-ci-robot assigned googs1025 Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Autoscaling replicatedJob #570

Support Autoscaling replicatedJob #570

tenzen-y commented May 14, 2024 •

edited

Loading

tenzen-y commented May 14, 2024

tenzen-y commented May 14, 2024

danielvegamyhre commented May 14, 2024

tenzen-y commented May 14, 2024

danielvegamyhre commented May 16, 2024

tenzen-y commented May 17, 2024

k8s-triage-robot commented Aug 15, 2024

tenzen-y commented Aug 15, 2024

googs1025 commented Sep 10, 2024

googs1025 commented Sep 10, 2024

googs1025 commented Sep 12, 2024

Support Autoscaling replicatedJob #570

Support Autoscaling replicatedJob #570

Comments

tenzen-y commented May 14, 2024 • edited Loading

tenzen-y commented May 14, 2024

tenzen-y commented May 14, 2024

danielvegamyhre commented May 14, 2024

tenzen-y commented May 14, 2024

danielvegamyhre commented May 16, 2024

tenzen-y commented May 17, 2024

k8s-triage-robot commented Aug 15, 2024

tenzen-y commented Aug 15, 2024

googs1025 commented Sep 10, 2024

googs1025 commented Sep 10, 2024

googs1025 commented Sep 12, 2024

tenzen-y commented May 14, 2024 •

edited

Loading