Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-3619: updated Production Readiness Review Questionnaire for beta release #4895

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions keps/prod-readiness/sig-node/3619.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,5 @@
kep-number: 3619
alpha:
approver: "@johnbelamaric"
beta:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no needed for this PR, now?

Copy link
Contributor Author

@everpeace everpeace Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnbelamaric Sorry for taking the time to do this PRR. As we postpone its beta promotion to v1.33 or later(depending on containerd v2 situation). So, I think you don't need to review this KEP for now (at least until v1.33 release cycle.

I'm not sure how to withdraw the PRR request. I'd be very glad if you could guide me.

approver: "@johnbelamaric"
johnbelamaric marked this conversation as resolved.
Show resolved Hide resolved
103 changes: 98 additions & 5 deletions keps/sig-node/3619-supplemental-groups-policy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -790,13 +790,44 @@ rollout. Similarly, consider large clusters and how enablement/disablement
will rollout across nodes.
-->

A rollout may fail when at least one of the following components are too old because this KEP introduces the new Kubernetes API field:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, a comment on the affect of this enabling/disabling this on running Pods (I can't comment on the unchanged lines above). In the answers above, you say that the permission may change. Under what conditions? If the container restarts? Or just if a Pod is recreated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under what conditions? If the container restarts? Or just if a Pod is recreated?

The permission(process identities) may change only when

  • the pod is set SupplementalGroupsPolicy: "Strict", and
  • it is recreated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, can you fix this above (lines 752, 756 are not clear on this).


| Component | `supplementalGroupsPolicy` value that will cause an error |
|----------------|-----------------------------------------------------------|
| kube-apiserver | not null |
| kubelet | not null |
| CRI runtime | `Strict` |


For example, an error will be returned like this if kube-apiserver is too old:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to revise your version skew strategy (above) and think about how each component reacts during an upgrade. You can't say "kubelet must be at least the version of control-plane components". That's not realistic. It's not possible during an upgrade, and in fact people often run for extended periods of time with older kubelets. In that case, the kubelet won't see the new field. What sort of failure does that cause? Similarly for the CRI runtime.

Similarly, with enablement, you could enable the feature gate in the control plane, then only enable it in some nodes at the kubelet level. What's the behavior in this case?

If you enable it everywhere, then you create some pods with the Strict policy, then you disable it, will the kubelet see the new field or not? If it see it, have you feature gated the kubelet behavior to ignore the field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't say "kubelet must be at least the version of control-plane components"

I did not say this. I think I respected the version skew policy.

It's not possible during an upgrade, and in fact people often run for extended periods of time with older kubelets. In that case, the kubelet won't see the new field. What sort of failure does that cause?

This is a common issue when adding new API fields in Pod. For this KEP, the below matrix describes what will happen:

kubelet version Feature Gate CR support the KEP? Pod's policy Outcome
<1.31
(which does not know this field)
N/A Yes/No Strict The pod can run, but its policy is just ignored. And, .containerStatuses.user will not be reported.
Merge/(not set) The pod can run normally as expected. And, .containerStatuses.user will not be reported.
>=1.31 True YES Strict The pod and its policy can run as expected. And, .containerStatuses.user will be reported.
Merge/(not set) The pod and its policy can run as expected. And, .containerStatuses.user will be reported.
NO Strict The pod will be rejected in kubelet's admission.
Merge/(not set) The pod and its policy can run as expected. And, .containerStatuses.user will be reported.
>=1.31 False YES Strict The pod, which was created when the feature gate was enabled previously, and its policy can run as expected. But, .containerStatuses.user will not be reported.
Merge/(not set) The pod and its policy can run as expected. But, .containerStatuses.user will not be reported.
NO Strict The pod, which was created when the feature gate was enabled previously, will be rejected in kubelet's admission.
Merge/(not set) The pod and the policy can run as expected. But, .containerStatuses.user will not be reported.

Similarly, with enablement, you could enable the feature gate in the control plane, then only enable it in some nodes at the kubelet level. What's the behavior in this case?

Please see the above matrix.

If you enable it everywhere, then you create some pods with the Strict policy, then you disable it, will the kubelet see the new field or not? If it see it, have you feature gated the kubelet behavior to ignore the field?

Please see the above matrix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was a quote from line 676, please fix it up there and add in the matrix. Thanks!

```console
$ kubectl apply -f supplementalgroupspolicy.yaml
Error from server (BadRequest): error when creating "supplementalgroupspolicy.yaml": Pod in version "v1" cannot be handled as a Pod:
strict decoding error: unknown field "spec.securityContext.supplementalGroupsPolicy"
```

No impact on already running workloads.

###### What specific metrics should inform a rollback?

<!--
What signals should users be paying attention to when the feature is young
that might indicate a serious problem?
-->

Look for an event saying indicating SupplementalGroupsPolicy is not supported by the runtime.
```console
$ kubectl get events -o json -w
...
{
...
"kind": "Event",
"message": "Error: SupplementalGroupsPolicyNotSupported",
...
}
...
```
Comment on lines +818 to +829
Copy link
Contributor Author

@everpeace everpeace Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this, I plan to add kubelet admission which raises an error event if

  • the pod sets pod.spec.securityContext.supplementalGroupsPolicy=Strict(non default value)
  • and the node is node.status.features.supplementalGroupsPolicy=false

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, and so the pod doesn't get admitted, can there be metric when this happens for this reason? Events are not useful if you are managing 50,000 clusters...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can there be metric when this happens for this reason?

hmm. If I understood correctly, there are currently no metrics for specific kubelet's admission errors. This kind of admission error is a common issue for node(or runtime handler) feature-based admission (e.g. user namespace or recursive read-only mounts).

Like KEP-127: Support User Namespaces stated, we can compare kubelet_running_pods and kubelet_desired_pods metrics.

WDYT??

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that looks reasonable, please pull that in here. thanks


###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

<!--
Expand All @@ -805,12 +836,22 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
are missing a bunch of machinery and tooling and can't do that now.
-->

During the beta phase, the following test will be manually performed:
haircommander marked this conversation as resolved.
Show resolved Hide resolved
- Enable the `SupplementalGroupsPolicy` feature gate for kube-apiserver and kubelet.
- Create a pod with `supplementalGroupsPolicy` specified.
- Disable the `SupplementalGroupsPolicy` feature gate for kube-apiserver, and confirm that the pod gets rejected.
- Enable the `SupplementalGroupsPolicy` feature gate again, and confirm that the pod gets scheduled again.
- Do the same for kubelet too.


###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

<!--
Even if applying deprecation policies, they may still surprise some users.
-->

No.

### Monitoring Requirements

<!--
Expand All @@ -828,6 +869,12 @@ checking if there are objects with field X set) may be a last resort. Avoid
logs or events for this purpose.
-->

Inspect the `supplementalGroupsPolicy` fields in Pods. You can check if the following `jq` command prints non-zero number:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have 50,000 clusters this is not helpful. Is there a metric we can use, can kube state metrics help here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, kube state metrics can help. Let me update this section.


```bash
kubectl get pods -A -o json | jq '[.items[].spec.securityContext? | select(.supplementalGroupsPolicy)] | length'
```

###### How can someone using this feature know that it is working for their instance?

<!--
Expand All @@ -841,8 +888,8 @@ Recall that end users cannot usually observe component logs or access metrics.

- [ ] Events
- Event Reason:
- [ ] API .status
- Condition name:
- [x] API .status
- Condition name: `containerStatuses.user`
- Other field:
- [ ] Other (treat as last resort)
- Details:
Expand All @@ -864,16 +911,22 @@ These goals will help you determine what you need to measure (SLIs) in the next
question.
-->

- `supplementalGroupsPolicy=Strict`: 100% of pods were scheduled into a node with the feature supported.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any scheduler integration incorporated into the KEP. How will this happen in clusters where some nodes support this and some do not?

Copy link
Contributor Author

@everpeace everpeace Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I described this topic in later section. I think to update this line like this. WDYT?

Suggested change
- `supplementalGroupsPolicy=Strict`: 100% of pods were scheduled into a node with the feature supported.
- `supplementalGroupsPolicy=Strict`: 100% of pods were scheduled into a node with the feature supported. This KEP does NOT support scheduler integration. Please see the section "Are there any missing metrics that would be useful to have to improve observability of this feature?".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's helpful, but perhaps you can also note up in the discussion of the feature and how it works, that users should target using a node label.


- `supplementalGroupsPolicy=Merge`: 100% of pods were scheduled into a node with or without the feature supported.

- `supplementalGroupsPolicy` is unset: 100% of pods were scheduled into a node with or without the feature supported.

###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

<!--
Pick one more of these and delete the rest.
-->

- [ ] Metrics
- [x] Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- [Optional] Aggregation method: `kubectl get events -o json -w`
- Components exposing the metric: kubelet -> kube-apiserver
- [ ] Other (treat as last resort)
- Details:

Expand All @@ -884,6 +937,15 @@ Describe the metrics themselves and the reasons why they weren't added (e.g., co
implementation difficulties, etc.).
-->

Potentially, kube-scheduler could be implemented to avoid scheduling a pod with `supplementalGroupsPolicy: Strict`
to a node running CRI runtime which does not supported this feature.

In this way, the Event metric described above would not happen, and users would instead see `Pending` pods
as an error metric.

However, this is not planned to be implemented in kube-scheduler, as it seems overengineering.
Users may use `nodeSelector`, `nodeAffinity`, etc. to workaround this.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this needs to be clearly documented, I didn't see this anywhere above (I may have missed it). Is the assumption then that during the time period while this is working its way through the ecosystem (maybe a couple years?), NFD/a label + node selector will be needed to ensure that workloads that need the Strict policy are properly scheduled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NFD/a label + node selector will be needed to ensure that workloads that need the Strict policy are properly scheduled?

Yes. However, this topic is not specific to the KEP. This is common issue for node/runtime handler features (e.g. user namespaces, recursive read-only mounts.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes...and I think this is a big usability oversight. but you are right, that's not specific to this KEP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's miserable and the containerd 2.0 situation is exaccerbating it.

The only paths I see are basically:

  1. Back-rev nodes ignore unknown fields; the API may say one thing and the node does something else.

  2. Scheduler does not know about features, and so might assign pods which use feature X to nodes which do not support feature X; back-rev nodes will detect the "unknown fields" and reject those pods.

  3. Scheduler knows how to map pods which need feature X to nodes which support feature X; scheduling may fail entirely if no nodes support it.

  4. We stall any feature which has a node-based implementation until old kubelets are out of support; CRI support is not version locked or controlled by us, so fall back on 1, 2, or 3.


### Dependencies

<!--
Expand All @@ -907,6 +969,12 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
- Impact of its degraded performance or high-error rates on the feature:
-->

Container runtimes supporting [CRI api v0.31.0](https://github.com/kubernetes/cri-api/tree/v0.31.0) or above.

For example,
- containerd: v2.0 or later
- CRI-O: v1.31 or later

### Scalability

<!--
Expand All @@ -919,6 +987,16 @@ For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.
-->


A pod with `supplementalGroupsPolicy: Strict` may be rejected by kubelet with the probablility of $$B/A$$,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only if the user fails to target the pod via a nodeSelector, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if the cluster administrator maintains node labels propagated from node.status.features.supplementalGroupsPolicy.

where $$A$$ is the number of all the nodes that may potentially accept the pod,
and $$B$$ is the number of the nodes that may potentially accept the pod but does not support this feature.
This may affect scalability.

To evaluate this risk, users may run
`kubectl get nodes -o json | jq '[.items[].status.features]'`
to see how many nodes support `supplementalGroupsPolicy: true`.

###### Will enabling / using this feature result in any new API calls?

<!--
Expand Down Expand Up @@ -1024,6 +1102,8 @@ details). For now, we leave it here.

###### How does this feature react if the API server and/or etcd is unavailable?

A pod cannot be created, just as in other pods.

###### What are other known failure modes?

<!--
Expand All @@ -1039,8 +1119,21 @@ For each of them, fill in the following information by copying the below templat
- Testing: Are there any tests for failure mode? If not, describe why.
-->

None.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "pod failing to schedule because the node doesn't support it" is a failure mode you can document here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. But, this section defines "What are OTHER known failure modes?", right?

I think the failure mode "node does not support it" was clearly mentioned above. This proposes to raise a kubelet admission error for this case.

Would you like to describe the failure mode here even though it was stated?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"other" here means "other than the API server and/or etcd being available". But I think it's ok to have it where you do, since it's really a user error not a failure of the feature.


###### What steps should be taken if SLOs are not being met to determine the problem?

- Make sure that the node is running with CRI runtime which supports this feature.
- Make sure that `crictl info` (with the latest crictl)
reports that `supplemental_groups_policy` is supported.
Comment on lines +1127 to +1128
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to update crictl to display features field in crictl info command.

Otherwise upgrade the CRI runtime, and make sure that no relevant error is printed in
the CRI runtime's log.
- Make sure that `kubectl get nodes -o json | jq '[.items[].status.features]'`
(with the latest kubectl and control plane)
reports that `supplementalGroupsPolicy` is supported.
Otherwise upgrade the CRI runtime, and make sure that no relevant error is printed in
kubelet's log.

## Implementation History

<!--
Expand Down
8 changes: 6 additions & 2 deletions keps/sig-node/3619-supplemental-groups-policy/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ see-also: []
replaces: []

# The target maturity stage in the current dev cycle for this KEP.
stage: alpha
stage: beta

# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
Expand All @@ -29,7 +29,11 @@ latest-milestone: "v1.31"
# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
alpha: "v1.31"
beta: "v1.xx"
# NOTE: This is tentative milestone.
# It is because the beta promotion will wait for containerd v2 got released.
# and it will become popular.
# ref: https://github.com/kubernetes/enhancements/pull/4895#discussion_r1790492414
beta: "v1.33"
stable: "v1.yy"

# The following PRR answers are required at alpha release
Expand Down