KEP-4444: add "prefer same node" semantics to PreferClose #4931

danwinship · 2024-10-24T16:00:11Z

One-line PR description: Adds "prefer same node" / "node local topology" semantics
Issue link: Traffic Distribution for Services #4444
Other comments:

Previously introduced as internalTrafficPolicy: PreferLocal in #3016 and then reworked as "node-level topology" in #3293 (which was closed before we came up with any specific API).

The goal here is "pods should connect to the local CoreDNS endpoint, if there is one".

Given that "PreferClose" has an intentionally vague name (which, in particular, does not mention "zones"), it seemed to make more sense to add prefer-same-node semantics to "PreferClose" than to add a new traffic distribution ("PreferREALLYClose"?) for this. The proposed rule for deciding when prefer-same-node can be used is:

  * the Service has `PreferClose` traffic distribution, and
  * the Service has endpoints on all or all-but-one of the Ready nodes, and
  * the endpoints of the Service are Pods that have `OwnerReferences` indicating
    that they are all part of the same DaemonSet.

(Bringing this up now because I have reason to want to kill off our downstream "prefer same node just for CoreDNS" patch again.)

I'm not sure if we'd need a new feature gate for this new behavior?

In the past, it has been suggested that this is what NodeLocal DNSCache is for, but NodeLocal DNSCache is just a bad partial reimplementation of Services, to make up for the fact that the real Service API doesn't have a prefer-same-node feature... (Among the reasons that it is bad: it doesn't support fallback-to-other-nodes when the DNS pods are being upgraded.)

/cc @aojea @thockin @gauravkghildiyal @robscott

k8s-ci-robot · 2024-10-24T16:00:19Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/sig-network/OWNERS~~ [danwinship]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

aojea · 2024-10-24T17:38:24Z

keps/sig-network/4444-service-traffic-distribution/README.md

@@ -266,6 +267,18 @@ NOTE: Implementations reserve the right to refine the behavior associated with
  implementation. This simplifies their deployment process and reduces the
  complexity of managing cross-cluster applications.

+#### Story 4 (DNS)
+
+* **Requirement:** As a cluster administrator, I plan to deploy CoreDNS so that there


yeah, this is common, to have the DNS resolver per node ... this also removes the nodeLocalDNS hack to divert the traffic, and people/installers: kops, kubeadm, ... can just replace the DNS deployment by a daemonset

aojea · 2024-10-24T17:42:25Z

keps/sig-network/4444-service-traffic-distribution/README.md

+      * the Service has endpoints on all or all-but-one of the Ready nodes, and
+      * the endpoints of the Service are Pods that have `OwnerReferences` indicating
+        that they are all part of the same DaemonSet.


is this not much of a stretch to rely on these specifics?
Should not be better add a new algorithm like PreferNode than try to special case the existing one ?

aojea · 2024-10-24T17:43:42Z

completely agree that we should do it, have concerns about if we should overload the term instead of doing a new one

danwinship · 2024-10-24T18:59:18Z

It's not even overloading though; PreferClose already means this:

// Indicates a preference for routing traffic to endpoints that are
// topologically proximate to the client. The interpretation of "topologically
// proximate" may vary across implementations and could encompass endpoints
// within the same node, rack, zone, or even region. Setting this value gives
// implementations permission to make different tradeoffs, e.g. optimizing for
// proximity rather than equal distribution of load. Users should not set this
// value if such tradeoffs are not acceptable.
ServiceTrafficDistributionPreferClose = "PreferClose"

The consensus in #4445 seemed to be in favor of vaguely-defined hints, interpreted broadly, rather than precisely-defined alternate semantics.

aojea · 2024-10-24T19:21:21Z

The consensus in #4445 seemed to be in favor of vaguely-defined hints, interpreted broadly, rather than precisely-defined alternate semantics.

don't disagree on that, but I expect the interpretation to be always the same for the same implementation, and not different depending on factors like is a daemonset, or it has less than 3 endpoints, it is a slippery slope in my opinion ... and this will make it more complex to debug, explain, troubleshoot, ...

adrianmoisey · 2024-10-24T19:39:48Z

keps/sig-network/4444-service-traffic-distribution/README.md

+    then it attempts to route traffic to endpoints on the same node as the client. If
+    no endpoints are available on the same node, traffic would be routed to other
+    nodes (though still preferring the same zone, if relevant).


Just a comment: This is super great!
Having used iTP of local, its one shortfall is when that endpoint is briefly down (when we upgrade the DaemonSet, for example).
As a user I'd prefer traffic to stay local to the node, but I don't mind if it occasionally leaves the node (or the zone), as long as it's temporary.

This is just a comment, so you're welcome to resolve this.

adrianmoisey · 2024-10-24T19:47:41Z

keps/sig-network/4444-service-traffic-distribution/README.md

+    clients on the same node as this endpoint should prefer it. This will be set
+    when: 
+      * the Service has `PreferClose` traffic distribution, and
+      * the Service has endpoints on all or all-but-one of the Ready nodes, and


This comment may be similar to what Antonio is saying below.
If a DaemonSet is configured to for updateStrategy.rollingUpdate.maxUnavailable of > 1, this won't match.

Also, a use case we have is that we run a statsd DaemonSet on all nodes, except for a handful of nodes that are short-lived and have a very specific task, and running the DaemonSet there costs too much.

As a user I'd want the "SameNode" functionality to be the "preferred" method, but I also understand that this feature is covering a wide range of use-cases.

And would you want prefer-same-node semantics for that Service?

There is definitely a tradeoff if we want to try to "auto-detect" this.

Maybe I should try to figure out the TopologySpreadConstraints version of this...

And would you want prefer-same-node semantics for that Service?

Definitely!
Since we are running the DaemonSet on most of our nodes, we wouldn't worry about overloading a specific node.

What is the thought behind having the safe guards described here? Is the worry that a single node could be overloaded with traffic, and may be the correct choice would have been to send traffic to a different Pod (at the cost of going over the network).

I may be missing something here (context: I've never used the topology hints feature, but would really like to, so may be I haven't encountered its downsides yet), but the name "PreferClose" makes me assume that traffic would be sent to a pod on the same node if available, if it can't go to a local pod, then to a pod on the same zone, if that fails, then a random pod in the cluster.

The issue is that we don't know anything about the clients of the service, and we can't just assume that they are distributed in a way that ensures prefer-same-node will result in good behavior. If, for some reason, there are endpoints on Nodes A, B, and C, but most of the clients are on Node A, then prefer-same-node would result in the Node A endpoint getting overloaded while the Node B and C endpoints were idle. (If the nodes are heterogeneous in some way that affects scheduling this might even be a predictable outcome.)

If I was attaching the prefer-same-node semantics to a new traffic distribution value then this wouldn't be a problem because we could just say "don't use trafficDistribution: PreferSameNode if you don't want that". But since there are already people using PreferClose, we have to be careful to not break them.

Right, so I guess this KEP is about PreferClose, does it make sense that the thing I'm asking for be under a different option?

I'm still not sure I like this test:

the Service has endpoints on all or all-but-one of the Ready nodes

But I also can't think of a better way to determine if the DaemonSet is running everywhere.
I guess the situation I'm worried about is when a DaemonSet is being updated, and multiple Pods are not ready at a time. However, in theory this scenario is temporary, so it isn't a big deal.

May be it's worth excluding control-plane nodes from the set of nodes?

danwinship · 2024-10-25T01:05:12Z

I expect the interpretation to be always the same for the same implementation

AIUI this is exactly how it's not supposed to work. The hints are vague specifically so that implementations can get smarter in the future.

aojea · 2024-10-25T08:56:39Z

AIUI this is exactly how it's not supposed to work. The hints are vague specifically so that implementations can get smarter in the future.

the previous implementation was "smart" and turned out to be very confusing for users, that is why IIRC we moved to something simpler and more deterministic solution.

I may have lost track on these details and I recognize I don't remember most of the details, but @gauravkghildiyal and @robscott should be the most knowledgeable , but I expect PreferClose to work always the same in kube-proxy and add a new mode of traffic distribution for this

danwinship · 2024-10-25T14:07:23Z

You mean Topology Keys? That was designed to let the user be "smart" and required the proxy to be correspondingly "dumb". (The proxy had to do exactly what the user said, even if it knew that would have bad effects). TrafficDistribution, like Topology-Aware Hints, was designed to allow the proxy to be smart, while also allowing the user to give hints about what they think would be smart for the proxy to do.

I'm open to the argument that we shouldn't change PreferClose to include prefer-same-node because we think the heuristic I suggested doesn't sufficiently capture the user's intent. But if you don't think it would ever be legitimate to change "PreferClose" to include prefer-same-node at all, then we should deprecate "PreferClose" and rename it to "PreferSameZone". The entire point of having the name be vague was that it wasn't supposed to refer to any specific behavior. If we think we can't get away with changing the behavior, then we should give up on the vagueness idea.

danwinship · 2024-10-25T14:57:16Z

Maybe I should try to figure out the TopologySpreadConstraints version of this...

(TopologySpreadConstraints is a feature for telling the scheduler how to schedule pods to nodes in a way that ensures they are evenly distributed relative to some topological constraint (eg, host, zone, region, etc). It is fairly complicated, but the complication is all in the "figuring out when you can and can't schedule a pod" part. The actual topology definition is a single field, and that's all we'd need to look at here.)

So this would be something like:

Add a new ForNodeSelector *metav1.LabelSelector field to discoveryv1.EndpointHints.
If a Service has trafficDistribution: PreferClose (or maybe trafficDistribution: TopologyConstrained) then for each endpoint Pod that has .spec.topologySpreadConstraints (which in the normal case would be all of them), the EndpointSlice controller will set the forNodeSelector field of the corresponding Endpoint to an appropriate selector based on the contraints' topologyKey values. Eg, if there is an endpoint pod on "node1", with a single TopologySpreadConstraint that has topologyKey: "kubernetes.io/hostname", then the corresponding Endpoint would get:
```
hints:
  forNodeSelector:
    matchLabels:
      "kubernetes.io/hostname": node1
```
If kube-proxy sees an endpoint with the forNodeSelector hint set (and topology is enabled), then it would check if the node selector applies to its own host, and ignore the endpoint otherwise (unless this resulted in no endpoints being available, in which case it would ignore the forNodeSelector hints for this service).

(Theoretically we could make the hints more compact by having a single top-level slice.hints.forTopologyKeys: ["kubernetes.io/hostname"] rather than exploding that out into per-endpoint nodeSelectors. This would require the service proxy to keep track of the labels of all nodes rather than only its own node, so that if it sees slice.hints.forTopologyKeys: ["topology.kubernetes.io/region"] it can figure out which nodes have the same region as its own node.)

From kube-proxy's perspective, this is basically the same as Topology Keys, but the difference is that you only get to use it if you've also told the scheduler about the plan, so it avoids the problem of "you said to prefer-same-zone but there is only 1 endpoint in this zone (and 10 in another)" which Topology Keys was vulnerable to.

One downside of this approach is that DaemonSets don't actually use topologySpreadConstraints... So you'd have to redundantly specify a dummy TSC as part of your DaemonSet pod template... Or maybe we'd say that trafficDistribution: TopologyConstrained applied to either pods using TopologySpreadConstraints or pods deployed by a DaemonSet.

danwinship added 2 commits October 24, 2024 10:54

KEP-4444: remove a resolved UNRESOLVED section

436dd82

KEP-4444: add "prefer same node" semantics to PreferClose

04a11f1

k8s-ci-robot requested review from aojea, gauravkghildiyal, robscott and thockin October 24, 2024 16:00

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 24, 2024

aojea reviewed Oct 24, 2024

View reviewed changes

danwinship mentioned this pull request Oct 24, 2024

Need better conntrack management for UDP services (especially DNS) kubernetes/kubernetes#127259

Open

adrianmoisey reviewed Oct 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP-4444: add "prefer same node" semantics to PreferClose #4931

KEP-4444: add "prefer same node" semantics to PreferClose #4931

danwinship commented Oct 24, 2024 •

edited

Loading

k8s-ci-robot commented Oct 24, 2024

aojea Oct 24, 2024

aojea Oct 24, 2024

aojea commented Oct 24, 2024

danwinship commented Oct 24, 2024

aojea commented Oct 24, 2024

adrianmoisey Oct 24, 2024

adrianmoisey Oct 24, 2024 •

edited

Loading

danwinship Oct 25, 2024

adrianmoisey Oct 27, 2024

danwinship Oct 28, 2024

adrianmoisey Oct 28, 2024 •

edited

Loading

danwinship commented Oct 25, 2024

aojea commented Oct 25, 2024

danwinship commented Oct 25, 2024 •

edited

Loading

danwinship commented Oct 25, 2024

KEP-4444: add "prefer same node" semantics to PreferClose #4931

Are you sure you want to change the base?

KEP-4444: add "prefer same node" semantics to PreferClose #4931

Conversation

danwinship commented Oct 24, 2024 • edited Loading

k8s-ci-robot commented Oct 24, 2024

aojea Oct 24, 2024

Choose a reason for hiding this comment

aojea Oct 24, 2024

Choose a reason for hiding this comment

aojea commented Oct 24, 2024

danwinship commented Oct 24, 2024

aojea commented Oct 24, 2024

adrianmoisey Oct 24, 2024

Choose a reason for hiding this comment

adrianmoisey Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

danwinship Oct 25, 2024

Choose a reason for hiding this comment

adrianmoisey Oct 27, 2024

Choose a reason for hiding this comment

danwinship Oct 28, 2024

Choose a reason for hiding this comment

adrianmoisey Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

danwinship commented Oct 25, 2024

aojea commented Oct 25, 2024

danwinship commented Oct 25, 2024 • edited Loading

danwinship commented Oct 25, 2024

danwinship commented Oct 24, 2024 •

edited

Loading

adrianmoisey Oct 24, 2024 •

edited

Loading

adrianmoisey Oct 28, 2024 •

edited

Loading

danwinship commented Oct 25, 2024 •

edited

Loading