-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-4444: add "prefer same node" semantics to PreferClose #4931
base: master
Are you sure you want to change the base?
KEP-4444: add "prefer same node" semantics to PreferClose #4931
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: danwinship The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@@ -266,6 +267,18 @@ NOTE: Implementations reserve the right to refine the behavior associated with | |||
implementation. This simplifies their deployment process and reduces the | |||
complexity of managing cross-cluster applications. | |||
|
|||
#### Story 4 (DNS) | |||
|
|||
* **Requirement:** As a cluster administrator, I plan to deploy CoreDNS so that there |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, this is common, to have the DNS resolver per node ... this also removes the nodeLocalDNS hack to divert the traffic, and people/installers: kops, kubeadm, ... can just replace the DNS deployment by a daemonset
* the Service has endpoints on all or all-but-one of the Ready nodes, and | ||
* the endpoints of the Service are Pods that have `OwnerReferences` indicating | ||
that they are all part of the same DaemonSet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this not much of a stretch to rely on these specifics?
Should not be better add a new algorithm like PreferNode
than try to special case the existing one ?
completely agree that we should do it, have concerns about if we should overload the term instead of doing a new one |
It's not even overloading though;
The consensus in #4445 seemed to be in favor of vaguely-defined hints, interpreted broadly, rather than precisely-defined alternate semantics. |
don't disagree on that, but I expect the interpretation to be always the same for the same implementation, and not different depending on factors like is a daemonset, or it has less than 3 endpoints, it is a slippery slope in my opinion ... and this will make it more complex to debug, explain, troubleshoot, ... |
then it attempts to route traffic to endpoints on the same node as the client. If | ||
no endpoints are available on the same node, traffic would be routed to other | ||
nodes (though still preferring the same zone, if relevant). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a comment: This is super great!
Having used iTP of local, its one shortfall is when that endpoint is briefly down (when we upgrade the DaemonSet, for example).
As a user I'd prefer traffic to stay local to the node, but I don't mind if it occasionally leaves the node (or the zone), as long as it's temporary.
This is just a comment, so you're welcome to resolve this.
clients on the same node as this endpoint should prefer it. This will be set | ||
when: | ||
* the Service has `PreferClose` traffic distribution, and | ||
* the Service has endpoints on all or all-but-one of the Ready nodes, and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment may be similar to what Antonio is saying below.
If a DaemonSet is configured to for updateStrategy.rollingUpdate.maxUnavailable
of > 1, this won't match.
Also, a use case we have is that we run a statsd DaemonSet on all nodes, except for a handful of nodes that are short-lived and have a very specific task, and running the DaemonSet there costs too much.
As a user I'd want the "SameNode" functionality to be the "preferred" method, but I also understand that this feature is covering a wide range of use-cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And would you want prefer-same-node semantics for that Service?
There is definitely a tradeoff if we want to try to "auto-detect" this.
Maybe I should try to figure out the TopologySpreadConstraints
version of this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And would you want prefer-same-node semantics for that Service?
Definitely!
Since we are running the DaemonSet on most of our nodes, we wouldn't worry about overloading a specific node.
What is the thought behind having the safe guards described here? Is the worry that a single node could be overloaded with traffic, and may be the correct choice would have been to send traffic to a different Pod (at the cost of going over the network).
I may be missing something here (context: I've never used the topology hints feature, but would really like to, so may be I haven't encountered its downsides yet), but the name "PreferClose" makes me assume that traffic would be sent to a pod on the same node if available, if it can't go to a local pod, then to a pod on the same zone, if that fails, then a random pod in the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that we don't know anything about the clients of the service, and we can't just assume that they are distributed in a way that ensures prefer-same-node will result in good behavior. If, for some reason, there are endpoints on Nodes A, B, and C, but most of the clients are on Node A, then prefer-same-node would result in the Node A endpoint getting overloaded while the Node B and C endpoints were idle. (If the nodes are heterogeneous in some way that affects scheduling this might even be a predictable outcome.)
If I was attaching the prefer-same-node semantics to a new traffic distribution value then this wouldn't be a problem because we could just say "don't use trafficDistribution: PreferSameNode
if you don't want that". But since there are already people using PreferClose
, we have to be careful to not break them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so I guess this KEP is about PreferClose
, does it make sense that the thing I'm asking for be under a different option?
I'm still not sure I like this test:
the Service has endpoints on all or all-but-one of the Ready nodes
But I also can't think of a better way to determine if the DaemonSet is running everywhere.
I guess the situation I'm worried about is when a DaemonSet is being updated, and multiple Pods are not ready at a time. However, in theory this scenario is temporary, so it isn't a big deal.
May be it's worth excluding control-plane nodes from the set of nodes?
AIUI this is exactly how it's not supposed to work. The hints are vague specifically so that implementations can get smarter in the future. |
the previous implementation was "smart" and turned out to be very confusing for users, that is why IIRC we moved to something simpler and more deterministic solution. I may have lost track on these details and I recognize I don't remember most of the details, but @gauravkghildiyal and @robscott should be the most knowledgeable , but I expect |
You mean Topology Keys? That was designed to let the user be "smart" and required the proxy to be correspondingly "dumb". (The proxy had to do exactly what the user said, even if it knew that would have bad effects). TrafficDistribution, like Topology-Aware Hints, was designed to allow the proxy to be smart, while also allowing the user to give hints about what they think would be smart for the proxy to do. I'm open to the argument that we shouldn't change PreferClose to include prefer-same-node because we think the heuristic I suggested doesn't sufficiently capture the user's intent. But if you don't think it would ever be legitimate to change "PreferClose" to include prefer-same-node at all, then we should deprecate "PreferClose" and rename it to "PreferSameZone". The entire point of having the name be vague was that it wasn't supposed to refer to any specific behavior. If we think we can't get away with changing the behavior, then we should give up on the vagueness idea. |
( So this would be something like:
(Theoretically we could make the hints more compact by having a single top-level From kube-proxy's perspective, this is basically the same as Topology Keys, but the difference is that you only get to use it if you've also told the scheduler about the plan, so it avoids the problem of "you said to prefer-same-zone but there is only 1 endpoint in this zone (and 10 in another)" which Topology Keys was vulnerable to. One downside of this approach is that DaemonSets don't actually use |
Previously introduced as
internalTrafficPolicy: PreferLocal
in #3016 and then reworked as "node-level topology" in #3293 (which was closed before we came up with any specific API).The goal here is "pods should connect to the local CoreDNS endpoint, if there is one".
Given that "PreferClose" has an intentionally vague name (which, in particular, does not mention "zones"), it seemed to make more sense to add prefer-same-node semantics to "PreferClose" than to add a new traffic distribution ("PreferREALLYClose"?) for this. The proposed rule for deciding when prefer-same-node can be used is:
(Bringing this up now because I have reason to want to kill off our downstream "prefer same node just for CoreDNS" patch again.)
I'm not sure if we'd need a new feature gate for this new behavior?
In the past, it has been suggested that this is what NodeLocal DNSCache is for, but NodeLocal DNSCache is just a bad partial reimplementation of Services, to make up for the fact that the real Service API doesn't have a prefer-same-node feature... (Among the reasons that it is bad: it doesn't support fallback-to-other-nodes when the DNS pods are being upgraded.)
/cc @aojea @thockin @gauravkghildiyal @robscott