Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-4444: add "prefer same node" semantics to PreferClose #4931

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

danwinship
Copy link
Contributor

@danwinship danwinship commented Oct 24, 2024

Previously introduced as internalTrafficPolicy: PreferLocal in #3016 and then reworked as "node-level topology" in #3293 (which was closed before we came up with any specific API).

The goal here is "pods should connect to the local CoreDNS endpoint, if there is one".

Given that "PreferClose" has an intentionally vague name (which, in particular, does not mention "zones"), it seemed to make more sense to add prefer-same-node semantics to "PreferClose" than to add a new traffic distribution ("PreferREALLYClose"?) for this. The proposed rule for deciding when prefer-same-node can be used is:

  * the Service has `PreferClose` traffic distribution, and
  * the Service has endpoints on all or all-but-one of the Ready nodes, and
  * the endpoints of the Service are Pods that have `OwnerReferences` indicating
    that they are all part of the same DaemonSet.

(Bringing this up now because I have reason to want to kill off our downstream "prefer same node just for CoreDNS" patch again.)

I'm not sure if we'd need a new feature gate for this new behavior?

In the past, it has been suggested that this is what NodeLocal DNSCache is for, but NodeLocal DNSCache is just a bad partial reimplementation of Services, to make up for the fact that the real Service API doesn't have a prefer-same-node feature... (Among the reasons that it is bad: it doesn't support fallback-to-other-nodes when the DNS pods are being upgraded.)

/cc @aojea @thockin @gauravkghildiyal @robscott

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 24, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory approved Indicates a PR has been approved by an approver from all required OWNERS files. sig/network Categorizes an issue or PR as relevant to SIG Network. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 24, 2024
@@ -266,6 +267,18 @@ NOTE: Implementations reserve the right to refine the behavior associated with
implementation. This simplifies their deployment process and reduces the
complexity of managing cross-cluster applications.

#### Story 4 (DNS)

* **Requirement:** As a cluster administrator, I plan to deploy CoreDNS so that there
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this is common, to have the DNS resolver per node ... this also removes the nodeLocalDNS hack to divert the traffic, and people/installers: kops, kubeadm, ... can just replace the DNS deployment by a daemonset

Comment on lines +346 to +348
* the Service has endpoints on all or all-but-one of the Ready nodes, and
* the endpoints of the Service are Pods that have `OwnerReferences` indicating
that they are all part of the same DaemonSet.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this not much of a stretch to rely on these specifics?
Should not be better add a new algorithm like PreferNode than try to special case the existing one ?

@aojea
Copy link
Member

aojea commented Oct 24, 2024

completely agree that we should do it, have concerns about if we should overload the term instead of doing a new one

@danwinship
Copy link
Contributor Author

It's not even overloading though; PreferClose already means this:

// Indicates a preference for routing traffic to endpoints that are
// topologically proximate to the client. The interpretation of "topologically
// proximate" may vary across implementations and could encompass endpoints
// within the same node, rack, zone, or even region. Setting this value gives
// implementations permission to make different tradeoffs, e.g. optimizing for
// proximity rather than equal distribution of load. Users should not set this
// value if such tradeoffs are not acceptable.
ServiceTrafficDistributionPreferClose = "PreferClose"

The consensus in #4445 seemed to be in favor of vaguely-defined hints, interpreted broadly, rather than precisely-defined alternate semantics.

@aojea
Copy link
Member

aojea commented Oct 24, 2024

The consensus in #4445 seemed to be in favor of vaguely-defined hints, interpreted broadly, rather than precisely-defined alternate semantics.

don't disagree on that, but I expect the interpretation to be always the same for the same implementation, and not different depending on factors like is a daemonset, or it has less than 3 endpoints, it is a slippery slope in my opinion ... and this will make it more complex to debug, explain, troubleshoot, ...

Comment on lines +331 to +333
then it attempts to route traffic to endpoints on the same node as the client. If
no endpoints are available on the same node, traffic would be routed to other
nodes (though still preferring the same zone, if relevant).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a comment: This is super great!
Having used iTP of local, its one shortfall is when that endpoint is briefly down (when we upgrade the DaemonSet, for example).
As a user I'd prefer traffic to stay local to the node, but I don't mind if it occasionally leaves the node (or the zone), as long as it's temporary.

This is just a comment, so you're welcome to resolve this.

clients on the same node as this endpoint should prefer it. This will be set
when:
* the Service has `PreferClose` traffic distribution, and
* the Service has endpoints on all or all-but-one of the Ready nodes, and
Copy link
Member

@adrianmoisey adrianmoisey Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment may be similar to what Antonio is saying below.
If a DaemonSet is configured to for updateStrategy.rollingUpdate.maxUnavailable of > 1, this won't match.

Also, a use case we have is that we run a statsd DaemonSet on all nodes, except for a handful of nodes that are short-lived and have a very specific task, and running the DaemonSet there costs too much.

As a user I'd want the "SameNode" functionality to be the "preferred" method, but I also understand that this feature is covering a wide range of use-cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And would you want prefer-same-node semantics for that Service?

There is definitely a tradeoff if we want to try to "auto-detect" this.

Maybe I should try to figure out the TopologySpreadConstraints version of this...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And would you want prefer-same-node semantics for that Service?

Definitely!
Since we are running the DaemonSet on most of our nodes, we wouldn't worry about overloading a specific node.

What is the thought behind having the safe guards described here? Is the worry that a single node could be overloaded with traffic, and may be the correct choice would have been to send traffic to a different Pod (at the cost of going over the network).

I may be missing something here (context: I've never used the topology hints feature, but would really like to, so may be I haven't encountered its downsides yet), but the name "PreferClose" makes me assume that traffic would be sent to a pod on the same node if available, if it can't go to a local pod, then to a pod on the same zone, if that fails, then a random pod in the cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that we don't know anything about the clients of the service, and we can't just assume that they are distributed in a way that ensures prefer-same-node will result in good behavior. If, for some reason, there are endpoints on Nodes A, B, and C, but most of the clients are on Node A, then prefer-same-node would result in the Node A endpoint getting overloaded while the Node B and C endpoints were idle. (If the nodes are heterogeneous in some way that affects scheduling this might even be a predictable outcome.)

If I was attaching the prefer-same-node semantics to a new traffic distribution value then this wouldn't be a problem because we could just say "don't use trafficDistribution: PreferSameNode if you don't want that". But since there are already people using PreferClose, we have to be careful to not break them.

Copy link
Member

@adrianmoisey adrianmoisey Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so I guess this KEP is about PreferClose, does it make sense that the thing I'm asking for be under a different option?

I'm still not sure I like this test:

the Service has endpoints on all or all-but-one of the Ready nodes

But I also can't think of a better way to determine if the DaemonSet is running everywhere.
I guess the situation I'm worried about is when a DaemonSet is being updated, and multiple Pods are not ready at a time. However, in theory this scenario is temporary, so it isn't a big deal.

May be it's worth excluding control-plane nodes from the set of nodes?

@danwinship
Copy link
Contributor Author

I expect the interpretation to be always the same for the same implementation

AIUI this is exactly how it's not supposed to work. The hints are vague specifically so that implementations can get smarter in the future.

@aojea
Copy link
Member

aojea commented Oct 25, 2024

AIUI this is exactly how it's not supposed to work. The hints are vague specifically so that implementations can get smarter in the future.

the previous implementation was "smart" and turned out to be very confusing for users, that is why IIRC we moved to something simpler and more deterministic solution.

I may have lost track on these details and I recognize I don't remember most of the details, but @gauravkghildiyal and @robscott should be the most knowledgeable , but I expect PreferClose to work always the same in kube-proxy and add a new mode of traffic distribution for this

@danwinship
Copy link
Contributor Author

danwinship commented Oct 25, 2024

You mean Topology Keys? That was designed to let the user be "smart" and required the proxy to be correspondingly "dumb". (The proxy had to do exactly what the user said, even if it knew that would have bad effects). TrafficDistribution, like Topology-Aware Hints, was designed to allow the proxy to be smart, while also allowing the user to give hints about what they think would be smart for the proxy to do.

I'm open to the argument that we shouldn't change PreferClose to include prefer-same-node because we think the heuristic I suggested doesn't sufficiently capture the user's intent. But if you don't think it would ever be legitimate to change "PreferClose" to include prefer-same-node at all, then we should deprecate "PreferClose" and rename it to "PreferSameZone". The entire point of having the name be vague was that it wasn't supposed to refer to any specific behavior. If we think we can't get away with changing the behavior, then we should give up on the vagueness idea.

@danwinship
Copy link
Contributor Author

Maybe I should try to figure out the TopologySpreadConstraints version of this...

(TopologySpreadConstraints is a feature for telling the scheduler how to schedule pods to nodes in a way that ensures they are evenly distributed relative to some topological constraint (eg, host, zone, region, etc). It is fairly complicated, but the complication is all in the "figuring out when you can and can't schedule a pod" part. The actual topology definition is a single field, and that's all we'd need to look at here.)

So this would be something like:

  • Add a new ForNodeSelector *metav1.LabelSelector field to discoveryv1.EndpointHints.

  • If a Service has trafficDistribution: PreferClose (or maybe trafficDistribution: TopologyConstrained) then for each endpoint Pod that has .spec.topologySpreadConstraints (which in the normal case would be all of them), the EndpointSlice controller will set the forNodeSelector field of the corresponding Endpoint to an appropriate selector based on the contraints' topologyKey values. Eg, if there is an endpoint pod on "node1", with a single TopologySpreadConstraint that has topologyKey: "kubernetes.io/hostname", then the corresponding Endpoint would get:

    hints:
      forNodeSelector:
        matchLabels:
          "kubernetes.io/hostname": node1
    
  • If kube-proxy sees an endpoint with the forNodeSelector hint set (and topology is enabled), then it would check if the node selector applies to its own host, and ignore the endpoint otherwise (unless this resulted in no endpoints being available, in which case it would ignore the forNodeSelector hints for this service).

(Theoretically we could make the hints more compact by having a single top-level slice.hints.forTopologyKeys: ["kubernetes.io/hostname"] rather than exploding that out into per-endpoint nodeSelectors. This would require the service proxy to keep track of the labels of all nodes rather than only its own node, so that if it sees slice.hints.forTopologyKeys: ["topology.kubernetes.io/region"] it can figure out which nodes have the same region as its own node.)

From kube-proxy's perspective, this is basically the same as Topology Keys, but the difference is that you only get to use it if you've also told the scheduler about the plan, so it avoids the problem of "you said to prefer-same-zone but there is only 1 endpoint in this zone (and 10 in another)" which Topology Keys was vulnerable to.

One downside of this approach is that DaemonSets don't actually use topologySpreadConstraints... So you'd have to redundantly specify a dummy TSC as part of your DaemonSet pod template... Or maybe we'd say that trafficDistribution: TopologyConstrained applied to either pods using TopologySpreadConstraints or pods deployed by a DaemonSet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/network Categorizes an issue or PR as relevant to SIG Network. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants