-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTTP/3 should support hot restart #19454
Comments
I'd like to help to implement this feature, although seems not straightforward and may take some time to research 🤣 |
@danzh2010 @RyanTheOptimist would one of you be up for outlining what the steps might look like here? |
Sure, I'll take a stab! |
+@ggreenway who was also interested in this behavior. So the challenge here is that HTTP/3 runs on top of QUIC + UDP which is quite different from TCP. With TCP each accepted connection results in a new socket and a new file description. Closing the listening socket does not close the accepted per-connection sockets. This means that the "old" listening socket can be closed and a "new" listening socket can be opened with the new config. Meanwhile, the existing sockets will remain open to handle existing connections. With UDP, there is no such thing. There is only a single socket which receives packets for all connections. If that socket is closed and a new one is opened with the new config, the packets will only be delivered to the new instances. However, all is not lost... It may still be possible to make this work and the cost of a much more complex solution. Instead of relying on the operating system's sockets to send packets to the right connection, we will have to do this ourselves. Namely, we will need some method of determining if a packet we receive should be sent to the "old" code, or to the "new" code. For QUIC (in contrast to generic UDP) each packet contains a "connection ID" which is provided to the client by the server when the connection is created. This connection ID is used to allow the server to deliver this packet to the right connection in memory. Since this ID is created by the server, it could be constructed in such a way that all connection IDs created by the "old" config look different from the connections IDs created by the "new" config. If this were the case, then the old socket would be closed, and the new socket would be opened. When packets are received on the new socket, their connection IDs would be extracted from the packet and they would be parsed to see if they are "old" or "new" connection IDs. If they are "new" then they would be handled as normal. On the other hand, if they are "old" then they would be delivered to the "old" code (sent to the old QuicDispatcher). There is an IETF draft which describes mechanism that could be used to allocate QUIC connection IDs for load balancing purpose, of which this use case is a fairly simple subset. Doing the work to support this behavior would likely be split to some degree between QUICHE (where the QUIC code is implemented) and Envoy. I hope this helps provide an overview. |
It would be ideal if we could route the packets correctly in the kernel via EBPF, but I don't know if that is possible or not. If it's not, we could forward packets over the hot-restart communication channel between the envoy instances, but that would have an extra CPU cost. |
Oh, right! I keep forgetting about BPF. Yes, good point. AIUI, Envoy has the ability today to route packets to the correct thread using BPF based on the first word of the connection id. https://github.com/envoyproxy/envoy/blob/main/source/common/quic/active_quic_listener.cc#L286 |
I remember we discussed with @mattklein123 about this topic not too long ago. Since there was a refectory of listener socket factory and socket life time change made in #17259, the reuse_port listeners actually duplicate the socket during hot start. So theoretically new listener and old listener listen on the same socket as apposed to creating a new listen socket for the new listener. Essentially at startup we make a socket for each worker, then just keep using that for the life of all updates. In this way the kernel reuse socket group don't change so the packets on the same connection can land on the same worker. I think the complexity is how to make the new listener forward packets which belong to the old listener, assuming only the new listener will be listening on READ|WRITE event. While we can still use a classic BPF filter. |
Regardless of whether we can accelerate it with EBPF, we'll probably need a non-kernel solution for systems that don't support EBPF (older linux kernels, windows, mac). I imagine we'll need some code in QUICHE to look at a packet and decline to process it, but signal that it is intended for the older envoy. |
Hey there @alyssawilk 👋🏼 are you aware if there is any ongoing work on this issue? Thank you! 🙇🏼 |
We currently are not actively working on this. @relvira do you have any use case for this feature? |
hey @danzh2010! thanks for getting back to me. We are exploring rolling out QUIC support to all our ingress stacks, one of which is based off mainly of long lived connections. We rely on hot restart to gracefully close connections when a process restart is needed. |
I am beginning actively working on this now. |
@ravenblackx what approach are you using? eBPF, or forwarding UDP over the hot-restart UDS, or something else? |
I'm thinking forwarding UDP - I don't think eBPF can be used helpfully without a significant restructure, because of the way hot-restart currently works (forking the new process so it has access to all the same handles). Since the sockets on the new and old instances are the exact same sockets, I don't think there's a way to give them distinct eBPF programs. It's difficult to open actual new sockets listening on the same port numbers and retain cross-platform compatibility, so I think the simplest option is to allow the packets to be delivered to either instance as will happen by default right now, and then forward the packet to the other instance if the receiving instance doesn't know what to do with it. (Though I haven't dug into what happens in the per-worker packet receiving area yet so I expect this to still be a pretty painful option!) Meanwhile I've noticed that ActiveQuicListener::destination assumes the eBPF program is the default one, while |
@ravenblackx that sounds like a reasonable starting point. We can always do the more involved refactoring later if someone thinks the performance of this approach is inadequate. I agree that |
I think I have a viable plan for how to make the packet forwarding work, with a few mechanical issues to iron out.
|
Thanks for the comprehensive list! A few thoughts added inline.
I think QUIC listener is already doing that: https://sourcegraph.com/github.com/envoyproxy/envoy@7bba38b743bb3bca22dffb4a21c38ccc155fbef8/-/blob/source/common/quic/active_quic_listener.cc?L167. But you may need to pass down the special state to distinguish the draining is for hot restarting.
Any reason why not relying on
|
Ah, nice, thanks! Then yes, the hot restart part just needs to be provided for the
Yeah, makes sense to use |
I was thinking that we can reuse this block: https://github.com/google/quiche/blob/043e5c45fc27199546c6004e9003306ae250061b/quiche/quic/core/quic_dispatcher.cc#L512-L523. We can let both redirecting case and dropping case fall through to this if condition, and override the existing logic in QUICHE to redirect the packet if in hot restart. Handling the packet in |
I've now proposed similar but opposite as a quiche issue - I think |
I realized a problem in my plan; the new instance also needs to forward packets to the old instance, since a packet is arbitrarily delivered to either instance. Since a new packet's connection id is just noise, we can't very well use a signal in the connection id to indicate which instance expects the packet. It could help, but there are edge cases where it would be problematic, and trying to include a signal there adds a lot of complexity. My idea for resolving this is instead to have the "consume a packet" function take an extra optional parameter for "this packet was already forwarded to you", and the new instance does not re-forward. This way it resolves like:
Notable that for this purpose, old instance does not refuse to forward an already forwarded packet, because if neither instance recognizes the packet it should be the new instance that deals with it, not the draining instance. This double-forwarding sounds a bit slow, but I think it's fine for it to be a bit slow because it's on a rare path - the common path is half the packets are resolved immediately by the receiving instance, and half are forwarded once and recognized by the second instance. |
Is this the effect of dup()? Did you observed such behavior is your prototype? |
I didn't do a prototype but I'm fairly confident that's the behavior of a UDP socket with two identically configured instances. That said I'm not 100% because I don't really know what the BPF situation is - when we fork does the socket group double in size such that everything will be messed up no matter what we do, or do we have two instances of each socket in the group? It turns out we don't explicitly I guess this means I do have to make a prototype to check what actually happens when there's BPF and cloned sockets like this, before I go deeper. |
Do you think if putting process ID or some per-process information into CID would work? Any packets on the existing connections to the old instance distinguishable from packets on the existing connections to the new instance or packets on new connections which should be handled by the new instance?
Alternatively can you make an RPC call from the old instance to the new one about the existing CIDs the old instance is interested in? In this way, in step 6, the new instance only needs to forward packets with those CIDs to the old. |
From fork() man page:
It says sockets are not doubled, but the file descriptors. So the socket group should remain unchanged. Bu I'm not sure how the I/O events are surfaced. |
If you put a lot of bits into the CID then you're undermining the whole "CID should not be traceable" thing, and also either increasing the size of the header or dramatically decreasing the entropy. And it still wouldn't really help very strongly because new packets won't match.
Considered it, but extracting CIDs from QuicSessions is pretty rough if it's even possible, and getting them from all the per-thread instances to transfer for hot restart would be even worse. Bearing in mind that the forwarding only happens for CIDs that didn't match an expected CID on the instance that received it, it seems like the double-forwarding is a pretty small price. It would essentially only ever happen on the first few packets of a new connection during hot restart.
Yeah, I'm proceeding with a manual experiment to try to figure out exactly what happens. A coworker indicates https://lwn.net/Articles/762101/ |
Packet routing is one of the purposes of using CID, and probably only needs one bit to distinguish child process from parent. And there is IETF effort to encrypted the CID in the future. As to new packets, those should only be processed by the new instance right? So if certain bits of the CID doesn't match the expected bits of the old instance, it should forward the packet to the new instance.
Agree double-forwarding the first packet on each new connection isn't too costly. If the child process can receive packets on existing connections belonging parent process, we need to forward packets in both directions anyway. |
Right, but if you're only using one bit for this then new connections would have a 50% chance of appearing to be for the old instance, at which point having all the special handling for this (cascading into mandatory changes to any connection id generator extensions) seems like it would be a huge waste of effort for such a small benefit.
Still struggling to get a working experiment to determine whether this is the case - my simple "just some udp sockets" testbed is unhelpfully complaining "invalid argument" on the |
Yeah, 1 bit is probably not sufficient for effectively distinguishing parent and child processes. But I think it's a good CPU improvement to use X bits of CID for this. And only 1/2^X chance that the first packet of the new connection would be double-forwarded.
Why do you need to installl CBPF? |
Trying to mimic the setup that envoy has with the sockets in question - if I don't reproduce the environment reasonably accurately I can't test what it does. |
Couldn't you use a single bit in the CID to encode Also, when forwarding to the other envoy, we should make sure we're only sending to active-quic-listeners with a matching listener-address (or addresses) as the original listener that got the packet. |
Yes, but that's the same problem again - a new 'noise' CID has 50% chance of matching it so you'd still have to support the whole forwarding to everyone just in case thing if it doesn't match the current instance's packets. But with the added cost that now you also have to transport the extra data to the CID generation extension. It could be useful in the more advanced solution later (using
Yes. Anyway, finally got my manual experiment to work - it seems that if you have n UDP sockets on the same port with BPF directing packets to the right worker, and you fork so those n sockets get duplicated to the new instance, delivery is to the correct workers and next-active-listener, which (assuming your listening path has equally fast turnaround) is essentially round-robin delivery between the old and new instances. This is good news in that it's not messing with the kernel group and therefore my proposed solution will at least work. Adapting to the I believe it would also require a fairly large kernel-version-based behavior split, because if we do the "open another new socket" operation that's required for |
Thanks for the experiment! Given the round-robin delivery, would it be easier to make the new listener not register with read events while the old one is draining? That means only the old listener will be receiving packets and forwarding packets with unseen CID to the new listener. How do you plan to toss packets across process? Would tossing every packets on new connections be too costly? |
Could we combine the single bit in the CID with whether it's a long vs short packet header? IIRC we should only see random CIDs with long-headers, which are only used at connection setup. But the goal of this exercise (marking something in the CID) is just performance I believe. How much computational effort is it to check whether the current process wants this packet or not? If it's very low, then I agree it's probably not worth doing and we just try the packet and forward if it isn't accepted locally. |
I think it's over a unix-domain socket, possibly wrapped in protobuf (grpc?) to add metadata.
Yeah, I think it will have pretty high overhead. When the process starts, nearly all the packets need to end up with the parent process. By the end of draining, nearly all the packets should end up with the child. So I don't think it matters which side is reading from the socket. When this is all working, we should consider publishing some benchmarks on the overhead during hot-restart. |
I think pre-checking a bit would actually be more computational effort than using the existing failed-to-deliver check (in that it would be hard to avoid impacting the common path, where using the existing failed-to-deliver check only adds any processing to the uncommon path). IMO it'd be worth pre-checking with BPF but not worth pre-checking without.
I like this idea, but I'm not sure how well it would play with the BPF programming. It looks like it could work, there's a |
It feels like having only the parent process listening instead of both processes listening doesn't make more packets need to be tossed, given that half of the packets land on the wrong instance. And in this way we can eliminate the complexity of tossing packets back and forth. WDYT?
+1 |
Agree, this seems reasonable. Also it conceptually plays well with the later eBPF addition, since both patterns involve "not doing the default thing" with the child socket. |
…packets (#28664) Add capability for special UDP packet handling during hot restart Adds a flexible argument for shutdownListeners, making it possible to pass an additional argument to QUIC listeners (or others); uses that to pass a callback to ActiveQuicListener and EnvoyQuicDispatcher. This is a no-op in production since the parameter is currently always nullptr, but is a first step towards allowing us to catch new connections on a QUIC listener and forward them to a new instance during hot restart. An important part of #19454 Risk Level: Low, essentially a no-op. Adds one nullptr check in the common path per nonroutable quic packet; zero touch to the common routable packet path. Testing: Added unit test in both EnvoyQuicDispatcher and ActiveQuicListener. Docs Changes: n/a Release Notes: n/a Platform Specific Features: n/a Signed-off-by: Raven Black <[email protected]>
Reading between the lines it appears that hot restart does not work for "plain" UDP in general and not only for QUIC (e.g. taking CIDs into considewration). |
You are correct that hot restart doesn't work for UDP in general. The work done in this thread only added support for QUIC (there's an interface in QUIC in which the protocol reports "I couldn't use this packet", which is used to determine whether to send the packet to the other envoy instance). I believe the QUIC support is functional at this point. And I think nobody is working on doing the same for UDP. I'm a bit surprised at UDP ceasing to work during the hot restart transition, my expectation would have been that it would drop all existing "connections" but work immediately for new connections associating them with the new instance. (But I haven't looked closely at the code - if someone wants to try to make it functional, and it is currently as you describe, my suggestion would be to first have the parent process close the listener as soon as it knows there is a child process starting up, so as to at least not have a window of not-working-at-all, then work from there to try to do forwarding based on unrecognized tuples.) |
@ravenblackx Thanks very much for your reply and the suggestion of how to proceed. Another avenue (which does not cover the exact use case of hot-restart) is to use e.g. the filesystem based xSD. At this point, however, this mechanism also seems broken for UDP. |
Second-guessing my previous response a bit, if the UDP behavior were to mirror the QUIC behavior then the implementation of forwarding the packets would be forwarding from the old instance to the new instance, which would mean instead of the old instance stopping listening, you'd need the new instance to not start listening until the old instance is about to stop (which makes sense because the old instance "knows" which address-tuples are already in use and the new instance does not). But same overall thrust that the first step would be to do just that, without the forwarding, so as to at least have UDP half-functioning during the overlapping period, which would be an improvement over not functioning at all. :) |
I think it makes sense to close this issue as done, and open a new one for UDP support and another one for eBPF-based support. |
(This space has been seized by @ravenblackx)
Some loose design was agreed with @RyanTheOptimist and @danzh2010
Work in progress on no-eBPF implementation, with intent for a PR for each checkbox:
The text was updated successfully, but these errors were encountered: