Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP/3 should support hot restart #19454

Closed
6 tasks done
alyssawilk opened this issue Jan 10, 2022 · 43 comments
Closed
6 tasks done

HTTP/3 should support hot restart #19454

alyssawilk opened this issue Jan 10, 2022 · 43 comments
Assignees
Labels
area/quic enhancement Feature requests. Not bugs or questions. no stalebot Disables stalebot from closing an issue

Comments

@alyssawilk
Copy link
Contributor

alyssawilk commented Jan 10, 2022

(This space has been seized by @ravenblackx)

Some loose design was agreed with @RyanTheOptimist and @danzh2010
Work in progress on no-eBPF implementation, with intent for a PR for each checkbox:

@alyssawilk alyssawilk added enhancement Feature requests. Not bugs or questions. no stalebot Disables stalebot from closing an issue area/quic labels Jan 10, 2022
@YaoZengzeng
Copy link
Member

I'd like to help to implement this feature, although seems not straightforward and may take some time to research 🤣

@alyssawilk
Copy link
Contributor Author

@danzh2010 @RyanTheOptimist would one of you be up for outlining what the steps might look like here?

@RyanTheOptimist
Copy link
Contributor

Sure, I'll take a stab!

@RyanTheOptimist
Copy link
Contributor

+@ggreenway who was also interested in this behavior.

So the challenge here is that HTTP/3 runs on top of QUIC + UDP which is quite different from TCP. With TCP each accepted connection results in a new socket and a new file description. Closing the listening socket does not close the accepted per-connection sockets. This means that the "old" listening socket can be closed and a "new" listening socket can be opened with the new config. Meanwhile, the existing sockets will remain open to handle existing connections.

With UDP, there is no such thing. There is only a single socket which receives packets for all connections. If that socket is closed and a new one is opened with the new config, the packets will only be delivered to the new instances.

However, all is not lost... It may still be possible to make this work and the cost of a much more complex solution. Instead of relying on the operating system's sockets to send packets to the right connection, we will have to do this ourselves. Namely, we will need some method of determining if a packet we receive should be sent to the "old" code, or to the "new" code. For QUIC (in contrast to generic UDP) each packet contains a "connection ID" which is provided to the client by the server when the connection is created. This connection ID is used to allow the server to deliver this packet to the right connection in memory. Since this ID is created by the server, it could be constructed in such a way that all connection IDs created by the "old" config look different from the connections IDs created by the "new" config. If this were the case, then the old socket would be closed, and the new socket would be opened. When packets are received on the new socket, their connection IDs would be extracted from the packet and they would be parsed to see if they are "old" or "new" connection IDs. If they are "new" then they would be handled as normal. On the other hand, if they are "old" then they would be delivered to the "old" code (sent to the old QuicDispatcher).

There is an IETF draft which describes mechanism that could be used to allocate QUIC connection IDs for load balancing purpose, of which this use case is a fairly simple subset.
https://quicwg.org/load-balancers/draft-ietf-quic-load-balancers.html

Doing the work to support this behavior would likely be split to some degree between QUICHE (where the QUIC code is implemented) and Envoy.

I hope this helps provide an overview.

@ggreenway
Copy link
Contributor

It would be ideal if we could route the packets correctly in the kernel via EBPF, but I don't know if that is possible or not. If it's not, we could forward packets over the hot-restart communication channel between the envoy instances, but that would have an extra CPU cost.

@RyanTheOptimist
Copy link
Contributor

Oh, right! I keep forgetting about BPF. Yes, good point. AIUI, Envoy has the ability today to route packets to the correct thread using BPF based on the first word of the connection id. https://github.com/envoyproxy/envoy/blob/main/source/common/quic/active_quic_listener.cc#L286
I suspect that could be extended to detect the old vs new connection ID state. I'm not familiar enough, though, with BPF to understand how we would route those to the right place in Envoy. Do you have a sense for how this would work?

@danzh2010
Copy link
Contributor

danzh2010 commented Jan 12, 2022

I remember we discussed with @mattklein123 about this topic not too long ago. Since there was a refectory of listener socket factory and socket life time change made in #17259, the reuse_port listeners actually duplicate the socket during hot start. So theoretically new listener and old listener listen on the same socket as apposed to creating a new listen socket for the new listener. Essentially at startup we make a socket for each worker, then just keep using that for the life of all updates. In this way the kernel reuse socket group don't change so the packets on the same connection can land on the same worker.

I think the complexity is how to make the new listener forward packets which belong to the old listener, assuming only the new listener will be listening on READ|WRITE event. While we can still use a classic BPF filter.

@ggreenway
Copy link
Contributor

Regardless of whether we can accelerate it with EBPF, we'll probably need a non-kernel solution for systems that don't support EBPF (older linux kernels, windows, mac).

I imagine we'll need some code in QUICHE to look at a packet and decline to process it, but signal that it is intended for the older envoy.

@relvira
Copy link

relvira commented Nov 11, 2022

Hey there @alyssawilk 👋🏼 are you aware if there is any ongoing work on this issue?

Thank you! 🙇🏼

@danzh2010
Copy link
Contributor

We currently are not actively working on this. @relvira do you have any use case for this feature?

@relvira
Copy link

relvira commented Nov 14, 2022

hey @danzh2010! thanks for getting back to me. We are exploring rolling out QUIC support to all our ingress stacks, one of which is based off mainly of long lived connections.

We rely on hot restart to gracefully close connections when a process restart is needed.

@ravenblackx ravenblackx self-assigned this Jul 14, 2023
@ravenblackx
Copy link
Contributor

I am beginning actively working on this now.
@YaoZengzeng have you made any progress / given up / other? :)

@ggreenway
Copy link
Contributor

@ravenblackx what approach are you using? eBPF, or forwarding UDP over the hot-restart UDS, or something else?

@ravenblackx
Copy link
Contributor

I'm thinking forwarding UDP - I don't think eBPF can be used helpfully without a significant restructure, because of the way hot-restart currently works (forking the new process so it has access to all the same handles). Since the sockets on the new and old instances are the exact same sockets, I don't think there's a way to give them distinct eBPF programs.

It's difficult to open actual new sockets listening on the same port numbers and retain cross-platform compatibility, so I think the simplest option is to allow the packets to be delivered to either instance as will happen by default right now, and then forward the packet to the other instance if the receiving instance doesn't know what to do with it. (Though I haven't dug into what happens in the per-worker packet receiving area yet so I expect this to still be a pretty painful option!)

Meanwhile I've noticed that ActiveQuicListener::destination assumes the eBPF program is the default one, while QuicConnectionIdGenerator extensions can override createCompatibleLinuxBpfSocketOption; I think QuicConnectionIdGenerator probably needs to also be able to override the behavior of destination to make this correct.

@ggreenway
Copy link
Contributor

@ravenblackx that sounds like a reasonable starting point. We can always do the more involved refactoring later if someone thinks the performance of this approach is inadequate.

I agree that destination() needs to be correct in any configuration, so do whatever needs to be done to make that happen.

@ravenblackx
Copy link
Contributor

I think I have a viable plan for how to make the packet forwarding work, with a few mechanical issues to iron out.

  1. stopListeners needs to put ActiveQuicListeners into a custom draining state rather than hard-stopping them, since hard-stopping the listener effectively terminates existing quic connections immediately.
  2. The draining state on an ActiveQuicListener should cause OnFailedToDispatchPacket in the EnvoyQuicDispatcher to return true, which prevents accept_new_connections_ from even being checked, making it equivalent to closing just the 'listener' for a TCP socket while keeping existing connections open; we can also perform additional actions on those packets in OnFailedToDispatchPacket, in the case that we're hot restarting.
  3. I think on hot restart we need to open another RPC socket between parent and child process, for forwarding these packets. The existing RPC socket is a poor fit for this as it only supports calls from child to parent, and the non-blocking call-response pattern is polling.
  4. An interface to the hot restart packet forwarding socket would need to be passed in to every ActiveQuicListener (and its EnvoyQuicDispatcher since that's where the OnFailedToDispatchPacket action is) that's potentially going to be forwarding. I think that could be rolled into the behavior of the pass_listen_socket RPC.
  5. The child / receiving end of the hot restart packet forwarding socket (which I think might need to live on its own short-term thread) catches any forwarded packets and passes them to udp_listener_worker_router_.deliver, which I think should get them into the normal flow on the new instance. I'm not 100% sure that the quic::ReceivedPacketInfo can be devolved back into an Envoy::Network::UdpRecvData though, which would be a requirement for this method to work.

@danzh2010
Copy link
Contributor

Thanks for the comprehensive list! A few thoughts added inline.

  1. stopListeners needs to put ActiveQuicListeners into a custom draining state rather than hard-stopping them, since hard-stopping the listener effectively terminates existing quic connections immediately.

I think QUIC listener is already doing that: https://sourcegraph.com/github.com/envoyproxy/envoy@7bba38b743bb3bca22dffb4a21c38ccc155fbef8/-/blob/source/common/quic/active_quic_listener.cc?L167. But you may need to pass down the special state to distinguish the draining is for hot restarting.

  1. The draining state on an ActiveQuicListener should cause OnFailedToDispatchPacket in the EnvoyQuicDispatcher to return true, which prevents accept_new_connections_ from even being checked, making it equivalent to closing just the 'listener' for a TCP socket while keeping existing connections open; we can also perform additional actions on those packets in OnFailedToDispatchPacket, in the case that we're hot restarting.

Any reason why not relying on accept_new_connections_ to pause further processing? Probably a new callback in QuicDispatcher is needed in that case to decide whether to STATELESS_RESET the packet or to forward the packet to the new process.

  1. I think on hot restart we need to open another RPC socket between parent and child process, for forwarding these packets. The existing RPC socket is a poor fit for this as it only supports calls from child to parent, and the non-blocking call-response pattern is polling.
  2. An interface to the hot restart packet forwarding socket would need to be passed in to every ActiveQuicListener (and its EnvoyQuicDispatcher since that's where the OnFailedToDispatchPacket action is) that's potentially going to be forwarding. I think that could be rolled into the behavior of the pass_listen_socket RPC.
  3. The child / receiving end of the hot restart packet forwarding socket (which I think might need to live on its own short-term thread) catches any forwarded packets and passes them to udp_listener_worker_router_.deliver, which I think should get them into the normal flow on the new instance. I'm not 100% sure that the quic::ReceivedPacketInfo can be devolved back into an Envoy::Network::UdpRecvData though, which would be a requirement for this method to work.

quic::ReceivedPacketInfo contains pretty much the same information as Envoy::Network::UdpRecvData so I think they are interchangeable.

@ravenblackx
Copy link
Contributor

ravenblackx commented Jul 26, 2023

I think QUIC listener is already doing that:

Ah, nice, thanks! Then yes, the hot restart part just needs to be provided for the OnFailedToDispatchPacket handler.

Any reason why not relying on accept_new_connections_ to pause further processing? Probably a new callback in QuicDispatcher is needed in that case to decide whether to STATELESS_RESET the packet or to forward the packet to the new process.

Yeah, makes sense to use accept_new_connections_ since it's already set. In this case OnFailedToDispatchPacket can just return false whenever it's not redispatched to the other instance, in which case it will fall through to the same accept_new_connections_ handling it does now, and true when it is redispatched which leaves the end result up to the new instance.

@danzh2010
Copy link
Contributor

Any reason why not relying on accept_new_connections_ to pause further processing? Probably a new callback in QuicDispatcher is needed in that case to decide whether to STATELESS_RESET the packet or to forward the packet to the new process.

Yeah, makes sense to use accept_new_connections_ since it's already set. In this case OnFailedToDispatchPacket can just return false whenever it's not redispatched to the other instance, in which case it will fall through to the same accept_new_connections_ handling it does now, and true when it is redispatched which leaves the end result up to the new instance.

I was thinking that we can reuse this block: https://github.com/google/quiche/blob/043e5c45fc27199546c6004e9003306ae250061b/quiche/quic/core/quic_dispatcher.cc#L512-L523. We can let both redirecting case and dropping case fall through to this if condition, and override the existing logic in QUICHE to redirect the packet if in hot restart. Handling the packet in OnFailedToDispatchPacket() may cause packets on connections already in time wait-list to be redirected.

@ravenblackx
Copy link
Contributor

I've now proposed similar but opposite as a quiche issue - I think OnFailedToDispatchPacket is the right place for this (because it mostly works without changing quiche at all), and the fix for time-wait would be to make it higher priority than OnFailedToDispatchPacket. If we wanted to handle it anywhere else in that block I think it would require much more significant changes to quiche.

@ravenblackx
Copy link
Contributor

I realized a problem in my plan; the new instance also needs to forward packets to the old instance, since a packet is arbitrarily delivered to either instance.

Since a new packet's connection id is just noise, we can't very well use a signal in the connection id to indicate which instance expects the packet. It could help, but there are edge cases where it would be problematic, and trying to include a signal there adds a lot of complexity.

My idea for resolving this is instead to have the "consume a packet" function take an extra optional parameter for "this packet was already forwarded to you", and the new instance does not re-forward. This way it resolves like:

  1. packet comes in to old instance that belongs to old instance. It gets resolved.
  2. a packet comes in to new instance that belongs to new instance. It gets resolved.
  3. a packet comes in to new instance that belongs to old instance. It gets forwarded, and resolved by old instance.
  4. a packet comes in to old instance that belongs to new instance. It gets forwarded, and resolved by new instance.
  5. a packet comes in to old instance that belongs to nobody. It gets forwarded, recorded as forwarded, and therefore is resolved by new instance as an unrouted packet (maybe being kept or starting a new connection, maybe rejected).
  6. a packet comes in to new instance that belongs to nobody. It gets forwarded to old instance, then gets forwarded back to new instance, and concludes as in 5.

Notable that for this purpose, old instance does not refuse to forward an already forwarded packet, because if neither instance recognizes the packet it should be the new instance that deals with it, not the draining instance.

This double-forwarding sounds a bit slow, but I think it's fine for it to be a bit slow because it's on a rare path - the common path is half the packets are resolved immediately by the receiving instance, and half are forwarded once and recognized by the second instance.

@danzh2010
Copy link
Contributor

a packet is arbitrarily delivered to either instance.

Is this the effect of dup()? Did you observed such behavior is your prototype?

@ravenblackx
Copy link
Contributor

I didn't do a prototype but I'm fairly confident that's the behavior of a UDP socket with two identically configured instances. That said I'm not 100% because I don't really know what the BPF situation is - when we fork does the socket group double in size such that everything will be messed up no matter what we do, or do we have two instances of each socket in the group?

It turns out we don't explicitly dup(), we fork the whole process so the sockets are implicitly duplicated.

I guess this means I do have to make a prototype to check what actually happens when there's BPF and cloned sockets like this, before I go deeper.

@danzh2010
Copy link
Contributor

Since a new packet's connection id is just noise, we can't very well use a signal in the connection id to indicate which instance expects the packet. It could help, but there are edge cases where it would be problematic, and trying to include a signal there adds a lot of complexity.

Do you think if putting process ID or some per-process information into CID would work? Any packets on the existing connections to the old instance distinguishable from packets on the existing connections to the new instance or packets on new connections which should be handled by the new instance?

My idea for resolving this is instead to have the "consume a packet" function take an extra optional parameter for "this packet was already forwarded to you", and the new instance does not re-forward. This way it resolves like:

  1. packet comes in to old instance that belongs to old instance. It gets resolved.
  2. a packet comes in to new instance that belongs to new instance. It gets resolved.
  3. a packet comes in to new instance that belongs to old instance. It gets forwarded, and resolved by old instance.
  4. a packet comes in to old instance that belongs to new instance. It gets forwarded, and resolved by new instance.
  5. a packet comes in to old instance that belongs to nobody. It gets forwarded, recorded as forwarded, and therefore is resolved by new instance as an unrouted packet (maybe being kept or starting a new connection, maybe rejected).
  6. a packet comes in to new instance that belongs to nobody. It gets forwarded to old instance, then gets forwarded back to new instance, and concludes as in 5.

Alternatively can you make an RPC call from the old instance to the new one about the existing CIDs the old instance is interested in? In this way, in step 6, the new instance only needs to forward packets with those CIDs to the old.

@danzh2010
Copy link
Contributor

I didn't do a prototype but I'm fairly confident that's the behavior of a UDP socket with two identically configured instances. That said I'm not 100% because I don't really know what the BPF situation is - when we fork does the socket group double in size such that everything will be messed up no matter what we do, or do we have two instances of each socket in the group?

From fork() man page:

The child inherits copies of the parent's set of open file
          descriptors.  Each file descriptor in the child refers to the
          same open file description (see [open(2)](https://man7.org/linux/man-pages/man2/open.2.html)) as the corresponding
          file descriptor in the parent.  This means that the two file
          descriptors share open file status flags, file offset, and
          signal-driven I/O attributes (see the description of F_SETOWN
          and F_SETSIG in [fcntl(2)](https://man7.org/linux/man-pages/man2/fcntl.2.html)).

It says sockets are not doubled, but the file descriptors. So the socket group should remain unchanged. Bu I'm not sure how the I/O events are surfaced.

@ravenblackx
Copy link
Contributor

Do you think if putting process ID or some per-process information into CID would work?

If you put a lot of bits into the CID then you're undermining the whole "CID should not be traceable" thing, and also either increasing the size of the header or dramatically decreasing the entropy. And it still wouldn't really help very strongly because new packets won't match.

Alternatively can you make an RPC call from the old instance to the new one about the existing CIDs the old instance is interested in?

Considered it, but extracting CIDs from QuicSessions is pretty rough if it's even possible, and getting them from all the per-thread instances to transfer for hot restart would be even worse. Bearing in mind that the forwarding only happens for CIDs that didn't match an expected CID on the instance that received it, it seems like the double-forwarding is a pretty small price. It would essentially only ever happen on the first few packets of a new connection during hot restart.

But I'm not sure how the I/O events are surfaced.

Yeah, I'm proceeding with a manual experiment to try to figure out exactly what happens.

A coworker indicates https://lwn.net/Articles/762101/ BPF_MAP_TYPE_REUSEPORT_SOCKARRAY may be useful, but I think that is a thing you'd do when opening a new socket with SO_REUSEADDR, which based on my understanding right now, switching to doing that would be a hefty change to how things work and a significant pain for other cross-platform related purposes (and would still require doing something like what I'm doing anyway, for contexts where BPF isn't available).

@danzh2010
Copy link
Contributor

danzh2010 commented Jul 31, 2023

Do you think if putting process ID or some per-process information into CID would work?

If you put a lot of bits into the CID then you're undermining the whole "CID should not be traceable" thing, and also either increasing the size of the header or dramatically decreasing the entropy. And it still wouldn't really help very strongly because new packets won't match.

Packet routing is one of the purposes of using CID, and probably only needs one bit to distinguish child process from parent. And there is IETF effort to encrypted the CID in the future. As to new packets, those should only be processed by the new instance right? So if certain bits of the CID doesn't match the expected bits of the old instance, it should forward the packet to the new instance.

Alternatively can you make an RPC call from the old instance to the new one about the existing CIDs the old instance is interested in?

Considered it, but extracting CIDs from QuicSessions is pretty rough if it's even possible, and getting them from all the per-thread instances to transfer for hot restart would be even worse. Bearing in mind that the forwarding only happens for CIDs that didn't match an expected CID on the instance that received it, it seems like the double-forwarding is a pretty small price. It would essentially only ever happen on the first few packets of a new connection during hot restart.

Agree double-forwarding the first packet on each new connection isn't too costly. If the child process can receive packets on existing connections belonging parent process, we need to forward packets in both directions anyway.

@ravenblackx
Copy link
Contributor

Packet routing is one of the purposes of using CID, and probably only needs one bit to distinguish child process from parent. And there is IETF effort to encrypted the CID in the future. As to new packets, those should only be processed by the new instance right? So if certain bits of the CID doesn't match the expected bits of the old instance, it should forward the packet to the new instance.

Right, but if you're only using one bit for this then new connections would have a 50% chance of appearing to be for the old instance, at which point having all the special handling for this (cascading into mandatory changes to any connection id generator extensions) seems like it would be a huge waste of effort for such a small benefit.

Agree double-forwarding the first packet on each new connection isn't too costly. If the child process can receive packets on existing connections belonging parent process, we need to forward packets in both directions anyway.

Still struggling to get a working experiment to determine whether this is the case - my simple "just some udp sockets" testbed is unhelpfully complaining "invalid argument" on the setsockopt(SO_ATTACH_REUSEPORT_CBPF) step.

@danzh2010
Copy link
Contributor

Right, but if you're only using one bit for this then new connections would have a 50% chance of appearing to be for the old instance, at which point having all the special handling for this (cascading into mandatory changes to any connection id generator extensions) seems like it would be a huge waste of effort for such a small benefit.

Yeah, 1 bit is probably not sufficient for effectively distinguishing parent and child processes. But I think it's a good CPU improvement to use X bits of CID for this. And only 1/2^X chance that the first packet of the new connection would be double-forwarded.

Still struggling to get a working experiment to determine whether this is the case - my simple "just some udp sockets" testbed is unhelpfully complaining "invalid argument" on the setsockopt(SO_ATTACH_REUSEPORT_CBPF) step.

Why do you need to installl CBPF?

@ravenblackx
Copy link
Contributor

Why do you need to installl CBPF?

Trying to mimic the setup that envoy has with the sockets in question - if I don't reproduce the environment reasonably accurately I can't test what it does.

@ggreenway
Copy link
Contributor

Couldn't you use a single bit in the CID to encode hot_restart_epoch & 0x01?

Also, when forwarding to the other envoy, we should make sure we're only sending to active-quic-listeners with a matching listener-address (or addresses) as the original listener that got the packet.

@ravenblackx
Copy link
Contributor

Couldn't you use a single bit in the CID to encode hot_restart_epoch & 0x01?

Yes, but that's the same problem again - a new 'noise' CID has 50% chance of matching it so you'd still have to support the whole forwarding to everyone just in case thing if it doesn't match the current instance's packets. But with the added cost that now you also have to transport the extra data to the CID generation extension. It could be useful in the more advanced solution later (using BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, if that's even feasible), but for a first pass it seems like simultaneously overkill and insufficiently useful to be worth it.

Also, when forwarding to the other envoy, we should make sure we're only sending to active-quic-listeners with a matching listener-address (or addresses) as the original listener that got the packet.

Yes.

Anyway, finally got my manual experiment to work - it seems that if you have n UDP sockets on the same port with BPF directing packets to the right worker, and you fork so those n sockets get duplicated to the new instance, delivery is to the correct workers and next-active-listener, which (assuming your listening path has equally fast turnaround) is essentially round-robin delivery between the old and new instances. This is good news in that it's not messing with the kernel group and therefore my proposed solution will at least work.

Adapting to the BPF_MAP_TYPE_REUSEPORT_SOCKARRAY model would be significantly more complicated (and won't help in non-BPF contexts so something like my solution would still need to exist); I, for one, don't find the motivation to make that more difficult change particularly compelling at this time.

I believe it would also require a fairly large kernel-version-based behavior split, because if we do the "open another new socket" operation that's required for BPF_MAP_TYPE_REUSEPORT_SOCKARRAY to work, that would actively break the existing behavior (by mangling the group indexing) on a kernel that doesn't support specifying SOCKARRAY. I'm pretty sure there exists some kernel version at which SO_ATTACH_REUSEPORT_CBPF is available and BPF_MAP_TYPE_REUSEPORT_SOCKARRAY is not.

@danzh2010
Copy link
Contributor

Anyway, finally got my manual experiment to work - it seems that if you have n UDP sockets on the same port with BPF directing packets to the right worker, and you fork so those n sockets get duplicated to the new instance, delivery is to the correct workers and next-active-listener, which (assuming your listening path has equally fast turnaround) is essentially round-robin delivery between the old and new instances. This is good news in that it's not messing with the kernel group and therefore my proposed solution will at least work.

Thanks for the experiment! Given the round-robin delivery, would it be easier to make the new listener not register with read events while the old one is draining? That means only the old listener will be receiving packets and forwarding packets with unseen CID to the new listener. How do you plan to toss packets across process? Would tossing every packets on new connections be too costly?

@ggreenway
Copy link
Contributor

Yes, but that's the same problem again - a new 'noise' CID has 50% chance of matching it so you'd still have to support the whole forwarding to everyone just in case thing if it doesn't match the current instance's packets.

Could we combine the single bit in the CID with whether it's a long vs short packet header? IIRC we should only see random CIDs with long-headers, which are only used at connection setup.

But the goal of this exercise (marking something in the CID) is just performance I believe. How much computational effort is it to check whether the current process wants this packet or not? If it's very low, then I agree it's probably not worth doing and we just try the packet and forward if it isn't accepted locally.

@ggreenway
Copy link
Contributor

How do you plan to toss packets across process?

I think it's over a unix-domain socket, possibly wrapped in protobuf (grpc?) to add metadata.

Would tossing every packets on new connections be too costly?

Yeah, I think it will have pretty high overhead. When the process starts, nearly all the packets need to end up with the parent process. By the end of draining, nearly all the packets should end up with the child. So I don't think it matters which side is reading from the socket.

When this is all working, we should consider publishing some benchmarks on the overhead during hot-restart.

@ravenblackx
Copy link
Contributor

But the goal of this exercise (marking something in the CID) is just performance I believe. How much computational effort is it to check whether the current process wants this packet or not? If it's very low, then I agree it's probably not worth doing and we just try the packet and forward if it isn't accepted locally.

I think pre-checking a bit would actually be more computational effort than using the existing failed-to-deliver check (in that it would be hard to avoid impacting the common path, where using the existing failed-to-deliver check only adds any processing to the uncommon path). IMO it'd be worth pre-checking with BPF but not worth pre-checking without.

Could we combine the single bit in the CID with whether it's a long vs short packet header?

I like this idea, but I'm not sure how well it would play with the BPF programming. It looks like it could work, there's a bpf_map_update_elem instruction which I think could be used to switch "unrecognized" packets from going to the current instance to going to the new instance. It's not compatible with the same simple BPF program we have right now; I think it changes the requirements to needing eBPF support, and the aforementioned need to change from just using duplicate sockets to closing the duplicate sockets, and opening new sockets sharing the same port but using a different bpf program-map (to avoid modifying the existing group).

@danzh2010
Copy link
Contributor

Would tossing every packets on new connections be too costly?

Yeah, I think it will have pretty high overhead. When the process starts, nearly all the packets need to end up with the parent process. By the end of draining, nearly all the packets should end up with the child. So I don't think it matters which side is reading from the socket.

It feels like having only the parent process listening instead of both processes listening doesn't make more packets need to be tossed, given that half of the packets land on the wrong instance. And in this way we can eliminate the complexity of tossing packets back and forth. WDYT?

When this is all working, we should consider publishing some benchmarks on the overhead during hot-restart.

+1

@ravenblackx
Copy link
Contributor

It feels like having only the parent process listening instead of both processes listening doesn't make more packets need to be tossed, given that half of the packets land on the wrong instance. And in this way we can eliminate the complexity of tossing packets back and forth. WDYT?

Agree, this seems reasonable. Also it conceptually plays well with the later eBPF addition, since both patterns involve "not doing the default thing" with the child socket.

RyanTheOptimist pushed a commit that referenced this issue Aug 23, 2023
…packets (#28664)

Add capability for special UDP packet handling during hot restart

Adds a flexible argument for shutdownListeners, making it possible to pass an additional argument to QUIC listeners (or others); uses that to pass a callback to ActiveQuicListener and EnvoyQuicDispatcher. This is a no-op in production since the parameter is currently always nullptr, but is a first step towards allowing us to catch new connections on a QUIC listener and forward them to a new instance during hot restart. An important part of #19454
Risk Level: Low, essentially a no-op. Adds one nullptr check in the common path per nonroutable quic packet; zero touch to the common routable packet path.
Testing: Added unit test in both EnvoyQuicDispatcher and ActiveQuicListener.
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a

Signed-off-by: Raven Black <[email protected]>
@eranr
Copy link

eranr commented Jun 11, 2024

Reading between the lines it appears that hot restart does not work for "plain" UDP in general and not only for QUIC (e.g. taking CIDs into considewration).
In an initial attempt to perform a hot restart while using the udp proxy listener, udp traffic ceases to work as soon as the child process starts. UDP returns to work as soon as the parent process exits.
By hot restart support for UDP I refer to the possibility of maintaining the UDP sessions between the parent and child processes (e.g. packets with the same 4-tuple of source IP/port and local IP/port are routed to the same backend within a given timeout).
Can someone please confirm that my understanding is correct?
Also, does the work addressed in this thread cover also "plain" udp or only the QUIC listeners?
Thanks very much in advance.

@ravenblackx
Copy link
Contributor

You are correct that hot restart doesn't work for UDP in general.

The work done in this thread only added support for QUIC (there's an interface in QUIC in which the protocol reports "I couldn't use this packet", which is used to determine whether to send the packet to the other envoy instance).

I believe the QUIC support is functional at this point. And I think nobody is working on doing the same for UDP. I'm a bit surprised at UDP ceasing to work during the hot restart transition, my expectation would have been that it would drop all existing "connections" but work immediately for new connections associating them with the new instance. (But I haven't looked closely at the code - if someone wants to try to make it functional, and it is currently as you describe, my suggestion would be to first have the parent process close the listener as soon as it knows there is a child process starting up, so as to at least not have a window of not-working-at-all, then work from there to try to do forwarding based on unrecognized tuples.)

@eranr
Copy link

eranr commented Jun 13, 2024

@ravenblackx Thanks very much for your reply and the suggestion of how to proceed. Another avenue (which does not cover the exact use case of hot-restart) is to use e.g. the filesystem based xSD. At this point, however, this mechanism also seems broken for UDP.

@ravenblackx
Copy link
Contributor

Second-guessing my previous response a bit, if the UDP behavior were to mirror the QUIC behavior then the implementation of forwarding the packets would be forwarding from the old instance to the new instance, which would mean instead of the old instance stopping listening, you'd need the new instance to not start listening until the old instance is about to stop (which makes sense because the old instance "knows" which address-tuples are already in use and the new instance does not).

But same overall thrust that the first step would be to do just that, without the forwarding, so as to at least have UDP half-functioning during the overlapping period, which would be an improvement over not functioning at all. :)

@ravenblackx
Copy link
Contributor

I think it makes sense to close this issue as done, and open a new one for UDP support and another one for eBPF-based support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/quic enhancement Feature requests. Not bugs or questions. no stalebot Disables stalebot from closing an issue
Projects
None yet
Development

No branches or pull requests

8 participants