-
Notifications
You must be signed in to change notification settings - Fork 6
RabbitMQ Microservice reliability
This is a condensation of a short tech talk discussing findings and recommendations from issue #565.
RabbitMQ uses acknowledgments to ensure that queue consumers reliably receive messages. A consumer can listen for messages on a queue using one of two settings:
- Auto-ack (no-ack) - Once RabbitMQ has successfully written a message to the TCP socket, its job is done and it forgets about the message. From RabbitMQ's perspective, it does not expect any acknowledgment back from the consumer. From the consumer's perspective, the receipt of the message comes with an implication of acknowledgment (courtesy of TCP's syn-ack handshake); no further action is needed to tell RabbitMQ to forget about the message.
- Explicit ack - The consumer is responsible for explicitly sending an acknowledgment back to RabbitMQ. If it doesn't (due to buggy code, or a system failure that the consumer can't handle), RabbitMQ will wait a certain amount of time (30 minutes by default), then re-queue the message for delivery to another consumer. Note that an ack can mean: (1) a consumer received the message (redundancy for the TCP syn-ack) or (2) consumer successfully processed the message (may not be needed if the caller is expecting a response to the request) -- the semantics is for us to decide but we should be consistent.
The remainder of this page describes a few scenarios for how VRO microservices could interact with RabbitMQ to ensure reliability (or not), starting with some obviously unworkable ones.
The microservice consumes requests with auto-ack on, and performs the work without communicating out about any success or failure.
sequenceDiagram
App->>RabbitMQ: publish request
RabbitMQ->>Microservice: consume & auto-ack request
Note right of Microservice: do some work
This does not provide any safeguards in case of failure, and should only be used if the microservice is intended to perform optional "best-effort" work.
The most basic way to address the above issue is to have the microservice explicitly acknowledge that it performed its work successfully.
sequenceDiagram
App->>RabbitMQ: publish request
RabbitMQ->>Microservice: consume request
Note right of Microservice: do some work
Microservice->>RabbitMQ: ack request
See the re-queuing section below for a description of RabbitMQ's behavior if it never receives the explicit ack.
Currently, every microservice in VRO is invoked by a caller (for example the API server, labeled here as "App") that depends on some sort of result from the microservice. As with the request sent to the microservice, the response is published to RabbitMQ, and subsequently consumed by the caller.
sequenceDiagram
App->>RabbitMQ: publish request
RabbitMQ->>Microservice: consume & auto-ack request
Note right of Microservice: do some work
Microservice->>RabbitMQ: publish response
RabbitMQ->>App: consume response
In this case, the response message also serves conceptually as an acknowledgment from the microservice that it performed its work successfully. The next two scenarios describe what happens when a failure occurs.
If the microservice encounters a failure and can recover (in Java and Python microservices that wrap the entire task in a try/catch, this likely accounts for nearly all failures), the microservice can send an error response back to the app.
sequenceDiagram
App->>RabbitMQ: publish request
RabbitMQ->>Microservice: consume & auto-ack request
Note right of Microservice: try to do some work
Note right of Microservice: handle a failure
Microservice->>RabbitMQ: publish error response
RabbitMQ->>App: consume error response
This is nearly identical to the previous scenario, just with different content in the response message -- the error response acts like the ack.
Occasionally there are failures can't be handled by the microservice, e.g. if its container gets killed by OOM. In this case, the microservice isn't able to send back any response, so it's up to the App to time-out after a reasonable amount of time.
sequenceDiagram
App->>RabbitMQ: publish request
RabbitMQ->>Microservice: consume & auto-ack request
Note right of Microservice: unhandled failure
Note left of App: timeout waiting for response
This works well when the App itself is bound by a request/response context that is expected to return within a reasonable amount of time. Other callers/requesters must ensure to set a response timeout and handle it by resubmitting the request (itself or via RabbitMQ's features) or raising an error for the originating caller to handle.
In the event that the microservice is responsible for work that finishes at some indeterminate time in the future, RabbitMQ's re-queuing of unacknowledged messages can be used. When the microservice consumes messages with explicit acks, RabbitMQ will wait a certain amount of time (30 minutes by default) for the ack, and re-queue and re-deliver the message after a time-out.
sequenceDiagram
participant App
participant RabbitMQ
participant MS1 as Microservice
participant MS2 as Microservice
App->>RabbitMQ: publish request
RabbitMQ->>MS1: consume request
Note right of MS1: unhandled failure
Note right of RabbitMQ: timeout waiting for ack
RabbitMQ->>MS2: consume request
Note right of MS2: do some work
MS2->>RabbitMQ: ack request
This sequence diagram shows the flow when a second microservice is available to consume the message.