-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Events skipped in group rebalancing #3703
Comments
There is a Kafka property |
Hi @artembilan, To make sure I understand your answer correctly, the suggestion will minimize the frequency of consecutive rebalances. Is this correct? |
Right. See more info in the official Apache Kafka docs: https://kafka.apache.org/documentation/
|
Ok, unfortunately we had a case today where a new consumer joined the group (so only a rebalance occurred) and the problem occurred. Therefore, I believe the property will not be useful in this case, sadly. I hope you don't mind answering a few questions: Is it true that Thanks for your help! |
I don't know those details of Apache Kafka consumer implementation. We will look into this together with @sobychacko when we have a chance. Meanwhile some project to reproduce would be great. I don't deny that we may do something in Spring which could be fixed. |
I will do my best to find a way to reproduce the problem in a project, thank you! |
@ejba You might want to double-check these in the application, but here are some explanations for your questions. The When the fetch response is received, the Kafka consumer stores it in an internal buffer. On the next And finally, regarding subscription data and metadata - the metadata handling in Kafka consumer maintains its own separate cycle from message fetching. The consumer keeps cluster information cached, but fetch responses don't directly update this cache. Instead, when the broker detects stale metadata, it returns specific errors (like |
@sobychacko Thank you so much for answering back to the questions. So I have no clue what could be causing the offset being advanced after commit failure exception due to rebalance in progress. For example, in the attached logs the reader can see the consumer attempt to commit the offset 2148212 but without success and fetch the offset 2148260 after the group rebalance is completed and the old partition re-assigned to the consumer, losing then 48 records.
I will try to create a project to reproduce this problem. Thanks! |
I see in your logs:
That means that this |
Aside from what @artembilan asked ^^, I have a few questions and suggestions. Who is committing the offset |
The consumers are created via a DefaultKafkaConsumerFactory instance and launched dynamically in the presence of custom consumer implementations. I recently came across Creating Containers Dynamically page. I was planning to reimplementing container creation based on the page's recommendations in the coming weeks, but if you believe this could be the root cause, I will prioritise the work.
Originally, the acknowledgment mode was manual_immediate and the acknowledgment invoked at the end of batch processing. I switched to batch as soon as the problem started to appear among consumers. The problem still persisted.
Yes, the team was discussing this today with the intention of storing the offsets in external storage (e.g. database). |
This application contains only one consumer (single thread). I don't see the typical "Commit list:" log to show the intent to commit 2148260. Therefore I believe it's something kafka-client does internally (not Spring Consumer).
|
Thanks for the updates. In your troubleshooting and analysis, let us know if you find any issues within the framework. We will be happy to take a look. |
In what version(s) of Spring for Apache Kafka are you seeing this issue?
Between 3.1.7 and 3.2.5
Describe the bug
The topic has 12 partitions. Consumers are automatically scaled as soon as event lag is detected (typically 1 to 12 replicas). The group rebalancing currently takes a little time (>10 seconds). The events fetched before the group rebalance are processed with success but the offset commitment fails as the current partition was revoked due to group rebalance (non-fatal failure).
The problem happens after the partition assignment is complete. The partition offset has advanced to the uncommitted offset.
Example:
Consumer A assigned to partition P
Consumer A fetches offset 2148209
Consumer A reads 3 events
11 consumers joins the group
Consumer A tries to commit 2148212 (2148209 + 3)
Commit 2148212 fails because partition P was revoked
Consumer A is assigned to partition P again
Consumer A fetches offset 2148260
48 events were skipped ( 2148260 - 2148212 = 48)
To Reproduce
AckMode = Batch
Partition Assignment Strategy = CooperativeSticky
A consumer continually reads events from the partition.
11 new consumers suddenly join the group.
The consumer is assigned to the previous partition.
Expected behavior
Once the group rebalance completes, it is expected the partition assignment fetch always a committed offset to avoid skipped events.
Sample
I wasn't able to create a sample project yet, but I gather the debug logs that explains this issue in detail.
logs.txt
Thanks in advance!
The text was updated successfully, but these errors were encountered: