A lost partition is no longer fatal #1252

erikvanoosten · 2024-06-10T19:26:47Z

Before 2.7.0 a lost partition was treated as a revoked partition. Since the partition is already assigned to another node, this potentially leads to duplicate processing of records.

Zio-kafka 2.7.0 assumes that a lost partition is a fatal event. It leads to an interrupt in the stream that handles the partition. The other streams are ended, and the consumer closes with an error. Usually, a full program restart is needed to resume consuming.

It should be noted that stream processing is not interrupted immediately. Only when the stream requests new records, the interrupt is observed. Unfortunately, we have not found a clean way to interrupt the stream consumer directly.

Meanwhile, from bug reports (#1233, #1250), we understand that partitions are usually lost when no records have been received for a long time.

In conclusion, 1) it is not possible to immediately interrupt user stream processing, and 2) it is most likely not needed anyway because the stream is already done processing and awaiting more records.

With this change, a lost partition no longer leads to an interrupt. Instead, we first drain the stream's internal queue (just to be sure, it is probably already empty), and then we end the stream gracefully (that is, without error, like we do with revoked partitions). Other streams are not affected, the consumer will continue to work.

Lost partitions do not affect the features rebalanceSafeCommits and restartStreamsOnRebalancing; they do not hold up a rebalance waiting for commits to complete, and they do not lead to restarts of other streams.

Since we currently have no way to test lost partitions, there is no change to the tests.

Fixes #1233 and #1250.

Before 2.7.0 a lost partition was treated as a revoked partition. Since the partition is already assigned to another node, this potentially leads to duplicate processing of records. Zio-kafka 2.7.0 assumes that a lost partition is a fatal event. It leads to an interrupt in the stream that handles the partition. The other streams are ended, and the consumer closes with an error. Usually, a full program restart is needed to resume consuming. It should be noted that stream processing is not interrupted immediately. Only when the stream requests new records, the interrupt is observed. Unfortunately, we have not found a clean way to interrupt the stream consumer directly. Meanwhile, from bug reports, we understand that partitions are usually lost when no records have been received for a long time. In conclusion, 1) it is not possible to immediately interrupt user stream processing, and 2) it most likely not needed anyway because the stream is awaiting new records. With this change, a lost partition no longer leads to an interrupt. Instead, we first drain the stream's internal queue (just to be sure, it is probably already empty), and then we end it gracefully (that is, without error). Other streams are not affected, the consumer will continue to work. When `rebalanceSafeCommits` is enabled, lost partitions do _not_ participate like revoked partitions do. So lost partitions cannot hold up a rebalance. Fixes #1233 and #1250.

erikvanoosten · 2024-06-10T19:31:10Z

A note for reviewers: we assume that when a partition is lost, the underlying Java consumer no longer returns records for that partition.

zio-kafka/src/main/scala/zio/kafka/consumer/internal/PartitionStreamControl.scala

svroonland

I think this is a usability improvement in the situations that we have identified.

We should mention in the release notes that clients that want to handle the partition lost event somehow, instead of handling the failed stream they need to create their own RebalanceListener and handle the onLost call.

erikvanoosten requested review from svroonland and guizmaii June 10, 2024 19:26

typo

605f536

svroonland reviewed Jun 10, 2024

View reviewed changes

zio-kafka/src/main/scala/zio/kafka/consumer/internal/PartitionStreamControl.scala Outdated Show resolved Hide resolved

erikvanoosten added 4 commits June 11, 2024 19:24

Add logging

ff8e92d

Merge branch 'master' into lost

dcf61c2

Merge branch 'master' into lost

736c7ed

Merge branch 'master' into lost

e52e19e

svroonland mentioned this pull request Jun 13, 2024

Consumer not reconnecting on lost session on proxy connectivity #1233

Open

Merge branch 'master' into lost

cf3516d

svroonland approved these changes Jul 9, 2024

View reviewed changes

erikvanoosten merged commit a8596f3 into master Jul 9, 2024
14 checks passed

erikvanoosten deleted the lost branch July 9, 2024 19:22

erikvanoosten mentioned this pull request Jul 20, 2024

Consumer doesn't consume after onLost #1288

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A lost partition is no longer fatal #1252

A lost partition is no longer fatal #1252

erikvanoosten commented Jun 10, 2024 •

edited

Loading

erikvanoosten commented Jun 10, 2024

svroonland left a comment

A lost partition is no longer fatal #1252

A lost partition is no longer fatal #1252

Conversation

erikvanoosten commented Jun 10, 2024 • edited Loading

erikvanoosten commented Jun 10, 2024

svroonland left a comment

Choose a reason for hiding this comment

erikvanoosten commented Jun 10, 2024 •

edited

Loading