Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simply KafkaConsumer.commitAsync #1375

Open
wants to merge 3 commits into
base: series/3.x
Choose a base branch
from

Conversation

rodrigo-molina
Copy link
Contributor

KafkaConsumerActor.manualCommitAsync no longer semantically blocks the actor polling loop. It shares the same implementation as CommittableConsumerRecord.commit, except it does not apply the commit recovery policy. Instead, it leverages the existing queue mechanism rather than spawning a new fiber.
The combination of actor blocking and fiber spawning is assumed to contribute to the observed performance degradation.

Motivation

In our model, we manage over 1,000 integrations, where each integration translates to N Kafka consumers (where N is the number of partitions per topic) assigned—not subscribed—to the respective topics. On average, each topic has 24 partitions, resulting in approximately 24,000 active consumers.

Previously, we used CommitBatch.commit and observed p95 latencies (averaged across all hosts) of 70ms during quiet hours and 140ms during busy ones.

These times followed the traffic pattern and reflected normal circumstances.

When switching to commitAsync, however, the observed p95 latencies more than doubled: 240ms during quiet hours
and 300ms during busy ones.

While these measurements include the overhead of Cats Effect scheduling and CPU competition, the observation conditions remained consistent during the comparison:

  • No changes were made to the code except from the commit mechanism,
  • The number of integrations remained constant,
  • The total number of hosts/replicas did not change.

Switching from `CommittableConsumerRecord.commit` to `KafkaConsumer.commitAsync` caused p95 latencies to triple during quiet hours (70ms → 240ms) and double during peak hours (140ms → 300ms) in our production systems.

In this commit, `KafkaConsumerActor.manualCommitAsync` no longer semantically blocks the actor polling loop. It shares the same implementation as `CommittableConsumerRecord.commit`, except that it does not apply the commit recovery policy. Instead, it leverages the existing queue mechanism rather than spawning a new fiber. The combination of actor blocking and fiber spawning is assumed to contribute to the observed performance degradation.
@rodrigo-molina rodrigo-molina changed the title Improve KafkaConsumer.commitAsync performance Simply KafkaConsumer.commitAsync Dec 30, 2024
@aartigao
Copy link
Contributor

I once tried to refactor all this actor stuff, but the changes were gigantic (it was basically a rewrite from scratch but respecting the current APIs). I agree with you, your measurements show a non-optimal implementation due to the actor overhead. I'm tempted to merge this if this is a quick win for clients but I wonder how much benefit could be achieved if other parts are refactored...

@rodrigo-molina
Copy link
Contributor Author

I once tried to refactor all this actor stuff, but the changes were gigantic (it was basically a rewrite from scratch but respecting the current APIs). I agree with you, your measurements show a non-optimal implementation due to the actor overhead. I'm tempted to merge this if this is a quick win for clients but I wonder how much benefit could be achieved if other parts are refactored...

Ideally, we should find an iterative approach that lets us move forward without entering a never-ending rabbit hole. I think this PR takes us in that direction while also converging two implementations for the same commit async functionality.

What other feature could we benefit the most from refactoring?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants