Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: handle error in kafka consumer loop to prevent premature return #2100

Merged
merged 4 commits into from
Feb 24, 2025

Conversation

Gezi-lzq
Copy link
Contributor

When there are short-term network jitters or brief service outages in the Kafka cluster, it causes the Logtail program to fail in consumption and exit the consumption process. As a result, even after Kafka resumes normal operation, Logtail cannot continue consuming messages, leading to constant message accumulation. The only solution is to restart Logtail to restore normal operations.

Root Cause:

After reviewing portions of the code for the open-source ilogtail component, we discovered that the error handling logic in the Kafka consumption part of LoongCollector (input_kafka) does not meet the requirements of the Sarama client. Specifically, when the k.consumerGroupClient.Consume() call returns an error (e.g., due to network instability exhausting Sarama's internal retries), the current code simply exits the goroutine by returning. This approach lacks upper-layer fault tolerance and retry mechanisms. Our tests have confirmed this issue.

According to Sarama Issue #2381 , when network jitters or short-term network unavailability occur, and multiple retries still fail, the current ConsumeGroupSession will exit and return an error. It requires the upper layer to handle the exception and implement recovery and retry operations.

I believe that in such logtail scenarios, LoongCollector should continuously retry until Kafka recovers, instead of exiting directly.

@CLAassistant
Copy link

CLAassistant commented Feb 17, 2025

CLA assistant check
All committers have signed the CLA.

@Gezi-lzq
Copy link
Contributor Author

@messixukejia

@Gezi-lzq
Copy link
Contributor Author

@EvanLjp

Copy link
Member

@zhouxinyu zhouxinyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM~

@Gezi-lzq
Copy link
Contributor Author

I've implemented the retry logic with a basic delay approach. Could you help review it? Regarding adding retry parameters, would you recommend handling this in the current PR or creating a new PR?

@shalousun

@messixukejia messixukejia merged commit ae061ff into alibaba:main Feb 24, 2025
15 checks passed
@henryzhx8 henryzhx8 added the bug Something isn't working label Feb 26, 2025
@henryzhx8 henryzhx8 added this to the v3.0 milestone Feb 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants