-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix][broker] Fix rate limiter token bucket and clock consistency issues causing excessive throttling and connection timeouts #23930
[fix][broker] Fix rate limiter token bucket and clock consistency issues causing excessive throttling and connection timeouts #23930
Conversation
…yed when there are races
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #23930 +/- ##
============================================
+ Coverage 73.57% 74.29% +0.72%
+ Complexity 32624 31923 -701
============================================
Files 1877 1853 -24
Lines 139502 143855 +4353
Branches 15299 16357 +1058
============================================
+ Hits 102638 106877 +4239
+ Misses 28908 28586 -322
- Partials 7956 8392 +436
Flags with carried forward coverage won't be shown. Click here to find out more.
|
pulsar-broker/src/main/java/org/apache/pulsar/broker/qos/DefaultMonotonicSnapshotClock.java
Outdated
Show resolved
Hide resolved
pulsar-broker/src/main/java/org/apache/pulsar/broker/qos/DefaultMonotonicSnapshotClock.java
Outdated
Show resolved
Hide resolved
pulsar-broker/src/main/java/org/apache/pulsar/broker/qos/DefaultMonotonicSnapshotClock.java
Outdated
Show resolved
Hide resolved
pulsar-broker/src/main/java/org/apache/pulsar/broker/qos/DefaultMonotonicSnapshotClock.java
Outdated
Show resolved
Hide resolved
pulsar-broker/src/main/java/org/apache/pulsar/broker/service/PublishRateLimiterImpl.java
Show resolved
Hide resolved
pulsar-broker/src/main/java/org/apache/pulsar/broker/qos/DefaultMonotonicSnapshotClock.java
Outdated
Show resolved
Hide resolved
…nsistency should be fine
I fixed the flaky throttling tests and moved them to broker-api group. There were multiple issues in the flaky tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, LGTM
Fixes #23920
Motivation
Prior to this PR, connections in Pulsar could experience excessive throttling and timeouts due to issues in rate limiting implementation. This was caused by both calculation errors in token bucket logic and problems with clock source handling in virtualized environments. The PR addresses these issues to prevent unintended long throttling periods and connection timeouts.
Two key issues are addressed in this PR:
Calculation Issue in AsyncTokenBucket:
Clock Source Consistency:
System.nanoTime()
isn't strictly monotonic and consistent across multiple CPUsSystem.nanoTime()
was read on different threads.For details about timekeeping challenges in virtualized environments, see Linux kernel documentation on timekeeping.
Without this fix, when time leaps backwards (for example due to
System.nanoTime
being read by a thread running on a different CPU with unsynchronized TSC), the rate limiter would incorrectly throttle connections for the duration of the time leap. In virtualized environments, the time difference between CPU cores can be significant. Even though such large variances are uncommon, they can cause extended connection timeouts when they occur. This PR significantly improves reliability by preventing such time-variance-induced throttling entirely.While both forward and backward time leaps can occur in these environments, only backward leaps are explicitly handled. Forward leaps cannot be reliably detected within the eventually consistent logic used in AsyncTokenBucket, and their impact is bounded - at most one rate period (default 1 second) of potential overuse. This design choice favors implementation simplicity and performance while still protecting against the more problematic backward leap case that could cause extended throttling.
Although this PR also covers clock source consistency issues, the most probable root cause for the reported issue #23920 is, however, the calculation logic that was assuming that the DefaultMonotonicSnapshotClock's update thread would always be executing consistently.
Modifications
AsyncTokenBucket improvements:
DefaultMonotonicSnapshotClock enhancements:
Rate limiter implementations:
Performance
AsyncTokenBucketBenchmark and DefaultMonotonicSnapshotClockBenchmark results with JMH demonstrate high throughput (test results with Apple M3 Max):
Results with Dell XPS 15 7590 i9-9980HK on Linux:
Documentation
doc
doc-required
doc-not-needed
doc-complete