Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[improve][broker] defer the ownership checks if the owner is inactive (ExtensibleLoadManager) #21811

Merged
merged 2 commits into from
Dec 31, 2023

Conversation

heesung-sn
Copy link
Contributor

Motivation

When a broker shutdown or gets killed, the leader broker will get notified and soon reassign the orphan bundle ownerships. Meanwhile, the ownership lookup to these orphan bundles should be deferred til the new brokers own them. This will further reduce the lookup request retries when brokers are shutdown or get killed.

Modifications

  • In ServiceUnitStateChannelImpl.getOwnerAsync, additionally, check if the current owner broker is active or not in the broker registry. If active, return the current owner. Otherwise, defer the ownership lookup requests til the new owner owns it.
  • During the orphan ownership cleanup, when there is no selected owner, the leader overrides the orphan states to Free.
  • Do not throw exception from writeLoadReportOnZookeeper writeResourceQuotasToZooKeeper

Verifying this change

  • Make sure that the change passes the CI checks.

Added a test case testActiveGetOwner to cover this.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: heesung-sn#58

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Dec 27, 2023
Copy link
Contributor

@dragosvictor dragosvictor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, had some cosmetic recommendations.

Comment on lines 503 to 504
ownerAfterDeferred -> ownerAfterDeferred == null ? Optional.empty()
: Optional.of(ownerAfterDeferred))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simplify this to Optional.ofNullable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. updated.

Comment on lines 1255 to 1256
var override = getOverrideInactiveBrokerStateData(
orphanData, selectedBroker, inactiveBroker);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var override = getOverrideInactiveBrokerStateData(
orphanData, selectedBroker, inactiveBroker);
var override = getOverrideInactiveBrokerStateData(orphanData, selectedBroker, inactiveBroker);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. updated.

Comment on lines 143 to 145
} catch (Throwable e) {
return false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this and add ignoreExceptions to the Awaitility chain above, similar to instances below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

@@ -259,9 +281,11 @@ public void testStopBroker() throws Exception {
}
}

String broker1 = admin.lookups().lookupTopic(topicName);
Awaitility.waitAtMost(60, TimeUnit.SECONDS).ignoreExceptions().untilAsserted(() -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wait period here is larger than the test timeout itself.

Suggested change
Awaitility.waitAtMost(60, TimeUnit.SECONDS).ignoreExceptions().untilAsserted(() -> {
Awaitility.waitAtMost(40, TimeUnit.SECONDS).ignoreExceptions().untilAsserted(() -> {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Increased the test time, instead.

@@ -380,8 +404,7 @@ public void testIsolationPolicy() throws Exception {
fail();
} catch (Exception ex) {
log.error("Failed to lookup topic: ", ex);
assertThat(ex.getMessage()).containsAnyOf("Failed to look up a broker",
"Failed to select the new owner broker for bundle");
assertThat(ex.getMessage()).containsAnyOf("Failed to select the new owner broker for bundle");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
assertThat(ex.getMessage()).containsAnyOf("Failed to select the new owner broker for bundle");
assertThat(ex.getMessage()).contains("Failed to select the new owner broker for bundle");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Comment on lines 1565 to 1568
public void testActiveGetOwner()
throws IllegalAccessException, ExecutionException, InterruptedException, TimeoutException {


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public void testActiveGetOwner()
throws IllegalAccessException, ExecutionException, InterruptedException, TimeoutException {
public void testActiveGetOwner() throws Exception {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. updated.

Comment on lines 1589 to 1598
try {
channel1.getOwnerAsync(bundle).get();
fail();
} catch (Exception e) {
if (e.getCause() instanceof TimeoutException) {
// expected
} else {
fail();
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads a bit more clearly IMO, but I haven't seen it widely used in our tests:

Suggested change
try {
channel1.getOwnerAsync(bundle).get();
fail();
} catch (Exception e) {
if (e.getCause() instanceof TimeoutException) {
// expected
} else {
fail();
}
}
var ex = expectThrows(ExecutionException.class, () -> channel1.getOwnerAsync(bundle).get());
assertTrue(ex.getCause() instanceof TimeoutException);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. updated.

doReturn(CompletableFuture.completedFuture(Optional.empty()))
.when(loadManager).selectAsync(any(), any());
var leaderChannel = channel1;
String leader = channel1.getChannelOwnerAsync().get(2, TimeUnit.SECONDS).get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we rename this to leader1, to stay consistent with the naming scheme?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. updated.

Copy link
Contributor

@dragosvictor dragosvictor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Comment on lines +1566 to +1567


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

@heesung-sn heesung-sn added this to the 3.3.0 milestone Dec 29, 2023
@Demogorgon314 Demogorgon314 reopened this Dec 29, 2023
@codecov-commenter
Copy link

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Comparison is base (99d06b9) 73.58% compared to head (3535156) 73.54%.
Report is 11 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #21811      +/-   ##
============================================
- Coverage     73.58%   73.54%   -0.04%     
- Complexity    32254    32278      +24     
============================================
  Files          1858     1858              
  Lines        138021   138110      +89     
  Branches      15111    15137      +26     
============================================
+ Hits         101558   101579      +21     
- Misses        28606    28670      +64     
- Partials       7857     7861       +4     
Flag Coverage Δ
inttests 24.16% <16.36%> (-0.02%) ⬇️
systests 23.68% <21.81%> (-0.11%) ⬇️
unittests 72.84% <90.90%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...lance/extensions/ExtensibleLoadManagerWrapper.java 58.62% <ø> (ø)
...ache/pulsar/client/impl/MemoryLimitController.java 93.75% <89.47%> (-1.81%) ⬇️
...xtensions/channel/ServiceUnitStateChannelImpl.java 84.46% <91.66%> (+0.30%) ⬆️

... and 65 files with indirect coverage changes

Copy link
Contributor

@gaoran10 gaoran10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants