-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
__vertx.subs has stale entries for downed nodes after ungraceful shutdowns #141
Comments
Is you node leaving because it crashed or is it a graceful shutdown? Not sure if i could follow your idea fully. I fear an update on the NodeInfo might as well get lost when there is a node leaving and not running a graceful shutdown. As a backup solution any join or leave event could trigger getting the member list to make sure they are still in sync. In general i would like to make sure that the vertx side does not compensate a missing feature or bug on the ignite side. |
For details--the service is hosted in Amazon ECS and ECS can terminate a container quite un-gracefully (and multiple containers at once at times since for us we run 9 cluster members). The solution is to ensure that the Instead of the current "first deleter wins", I can attempt to write a test but it can be triggered in a kubernetes / other container management situation where 2+ nodes are killed at ~ the same time. |
Might not be the full extend of what you thought of but have a look at #142 and let me know if this would go into the right direction. As the cache is configured as atomic there is even a retry behind the scenes to the atomic remove methods, so this could be promising. |
i extended the test in @Test
public void testSubsRemovedForKilledNode() throws Exception {
int start = 20;
int stop = 18;
int messages = 30;
testSubsRemoved(start, stop, messages, latch -> {
for(int i = 1; i<=stop; i++) {
VertxInternal vi = (VertxInternal) vertices[i];
Promise<Void> promise = vi.getOrCreateContext().promise();
vi.getClusterManager().leave(promise);
promise.future().onComplete(onSuccess(v -> {
latch.countDown();
}));
}
});
}
private void testSubsRemoved(int start, int stop, int messages, Consumer<CountDownLatch> action) throws Exception {
if((start - stop) < 2) {
fail("start count needs to be at least stop count +2");
}
startNodes(start);
CountDownLatch regLatch = new CountDownLatch(stop);
AtomicInteger cnt = new AtomicInteger();
vertices[0].eventBus().consumer(ADDRESS1, msg -> {
int c = cnt.getAndIncrement();
assertEquals(msg.body(), "foo" + c);
if (c == messages - 1) {
testComplete();
}
if (c >= messages) {
fail("too many messages");
}
}).completionHandler(onSuccess(v -> {
for(int i = 1; i<=stop; i++) {
vertices[i].eventBus().consumer(ADDRESS1, msg -> {
fail("shouldn't get message");
}).completionHandler(onSuccess(v2 -> {
regLatch.countDown();
}));
}
}));
awaitLatch(regLatch);
CountDownLatch closeLatch = new CountDownLatch(1);
action.accept(closeLatch);
awaitLatch(closeLatch);
// Allow time for kill to be propagate
Thread.sleep(2000);
vertices[start - 1].runOnContext(v -> {
// Now send some messages from node 2 - they should ALL go to node 0
EventBus ebSender = vertices[start - 1].eventBus();
for (int i = 0; i < messages; i++) {
ebSender.send(ADDRESS1, "foo" + i);
}
});
await();
} i get some ignite exceptions but it might be too graceful shutdowns still... |
@dspangen what do you think? |
I think this is a useful addition, but not sure it will be everything that is needed. For context we've been using the following in prod for the last few weeks and are still getting some instability. (We use vert.x in Quarkus fwiw). After a restart we may get a few straggler addresses that contain node info for departed nodes, triggering a
|
i am not sure if this is only valid for the ignite cluster manager. i would go with that solution as last resort and maybe have it as a toggle-able option. i would prefer to have something like an interface where such a recovery can be hooked into, like handling clustered eventbus exceptions not being able to talk to a peer. so this event could trigger a consistency check and make sure this node is still part of the cluster and/or has subscriptions not cleared yet. @tsegismont what do you think about the above problem, or do you see something obvious in the ignite side that could go wrong? |
A recovery interface seems like a good idea... at least until the root cause is identified. Today it's not possible to respond to a subs map inconsistency programmatically in any way (besides forking the cluster manager) hence this heavy-handed approach. |
@dspangen I'm not familiar with ECS, is there anyway to configure rolling updates and up/downscaling so that it does not confuse the data grid? In the https://vertx.io/docs/vertx-hazelcast/java/#_rolling_updates I know it's not the same datagrid but I think the concepts apply as well |
Apologies in advance for a long explanation. Looking into this further I believe I found the root cause, a mis-understanding of the Ignite distributed cache contract. The Here's how it manifests:
I tested this hypothesis by forking the Importantly, with this change we have observed no inconsistency in a week. Previously, after every update to the cluster (approx once a day) several registrations would stick around in perpetuity, failing approximately 0.5-1.0% of all requests to our application with the infamous Given that this state is maintained across two unrelated objects it will take some work to clean that all up into a PR for submission, but I am happy to do so if of interest. Aside: ExploExplo provides embedded analytics and BI to companies looking to share data with their customers. We provide a SQL-first analytic engine on top of customer databases and data warehouses. We use Vert.x via Quarkus 3 to maintain our connection pooling infrastructure for our customer data warehouse connections. Given that we serve hundreds of customers and thousands of databases and data warehouses we needed a sharding solution to maintain pools of database connections through a cluster of nodes. We use Vert.x to manage that. Our application is architected as follows:
|
Thanks for the long explanation, that helps to trace down the potential problem a lot. The reason for using To stay consistent by default for the rest of the shared data caches, it's Can you refer to the Ignite documentation where this is stated? Where you able to test my PR draft to check if that one get's you a lower failure rate? |
This is a continuation of #94 as it's hitting us in production. Looking into it, it seems the culprit is in handling a node leave event--the subscriptions are cleared by the cluster member that gets to delete the node info from the cache. But that can take a little while during which the member that is handling the removal (essentially the leader for this event) could crash or be shutdown, leaving the subs map in a bad state.
I think the solution here is to add a status field to
IgniteNodeInfo
of{STARTED, STOPPING}
and mark the entry for a node as stopping. Then, whoever gets to update that entry gets the first chance to clear the subs. There would also be a background executor service that would poll for stopping members, maybe synchronized by a sempaphore id'd to the removed member.Does this seem like a reasonable solution to the problem?
Version
4.5.7
The text was updated successfully, but these errors were encountered: