What could cause leaders to constantly switch? #94

Arkensor · 2022-01-10T23:16:45Z

Arkensor
Jan 10, 2022

Edit: I think I might have found at least part of the cause of this

... is passed into the leader state in the following line:
https://github.com/dotnet/dotNext/blob/master/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/RaftCluster.cs#L744
The variable HeartbeatTimeout determines its value by using the randomly generated electionTimeout multiplied by heartbeatThreshold. So when ever a new random election timeout is generated this might result in the heart beat rate to be generated too close to the lower election timeout or in the worst case higher than the lower election timeout.
The fix is simple - the heartbeat rate must be based on the lower election timeout config value and must not change. e.g. HeartbeatTimeout = lowerElectiontimeout * heartbeatThreshold. This guarantees that heartbeats are sent faster than a follower can time out (assuming heartbeatThreshold < 1.0)

I have tested this idea, but it did not fully fix the issue. So there might be more problems as described below. There must be some task overhead or the wrong dispatch loop that waits for completions ... I am not sure. See the end of the issue for more ...

Original question:
Hi,

during development I often observe my cluster going into a leader switch (ping pong) when I go from a 3 member cluster and let the leader die. The two followers select one of them as leader. The follower appears to think that the recently elected leader is dead (no heartbeat) and decides that he wants another election. The requests a pre-vote from the leader, he will answer with yes because the other member has a higher term. Then the other member is all set to call an actual election, votes for himself and gets the vote from the former leader. Now they have switched. Unfortunately the now follower repeats the process and seems to think the just newly elected leader is dead again ... and this keeps going on for ever.

When using a very very frequent heart beat (0.1) it minimizes the problem, but it feels like that should not be how to fix it. I have tried the default settings for everything and with an IRaftHttpCluster that results in the ping pong.

It also appears that this is far more likely to happen when you go from a cluster with 3 to 2. If a cluster of 4 is reduced to 3 it appears to be stable with the same settings.

Right now I am using these

memberConfiguration.HeartbeatThreshold = 0.5;
memberConfiguration.LowerElectionTimeout = 2500;
memberConfiguration.UpperElectionTimeout = 5000;
memberConfiguration.RequestTimeout = TimeSpan.FromMinutes(1);
memberConfiguration.Partitioning = false;

In general I would expect this to stop after a few iterations because eventually the random timeout chosen should make one of them win and keep the leader state. The heartbeats should be frequent enough to always arrive in time ... thought it appears they don't do that?

Steps to reproduce:
I am using the latest stable version 4.1.3

Instance 1 starts alone in the cluster and thus automatically initializes it and becomes leader. He is added as the only member during UseInMemoryConfigurationStorage()
Have instance 2 join the cluster through a request sent to instance 1 (through something like a rest api). Instance 1 adds instance 2 via AddMemberAsync()
Repeat the process for instance 3, again having instance 1 add him through AddMemberAsync()
Shut down instance 1 without any other steps - kill the app or disable all traffic between him and the other two instances
Watch instance 2 and 3 fight for leadership over and over.

My debugging attempts:
I have debugged the library a bit and added logging and was able to observe that a follower that just got his append entries heartbeat still though he had no leader a fraction of a second later. I am not sure, but this feels like a bug maybe?`

If someone is elected leader he must always sent a heartbeat to all other members as the first thing he does - this is described in the raft specification. I suspect the current implementation does not do this and instead waits one heartbeat interval before sending them. This would mean that the other cluster member experiences a race between him timing out and getting the heartbeat from the leader that was just elected - making it very likely to consider the newly elected member "dead on arrival".
Either the new leader takes to long to get the heart beats out, or the followers take too long to process the heartbeats sent to them.

Or heartbeats are not sent often enough.
I am not sure how the code works exactly but maybe instead of dispatching a heartbeat in a fixed interval, its dispatched something waits for them to be delivered and only then the loop continues. So the real heartbeat rate is not X but X + response time.
This could exceed the election timeout and thus cause the observed behavior.

Here are some of the adjustments I did to debug and the outputs. Given the configured 2500 lower election timeout and 5000 upper with a 0.5 threshold heartbeat rate this should give a heartbeat at LEAST every 1.25 sec (0.5 * lower election timeout)
I added console logs to LeaderState.DoHeartbeats:

    private async Task DoHeartbeats(TimeSpan period, IAuditTrail<IRaftLogEntry> auditTrail, IClusterConfigurationStorage configurationStorage, CancellationToken token)
    {
        using var cancellationSource = token.LinkTo(LeadershipToken);

        // reuse this buffer to place responses from other nodes
        using var taskBuffer = new AsyncResultSet(Members.Count);

        Console.WriteLine($"{DateTime.Now}: Starting sending heartbeats with interval of {period}");

        for (var forced = false; await DoHeartbeats(taskBuffer, auditTrail, configurationStorage, token).ConfigureAwait(false); forced = await WaitForReplicationAsync(period, token).ConfigureAwait(false))
        {
            Console.WriteLine($"{DateTime.Now}: WaitForReplicationAsync completed, forced={forced}");

            if (forced)
                DrainReplicationQueue();

            taskBuffer.Clear(true);
            Console.WriteLine($"{DateTime.Now}: Beginning DoHeatbeats loop again");
        }
    }

And another console log in the overload:

private async Task<bool> DoHeartbeats(AsyncResultSet taskBuffer, IAuditTrail<IRaftLogEntry> auditTrail, IClusterConfigurationStorage configurationStorage, CancellationToken token)
{
   Console.WriteLine($"{DateTime.Now}: Added task to send heartbeat to a member");
   ...
}

This resulted in the following log on the leader:

10.01.2022 23:44:07: Starting sending heartbeats with interval of 00:00:01.8515000
10.01.2022 23:44:07: Added task to send heartbeat to a member
10.01.2022 23:44:07: Added task to send heartbeat to a member
10.01.2022 23:44:10: WaitForReplicationAsync completed, forced=False
10.01.2022 23:44:10: Beginning DoHeatbeats loop again
10.01.2022 23:44:12: Added task to send heartbeat to a member
10.01.2022 23:44:12: Added task to send heartbeat to a member
10.01.2022 23:44:14: WaitForReplicationAsync completed, forced=False
10.01.2022 23:44:14: Beginning DoHeatbeats loop again
10.01.2022 23:44:16: Added task to send heartbeat to a member
10.01.2022 23:44:16: Added task to send heartbeat to a member

23:44:07 -> 23:44:10 is already more than the configured heartbeat rate of 1.25 seconds. The WaitForReplicationAsync step appears to be blocking the loop from going faster. It alone often takes 1-2 seconds but only if there are just two cluster members left. Initially when all 3 cluster members were present the leader did not take as long to process his heartbeats. So there might be a special case here?

What ever causes it, the guarantee must be that a) the heart beats are configured to be sent faster than the lower election timeout b) that what ever processing time might be attached to them does not stop them from being "queued" to be send at the fixed rate.

Sorry for the wall of text. Here is a cookie 🍪

Answered by sakno

Jan 11, 2022

@Arkensor , request timeout is too high in your setup. On Windows, in case of localhost communication, the leader should wait for this timeout to detect unavailable member. This is the expected behavior of tcp socket on Windows. On Linux, it behaves differently. Anyway, you should set request timeout lower than lowerElectionTimeout.

View full answer

sakno · 2022-01-11T09:27:23Z

sakno
Jan 11, 2022
Maintainer

@Arkensor , request timeout is too high in your setup. On Windows, in case of localhost communication, the leader should wait for this timeout to detect unavailable member. This is the expected behavior of tcp socket on Windows. On Linux, it behaves differently. Anyway, you should set request timeout lower than lowerElectionTimeout.

3 replies

Arkensor Jan 11, 2022
Author

@sakno thank you for your reply. May I ask why the default timeout configured is UpperElectionTimeout then? Would it not make sense to do something like RequestTimeout = LowerElectionTimeout / 2? I would expect the default configuration to satisfy the worst case scenario.

I was initially not even changing the configuration manually until I noticed that it did not quite work out for me on my windows local host test machine. I will try out some variations and see if that improves stability.

I do think my finds about how heart beats are calculated and dispatched are still valid though.

sakno Jan 11, 2022
Maintainer

If someone is elected leader he must always sent a heartbeat to all other members as the first thing he does - this is described in the raft specification. I suspect the current implementation does not do this and instead waits one heartbeat interval before sending them

That's wrong. The current implementation behaves just as described in Raft paper. A newly elected leader sends heartbeat to all members immediately without delay.

sakno Jan 11, 2022
Maintainer

May I ask why the default timeout configured is UpperElectionTimeout

It's too pessimistic, agree. For TCP transport, it is configured as UpperElectionTimeout / 2 by default. The same for HTTP transport (RpcTimeout configuration property).

Having RequestTimeout equals to LoweElectionTimeout or lower can be too optimistic. Should word was incorrect from my side 😸 But you can try.

sakno · 2022-01-11T09:33:12Z

sakno
Jan 11, 2022
Maintainer

So the real heartbeat rate is not X but X + response time.

Probably, that maybe an issue. I can fix it quickly.

3 replies

sakno Jan 11, 2022
Maintainer

@Arkensor , the patch is ready. Could you try it?

Arkensor Jan 11, 2022
Author

@sakno a setup that has lead to frequent re-elections before the patch appears to be stable now. There is always some randomness involved so I am not sure if its 100% fixed, but all I can observe is that none of the tests I run on the new build were entering the ping pong mode. So I would think your fix has worked.

Thank you for looking into it so quickly. And sorry for all the issues and questions on here :) I am just very interested in the library and its a bit too complex for me to understand some times. I hope that my questions will help others who might come across the same issues though.

sakno Jan 11, 2022
Maintainer

Thanks for the feedback. I'll continue to work on Raft stability and reliability.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What could cause leaders to constantly switch? #94

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

What could cause leaders to constantly switch? #94

Arkensor Jan 10, 2022

Edit: I think I might have found at least part of the cause of this

Replies: 2 comments · 6 replies

sakno Jan 11, 2022 Maintainer

Arkensor Jan 11, 2022 Author

sakno Jan 11, 2022 Maintainer

sakno Jan 11, 2022 Maintainer

sakno Jan 11, 2022 Maintainer

sakno Jan 11, 2022 Maintainer

Arkensor Jan 11, 2022 Author

sakno Jan 11, 2022 Maintainer

Arkensor
Jan 10, 2022

Replies: 2 comments 6 replies

sakno
Jan 11, 2022
Maintainer

Arkensor Jan 11, 2022
Author

sakno Jan 11, 2022
Maintainer

sakno Jan 11, 2022
Maintainer

sakno
Jan 11, 2022
Maintainer

sakno Jan 11, 2022
Maintainer

Arkensor Jan 11, 2022
Author

sakno Jan 11, 2022
Maintainer