Skip to content
This repository has been archived by the owner on Sep 6, 2018. It is now read-only.

Race condition during leader stepdown #168

Open
zenazn opened this issue Jan 27, 2014 · 6 comments
Open

Race condition during leader stepdown #168

zenazn opened this issue Jan 27, 2014 · 6 comments

Comments

@zenazn
Copy link

zenazn commented Jan 27, 2014

  1. Node 1 is leader. It has 20 entries in its log, the first 10 of which are committed. It has pushed all 20 to Node 2, but let's assume Nodes 3, 4, and 5 haven't seen any of these uncommitted entries.
  2. Somehow Node 3 gets elected while Node 1 isn't looking. It only has 10 entries in its log.
  3. Node 3 sends Node 1 an AppendEntries with a higher term number, and Node 1 steps down. It truncates its log to entry 10.
  4. In a separate goroutine, Node 1 prepares to send an AppendEntries request to Node 2 (that goroutine has no idea a step-down is in progress). It notes that the last entry it replicated to Node 2 was entry 20, and attempts to retrieve all entries in the log from 20 onwards.
  5. getEntriesAfter panics with the message "raft: Index is beyond end of log: 10 20"

In general, the locking around log operations is super sketchy. I suspect, for instance, that the entirety of https://github.com/goraft/raft/blob/master/server.go#L904-920 ought to be protected by a single lock acquisition / release, for instance.

@mathiasbynens
Copy link

@xiang90
Copy link
Contributor

xiang90 commented Jan 27, 2014

@zenazn @mathiasbynens I think this is fixed by #167
We stop all the heartbeats go-routines before leader steps-down and truncates its log.

@xiang90
Copy link
Contributor

xiang90 commented Jan 27, 2014

@zenazn Yes. Go-raft needs some cleaning up work.

@xiang90
Copy link
Contributor

xiang90 commented Jan 27, 2014

@zenazn One more thing, how did you find out the problem? Are you trying to audit the codes against the raft paper?

@zenazn
Copy link
Author

zenazn commented Jan 27, 2014

We discovered this bug in the process of writing the reference solution to Level 4 of Stripe's CTF (https://stripe-ctf.com). The network simulator I wrote for the event, Octopus (https://github.com/stripe-ctf/octopus), turns out to be very good at reproducing this particular bug :)

The patch in #167 introduces a deadlock, which I'll comment on there.

@xiang90
Copy link
Contributor

xiang90 commented Jan 27, 2014

@zenazn CTF is interesting! I found this bug several days ago when I was doing a benchmark.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants