Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mute/unmute overhead interferes with checkpoint barriers #3120

Open
slfritchie opened this issue Mar 11, 2020 · 0 comments
Open

Mute/unmute overhead interferes with checkpoint barriers #3120

slfritchie opened this issue Mar 11, 2020 · 0 comments

Comments

@slfritchie
Copy link
Contributor

Is this a bug, feature request, or feedback?

Bug

What is the current behavior?

Intermittent crash during checkpoint processing

What is the expected behavior?

No crash

What OS and version of Wallaroo are you using?

Ubuntu Bionic/18.04 LTS + Wallaroo @ commit 35d2038

Steps to reproduce?

See README.md in tarball at http://wallaroolabs-dev.s3.amazonaws.com/scott/count2.tar.gz. Instructions include options for building & running a demonstration test via a VM or Docker.

reset.sh
start-cluster.sh 4

... can occasionally yield a crash a few seconds after the start-cluster.sh script is finished. See full logs at http://wallaroolabs-dev.s3.amazonaws.com/logs/logs.1583892856.tar.gz. On a 1 CPU/5GB RAM virtual machine, the crash seems to happen roughly 50% of the time.

The crash is more likely to happen as the cluster size is increased. The crash always seems to be during the 2nd checkpoint operation.

$ tail /tmp/wallaroo.2
1583892717.934412,Unmuting DataChannel
1583892717.934418,Unmuting DataChannel
1583892717.934425,Unmuting DataChannel
1583892717.934431,Unmuting DataChannel
1583892718.090417,Sent control message to initializer: EventLogAckCheckpointMsg
1583892718.091238,Sent control message to initializer: WorkerAckBarrierMsg
1583892718.102630,Sent control message to initializer: EventLogAckCheckpointIdWrittenMsg
1583892719.111070,ERROR,Step,Invariant violation: received barrier CheckpointBarrierToken(2) is greater than current barrier CheckpointBarrierToken(1) at Step 193591313640807744353045639962347611769

Invariant violated in /build2/.deps/wallaroolabs/wallaroo/lib/wallaroo/core/step/step_phase.pony at line 219
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant