Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harde restate-server::raft_metadata_cluster raft_metadata_cluster_chaos_test #2828

Closed
tillrohrmann opened this issue Mar 4, 2025 · 3 comments · Fixed by #2863
Closed

Harde restate-server::raft_metadata_cluster raft_metadata_cluster_chaos_test #2828

tillrohrmann opened this issue Mar 4, 2025 · 3 comments · Fixed by #2863
Assignees

Comments

@tillrohrmann
Copy link
Contributor

The raft_metadata_cluster_chaos_test seems to be unstable. It looks that the cluster sometimes does not start up fast enough for the initial health checks to pass. Maybe we should give it a bit more time.

https://github.com/restatedev/restate/actions/runs/13657796133/job/38181318353?pr=2825#step:12:2853

@tillrohrmann tillrohrmann self-assigned this Mar 7, 2025
@tillrohrmann
Copy link
Contributor Author

@tillrohrmann
Copy link
Contributor Author

It seems that in both failing cases, we were restarting a killed node with a completely different configuration:

https://github.com/restatedev/restate/actions/runs/13721166634/job/38376765214#step:12:2510
https://github.com/restatedev/restate/actions/runs/13657796133/job/38181318353?pr=2825#step:12:2684

@tillrohrmann
Copy link
Contributor Author

I looks as if the node has been started with an empty configuration. Maybe the file hasn't been fully written before the process starts?

tillrohrmann added a commit to tillrohrmann/restate that referenced this issue Mar 7, 2025
Dropping the file alone, does not guarantee that the file content is
written to disk and the file being closed immediately. If the config
file is not written when starting the Restate process, then it will
start with the default configuration which causes the raft_metadata_cluster_chaos_test
to fail every now and then.

This fixes restatedev#2828.
tillrohrmann added a commit to tillrohrmann/restate that referenced this issue Mar 7, 2025
Dropping the file alone, does not guarantee that the file content is
written to disk and the file being closed immediately. If the config
file is not written when starting the Restate process, then it will
start with the default configuration which causes the raft_metadata_cluster_chaos_test
to fail every now and then.

This fixes restatedev#2828.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant