Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nats streaming unable to restore messages on startup #1271

Open
fowlerp-qlik opened this issue Oct 4, 2022 · 8 comments
Open

Nats streaming unable to restore messages on startup #1271

fowlerp-qlik opened this issue Oct 4, 2022 · 8 comments

Comments

@fowlerp-qlik
Copy link

We are using nats-streaming version 0.24.4 running as three pods (Kubernetes). When nats-streaming was deployed the pods rolled in an order that does not take the nats streaming leader into account. We have 96 channels. During startup received 10 of

[1] 2022/10/03 19:20:39.135630 [ERR] STREAM: channel "system-events.user-identity" - unable to restore messages (snapshot 75859347/75979326, store 75872433/75962095, cfs 75859347): nats: timeout

every three seconds then that nats-streaming pod would abort/exit. Kubernetes would start a new instance and again same issue.

Our message store is in a ram disk so we eventually shut down all pods and restarted from scratch (loosing all messages). This recovered nat-streaming. nats pods were not rolled during the nats-streaming deployment.

In terms of order system-events.user-identity is not the first nor the last channel based on nats-streaming channel creation logs order.

What would cause this problem?

@kozlovic
Copy link
Member

kozlovic commented Oct 4, 2022

This error indicates that this node needed to restore messages based on the snapshot it got and its current state of the store: some messages were missing, and it needs then to get those messages from the leader. If no leader was available, then it would not be able to get those messages back and cannot proceed. This seem to indicate that either connectivity was missing between this node and the rest of the cluster, or other servers were also restarted and there were no leader election possible because none of the restarted nodes were in a situation where they could proceed?

@fowlerp-qlik
Copy link
Author

Would setting cluster_proceed_on_restore_failure=true be wise even if it means potential loss of messages? Is there any way to avoid/mitigate this issue?

@kozlovic
Copy link
Member

kozlovic commented Oct 4, 2022

Would setting cluster_proceed_on_restore_failure=true be wise even if it means potential loss of messages

It depends. If you are in a situation where no leader can be elected, then it will allow you to start (with the understanding that some channels may not be recovered). It is a bit better than removing the whole state since some of the channels may not have had any problem. But again, this is a decision that you have to make after judging the impact. And this is likely something that you may not want to leave "on" by default, but just enable in a bad situation.

Is there any way to avoid/mitigate this issue?

Make sure that there is a leader when you restart nodes, which also means, maybe start recycling followers instead of the leader.

@kozlovic
Copy link
Member

Any update on this? Should I close?

@fowlerp-qlik
Copy link
Author

Could you keep open a bit longer. We are actively reviewing the logs to see how Kubernetes pod rolling may have impacted leadership and whether we can mitigate somehow.

@kozlovic
Copy link
Member

Could you keep open a bit longer

No problem!

@fowlerp-qlik
Copy link
Author

Sequence

  1. NS-2 is leader, Kubernetes update strategy is a rolling update (not onDelete)
  2. NS-2 rolls, leadership is lost, NS-1 becomes leader
  3. NS-1 rolls, NS-0 becomes leader but doesn’t finish its leadership promotion actions
  4. NS1 and NS2 begin updating their channel snapshots (around 100 for each)
  5. NS-0 rolls, stays in terminating state for awhile then new instance starts. Was only leader for 20 seconds before being rolled
  6. No leader is elected, NS1 and NS2 fail to update various channel snapshots due to nats timeout, This continues for many minutes. Each node aborting after 30 seconds and restarting
  7. NS-0 new instance shows typical startup logs but nothing else

So could there be a weak area where the leader is killed/restarted while the followers are restoring their channel snapshots?

@kozlovic
Copy link
Member

Yes, you should ensure that you recycle one node at a time and wait for it to be fully recovered/active before moving to the next. Ideally, you would start with the non leader - but it is understood that leadership may change while a node is restarted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants