Nats streaming unable to restore messages on startup #1271

fowlerp-qlik · 2022-10-04T14:47:46Z

We are using nats-streaming version 0.24.4 running as three pods (Kubernetes). When nats-streaming was deployed the pods rolled in an order that does not take the nats streaming leader into account. We have 96 channels. During startup received 10 of

[1] 2022/10/03 19:20:39.135630 [ERR] STREAM: channel "system-events.user-identity" - unable to restore messages (snapshot 75859347/75979326, store 75872433/75962095, cfs 75859347): nats: timeout

every three seconds then that nats-streaming pod would abort/exit. Kubernetes would start a new instance and again same issue.

Our message store is in a ram disk so we eventually shut down all pods and restarted from scratch (loosing all messages). This recovered nat-streaming. nats pods were not rolled during the nats-streaming deployment.

In terms of order system-events.user-identity is not the first nor the last channel based on nats-streaming channel creation logs order.

What would cause this problem?

kozlovic · 2022-10-04T14:57:45Z

This error indicates that this node needed to restore messages based on the snapshot it got and its current state of the store: some messages were missing, and it needs then to get those messages from the leader. If no leader was available, then it would not be able to get those messages back and cannot proceed. This seem to indicate that either connectivity was missing between this node and the rest of the cluster, or other servers were also restarted and there were no leader election possible because none of the restarted nodes were in a situation where they could proceed?

fowlerp-qlik · 2022-10-04T15:06:09Z

Would setting cluster_proceed_on_restore_failure=true be wise even if it means potential loss of messages? Is there any way to avoid/mitigate this issue?

kozlovic · 2022-10-04T15:15:33Z

Would setting cluster_proceed_on_restore_failure=true be wise even if it means potential loss of messages

It depends. If you are in a situation where no leader can be elected, then it will allow you to start (with the understanding that some channels may not be recovered). It is a bit better than removing the whole state since some of the channels may not have had any problem. But again, this is a decision that you have to make after judging the impact. And this is likely something that you may not want to leave "on" by default, but just enable in a bad situation.

Is there any way to avoid/mitigate this issue?

Make sure that there is a leader when you restart nodes, which also means, maybe start recycling followers instead of the leader.

kozlovic · 2022-10-11T01:28:00Z

Any update on this? Should I close?

fowlerp-qlik · 2022-10-11T11:35:14Z

Could you keep open a bit longer. We are actively reviewing the logs to see how Kubernetes pod rolling may have impacted leadership and whether we can mitigate somehow.

kozlovic · 2022-10-11T15:05:03Z

Could you keep open a bit longer

No problem!

fowlerp-qlik · 2022-10-11T18:38:52Z

Sequence

NS-2 is leader, Kubernetes update strategy is a rolling update (not onDelete)
NS-2 rolls, leadership is lost, NS-1 becomes leader
NS-1 rolls, NS-0 becomes leader but doesn’t finish its leadership promotion actions
NS1 and NS2 begin updating their channel snapshots (around 100 for each)
NS-0 rolls, stays in terminating state for awhile then new instance starts. Was only leader for 20 seconds before being rolled
No leader is elected, NS1 and NS2 fail to update various channel snapshots due to nats timeout, This continues for many minutes. Each node aborting after 30 seconds and restarting
NS-0 new instance shows typical startup logs but nothing else

So could there be a weak area where the leader is killed/restarted while the followers are restoring their channel snapshots?

kozlovic · 2022-10-11T20:42:48Z

Yes, you should ensure that you recycle one node at a time and wait for it to be fully recovered/active before moving to the next. Ideally, you would start with the non leader - but it is understood that leadership may change while a node is restarted.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nats streaming unable to restore messages on startup #1271

Nats streaming unable to restore messages on startup #1271

fowlerp-qlik commented Oct 4, 2022

kozlovic commented Oct 4, 2022

fowlerp-qlik commented Oct 4, 2022

kozlovic commented Oct 4, 2022

kozlovic commented Oct 11, 2022

fowlerp-qlik commented Oct 11, 2022

kozlovic commented Oct 11, 2022

fowlerp-qlik commented Oct 11, 2022

kozlovic commented Oct 11, 2022

Nats streaming unable to restore messages on startup #1271

Nats streaming unable to restore messages on startup #1271

Comments

fowlerp-qlik commented Oct 4, 2022

kozlovic commented Oct 4, 2022

fowlerp-qlik commented Oct 4, 2022

kozlovic commented Oct 4, 2022

kozlovic commented Oct 11, 2022

fowlerp-qlik commented Oct 11, 2022

kozlovic commented Oct 11, 2022

fowlerp-qlik commented Oct 11, 2022

kozlovic commented Oct 11, 2022