-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support leader election mechanism for Spark operator HA #458
Comments
This is at least useful in case of network partitions. There is this approach kind of old and the related repo is retired but it is still referenced in related PRs. Also this one which requires less work. Stateful sets provide some guarantees but if the node goes down there will be no service available afaik. |
The issue is so interesting! But I wasn't able to find a easy way to achieve it, execpt Package leaderelection. Package leaderelection might be useful to enable leader election for the operator. We can enable leader election, just like code below: // leader election for multiple operators
go wait.Forever(func() {
leaderelection.RunOrDie(controllerCtx, leaderelection.LeaderElectionConfig{
Lock: &rl,
LeaseDuration: leaseDuration,
RenewDeadline: renewDuration,
RetryPeriod: retryPeriod,
Callbacks: leaderelection.LeaderCallbacks{
OnStartedLeading: func(){
if err = applicationController.Start(*controllerThreads, stopCh); err != nil {
glog.Fatal(err)
}
if err = scheduledApplicationController.Start(*controllerThreads, stopCh); err != nil {
glog.Fatal(err)
}
},
OnStoppedLeading: func(){
},
},
})
}, waitDuration) After we enable leader election for the operator, we can deploy deployment with multiple replicas to acheive HA. There can be multiple endpoints behind But I'm not sure this approach is good enough, so I'm hoping to disscuss with you and maybe take a stab at implementation :) |
@tkanng you are right. The webhook service can route mutating admission requests to any replicas. This is still fine as processing of mutating admission requests can indeed be handled by any replicas. The leader is only applicable to receiving an |
@tkanng @liyinan926 Is this feature still under discussion.We are facing the same problem.I am a contributor to swarm and we are developing a master-slave HA for spark operator for our production cluster. |
This has been resolved in #518. The latest image at |
Currently, Spark operator is deployed as a single replica and doesn't provide any HA if the only replica is not functioning or working reliably.
To solve this, We should be able to increase replica for spark-operator along with a leader mechanism which should provide High availability for the Spark operator.
The text was updated successfully, but these errors were encountered: