Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support leader election mechanism for Spark operator HA #458

Closed
sarjeet2013 opened this issue Mar 29, 2019 · 5 comments
Closed

Support leader election mechanism for Spark operator HA #458

sarjeet2013 opened this issue Mar 29, 2019 · 5 comments
Labels
enhancement New feature or request

Comments

@sarjeet2013
Copy link
Contributor

Currently, Spark operator is deployed as a single replica and doesn't provide any HA if the only replica is not functioning or working reliably.

To solve this, We should be able to increase replica for spark-operator along with a leader mechanism which should provide High availability for the Spark operator.

@liyinan926 liyinan926 added the enhancement New feature or request label Apr 4, 2019
@skonto
Copy link
Contributor

skonto commented May 24, 2019

This is at least useful in case of network partitions. There is this approach kind of old and the related repo is retired but it is still referenced in related PRs. Also this one which requires less work. Stateful sets provide some guarantees but if the node goes down there will be no service available afaik.

@tkanng
Copy link
Contributor

tkanng commented Jun 8, 2019

The issue is so interesting! But I wasn't able to find a easy way to achieve it, execpt Package leaderelection.

Package leaderelection might be useful to enable leader election for the operator. We can enable leader election, just like code below:

// leader election for multiple operators
go wait.Forever(func() {
    leaderelection.RunOrDie(controllerCtx, leaderelection.LeaderElectionConfig{
        Lock:          &rl,
        LeaseDuration: leaseDuration,
        RenewDeadline: renewDuration,
        RetryPeriod:   retryPeriod,
        Callbacks: leaderelection.LeaderCallbacks{
            OnStartedLeading: func(){
                if err = applicationController.Start(*controllerThreads, stopCh); err != nil {
                    glog.Fatal(err)
                }
                if err = scheduledApplicationController.Start(*controllerThreads, stopCh); err != nil {
                    glog.Fatal(err)
                }
            },
            OnStoppedLeading: func(){

            },
        },
    })
}, waitDuration)

After we enable leader election for the operator, we can deploy deployment with multiple replicas to acheive HA. There can be multiple endpoints behind webhook service, which makes no difference to whole component.

But I'm not sure this approach is good enough, so I'm hoping to disscuss with you and maybe take a stab at implementation :)

@liyinan926
Copy link
Collaborator

@tkanng you are right. The webhook service can route mutating admission requests to any replicas. This is still fine as processing of mutating admission requests can indeed be handled by any replicas. The leader is only applicable to receiving an SparkApplication and actually submitting the application to run.

@ringtail
Copy link

@tkanng @liyinan926 Is this feature still under discussion.We are facing the same problem.I am a contributor to swarm and we are developing a master-slave HA for spark operator for our production cluster.

@liyinan926
Copy link
Collaborator

This has been resolved in #518. The latest image at gcr.io/spark-operator/spark-operator:v2.4.0-v1beta1-latest contains the changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants