Support leader election mechanism for Spark operator HA #458

sarjeet2013 · 2019-03-29T18:44:05Z

Currently, Spark operator is deployed as a single replica and doesn't provide any HA if the only replica is not functioning or working reliably.

To solve this, We should be able to increase replica for spark-operator along with a leader mechanism which should provide High availability for the Spark operator.

skonto · 2019-05-24T11:48:07Z

This is at least useful in case of network partitions. There is this approach kind of old and the related repo is retired but it is still referenced in related PRs. Also this one which requires less work. Stateful sets provide some guarantees but if the node goes down there will be no service available afaik.

tkanng · 2019-06-08T13:30:44Z

The issue is so interesting! But I wasn't able to find a easy way to achieve it, execpt Package leaderelection.

Package leaderelection might be useful to enable leader election for the operator. We can enable leader election, just like code below:

// leader election for multiple operators
go wait.Forever(func() {
    leaderelection.RunOrDie(controllerCtx, leaderelection.LeaderElectionConfig{
        Lock:          &rl,
        LeaseDuration: leaseDuration,
        RenewDeadline: renewDuration,
        RetryPeriod:   retryPeriod,
        Callbacks: leaderelection.LeaderCallbacks{
            OnStartedLeading: func(){
                if err = applicationController.Start(*controllerThreads, stopCh); err != nil {
                    glog.Fatal(err)
                }
                if err = scheduledApplicationController.Start(*controllerThreads, stopCh); err != nil {
                    glog.Fatal(err)
                }
            },
            OnStoppedLeading: func(){

            },
        },
    })
}, waitDuration)

After we enable leader election for the operator, we can deploy deployment with multiple replicas to acheive HA. There can be multiple endpoints behind webhook service, which makes no difference to whole component.

But I'm not sure this approach is good enough, so I'm hoping to disscuss with you and maybe take a stab at implementation :)

liyinan926 · 2019-06-08T23:24:59Z

@tkanng you are right. The webhook service can route mutating admission requests to any replicas. This is still fine as processing of mutating admission requests can indeed be handled by any replicas. The leader is only applicable to receiving an SparkApplication and actually submitting the application to run.

ringtail · 2019-06-24T15:54:52Z

@tkanng @liyinan926 Is this feature still under discussion.We are facing the same problem.I am a contributor to swarm and we are developing a master-slave HA for spark operator for our production cluster.

liyinan926 · 2019-08-19T22:06:13Z

This has been resolved in #518. The latest image at gcr.io/spark-operator/spark-operator:v2.4.0-v1beta1-latest contains the changes.

liyinan926 added the enhancement New feature or request label Apr 4, 2019

tkanng mentioned this issue Jun 9, 2019

Support leader election mechanism for Spark operator HA #511

Closed

liyinan926 closed this as completed Aug 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support leader election mechanism for Spark operator HA #458

Support leader election mechanism for Spark operator HA #458

sarjeet2013 commented Mar 29, 2019

skonto commented May 24, 2019 •

edited

Loading

tkanng commented Jun 8, 2019 •

edited

Loading

liyinan926 commented Jun 8, 2019

ringtail commented Jun 24, 2019

liyinan926 commented Aug 19, 2019

Support leader election mechanism for Spark operator HA #458

Support leader election mechanism for Spark operator HA #458

Comments

sarjeet2013 commented Mar 29, 2019

skonto commented May 24, 2019 • edited Loading

tkanng commented Jun 8, 2019 • edited Loading

liyinan926 commented Jun 8, 2019

ringtail commented Jun 24, 2019

liyinan926 commented Aug 19, 2019

skonto commented May 24, 2019 •

edited

Loading

tkanng commented Jun 8, 2019 •

edited

Loading