Add statefulset partition controller #633

d-kuro · 2024-01-10T05:13:14Z

refs: #628

See documentation for details:
https://github.com/cybozu-go/moco/blob/d-kuro/partition/docs/rolling-update-strategy.md

api/v1beta2/statefulset_webhhok.go

masa213f

I could not update MySQL Pods after the following operations.
This means we can't roll back MySQLClusters by Argo CD, etc.

If the pod's phase is not Running (= Pending, Succeeded, or Failed phase), how about proceeding with the partition to update the error pod?
This behavior may be the same as the PodDisruptionBudget. Please see the following note.
https://kubernetes.io/docs/tasks/run-application/configure-pdb/#unhealthy-pod-eviction-policy

NOTE: Pods in Pending, Succeeded or Failed phase are always considered for eviction.

Create a MySQLCluster

$ kubectl create ns test

$ kubectl apply -f - << EOF
apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
  namespace: test
  name: test
spec:
  replicas: 3
  podTemplate:
    spec:
      containers:
      - name: mysqld
        image: ghcr.io/cybozu-go/moco/mysql:8.0.36.2
  volumeClaimTemplates:
  - metadata:
      name: mysql-data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi
EOF

Wait to become Healthy.
Update the MySQLCluster with wrong image.

$ kubectl apply -f - << EOF
apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
  namespace: test
  name: test
spec:
  replicas: 3
  podTemplate:
    spec:
      containers:
      - name: mysqld
        # invalid image
        image: foo/bar:baz
  volumeClaimTemplates:
  - metadata:
      name: mysql-data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi
EOF

Wait for mysql pod -2 to become error.

$ kubectl get pod -n test -w
NAME          READY   STATUS        RESTARTS   AGE
...
moco-test-2   3/3     Terminating   0          4m7s
...
moco-test-2   0/3     Pending       0          0s
...
moco-test-2   0/3     Init:ErrImagePull   0          17s
moco-test-2   0/3     Init:ImagePullBackOff   0          33s

controllers/partition_controller.go

d-kuro · 2024-07-09T18:30:13Z

@YZ775 @masa213f
Sorry for the delay in fixing it.
I have significantly rewritten the code to check Pod ready when updating partition.
The isRolloutReady function has changed significantly.

moco/controllers/partition_controller.go

Line 134 in 088bac5

    
           func (r *StatefulSetPartitionReconciler) isRolloutReady(ctx context.Context, cluster *mocov1beta2.MySQLCluster, sts *appsv1.StatefulSet) (bool, error) {

And with that I have also added the case for E2E testing.

moco/e2e/partition_test.go

Lines 163 to 241 in 088bac5

    
           It("should image pull backoff", func() { 
        
           	kubectlSafe(fillTemplate(imagePullBackoffApplyYAML), "apply", "-f", "-") 
        
           	Eventually(func() error { 
        
           		out, err := kubectl(nil, "get", "-n", "partition", "pod", "moco-test-2", "-o", "json") 
        
           		if err != nil { 
        
           			return err 
        
           		} 
        
           		pod := &corev1.Pod{} 
        
           		if err := json.Unmarshal(out, pod); err != nil { 
        
           			return err 
        
           		} 
        
           		status := make([]corev1.ContainerStatus, 0, len(pod.Status.ContainerStatuses)+len(pod.Status.InitContainerStatuses)) 
        
           		status = append(status, pod.Status.ContainerStatuses...) 
        
           		status = append(status, pod.Status.InitContainerStatuses...) 
        
           		for _, s := range status { 
        
           			if s.Image != "ghcr.io/cybozu-go/moco/mysql:invalid-image" { 
        
           				continue 
        
           			} 
        
           			if s.State.Waiting != nil && s.State.Waiting.Reason == "ImagePullBackOff" { 
        
           				return nil 
        
           			} 
        
           		} 
        
           		return errors.New("image pull backoff Pod not found") 
        
           	}).Should(Succeed()) 
        
           }) 
        
           It("should partition updates have stopped", func() { 
        
           	out, err := kubectl(nil, "get", "-n", "partition", "statefulset", "moco-test", "-o", "json") 
        
           	Expect(err).NotTo(HaveOccurred()) 
        
           	sts := &appsv1.StatefulSet{} 
        
           	err = json.Unmarshal(out, sts) 
        
           	Expect(err).NotTo(HaveOccurred()) 
        
           	Expect(sts.Spec.UpdateStrategy.RollingUpdate).NotTo(BeNil()) 
        
           	Expect(sts.Spec.UpdateStrategy.RollingUpdate.Partition).NotTo(BeNil()) 
        
           	Expect(*sts.Spec.UpdateStrategy.RollingUpdate.Partition).To(Equal(int32(1))) 
        
           }) 
        
           It("should rollback succeed", func() { 
        
           	kubectlSafe(fillTemplate(partitionApplyYAML), "apply", "-f", "-") 
        
           	Eventually(func() error { 
        
           		cluster, err := getCluster("partition", "test") 
        
           		if err != nil { 
        
           			return err 
        
           		} 
        
           		for _, cond := range cluster.Status.Conditions { 
        
           			if cond.Type != mocov1beta2.ConditionHealthy { 
        
           				continue 
        
           			} 
        
           			if cond.Status == metav1.ConditionTrue { 
        
           				return nil 
        
           			} 
        
           			return fmt.Errorf("cluster is not healthy: %s", cond.Status) 
        
           		} 
        
           		return errors.New("no health condition") 
        
           	}).Should(Succeed()) 
        
           }) 
        
           It("should partition updates succeed", func() { 
        
           	Eventually(func() error { 
        
           		out, err := kubectl(nil, "get", "-n", "partition", "statefulset", "moco-test", "-o", "json") 
        
           		if err != nil { 
        
           			return err 
        
           		} 
        
           		sts := &appsv1.StatefulSet{} 
        
           		if err := json.Unmarshal(out, sts); err != nil { 
        
           			return err 
        
           		} 
        
           		if sts.Spec.UpdateStrategy.RollingUpdate == nil || sts.Spec.UpdateStrategy.RollingUpdate.Partition == nil { 
        
           			return errors.New("partition is nil") 
        
           		} 
        
           		if *sts.Spec.UpdateStrategy.RollingUpdate.Partition == int32(0) { 
        
           			return nil 
        
           		} 
        
           		return errors.New("partition is not 0") 
        
           	}).Should(Succeed()) 
        
           })

masa213f

Please check and fix this behavior.
I am still reviewing, but I will comment first because this issue is important. :)

Create a MySQLCluster

$ kubectl create ns test

$ kubectl apply -f - << EOF
apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
  namespace: test
  name: test
spec:
  replicas: 3
  podTemplate:
    spec:
      containers:
      - name: mysqld
        image: ghcr.io/cybozu-go/moco/mysql:8.0.36.2
  volumeClaimTemplates:
  - metadata:
      name: mysql-data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi
EOF

Wait to become Healthy.
Rollout restart MySQL StatefulSet. (It's OK to update MySQLCluster)

$ kubectl rollout restart sts -n test moco-test

At this time, even though the rolling update has not been finished, the partition become to 0.

$ kubectl get sts -n test moco-test -o json | jq .spec.updateStrategy
{
  "rollingUpdate": {
    "partition": 0
  },
  "type": "RollingUpdate"
}

$ kubectl get pod -n test
NAME          READY   STATUS        RESTARTS   AGE
moco-test-0   3/3     Running       0          2m32s
moco-test-1   3/3     Running       0          2m32s
moco-test-2   3/3     Terminating   0          2m32s

$ kubectl events -n test
LAST SEEN               TYPE      REASON                    OBJECT                                         MESSAGE
...
28s                     Normal    Killing                   Pod/moco-test-2                                Stopping container slow-log
28s                     Normal    Killing                   Pod/moco-test-2                                Stopping container agent
28s                     Normal    Killing                   Pod/moco-test-2                                Stopping container mysqld
28s                     Normal    PartitionUpdate           StatefulSet/moco-test                          Updated partition from 3 to 2
28s                     Normal    PartitionUpdate           StatefulSet/moco-test                          Updated partition from 2 to 1
28s                     Normal    PartitionUpdate           StatefulSet/moco-test                          Updated partition from 1 to 0
6s (x10 over 28s)       Normal    SuccessfulDelete          StatefulSet/moco-test                          delete Pod moco-test-2 in StatefulSet moco-test successful
6s                      Normal    Scheduled                 Pod/moco-test-2                                Successfully assigned test/moco-test-2 to moco-worker2
6s (x9 over 7s)         Normal    RecreatingTerminatedPod   StatefulSet/moco-test                          StatefulSet test/moco-test is recreating terminated Pod

If the PARTITION is 0, when a mysql-0 pod is accidentally delete during a rolling update, it is re-created with a new revesion.
If the order of updating the MySQL Pods is swapped, the data may be corrupted during the MySQL upgrade.

https://github.com/cybozu-go/moco/blob/v0.23.2/docs/upgrading.md?plain=1#L58-L60

api/v1beta2/statefulset_webhhok_test.go

controllers/partition_controller.go

masa213f · 2024-07-16T05:08:07Z

@d-kuro
IMO, it would be easier to implement by using StatefulSet status and MySQLCluster status.
For example, how about the following logic.

if cluster.generation != cluster.status.reconcileInfo.generation {
	// In this case, the reconciliation of MySQLClusterReconciler has not been completed.
	// Wait until completion.
	return reconcile.Result{RequeueAfter: 10 * time.Second}, nil
}

if sts.metadata.generation != sts.status.observedGeneration {
	// In this case, the reconciliation of StatefulSet controller (kube-controller-manager) has not been completed.
	// Wait until completion.
	return reconcile.Result{RequeueAfter: 10 * time.Second}, nil
}

if sts.spec.replicas == sts.status.updatedReplicas {
	// In this case, a rolling update has been completed.
	// Update the partition to the initial value (`sts.spec.replicas`) and finish the reconciliation.
	updatePartition(sts.spec.replicas) // return error if conflicts occur
	return reconcile.Result{}, nil
}

if sts.spec.updateStrategy.rollingUpdate.partition == 0 {
	// In this case, just wait for mysql-pod-0 to update.
	return reconcile.Result{RequeueAfter: 10 * time.Second}, nil
}

nextIndex := sts.spec.updateStrategy.rollingUpdate.partition - 1
if podList.Items[nextIndex].Labels[appsv1.ControllerRevisionHashLabelKey] == sts.status.updateRevision {
	// In this case, nextIndex pod is already updated. Just decrement the partition.
	// The nextIndex pod will not be restarted, so we need not check the cluster status.
	updatePartition(sts.spec.replicas) // return error if conflicts occur
	return reconcile.Result{RequeueAfter: 10 * time.Second}, nil
}

if cluster is healthy (check cluster.status) || nextIndex pod is not running (if possible, check the pod is ready or not) {
	// Restart the nextIndex pod.
	updatePartition(nextIndex) // return error if conflicts occur
	return reconcile.Result{RequeueAfter: 10 * time.Second}, nil
}

// Wait until the cluster become healhty.
return reconcile.Result{RequeueAfter: 10 * time.Second}, nil

d-kuro · 2024-07-31T04:38:34Z

@masa213f
Thank you for the review and proposed revisions. I have revised the logic based on your suggestions. Please check it.
abecf35

masa213f

When I updated volumeClaimTemplates, the PartitionReconciler did not work.
I guess we need add partition when a StatefulSet is created.

Create a MySQLCluster

$ kubectl create ns test

$ kubectl apply -f - << EOF
apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
  namespace: test
  name: test
spec:
  replicas: 3
  podTemplate:
    spec:
      containers:
      - name: mysqld
        image: ghcr.io/cybozu-go/moco/mysql:8.0.36.2
  volumeClaimTemplates:
  - metadata:
      name: mysql-data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi
EOF

# wait for the cluster become healthy.

Watch the following at another terminals.

# another terminals
$ watch kubectl get mysqlcluster,pod -n test --show-labels
$ watch "kubectl get sts -n test moco-test -o json | jq .metadata.generation,.spec.updateStrategy,.status"

Create a DB.

$ kubectl moco mysql -n test test -u moco-writable -- -e "CREATE DATABASE test;"

Crash the MySQL pod 0: keep killing mysqld.

# another terminal
$ watch kubectl exec -n test moco-test-0 -c mysqld -- kill 1

Update the podTemplate and volumeClaimTemplates.

$ kubectl apply -f - << EOF
apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
  namespace: test
  name: test
spec:
  replicas: 3
  podTemplate:
    metadata:
      labels:
        hoge: piyo # add
    spec:
      containers:
      - name: mysqld
        image: ghcr.io/cybozu-go/moco/mysql:8.0.36.2
  volumeClaimTemplates:
  - metadata:
      name: mysql-data
      labels:
        foo: bar # add
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi
EOF

controllers/partition_controller.go

masa213f · 2024-08-02T08:19:38Z

e2e/partition_test.go

@@ -0,0 +1,290 @@
+package e2e


Could you add tests when a MySQL Pod is crash looping, rolling update will not start?
I think we can implement this test as following steps.

Create a MySQLCluster.

Wait for the MySQLCluster become Healthy.

Create a DB.

e.g. kubectl moco mysql <MySQLCluster> -u moco-writable -- -e "CREATE DATABASE test;"

Continue to kill mysql-0 or mysql-1 in goroutine.

e.g. Continue to exec kubectl exec moco-<MySQLCluster>-0 -c mysqld -- kill 1

Update MySQLCluster.

Then, check the mysql pods will not start restarting.

Also, I want 2 test cases, one is the StatefulSet is simply Updated and other the StatefulSet will be re-created.

Signed-off-by: d-kuro <[email protected]>

api/v1beta2/statefulset_webhhok_test.go

controllers/partition_controller.go

Signed-off-by: d-kuro <[email protected]>

masa213f

LGTM. Thank you.

d-kuro self-assigned this Jan 10, 2024

d-kuro force-pushed the d-kuro/partition branch 2 times, most recently from 98c05a1 to 08421b3 Compare January 17, 2024 05:51

d-kuro force-pushed the d-kuro/partition branch from e8f9cfb to 8582509 Compare March 27, 2024 04:41

d-kuro force-pushed the d-kuro/partition branch 5 times, most recently from 615115a to 2306d09 Compare May 6, 2024 15:34

d-kuro force-pushed the d-kuro/partition branch from 2306d09 to ec3a022 Compare May 8, 2024 04:04

d-kuro changed the title ~~WIP: Add statefulset partition controller~~ Add statefulset partition controller May 8, 2024

d-kuro marked this pull request as ready for review May 8, 2024 04:07

masa213f requested review from masa213f and YZ775 May 9, 2024 05:23

YZ775 reviewed May 23, 2024

View reviewed changes

api/v1beta2/statefulset_webhhok.go Show resolved Hide resolved

masa213f requested changes May 28, 2024

View reviewed changes

controllers/partition_controller.go Outdated Show resolved Hide resolved

masa213f mentioned this pull request Jun 17, 2024

rate limit for re-creating MySQL Pods #698

Open

4 tasks

d-kuro force-pushed the d-kuro/partition branch 6 times, most recently from 2bdea15 to f665979 Compare July 3, 2024 04:22

d-kuro force-pushed the d-kuro/partition branch 2 times, most recently from 74dae8e to 088bac5 Compare July 4, 2024 22:16

d-kuro requested review from YZ775 and masa213f July 9, 2024 18:30

masa213f requested changes Jul 12, 2024

View reviewed changes

api/v1beta2/statefulset_webhhok_test.go Show resolved Hide resolved

controllers/partition_controller.go Show resolved Hide resolved

d-kuro force-pushed the d-kuro/partition branch 2 times, most recently from f8b23dd to abecf35 Compare July 31, 2024 04:14

masa213f self-requested a review August 1, 2024 01:18

masa213f requested changes Aug 2, 2024

View reviewed changes

d-kuro added 5 commits August 14, 2024 10:46

Add statefulset Partition controller

52515da

Signed-off-by: d-kuro <[email protected]>

Fix force rolling update annotation test case.

6174a5a

Signed-off-by: d-kuro <[email protected]>

Fix check partition logic.

2af3fbe

Signed-off-by: d-kuro <[email protected]>

Give partition in create.

7e1e547

Signed-off-by: d-kuro <[email protected]>

Addressed review comments.

61123fc

Signed-off-by: d-kuro <[email protected]>

d-kuro force-pushed the d-kuro/partition branch 2 times, most recently from ddae9eb to 8d17448 Compare August 14, 2024 02:20

Add pvc template update e2e tests.

c97f237

Signed-off-by: d-kuro <[email protected]>

d-kuro force-pushed the d-kuro/partition branch from c996c26 to c97f237 Compare August 14, 2024 04:32

masa213f self-requested a review August 15, 2024 08:33

masa213f requested changes Aug 15, 2024

View reviewed changes

api/v1beta2/statefulset_webhhok_test.go Outdated Show resolved Hide resolved

controllers/partition_controller.go Outdated Show resolved Hide resolved

controllers/partition_controller.go Outdated Show resolved Hide resolved

d-kuro added 4 commits August 21, 2024 11:53

Remove unnecessary new line.

d823613

Fix comments

c674c93

Signed-off-by: d-kuro <[email protected]>

Fix log fields related to Pod

a6f061a

Signed-off-by: d-kuro <[email protected]>

Add e2e for should not start rollout

d9a1ed4

Signed-off-by: d-kuro <[email protected]>

d-kuro force-pushed the d-kuro/partition branch from 617433f to d9a1ed4 Compare August 21, 2024 04:19

masa213f approved these changes Aug 26, 2024

View reviewed changes

masa213f merged commit 0c4d537 into main Aug 30, 2024
18 checks passed

masa213f deleted the d-kuro/partition branch August 30, 2024 00:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add statefulset partition controller #633

Add statefulset partition controller #633

d-kuro commented Jan 10, 2024 •

edited

Loading

masa213f left a comment

d-kuro commented Jul 9, 2024

masa213f left a comment

masa213f commented Jul 16, 2024

d-kuro commented Jul 31, 2024

masa213f left a comment

masa213f Aug 2, 2024

masa213f left a comment

Add statefulset partition controller #633

Add statefulset partition controller #633

Conversation

d-kuro commented Jan 10, 2024 • edited Loading

masa213f left a comment

Choose a reason for hiding this comment

d-kuro commented Jul 9, 2024

masa213f left a comment

Choose a reason for hiding this comment

masa213f commented Jul 16, 2024

d-kuro commented Jul 31, 2024

masa213f left a comment

Choose a reason for hiding this comment

masa213f Aug 2, 2024

Choose a reason for hiding this comment

masa213f left a comment

Choose a reason for hiding this comment

d-kuro commented Jan 10, 2024 •

edited

Loading