Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add statefulset partition controller #633

Merged
merged 10 commits into from
Aug 30, 2024
Merged

Add statefulset partition controller #633

merged 10 commits into from
Aug 30, 2024

Conversation

d-kuro
Copy link
Contributor

@d-kuro d-kuro commented Jan 10, 2024

@d-kuro d-kuro self-assigned this Jan 10, 2024
@d-kuro d-kuro force-pushed the d-kuro/partition branch 2 times, most recently from 98c05a1 to 08421b3 Compare January 17, 2024 05:51
@d-kuro d-kuro force-pushed the d-kuro/partition branch 5 times, most recently from 615115a to 2306d09 Compare May 6, 2024 15:34
@d-kuro d-kuro changed the title WIP: Add statefulset partition controller Add statefulset partition controller May 8, 2024
@d-kuro d-kuro marked this pull request as ready for review May 8, 2024 04:07
@masa213f masa213f requested review from masa213f and YZ775 May 9, 2024 05:23
Copy link
Contributor

@masa213f masa213f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not update MySQL Pods after the following operations.
This means we can't roll back MySQLClusters by Argo CD, etc.

If the pod's phase is not Running (= Pending, Succeeded, or Failed phase), how about proceeding with the partition to update the error pod?
This behavior may be the same as the PodDisruptionBudget. Please see the following note.
https://kubernetes.io/docs/tasks/run-application/configure-pdb/#unhealthy-pod-eviction-policy

NOTE: Pods in Pending, Succeeded or Failed phase are always considered for eviction.

  1. Create a MySQLCluster
$ kubectl create ns test

$ kubectl apply -f - << EOF
apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
  namespace: test
  name: test
spec:
  replicas: 3
  podTemplate:
    spec:
      containers:
      - name: mysqld
        image: ghcr.io/cybozu-go/moco/mysql:8.0.36.2
  volumeClaimTemplates:
  - metadata:
      name: mysql-data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi
EOF
  1. Wait to become Healthy.

  2. Update the MySQLCluster with wrong image.

$ kubectl apply -f - << EOF
apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
  namespace: test
  name: test
spec:
  replicas: 3
  podTemplate:
    spec:
      containers:
      - name: mysqld
        # invalid image
        image: foo/bar:baz
  volumeClaimTemplates:
  - metadata:
      name: mysql-data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi
EOF
  1. Wait for mysql pod -2 to become error.
$ kubectl get pod -n test -w
NAME          READY   STATUS        RESTARTS   AGE
...
moco-test-2   3/3     Terminating   0          4m7s
...
moco-test-2   0/3     Pending       0          0s
...
moco-test-2   0/3     Init:ErrImagePull   0          17s
moco-test-2   0/3     Init:ImagePullBackOff   0          33s

controllers/partition_controller.go Outdated Show resolved Hide resolved
@d-kuro d-kuro force-pushed the d-kuro/partition branch 6 times, most recently from 2bdea15 to f665979 Compare July 3, 2024 04:22
@d-kuro d-kuro force-pushed the d-kuro/partition branch 2 times, most recently from 74dae8e to 088bac5 Compare July 4, 2024 22:16
@d-kuro
Copy link
Contributor Author

d-kuro commented Jul 9, 2024

@YZ775 @masa213f
Sorry for the delay in fixing it.
I have significantly rewritten the code to check Pod ready when updating partition.
The isRolloutReady function has changed significantly.

func (r *StatefulSetPartitionReconciler) isRolloutReady(ctx context.Context, cluster *mocov1beta2.MySQLCluster, sts *appsv1.StatefulSet) (bool, error) {

And with that I have also added the case for E2E testing.

moco/e2e/partition_test.go

Lines 163 to 241 in 088bac5

It("should image pull backoff", func() {
kubectlSafe(fillTemplate(imagePullBackoffApplyYAML), "apply", "-f", "-")
Eventually(func() error {
out, err := kubectl(nil, "get", "-n", "partition", "pod", "moco-test-2", "-o", "json")
if err != nil {
return err
}
pod := &corev1.Pod{}
if err := json.Unmarshal(out, pod); err != nil {
return err
}
status := make([]corev1.ContainerStatus, 0, len(pod.Status.ContainerStatuses)+len(pod.Status.InitContainerStatuses))
status = append(status, pod.Status.ContainerStatuses...)
status = append(status, pod.Status.InitContainerStatuses...)
for _, s := range status {
if s.Image != "ghcr.io/cybozu-go/moco/mysql:invalid-image" {
continue
}
if s.State.Waiting != nil && s.State.Waiting.Reason == "ImagePullBackOff" {
return nil
}
}
return errors.New("image pull backoff Pod not found")
}).Should(Succeed())
})
It("should partition updates have stopped", func() {
out, err := kubectl(nil, "get", "-n", "partition", "statefulset", "moco-test", "-o", "json")
Expect(err).NotTo(HaveOccurred())
sts := &appsv1.StatefulSet{}
err = json.Unmarshal(out, sts)
Expect(err).NotTo(HaveOccurred())
Expect(sts.Spec.UpdateStrategy.RollingUpdate).NotTo(BeNil())
Expect(sts.Spec.UpdateStrategy.RollingUpdate.Partition).NotTo(BeNil())
Expect(*sts.Spec.UpdateStrategy.RollingUpdate.Partition).To(Equal(int32(1)))
})
It("should rollback succeed", func() {
kubectlSafe(fillTemplate(partitionApplyYAML), "apply", "-f", "-")
Eventually(func() error {
cluster, err := getCluster("partition", "test")
if err != nil {
return err
}
for _, cond := range cluster.Status.Conditions {
if cond.Type != mocov1beta2.ConditionHealthy {
continue
}
if cond.Status == metav1.ConditionTrue {
return nil
}
return fmt.Errorf("cluster is not healthy: %s", cond.Status)
}
return errors.New("no health condition")
}).Should(Succeed())
})
It("should partition updates succeed", func() {
Eventually(func() error {
out, err := kubectl(nil, "get", "-n", "partition", "statefulset", "moco-test", "-o", "json")
if err != nil {
return err
}
sts := &appsv1.StatefulSet{}
if err := json.Unmarshal(out, sts); err != nil {
return err
}
if sts.Spec.UpdateStrategy.RollingUpdate == nil || sts.Spec.UpdateStrategy.RollingUpdate.Partition == nil {
return errors.New("partition is nil")
}
if *sts.Spec.UpdateStrategy.RollingUpdate.Partition == int32(0) {
return nil
}
return errors.New("partition is not 0")
}).Should(Succeed())
})

@d-kuro d-kuro requested review from YZ775 and masa213f July 9, 2024 18:30
Copy link
Contributor

@masa213f masa213f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check and fix this behavior.
I am still reviewing, but I will comment first because this issue is important. :)

  1. Create a MySQLCluster
$ kubectl create ns test

$ kubectl apply -f - << EOF
apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
  namespace: test
  name: test
spec:
  replicas: 3
  podTemplate:
    spec:
      containers:
      - name: mysqld
        image: ghcr.io/cybozu-go/moco/mysql:8.0.36.2
  volumeClaimTemplates:
  - metadata:
      name: mysql-data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi
EOF
  1. Wait to become Healthy.

  2. Rollout restart MySQL StatefulSet. (It's OK to update MySQLCluster)

$ kubectl rollout restart sts -n test moco-test

At this time, even though the rolling update has not been finished, the partition become to 0.

$ kubectl get sts -n test moco-test -o json | jq .spec.updateStrategy
{
  "rollingUpdate": {
    "partition": 0
  },
  "type": "RollingUpdate"
}

$ kubectl get pod -n test
NAME          READY   STATUS        RESTARTS   AGE
moco-test-0   3/3     Running       0          2m32s
moco-test-1   3/3     Running       0          2m32s
moco-test-2   3/3     Terminating   0          2m32s

$ kubectl events -n test
LAST SEEN               TYPE      REASON                    OBJECT                                         MESSAGE
...
28s                     Normal    Killing                   Pod/moco-test-2                                Stopping container slow-log
28s                     Normal    Killing                   Pod/moco-test-2                                Stopping container agent
28s                     Normal    Killing                   Pod/moco-test-2                                Stopping container mysqld
28s                     Normal    PartitionUpdate           StatefulSet/moco-test                          Updated partition from 3 to 2
28s                     Normal    PartitionUpdate           StatefulSet/moco-test                          Updated partition from 2 to 1
28s                     Normal    PartitionUpdate           StatefulSet/moco-test                          Updated partition from 1 to 0
6s (x10 over 28s)       Normal    SuccessfulDelete          StatefulSet/moco-test                          delete Pod moco-test-2 in StatefulSet moco-test successful
6s                      Normal    Scheduled                 Pod/moco-test-2                                Successfully assigned test/moco-test-2 to moco-worker2
6s (x9 over 7s)         Normal    RecreatingTerminatedPod   StatefulSet/moco-test                          StatefulSet test/moco-test is recreating terminated Pod

If the PARTITION is 0, when a mysql-0 pod is accidentally delete during a rolling update, it is re-created with a new revesion.
If the order of updating the MySQL Pods is swapped, the data may be corrupted during the MySQL upgrade.

https://github.com/cybozu-go/moco/blob/v0.23.2/docs/upgrading.md?plain=1#L58-L60

api/v1beta2/statefulset_webhhok_test.go Show resolved Hide resolved
controllers/partition_controller.go Show resolved Hide resolved
@masa213f
Copy link
Contributor

@d-kuro
IMO, it would be easier to implement by using StatefulSet status and MySQLCluster status.
For example, how about the following logic.

if cluster.generation != cluster.status.reconcileInfo.generation {
	// In this case, the reconciliation of MySQLClusterReconciler has not been completed.
	// Wait until completion.
	return reconcile.Result{RequeueAfter: 10 * time.Second}, nil
}

if sts.metadata.generation != sts.status.observedGeneration {
	// In this case, the reconciliation of StatefulSet controller (kube-controller-manager) has not been completed.
	// Wait until completion.
	return reconcile.Result{RequeueAfter: 10 * time.Second}, nil
}

if sts.spec.replicas == sts.status.updatedReplicas {
	// In this case, a rolling update has been completed.
	// Update the partition to the initial value (`sts.spec.replicas`) and finish the reconciliation.
	updatePartition(sts.spec.replicas) // return error if conflicts occur
	return reconcile.Result{}, nil
}

if sts.spec.updateStrategy.rollingUpdate.partition == 0 {
	// In this case, just wait for mysql-pod-0 to update.
	return reconcile.Result{RequeueAfter: 10 * time.Second}, nil
}

nextIndex := sts.spec.updateStrategy.rollingUpdate.partition - 1
if podList.Items[nextIndex].Labels[appsv1.ControllerRevisionHashLabelKey] == sts.status.updateRevision {
	// In this case, nextIndex pod is already updated. Just decrement the partition.
	// The nextIndex pod will not be restarted, so we need not check the cluster status.
	updatePartition(sts.spec.replicas) // return error if conflicts occur
	return reconcile.Result{RequeueAfter: 10 * time.Second}, nil
}

if cluster is healthy (check cluster.status) || nextIndex pod is not running (if possible, check the pod is ready or not) {
	// Restart the nextIndex pod.
	updatePartition(nextIndex) // return error if conflicts occur
	return reconcile.Result{RequeueAfter: 10 * time.Second}, nil
}

// Wait until the cluster become healhty.
return reconcile.Result{RequeueAfter: 10 * time.Second}, nil

@d-kuro d-kuro force-pushed the d-kuro/partition branch 2 times, most recently from f8b23dd to abecf35 Compare July 31, 2024 04:14
@d-kuro
Copy link
Contributor Author

d-kuro commented Jul 31, 2024

@masa213f
Thank you for the review and proposed revisions. I have revised the logic based on your suggestions. Please check it.
abecf35

@masa213f masa213f self-requested a review August 1, 2024 01:18
Copy link
Contributor

@masa213f masa213f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I updated volumeClaimTemplates, the PartitionReconciler did not work.
I guess we need add partition when a StatefulSet is created.

  1. Create a MySQLCluster
$ kubectl create ns test

$ kubectl apply -f - << EOF
apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
  namespace: test
  name: test
spec:
  replicas: 3
  podTemplate:
    spec:
      containers:
      - name: mysqld
        image: ghcr.io/cybozu-go/moco/mysql:8.0.36.2
  volumeClaimTemplates:
  - metadata:
      name: mysql-data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi
EOF

# wait for the cluster become healthy.
  1. Watch the following at another terminals.
# another terminals
$ watch kubectl get mysqlcluster,pod -n test --show-labels
$ watch "kubectl get sts -n test moco-test -o json | jq .metadata.generation,.spec.updateStrategy,.status"
  1. Create a DB.
$ kubectl moco mysql -n test test -u moco-writable -- -e "CREATE DATABASE test;"
  1. Crash the MySQL pod 0: keep killing mysqld.
# another terminal
$ watch kubectl exec -n test moco-test-0 -c mysqld -- kill 1
  1. Update the podTemplate and volumeClaimTemplates.
$ kubectl apply -f - << EOF
apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
  namespace: test
  name: test
spec:
  replicas: 3
  podTemplate:
    metadata:
      labels:
        hoge: piyo # add
    spec:
      containers:
      - name: mysqld
        image: ghcr.io/cybozu-go/moco/mysql:8.0.36.2
  volumeClaimTemplates:
  - metadata:
      name: mysql-data
      labels:
        foo: bar # add
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi
EOF

controllers/partition_controller.go Outdated Show resolved Hide resolved
controllers/partition_controller.go Outdated Show resolved Hide resolved
@@ -0,0 +1,290 @@
package e2e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add tests when a MySQL Pod is crash looping, rolling update will not start?
I think we can implement this test as following steps.

  1. Create a MySQLCluster.
  2. Wait for the MySQLCluster become Healthy.
  3. Create a DB.
    • e.g. kubectl moco mysql <MySQLCluster> -u moco-writable -- -e "CREATE DATABASE test;"
  4. Continue to kill mysql-0 or mysql-1 in goroutine.
    • e.g. Continue to exec kubectl exec moco-<MySQLCluster>-0 -c mysqld -- kill 1
  5. Update MySQLCluster.

Then, check the mysql pods will not start restarting.

Also, I want 2 test cases, one is the StatefulSet is simply Updated and other the StatefulSet will be re-created.

@d-kuro d-kuro force-pushed the d-kuro/partition branch 2 times, most recently from ddae9eb to 8d17448 Compare August 14, 2024 02:20
api/v1beta2/statefulset_webhhok_test.go Outdated Show resolved Hide resolved
controllers/partition_controller.go Outdated Show resolved Hide resolved
controllers/partition_controller.go Outdated Show resolved Hide resolved
Copy link
Contributor

@masa213f masa213f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you.

@masa213f masa213f merged commit 0c4d537 into main Aug 30, 2024
18 checks passed
@masa213f masa213f deleted the d-kuro/partition branch August 30, 2024 00:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants