Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix previously unset "Ready" conditions #311

Merged
merged 1 commit into from
Jun 11, 2024

Conversation

weinimo
Copy link
Collaborator

@weinimo weinimo commented May 28, 2024

... of types:

  • RabbitMqTransportURLReady
  • OctaviaHealthManagerReady
  • OctaviaHousekeepingReady
  • OctaviaWorkerReady

Jira: OSPRH-6678

Copy link
Contributor

openshift-ci bot commented May 28, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@weinimo
Copy link
Collaborator Author

weinimo commented May 28, 2024

/test all

@weinimo
Copy link
Collaborator Author

weinimo commented Jun 3, 2024

/retest

@weinimo weinimo marked this pull request as ready for review June 4, 2024 13:11
@openshift-ci openshift-ci bot requested review from beagles and slawqo June 4, 2024 13:11
@weinimo
Copy link
Collaborator Author

weinimo commented Jun 5, 2024

/hold

@fmount since you did the initial work on conditions in #290 and #291 I want to check with you about this. Do you think this is the right approach for setting those conditions to ready?

I saw your discussion with @gibizer in https://github.com/openstack-k8s-operators/octavia-operator/pull/291/files#r1557376428 about mirroring, and I am not sure I understand how it is supposed to work.

That said, this patch does not seem to be complete yet. In my install_yamls environment I still see some of the conditions for Octavia unset:

  Conditions:
    Last Transition Time:  2024-06-04T13:27:39Z
    Message:               Setup started
    Reason:                Init
    Status:                Unknown
    Type:                  Ready
--
    Last Transition Time:  2024-06-04T13:26:30Z
    Message:               Deployment not started
    Reason:                Init
    Status:                Unknown
    Type:                  DeploymentReady
--
    Last Transition Time:  2024-06-04T13:26:30Z
    Message:               Exposing service not started
    Reason:                Init
    Status:                Unknown
    Type:                  ExposeServiceReady

I will need to investigate that further.

@gthiemonge
Copy link
Contributor

@weinimo FYI "Deployment" and "Exposing service" conditions are only used when uploading an amphora image.

@fmount
Copy link
Contributor

fmount commented Jun 10, 2024

/hold

@fmount since you did the initial work on conditions in #290 and #291 I want to check with you about this. Do you think this is the right approach for setting those conditions to ready?

I saw your discussion with @gibizer in https://github.com/openstack-k8s-operators/octavia-operator/pull/291/files#r1557376428 about mirroring, and I am not sure I understand how it is supposed to work.

That said, this patch does not seem to be complete yet. In my install_yamls environment I still see some of the conditions for Octavia unset:

  Conditions:
    Last Transition Time:  2024-06-04T13:27:39Z
    Message:               Setup started
    Reason:                Init
    Status:                Unknown
    Type:                  Ready
--
    Last Transition Time:  2024-06-04T13:26:30Z
    Message:               Deployment not started
    Reason:                Init
    Status:                Unknown
    Type:                  DeploymentReady
--
    Last Transition Time:  2024-06-04T13:26:30Z
    Message:               Exposing service not started
    Reason:                Init
    Status:                Unknown
    Type:                  ExposeServiceReady

I will need to investigate that further.

/hold

@fmount since you did the initial work on conditions in #290 and #291 I want to check with you about this. Do you think this is the right approach for setting those conditions to ready?

I saw your discussion with @gibizer in https://github.com/openstack-k8s-operators/octavia-operator/pull/291/files#r1557376428 about mirroring, and I am not sure I understand how it is supposed to work.

That said, this patch does not seem to be complete yet. In my install_yamls environment I still see some of the conditions for Octavia unset:

  Conditions:
    Last Transition Time:  2024-06-04T13:27:39Z
    Message:               Setup started
    Reason:                Init
    Status:                Unknown
    Type:                  Ready
--
    Last Transition Time:  2024-06-04T13:26:30Z
    Message:               Deployment not started
    Reason:                Init
    Status:                Unknown
    Type:                  DeploymentReady
--
    Last Transition Time:  2024-06-04T13:26:30Z
    Message:               Exposing service not started
    Reason:                Init
    Status:                Unknown
    Type:                  ExposeServiceReady

I will need to investigate that further.

I did a few investigation on the current code and I think the problem is not related to the observedGeneration pattern that is reduced to a new reconciliation triggered when the CR is not the most recent one.
In the current code we see that ReadyCondition is still Unknown: as per the recent changes, the ReadyCondition is marked True only when all the SubConditions are verified.
For this reason, having both DeploymentReady and ExposeServiceReady still not True let the Octavia controller to mark tha main condition as False, or better, Unknown, because it didn't detect a fatal error in the operator that requires an external intervention (e.g. the human operator that needs to make a change in the CR).
Let's analyze the two conditions that are not True.

ExposeServiceReady

Both are part of the same function, reconcileAmphoraImages, which means that if something goes wrong there (a reconciliation is triggered because of the Requeue), the ExposeServiceReady condition got reset, and it's never marked True unless it goes to the same function.
In theory, if reconcileAmphoraImages works without returning an err, then we can mark ExposeServiceReady as True, because we are 100% sure the service has been created, and the controller can move to the next function of the main reconciliation.

DeploymentReady

As per the current code, this condition is useless: a deployment is created and deleted within the same function (unless I'm overlooking something relevant), but in general, we already have a octaviav1.OctaviaAmphoraImagesReadyCondition that is marked True if reconcileAmphoraImages ends without errors.
IMO we can simply remove DeploymentCondition, and mark octaviav1.OctaviaAmphoraImagesReadyCondition as False or Unknown when something goes wrong.
For example:

+       depl := deployment.NewDeployment(
+               octavia.ImageUploadDeployment(instance, serviceLabels),
+               time.Duration(5)*time.Second,
+       )
+       ctrlResult, err = depl.CreateOrPatch(ctx, helper)
+       if err != nil {
+               instance.Status.Conditions.Set(condition.FalseCondition(
+                       octaviav1.OctaviaAmphoraImagesReadyCondition,
+                       condition.ErrorReason,
+                       condition.SeverityWarning,
+                       octaviav1.OctaviaAmphoraImagesReadyErrorMessage,
+                       err.Error()))
+               return ctrlResult, err
+       } else if (ctrlResult != ctrl.Result{}) {
+               instance.Status.Conditions.Set(condition.FalseCondition(
+                       octaviav1.OctaviaAmphoraImagesReadyCondition,
+                       condition.ErrorReason,
+                       condition.SeverityWarning,
+                       octaviav1.OctaviaAmphoraImagesReadyErrorMessage,
+                       err.Error()))
+               return ctrlResult, nil
+       }

If everything goes well, when this function returns everything is marked True and mirrored to the main condition.
I captured all the above in the following patch [1], and I can see a Ready octavia service:

$ oc get octavia
NAME      STATUS   MESSAGE
octavia   True     Setup complete

$ oc get octaviaapi
NAME          NETWORKATTACHMENTS   STATUS   MESSAGE
octavia-api                        True     Setup complete

$ oc get octaviaamphoracontroller
NAME                    NETWORKATTACHMENTS                                                            STATUS   MESSAGE
octavia-healthmanager   {"openstack/octavia":["172.23.0.30"],"ovn-kubernetes":["10.217.1.204"]}       True     Setup complete
octavia-housekeeping    {"openstack/internalapi":["172.17.0.32"],"ovn-kubernetes":["10.217.1.205"]}   True     Setup complete
octavia-worker          {"openstack/octavia":["172.23.0.32"],"ovn-kubernetes":["10.217.1.206"]}       True     Setup complete

We can dig more into the above, but [1] should cover what we need to unblock the conditions related issues.
As a follow up, I think we should make the controller code more modular, and not use the same function to compute conditions there can see failures for unrelated reasons.

[1] https://paste.opendev.org/show/bjIB5FPgNxrf9SaPV441/

... of types:

    RabbitMqTransportURLReady
    OctaviaHealthManagerReady
    OctaviaHousekeepingReady
    OctaviaWorkerReady

Jira: OSPRH-6678
@weinimo
Copy link
Collaborator Author

weinimo commented Jun 11, 2024

/hold cancel

Thanks a lot for your analysis and explanation @fmount. It makes sense to me. I tested your proposed change (combined with the Rabbit MQ transport URL conditions fix) and it seems to resolve all conditions issues. Yay.

Copy link
Contributor

@fmount fmount left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Contributor

openshift-ci bot commented Jun 11, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fmount, weinimo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 3b6b876 into main Jun 11, 2024
8 checks passed
@openshift-merge-bot openshift-merge-bot bot deleted the fix-ready-conditions branch June 11, 2024 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants