Skip to content

Commit

Permalink
fix: do not overwrite manually created operator configuration resource
Browse files Browse the repository at this point in the history
...with values provided via Helm.

Previously, when instructed to create/update the operator configuration
resource via Helm, the operator manager would update any operator
configuration resource it finds with the provided value. This would
likely lead to unexpected results if people transition from a manually
configured operator configuration resource to a Helm-managed values.

Now the operator will refuse to overwrite the existing resource in this
scenario and log an error.

Also: Describe the behavior around creating the operator configuration
resource via Helm in more detail in the docs.
  • Loading branch information
basti1302 committed Mar 10, 2025
1 parent 7c5f65b commit e982343
Show file tree
Hide file tree
Showing 7 changed files with 134 additions and 62 deletions.
4 changes: 2 additions & 2 deletions cmd/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -841,7 +841,7 @@ func executeStartupTasks(
isIPv6Cluster bool,
logger *logr.Logger,
) *dash0v1alpha1.Dash0OperatorConfiguration {
operatorConfigurationResource := createAutoOperatorConfigurationResource(
operatorConfigurationResource := createOrUpdateAutoOperatorConfigurationResource(
ctx,
startupTasksK8sClient,
operatorConfigurationValues,
Expand Down Expand Up @@ -883,7 +883,7 @@ func instrumentAtStartup(
startupInstrumenter.InstrumentAtStartup(ctx, startupTasksK8sClient, &setupLog)
}

func createAutoOperatorConfigurationResource(
func createOrUpdateAutoOperatorConfigurationResource(
ctx context.Context,
k8sClient client.Client,
operatorConfigurationValues *startup.OperatorConfigurationValues,
Expand Down
38 changes: 38 additions & 0 deletions helm-chart/dash0-operator/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,11 @@ You can consult the chart's
[values.yaml](https://github.com/dash0hq/dash0-operator/blob/main/helm-chart/dash0-operator/values.yaml) file for a
complete list of available configuration settings.

See the section
[Notes on Creating the Operator Configuration Resource Via Helm](#notes-on-creating-the-operator-configuration-resource-via-helm)
for more information on providing Dash0 export settings via Helm and how it affects manual changes to the operator
configuration resource.

Last but not least, you can also install the operator without providing a Dash0 backend configuration:

```console
Expand Down Expand Up @@ -217,6 +222,39 @@ Helm chart auto-create this resource, as explained in the section [Installation]
chart's [values.yaml](https://github.com/dash0hq/dash0-operator/blob/main/helm-chart/dash0-operator/values.yaml) file
for a complete list of available configuration settings.

### Notes on Creating the Operator Configuration Resource Via Helm

Providing the backend connection settings to the operator via Helm parameters is a _convenience mechanism_ to get
monitoring started right away when installing the operator.
Setting `operator.dash0Export.enabled` to `true` and providing other necessary `operator.dash0Export.*` values like
`operator.dash0Export.endpoint` will instruct the operator manager to create an operator configuration resource with the
provided values at startup.
This automatically created operator configuration resource will have the name
`dash0-operator-configuration-auto-resource`.

If an operator configuration resource with any other name already exists in the cluster (e.g. a manually created
operator configuration resource), the operator will treat this as an error and refuse to overwrite the existing operator
configuration resource with the values provided via Helm.

If an operator configuration resource with the name `dash0-operator-configuration-auto-resource` already exists in the
cluster (e.g. a previous startup of the operator manager has created the resource), the operator manager will
update/overwrite this resource with the values provided via Helm.

Manual changes to the `dash0-operator-configuration-auto-resource` are permissible for quickly experimenting with
configuration changes, without an operator restart, but you need to be aware that they will be overwritten with the
settings provided via Helm the next time the operator manager pod is restarted.
Possible reasons for a restart of the operator manager pod include upgrading to a new operator version, running
`helm upgrade ... dash0-operator dash0-operator/dash0-operator`, or Kubernetes moving the operator manager pod to a
different node.

For this reason, when using this feature, it is recommended to treat the _Helm values_ as the source of truth for the
operator configuration.
Any changes you want to be permanent should be applied via Helm and the `operator.dash0Export.*` settings.

If you would rather retain manual control over the operator configuration resource, you should omit any
`operator.dash0Export.*` Helm values and create and manage the operator configuration resource manually (that is, via
kubectl, ArgoCD etc.).

### Enable Dash0 Monitoring For a Namespace

*Note:* By default, when enabling Dash0 monitoring for a namespace, all workloads in this namespace will be restarted
Expand Down
13 changes: 4 additions & 9 deletions internal/controller/operator_configuration_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -279,21 +279,16 @@ func (r *OperatorConfigurationReconciler) applyApiAccessSettings(
authToken = operatorConfigurationResource.Spec.Export.Dash0.Authorization.Token
usesSecretRef = operatorConfigurationResource.Spec.Export.Dash0.Authorization.SecretRef != nil
}

if usesSecretRef {
for _, apiClient := range r.ApiClients {
apiClient.RemoveAuthToken(ctx, &logger)
}
} else {
if !usesSecretRef {
if authToken != nil && *authToken != "" {
for _, apiClient := range r.ApiClients {
apiClient.SetAuthToken(ctx, *authToken, &logger)
}
} else {
// If the auth token is provided via a secret ref, we cannot remove it here, as we might accidentally
// delete a token that has been resolved via the secret ref resolver. But if the operator configuration
// resource does not have a secret ref, we can and should remove the auth token to keep the API client's
// state consistent with the operator configuration resource's state.
// delete a token that has been resolved via the secret ref resolver already. But if the operator
// configuration resource does not use a secret ref, we can and should remove the auth token to keep the
// API client's state consistent with the operator configuration resource's state.
logger.Info("The Dash0 auth token required for managing dashboards or check rules via the operator " +
"is missing or has been removed, the operator will not update dashboards nor check rules in Dash0.")
for _, apiClient := range r.ApiClients {
Expand Down
14 changes: 3 additions & 11 deletions internal/controller/operator_configuration_controller_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -253,14 +253,6 @@ var _ = Describe("The operation configuration resource controller", Ordered, fun
}
},

// | operator config resource | expected calls |
// |-------------------------------------------|----------------|
// | no Dash0 export | remove |
// | no API endpoint, token | remove |
// | no API endpoint, secret ref | remove |
// | API endpoint, token | set |
// | API endpoint, secret ref | set |

// | no Dash0 export | remove |
Entry("no API endpoint, no Dash0 export", ApiClientSetRemoveTestConfig{
operatorConfigurationResourceSpec: OperatorConfigurationResourceWithoutExport,
Expand All @@ -283,7 +275,7 @@ var _ = Describe("The operation configuration resource controller", Ordered, fun
expectSetApiEndpointAndDataset: false,
expectRemoveApiEndpointAndDataset: true,
expectSetAuthToken: false,
expectRemoveAuthToken: true,
expectRemoveAuthToken: false,
}),
// | API endpoint, token | set |
Entry("API endpoint, Dash0 export with token", ApiClientSetRemoveTestConfig{
Expand All @@ -299,7 +291,7 @@ var _ = Describe("The operation configuration resource controller", Ordered, fun
expectSetApiEndpointAndDataset: true,
expectRemoveApiEndpointAndDataset: false,
expectSetAuthToken: false,
expectRemoveAuthToken: true,
expectRemoveAuthToken: false,
}),
// | API endpoint, token, custom dataset | set |
Entry("API endpoint, Dash0 export with token, custom dataset", ApiClientSetRemoveTestConfig{
Expand All @@ -317,7 +309,7 @@ var _ = Describe("The operation configuration resource controller", Ordered, fun
expectSetApiEndpointAndDataset: true,
expectRemoveApiEndpointAndDataset: false,
expectSetAuthToken: false,
expectRemoveAuthToken: true,
expectRemoveAuthToken: false,
}),
)
})
Expand Down
51 changes: 16 additions & 35 deletions internal/controller/third_party_crd_common.go
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@ package controller

import (
"context"
"errors"
"fmt"
"io"
"maps"
Expand Down Expand Up @@ -152,15 +151,6 @@ type preconditionValidationResult struct {
k8sName string
}

type retryableError struct {
err error
retryable bool
}

func (e *retryableError) Error() string {
return e.err.Error()
}

func SetupThirdPartyCrdReconcilerWithManager(
ctx context.Context,
k8sClient client.Client,
Expand Down Expand Up @@ -778,26 +768,22 @@ func executeSingleHttpRequestWithRetry(
req.Request.URL.String(),
))

return retry.OnError(
wait.Backoff{
Steps: 3,
Duration: resourceReconciler.GetHttpRetryDelay(),
Factor: 1.5,
},
func(err error) bool {
var retryErr *retryableError
if errors.As(err, &retryErr) {
return retryErr.retryable
}
return false
},
return util.RetryWithCustomBackoff(
fmt.Sprintf("http request to %s", req.Request.URL.String()),
func() error {
return executeSingleHttpRequest(
resourceReconciler,
req,
logger,
)
},
wait.Backoff{
Steps: 3,
Duration: resourceReconciler.GetHttpRetryDelay(),
Factor: 1.5,
},
true,
logger,
)
}

Expand All @@ -815,29 +801,24 @@ func executeSingleHttpRequest(
req.ItemName,
req.Request.URL.String(),
))
return &retryableError{
err: err,
retryable: true,
}
return util.NewRetryableErrorWithFlag(err, true)
}

if res.StatusCode < http.StatusOK || res.StatusCode >= http.StatusMultipleChoices {
// HTTP status is not 2xx, treat this as an error
// convertNon2xxStatusCodeToError will also consume and close the response body
statusCodeError := convertNon2xxStatusCodeToError(resourceReconciler, req, res)
retryableStatusCodeError := &retryableError{
err: statusCodeError,
}
err = convertNon2xxStatusCodeToError(resourceReconciler, req, res)
retryableStatusCodeError := util.NewRetryableError(err)

if res.StatusCode >= http.StatusBadRequest && res.StatusCode < http.StatusInternalServerError {
// HTTP 4xx status codes are not retryable
retryableStatusCodeError.retryable = false
logger.Error(statusCodeError, "unexpected status code")
retryableStatusCodeError.SetRetryable(false)
logger.Error(err, "unexpected status code")
return retryableStatusCodeError
} else {
// everything else, in particular HTTP 5xx status codes can be retried
retryableStatusCodeError.retryable = true
logger.Error(statusCodeError, "unexpected status code, request might be retried")
retryableStatusCodeError.SetRetryable(true)
logger.Error(err, "unexpected status code, request might be retried")
return retryableStatusCodeError
}
}
Expand Down
22 changes: 20 additions & 2 deletions internal/startup/auto_operator_configuration_handler.go
Original file line number Diff line number Diff line change
Expand Up @@ -170,7 +170,7 @@ func (r *AutoOperatorConfigurationResourceHandler) createOrUpdateOperatorConfigu
return util.RetryWithCustomBackoff(
"create/update operator configuration resource at startup",
func() error {
return r.createOperatorConfigurationResourceOnce(ctx, operatorConfigurationResource, logger)
return r.createOrUpdateOperatorConfigurationResourceOnce(ctx, operatorConfigurationResource, logger)
},
wait.Backoff{
Duration: 3 * time.Second,
Expand All @@ -182,7 +182,7 @@ func (r *AutoOperatorConfigurationResourceHandler) createOrUpdateOperatorConfigu
)
}

func (r *AutoOperatorConfigurationResourceHandler) createOperatorConfigurationResourceOnce(
func (r *AutoOperatorConfigurationResourceHandler) createOrUpdateOperatorConfigurationResourceOnce(
ctx context.Context,
operatorConfigurationResource *dash0v1alpha1.Dash0OperatorConfiguration,
logger *logr.Logger,
Expand All @@ -191,10 +191,28 @@ func (r *AutoOperatorConfigurationResourceHandler) createOperatorConfigurationRe
if err := r.List(ctx, allOperatorConfigurationResources); err != nil {
return fmt.Errorf("failed to list all Dash0 operator configuration resources: %w", err)
}

if len(allOperatorConfigurationResources.Items) >= 1 {
// The validation webhook for the operator configuration resource guarantees that there is only ever one
// resource per cluster. Thus, we can arbitrarily update the first item in the list.
existingOperatorConfigurationResource := allOperatorConfigurationResources.Items[0]
// If this is a manually created operator configuration resource, we refuse to overwrite it.
if existingOperatorConfigurationResource.Name != operatorConfigurationAutoResourceName {
return util.NewRetryableErrorWithFlag(fmt.Errorf(
"The configuration provided via Helm instructs the operator manager to create an operator "+
"configuration resource at startup, that is, operator.dash0Export.enabled is true and "+
"operator.dash0Export.endpoint has been provided. But there is already an operator configuration "+
"resource in the cluster with the name %s that has not been created by the operator "+
"manager. Replacing a manually created operator configuration resource with values provided via "+
"Helm is not supported. Please either delete the existing operator configuration resource or "+
"change the Helm values to not create an operator configuration resource at startup, "+
"e.g. set operator.dash0Export.enabled to false or remove all operator.dash0Export.* values.",
existingOperatorConfigurationResource.Name),
// do not retry
false,
)
}

existingOperatorConfigurationResource.Spec = operatorConfigurationResource.Spec
if err := r.Update(ctx, &existingOperatorConfigurationResource); err != nil {
return fmt.Errorf("failed to update the Dash0 operator configuration resource: %w", err)
Expand Down
54 changes: 51 additions & 3 deletions internal/util/retry.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
package util

import (
"errors"
"fmt"
"time"

Expand Down Expand Up @@ -68,6 +69,18 @@ func RetryWithCustomBackoff(
return retry.OnError(
backoff,
func(err error) bool {
// Per default, we assume errors are retriable, use RetryableError with retryable=false to stop retrying.
canBeRetried := true

var retryErr *RetryableError
if errors.As(err, &retryErr) {
canBeRetried = retryErr.retryable
}

if !canBeRetried {
return canBeRetried
}

attempt += 1
if attempt < backoff.Steps {
if logAttempts {
Expand All @@ -79,12 +92,47 @@ func RetryWithCustomBackoff(
logger.Error(err,
fmt.Sprintf(
"%s failed after attempt %d/%d, no more retries left.", operationLabel, attempt, backoff.Steps))
// Note: retry.OnError stops retrying correctly after the specified number of Steps, returning this
// most recent error, no matter whether we return true or false. The bool return value is only used to
// inspect errors and decide whether to they even can be retried or not.
}

// Note: retry.OnError stops retrying correctly after the specified number of Steps, returning this
// most recent error, no matter whether we return true or false. The bool return value is only used to
// inspect errors and decide whether they even can be retried or not.
return true
},
operation,
)
}

type RetryableError struct {
err error
retryable bool
}

func NewRetryableError(err error) *RetryableError {
return &RetryableError{err: err, retryable: false}
}

func NewRetryableErrorWithFlag(err error, retryable bool) *RetryableError {
return &RetryableError{err: err, retryable: retryable}
}

func (e *RetryableError) Error() string {
if e == nil || e.err == nil {
return "unknown retryable error"
}
return e.err.Error()
}

func (e *RetryableError) SetRetryable(retryable bool) {
if e == nil {
return
}
e.retryable = retryable
}

func (e *RetryableError) IsRetryable() bool {
if e == nil {
return false
}
return e.retryable
}

0 comments on commit e982343

Please sign in to comment.