feat: enable self monitoring and api access without self-restart #290

basti1302 · 2025-02-21T08:06:01Z

Before this change, the operator manager container deliberately
restarted itself once, when reconciling the operator configuration
resource for the first time (or when certain settings in that resource
would be changed).

The export endpoint, the API endpoint and the auth token (or
alternatively the secret ref) are all required for self monitoring as
well as API access. They need to be available at runtime from within the
operator manager process.

One reason for this self-restart was environment variables have been
used to configure the Golang OTel SDK for self monitoring. Hence, for
transferring the export settings found in the operator configuration
resource would have been added to the operator manager deployment as
environment variables, which triggered a single restart.

Another reason for the self-restart is the support for Kubernetes
secrets for the Dash0 authorization token. To resolve the secret ref
into an actual token, the secret would be added to the operator manager
deployment as an environment variable.

This self-restart was problematic for a couple of reasons:

When the operator configuration resource is deployed automatically
via Helm, and then a user later tries to update it in any way, or
delete it, the following happens: A reconcile for the changed/deleted
operator configuration will be triggered, this reconcile will set
different self-monitoring/API access env vars on the operator manager
deployment, the deployment will be updated via the K8s client, this
will lead to a restart of the operator manager process; when starting
up again, the operator manager will be started with the same command
line parameters (the ones determined by the Helm values that were
originally used when doing Helm install), this will recreate the
deleted operator configuration resource or overwrite the changed values.
This effectively lead to ignoring the changes the user made to the
resource entirely.
The auto restart leads to longer operator manager startup times. The
first start of the operator manager is relatively quick, it then gets
a leader election lease and often gets restarted shortly after that.
When the changed pods comes up after the auto-restart, the old one is
not yet terminated (due to how rolling updates work for K8s
deployments), which means that the new pod needs to wait for a long time
(often > 30 seconds) until it gets the leader election lease.
Last but not least, the auto-restart can happen at any time, in the
middle of whatever the operator manager is doing at the moment —
reconciling custom resources, setting up the OTel collectors etc. etc.

This commit solves this problem and removes the self-restart entirely:

The OTel SDK in the operator manager is now configured in code with
values based on the settings in the operator configuration resource.
The OTel SDK in the operator manager is started/shut down/restarted
as required, in particular when the operator configuration resource is
reconciled and changes that are relevant for self monitoring are
detected.
If the auth token (be it for self monitoring or for API access) is
provided as a reference to a Kubernetes secret, this is resolved via a
separate auxiliary process called secret ref resolver, which can be
restarted as necessary without any impact on the operator manager.

helm-chart/dash0-operator/templates/operator/deployment-and-webhooks.yaml

Before this change, the operator manager container deliberately restarted itself once, when reconciling the operator configuration resource for the first time (or when certain settings in that resource would be changed). The export endpoint, the API endpoint and the auth token (or alternatively the secret ref) are all required for self monitoring as well as API access. They need to be available at runtime from within the operator manager process. One reason for this self-restart was environment variables have been used to configure the Golang OTel SDK for self monitoring. Hence, for transferring the export settings found in the operator configuration resource would have been added to the operator manager deployment as environment variables, which triggered a single restart. Another reason for the self-restart is the support for Kubernetes secrets for the Dash0 authorization token. To resolve the secret ref into an actual token, the secret would be added to the operator manager deployment as an environment variable. This self-restart was problematic for a couple of reasons: * When the operator configuration resource is deployed automatically via Helm, and then a user later tries to update it in any way, or delete it, the following happens: A reconcile for the changed/deleted operator configuration will be triggered, this reconcile will set different self-monitoring/API access env vars on the operator manager deployment, the deployment will be updated via the K8s client, this will lead to a restart of the operator manager process; when starting up again, the operator manager will be started with the same command line parameters (the ones determined by the Helm values that were originally used when doing Helm install), this will recreate the deleted operator configuration resource or overwrite the changed values. This effectively lead to ignoring the changes the user made to the resource entirely. * The auto restart leads to longer operator manager startup times. The first start of the operator manager is relatively quick, it then gets a leader election lease and often gets restarted shortly after that. When the changed pods comes up after the auto-restart, the old one is not yet terminated (due to how rolling updates work for K8s deployments), which means that the new pod needs to wait for a long time (often > 30 seconds) until it gets the leader election lease. * Last but not least, the auto-restart can happen at any time, in the middle of whatever the operator manager is doing at the moment — reconciling custom resources, setting up the OTel collectors etc. etc. This commit solves this problem and removes the self-restart entirely: * The OTel SDK in the operator manager is now configured in code with values based on the settings in the operator configuration resource. * The OTel SDK in the operator manager is started/shut down/restarted as required, in particular when the operator configuration resource is reconciled and changes that are relevant for self monitoring are detected. * If the auth token (be it for self monitoring or for API access) is provided as a reference to a Kubernetes secret, this is resolved via a separate auxiliary process called secret ref resolver, which can be restarted as necessary without any impact on the operator manager.

sonarqubecloud · 2025-03-07T15:38:50Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

basti1302 changed the title ~~Do not restart the manager for self monitoring and api access~~ feat: enable self monitoring and api access without self-restart Feb 21, 2025

github-advanced-security bot found potential problems Feb 21, 2025

View reviewed changes

helm-chart/dash0-operator/templates/operator/deployment-and-webhooks.yaml Dismissed Show dismissed Hide dismissed

basti1302 force-pushed the do-not-restart-the-manager-for-self-monitoring-and-api-access branch from 546d745 to c5a72f3 Compare February 21, 2025 11:37

basti1302 force-pushed the do-not-restart-the-manager-for-self-monitoring-and-api-access branch 7 times, most recently from 9a5ff03 to 22b9c08 Compare March 7, 2025 15:10

basti1302 force-pushed the do-not-restart-the-manager-for-self-monitoring-and-api-access branch from 22b9c08 to 073295d Compare March 7, 2025 15:38

basti1302 marked this pull request as ready for review March 7, 2025 15:38

basti1302 merged commit 7e06662 into main Mar 7, 2025
11 checks passed

basti1302 deleted the do-not-restart-the-manager-for-self-monitoring-and-api-access branch March 7, 2025 16:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: enable self monitoring and api access without self-restart #290

feat: enable self monitoring and api access without self-restart #290

basti1302 commented Feb 21, 2025 •

edited

Loading

sonarqubecloud bot commented Mar 7, 2025

feat: enable self monitoring and api access without self-restart #290

feat: enable self monitoring and api access without self-restart #290

Conversation

basti1302 commented Feb 21, 2025 • edited Loading

sonarqubecloud bot commented Mar 7, 2025

Quality Gate passed

basti1302 commented Feb 21, 2025 •

edited

Loading