-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: enable self monitoring and api access without self-restart #290
Merged
basti1302
merged 1 commit into
main
from
do-not-restart-the-manager-for-self-monitoring-and-api-access
Mar 7, 2025
Merged
feat: enable self monitoring and api access without self-restart #290
basti1302
merged 1 commit into
main
from
do-not-restart-the-manager-for-self-monitoring-and-api-access
Mar 7, 2025
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
helm-chart/dash0-operator/templates/operator/deployment-and-webhooks.yaml
Dismissed
Show dismissed
Hide dismissed
546d745
to
c5a72f3
Compare
9a5ff03
to
22b9c08
Compare
Before this change, the operator manager container deliberately restarted itself once, when reconciling the operator configuration resource for the first time (or when certain settings in that resource would be changed). The export endpoint, the API endpoint and the auth token (or alternatively the secret ref) are all required for self monitoring as well as API access. They need to be available at runtime from within the operator manager process. One reason for this self-restart was environment variables have been used to configure the Golang OTel SDK for self monitoring. Hence, for transferring the export settings found in the operator configuration resource would have been added to the operator manager deployment as environment variables, which triggered a single restart. Another reason for the self-restart is the support for Kubernetes secrets for the Dash0 authorization token. To resolve the secret ref into an actual token, the secret would be added to the operator manager deployment as an environment variable. This self-restart was problematic for a couple of reasons: * When the operator configuration resource is deployed automatically via Helm, and then a user later tries to update it in any way, or delete it, the following happens: A reconcile for the changed/deleted operator configuration will be triggered, this reconcile will set different self-monitoring/API access env vars on the operator manager deployment, the deployment will be updated via the K8s client, this will lead to a restart of the operator manager process; when starting up again, the operator manager will be started with the same command line parameters (the ones determined by the Helm values that were originally used when doing Helm install), this will recreate the deleted operator configuration resource or overwrite the changed values. This effectively lead to ignoring the changes the user made to the resource entirely. * The auto restart leads to longer operator manager startup times. The first start of the operator manager is relatively quick, it then gets a leader election lease and often gets restarted shortly after that. When the changed pods comes up after the auto-restart, the old one is not yet terminated (due to how rolling updates work for K8s deployments), which means that the new pod needs to wait for a long time (often > 30 seconds) until it gets the leader election lease. * Last but not least, the auto-restart can happen at any time, in the middle of whatever the operator manager is doing at the moment — reconciling custom resources, setting up the OTel collectors etc. etc. This commit solves this problem and removes the self-restart entirely: * The OTel SDK in the operator manager is now configured in code with values based on the settings in the operator configuration resource. * The OTel SDK in the operator manager is started/shut down/restarted as required, in particular when the operator configuration resource is reconciled and changes that are relevant for self monitoring are detected. * If the auth token (be it for self monitoring or for API access) is provided as a reference to a Kubernetes secret, this is resolved via a separate auxiliary process called secret ref resolver, which can be restarted as necessary without any impact on the operator manager.
22b9c08
to
073295d
Compare
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Before this change, the operator manager container deliberately
restarted itself once, when reconciling the operator configuration
resource for the first time (or when certain settings in that resource
would be changed).
The export endpoint, the API endpoint and the auth token (or
alternatively the secret ref) are all required for self monitoring as
well as API access. They need to be available at runtime from within the
operator manager process.
One reason for this self-restart was environment variables have been
used to configure the Golang OTel SDK for self monitoring. Hence, for
transferring the export settings found in the operator configuration
resource would have been added to the operator manager deployment as
environment variables, which triggered a single restart.
Another reason for the self-restart is the support for Kubernetes
secrets for the Dash0 authorization token. To resolve the secret ref
into an actual token, the secret would be added to the operator manager
deployment as an environment variable.
This self-restart was problematic for a couple of reasons:
via Helm, and then a user later tries to update it in any way, or
delete it, the following happens: A reconcile for the changed/deleted
operator configuration will be triggered, this reconcile will set
different self-monitoring/API access env vars on the operator manager
deployment, the deployment will be updated via the K8s client, this
will lead to a restart of the operator manager process; when starting
up again, the operator manager will be started with the same command
line parameters (the ones determined by the Helm values that were
originally used when doing Helm install), this will recreate the
deleted operator configuration resource or overwrite the changed values.
This effectively lead to ignoring the changes the user made to the
resource entirely.
first start of the operator manager is relatively quick, it then gets
a leader election lease and often gets restarted shortly after that.
When the changed pods comes up after the auto-restart, the old one is
not yet terminated (due to how rolling updates work for K8s
deployments), which means that the new pod needs to wait for a long time
(often > 30 seconds) until it gets the leader election lease.
middle of whatever the operator manager is doing at the moment —
reconciling custom resources, setting up the OTel collectors etc. etc.
This commit solves this problem and removes the self-restart entirely:
values based on the settings in the operator configuration resource.
as required, in particular when the operator configuration resource is
reconciled and changes that are relevant for self monitoring are
detected.
provided as a reference to a Kubernetes secret, this is resolved via a
separate auxiliary process called secret ref resolver, which can be
restarted as necessary without any impact on the operator manager.