Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: enable self monitoring and api access without self-restart #290

Conversation

basti1302
Copy link
Member

@basti1302 basti1302 commented Feb 21, 2025

Before this change, the operator manager container deliberately
restarted itself once, when reconciling the operator configuration
resource for the first time (or when certain settings in that resource
would be changed).

The export endpoint, the API endpoint and the auth token (or
alternatively the secret ref) are all required for self monitoring as
well as API access. They need to be available at runtime from within the
operator manager process.

One reason for this self-restart was environment variables have been
used to configure the Golang OTel SDK for self monitoring. Hence, for
transferring the export settings found in the operator configuration
resource would have been added to the operator manager deployment as
environment variables, which triggered a single restart.

Another reason for the self-restart is the support for Kubernetes
secrets for the Dash0 authorization token. To resolve the secret ref
into an actual token, the secret would be added to the operator manager
deployment as an environment variable.

This self-restart was problematic for a couple of reasons:

  • When the operator configuration resource is deployed automatically
    via Helm, and then a user later tries to update it in any way, or
    delete it, the following happens: A reconcile for the changed/deleted
    operator configuration will be triggered, this reconcile will set
    different self-monitoring/API access env vars on the operator manager
    deployment, the deployment will be updated via the K8s client, this
    will lead to a restart of the operator manager process; when starting
    up again, the operator manager will be started with the same command
    line parameters (the ones determined by the Helm values that were
    originally used when doing Helm install), this will recreate the
    deleted operator configuration resource or overwrite the changed values.
    This effectively lead to ignoring the changes the user made to the
    resource entirely.
  • The auto restart leads to longer operator manager startup times. The
    first start of the operator manager is relatively quick, it then gets
    a leader election lease and often gets restarted shortly after that.
    When the changed pods comes up after the auto-restart, the old one is
    not yet terminated (due to how rolling updates work for K8s
    deployments), which means that the new pod needs to wait for a long time
    (often > 30 seconds) until it gets the leader election lease.
  • Last but not least, the auto-restart can happen at any time, in the
    middle of whatever the operator manager is doing at the moment —
    reconciling custom resources, setting up the OTel collectors etc. etc.

This commit solves this problem and removes the self-restart entirely:

  • The OTel SDK in the operator manager is now configured in code with
    values based on the settings in the operator configuration resource.
  • The OTel SDK in the operator manager is started/shut down/restarted
    as required, in particular when the operator configuration resource is
    reconciled and changes that are relevant for self monitoring are
    detected.
  • If the auth token (be it for self monitoring or for API access) is
    provided as a reference to a Kubernetes secret, this is resolved via a
    separate auxiliary process called secret ref resolver, which can be
    restarted as necessary without any impact on the operator manager.

@basti1302 basti1302 changed the title Do not restart the manager for self monitoring and api access feat: enable self monitoring and api access without self-restart Feb 21, 2025
@basti1302 basti1302 force-pushed the do-not-restart-the-manager-for-self-monitoring-and-api-access branch from 546d745 to c5a72f3 Compare February 21, 2025 11:37
@basti1302 basti1302 force-pushed the do-not-restart-the-manager-for-self-monitoring-and-api-access branch 7 times, most recently from 9a5ff03 to 22b9c08 Compare March 7, 2025 15:10
Before this change, the operator manager container deliberately
restarted itself once, when reconciling the operator configuration
resource for the first time (or when certain settings in that resource
would be changed).

The export endpoint, the API endpoint and the auth token (or
alternatively the secret ref) are all required for self monitoring as
well as API access. They need to be available at runtime from within the
operator manager process.

One reason for this self-restart was environment variables have been
used to configure the Golang OTel SDK for self monitoring. Hence, for
transferring the export settings found in the operator configuration
resource would have been added to the operator manager deployment as
environment variables, which triggered a single restart.

Another reason for the self-restart is the support for Kubernetes
secrets for the Dash0 authorization token. To resolve the secret ref
into an actual token, the secret would be added to the operator manager
deployment as an environment variable.

This self-restart was problematic for a couple of reasons:
* When the operator configuration resource is deployed automatically
  via Helm, and then a user later tries to update it in any way, or
  delete it, the following happens: A reconcile for the changed/deleted
  operator configuration will be triggered, this reconcile will set
  different self-monitoring/API access env vars on the operator manager
  deployment, the deployment will be updated via the K8s client, this
  will lead to a restart of the operator manager process; when starting
  up again, the operator manager will be started with the same command
  line parameters (the ones determined by the Helm values that were
  originally used when doing Helm install), this will recreate the
  deleted operator configuration resource or overwrite the changed values.
  This effectively lead to ignoring the changes the user made to the
  resource entirely.
* The auto restart leads to longer operator manager startup times. The
  first start of the operator manager is relatively quick, it then gets
  a leader election lease and often gets restarted shortly after that.
  When the changed pods comes up after the auto-restart, the old one is
  not yet terminated (due to how rolling updates work for K8s
  deployments), which means that the new pod needs to wait for a long time
  (often > 30 seconds) until it gets the leader election lease.
* Last but not least, the auto-restart can happen at any time, in the
  middle of whatever the operator manager is doing at the moment —
  reconciling custom resources, setting up the OTel collectors etc. etc.

This commit solves this problem and removes the self-restart entirely:
* The OTel SDK in the operator manager is now configured in code with
  values based on the settings in the operator configuration resource.
* The OTel SDK in the operator manager is started/shut down/restarted
  as required, in particular when the operator configuration resource is
  reconciled and changes that are relevant for self monitoring are
  detected.
* If the auth token (be it for self monitoring or for API access) is
  provided as a reference to a Kubernetes secret, this is resolved via a
  separate auxiliary process called secret ref resolver, which can be
  restarted as necessary without any impact on the operator manager.
@basti1302 basti1302 force-pushed the do-not-restart-the-manager-for-self-monitoring-and-api-access branch from 22b9c08 to 073295d Compare March 7, 2025 15:38
@basti1302 basti1302 marked this pull request as ready for review March 7, 2025 15:38
Copy link

sonarqubecloud bot commented Mar 7, 2025

@basti1302 basti1302 merged commit 7e06662 into main Mar 7, 2025
11 checks passed
@basti1302 basti1302 deleted the do-not-restart-the-manager-for-self-monitoring-and-api-access branch March 7, 2025 16:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant