Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: enable self monitoring and api access without self-restart
Before this change, the operator manager container deliberately restarted itself once, when reconciling the operator configuration resource for the first time (or when certain settings in that resource would be changed). The export endpoint, the API endpoint and the auth token (or alternatively the secret ref) are all required for self monitoring as well as API access. They need to be available at runtime from within the operator manager process. One reason for this self-restart was environment variables have been used to configure the Golang OTel SDK for self monitoring. Hence, for transferring the export settings found in the operator configuration resource would have been added to the operator manager deployment as environment variables, which triggered a single restart. Another reason for the self-restart is the support for Kubernetes secrets for the Dash0 authorization token. To resolve the secret ref into an actual token, the secret would be added to the operator manager deployment as an environment variable. This self-restart was problematic for a couple of reasons: * When the operator configuration resource is deployed automatically via Helm, and then a user later tries to update it in any way, or delete it, the following happens: A reconcile for the changed/deleted operator configuration will be triggered, this reconcile will set different self-monitoring/API access env vars on the operator manager deployment, the deployment will be updated via the K8s client, this will lead to a restart of the operator manager process; when starting up again, the operator manager will be started with the same command line parameters (the ones determined by the Helm values that were originally used when doing Helm install), this will recreate the deleted operator configuration resource or overwrite the changed values. This effectively lead to ignoring the changes the user made to the resource entirely. * The auto restart leads to longer operator manager startup times. The first start of the operator manager is relatively quick, it then gets a leader election lease and often gets restarted shortly after that. When the changed pods comes up after the auto-restart, the old one is not yet terminated (due to how rolling updates work for K8s deployments), which means that the new pod needs to wait for a long time (often > 30 seconds) until it gets the leader election lease. * Last but not least, the auto-restart can happen at any time, in the middle of whatever the operator manager is doing at the moment — reconciling custom resources, setting up the OTel collectors etc. etc. This commit solves this problem and removes the self-restart entirely: * The OTel SDK in the operator manager is now configured in code with values based on the settings in the operator configuration resource. * The OTel SDK in the operator manager is started/shut down/restarted as required, in particular when the operator configuration resource is reconciled and changes that are relevant for self monitoring are detected. * If the auth token (be it for self monitoring or for API access) is provided as a reference to a Kubernetes secret, this is resolved via a separate auxiliary process called secret ref resolver, which can be restarted as necessary without any impact on the operator manager.
- Loading branch information