feat: enable self monitoring and api access without self-restart

Before this change, the operator manager container deliberately restarted itself once, when reconciling the operator configuration resource for the first time (or when certain settings in that resource would be changed). The export endpoint, the API endpoint and the auth token (or alternatively the secret ref) are all required for self monitoring as well as API access. They need to be available at runtime from within the operator manager process. One reason for this self-restart was environment variables have been used to configure the Golang OTel SDK for self monitoring. Hence, for transferring the export settings found in the operator configuration resource would have been added to the operator manager deployment as environment variables, which triggered a single restart. Another reason for the self-restart is the support for Kubernetes secrets for the Dash0 authorization token. To resolve the secret ref into an actual token, the secret would be added to the operator manager deployment as an environment variable. This self-restart was problematic for a couple of reasons: * When the operator configuration resource is deployed automatically via Helm, and then a user later tries to update it in any way, or delete it, the following happens: A reconcile for the changed/deleted operator configuration will be triggered, this reconcile will set different self-monitoring/API access env vars on the operator manager deployment, the deployment will be updated via the K8s client, this will lead to a restart of the operator manager process; when starting up again, the operator manager will be started with the same command line parameters (the ones determined by the Helm values that were originally used when doing Helm install), this will recreate the deleted operator configuration resource or overwrite the changed values. This effectively lead to ignoring the changes the user made to the resource entirely. * The auto restart leads to longer operator manager startup times. The first start of the operator manager is relatively quick, it then gets a leader election lease and often gets restarted shortly after that. When the changed pods comes up after the auto-restart, the old one is not yet terminated (due to how rolling updates work for K8s deployments), which means that the new pod needs to wait for a long time (often > 30 seconds) until it gets the leader election lease. * Last but not least, the auto-restart can happen at any time, in the middle of whatever the operator manager is doing at the moment — reconciling custom resources, setting up the OTel collectors etc. etc. This commit solves this problem and removes the self-restart entirely: * The OTel SDK in the operator manager is now configured in code with values based on the settings in the operator configuration resource. * The OTel SDK in the operator manager is started/shut down/restarted as required, in particular when the operator configuration resource is reconciled and changes that are relevant for self monitoring are detected. * If the auth token (be it for self monitoring or for API access) is provided as a reference to a Kubernetes secret, this is resolved via a separate auxiliary process called secret ref resolver, which can be restarted as necessary without any impact on the operator manager.
dash0hq · Mar 7, 2025 · 7e06662 · 7e06662
1 parent d1593ff
commit 7e06662
Show file tree

Hide file tree

Showing 60 changed files with 3,705 additions and 2,671 deletions.
diff --git a/.github/dependabot.yml b/.github/dependabot.yml
@@ -14,6 +14,7 @@ updates:
       - "/images/configreloader/src"
       - "/images/filelogoffsetsynch/src"
       - "/images/pkg/common"
+      - "/images/secretrefresolver"
     schedule:
       interval: "weekly"
       day: "tuesday"

diff --git a/.github/workflows/ci.yaml b/.github/workflows/ci.yaml
@@ -152,6 +152,17 @@ jobs:
           imageUrl: https://github.com/dash0hq/dash0-operator/tree/main/images/instrumentation
           context: images/instrumentation
 
+      - name: build secret ref resolver image
+        uses: ./.github/actions/build-image
+        with:
+          githubToken: ${{ secrets.GITHUB_TOKEN }}
+          imageName: secret-ref-resolver
+          imageTitle: Dash0 Kubernetes Secret Ref Resolver
+          imageDescription: the secret ref resolver for the Dash0 operator for Kubernetes
+          imageUrl: https://github.com/dash0hq/dash0-operator/tree/main/images/secretrefresolver
+          context: images
+          file: images/secretrefresolver/Dockerfile
+
       - name: build collector image
         uses: ./.github/actions/build-image
         with:

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -37,7 +37,10 @@ After that, you can deploy the operator to your cluster:
     CONFIGURATION_RELOADER_IMG_PULL_POLICY=""
     FILELOG_OFFSET_SYNCH_IMG_REPOSITORY=ghcr.io/dash0hq/filelog-offset-synch \
     FILELOG_OFFSET_SYNCH_IMG_TAG=main-dev \
-    FILELOG_OFFSET_SYNCH_IMG_PULL_POLICY=""
+    FILELOG_OFFSET_SYNCH_IMG_PULL_POLICY="" \
+    SECRET_REF_RESOLVER_IMG_REPOSITORY=ghcr.io/dash0hq/secret-ref-resolver \
+    SECRET_REF_RESOLVER_IMG_TAG=main-dev \
+    SECRET_REF_RESOLVER_IMG_PULL_POLICY=""
   ```
 * The custom resource definition will automatically be installed when deploying the operator. However, you can also do
   that separately via kustomize if required via `make install`.
@@ -90,6 +93,8 @@ CONTROLLER_IMG_REPOSITORY=ghcr.io/dash0hq/operator-controller \
   CONFIGURATION_RELOADER_IMG_TAG=main-dev \
   FILELOG_OFFSET_SYNCH_IMG_REPOSITORY=ghcr.io/dash0hq/filelog-offset-synch \
   FILELOG_OFFSET_SYNCH_IMG_TAG=main-dev \
+  SECRET_REF_RESOLVER_IMG_REPOSITORY=ghcr.io/dash0hq/secret-ref-resolver \
+  SECRET_REF_RESOLVER_IMG_TAG=main-dev \
   make test-e2e
 ```
 
@@ -217,6 +222,8 @@ If you want to report telemetry to a Dash0 backend, set `DASH0_AUTHORIZATION_TOK
     * `OPERATOR_CONFIGURATION_VIA_HELM_SELF_MONITORING_ENABLED`: Set this to false to set the respective Helm value to
       false, disabling self-monitoring.
       This defaults to "true".
+    * `OPERATOR_CONFIGURATION_VIA_HELM_USE_TOKEN`: Set this to true to let use an auth token
+      (`DASH0_AUTHORIZATION_TOKEN`) in the operator configuration resource instead of a secret ref.
     * `OPERATOR_HELM_CHART_VERSION`: Set this to use a specific version of the Helm chart. This is meant to be used
       together with `OPERATOR_HELM_CHART=dash0-operator/dash0-operator` or similar, where `OPERATOR_HELM_CHART` refers
       to an already installed remote Helm repository (e.g. https://dash0hq.github.io/dash0-operator) that contains the
@@ -268,6 +275,9 @@ If you want to report telemetry to a Dash0 backend, set `DASH0_AUTHORIZATION_TOK
     * `FILELOG_OFFSET_SYNCH_IMG_TAG`
     * `FILELOG_OFFSET_SYNCH_IMG_DIGEST`
     * `FILELOG_OFFSET_SYNCH_IMG_PULL_POLICY`
+    * `SECRET_REF_RESOLVER_IMG_REPOSITORY`
+    * `SECRET_REF_RESOLVER_IMG_TAG`
+    * `SECRET_REF_RESOLVER_IMG_PULL_POLICY`
 * To run the scenario with the images that have been built from the main branch and pushed to ghcr.io most recently:
     ```
     CONTROLLER_IMG_REPOSITORY=ghcr.io/dash0hq/operator-controller \
@@ -280,6 +290,8 @@ If you want to report telemetry to a Dash0 backend, set `DASH0_AUTHORIZATION_TOK
       CONFIGURATION_RELOADER_IMG_TAG=main-dev \
       FILELOG_OFFSET_SYNCH_IMG_REPOSITORY=ghcr.io/dash0hq/filelog-offset-synch \
       FILELOG_OFFSET_SYNCH_IMG_TAG=main-dev \
+      SECRET_REF_RESOLVER_IMG_REPOSITORY=ghcr.io/dash0hq/secret-ref-resolver \
+      SECRET_REF_RESOLVER_IMG_TAG=main-dev \
       test-resources/bin/test-scenario-01-aum-operator-cr.sh
     ```
    * To run the scenario with the helm chart from the official remote repository and the default images referenced in

diff --git a/Makefile b/Makefile
@@ -81,6 +81,11 @@ FILELOG_OFFSET_SYNCH_IMG_TAG ?= latest
 FILELOG_OFFSET_SYNCH_IMG ?= $(FILELOG_OFFSET_SYNCH_IMG_REPOSITORY):$(FILELOG_OFFSET_SYNCH_IMG_TAG)
 FILELOG_OFFSET_SYNCH_IMG_PULL_POLICY ?= Never
 
+SECRET_REF_RESOLVER_IMG_REPOSITORY ?= secret-ref-resolver
+SECRET_REF_RESOLVER_IMG_TAG ?= latest
+SECRET_REF_RESOLVER_IMG ?= $(SECRET_REF_RESOLVER_IMG_REPOSITORY):$(SECRET_REF_RESOLVER_IMG_TAG)
+SECRET_REF_RESOLVER_IMG_PULL_POLICY ?= Never
+
 # ENVTEST_K8S_VERSION refers to the version of kubebuilder assets to be downloaded by envtest binary.
 ENVTEST_K8S_VERSION = 1.28.3
 
@@ -254,6 +259,7 @@ run: manifests generate fmt vet ## Run a controller from your host.
 .PHONY: docker-build
 docker-build: \
   docker-build-controller \
+  docker-build-secret-ref-resolver \
   docker-build-instrumentation \
   docker-build-collector \
   docker-build-config-reloader \
@@ -300,6 +306,10 @@ docker-build-config-reloader: ## Build the config reloader container image.
 docker-build-filelog-offset-synch: ## Build the filelog offset synch container image.
 	@$(call build_container_image,$(FILELOG_OFFSET_SYNCH_IMG_REPOSITORY),$(FILELOG_OFFSET_SYNCH_IMG_TAG),images,images/filelogoffsetsynch/Dockerfile)
 
+.PHONY: docker-build-secret-ref-resolver
+docker-build-secret-ref-resolver: ## Build the secret ref resolver container image.
+	@$(call build_container_image,$(SECRET_REF_RESOLVER_IMG_REPOSITORY),$(SECRET_REF_RESOLVER_IMG_TAG),images,images/secretrefresolver/Dockerfile)
+
 ifndef ignore-not-found
   ignore-not-found = false
 endif
@@ -333,6 +343,9 @@ deploy-via-helm: ## Deploy the controller via helm to the K8s cluster specified
 		--set operator.filelogOffsetSynchImage.repository=$(FILELOG_OFFSET_SYNCH_IMG_REPOSITORY) \
 		--set operator.filelogOffsetSynchImage.tag=$(FILELOG_OFFSET_SYNCH_IMG_TAG) \
 		--set operator.filelogOffsetSynchImage.pullPolicy=$(FILELOG_OFFSET_SYNCH_IMG_PULL_POLICY) \
+		--set operator.secretRefResolverImage.repository=$(SECRET_REF_RESOLVER_IMG_REPOSITORY) \
+		--set operator.secretRefResolverImage.tag=$(SECRET_REF_RESOLVER_IMG_TAG) \
+		--set operator.secretRefResolverImage.pullPolicy=$(SECRET_REF_RESOLVER_IMG_PULL_POLICY) \
 		--set operator.developmentMode=true \
 		dash0-operator \
 		$(OPERATOR_HELM_CHART)