Skip to content

Commit

Permalink
feat: enable self monitoring and api access without self-restart
Browse files Browse the repository at this point in the history
Before this change, the operator manager container deliberately
restarted itself once, when reconciling the operator configuration
resource for the first time (or when certain settings in that resource
would be changed).

The export endpoint, the API endpoint and the auth token (or
alternatively the secret ref) are all required for self monitoring as
well as API access. They need to be available at runtime from within the
operator manager process.

One reason for this self-restart was environment variables have been
used to configure the Golang OTel SDK for self monitoring. Hence, for
transferring the export settings found in the operator configuration
resource would have been added to the operator manager deployment as
environment variables, which triggered a single restart.

Another reason for the self-restart is the support for Kubernetes
secrets for the Dash0 authorization token. To resolve the secret ref
into an actual token, the secret would be added to the operator manager
deployment as an environment variable.

This self-restart was problematic for a couple of reasons:
* When the operator configuration resource is deployed automatically
  via Helm, and then a user later tries to update it in any way, or
  delete it, the following happens: A reconcile for the changed/deleted
  operator configuration will be triggered, this reconcile will set
  different self-monitoring/API access env vars on the operator manager
  deployment, the deployment will be updated via the K8s client, this
  will lead to a restart of the operator manager process; when starting
  up again, the operator manager will be started with the same command
  line parameters (the ones determined by the Helm values that were
  originally used when doing Helm install), this will recreate the
  deleted operator configuration resource or overwrite the changed values.
  This effectively lead to ignoring the changes the user made to the
  resource entirely.
* The auto restart leads to longer operator manager startup times. The
  first start of the operator manager is relatively quick, it then gets
  a leader election lease and often gets restarted shortly after that.
  When the changed pods comes up after the auto-restart, the old one is
  not yet terminated (due to how rolling updates work for K8s
  deployments), which means that the new pod needs to wait for a long time
  (often > 30 seconds) until it gets the leader election lease.
* Last but not least, the auto-restart can happen at any time, in the
  middle of whatever the operator manager is doing at the moment —
  reconciling custom resources, setting up the OTel collectors etc. etc.

This commit solves this problem and removes the self-restart entirely:
* The OTel SDK in the operator manager is now configured in code with
  values based on the settings in the operator configuration resource.
* The OTel SDK in the operator manager is started/shut down/restarted
  as required, in particular when the operator configuration resource is
  reconciled and changes that are relevant for self monitoring are
  detected.
* If the auth token (be it for self monitoring or for API access) is
  provided as a reference to a Kubernetes secret, this is resolved via a
  separate auxiliary process called secret ref resolver, which can be
  restarted as necessary without any impact on the operator manager.
  • Loading branch information
basti1302 committed Mar 7, 2025
1 parent d1593ff commit 7e06662
Show file tree
Hide file tree
Showing 60 changed files with 3,705 additions and 2,671 deletions.
1 change: 1 addition & 0 deletions .github/dependabot.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ updates:
- "/images/configreloader/src"
- "/images/filelogoffsetsynch/src"
- "/images/pkg/common"
- "/images/secretrefresolver"
schedule:
interval: "weekly"
day: "tuesday"
Expand Down
11 changes: 11 additions & 0 deletions .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,17 @@ jobs:
imageUrl: https://github.com/dash0hq/dash0-operator/tree/main/images/instrumentation
context: images/instrumentation

- name: build secret ref resolver image
uses: ./.github/actions/build-image
with:
githubToken: ${{ secrets.GITHUB_TOKEN }}
imageName: secret-ref-resolver
imageTitle: Dash0 Kubernetes Secret Ref Resolver
imageDescription: the secret ref resolver for the Dash0 operator for Kubernetes
imageUrl: https://github.com/dash0hq/dash0-operator/tree/main/images/secretrefresolver
context: images
file: images/secretrefresolver/Dockerfile

- name: build collector image
uses: ./.github/actions/build-image
with:
Expand Down
14 changes: 13 additions & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,10 @@ After that, you can deploy the operator to your cluster:
CONFIGURATION_RELOADER_IMG_PULL_POLICY=""
FILELOG_OFFSET_SYNCH_IMG_REPOSITORY=ghcr.io/dash0hq/filelog-offset-synch \
FILELOG_OFFSET_SYNCH_IMG_TAG=main-dev \
FILELOG_OFFSET_SYNCH_IMG_PULL_POLICY=""
FILELOG_OFFSET_SYNCH_IMG_PULL_POLICY="" \
SECRET_REF_RESOLVER_IMG_REPOSITORY=ghcr.io/dash0hq/secret-ref-resolver \
SECRET_REF_RESOLVER_IMG_TAG=main-dev \
SECRET_REF_RESOLVER_IMG_PULL_POLICY=""
```
* The custom resource definition will automatically be installed when deploying the operator. However, you can also do
that separately via kustomize if required via `make install`.
Expand Down Expand Up @@ -90,6 +93,8 @@ CONTROLLER_IMG_REPOSITORY=ghcr.io/dash0hq/operator-controller \
CONFIGURATION_RELOADER_IMG_TAG=main-dev \
FILELOG_OFFSET_SYNCH_IMG_REPOSITORY=ghcr.io/dash0hq/filelog-offset-synch \
FILELOG_OFFSET_SYNCH_IMG_TAG=main-dev \
SECRET_REF_RESOLVER_IMG_REPOSITORY=ghcr.io/dash0hq/secret-ref-resolver \
SECRET_REF_RESOLVER_IMG_TAG=main-dev \
make test-e2e
```

Expand Down Expand Up @@ -217,6 +222,8 @@ If you want to report telemetry to a Dash0 backend, set `DASH0_AUTHORIZATION_TOK
* `OPERATOR_CONFIGURATION_VIA_HELM_SELF_MONITORING_ENABLED`: Set this to false to set the respective Helm value to
false, disabling self-monitoring.
This defaults to "true".
* `OPERATOR_CONFIGURATION_VIA_HELM_USE_TOKEN`: Set this to true to let use an auth token
(`DASH0_AUTHORIZATION_TOKEN`) in the operator configuration resource instead of a secret ref.
* `OPERATOR_HELM_CHART_VERSION`: Set this to use a specific version of the Helm chart. This is meant to be used
together with `OPERATOR_HELM_CHART=dash0-operator/dash0-operator` or similar, where `OPERATOR_HELM_CHART` refers
to an already installed remote Helm repository (e.g. https://dash0hq.github.io/dash0-operator) that contains the
Expand Down Expand Up @@ -268,6 +275,9 @@ If you want to report telemetry to a Dash0 backend, set `DASH0_AUTHORIZATION_TOK
* `FILELOG_OFFSET_SYNCH_IMG_TAG`
* `FILELOG_OFFSET_SYNCH_IMG_DIGEST`
* `FILELOG_OFFSET_SYNCH_IMG_PULL_POLICY`
* `SECRET_REF_RESOLVER_IMG_REPOSITORY`
* `SECRET_REF_RESOLVER_IMG_TAG`
* `SECRET_REF_RESOLVER_IMG_PULL_POLICY`
* To run the scenario with the images that have been built from the main branch and pushed to ghcr.io most recently:
```
CONTROLLER_IMG_REPOSITORY=ghcr.io/dash0hq/operator-controller \
Expand All @@ -280,6 +290,8 @@ If you want to report telemetry to a Dash0 backend, set `DASH0_AUTHORIZATION_TOK
CONFIGURATION_RELOADER_IMG_TAG=main-dev \
FILELOG_OFFSET_SYNCH_IMG_REPOSITORY=ghcr.io/dash0hq/filelog-offset-synch \
FILELOG_OFFSET_SYNCH_IMG_TAG=main-dev \
SECRET_REF_RESOLVER_IMG_REPOSITORY=ghcr.io/dash0hq/secret-ref-resolver \
SECRET_REF_RESOLVER_IMG_TAG=main-dev \
test-resources/bin/test-scenario-01-aum-operator-cr.sh
```
* To run the scenario with the helm chart from the official remote repository and the default images referenced in
Expand Down
13 changes: 13 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,11 @@ FILELOG_OFFSET_SYNCH_IMG_TAG ?= latest
FILELOG_OFFSET_SYNCH_IMG ?= $(FILELOG_OFFSET_SYNCH_IMG_REPOSITORY):$(FILELOG_OFFSET_SYNCH_IMG_TAG)
FILELOG_OFFSET_SYNCH_IMG_PULL_POLICY ?= Never

SECRET_REF_RESOLVER_IMG_REPOSITORY ?= secret-ref-resolver
SECRET_REF_RESOLVER_IMG_TAG ?= latest
SECRET_REF_RESOLVER_IMG ?= $(SECRET_REF_RESOLVER_IMG_REPOSITORY):$(SECRET_REF_RESOLVER_IMG_TAG)
SECRET_REF_RESOLVER_IMG_PULL_POLICY ?= Never

# ENVTEST_K8S_VERSION refers to the version of kubebuilder assets to be downloaded by envtest binary.
ENVTEST_K8S_VERSION = 1.28.3

Expand Down Expand Up @@ -254,6 +259,7 @@ run: manifests generate fmt vet ## Run a controller from your host.
.PHONY: docker-build
docker-build: \
docker-build-controller \
docker-build-secret-ref-resolver \
docker-build-instrumentation \
docker-build-collector \
docker-build-config-reloader \
Expand Down Expand Up @@ -300,6 +306,10 @@ docker-build-config-reloader: ## Build the config reloader container image.
docker-build-filelog-offset-synch: ## Build the filelog offset synch container image.
@$(call build_container_image,$(FILELOG_OFFSET_SYNCH_IMG_REPOSITORY),$(FILELOG_OFFSET_SYNCH_IMG_TAG),images,images/filelogoffsetsynch/Dockerfile)

.PHONY: docker-build-secret-ref-resolver
docker-build-secret-ref-resolver: ## Build the secret ref resolver container image.
@$(call build_container_image,$(SECRET_REF_RESOLVER_IMG_REPOSITORY),$(SECRET_REF_RESOLVER_IMG_TAG),images,images/secretrefresolver/Dockerfile)

ifndef ignore-not-found
ignore-not-found = false
endif
Expand Down Expand Up @@ -333,6 +343,9 @@ deploy-via-helm: ## Deploy the controller via helm to the K8s cluster specified
--set operator.filelogOffsetSynchImage.repository=$(FILELOG_OFFSET_SYNCH_IMG_REPOSITORY) \
--set operator.filelogOffsetSynchImage.tag=$(FILELOG_OFFSET_SYNCH_IMG_TAG) \
--set operator.filelogOffsetSynchImage.pullPolicy=$(FILELOG_OFFSET_SYNCH_IMG_PULL_POLICY) \
--set operator.secretRefResolverImage.repository=$(SECRET_REF_RESOLVER_IMG_REPOSITORY) \
--set operator.secretRefResolverImage.tag=$(SECRET_REF_RESOLVER_IMG_TAG) \
--set operator.secretRefResolverImage.pullPolicy=$(SECRET_REF_RESOLVER_IMG_PULL_POLICY) \
--set operator.developmentMode=true \
dash0-operator \
$(OPERATOR_HELM_CHART)
Expand Down
Loading

0 comments on commit 7e06662

Please sign in to comment.