feat(backendconnection): deploy collectors via operator configuration

Deploy the Dash0 OpenTelemetry collectors as soon as an operator configuration resource with an export is available. In particular, the slightly arbitrary requirement of having at least one Dash0 monitoring resource monitoring a namespace for getting _any_ telemetry from a cluster is removed with this. This enables collecting non-namespace scoped metrics as soon as export settings are available.
dash0hq · Feb 5, 2025 · e5ca830 · e5ca830
1 parent f4d87b4
commit e5ca830
Show file tree

Hide file tree

Showing 20 changed files with 1,252 additions and 599 deletions.
diff --git a/helm-chart/dash0-operator/README.md b/helm-chart/dash0-operator/README.md
@@ -97,8 +97,8 @@ See the section
 for more information on using a Kubernetes secrets with the Dash0 operator.
 
 You can consult the chart's
-[values.yaml](https://github.com/dash0hq/dash0-operator/blob/main/helm-chart/dash0-operator/values.yaml) for a complete
-list of available configuration settings.
+[values.yaml](https://github.com/dash0hq/dash0-operator/blob/main/helm-chart/dash0-operator/values.yaml) file for a
+complete list of available configuration settings.
 
 Last but not least, you can also install the operator without providing a Dash0 backend configuration:
 
@@ -118,7 +118,8 @@ That is, providing `--set operator.dash0Export.enabled=true` and the other backe
 On its own, the operator will not do much.
 To actually have the operator monitor your cluster, two more things need to be set up:
 1. a [Dash0 backend connection](#configuring-the-dash0-backend-connection) has to be configured and
-2. monitoring workloads and collecting metrics has to be [enabled per namespace](#enable-dash0-monitoring-for-a-namespace).
+2. monitoring namespaces and their workloads to collect logs, traces and metrics has to be
+   [enabled per namespace](#enable-dash0-monitoring-for-a-namespace).
 
 Both steps are described in the following sections.
 
@@ -207,6 +208,11 @@ kubectl apply -f dash0-operator-configuration.yaml
 The Dash0 operator configuration resource is cluster-scoped, so a specific namespace should not be provided when
 applying it.
 
+Note: All configuration options available in the operator configuration resource can also be configured when letting the
+Helm chart auto-create this resource, as explained in the section [Installation](#installation). You can consult the
+chart's [values.yaml](https://github.com/dash0hq/dash0-operator/blob/main/helm-chart/dash0-operator/values.yaml) file 
+for a complete list of available configuration settings.
+
 ### Enable Dash0 Monitoring For a Namespace
 
 For _each namespace_ that you want to monitor with Dash0, enable monitoring by installing a _Dash0 monitoring
@@ -234,8 +240,10 @@ If you want to monitor the `default` namespace with Dash0, use the following com
 kubectl apply -f dash0-monitoring.yaml
 ```
 
-Note: Collecting Kubernetes infrastructure metrics (which are not neccessarily related to specific workloads or
-namespaces) also requires that at least one namespace has a Dash0Monitoring resource.
+Note: Even when no monitoring resources has been installed and no namespace is being monitored by Dash0, the Dash0
+operator's collector will collect Kubernetes infrastructure metrics that are not namespace scoped, like node-related
+metrics. The only prerequisite for this is an [operator configuration](#configuring-the-dash0-backend-connection) with
+export settings.
 
 ### Additional Configuration Per Namespace
 
@@ -291,14 +299,13 @@ The Dash0 monitoring resource supports additional configuration settings:
 * `spec.synchronizePersesDashboards`: A namespace-wide opt-out for synchronizing Perses dashboard resources found in the
   target namespace. If enabled, the operator will watch Perses dashboard resources in this namespace and create
   corresponding dashboards in Dash0 via the Dash0 API.
-  See https://github.com/dash0hq/dash0-operator/blob/main/helm-chart/dash0-operator/README.md#managing-dash0-dashboards
-  for details. This setting is optional, it defaults to true.
+  See [Managing Dash0 Dashboards](#managing-dash0-dashboards) for details. This setting is optional, it defaults to true.
 
 * `spec.synchronizePrometheusRules`: A namespace-wide opt-out for synchronizing Prometheus rule resources found in the
   target namespace. If enabled, the operator will watch Prometheus rule resources in this namespace and create
   corresponding check rules in Dash0 via the Dash0 API.
-  See https://github.com/dash0hq/dash0-operator/blob/main/helm-chart/dash0-operator/README.md#managing-dash0-check-rules
-  for details. This setting is optional, it defaults to true.
+  See [Managing Dash0 Check Rules](#managing-dash0-check-rules) for details. This setting is optional, it defaults to
+  true.
 
 * `spec.prometheusScrapingEnabled`: A namespace-wide opt-out for Prometheus scraping for the target namespace.
   If enabled, the operator will configure its OpenTelemetry collector to scrape metrics from pods in the namespace
@@ -462,8 +469,6 @@ spec:
 
 ### Configure Metrics Collection
 
-Note: Collecting metrics requires that at least one namespace has a Dash0Monitoring resource.
-
 By default, the operator collects metrics as follows:
 * The operator collects node, pod, container, and volume metrics from the API server on
   [kubelets](https://kubernetes.io/docs/concepts/architecture/#kubelet)
@@ -477,13 +482,14 @@ By default, the operator collects metrics as follows:
   `false` when deploying the operator configuration resource via the Helm chart).
 * Namespace-scoped metrics (e.g. metrics related to a workload running in a specific namespace) will only be collected
   if the namespace is monitored, that is, there is a Dash0 monitoring resource in that namespace.
-* Metrics which are not namespace-scoped (for example node metrics like `k8s.node.*`) will always be collected, unless
-  metrics collection is disabled globally for the cluster (`kubernetesInfrastructureMetricsCollectionEnabled: false`,
-  see above). For technical reasons, metrics collection also does not start if there is no Dash0 monitoring resource at
-  all in the cluster, that is, if no namespace is monitored (this is subject to change in a future version).
-* The Dash0 operator scrapes Prometheus endpoints on pods annotated with the `prometheus.io/*` annotations, as
-  described in the section [Scraping Prometheus endpoints](#scraping-prometheus-endpoints). This can be disabled per
-  namespace by explicitly setting `prometheusScrapingEnabled: false` in the Dash0 monitoring resource.
+* The Dash0 operator scrapes Prometheus endpoints on pods annotated with the `prometheus.io/*` annotations in monitored
+  namespaces, as described in the section [Scraping Prometheus endpoints](#scraping-prometheus-endpoints). This can be
+  disabled per namespace by explicitly setting `prometheusScrapingEnabled: false` in the Dash0 monitoring resource.
+* Metrics which are not namespace-scoped (for example node metrics like `k8s.node.*` or host metrics like
+  `system.cpu.utilization`) will always be collected, unless metrics collection is disabled globally for the cluster
+  (`kubernetesInfrastructureMetricsCollectionEnabled: false`, see above). An operator configuration resource with
+  [export settings](#configuring-the-dash0-backend-connection) has to be present in the cluster, otherwise no metrics
+  collection takes place.
 
 Disabling or enabling individual metrics via configuration is not supported.
 
@@ -781,6 +787,11 @@ The scraping of a pod is executed from the same Kubernetes node the pod resides
 This feature can be disabled for a namespace by explicitly setting `prometheusScrapingEnabled: false` in the Dash0
 monitoring resource.
 
+Note: To also have [Kube state metrics available](https://github.com/kubernetes/kube-state-metrics) (which are used
+extensively in [Awesome Prometheus alerts](#https://samber.github.io/awesome-prometheus-alerts/)) scraped and delivered
+to Dash0, you can annotate the kube-state-metrics pod with `prometheus.io/scrape: "true"` and add a Dash0 monitoring
+resource to the namespace it is running in.
+
 ## Managing Dash0 Dashboards
 
 You can manage your Dash0 dashboards via the Dash0 operator.

diff --git a/internal/backendconnection/backend_connection_controller.go b/internal/backendconnection/backend_connection_controller.go
@@ -7,7 +7,6 @@ import (
 	"context"
 	"slices"
 
-	"github.com/go-logr/logr"
 	appsv1 "k8s.io/api/apps/v1"
 	corev1 "k8s.io/api/core/v1"
 	rbacv1 "k8s.io/api/rbac/v1"
@@ -20,7 +19,6 @@ import (
 	"sigs.k8s.io/controller-runtime/pkg/predicate"
 	"sigs.k8s.io/controller-runtime/pkg/reconcile"
 
-	dash0v1alpha1 "github.com/dash0hq/dash0-operator/api/dash0monitoring/v1alpha1"
 	"github.com/dash0hq/dash0-operator/internal/backendconnection/otelcolresources"
 	"github.com/dash0hq/dash0-operator/internal/util"
 )
@@ -117,18 +115,11 @@ func (r *BackendConnectionReconciler) Reconcile(
 	logger := log.FromContext(ctx)
 	logger.Info("reconciling backend connection resources", "request", request)
 
-	arbitraryMonitoringResource, err := r.findArbitraryMonitoringResource(ctx, &logger)
-	if err != nil {
-		return reconcile.Result{}, err
-	} else if arbitraryMonitoringResource == nil {
-		return reconcile.Result{}, nil
-	}
-
-	if err = r.BackendConnectionManager.ReconcileOpenTelemetryCollector(
+	if err := r.BackendConnectionManager.ReconcileOpenTelemetryCollector(
 		ctx,
 		r.Images,
 		r.OperatorNamespace,
-		arbitraryMonitoringResource,
+		nil,
 		TriggeredByWatchEvent,
 	); err != nil {
 		logger.Error(err, "Failed to create/update backend connection resources.")
@@ -143,28 +134,3 @@ func (r *BackendConnectionReconciler) Reconcile(
 
 	return reconcile.Result{}, nil
 }
-
-func (r *BackendConnectionReconciler) findArbitraryMonitoringResource(
-	ctx context.Context,
-	logger *logr.Logger,
-) (*dash0v1alpha1.Dash0Monitoring, error) {
-	allDash0MonitoringResouresInCluster := &dash0v1alpha1.Dash0MonitoringList{}
-	if err := r.List(
-		ctx,
-		allDash0MonitoringResouresInCluster,
-		&client.ListOptions{},
-	); err != nil {
-		logger.Error(err, "Failed to list all Dash0 monitoring resources when reconciling backend connection resources.")
-		return nil, err
-	}
-
-	if len(allDash0MonitoringResouresInCluster.Items) == 0 {
-		logger.Info("No Dash0 monitoring resources in cluster, aborting the backend connection resources reconciliation.")
-		return nil, nil
-	}
-
-	// TODO this needs to be fixed when we start to support sending telemetry to different backends per namespace.
-	// Ultimately we need to derive one consistent configuration including multiple pipelines and routing across all
-	// monitored namespaces.
-	return &allDash0MonitoringResouresInCluster.Items[0], nil
-}
diff --git a/internal/backendconnection/backendconnection_manager.go b/internal/backendconnection/backendconnection_manager.go
@@ -5,6 +5,7 @@ package backendconnection
 
 import (
 	"context"
+	"fmt"
 	"sync/atomic"
 
 	"github.com/go-logr/logr"
@@ -34,11 +35,18 @@ const (
 	TriggeredByDash0ResourceReconcile BackendConnectionReconcileTrigger = "resource"
 )
 
+// ReconcileOpenTelemetryCollector can be triggered by a
+//  1. a reconcile request from the Dash0OperatorConfiguration resource.
+//  2. a reconcile request from a Dash0Monitoring resource in the cluster.
+//  3. a change event on one of the OpenTelemetry collector related resources that the operator manages (a change to one
+//     of "our" config maps or similar).
+//
+// The parameter triggeringMonitoringResource is only != nil for case (2).
 func (m *BackendConnectionManager) ReconcileOpenTelemetryCollector(
 	ctx context.Context,
 	images util.Images,
 	operatorNamespace string,
-	monitoringResource *dash0v1alpha1.Dash0Monitoring,
+	triggeringMonitoringResource *dash0v1alpha1.Dash0Monitoring,
 	trigger BackendConnectionReconcileTrigger,
 ) error {
 	logger := log.FromContext(ctx)
@@ -68,24 +76,75 @@ func (m *BackendConnectionManager) ReconcileOpenTelemetryCollector(
 		m.updateInProgress.Store(false)
 	}()
 
+	operatorConfigurationResource, err := m.findOperatorConfigurationResource(ctx, &logger)
+	if err != nil {
+		return err
+	}
 	allMonitoringResources, err := m.findAllMonitoringResources(ctx, &logger)
 	if err != nil {
 		return err
 	}
-	if len(allMonitoringResources) == 0 {
+	var export *dash0v1alpha1.Export
+	if operatorConfigurationResource != nil && operatorConfigurationResource.Spec.Export != nil {
+		export = operatorConfigurationResource.Spec.Export
+	}
+	if export == nil && triggeringMonitoringResource != nil &&
+		triggeringMonitoringResource.IsAvailable() &&
+		triggeringMonitoringResource.Spec.Export != nil {
+		export = triggeringMonitoringResource.Spec.Export
+	}
+	if export == nil {
+		// Using the export setting of an arbitrary monitoring resource is a bandaid as long as we do not allow
+		// exporting, telemetry to different backends per namespace.
+		for _, monitoringResource := range allMonitoringResources {
+			if monitoringResource.Spec.Export != nil {
+				export = monitoringResource.Spec.Export
+				break
+			}
+		}
+	}
+
+	if export != nil {
+		return m.createOrUpdateOpenTelemetryCollector(
+			ctx,
+			operatorNamespace,
+			images,
+			operatorConfigurationResource,
+			allMonitoringResources,
+			export,
+			&logger,
+		)
+	} else {
+		if operatorConfigurationResource != nil {
+			logger.Info(
+				fmt.Sprintf("There is an operator configuration resource (\"%s\"), but it has no export "+
+					"configuration, no Dash0 OpenTelemetry collector will be created, existing Dash0 OpenTelemetry "+
+					"collectors will be removed.", operatorConfigurationResource.Name),
+			)
+		}
 		return m.removeOpenTelemetryCollector(ctx, operatorNamespace, &logger)
 	}
+}
 
+func (m *BackendConnectionManager) createOrUpdateOpenTelemetryCollector(
+	ctx context.Context,
+	operatorNamespace string,
+	images util.Images,
+	operatorConfigurationResource *dash0v1alpha1.Dash0OperatorConfiguration,
+	allMonitoringResources []dash0v1alpha1.Dash0Monitoring,
+	export *dash0v1alpha1.Export,
+	logger *logr.Logger,
+) error {
 	resourcesHaveBeenCreated, resourcesHaveBeenUpdated, err :=
 		m.OTelColResourceManager.CreateOrUpdateOpenTelemetryCollectorResources(
 			ctx,
 			operatorNamespace,
 			images,
+			operatorConfigurationResource,
 			allMonitoringResources,
-			monitoringResource,
-			&logger,
+			export,
+			logger,
 		)
-
 	if err != nil {
 		logger.Error(
 			err,
@@ -94,85 +153,57 @@ func (m *BackendConnectionManager) ReconcileOpenTelemetryCollector(
 		)
 		return err
 	}
-
-	if resourcesHaveBeenCreated {
+	if resourcesHaveBeenCreated && resourcesHaveBeenUpdated {
+		logger.Info("OpenTelemetry collector Kubernetes resources have been created and updated.")
+	} else if resourcesHaveBeenCreated {
 		logger.Info("OpenTelemetry collector Kubernetes resources have been created.")
 	} else if resourcesHaveBeenUpdated {
 		logger.Info("OpenTelemetry collector Kubernetes resources have been updated.")
 	}
 	return nil
 }
 
-func (m *BackendConnectionManager) RemoveOpenTelemetryCollectorIfNoMonitoringResourceIsLeft(
+func (m *BackendConnectionManager) removeOpenTelemetryCollector(
 	ctx context.Context,
 	operatorNamespace string,
-	dash0MonitoringResourceToBeDeleted *dash0v1alpha1.Dash0Monitoring,
+	logger *logr.Logger,
 ) error {
-	m.resourcesHaveBeenDeletedByOperator.Store(true)
-	m.updateInProgress.Store(true)
-	defer func() {
-		m.updateInProgress.Store(false)
-	}()
-
-	logger := log.FromContext(ctx)
-	list := &dash0v1alpha1.Dash0MonitoringList{}
-	err := m.Client.List(
+	resourcesHaveBeenDeleted, err := m.OTelColResourceManager.DeleteResources(
 		ctx,
-		list,
+		operatorNamespace,
+		logger,
 	)
-
 	if err != nil {
-		logger.Error(err, "Error when checking whether there are any Dash0 monitoring resources left in the cluster.")
+		logger.Error(
+			err,
+			"Failed to delete the OpenTelemetry collector Kubernetes resources, requeuing reconcile request.",
+		)
 		return err
 	}
-	if len(list.Items) > 1 {
-		// There is still more than one Dash0 monitoring resource in the namespace, do not remove the backend connection.
-		return nil
+	if resourcesHaveBeenDeleted {
+		logger.Info("OpenTelemetry collector Kubernetes resources have been deleted.")
 	}
-
-	if len(list.Items) == 1 && list.Items[0].UID != dash0MonitoringResourceToBeDeleted.UID {
-		// There is only one Dash0 monitoring resource left, but it is *not* the one that is about to be deleted.
-		// Do not remove the backend connection.
-		logger.Info(
-			"There is only one Dash0 monitoring resource left, but it is not the one being deleted.",
-			"to be deleted/UID",
-			dash0MonitoringResourceToBeDeleted.UID,
-			"to be deleted/namespace",
-			dash0MonitoringResourceToBeDeleted.Namespace,
-			"to be deleted/name",
-			dash0MonitoringResourceToBeDeleted.Name,
-			"existing resource/UID",
-			list.Items[0].UID,
-			"existing resource/namespace",
-			list.Items[0].Namespace,
-			"existing resource/name",
-			list.Items[0].Name,
-		)
-		return nil
-	}
-
-	// Either there is no Dash0 monitoring resource left, or only one and that one is about to be deleted. Delete the
-	// backend connection.
-	return m.removeOpenTelemetryCollector(ctx, operatorNamespace, &logger)
+	return nil
 }
 
-func (m *BackendConnectionManager) removeOpenTelemetryCollector(
+func (m *BackendConnectionManager) findOperatorConfigurationResource(
 	ctx context.Context,
-	operatorNamespace string,
 	logger *logr.Logger,
-) error {
-	if err := m.OTelColResourceManager.DeleteResources(
+) (*dash0v1alpha1.Dash0OperatorConfiguration, error) {
+	operatorConfigurationResource, err := util.FindUniqueOrMostRecentResourceInScope(
 		ctx,
-		operatorNamespace,
+		m.Client,
+		"", /* cluster-scope, thus no namespace */
+		&dash0v1alpha1.Dash0OperatorConfiguration{},
 		logger,
-	); err != nil {
-		logger.Error(
-			err,
-			"Failed to delete the OpenTelemetry collector Kuberenetes resources, requeuing reconcile request.",
-		)
-		return err
+	)
+	if err != nil {
+		return nil, err
 	}
-	return nil
+	if operatorConfigurationResource == nil {
+		return nil, nil
+	}
+	return operatorConfigurationResource.(*dash0v1alpha1.Dash0OperatorConfiguration), nil
 }
 
 func (m *BackendConnectionManager) findAllMonitoringResources(