Skip to content

Commit

Permalink
feat(backendconnection): deploy collectors via operator configuration
Browse files Browse the repository at this point in the history
Deploy the Dash0 OpenTelemetry collectors as soon as an operator
configuration resource with an export is available. In particular,
the slightly arbitrary requirement of having at least one Dash0
monitoring resource monitoring a namespace for getting _any_ telemetry
from a cluster is removed with this. This enables collecting
non-namespace scoped metrics as soon as export settings are available.
  • Loading branch information
basti1302 committed Feb 5, 2025
1 parent f4d87b4 commit e5ca830
Show file tree
Hide file tree
Showing 20 changed files with 1,252 additions and 599 deletions.
47 changes: 29 additions & 18 deletions helm-chart/dash0-operator/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,8 +97,8 @@ See the section
for more information on using a Kubernetes secrets with the Dash0 operator.

You can consult the chart's
[values.yaml](https://github.com/dash0hq/dash0-operator/blob/main/helm-chart/dash0-operator/values.yaml) for a complete
list of available configuration settings.
[values.yaml](https://github.com/dash0hq/dash0-operator/blob/main/helm-chart/dash0-operator/values.yaml) file for a
complete list of available configuration settings.

Last but not least, you can also install the operator without providing a Dash0 backend configuration:

Expand All @@ -118,7 +118,8 @@ That is, providing `--set operator.dash0Export.enabled=true` and the other backe
On its own, the operator will not do much.
To actually have the operator monitor your cluster, two more things need to be set up:
1. a [Dash0 backend connection](#configuring-the-dash0-backend-connection) has to be configured and
2. monitoring workloads and collecting metrics has to be [enabled per namespace](#enable-dash0-monitoring-for-a-namespace).
2. monitoring namespaces and their workloads to collect logs, traces and metrics has to be
[enabled per namespace](#enable-dash0-monitoring-for-a-namespace).

Both steps are described in the following sections.

Expand Down Expand Up @@ -207,6 +208,11 @@ kubectl apply -f dash0-operator-configuration.yaml
The Dash0 operator configuration resource is cluster-scoped, so a specific namespace should not be provided when
applying it.

Note: All configuration options available in the operator configuration resource can also be configured when letting the
Helm chart auto-create this resource, as explained in the section [Installation](#installation). You can consult the
chart's [values.yaml](https://github.com/dash0hq/dash0-operator/blob/main/helm-chart/dash0-operator/values.yaml) file
for a complete list of available configuration settings.

### Enable Dash0 Monitoring For a Namespace

For _each namespace_ that you want to monitor with Dash0, enable monitoring by installing a _Dash0 monitoring
Expand Down Expand Up @@ -234,8 +240,10 @@ If you want to monitor the `default` namespace with Dash0, use the following com
kubectl apply -f dash0-monitoring.yaml
```

Note: Collecting Kubernetes infrastructure metrics (which are not neccessarily related to specific workloads or
namespaces) also requires that at least one namespace has a Dash0Monitoring resource.
Note: Even when no monitoring resources has been installed and no namespace is being monitored by Dash0, the Dash0
operator's collector will collect Kubernetes infrastructure metrics that are not namespace scoped, like node-related
metrics. The only prerequisite for this is an [operator configuration](#configuring-the-dash0-backend-connection) with
export settings.

### Additional Configuration Per Namespace

Expand Down Expand Up @@ -291,14 +299,13 @@ The Dash0 monitoring resource supports additional configuration settings:
* `spec.synchronizePersesDashboards`: A namespace-wide opt-out for synchronizing Perses dashboard resources found in the
target namespace. If enabled, the operator will watch Perses dashboard resources in this namespace and create
corresponding dashboards in Dash0 via the Dash0 API.
See https://github.com/dash0hq/dash0-operator/blob/main/helm-chart/dash0-operator/README.md#managing-dash0-dashboards
for details. This setting is optional, it defaults to true.
See [Managing Dash0 Dashboards](#managing-dash0-dashboards) for details. This setting is optional, it defaults to true.

* `spec.synchronizePrometheusRules`: A namespace-wide opt-out for synchronizing Prometheus rule resources found in the
target namespace. If enabled, the operator will watch Prometheus rule resources in this namespace and create
corresponding check rules in Dash0 via the Dash0 API.
See https://github.com/dash0hq/dash0-operator/blob/main/helm-chart/dash0-operator/README.md#managing-dash0-check-rules
for details. This setting is optional, it defaults to true.
See [Managing Dash0 Check Rules](#managing-dash0-check-rules) for details. This setting is optional, it defaults to
true.

* `spec.prometheusScrapingEnabled`: A namespace-wide opt-out for Prometheus scraping for the target namespace.
If enabled, the operator will configure its OpenTelemetry collector to scrape metrics from pods in the namespace
Expand Down Expand Up @@ -462,8 +469,6 @@ spec:

### Configure Metrics Collection

Note: Collecting metrics requires that at least one namespace has a Dash0Monitoring resource.

By default, the operator collects metrics as follows:
* The operator collects node, pod, container, and volume metrics from the API server on
[kubelets](https://kubernetes.io/docs/concepts/architecture/#kubelet)
Expand All @@ -477,13 +482,14 @@ By default, the operator collects metrics as follows:
`false` when deploying the operator configuration resource via the Helm chart).
* Namespace-scoped metrics (e.g. metrics related to a workload running in a specific namespace) will only be collected
if the namespace is monitored, that is, there is a Dash0 monitoring resource in that namespace.
* Metrics which are not namespace-scoped (for example node metrics like `k8s.node.*`) will always be collected, unless
metrics collection is disabled globally for the cluster (`kubernetesInfrastructureMetricsCollectionEnabled: false`,
see above). For technical reasons, metrics collection also does not start if there is no Dash0 monitoring resource at
all in the cluster, that is, if no namespace is monitored (this is subject to change in a future version).
* The Dash0 operator scrapes Prometheus endpoints on pods annotated with the `prometheus.io/*` annotations, as
described in the section [Scraping Prometheus endpoints](#scraping-prometheus-endpoints). This can be disabled per
namespace by explicitly setting `prometheusScrapingEnabled: false` in the Dash0 monitoring resource.
* The Dash0 operator scrapes Prometheus endpoints on pods annotated with the `prometheus.io/*` annotations in monitored
namespaces, as described in the section [Scraping Prometheus endpoints](#scraping-prometheus-endpoints). This can be
disabled per namespace by explicitly setting `prometheusScrapingEnabled: false` in the Dash0 monitoring resource.
* Metrics which are not namespace-scoped (for example node metrics like `k8s.node.*` or host metrics like
`system.cpu.utilization`) will always be collected, unless metrics collection is disabled globally for the cluster
(`kubernetesInfrastructureMetricsCollectionEnabled: false`, see above). An operator configuration resource with
[export settings](#configuring-the-dash0-backend-connection) has to be present in the cluster, otherwise no metrics
collection takes place.

Disabling or enabling individual metrics via configuration is not supported.

Expand Down Expand Up @@ -781,6 +787,11 @@ The scraping of a pod is executed from the same Kubernetes node the pod resides
This feature can be disabled for a namespace by explicitly setting `prometheusScrapingEnabled: false` in the Dash0
monitoring resource.

Note: To also have [Kube state metrics available](https://github.com/kubernetes/kube-state-metrics) (which are used
extensively in [Awesome Prometheus alerts](#https://samber.github.io/awesome-prometheus-alerts/)) scraped and delivered
to Dash0, you can annotate the kube-state-metrics pod with `prometheus.io/scrape: "true"` and add a Dash0 monitoring
resource to the namespace it is running in.
## Managing Dash0 Dashboards
You can manage your Dash0 dashboards via the Dash0 operator.
Expand Down
38 changes: 2 additions & 36 deletions internal/backendconnection/backend_connection_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ import (
"context"
"slices"

"github.com/go-logr/logr"
appsv1 "k8s.io/api/apps/v1"
corev1 "k8s.io/api/core/v1"
rbacv1 "k8s.io/api/rbac/v1"
Expand All @@ -20,7 +19,6 @@ import (
"sigs.k8s.io/controller-runtime/pkg/predicate"
"sigs.k8s.io/controller-runtime/pkg/reconcile"

dash0v1alpha1 "github.com/dash0hq/dash0-operator/api/dash0monitoring/v1alpha1"
"github.com/dash0hq/dash0-operator/internal/backendconnection/otelcolresources"
"github.com/dash0hq/dash0-operator/internal/util"
)
Expand Down Expand Up @@ -117,18 +115,11 @@ func (r *BackendConnectionReconciler) Reconcile(
logger := log.FromContext(ctx)
logger.Info("reconciling backend connection resources", "request", request)

arbitraryMonitoringResource, err := r.findArbitraryMonitoringResource(ctx, &logger)
if err != nil {
return reconcile.Result{}, err
} else if arbitraryMonitoringResource == nil {
return reconcile.Result{}, nil
}

if err = r.BackendConnectionManager.ReconcileOpenTelemetryCollector(
if err := r.BackendConnectionManager.ReconcileOpenTelemetryCollector(
ctx,
r.Images,
r.OperatorNamespace,
arbitraryMonitoringResource,
nil,
TriggeredByWatchEvent,
); err != nil {
logger.Error(err, "Failed to create/update backend connection resources.")
Expand All @@ -143,28 +134,3 @@ func (r *BackendConnectionReconciler) Reconcile(

return reconcile.Result{}, nil
}

func (r *BackendConnectionReconciler) findArbitraryMonitoringResource(
ctx context.Context,
logger *logr.Logger,
) (*dash0v1alpha1.Dash0Monitoring, error) {
allDash0MonitoringResouresInCluster := &dash0v1alpha1.Dash0MonitoringList{}
if err := r.List(
ctx,
allDash0MonitoringResouresInCluster,
&client.ListOptions{},
); err != nil {
logger.Error(err, "Failed to list all Dash0 monitoring resources when reconciling backend connection resources.")
return nil, err
}

if len(allDash0MonitoringResouresInCluster.Items) == 0 {
logger.Info("No Dash0 monitoring resources in cluster, aborting the backend connection resources reconciliation.")
return nil, nil
}

// TODO this needs to be fixed when we start to support sending telemetry to different backends per namespace.
// Ultimately we need to derive one consistent configuration including multiple pipelines and routing across all
// monitored namespaces.
return &allDash0MonitoringResouresInCluster.Items[0], nil
}
153 changes: 92 additions & 61 deletions internal/backendconnection/backendconnection_manager.go
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ package backendconnection

import (
"context"
"fmt"
"sync/atomic"

"github.com/go-logr/logr"
Expand Down Expand Up @@ -34,11 +35,18 @@ const (
TriggeredByDash0ResourceReconcile BackendConnectionReconcileTrigger = "resource"
)

// ReconcileOpenTelemetryCollector can be triggered by a
// 1. a reconcile request from the Dash0OperatorConfiguration resource.
// 2. a reconcile request from a Dash0Monitoring resource in the cluster.
// 3. a change event on one of the OpenTelemetry collector related resources that the operator manages (a change to one
// of "our" config maps or similar).
//
// The parameter triggeringMonitoringResource is only != nil for case (2).
func (m *BackendConnectionManager) ReconcileOpenTelemetryCollector(
ctx context.Context,
images util.Images,
operatorNamespace string,
monitoringResource *dash0v1alpha1.Dash0Monitoring,
triggeringMonitoringResource *dash0v1alpha1.Dash0Monitoring,
trigger BackendConnectionReconcileTrigger,
) error {
logger := log.FromContext(ctx)
Expand Down Expand Up @@ -68,24 +76,75 @@ func (m *BackendConnectionManager) ReconcileOpenTelemetryCollector(
m.updateInProgress.Store(false)
}()

operatorConfigurationResource, err := m.findOperatorConfigurationResource(ctx, &logger)
if err != nil {
return err
}
allMonitoringResources, err := m.findAllMonitoringResources(ctx, &logger)
if err != nil {
return err
}
if len(allMonitoringResources) == 0 {
var export *dash0v1alpha1.Export
if operatorConfigurationResource != nil && operatorConfigurationResource.Spec.Export != nil {
export = operatorConfigurationResource.Spec.Export
}
if export == nil && triggeringMonitoringResource != nil &&
triggeringMonitoringResource.IsAvailable() &&
triggeringMonitoringResource.Spec.Export != nil {
export = triggeringMonitoringResource.Spec.Export
}
if export == nil {
// Using the export setting of an arbitrary monitoring resource is a bandaid as long as we do not allow
// exporting, telemetry to different backends per namespace.
for _, monitoringResource := range allMonitoringResources {
if monitoringResource.Spec.Export != nil {
export = monitoringResource.Spec.Export
break
}
}
}

if export != nil {
return m.createOrUpdateOpenTelemetryCollector(
ctx,
operatorNamespace,
images,
operatorConfigurationResource,
allMonitoringResources,
export,
&logger,
)
} else {
if operatorConfigurationResource != nil {
logger.Info(
fmt.Sprintf("There is an operator configuration resource (\"%s\"), but it has no export "+
"configuration, no Dash0 OpenTelemetry collector will be created, existing Dash0 OpenTelemetry "+
"collectors will be removed.", operatorConfigurationResource.Name),
)
}
return m.removeOpenTelemetryCollector(ctx, operatorNamespace, &logger)
}
}

func (m *BackendConnectionManager) createOrUpdateOpenTelemetryCollector(
ctx context.Context,
operatorNamespace string,
images util.Images,
operatorConfigurationResource *dash0v1alpha1.Dash0OperatorConfiguration,
allMonitoringResources []dash0v1alpha1.Dash0Monitoring,
export *dash0v1alpha1.Export,
logger *logr.Logger,
) error {
resourcesHaveBeenCreated, resourcesHaveBeenUpdated, err :=
m.OTelColResourceManager.CreateOrUpdateOpenTelemetryCollectorResources(
ctx,
operatorNamespace,
images,
operatorConfigurationResource,
allMonitoringResources,
monitoringResource,
&logger,
export,
logger,
)

if err != nil {
logger.Error(
err,
Expand All @@ -94,85 +153,57 @@ func (m *BackendConnectionManager) ReconcileOpenTelemetryCollector(
)
return err
}

if resourcesHaveBeenCreated {
if resourcesHaveBeenCreated && resourcesHaveBeenUpdated {
logger.Info("OpenTelemetry collector Kubernetes resources have been created and updated.")
} else if resourcesHaveBeenCreated {
logger.Info("OpenTelemetry collector Kubernetes resources have been created.")
} else if resourcesHaveBeenUpdated {
logger.Info("OpenTelemetry collector Kubernetes resources have been updated.")
}
return nil
}

func (m *BackendConnectionManager) RemoveOpenTelemetryCollectorIfNoMonitoringResourceIsLeft(
func (m *BackendConnectionManager) removeOpenTelemetryCollector(
ctx context.Context,
operatorNamespace string,
dash0MonitoringResourceToBeDeleted *dash0v1alpha1.Dash0Monitoring,
logger *logr.Logger,
) error {
m.resourcesHaveBeenDeletedByOperator.Store(true)
m.updateInProgress.Store(true)
defer func() {
m.updateInProgress.Store(false)
}()

logger := log.FromContext(ctx)
list := &dash0v1alpha1.Dash0MonitoringList{}
err := m.Client.List(
resourcesHaveBeenDeleted, err := m.OTelColResourceManager.DeleteResources(
ctx,
list,
operatorNamespace,
logger,
)

if err != nil {
logger.Error(err, "Error when checking whether there are any Dash0 monitoring resources left in the cluster.")
logger.Error(
err,
"Failed to delete the OpenTelemetry collector Kubernetes resources, requeuing reconcile request.",
)
return err
}
if len(list.Items) > 1 {
// There is still more than one Dash0 monitoring resource in the namespace, do not remove the backend connection.
return nil
if resourcesHaveBeenDeleted {
logger.Info("OpenTelemetry collector Kubernetes resources have been deleted.")
}

if len(list.Items) == 1 && list.Items[0].UID != dash0MonitoringResourceToBeDeleted.UID {
// There is only one Dash0 monitoring resource left, but it is *not* the one that is about to be deleted.
// Do not remove the backend connection.
logger.Info(
"There is only one Dash0 monitoring resource left, but it is not the one being deleted.",
"to be deleted/UID",
dash0MonitoringResourceToBeDeleted.UID,
"to be deleted/namespace",
dash0MonitoringResourceToBeDeleted.Namespace,
"to be deleted/name",
dash0MonitoringResourceToBeDeleted.Name,
"existing resource/UID",
list.Items[0].UID,
"existing resource/namespace",
list.Items[0].Namespace,
"existing resource/name",
list.Items[0].Name,
)
return nil
}

// Either there is no Dash0 monitoring resource left, or only one and that one is about to be deleted. Delete the
// backend connection.
return m.removeOpenTelemetryCollector(ctx, operatorNamespace, &logger)
return nil
}

func (m *BackendConnectionManager) removeOpenTelemetryCollector(
func (m *BackendConnectionManager) findOperatorConfigurationResource(
ctx context.Context,
operatorNamespace string,
logger *logr.Logger,
) error {
if err := m.OTelColResourceManager.DeleteResources(
) (*dash0v1alpha1.Dash0OperatorConfiguration, error) {
operatorConfigurationResource, err := util.FindUniqueOrMostRecentResourceInScope(
ctx,
operatorNamespace,
m.Client,
"", /* cluster-scope, thus no namespace */
&dash0v1alpha1.Dash0OperatorConfiguration{},
logger,
); err != nil {
logger.Error(
err,
"Failed to delete the OpenTelemetry collector Kuberenetes resources, requeuing reconcile request.",
)
return err
)
if err != nil {
return nil, err
}
return nil
if operatorConfigurationResource == nil {
return nil, nil
}
return operatorConfigurationResource.(*dash0v1alpha1.Dash0OperatorConfiguration), nil
}

func (m *BackendConnectionManager) findAllMonitoringResources(
Expand Down
Loading

0 comments on commit e5ca830

Please sign in to comment.