The goal of the Telemetry module is to support you in collecting all relevant metrics of a workload in a Kyma cluster and ship them to a backend for further analysis. Kyma modules like Istio or Serverless contribute metrics instantly, and the Telemetry module enriches the data. You can choose among multiple vendors for OTLP-based backends.
Observability is all about exposing the internals of the components belonging to a distributed application and making that data analysable at a central place. While application logs and traces usually provide request-oriented data, metrics are aggregated statistics exposed by a component to reflect the internal state. Typical statistics like the amount of processed requests, or the amount of registered users, can be very useful to monitor the current state and also the health of a component. Also, you can define proactive and reactive alerts if metrics are about to reach thresholds, or if they already passed thresholds.
The Telemetry module provides a metric gateway and, optionally, an agent for the collection and shipment of metrics of any container running in the Kyma runtime.
You can configure the metric gateway with external systems using runtime configuration with a dedicated Kubernetes API (CRD) named MetricPipeline. The Metric feature is optional. If you don't want to use it, simply don't set up a MetricPipeline.
-
Before you can collect metrics data from a component, it must expose (or instrument) the metrics. Typically, it instruments specific metrics for the used language runtime (like Node.js) and custom metrics specific to the business logic. Also, the exposure can be in different formats, like the pull-based Prometheus format or the push-based OTLP format.
-
If you want to use Prometheus-based metrics, you must have instrumented your application using a library like the Prometheus client librar, with a port in your workload exposed serving as a Prometheus metrics endpoint.
-
For the instrumentation, you typically use an SDK, namely the Prometheus client libraries or the Open Telemetry SDKs. Both libraries provide extensions to activate language-specific auto-instrumentation like for Node.js, and an API to implement custom instrumentation.
In the Telemetry module, a central in-cluster Deployment of an OTel Collector acts as a gateway. The gateway exposes endpoints for the OpenTelemetry Protocol (OTLP) for GRPC and HTTP-based communication using the dedicated telemetry-otlp-metrics
service, to which all Kyma modules and users’ applications send the metrics data.
Optionally, the Telemetry module provides a DaemonSet of an OTel Collector acting as an agent. This agent can retrieve metrics of a workload and the Istio sidecar in the Prometheus pull-based format and can provide runtime-specific metrics for the workload.
- An application (exposing metrics in OTLP) sends metrics to the central metric gateway service.
- An application (exposing metrics in Prometheus protocol) activates the agent to scrape the metrics with an annotation-based configuration.
- Additionally, you can activate the agent to pull metrics of each Istio sidecar.
- The agent supports collecting metrics from the Kubelet and Kubernetes APIServer.
- The agent converts and sends all collected metric data to the gateway in OTLP.
- The gateway discovers the metadata and enriches all received data with typical metadata of the source by communicating with the Kubernetes APIServer. Furthermore, it filters data according to the pipeline configuration.
- Telemetry Manager configures the agent and gateway according to the
MetricPipeline
resource specification, including the target backend for the metric gateway. Also, it observes the metrics flow to the backend and reports problems in the MetricPipeline status. - The metric gateway sends the data to the observability system that's specified in your
MetricPipeline
resource - either within the Kyma cluster, or, if authentication is set up, to an external observability backend. - You can analyze the metric data with your preferred backend system.
The MetricPipeline resource is watched by Telemetry Manager, which is responsible for generating the custom parts of the OTel Collector configuration.
- Telemetry Manager watches all MetricPipeline resources and related Secrets.
- Furthermore, Telemetry Manager takes care of the full lifecycle of the gateway Deployment and the agent DaemonSet. Only if you defined a MetricPipeline, the gateway and agent are deployed.
- Whenever the user configuration changes, Telemetry Manager validates it and generates a single configuration for the gateway and agent.
- Referenced Secrets are copied into one Secret that is mounted to the gateway as well.
In a Kyma cluster, the metric gateway is the central component to which all components can send their individual metrics. The gateway collects, enriches, and dispatches the data to the configured backend. For more information, see Telemetry Gateways.
If a MetricPipeline configures a feature in the input
section, an additional DaemonSet is deployed acting as an agent. The agent is also based on an OTel Collector and encompasses the collection and conversion of Prometheus-based metrics. Hereby, the workload puts a prometheus.io/scrape
annotation on the specification of the Pod or service, and the agent collects it. The agent sends all data in OTLP to the central gateway.
In the following steps, you can see how to construct and deploy a typical MetricPipeline. Learn more about the available parameters and attributes.
To ship metrics to a new OTLP output, create a resource of the kind MetricPipeline
and save the file (named, for example, metricpipeline.yaml
).
This configures the underlying OTel Collector of the gateway with a pipeline for metrics. It defines that the receiver of the pipeline is of the OTLP type and is accessible with the telemetry-otlp-metrics
service.
The default protocol is GRPC, but you can choose HTTP instead. Depending on the configured protocol, an otlp
or an otlphttp
exporter is used. Ensure that the correct port is configured as part of the endpoint. Typically, port 4317
is used for GRPC and port 4318
for HTTP.
For GRPC, use:
apiVersion: telemetry.kyma-project.io/v1alpha1
kind: MetricPipeline
metadata:
name: backend
spec:
output:
otlp:
endpoint:
value: https://backend.example.com:4317
For HTTP, use the protocol
attribute:
apiVersion: telemetry.kyma-project.io/v1alpha1
kind: MetricPipeline
metadata:
name: backend
spec:
output:
otlp:
protocol: http
endpoint:
value: https://backend.example.com:4318
To integrate with external systems, you must configure authentication details. You can use mutual TLS (mTLS), Basic Authentication, or custom headers:
apiVersion: telemetry.kyma-project.io/v1alpha1
kind: MetricPipeline
metadata:
name: backend
spec:
output:
otlp:
endpoint:
value: https://backend.example.com/otlp:4317
tls:
cert:
value: |
-----BEGIN CERTIFICATE-----
...
key:
value: |
-----BEGIN RSA PRIVATE KEY-----
...
apiVersion: telemetry.kyma-project.io/v1alpha1
kind: MetricPipeline
metadata:
name: backend
spec:
output:
otlp:
endpoint:
value: https://backend.example.com/otlp:4317
authentication:
basic:
user:
value: myUser
password:
value: myPwd
apiVersion: telemetry.kyma-project.io/v1alpha1
kind: MetricPipeline
metadata:
name: backend
spec:
output:
otlp:
endpoint:
value: https://backend.example.com/otlp:4317
headers:
- name: Authorization
prefix: Bearer
value: "myToken"
Integrations into external systems usually need authentication details dealing with sensitive data. To handle that data properly in Secrets, MetricsPipeline supports the reference of Secrets.
Using the valueFrom attribute, you can map Secret keys for mutual TLS (mTLS), Basic Authentication, or with custom headers.
You can store the value of the token in the referenced Secret without any prefix or scheme, and you can configure it in the headers section of the MetricPipeline. In this example, the token has the prefix “Bearer”.
apiVersion: telemetry.kyma-project.io/v1alpha1
kind: MetricPipeline
metadata:
name: backend
spec:
output:
otlp:
endpoint:
value: https://backend.example.com/otlp:4317
tls:
cert:
valueFrom:
secretKeyRef:
name: backend
namespace: default
key: cert
key:
valueFrom:
secretKeyRef:
name: backend
namespace: default
key: key
apiVersion: telemetry.kyma-project.io/v1alpha1
kind: MetricPipeline
metadata:
name: backend
spec:
output:
otlp:
endpoint:
valueFrom:
secretKeyRef:
name: backend
namespace: default
key: endpoint
authentication:
basic:
user:
valueFrom:
secretKeyRef:
name: backend
namespace: default
key: user
password:
valueFrom:
secretKeyRef:
name: backend
namespace: default
key: password
apiVersion: telemetry.kyma-project.io/v1alpha1
kind: MetricPipeline
metadata:
name: backend
spec:
output:
otlp:
endpoint:
value: https://backend.example.com:4317
headers:
- name: Authorization
prefix: Bearer
valueFrom:
secretKeyRef:
name: backend
namespace: default
key: token
The related Secret must have the referenced name, be located in the referenced namespace, and contain the mapped key. See the following example:
kind: Secret
apiVersion: v1
metadata:
name: backend
namespace: default
stringData:
endpoint: https://backend.example.com:4317
user: myUser
password: XXX
token: YYY
Telemetry Manager continuously watches the Secret referenced with the secretKeyRef construct. You can update the Secret’s values, and Telemetry Manager detects the changes and applies the new Secret to the setup.
Tip
If you use a Secret owned by the SAP BTP Service Operator, you can configure an automated rotation using a credentialsRotationPolicy
with a specific rotationFrequency
and don’t have to intervene manually.
Note
For the following approach, you must have instrumented your application using a library like the Prometheus client library, with a port in your workload exposed serving as a Prometheus metrics endpoint.
To enable collection of Prometheus-based metrics, define a MetricPipeline that has the prometheus
section enabled as input:
apiVersion: telemetry.kyma-project.io/v1alpha1
kind: MetricPipeline
metadata:
name: backend
spec:
input:
prometheus:
enabled: true
output:
otlp:
endpoint:
value: https://backend.example.com:4317
The Metric agent is configured with a generic scrape configuration, which uses annotations to specify the endpoints to scrape in the cluster.
For metrics ingestion to start automatically, simply apply the following annotations either to a Service that resolves your metrics port, or directly to the Pod:
Annotation Key | Example Values | Default Value | Description |
---|---|---|---|
prometheus.io/scrape (mandatory) |
true , false |
none | Controls whether Prometheus automatically scrapes metrics from this target. |
prometheus.io/port (mandatory) |
8080 , 9100 |
none | Specifies the port where the metrics are exposed. |
prometheus.io/path |
/metrics , /custom_metrics |
/metrics |
Defines the HTTP path where Prometheus can find metrics data. |
prometheus.io/scheme |
http , https |
If Istio is active, https is supported; otherwise, only http is available. The default scheme is http unless an Istio sidecar is present, denoted by the label security.istio.io/tlsMode=istio , in which case https becomes the default. |
Determines the protocol used for scraping metrics — either HTTPS with mTLS or plain HTTP. |
If you're running the Pod targeted by a Service with Istio, Istio must be able to derive the appProtocol from the Service port definition; otherwise the communication for scraping the metric endpoint cannot be established. You must either prefix the port name with the protocol like in http-metrics
, or explicitly define the appProtocol
attribute.
For example, see the following Service
configuration:
apiVersion: v1
kind: Service
metadata:
annotations:
prometheus.io/port: "8080"
prometheus.io/scrape: "true"
name: sample
spec:
ports:
- name: http-metrics
appProtocol: http
port: 8080
protocol: TCP
targetPort: 8080
selector:
app: sample
type: ClusterIP
Note
The Metric agent can scrape endpoints even if the workload is a part of the Istio service mesh and accepts mTLS communication. However, there's a constraint: For scraping through HTTPS, Istio must configure the workload using 'STRICT' mTLS mode. Without 'STRICT' mTLS mode, you can set up scraping through HTTP by applying the annotation prometheus.io/scheme=http
. For related troubleshooting, see Log entry: Failed to scrape Prometheus endpoint.
By default, a MetricPipeline emits metrics about the health of all pipelines managed by the Telemetry module. Based on these metrics, you can track the status of every individual pipeline and set up alerting for it.
Metrics for Pipelines and the Telemetry Module:
Metric | Description | Availability |
---|---|---|
kyma.resource.status.conditions | Value represents status of different conditions reported by the resource. Possible values are 1 ("True"), 0 ("False"), and -1 (other status values) | Available for both, the pipeline and the Telemetry resource |
kyma.resource.status.state | Value represents the state of the resource (if present) | Available for the Telemetry resource |
Metric Attributes for Monitoring:
Name | Description |
---|---|
metric.attributes.Type | Type of the condition |
metric.attributes.status | Status of the condition |
metric.attributes.reason | Contains a programmatic identifier indicating the reason for the condition's last transition |
To set up alerting, use an alert rule. In the following example, the alert is triggered if metrics are not delivered to the backend:
min by (k8s_resource_name) ((kyma_resource_status_conditions{type="TelemetryFlowHealthy",k8s_resource_kind="metricpipelines"})) == 0
To enable collection of runtime metrics, define a MetricPipeline that has the runtime
section enabled as input:
apiVersion: telemetry.kyma-project.io/v1alpha1
kind: MetricPipeline
metadata:
name: backend
spec:
input:
runtime:
enabled: true
output:
otlp:
endpoint:
value: https://backend.example.com:4317
By default, container and Pod metrics are collected.
To enable or disable the collection of metrics for a specific resource, use the resources
section in the runtime
input.
The following example collects only DaemonSet, Deployment, StatefulSet and Job metrics:
apiVersion: telemetry.kyma-project.io/v1alpha1
kind: MetricPipeline
metadata:
name: backend
spec:
input:
runtime:
enabled: true
resources:
pod:
enabled: false
container:
enabled: false
node:
enabled: false
volume:
enabled: false
daemonset:
enabled: true
deployment:
enabled: true
statefulset:
enabled: true
job:
enabled: true
output:
otlp:
endpoint:
value: https://backend.example.com:4317
If Pod metrics are enabled, the following metrics are collected:
- From the kubletstatsreceiver:
k8s.pod.cpu.capacity
k8s.pod.cpu.usage
k8s.pod.filesystem.available
k8s.pod.filesystem.capacity
k8s.pod.filesystem.usage
k8s.pod.memory.available
k8s.pod.memory.major_page_faults
k8s.pod.memory.page_faults
k8s.pod.memory.rss
k8s.pod.memory.usage
k8s.pod.memory.working_set
k8s.pod.network.errors
k8s.pod.network.io
- From the k8sclusterreceiver:
k8s.pod.phase
If container metrics are enabled, the following metrics are collected:
- From the kubletstatsreceiver:
container.cpu.time
container.cpu.usage
container.filesystem.available
container.filesystem.capacity
container.filesystem.usage
container.memory.available
container.memory.major_page_faults
container.memory.page_faults
container.memory.rss
container.memory.usage
container.memory.working_set
- From the k8sclusterreceiver:
k8s.container.cpu_request
k8s.container.cpu_limit
k8s.container.memory_request
k8s.container.memory_limit
If Node metrics are enabled, the following metrics are collected:
- From the kubletstatsreceiver:
k8s.node.cpu.usage
k8s.node.filesystem.available
k8s.node.filesystem.capacity
k8s.node.filesystem.usage
k8s.node.memory.available
k8s.node.memory.usage
k8s.node.memory.rss
k8s.node.memory.working_set
If Volume metrics are enabled, the following metrics are collected:
- From the kubletstatsreceiver:
k8s.volume.available
k8s.volume.capacity
k8s.volume.inodes
k8s.volume.inodes.free
k8s.volume.inodes.used
If Deployment metrics are enabled, the following metrics are collected:
- From the k8sclusterreceiver:
k8s.deployment.available
k8s.deployment.desired
If DaemonSet metrics are enabled, the following metrics are collected:
- From the k8sclusterreceiver:
k8s.daemonset.current_scheduled_nodes
k8s.daemonset.desired_scheduled_nodes
k8s.daemonset.misscheduled_nodes
k8s.daemonset.ready_nodes
If StatefulSet metrics are enabled, the following metrics are collected:
- From the k8sclusterreceiver:
k8s.statefulset.current_pods
k8s.statefulset.desired_pods
k8s.statefulset.ready_pods
k8s.statefulset.updated_pods
If Job metrics are enabled, the following metrics are collected:
- From the k8sclusterreceiver:
k8s.job.active_pods
k8s.job.desired_successful_pods
k8s.job.failed_pods
k8s.job.max_parallel_pods
k8s.job.successful_pods
To enable collection of Istio metrics, define a MetricPipeline that has the istio
section enabled as input:
apiVersion: telemetry.kyma-project.io/v1alpha1
kind: MetricPipeline
metadata:
name: backend
spec:
input:
istio:
enabled: true
output:
otlp:
endpoint:
value: https://backend.example.com:4317
With this, the agent starts collecting all Istio metrics from Istio sidecars.
By default, otlp
input is enabled.
To drop the push-based OTLP metrics that are received by the Metric gateway, define a MetricPipeline that has the otlp
section disabled as an input:
apiVersion: telemetry.kyma-project.io/v1alpha1
kind: MetricPipeline
metadata:
name: backend
spec:
input:
istio:
enabled: true
otlp:
disabled: true
output:
otlp:
endpoint:
value: https://backend.example.com:4317
With this, the agent starts collecting all Istio metrics from Istio sidecars, and the push-based OTLP metrics are dropped.
To filter metrics by namespaces, define a MetricPipeline that has the namespaces
section defined in one of the inputs. For example, you can specify the namespaces from which metrics are collected or the namespaces from which metrics are dropped. Learn more about the available parameters and attributes.
The following example collects runtime metrics only from the foo
and bar
namespaces:
apiVersion: telemetry.kyma-project.io/v1alpha1
kind: MetricPipeline
metadata:
name: backend
spec:
input:
runtime:
enabled: true
namespaces:
include:
- foo
- bar
output:
otlp:
endpoint:
value: https://backend.example.com:4317
The following example collects runtime metrics from all namespaces except the foo
and bar
namespaces:
apiVersion: telemetry.kyma-project.io/v1alpha1
kind: MetricPipeline
metadata:
name: backend
spec:
input:
runtime:
enabled: true
namespaces:
exclude:
- foo
- bar
output:
otlp:
endpoint:
value: https://backend.example.com:4317
Note
The default settings depend on the input:
If no namespace selector is defined for the prometheus
or runtime
input, then metrics from system namespaces are excluded by default.
However, if the namespace selector is not defined for the istio
and otlp
input, then metrics from system namespaces are included by default.
If you use the prometheus
or istio
input, for every metric source typical scrape metrics are produced, such as up
, scrape_duration_seconds
, scrape_samples_scraped
, scrape_samples_post_metric_relabeling
, and scrape_series_added
.
By default, they are disabled.
If you want to use them for debugging and diagnostic purposes, you can activate them. To activate diagnostic metrics, define a MetricPipeline that has the diagnosticMetrics
section defined.
-
The following example collects diagnostic metrics only for input
istio
:apiVersion: telemetry.kyma-project.io/v1alpha1 kind: MetricPipeline metadata: name: backend spec: input: istio: enabled: true diagnosticMetrics: enabled: true output: otlp: endpoint: value: https://backend.example.com:4317
-
The following example collects diagnostic metrics only for input
prometheus
:apiVersion: telemetry.kyma-project.io/v1alpha1 kind: MetricPipeline metadata: name: backend spec: input: prometheus: enabled: true diagnosticMetrics: enabled: true output: otlp: endpoint: value: https://backend.example.com:4317
Note
Diagnostic metrics are only available for inputs prometheus
and istio
. Learn more about the available parameters and attributes.
To activate the MetricPipeline, apply the metricpipeline.yaml
resource file in your cluster:
kubectl apply -f metricpipeline.yaml
You activated a MetricPipeline and metrics start streaming to your backend.
To check that the pipeline is running, wait until the status conditions of the MetricPipeline in your cluster have status True
:
kubectl get metricpipeline
NAME CONFIGURATION GENERATED GATEWAY HEALTHY AGENT HEALTHY FLOW HEALTHY
backend True True True True
A MetricPipeline runs several OTel Collector instances in your cluster. This Deployment serves OTLP endpoints and ships received data to the configured backend.
The Telemetry module ensures that the OTel Collector instances are operational and healthy at any time, for example, with buffering and retries. However, there may be situations when the instances drop metrics, or cannot handle the metric load.
To detect and fix such situations, check the pipeline status and check out Troubleshooting.
- Throughput: Assuming an average metric with 20 metric data points and 10 labels, the default metric gateway setup has a maximum throughput of 34K metric data points/sec. If more data is sent to the gateway, it is refused. To increase the maximum throughput, manually scale out the gateway by increasing the number of replicas for the Metric gateway. The metric agent setup has a maximum throughput of 14K metric data points/sec per instance. If more data must be ingested, it is refused. If a metric data endpoint emits more than 50.000 metric data points per scrape loop, the metric agent refuses all the data.
- Load Balancing With Istio: To ensure availability, the metric gateway runs with multiple instances. If you want to increase the maximum throughput, use manual scaling and enter a higher number of instances. By design, the connections to the gateway are long-living connections (because OTLP is based on gRPC and HTTP/2). For optimal scaling of the gateway, the clients or applications must balance the connections across the available instances, which is automatically achieved if you use an Istio sidecar. If your application has no Istio sidecar, the data is always sent to one instance of the gateway.
- Unavailability of Output: For up to 5 minutes, a retry for data is attempted when the destination is unavailable. After that, data is dropped.
- No Guaranteed Delivery: The used buffers are volatile. If the gateway or agent instances crash, metric data can be lost.
- Multiple MetricPipeline Support: The maximum amount of MetricPipeline resources is 3.
Symptom:
- No metrics arrive at the backend.
- In the MetricPipeline status, the
TelemetryFlowHealthy
condition has status AllDataDropped.
Cause: Incorrect backend endpoint configuration (such as using the wrong authentication credentials) or the backend is unreachable.
Remedy:
- Check the
telemetry-metric-gateway
Pods for error logs by callingkubectl logs -n kyma-system {POD_NAME}
. - Check if the backend is up and reachable.
- Fix the errors.
Symptom:
- The backend is reachable and the connection is properly configured, but some metrics are refused.
- In the MetricPipeline status, the
TelemetryFlowHealthy
condition has status SomeDataDropped.
Cause: It can happen due to a variety of reasons - for example, the backend is limiting the ingestion rate.
Remedy:
- Check the
telemetry-metric-gateway
Pods for error logs by callingkubectl logs -n kyma-system {POD_NAME}
. Also, check your observability backend to investigate potential causes. - If backend is limiting the rate by refusing metrics, try the options desribed in Gateway Buffer Filling Up.
- Otherwise, take the actions appropriate to the cause indicated in the logs.
Symptom: Custom metrics don't arrive at the backend, but Istio metrics do.
Cause: Your SDK version is incompatible with the OTel Collector version.
Remedy:
- Check which SDK version you are using for instrumentation.
- Investigate whether it is compatible with the OTel Collector version.
- If required, upgrade to a supported SDK version.
Symptom: Custom metrics don't arrive at the destination and the OTel Collector produces log entries saying "Failed to scrape Prometheus endpoint", such as the following example:
2023-08-29T09:53:07.123Z warn internal/transaction.go:111 Failed to scrape Prometheus endpoint {"kind": "receiver", "name": "prometheus/app-pods", "data_type": "metrics", "scrape_timestamp": 1693302787120, "target_labels": "{__name__=\"up\", instance=\"10.42.0.18:8080\", job=\"app-pods\"}"}
Cause 1: The workload is not configured to use 'STRICT' mTLS mode. For details, see Activate Prometheus-based metrics.
Remedy 1: You can either set up 'STRICT' mTLS mode or HTTP scraping:
- Configure the workload using “STRICT” mTLS mode (for example, by applying a corresponding PeerAuthentication).
- Set up scraping through HTTP by applying the
prometheus.io/scheme=http
annotation.
Cause 2: The Service definition enabling the scrape with Prometheus annotations does not reveal the application protocol to use in the port definition. For details, see Activate Prometheus-based metrics.
Remedy 2: Define the application protocol in the Service port definition by either prefixing the port name with the protocol, like in http-metrics
or define the appProtocol
attribute.
Symptom: In the MetricPipeline status, the TelemetryFlowHealthy
condition has status BufferFillingUp.
Cause: The backend export rate is too low compared to the gateway ingestion rate.
Remedy:
-
Option 1: Increase maximum backend ingestion rate. For example, by scaling out the SAP Cloud Logging instances.
-
Option 2: Reduce emitted metrics by re-configuring the MetricPipeline (for example, by disabling certain inputs or applying namespace filters).
-
Option 3: Reduce emitted metrics in your applications.
Symptom: In the MetricPipeline status, the TelemetryFlowHealthy
condition has status GatewayThrottling.
Cause: Gateway cannot receive metrics at the given rate.
Remedy: Manually scale out the gateway by increasing the number of replicas for the Metric gateway. See Module Configuration and Status.