`health_check` doesn't work #3688

JorTurFer · 2025-02-03T13:52:35Z

Component(s)

collector

What happened?

Description

Upgrading the operator I have noticed that since v0.104.0, the extension health_check doesn't work because the probes fail all the time

Steps to Reproduce

Install the operator v0.117.0 (but this happens since v0.104.0)
deploy this yaml (just copied from the readme and added the health-check extension)

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: collector-with-ta
spec:
  mode: statefulset
  targetAllocator:
    enabled: true
  config:
    extensions:
      health_check: {}
    receivers:
      prometheus:
        config:
          scrape_configs:
          - job_name: 'otel-collector'
            scrape_interval: 10s
            static_configs:
            - targets: [ '0.0.0.0:8888' ]
            metric_relabel_configs:
            - action: labeldrop
              regex: (id|name)
            - action: labelmap
              regex: label_(.+)
              replacement: $$1
    exporters:
      debug: {}
    service:
      extensions:
        - health_check
      pipelines:
        metrics:
          receivers: [prometheus]
          exporters: [debug]

Expected Result

The collector starts

Actual Result

The collector doesn't start because the probes return 502

Kubernetes Version

1.29 and 1.31

Operator version

0.104.0

Collector version

0.104.0

Environment information

No response

Log output

Pod logs:
2025-02-03T13:50:15.076Z	info	[email protected]/service.go:115	Setting up own telemetry...
2025-02-03T13:50:15.076Z	info	[email protected]/telemetry.go:96	Serving metrics	{"address": "0.0.0.0:8888", "level": "Normal"}
2025-02-03T13:50:15.076Z	info	[email protected]/exporter.go:280	Development component. May change in the future.	{"kind": "exporter", "data_type": "metrics", "name": "debug"}
2025-02-03T13:50:15.078Z	info	[email protected]/service.go:193	Starting otelcol-contrib...	{"Version": "0.104.0", "NumCPU": 4}
2025-02-03T13:50:15.078Z	info	extensions/extensions.go:34	Starting extensions...
2025-02-03T13:50:15.078Z	info	extensions/extensions.go:37	Extension is starting...	{"kind": "extension", "name": "health_check"}
2025-02-03T13:50:15.078Z	info	[email protected]/healthcheckextension.go:32	Starting health_check extension	{"kind": "extension", "name": "health_check", "config": {"Endpoint":"localhost:13133","TLSSetting":null,"CORS":null,"Auth":null,"MaxRequestBodySize":0,"IncludeMetadata":false,"ResponseHeaders":null,"CompressionAlgorithms":null,"Path":"/","ResponseBody":null,"CheckCollectorPipeline":{"Enabled":false,"Interval":"5m","ExporterFailureThreshold":5}}}
2025-02-03T13:50:15.079Z	info	extensions/extensions.go:52	Extension started.	{"kind": "extension", "name": "health_check"}
2025-02-03T13:50:15.080Z	info	[email protected]/metrics_receiver.go:279	Starting discovery manager	{"kind": "receiver", "name": "prometheus", "data_type": "metrics"}
2025-02-03T13:50:15.080Z	info	[email protected]/metrics_receiver.go:121	Starting target allocator discovery	{"kind": "receiver", "name": "prometheus", "data_type": "metrics"}
2025-02-03T13:50:15.083Z	info	[email protected]/metrics_receiver.go:257	Scrape job added	{"kind": "receiver", "name": "prometheus", "data_type": "metrics", "jobName": "otel-collector"}
2025-02-03T13:50:15.083Z	info	healthcheck/handler.go:132	Health Check state change	{"kind": "extension", "name": "health_check", "status": "ready"}
2025-02-03T13:50:15.084Z	info	[email protected]/service.go:219	Everything is ready. Begin running and processing data.
2025-02-03T13:50:15.083Z	info	[email protected]/metrics_receiver.go:344	Starting scrape manager	{"kind": "receiver", "name": "prometheus", "data_type": "metrics"}


Events:

Liveness probe failed: Get "http://10.96.7.107:13133/": dial tcp 10.96.7.107:13133: connect: connection refused

Additional context

No response

The text was updated successfully, but these errors were encountered:

jaronoff97 · 2025-02-05T15:25:51Z

I think this may be because the healthcheck extension is binding to localhost and not 0.0.0.0 or the pod's IP address which would prevent it from responding. There's other documentation for this somewhere, but it's recommended to set it to the pod's own IP like is done in the helm chart here

JorTurFer · 2025-02-06T08:24:47Z

This has actually solved the issue! Thanks for pointing me there ❤
Does it make sense to automatically add this configuration by the operator to ensure that it works? If not, I'll just close the issue

jaronoff97 · 2025-02-06T15:18:30Z

hm... I think it would be reasonable, we already do something similar here

JorTurFer added bug Something isn't working needs triage labels Feb 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`health_check` doesn't work #3688

`health_check` doesn't work #3688

JorTurFer commented Feb 3, 2025 •

edited

Loading

jaronoff97 commented Feb 5, 2025

JorTurFer commented Feb 6, 2025

jaronoff97 commented Feb 6, 2025

health_check doesn't work #3688

health_check doesn't work #3688

Comments

JorTurFer commented Feb 3, 2025 • edited Loading

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Kubernetes Version

Operator version

Collector version

Environment information

Log output

Additional context

jaronoff97 commented Feb 5, 2025

JorTurFer commented Feb 6, 2025

jaronoff97 commented Feb 6, 2025

`health_check` doesn't work #3688

`health_check` doesn't work #3688

JorTurFer commented Feb 3, 2025 •

edited

Loading