Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

health_check doesn't work #3688

Open
JorTurFer opened this issue Feb 3, 2025 · 3 comments
Open

health_check doesn't work #3688

JorTurFer opened this issue Feb 3, 2025 · 3 comments
Labels
bug Something isn't working needs triage

Comments

@JorTurFer
Copy link
Contributor

JorTurFer commented Feb 3, 2025

Component(s)

collector

What happened?

Description

Upgrading the operator I have noticed that since v0.104.0, the extension health_check doesn't work because the probes fail all the time

Steps to Reproduce

  1. Install the operator v0.117.0 (but this happens since v0.104.0)
  2. deploy this yaml (just copied from the readme and added the health-check extension)
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: collector-with-ta
spec:
  mode: statefulset
  targetAllocator:
    enabled: true
  config:
    extensions:
      health_check: {}
    receivers:
      prometheus:
        config:
          scrape_configs:
          - job_name: 'otel-collector'
            scrape_interval: 10s
            static_configs:
            - targets: [ '0.0.0.0:8888' ]
            metric_relabel_configs:
            - action: labeldrop
              regex: (id|name)
            - action: labelmap
              regex: label_(.+)
              replacement: $$1
    exporters:
      debug: {}
    service:
      extensions:
        - health_check
      pipelines:
        metrics:
          receivers: [prometheus]
          exporters: [debug]

Expected Result

The collector starts

Actual Result

The collector doesn't start because the probes return 502

Kubernetes Version

1.29 and 1.31

Operator version

0.104.0

Collector version

0.104.0

Environment information

No response

Log output

Pod logs:
2025-02-03T13:50:15.076Z	info	[email protected]/service.go:115	Setting up own telemetry...
2025-02-03T13:50:15.076Z	info	[email protected]/telemetry.go:96	Serving metrics	{"address": "0.0.0.0:8888", "level": "Normal"}
2025-02-03T13:50:15.076Z	info	[email protected]/exporter.go:280	Development component. May change in the future.	{"kind": "exporter", "data_type": "metrics", "name": "debug"}
2025-02-03T13:50:15.078Z	info	[email protected]/service.go:193	Starting otelcol-contrib...	{"Version": "0.104.0", "NumCPU": 4}
2025-02-03T13:50:15.078Z	info	extensions/extensions.go:34	Starting extensions...
2025-02-03T13:50:15.078Z	info	extensions/extensions.go:37	Extension is starting...	{"kind": "extension", "name": "health_check"}
2025-02-03T13:50:15.078Z	info	[email protected]/healthcheckextension.go:32	Starting health_check extension	{"kind": "extension", "name": "health_check", "config": {"Endpoint":"localhost:13133","TLSSetting":null,"CORS":null,"Auth":null,"MaxRequestBodySize":0,"IncludeMetadata":false,"ResponseHeaders":null,"CompressionAlgorithms":null,"Path":"/","ResponseBody":null,"CheckCollectorPipeline":{"Enabled":false,"Interval":"5m","ExporterFailureThreshold":5}}}
2025-02-03T13:50:15.079Z	info	extensions/extensions.go:52	Extension started.	{"kind": "extension", "name": "health_check"}
2025-02-03T13:50:15.080Z	info	[email protected]/metrics_receiver.go:279	Starting discovery manager	{"kind": "receiver", "name": "prometheus", "data_type": "metrics"}
2025-02-03T13:50:15.080Z	info	[email protected]/metrics_receiver.go:121	Starting target allocator discovery	{"kind": "receiver", "name": "prometheus", "data_type": "metrics"}
2025-02-03T13:50:15.083Z	info	[email protected]/metrics_receiver.go:257	Scrape job added	{"kind": "receiver", "name": "prometheus", "data_type": "metrics", "jobName": "otel-collector"}
2025-02-03T13:50:15.083Z	info	healthcheck/handler.go:132	Health Check state change	{"kind": "extension", "name": "health_check", "status": "ready"}
2025-02-03T13:50:15.084Z	info	[email protected]/service.go:219	Everything is ready. Begin running and processing data.
2025-02-03T13:50:15.083Z	info	[email protected]/metrics_receiver.go:344	Starting scrape manager	{"kind": "receiver", "name": "prometheus", "data_type": "metrics"}


Events:

Liveness probe failed: Get "http://10.96.7.107:13133/": dial tcp 10.96.7.107:13133: connect: connection refused

Additional context

No response

@JorTurFer JorTurFer added bug Something isn't working needs triage labels Feb 3, 2025
@jaronoff97
Copy link
Contributor

I think this may be because the healthcheck extension is binding to localhost and not 0.0.0.0 or the pod's IP address which would prevent it from responding. There's other documentation for this somewhere, but it's recommended to set it to the pod's own IP like is done in the helm chart here

@JorTurFer
Copy link
Contributor Author

This has actually solved the issue! Thanks for pointing me there ❤
Does it make sense to automatically add this configuration by the operator to ensure that it works? If not, I'll just close the issue

@jaronoff97
Copy link
Contributor

hm... I think it would be reasonable, we already do something similar here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage
Projects
None yet
Development

No branches or pull requests

2 participants