Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix WorkerNameMetricsExporter to not export unless otel is enabled. #5869

Merged
merged 1 commit into from
Oct 7, 2024

Conversation

ggainey
Copy link
Contributor

@ggainey ggainey commented Oct 6, 2024

No description provided.

@lubosmj
Copy link
Member

lubosmj commented Oct 7, 2024

Am I understanding this correctly that the idea is to get rid of the following error in the logs?

2024-10-07T03:30:38.9240394Z Traceback (most recent call last):
2024-10-07T03:30:38.9241022Z   File "/usr/local/lib/python3.9/site-packages/opentelemetry/sdk/metrics/_internal/export/__init__.py", line 541, in _receive_metrics
2024-10-07T03:30:38.9241127Z     self._exporter.export(
2024-10-07T03:30:38.9241619Z   File "/usr/local/lib/python3.9/site-packages/pulpcore/app/wsgi.py", line 34, in export
2024-10-07T03:30:38.9241819Z     return super().export(metrics_data, timeout_millis, **kwargs)
2024-10-07T03:30:38.9242478Z   File "/usr/local/lib/python3.9/site-packages/opentelemetry/exporter/otlp/proto/http/metric_exporter/__init__.py", line 203, in export
2024-10-07T03:30:38.9242670Z     resp = self._export(serialized_data.SerializeToString())
2024-10-07T03:30:38.9243343Z   File "/usr/local/lib/python3.9/site-packages/opentelemetry/exporter/otlp/proto/http/metric_exporter/__init__.py", line 173, in _export
2024-10-07T03:30:38.9243452Z     return self._session.post(
2024-10-07T03:30:38.9243833Z   File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 637, in post
2024-10-07T03:30:38.9244048Z     return self.request("POST", url, data=data, json=json, **kwargs)
2024-10-07T03:30:38.9244441Z   File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 589, in request
2024-10-07T03:30:38.9244568Z     resp = self.send(prep, **send_kwargs)
2024-10-07T03:30:38.9244932Z   File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 703, in send
2024-10-07T03:30:38.9245052Z     r = adapter.send(request, **kwargs)
2024-10-07T03:30:38.9245411Z   File "/usr/local/lib/python3.9/site-packages/requests/adapters.py", line 700, in send
2024-10-07T03:30:38.9245570Z     raise ConnectionError(e, request=request)
2024-10-07T03:30:38.9246901Z requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=4318): Max retries exceeded with url: /v1/metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb58b42bfd0>: Failed to establish a new connection: [Errno 111] Connection refused'))
2024-10-07T03:30:38.9247653Z pulp [f277dd6008674f80acb0b772d7adb68b]: pulpcore.tasking.tasks:INFO: Starting task 019264fc-211f-718a-a11a-6707b9213d64 in domain: default
2024-10-07T03:30:38.9249247Z pulp [None]: opentelemetry.sdk.metrics._internal.export:ERROR: Exception while exporting metrics HTTPConnectionPool(host='localhost', port=4318): Max retries exceeded with url: /v1/metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb58b584190>: Failed to establish a new connection: [Errno 111] Connection refused'))

@lubosmj
Copy link
Member

lubosmj commented Oct 7, 2024

I think that this piece shows one of the many side effects triggered by the improper configuration of OTEL. We should investigate why we still do report some metrics. OTEL_SDK_DISABLED should be set to False by default (https://opentelemetry.io/docs/specs/otel/configuration/sdk-environment-variables/). Thus, no metrics coming from the SDK should be emitted once we flip it to True. Would you mind trying that instead? Perhaps, we could put it here:

pulp_env:
.

cc: @decko

@mdellweg
Copy link
Member

mdellweg commented Oct 7, 2024

I still don't get, why this middleware is so special, it cannot be configured like any other in the django config.

@lubosmj
Copy link
Member

lubosmj commented Oct 7, 2024

@mdellweg, we aim to pass custom attributes into the autoinstrumentation, such as the worker name (pid@socketname) and service name (pulp-app), to help users accurately interpret the metrics from the Prometheus database. For example, when running multiple pods and workers in a cloud environment, Prometheus (and its backup) cannot properly distinguish between metrics coming from different sources, making them interchangeable and hard to identify.

To address this, it might be also beneficial to make this change on upstream.

@ggainey
Copy link
Contributor Author

ggainey commented Oct 7, 2024

Am I understanding this correctly that the idea is to get rid of the following error in the logs?

Yes - it makes the logs basically useless, as it repeats every...minute? or so? And, esp when doing development, means I have to repeatedly scan the traceback to see if it's "my" broken code traceback, or a meaningless error.

@mdellweg
Copy link
Member

mdellweg commented Oct 7, 2024

I guess the real answer is here:
Django middleware is not wsgi middleware.
https://docs.djangoproject.com/en/5.1/howto/deployment/wsgi/#applying-wsgi-middleware

But first and foremost, the OTEL is not holding the promise to not interfere when not configured, and this PR fixes that.

pulpcore/app/wsgi.py Outdated Show resolved Hide resolved
@ggainey
Copy link
Contributor Author

ggainey commented Oct 7, 2024

I think that this piece shows one of the many side effects triggered by the improper configuration of OTEL. We should investigate why we still do report some metrics. OTEL_SDK_DISABLED should be set to False by default (https://opentelemetry.io/docs/specs/otel/configuration/sdk-environment-variables/). Thus, no metrics coming from the SDK should be emitted once we flip it to True. Would you mind trying that instead? Perhaps, we could put it here:

pulp_env:

.

cc: @decko

I don't know - but I do know that we use this pattern in two other places in core, which is where I stole the pattern from:

Personally, I don't care how it gets fixed - but I need it fixed, like, now - this makes it really hard to develop/debug code.

@ggainey
Copy link
Contributor Author

ggainey commented Oct 7, 2024

Would you mind trying that instead? Perhaps, we could put it here:

Changing things in the template requires rebuilding the containers, yeah? I am in the middle of testing a thing that requires a certain amount of data setup, and would rather not, honestly.

@mdellweg mdellweg enabled auto-merge (rebase) October 7, 2024 13:00
@mdellweg mdellweg merged commit f5c439c into pulp:main Oct 7, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants