-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: semantic conventions for (non-rejection) ingestion errors leading to truncation/mutation #1098
Comments
Does anyone know if the |
For 2:
This is already possible with the OTLP Partial success spec . That can tell how many were accepted vs rejected, and, via the We had discussions about introducing more fine-grained, typed response but that can get complicated very quickly - for example the receiver would need to keep an index or some sort of order to tell the exporter which log/metric/span was rejected and why. See this and this for some prior discussions. |
Thanks for pointing to the partial success spec. This request is intended
to cover gaps that exist with the current specification...
Firstly partial success addresses batch-level failure; it is possible to
accept part of a batch. However, items within the batch are either accepted
or rejected. There isn't a way to partially accept an individual span such
as by accepting some of its attributes but not others.
Secondly, partial success/failure reports the failure to the client which
may or may not be logging these failures in a way that is visible or
obvious to downstream viewers/consumers of the information. When a span has
been modified to become accepted, it is desirable for the warnings or
errors related to it its ingestion (and the fact that the data may not be
100% faithful to what was originally written) to be surfaces and easily
available in whatever context the span is available/displayed.
…On Mon, Jun 17, 2024, 7:52 AM Joao Grassi ***@***.***> wrote:
For 2:
Surface partial acceptance/mutation/truncation some other,
vendor-specific, non-standard way.
This is already possible with the OTLP Partial success spec
<https://github.com/open-telemetry/opentelemetry-proto/blob/main/docs/specification.md#partial-success>.
That can tell how many were accepted vs rejected, and, via the
error_message back-ends can give info on what was limited and etc.
We had discussion about introducing more fine-grained, typed response but
that can get complicated very quickly - for example the receiver would need
to keep an index or some sort of order to tell the exporter which
log/metric/span was rejected and why. See this
<open-telemetry/opentelemetry-proto#470> and
this <open-telemetry/opentelemetry-proto#404>
for some prior discussions.
—
Reply to this email directly, view it on GitHub
<#1098 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABI65OO34LUMPE6MC3GFSLZH3Z2DAVCNFSM6AAAAABIRNFHW2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZTGYZTAMJZHA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Hum, why you think there's isn't a way? This is solely the responsability of the receiver - For example in Dynatrace we have validations on the attributes and their format. We accept telemetry that have invalid attributes by either dropping them or massaging them to fit our requirements. Both cases our OTLP APIs return a partial success, where these changes in the original telemetry are returned to the client. The partial success spec partially covers such cases - the
This should be implemented in each SDK and to my knowledge it is. There's a entry in the compatibility matrix, so that can be used to keep track (not sure now if it's up-to-date). We also created issues in each repo to tell them SDKs should log partial success messages. My feeling is that while it would be possible to come up with a consistent/conventions to surface such errors/warnings, I'm not sure we should or even makes sense to do it. As I said, to be able to exactly pin-point which span/metric/log had issues, OTLP receivers need to keep state and all of this puts pressure in them. In high-load scenarios this is definitely not ideal. I feel what we have now with the partial success is a good middle ground that offers enough info to be able to troubleshoot and identify problems in the telemetry. |
@michaelsafyan ping on this. I'm inclined to close this as nothing to do, but please let me know if you'd like to continue this discussion or have other arguments. |
Area(s)
area:telemetry
Is your change request related to a problem? Please describe.
If a backend telemetry system has certain limits on the size, number, etc. of attributes, it is possible to fail ingestion of the entire batch or of individual signals in the batch, but there is no standard way to surface these issues to users while accepting the signal albeit with some kind of truncation.
Describe the solution you'd like
I would suggest standardizing certain attributes related to instrumentation ingestion that provide a way to indicate that the signal was accepted but had to be mutated/modified/truncated in order to be accepted by the system.
As a strawman proposal, something along the lines of:
Standardize
telemetry.backend.ingestion_log
as thename
of a special span event to be created by backends.Standardize the following attributes of
telemetry.backend.ingestion_log
:severity
:ERROR
orWARNING
subject
: names the property it is about (e.g.resource.attributes
,span.attributes
,events[0].attributes
)message
: free-form message (e.g. "Limit exceeded", "Nearing limit", etc.)limit
: the numeric value of the limit if having to do with a limitunit
: if not a count, specifies the unit (e.g.bytes
)actual
: the observed value if availableconsequences
: a list of records that describe various consequences like:Describe alternatives you've considered
With respect to 1, it would be good to give users more control over the behavior. A possible option would be to be lenient by default, but to support certain additional headers or options in OTLP for enabling stricter validation. If both modes exist, then there needs to be some way to surface ingestion errors.
With respect to 2, it is useful to have standardization here, especially as it relates to export; if truncation happens before export, it is useful for the errors to be represented in the OTel data format rather than some format outside of OTel.
Additional context
No response
The text was updated successfully, but these errors were encountered: