Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: semantic conventions for (non-rejection) ingestion errors leading to truncation/mutation #1098

Open
michaelsafyan opened this issue May 30, 2024 · 6 comments
Labels
area:telemetry enhancement New feature or request

Comments

@michaelsafyan
Copy link
Contributor

Area(s)

area:telemetry

Is your change request related to a problem? Please describe.

If a backend telemetry system has certain limits on the size, number, etc. of attributes, it is possible to fail ingestion of the entire batch or of individual signals in the batch, but there is no standard way to surface these issues to users while accepting the signal albeit with some kind of truncation.

Describe the solution you'd like

I would suggest standardizing certain attributes related to instrumentation ingestion that provide a way to indicate that the signal was accepted but had to be mutated/modified/truncated in order to be accepted by the system.

As a strawman proposal, something along the lines of:

  • Standardize telemetry.backend.ingestion_log as the name of a special span event to be created by backends.

  • Standardize the following attributes of telemetry.backend.ingestion_log:

    • severity: ERROR or WARNING

    • subject: names the property it is about (e.g. resource.attributes, span.attributes, events[0].attributes)

    • message: free-form message (e.g. "Limit exceeded", "Nearing limit", etc.)

    • limit: the numeric value of the limit if having to do with a limit

    • unit: if not a count, specifies the unit (e.g. bytes)

    • actual: the observed value if available

    • consequences: a list of records that describe various consequences like:

         ```
            {
                "type": "CONTAINER_DROPPED",  # e.g. dropped entire event
            }
        ```  
         ```
            {
                "type": "ITEMS_DROPPED",  # e.g. dropped entire attributes
                "count": 50,
            }
        ```
         ```
            {
                "type": "ITEMS_TRUNCATED",  # e.g. attributes kept but modified
                "count": 50,
            }
        ```
      

Describe alternatives you've considered

  1. More strict validation by backends (either accept or reject entire spans in whole).
  2. Surface partial acceptance/mutation/truncation some other, vendor-specific, non-standard way.

With respect to 1, it would be good to give users more control over the behavior. A possible option would be to be lenient by default, but to support certain additional headers or options in OTLP for enabling stricter validation. If both modes exist, then there needs to be some way to surface ingestion errors.

With respect to 2, it is useful to have standardization here, especially as it relates to export; if truncation happens before export, it is useful for the errors to be represented in the OTel data format rather than some format outside of OTel.

Additional context

No response

@michaelsafyan michaelsafyan added enhancement New feature or request experts needed This issue or pull request is outside an area where general approvers feel they can approve triage:needs-triage labels May 30, 2024
@dashpole
Copy link
Contributor

Does anyone know if the dropped_attributes_count, etc. fields are intended to be modified outside of the SDK? For example, if a lower attribute count limit is imposed by a backend, can/should it increment the dropped_attributes_count?

@joaopgrassi
Copy link
Member

joaopgrassi commented Jun 17, 2024

For 2:

Surface partial acceptance/mutation/truncation some other, vendor-specific, non-standard way.

This is already possible with the OTLP Partial success spec . That can tell how many were accepted vs rejected, and, via the error_message back-ends can give info on what was limited and etc.

We had discussions about introducing more fine-grained, typed response but that can get complicated very quickly - for example the receiver would need to keep an index or some sort of order to tell the exporter which log/metric/span was rejected and why. See this and this for some prior discussions.

@michaelsafyan
Copy link
Contributor Author

michaelsafyan commented Jun 17, 2024 via email

@joaopgrassi
Copy link
Member

joaopgrassi commented Jun 18, 2024

There isn't a way to partially accept an individual span such
as by accepting some of its attributes but not others.

Hum, why you think there's isn't a way? This is solely the responsability of the receiver - For example in Dynatrace we have validations on the attributes and their format. We accept telemetry that have invalid attributes by either dropping them or massaging them to fit our requirements. Both cases our OTLP APIs return a partial success, where these changes in the original telemetry are returned to the client. The partial success spec partially covers such cases - the error_message field can be used to surface such things:

Servers MAY also use the partial_success field to convey warnings/suggestions to clients even when the server fully accepts the request. In such cases, the rejected_ field MUST have a value of 0, and the error_message field MUST be non-empty.

Secondly, partial success/failure reports the failure to the client which
may or may not be logging these failures in a way that is visible or
obvious to downstream viewers/consumers of the information

This should be implemented in each SDK and to my knowledge it is. There's a entry in the compatibility matrix, so that can be used to keep track (not sure now if it's up-to-date). We also created issues in each repo to tell them SDKs should log partial success messages.

My feeling is that while it would be possible to come up with a consistent/conventions to surface such errors/warnings, I'm not sure we should or even makes sense to do it.

As I said, to be able to exactly pin-point which span/metric/log had issues, OTLP receivers need to keep state and all of this puts pressure in them. In high-load scenarios this is definitely not ideal. I feel what we have now with the partial success is a good middle ground that offers enough info to be able to troubleshoot and identify problems in the telemetry.

@joaopgrassi joaopgrassi removed the experts needed This issue or pull request is outside an area where general approvers feel they can approve label Jul 9, 2024
@joaopgrassi
Copy link
Member

@michaelsafyan ping on this. I'm inclined to close this as nothing to do, but please let me know if you'd like to continue this discussion or have other arguments.

@lmolkova
Copy link
Contributor

I believe the best place to start would be to define something similar to #1580 and #1631 (if looking specifically for events/logs).

It could also be useful to socialize in Maintainer call (Monday 9am PT) and see if any of the OTel SDKs report something similar in non-documented (non-standard) way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:telemetry enhancement New feature or request
Projects
Development

No branches or pull requests

5 participants