Open telemetry usage (Proof of Concept) #379

AcquaDiGiorgio · 2025-01-20T10:33:13Z

See #257.
This pull request aims to showcase a possible way of implementing Open Telemetry's traces and metrics. This code is NOT final and is not intended to be merged.

Important notes:

With OTel not only we can keep track of certain information of the variables, but we can also see, for example, the cost in time of each function. Which might be useful in locating unusual agent executions or slow response times.

First access to the Auth API

Second access (after caching)

Look at the execution time of the 3rd and 4th span (diracx.routers.auth.utils.initiate_authorization_flow_with_iam and diracx.routers.auth.utils.get_server_metadata)

Currently OTel is located at the "routers" package, but we can move it to the core of DiracX and trace everything, from the headers used at API level to the specific variables that took place during the execution of a database query.
I implemented tracing using decorators. This is the most elegant solution I was able to come up with, but decorating every function could be a pain and also a quite over the top solution. Instead, we can either only decorate strategic functions or class functions that are problematic or just interesting to look at, or we could offer a secondary option of tracing inside the code of a function using a context manager.
Because we are using custom FastAPI routers with already decorated entry points, they must NOT be decorated. Tracing with OTel clashes with an already traced function, so these decorators should only be used in internal functions.
Lastly, we should consider the use Propagators, which implements a way of distributing traces. For example, we could start a trace in a python script file on the client side, serialize and inject the span context in the header and continue it's processing on the server side, having a full trace from the very first operation a user submitted from the last that DiracX returned.

For the metrics part, the only interesting way of using them that I was able to find (there is surely a hundred more) is to measure CEs job slots availability, having counters that go up when submitting and down when completing or failing and we could also compare it with the values obtained by the status of the CE, for example, to see the percentage of jobs submitted by the different investigation groups.

note: metrics are untested

AcquaDiGiorgio added 2 commits December 11, 2024 16:06

feat(otel): proof of concept for tracing using decorators

3ae673e

note: metrics are untested

Merge branch 'DIRACGrid:main' into issue-257-open-telemetry-usage

c963551

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Open telemetry usage (Proof of Concept) #379

Open telemetry usage (Proof of Concept) #379

AcquaDiGiorgio commented Jan 20, 2025

Open telemetry usage (Proof of Concept) #379

Are you sure you want to change the base?

Open telemetry usage (Proof of Concept) #379

Conversation

AcquaDiGiorgio commented Jan 20, 2025

Important notes: