Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open telemetry usage (Proof of Concept) #379

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

AcquaDiGiorgio
Copy link

See #257.
This pull request aims to showcase a possible way of implementing Open Telemetry's traces and metrics. This code is NOT final and is not intended to be merged.

Important notes:

  1. With OTel not only we can keep track of certain information of the variables, but we can also see, for example, the cost in time of each function. Which might be useful in locating unusual agent executions or slow response times.

First access to the Auth API
Screenshot from 2024-12-11 14-41-26
Second access (after caching)
Screenshot from 2024-12-11 14-43-53
Look at the execution time of the 3rd and 4th span (diracx.routers.auth.utils.initiate_authorization_flow_with_iam and diracx.routers.auth.utils.get_server_metadata)

  1. Currently OTel is located at the "routers" package, but we can move it to the core of DiracX and trace everything, from the headers used at API level to the specific variables that took place during the execution of a database query.

  2. I implemented tracing using decorators. This is the most elegant solution I was able to come up with, but decorating every function could be a pain and also a quite over the top solution. Instead, we can either only decorate strategic functions or class functions that are problematic or just interesting to look at, or we could offer a secondary option of tracing inside the code of a function using a context manager.

  3. Because we are using custom FastAPI routers with already decorated entry points, they must NOT be decorated. Tracing with OTel clashes with an already traced function, so these decorators should only be used in internal functions.

  4. Lastly, we should consider the use Propagators, which implements a way of distributing traces. For example, we could start a trace in a python script file on the client side, serialize and inject the span context in the header and continue it's processing on the server side, having a full trace from the very first operation a user submitted from the last that DiracX returned.

For the metrics part, the only interesting way of using them that I was able to find (there is surely a hundred more) is to measure CEs job slots availability, having counters that go up when submitting and down when completing or failing and we could also compare it with the values obtained by the status of the CE, for example, to see the percentage of jobs submitted by the different investigation groups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant