Open telemetry usage (Proof of Concept) #379
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See #257.
This pull request aims to showcase a possible way of implementing Open Telemetry's traces and metrics. This code is NOT final and is not intended to be merged.
Important notes:
Currently OTel is located at the "routers" package, but we can move it to the core of DiracX and trace everything, from the headers used at API level to the specific variables that took place during the execution of a database query.
I implemented tracing using decorators. This is the most elegant solution I was able to come up with, but decorating every function could be a pain and also a quite over the top solution. Instead, we can either only decorate strategic functions or class functions that are problematic or just interesting to look at, or we could offer a secondary option of tracing inside the code of a function using a context manager.
Because we are using custom FastAPI routers with already decorated entry points, they must NOT be decorated. Tracing with OTel clashes with an already traced function, so these decorators should only be used in internal functions.
Lastly, we should consider the use Propagators, which implements a way of distributing traces. For example, we could start a trace in a python script file on the client side, serialize and inject the span context in the header and continue it's processing on the server side, having a full trace from the very first operation a user submitted from the last that DiracX returned.
For the metrics part, the only interesting way of using them that I was able to find (there is surely a hundred more) is to measure CEs job slots availability, having counters that go up when submitting and down when completing or failing and we could also compare it with the values obtained by the status of the CE, for example, to see the percentage of jobs submitted by the different investigation groups.