stats/openetelemetry: refactor and make e2e test stats verification deterministic #8077

purnesh42H · 2025-02-11T19:35:51Z

The flakiness in stats/opentelemetry e2e tests flakiness is related to asynchronous metrics and trace collection. gRPC server metrics, specifically those generated by stats.End calls, are processed asynchronously. This means that immediately after a client-side RPC completes, the corresponding server-side metrics may not be available yet. To handle this asynchronicity, the tests employ a polling mechanism that repeatedly checks for the expected metrics until they appear or a timeout occurs. This polling strategy ensures that the tests wait for the server to finish processing the metrics, preventing premature validation against incomplete data.

However, the following limitations are still there which are causing the flakiness

the validation logic for histogram metrics, particularly sent_total_compressed_message_size, did not adequately account for asynchronous data point population. The tests would sometimes check the histogram before all expected data points were present, resulting in false negatives.
the validation logic for duration metrics did not use the above polling mechanism, specifically for histogram metrics.
the trace exporter was only queried once for collected spans, which could lead to missing spans due to the asynchronous nature of trace exporting. This resulted in incomplete trace verification and occasional test failures.

This PR fixes these flakiness issues by adding the appropriate polling before the metric and trace validation logic.

the waitForServerCompletedRPCs now not only wait for required metric to be present but also wait for the number of histogram data points to match the expected values before proceeding with validation.
The duration metrics validation check now also waitForServerCompletedRPCs.
the trace exporter is polled repeatedly until all expected spans are available, ensuring complete trace collection.

RELEASE NOTES: None

zasweq · 2025-02-11T19:51:35Z

Sorry, lots of work on new team so no bandwidth to review this. Glad you worked on this though!

codecov · 2025-02-11T19:53:28Z

Codecov Report

Attention: Patch coverage is 59.45946% with 15 lines in your changes missing coverage. Please review.

Project coverage is 82.33%. Comparing base (e95a4b7) to head (0184242).
Report is 22 commits behind head on master.

Files with missing lines	Patch %	Lines
...tats/opentelemetry/internal/testutils/testutils.go	58.33%	10 Missing and 5 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #8077      +/-   ##
==========================================
+ Coverage   82.15%   82.33%   +0.18%     
==========================================
  Files         387      387              
  Lines       39067    38967     -100     
==========================================
- Hits        32094    32082      -12     
+ Misses       5643     5572      -71     
+ Partials     1330     1313      -17

Files with missing lines	Coverage Δ
stats/opentelemetry/trace.go	`83.33% <100.00%> (ø)`
...tats/opentelemetry/internal/testutils/testutils.go	`94.69% <58.33%> (-0.88%)`	⬇️

... and 43 files with indirect coverage changes

…eterministic