Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats/openetelemetry: refactor and make e2e test stats verification deterministic #8077

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

purnesh42H
Copy link
Contributor

@purnesh42H purnesh42H commented Feb 11, 2025

Fixes: #8053 #7423 #8076

The flakiness in stats/opentelemetry e2e tests flakiness is related to asynchronous metrics and trace collection. gRPC server metrics, specifically those generated by stats.End calls, are processed asynchronously. This means that immediately after a client-side RPC completes, the corresponding server-side metrics may not be available yet. To handle this asynchronicity, the tests employ a polling mechanism that repeatedly checks for the expected metrics until they appear or a timeout occurs. This polling strategy ensures that the tests wait for the server to finish processing the metrics, preventing premature validation against incomplete data.

However, the following limitations are still there which are causing the flakiness

  1. the validation logic for histogram metrics, particularly sent_total_compressed_message_size, did not adequately account for asynchronous data point population. The tests would sometimes check the histogram before all expected data points were present, resulting in false negatives.
  2. the validation logic for duration metrics did not use the above polling mechanism, specifically for histogram metrics.
  3. the trace exporter was only queried once for collected spans, which could lead to missing spans due to the asynchronous nature of trace exporting. This resulted in incomplete trace verification and occasional test failures.

This PR fixes these flakiness issues by adding the appropriate polling before the metric and trace validation logic.

  1. the waitForServerCompletedRPCs now not only wait for required metric to be present but also wait for the number of histogram data points to match the expected values before proceeding with validation.
  2. The duration metrics validation check now also waitForServerCompletedRPCs.
  3. the trace exporter is polled repeatedly until all expected spans are available, ensuring complete trace collection.

RELEASE NOTES: None

@zasweq
Copy link
Contributor

zasweq commented Feb 11, 2025

Sorry, lots of work on new team so no bandwidth to review this. Glad you worked on this though!

@zasweq zasweq removed their request for review February 11, 2025 19:51
@zasweq zasweq removed their assignment Feb 11, 2025
Copy link

codecov bot commented Feb 11, 2025

Codecov Report

Attention: Patch coverage is 59.45946% with 15 lines in your changes missing coverage. Please review.

Project coverage is 82.33%. Comparing base (e95a4b7) to head (0184242).
Report is 22 commits behind head on master.

Files with missing lines Patch % Lines
...tats/opentelemetry/internal/testutils/testutils.go 58.33% 10 Missing and 5 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8077      +/-   ##
==========================================
+ Coverage   82.15%   82.33%   +0.18%     
==========================================
  Files         387      387              
  Lines       39067    38967     -100     
==========================================
- Hits        32094    32082      -12     
+ Misses       5643     5572      -71     
+ Partials     1330     1313      -17     
Files with missing lines Coverage Δ
stats/opentelemetry/trace.go 83.33% <100.00%> (ø)
...tats/opentelemetry/internal/testutils/testutils.go 94.69% <58.33%> (-0.88%) ⬇️

... and 43 files with indirect coverage changes

@purnesh42H purnesh42H force-pushed the stats-otel-flaky-tests branch 2 times, most recently from 2038431 to dc1e2ba Compare February 12, 2025 17:11
@purnesh42H purnesh42H force-pushed the stats-otel-flaky-tests branch 4 times, most recently from 171c066 to 82ba13e Compare February 14, 2025 02:41
@arjan-bal arjan-bal removed their assignment Feb 18, 2025
@purnesh42H purnesh42H requested a review from arjan-bal February 18, 2025 13:44
@arjan-bal arjan-bal assigned purnesh42H and unassigned arjan-bal Feb 20, 2025
@purnesh42H purnesh42H force-pushed the stats-otel-flaky-tests branch from b563f15 to c50e00c Compare February 20, 2025 14:46
@purnesh42H purnesh42H requested a review from arjan-bal February 20, 2025 15:07
@purnesh42H purnesh42H assigned arjan-bal and unassigned purnesh42H Feb 20, 2025
// Verify spans has error status with rpcErrorMsg as error message.
for ; len(exporter.GetSpans()) == 0 && ctx.Err() == nil; <-time.After(time.Millisecond) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we wait till len(exporter.GetSpans()) < 3 since the removed lines were checking if the length was 3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was not needed as we are only checking the first span anyway. The error is enforced on first event itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would change the test behaviour though. Earlier the test was asserting that exactly 3 spans are received and the first span's status message is as expected. Now it's just asserting the second part. Is it okay to skip checking the span count?

@arjan-bal arjan-bal assigned purnesh42H and unassigned arjan-bal Feb 24, 2025
@purnesh42H purnesh42H requested a review from arjan-bal February 24, 2025 19:42
@purnesh42H purnesh42H assigned arjan-bal and unassigned easwars Feb 24, 2025
@purnesh42H purnesh42H force-pushed the stats-otel-flaky-tests branch from 7d96428 to f83d6aa Compare February 25, 2025 04:43
@arjan-bal arjan-bal removed their assignment Feb 25, 2025
@purnesh42H purnesh42H requested a review from arjan-bal February 25, 2025 12:12
@purnesh42H purnesh42H assigned arjan-bal and unassigned purnesh42H Feb 25, 2025
Comment on lines +297 to +299
eventsSort := cmpopts.SortSlices(func(a, b trace.Event) bool {
return a.Name < b.Name
})
Copy link
Contributor

@arjan-bal arjan-bal Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct to ignore the order of events during comparison? From the test case, I see that the expected order for unary clients is:

  1. Outbound compressed message
  2. Inbound compressed message

For unary servers the event order is:

  1. Inbound compressed message
  2. Outbound compressed message

This seems to follow the order of messages being exchanged b/w the client and servers. So it may be worth verifying the event order.

// Verify spans has error status with rpcErrorMsg as error message.
for ; len(exporter.GetSpans()) == 0 && ctx.Err() == nil; <-time.After(time.Millisecond) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would change the test behaviour though. Earlier the test was asserting that exactly 3 spans are received and the first span's status message is as expected. Now it's just asserting the second part. Is it okay to skip checking the span count?

@arjan-bal arjan-bal assigned purnesh42H and unassigned arjan-bal Feb 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky Test: MetricsAndTracesOptionEnabled
5 participants