-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stats/openetelemetry: refactor and make e2e test stats verification deterministic #8077
base: master
Are you sure you want to change the base?
Conversation
dc1badd
to
67b38cd
Compare
Sorry, lots of work on new team so no bandwidth to review this. Glad you worked on this though! |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #8077 +/- ##
==========================================
+ Coverage 82.15% 82.33% +0.18%
==========================================
Files 387 387
Lines 39067 38967 -100
==========================================
- Hits 32094 32082 -12
+ Misses 5643 5572 -71
+ Partials 1330 1313 -17
|
2038431
to
dc1e2ba
Compare
171c066
to
82ba13e
Compare
82ba13e
to
3e8fc54
Compare
b563f15
to
c50e00c
Compare
// Verify spans has error status with rpcErrorMsg as error message. | ||
for ; len(exporter.GetSpans()) == 0 && ctx.Err() == nil; <-time.After(time.Millisecond) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we wait till len(exporter.GetSpans()) < 3
since the removed lines were checking if the length was 3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was not needed as we are only checking the first span anyway. The error is enforced on first event itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would change the test behaviour though. Earlier the test was asserting that exactly 3 spans are received and the first span's status message is as expected. Now it's just asserting the second part. Is it okay to skip checking the span count?
7d96428
to
f83d6aa
Compare
eventsSort := cmpopts.SortSlices(func(a, b trace.Event) bool { | ||
return a.Name < b.Name | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it correct to ignore the order of events during comparison? From the test case, I see that the expected order for unary clients is:
- Outbound compressed message
- Inbound compressed message
For unary servers the event order is:
- Inbound compressed message
- Outbound compressed message
This seems to follow the order of messages being exchanged b/w the client and servers. So it may be worth verifying the event order.
// Verify spans has error status with rpcErrorMsg as error message. | ||
for ; len(exporter.GetSpans()) == 0 && ctx.Err() == nil; <-time.After(time.Millisecond) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would change the test behaviour though. Earlier the test was asserting that exactly 3 spans are received and the first span's status message is as expected. Now it's just asserting the second part. Is it okay to skip checking the span count?
Fixes: #8053 #7423 #8076
The flakiness in stats/opentelemetry e2e tests flakiness is related to asynchronous metrics and trace collection. gRPC server metrics, specifically those generated by
stats.End
calls, are processed asynchronously. This means that immediately after a client-side RPC completes, the corresponding server-side metrics may not be available yet. To handle this asynchronicity, the tests employ a polling mechanism that repeatedly checks for the expected metrics until they appear or a timeout occurs. This polling strategy ensures that the tests wait for the server to finish processing the metrics, preventing premature validation against incomplete data.However, the following limitations are still there which are causing the flakiness
sent_total_compressed_message_size
, did not adequately account for asynchronous data point population. The tests would sometimes check the histogram before all expected data points were present, resulting in false negatives.duration
metrics did not use the above polling mechanism, specifically for histogram metrics.This PR fixes these flakiness issues by adding the appropriate polling before the metric and trace validation logic.
waitForServerCompletedRPCs
now not only wait for required metric to be present but also wait for the number of histogram data points to match the expected values before proceeding with validation.duration
metrics validation check now alsowaitForServerCompletedRPCs
.RELEASE NOTES: None