-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stats/opentelemetry: add trace event for name resolution delay #8074
base: master
Are you sure you want to change the base?
stats/opentelemetry: add trace event for name resolution delay #8074
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #8074 +/- ##
==========================================
- Coverage 82.29% 82.20% -0.10%
==========================================
Files 387 387
Lines 39065 38953 -112
==========================================
- Hits 32150 32020 -130
- Misses 5584 5608 +24
+ Partials 1331 1325 -6
|
clientconn.go
Outdated
func (cc *ClientConn) waitForResolvedAddrs(ctx context.Context) error { | ||
// context expires, whichever happens first. | ||
// | ||
// If the name resolution took longer than expected (indicating a delay before |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: If the name resolution did not succeed in first attempt, it returns true indicating delay in name resolution completion. In all other cases, including name resolution failure and name resolution succeeding in first attempt, it returns false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
stats/handlers.go
Outdated
@@ -38,6 +38,8 @@ type RPCTagInfo struct { | |||
// FailFast indicates if this RPC is failfast. | |||
// This field is only valid on client side, it's always false on server side. | |||
FailFast bool | |||
// NameResolutionDelay indicates if the RPC was delayed due to address resolution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// NameResolutionDelay indicates if there was a delay in the name resolution.
// This field is only valid on client side, it's always false on server side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
stats/opentelemetry/opentelemetry.go
Outdated
@@ -211,6 +211,8 @@ type attemptInfo struct { | |||
countSentMsg uint32 | |||
countRecvMsg uint32 | |||
previousRPCAttempts uint32 | |||
// nameResolutionDelayed indicates if the RPC was delayed by address resolution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above about docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
stream.go
Outdated
@@ -212,9 +216,13 @@ func newClientStream(ctx context.Context, desc *StreamDesc, cc *ClientConn, meth | |||
} | |||
// Provide an opportunity for the first RPC to see the first service config | |||
// provided by the resolver. | |||
if err := cc.waitForResolvedAddrs(ctx); err != nil { | |||
isDelayed, err := cc.waitForResolvedAddrs(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: isDelayed -> nameResDelayed/nameResolutionDelayed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
stream.go
Outdated
return nil, err | ||
} | ||
if isDelayed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think its fine to always add the nameResolutionDelay key. It will be either true/false. That we don't have to check again and again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
stream.go
Outdated
@@ -416,8 +424,9 @@ func (cs *clientStream) newAttemptLocked(isTransparent bool) (*csAttempt, error) | |||
method := cs.callHdr.Method | |||
var beginTime time.Time | |||
shs := cs.cc.dopts.copts.StatsHandlers | |||
isDelayed, _ := ctx.Value(nameResolutionDelayKey).(bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment about naming the variable as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
stream.go
Outdated
return nil, err | ||
} | ||
if isDelayed { | ||
ctx = context.WithValue(ctx, nameResolutionDelayKey, isDelayed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of adding it to context, can't we add to the rpcInfo below at line 233? and then we can pass that rpcInfo to newClientStreamWithParams. We can also just pass the bool value
Adding something less scoped like this to context has several downsides like context pollution, ambiguity, potential for misuse. It is because context is passed to different parts of application or even beyond current application
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the suggestion. I’ve updated the code to move nameResolutionDelay to rpcInfo as recommended. rpcInfo is now passed to newClientStreamWithParams,
stream.go
Outdated
@@ -212,9 +216,13 @@ func newClientStream(ctx context.Context, desc *StreamDesc, cc *ClientConn, meth | |||
} | |||
// Provide an opportunity for the first RPC to see the first service config | |||
// provided by the resolver. | |||
if err := cc.waitForResolvedAddrs(ctx); err != nil { | |||
isDelayed, err := cc.waitForResolvedAddrs(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
returning a bool and error is not good practice for go. It breaks the established pattern of error handling in Go because returned bool indicates success/failure in general. Can we do something better? It might be fine if we can't but we can try to look for better ways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have an approach as follows:
Add a nameResolutionDelay field: Add a new nameResolutionDelay field to the ClientConn struct to store the delay state.
Modify waitForResolvedAddrs: Set the nameResolutionDelay field directly in ClientConn instead of returning a boolean.
Access in newAttemptLocked: Use the nameResolutionDelay field from ClientConn within newAttemptLocked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think we want to add the field to clientconn because clientconn is not restricted to only single rpc. Returning a struct sounds better but we don't have any other field apart from boolean field. Let's keep bool, err for now. But make sure docstring is updated to indicate the bool correctly.
just to give you more context
|
Test Case 1: Fast Path (Line 699) Setup: Test Case 2: Waiting Path (Line 703) Setup: |
Done |
clientconn.go
Outdated
func (cc *ClientConn) waitForResolvedAddrs(ctx context.Context) error { | ||
// context expires, whichever happens first. | ||
// | ||
// If the name resolution did not succeed in first attempt, it returns true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This statement is not entirely correct. Please modify this based on my other comment where I provided more background. If firstResolveEvent.HasFired()
returns true, it means the resolver has already provided addresses at least once. In this case, there is no need to wait, so we can immediately return nil to signal that resolution has already occurred in the past. This is the fast path indicating no delay in name resolution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other case is specifically waiting for the resolver to send the first address update. Until then, it will block. So, its indicated delayed name resolution.
Please modify the docstring accordingly and mention when we return true and when not. Your first statement should mention that the bool return value indicate whether there wait for resolved address happened or not. If it didn't happen or there was an error, its false. It we were blocked for name resolution then it happened, so value is true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
stream.go
Outdated
var mc serviceconfig.MethodConfig | ||
var onCommit func() | ||
newStream := func(ctx context.Context, done func()) (iresolver.ClientStream, error) { | ||
return newClientStreamWithParams(ctx, desc, cc, method, mc, onCommit, done, opts...) | ||
return newClientStreamWithParams(ctx, desc, cc, method, mc, onCommit, done, rpcInfo, opts...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if we just send the bool value instead of struct? Does it work? If yes, then that's more simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
test/clientconn_test.go
Outdated
@@ -53,7 +55,7 @@ func (s) TestClientConnClose_WithPendingRPC(t *testing.T) { | |||
go func() { | |||
// This RPC would block until the ClientConn is closed, because the | |||
// resolver has not provided its first update yet. | |||
_, err := client.EmptyCall(ctx, &testpb.Empty{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep it testpb
please. Don't change this. This workaround is done only for google3 because there protos are separated from server defs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
test/clientconn_test.go
Outdated
|
||
// EmptyCall is a simple RPC that returns an empty response. | ||
func (s *server) EmptyCall(_ context.Context, _ *testgrpc.Empty) (*testgrpc.Empty, error) { | ||
return &testgrpc.Empty{}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same. It should be testpb.Empty{}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
test/clientconn_test.go
Outdated
|
||
// startTestGRPCServer initializes a test gRPC server on a random port. | ||
// Returns the server address and a cleanup function. | ||
func startTestGRPCServer(t *testing.T) (string, func()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use stubserver instead? Please see the other tests in the package for stubserver and use in the same for empty call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
test/clientconn_test.go
Outdated
client := testgrpc.NewTestServiceClient(clientConn) | ||
|
||
// First RPC call should succeed immediately | ||
if _, err := client.EmptyCall(ctx, &testgrpc.Empty{}); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't verify if name resolution field is set or not. We should have some way of verifying the nameResolutionDelay field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if you pass the statsHandler in grpc.NewClient
then you can check the value of the bool field that you are setting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
defer cleanup() | ||
|
||
// Create a stats handler to track name resolution delay | ||
statsHandler := &testStatsHandler{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do the same for non delay test as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this one for the non-delay test as well
} | ||
t.Log("Second RPC succeeded after resolver update, confirming resolution completion.") | ||
|
||
if !statsHandler.nameResolutionDelayed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should do the same in non delay case. It will be reverse of this there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this one for the non-delay test as well
test/clientconn_test.go
Outdated
}() | ||
|
||
// Simulate a delay before updating resolver | ||
time.Sleep(2 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not use time.Sleep. You should use the resolutionReady channel to unblock resolution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refer this test https://github.com/grpc/grpc-go/blob/master/internal/resolver/dns/dns_resolver_test.go#L641. I think you can structure this test in similar way.
Block on resolutionReady in a select block and for case <- resolutionReady:
, update resolver state to unblock. Is there a way we can make the first RPC wait for resolution instead of failing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
test/clientconn_test.go
Outdated
resolverBuilder.UpdateState(resolver.State{Addresses: []resolver.Address{{Addr: serverAddress}}}) | ||
|
||
// Second RPC should succeed after resolver update | ||
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use single context throughout test which is cancelable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
test/clientconn_test.go
Outdated
defer cancel() | ||
client := testgrpc.NewTestServiceClient(clientConn) | ||
// First RPC call should succeed immediately. | ||
if _, err := client.EmptyCall(ctx, &testgrpc.Empty{}); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be a testpb.Empty{}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
test/clientconn_test.go
Outdated
defer cleanup() | ||
|
||
statsHandler := &testStatsHandler{} | ||
resolverBuilder := manual.NewBuilderWithScheme("instant") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep it simple like rb := manual.NewBuilderWithScheme("")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
test/clientconn_test.go
Outdated
resolverBuilder := manual.NewBuilderWithScheme("instant") | ||
resolverBuilder.InitialState(resolver.State{Addresses: []resolver.Address{{Addr: stub.Address}}}) | ||
// Create a ClientConn using the manual resolver. | ||
clientConn, err := grpc.NewClient(resolverBuilder.Scheme()+":///test.server", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep it simple like cc, err := grpc.NewClient(resolverBuilder.Scheme()+":///test.server",
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
test/clientconn_test.go
Outdated
close(resolutionReady) | ||
case <-rpcCompleted: | ||
t.Fatal("RPC completed prematurely before resolution was updated!") | ||
case <-time.After(5 * time.Second): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use test context for timeout
case <-ctx.Done():
t.Fatalf("Test setup timed out: %v", ctx.Err())
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
test/clientconn_test.go
Outdated
t.Fatalf("RPC failed after resolution: %v", err) | ||
} | ||
t.Log("RPC completed successfully after resolution.") | ||
case <-time.After(5 * time.Second): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, use test context for timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
grpc.WithStatsHandler(statsHandler), | ||
) | ||
if err != nil { | ||
t.Fatalf("grpc.NewClient error: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be t.Fatalf("NewClient() failed: %v", err)
test/clientconn_test.go
Outdated
defer cleanup() | ||
|
||
statsHandler := &testStatsHandler{} | ||
clientConn, resolverBuilder := createTestClient(t, "delayed", statsHandler) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be cc, rb := createTestClient(t, "delayed", statsHandler)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
stats/opentelemetry: add trace event for name resolution delay.
RELEASE NOTES: None