- Author(s): dapengzhang0, dfawley, ejona86, markdroth
- Approver: markdroth
- Status: Implemented
- Implemented in: C-core, Java
- Last updated: 2021-08-24
- Discussion at: https://groups.google.com/g/grpc-io/c/EAbFqaJWp5w
This document is a proposal on how gRPC client would expose OpenCensus metrics and tracing when retry is enabled. It can be considered as an augment of the retry gRFC.
There are a collection of OpenCensus metrics for gRPC. When adding retry support to gRPC, we need to clearly define the measurement points for those metrics and add some retry specific metrics to OpenCensus library to present a clearer picture of retry attempts. The retry gRFC discussed exposing three additional per-method metrics for retry. However, no detail on how to record and expose these metrics was discussed. Also, these metrics still provide very limited information and the histogram view (the third metric) is not very useful because maxAttempts
is always less than or equal 5. Besides the number of retry attempts, users may also be interested in the total number of transparent retries and the total amount of delay time caused by retry backoff.
Per-call stats from the perspective of the application should continue to be gathered from the interceptor/filter level, just as they are today.
For each call, there should be some way to attach an object that will gather data for individual call attempts. Events on the underlying call attempts will be recorded via that object.
On the parent call object, there should be some way to record an event when we do the first LB pick attempt for a given call attempt.
There should be some way to indicate whether a given call attempt was triggered via a transparent retry.
With that general structure in the gRPC stats interface, it should be possible for whatever plugs into the stats interface (e.g., Census, OpenCensus, etc) to determine the data to export in the metrics defined here.
gRPC will treat each retry attempt or hedged RPC as a distinct RPC with regards to the current set of metrics.
We add the following per-overall-client-call metrics to OpenCensus:
Measure name | Unit | Description |
---|---|---|
grpc.io/client/retries_per_call | 1 | Number of retry or hedging attempts excluding transparent retries made during the client call. The original attempt is not counted as a retry/hedging attempt. |
grpc.io/client/transparent_retries_per_call | 1 | Number of transparent retries made during the client call. |
grpc.io/client/retry_delay_per_call | ms | Total time of delay while there is no active attempt during the client call. |
View name | Measure | Aggregation | tags |
---|---|---|---|
grpc.io/client/retries_per_call | retries_per_call | distribution | grpc_client_method |
grpc.io/client/retries | retries_per_call | sum | grpc_client_method |
grpc.io/client/transparent_retries_per_call | transparent_retries_per_call | distribution | grpc_client_method |
grpc.io/client/transparent_retries | transparent_retries_per_call | sum | grpc_client_method |
grpc.io/client/retry_delay_per_call | retry_delay_per_call | distribution | grpc_client_method |
The gRPC tracing module should create a tracing span for each call from the perspective of the application, and create a child span for each individual call attempt, including transparent retry attempts. The parent span is named in the form of Sent.<service name>.<method name>
. The span for each individual call attempt is named in the form of Attempt.<service name>.<method name>
, and is with the following additional span attributes:
`previous-rpc-attempts` : an integer value, number of preceding attempts, transparent retry not included.
`transparent-retry` : a boolean value, whether the attempt is a transparent retry.
The child span should record the message events of the attempt.
Transparent retry should also be treated as a distinct RPC with regards to the pre-existing OpenCensus metrics, because for OpenCensus tracing it's natural to create a child span for transparent retry as with ordinary retry attempt, so for OpenCensus metrics we should be consistent. Also, there is no effective way to distinguish the two types of transparent retry, and one of them actually sends data to the wire, so the outbound bytes should be recorded.
API plumbing to expose the metrics is needed in each language.
Add the following API to io.grpc.ClientStreamTracer
/**
* The stream is being created on a ready transport.
*
* @param headers the mutable initial metadata.
* Modifications to it will be sent to the socket but
* not be seen by client interceptors and the application.
*/
public void streamCreated(Attributes transportAttrs, Metadata headers)
Deprecate io.grpc.ClientStreamTracer.StreamInfo.getTransportAttrs()
and io.grpc.ClientStreamTracer.StreamInfo.Builder.setTransportAttrs()
We will no longer call Factory.newClientStreamTracer(info, headers)
to create a tracer instance after a ready transport for the stream is available. Instead, we create a tracer instance prior to the first time picking a subchannel for the stream, and once we are about to create a real stream on a ready transport we call ClientStreamTracer.streamCreated(transportAttrs, headers)
.
Change the signature of ClientTransport.newStream()
by adding tracers argument:
/**
* @param tracers a non-empty array of tracers.
* The last element in it is reserved to be set by
* the load balancer's pick result and otherwise is null.
*/
ClientStream newStream(
MethodDescriptor<?, ?> method, Metadata headers, CallOptions callOptions,
ClientStreamTracer[] tracers);
For creating a PendingStream
or FailingClientStream
, we call this method providing a tracers array of length
callOptions.getStreamTracerFactories().size() + 1
, with the last element unset/a no-op tracer and other elements created from the callOptions.getStreamTracerFactories()
.
For creating a real stream, we create the same tracers array as above and set the last element of the tracers array with a tracer created from LB result, and then call this method.
See the CallTracer
class.
TBD