-
Notifications
You must be signed in to change notification settings - Fork 647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternate mechanism to keepalive
for detecting broken connections
#2710
Comments
The channel can only enter the SHUTDOWN state if the application calls the The unsafety of retrying requests is inherent to the problem. If the client sends a request, as far as it knows, and never gets a response, it has no way of knowing whether or not the server has processed it. The simplest solution there is to make your RPCs themselves safe to retry, generally by making them idempotent. If you have an alternative mechanism to detect broken connections, feel free to suggest it. |
That's interesting. We have an IdleConnection wrapper which re-initializes the client every This was before the The close method prevents new calls from being started and frees up resources once all active calls have finished, but it does allow those active calls to finish normally. If you want to cancel active calls you need to explicitly cancel them individually.
Going via the above comment, shouldn't the existing request complete even if I closed the channel? Appendix - Code where we call closeexport class IdleGrpcClientWrapper<T extends CloseableGrpcClient>
implements GrpcClientWrapper<T>
{
private readonly logger: MomentoLogger;
private client: T;
private readonly clientFactoryFn: () => T;
private readonly maxIdleMillis: number;
private lastAccessTime: number;
constructor(props: IdleGrpcClientWrapperProps<T>) {
this.logger = props.loggerFactory.getLogger(this);
this.clientFactoryFn = props.clientFactoryFn;
this.client = this.clientFactoryFn();
this.maxIdleMillis = props.maxIdleMillis;
this.lastAccessTime = Date.now();
}
getClient(): T {
this.logger.trace(
`Checking to see if client has been idle for more than ${this.maxIdleMillis} ms`
);
if (Date.now() - this.lastAccessTime > this.maxIdleMillis) {
this.logger.info(
`Client has been idle for more than ${this.maxIdleMillis} ms; reconnecting.`
);
this.client.close();
this.client = this.clientFactoryFn();
}
this.lastAccessTime = Date.now();
return this.client;
}
} |
I think you misunderstood. I'm not saying that closing the channel caused the RPC to time out. All I'm saying is that knowing that the channel state was SHUTDOWN at the end of the RPC lifetime gives us no more information than that the channel was closed before the call ended. |
Thank you @murgatroid99 ! I had one question about some recent observations and if the logs tell something since you added some new context to them. There were 5 consecutive requests that resulted in a deadline: The first deadline request had a
The next two requests didn't have LB pick in the error messages:
Then the channel/connection was idle for roughly 1.5 seconds and had two more deadline errors: Again first with
This time, the next deadline error had another property
After these 5 requests, subsequent requests were succeeding. |
All of those have a What this indicates is that all of these requests timed out a full second after the client actually sent the request to the server. That generally indicates server-side latency. |
Is your feature request related to a problem? Please describe.
This issue describes why it is problematic to enable
keepalive
in serverless environments like AWS Lambda or Cloud Functions. Because of that, we turned off client-side keepalives only for serverless environments. That helped us eliminate all the sustained timeouts our clients were experiencing. However, sporadically (about twice a day), they still experience sometimeout
orinternal
errors. The logs added as a part of this issue were indeed helpful. In combination with those logs, we also perform agetConnectivityState
in our interceptor when a request fails withdeadline exceeded
. Here's an example log from a client in production:What I believe is happening is that the connection was broken, as indicated by
SHUTDOWN
. In the absence ofkeepalive
, the client can only learn about the broken connection once it sends the request, so it is probably expected that the first request resulted in aDEADLINE
. Given the caveat of enabling keepalive in serverless environments, is there an alternative for handling these? For clients whose requests are idempotent, this is not an issue as they can safely retry. However, some of our clients useincrement
APIs that we vend and aren't safe to retry.Describe the solution you'd like
Additional context
The text was updated successfully, but these errors were encountered: