Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternate mechanism to keepalive for detecting broken connections #2710

Open
pratik151192 opened this issue Apr 6, 2024 · 5 comments
Open

Comments

@pratik151192
Copy link

Is your feature request related to a problem? Please describe.

This issue describes why it is problematic to enable keepalive in serverless environments like AWS Lambda or Cloud Functions. Because of that, we turned off client-side keepalives only for serverless environments. That helped us eliminate all the sustained timeouts our clients were experiencing. However, sporadically (about twice a day), they still experience some timeout or internal errors. The logs added as a part of this issue were indeed helpful. In combination with those logs, we also perform a getConnectivityState in our interceptor when a request fails with deadline exceeded. Here's an example log from a client in production:

[2024-04-05T02:07:58.060Z] WARN (Momento: grpc-interceptor): Deadline Exceeded! Received status: 4 Deadline exceeded after 1.101s,remote_addr=34.192.43.23:443 and grpc connection status: SHUTDOWN

What I believe is happening is that the connection was broken, as indicated by SHUTDOWN. In the absence of keepalive, the client can only learn about the broken connection once it sends the request, so it is probably expected that the first request resulted in a DEADLINE. Given the caveat of enabling keepalive in serverless environments, is there an alternative for handling these? For clients whose requests are idempotent, this is not an issue as they can safely retry. However, some of our clients use increment APIs that we vend and aren't safe to retry.

Describe the solution you'd like

  • One possible solution is if we could if these operations are safe to retry. Other than that, I am looking for ideas as well :)

Additional context

@murgatroid99
Copy link
Member

The channel can only enter the SHUTDOWN state if the application calls the close() method. So its presence in that error message only means that you called that request, and then you closed the channel/client, and then the request timed out. That channel-level connectivity state gives no information about the state of the connection the call was made on. In fact, the remote_addr component in the DEADLINE_EXCEEDED directly indicates that as far as the client knows, the connection was functional.

The unsafety of retrying requests is inherent to the problem. If the client sends a request, as far as it knows, and never gets a response, it has no way of knowing whether or not the server has processed it. The simplest solution there is to make your RPCs themselves safe to retry, generally by making them idempotent.

If you have an alternative mechanism to detect broken connections, feel free to suggest it.

@pratik151192
Copy link
Author

pratik151192 commented Apr 6, 2024

The channel can only enter the SHUTDOWN state if the application calls the close() method

That's interesting. We have an IdleConnection wrapper which re-initializes the client every x minutes or so. That's the only place where we call a close(). But that code path only triggers for an idle client so a shutdown shouldn't have impacted any new call as we just dispose off the previous idle channel (added code in Appendix).

This was before the grpc channel options had this setting pre-vended for max_client_age; we can switch to that, however I'd expect close() to gracefully complete existing RPCs on the channel as per this [comment] anyhow (#1211 (comment)):

The close method prevents new calls from being started and frees up resources once all active calls have finished, but it does allow those active calls to finish normally. If you want to cancel active calls you need to explicitly cancel them individually.

So its presence in that error message only means that you called that request, and then you closed the channel/client, and then the request timed out.

Going via the above comment, shouldn't the existing request complete even if I closed the channel?

Appendix - Code where we call close

export class IdleGrpcClientWrapper<T extends CloseableGrpcClient>
  implements GrpcClientWrapper<T>
{
  private readonly logger: MomentoLogger;

  private client: T;
  private readonly clientFactoryFn: () => T;

  private readonly maxIdleMillis: number;
  private lastAccessTime: number;

  constructor(props: IdleGrpcClientWrapperProps<T>) {
    this.logger = props.loggerFactory.getLogger(this);
    this.clientFactoryFn = props.clientFactoryFn;
    this.client = this.clientFactoryFn();
    this.maxIdleMillis = props.maxIdleMillis;
    this.lastAccessTime = Date.now();
  }

  getClient(): T {
    this.logger.trace(
      `Checking to see if client has been idle for more than ${this.maxIdleMillis} ms`
    );
    if (Date.now() - this.lastAccessTime > this.maxIdleMillis) {
      this.logger.info(
        `Client has been idle for more than ${this.maxIdleMillis} ms; reconnecting.`
      );
      this.client.close();
      this.client = this.clientFactoryFn();
    }

    this.lastAccessTime = Date.now();
    return this.client;
  }
}

@murgatroid99
Copy link
Member

I think you misunderstood. I'm not saying that closing the channel caused the RPC to time out. All I'm saying is that knowing that the channel state was SHUTDOWN at the end of the RPC lifetime gives us no more information than that the channel was closed before the call ended.

@pratik151192
Copy link
Author

pratik151192 commented Apr 15, 2024

Thank you @murgatroid99 ! I had one question about some recent observations and if the logs tell something since you added some new context to them. There were 5 consecutive requests that resulted in a deadline:

The first deadline request had a lb_pick. Does this indicate a DNS resolution happened for this request?

Deadline exceeded after 1.102s,LB pick: 0.001s,remote_addr=3.222.218.75:443 

The next two requests didn't have LB pick in the error messages:

Deadline exceeded after 1.101s,remote_addr=3.222.218.75:443
Deadline exceeded after 1.101s,remote_addr=3.222.218.75:443

Then the channel/connection was idle for roughly 1.5 seconds and had two more deadline errors:

Again first with lb_pick:

 Deadline exceeded after 1.102s,LB pick: 0.001s,remote_addr=3.222.218.75:443    

This time, the next deadline error had another property metadata_filters:

Deadline exceeded after 1.103s,metadata filters: 0.001s,remote_addr=3.222.218.75:443

After these 5 requests, subsequent requests were succeeding.
Curious if this indicates something.

@murgatroid99
Copy link
Member

All of those have a remote_addr property, which means that the call was associated with an active connection to the server with that IP address. The other parts of the extra details show how much time was spent in each step before associating the call with a connection. For example, the first error message shows that it took 1ms in the LB pick step. Values less than 1ms are omitted, so the absence of "LB pick" in the others just means that that step was faster than that.

What this indicates is that all of these requests timed out a full second after the client actually sent the request to the server. That generally indicates server-side latency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants