Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What can I do about Retry header did not contain a valid timeout? #1421

Open
1 of 6 tasks
meichstedt opened this issue Feb 11, 2022 · 21 comments
Open
1 of 6 tasks

What can I do about Retry header did not contain a valid timeout? #1421

meichstedt opened this issue Feb 11, 2022 · 21 comments
Assignees
Labels
bug M-T: A confirmed bug report. Issues are confirmed when the reproduction steps are documented pkg:web-api applies to `@slack/web-api` server-side-issue

Comments

@meichstedt
Copy link

Hey there,

I've been running into 3 occurrences of this error: Retry header did not contain a valid timeout (from here)

The client was using the outdated 5.15.0 WebApi, so it didn't tell me the url.

My main questions are

  • is there a way to prevent getting rate limit responses w/o retrySec information?
  • are there any ideas about what causes this?
  • Is there a best practice on how to handle these?

Thanks in advance

Packages:

Select all that apply:

  • @slack/web-api
  • @slack/rtm-api
  • @slack/webhooks
  • @slack/oauth
  • @slack/socket-mode
  • I don't know

Reproducible in:

npm ls | grep -o "\S\+@\S\+$" | tr @ ' ' | awk -v q='"' '{print q$1q": "q"^"$2q","}' | grep slack
node --version
sw_vers && uname -v # or `ver`

The Slack SDK version

@slack/web-api: 5.15.0

Python runtime version

v15.14.0

OS info

Chrome/98.0.4758.80
Macintosh; Intel Mac OS X 10_15_7

Steps to reproduce:

(Share the commands to run, source code, and project settings)

  1. Send a slack message
  2. Receive a 429
  3. Miss the timeout

Expected result:

I don't know, assume a default timeoutSec?

Actual result:

(Tell what actually happened with logs, screenshots)

Requirements

For general questions/issues about Slack API platform or its server-side, could you submit questions at https://my.slack.com/help/requests/new instead. 🙇

Please read the Contributing guidelines and Code of Conduct before creating this issue or pull request. By submitting, you are agreeing to those rules.

@meichstedt meichstedt changed the title (Set a clear title describing your question) What can I do about Retry header did not contain a valid timeout? Feb 11, 2022
@filmaj filmaj self-assigned this Feb 11, 2022
@filmaj filmaj added needs info An issue that is claimed to be a bug and hasn't been reproduced, or otherwise needs more info pkg:web-api applies to `@slack/web-api` question M-T: User needs support to use the project and removed untriaged labels Feb 11, 2022
@filmaj
Copy link
Contributor

filmaj commented Feb 11, 2022

Hey @meichstedt, sorry about the trouble you're experiencing. If at all possible, if you have some logs to share around when this happened and which URL/API you saw was returning no retry delay information in its header, that would be helpful to dig into this further. I can then dig into the relevant backend API code to see how this could come about.

At this point, I don't have any good answer for your first two questions, but with more information like logs and which APIs, we could take a deeper look.

As for workarounds, what about wrapping the API invocation with a try/catch, inspecting the error message for "Retry header did not contain a valid timeout", and if it contains that, defaulting to some reasonable default delay before trying again? Certainly not ideal, and the intention behind the retry and queueing features of web-api is so that you don't have to think about that! Ideally we could get to the bottom of the underlying issue and address that.

@meichstedt
Copy link
Author

meichstedt commented Feb 11, 2022

Hey @filmaj thanks for getting back to me so quickly.
I don't have logs unfortunately since the requests were made on a client machine, however I do have timestamps:

timestamp: 1644428130.537
timestamp: 1644514610.975
timestamp: 1644516295.745

I also have IP addresses of the machine that did the requests – I can share those if you have a non-public way to exchange.

The code in question is only using the conversations.list endpoint with these parameters:

return this.ctx.client.conversations.list({
  cursor,
  types: "public_channel,private_channel,im,mpim",
});

where cursor is string | undefined and set via results.response_metadata?.next_cursor || ""

I'm happy to catch those and retry later; we have similar workaround in other places; the main thing that worries me in this case is that it seems that the client didn't do many requests at all – if you could verify using logs/IP addresses, that would be great.

Let me know if the timestamps help, and whether I can share the IP somehow.

@filmaj
Copy link
Contributor

filmaj commented Feb 11, 2022

Hey @meichstedt, thanks a lot for the extra information, that's really helpful. I am asking around internally to see if if/how I can get access to the production logs for these APIs based on the timestamps and potentially the IPs you could provide. I am relatively new to the company so I'm still getting my bearings and it might take me a little bit to figure out where to look for this stuff - but as soon as I figure this out I'll report back. FYI it probably will have to wait until after the weekend.

@filmaj
Copy link
Contributor

filmaj commented Feb 14, 2022

Hey @meichstedt, I can indeed peruse our logs to try to identify the requests and see what happened. Beyond timestamps, can you also provide me with the timezone of your machines (unless the timestamps are UTC). I can also use the IP addresses to help filter through the information. Any other info you can provide to me to help narrow down the requests, like channel IDs, bot IDs, app IDs, would be helpful too. All this information you could ideally provide to me over a secure channel. I use keybase.io (keybase.io/filmaj) - would that work for you?

@meichstedt
Copy link
Author

Hi @filmaj perfect thank you so much. Yes the timestamps are all UTC; I'll send you all that information via keybase.

@filmaj
Copy link
Contributor

filmaj commented Feb 15, 2022

Got the info, thanks, I will dig around!

@filmaj
Copy link
Contributor

filmaj commented Feb 15, 2022

Alright, I've dug around, and I believe I can find the relevant logs from the backend for these calls. I see many calls to the conversations.list API with a status=ok log, followed by a single error=rate_limited log. Note that the conversations.list API is a Tier 2 rate limited API, which officially only supports about 20 requests per minute. By my rough calculations, your app is issuing about 10 requests per second. So most definitely your app is bound to hit the rate limit for this API quite quickly 😄

Unfortunately, on our side, while we do log the request parameters and some other helpful tidbits, we do not log the full response returned. As such, I am limited to trying to work backwards through the logic of the backend code to see how this situation could arise. That said, we do log a specific SHA of the backend code as well as a kind of stack trace through the backend code - so I can see the separate paths of logic executed between the rate-limited call and the successful calls.

I will list out my assumptions based on what I've seen from the backend code and how the node-slack-sdk parses the headers:

  1. I am 99.9% sure that some kind of value is returned by the backend in the Retry-After HTTP response header - what kind of value, I am not sure and it is hard for me to infer as it comes from a separate "quota" system. There is some float math applied to this value before writing it to the response header as it seems the quota system stores rate limits in nanoseconds whereas the Retry-After header returns it to clients as seconds. So perhaps there is an issue here in the float math followed by casting to an integer.
  2. Looking at the node-slack-sdk's parseRetryHeaders method, if the header is defined and can be parsed as an integer, then the exception you noted in your original post should never arise. However, if the header is not present or it cannot be parsed as an integer / ends up as a NaN, then the exception you noted is raised.

I think the key to solving this problem lies in identifying what was returned in these rate-limited response headers.

Perhaps as a baby step to helping with this issue, I can improve the exception being raised such that it records the values returned in the HTTP response's Retry-After header? This way, whenever this situation arises again, at least we can get a clue that would be helpful in determining how this could arise, and could inform our next step after that.

Sorry this is so indirect / taking so long! However, if you have the time and patience to work with me on this issue, I hope we can get to the bottom of it 😃

filmaj added a commit that referenced this issue Feb 15, 2022
…eader value when throwing an error due to unparseable/invalid header value
seratch pushed a commit that referenced this issue Feb 15, 2022
…eader value when throwing an error due to unparseable/invalid header value (#1426)
@meichstedt
Copy link
Author

Thanks for the details and for improving the error already @filmaj!

I'll look into what we're doing there, we should definitely not hammer the api like this 🤔
I haven't written this integration so I'll also need some time to dig into this.

@filmaj filmaj added bug M-T: A confirmed bug report. Issues are confirmed when the reproduction steps are documented enhancement M-T: A feature request for new functionality server-side-issue and removed question M-T: User needs support to use the project labels Feb 16, 2022
@filmaj
Copy link
Contributor

filmaj commented Feb 16, 2022

@meichstedt with the improved error being merged to main, if you are able to try the latest commit there, perhaps that could provide us with some additional information if this issue comes up again. If that's not possible, I can comment here after we do the next release and you could move to that version to try it out.

Another idea that came up as I was talking with a backend engineer at Slack about the potential issue with rate limiting retry headers being returned was to possibly change the behaviour of this SDK in the case that an invalid header is returned. Perhaps, instead of throwing an exception, the API client could fallback to some sensible default behaviour? Just some ideas, but perhaps introducing a long retry delay of, for example, 30 seconds instead of throwing an exception? I am not sure myself but thought perhaps it is worth a discussion.

@meichstedt
Copy link
Author

@filmaj I'm happy to use the main branch if that's considered stable – I planned on upgrading from 5.15.0 anyway. Unless there's a specific commit I should look for, I'll just pick the latest.

We are trying to release weekly so this change should be up soon once I get to doing it. I will also check our retry behavior and the api usage in general – I haven't written our slack integration but I have a hunch about where the issues originate from:

  • The default for exclude_archived seems to be false. We aren't setting it to true (even though we're not interested in archived channels), so we may be getting a ton of archived channels.
  • The conversations.list endpoint has a limit default of 100 – not sure whether leaving it at 100 counts as using it in a paginated fashion, but I'd guess so. Apart from that I wonder whether the rate limit correlates with page size?

Anyway, this combination of loading all channels incl. archived is clearly unfavorable in our case. I don't know how many channels that user has in their workspace, but it's 300 people so depending on how long they've been using Slack, it may be >1000 easily I guess; we're fetching types: "public_channel,private_channel,im,mpim" aka "the world".

Re the discussion about what to do when no retry header was given or could be parsed, both of these options would be fine IMO:

  • return a coded error that indicates we're rate limited but no retrySecs is given (meaning: the client using the SDK should apply a default)
  • Default retrySec to a value is is very likely above the max delay (30s sounds OK'ish).

As a consumer, anything that allows me to figure out this case is fine, meaning an error code I can rely on that won't change. I just wouldn't want to infer this case from an arbitrary error string that says *did not contain a valid timeout* – a string which may change over time.

I'll report back once I've adjusted the settings. Thanks again so much for diving into this ❤️

@filmaj filmaj added this to the [email protected] milestone Feb 22, 2022
@filmaj
Copy link
Contributor

filmaj commented Feb 22, 2022

Re: your particular usage of conversations.list - those all sound like good things to investigate to see if that API's usage could be optimized. Frankly, I am surprised exclude_archived is by default false - that seems like an unnecessary overhead. Regarding rate limit correlated with page size, I do not believe it is. The rate limit is on gross request counts per API token, so I think your theory is good: for very large result sets, increasing the page size to reduce the number of requests down should reduce your likelihood of hitting the rate limit. It makes sense to me to set the limit parameter to its maximum value of 1000 in your case.

My one concern about introducing a default value for a retry if no valid retry-header is provided is that I worry this could introduce a significant bottleneck for the request queue in the SDK (since a single queue is used to control outgoing requests). Using your particular situation as an example, consider a request queue that gets ~10 requests per second added to it. If suddenly one of the responses for these requests comes back with an invalid retry-header and the SDK defaults to waiting some number of seconds before retrying the request (say, 30 seconds) - then ~300 requests would pile up in the queue while waiting for the default retry timeout! I think that could negatively impact the perceived performance of your application.

I will get the rest of the team to chime in on this thread to see what further potential improvements could be made to address this situation.

@meichstedt
Copy link
Author

100% agreed that excludeArchived: false is an unfortunate default – not only because it implies many results unless client's restrict the scope; also because it's effectively a double negation and therefore hard to grok. I had to point out to one of the smartest people I know, because they naturally assumed the default was true 🤷
(Changing this default could be considered a breaking change though, so I don't expect this to happen – but I believe it would spare a lot of confusion to change it to a default of includeArchived: false).

I've adjusted the settings for conversations.list and users.list, those are the only listings we fetch. Seeing you tagged this with [[email protected]](https://github.com/slackapi/node-slack-sdk/milestone/35) – are you planning on cutting a release any time soon? We're using yarn 2 and to my knowledge that doesn't support adding a commit as a dependency (yarn 1 did).

I wasn't aware the SDK queues requests! That's actually pretty neat; other SDKs I've seen are simply providing types as their main value add. With the default maxRequestConcurrency = 3, yeah that queue can easily get stuck when requests assume a default of 30s or so. One might queue these outliers separately (so they don't block the queue of well-behaving requests), but honestly, I also wouldn't apply too many workaround for these special cases.

@filmaj
Copy link
Contributor

filmaj commented Feb 22, 2022

Yes, I think we will aim to release a new version very soon. It has been about a month since the previous release and we aim to generally release on a monthly cadence. I can tackle the releasing steps during my business hours (east coast of Canada) tomorrow. While I wouldn't count this issue as resolved with the release, at least we should be able to get better errors next time the error does occur.

@filmaj
Copy link
Contributor

filmaj commented Feb 23, 2022

FYI @meichstedt I just published web-api 6.7.0 to npm.

@meichstedt
Copy link
Author

Hi @filmaj our updated version using 6.7.0 with adjusted filters went live a couple of days ago. Haven't seen any issue so far and I expect this to be resolved. Greatly appreciated your involvement and immediate help on this! 🙌

Unless you want to keep this issue open for anything, it can be closed.

@filmaj
Copy link
Contributor

filmaj commented Mar 7, 2022

Hi @meichstedt, sounds good, I will close this issue, but if you do catch any more logs around responses coming back with no retry headers, please feel free to re-open and at-message me (this way I will be sure to get a notification).

@filmaj filmaj closed this as completed Mar 7, 2022
@meichstedt
Copy link
Author

Hi @filmaj actually I completely missed this issue in our reporting; one single user has been running into this regularly, and since March 5 I'm getting this updated error message from 6.7.0

Retry header did not contain a valid timeout (url: chat.postMessage, retry-after header: undefined)

the latest event is from ts 1646760189 (that's Mar 8, 2022 5:23:09 PM UTC)

I'm not sure what how exactly this user is using the integration, but I'm seeing many previous errors so that user may be misusing our app and try to send too many events – still, maybe this helps you debug why the header is missing?

@meichstedt
Copy link
Author

@filmaj looks like I'm not able to re-open the issue

@filmaj filmaj reopened this Mar 8, 2022
@filmaj
Copy link
Contributor

filmaj commented Mar 8, 2022

Hey @meichstedt, yes I see the rate limited requests around this time, coming from your existing App ID (assuming it is the same one you confirmed with me last time we spoke), from a single IP address, using the chat.postMessage API.

I see about 200 requests coming in to the chat.postMessage API in a ~15 second timespan beginning at the timestamp that you reported. Of these 200 requests, 139 are tagged with a status of error=ratelimited. It seems like this is all coming from a single user (all requests hitting chat.postMessage are using the same xoxp- user token). It is also trying to post to the same channel in this span. Not sure how your app works or what it is doing but figured I would post details here just to have a better paper trail. These requests are also using the as_user HTTP request parameter.

Since these requests are relatively fresh (happened about an hour ago) I am going to try to cross-reference the request ID with other logs in our systems to try to get a better idea of what is going on in other parts of the system.

@filmaj
Copy link
Contributor

filmaj commented Mar 8, 2022

Digging into the logs did not reveal clues. I have asked for assistance from the backend team to see what we can do to address this. Either by adding additional logging on the backend to try to get more awareness on the situation, or perhaps by applying some default / fallback Retry-After value in the case that no such header is applied to rate limited responses (or maybe both).

Sorry for the troubles, as this may take some time to resolve, but as I get updates and we take steps to address this, we will keep you informed.

@filmaj filmaj removed enhancement M-T: A feature request for new functionality needs info An issue that is claimed to be a bug and hasn't been reproduced, or otherwise needs more info labels Mar 8, 2022
@meichstedt
Copy link
Author

Thanks for the details @filmaj. To provide a bit of background, Bardeen is a no-code tool that allows users to integrate various services and automate workflows. E.g. a use case could be (not saying this one is enormously useful)

get current calendar event -> find slack users from emails -> send slack message (e.g. with meeting link)

Now, depending on what the user does, this can trigger a lot of invocations on the right side of the automation, in this example the postMessage – e.g. if the user wants to send a notification for each single item found based on some search.

I've reached out to that user but haven't heard back and don't know whether they appreciate my investigation ;-)
I don't think they want to have 200 slack notifications sent to themselves within 15 seconds (or even worse to a channel with more people). So I hope I'll get the chance to teach them how to prevent it.

Since the automation is running from the user's browser, there's only so much I can do to understand their use case w/o getting input from them. We aim at eventually throttling these invocations but – that will take some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug M-T: A confirmed bug report. Issues are confirmed when the reproduction steps are documented pkg:web-api applies to `@slack/web-api` server-side-issue
Projects
None yet
Development

No branches or pull requests

2 participants