Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSO auth renewal is blocking #232

Open
1 of 3 tasks
cben opened this issue Aug 17, 2020 · 6 comments
Open
1 of 3 tasks

SSO auth renewal is blocking #232

cben opened this issue Aug 17, 2020 · 6 comments

Comments

@cben
Copy link
Contributor

cben commented Aug 17, 2020

tokensContext() grabs a mutex:

ocm-sdk-go/token.go

Lines 75 to 79 in 323cd42

func (c *Connection) tokensContext(ctx context.Context) (code int, access, refresh string, err error) {
// We need to make sure that this method isn't execute concurrently, as we will be updating
// multiple attributes of the connection:
c.tokenMutex.Lock()
defer c.tokenMutex.Unlock()

IIUC this means that while a refresh/re-auth request to SSO is in progress, all concurrent API requests will be blocked.

This is sub-optimal — usually we renew some time before previous token expires, and could be used meanwhile without delaying requests.
Even the original request that triggered the renewal could proceed without blocking!

A second question is what happens with retries and concurrent requests. The mutex is only held for a single try, but if a retry is needed it means we still don't have valid auth, so the 2nd request will immediately take the mutex; the first retry loop doesn't know that happened, so we can get several retry loops running in parallel. 🔀
Once one of them succeeds, the following tokensContext() tries will all see valid auth and succeed immediately, and all retry loops will complete.

  • So this sounds internally safe to me;
  • however, externally we're not keeping the backoff we intended. Are we risking a "thundering herd" problem where we make an SSO outage worse by amplifying the number of requests?
  • also (very minor), this may confuse the retries logging/metrics Better logging and metrics when retrying SSO #231.

I propose we move renewal off the requesting goroutine into a separate dedicated goroutine. One per Connection object. This will simplify management of retries.
We need to decide whether renewal should wait for a request that needs auth (as done now, meaning zero traffic when unused), or be timer-based (minimizes latency).

@igoihman @nimrodshn @tzvatot @vkareh @zgalor @petli-openshift WDYT?

@petli-openshift
Copy link
Contributor

overall sounds a good plan. I'ts probably better to still block the request if the access token is completely expired. it doesn't hurt to use a timer on each request. if it times out, it should fail the current request but not impact the renewal thread.
also for the renewal thread, is it possible to implement it in a way that it sends a single request to SSO no matter how many concurrent requests call the goroutine? the token could be delivered through a channel, so it decouples the renewal goroutine from the requests.

@tzvatot
Copy link
Contributor

tzvatot commented Aug 18, 2020

Access tokens live for 5 min. And it takes a few milliseconds to get it. So we are "paying" a few milliseconds every 5 minutes. I think that's a very small price to pay for the sake of thread safety.
In other words - premature optimization IMO.

@cben
Copy link
Contributor Author

cben commented Aug 18, 2020

Good point. However, what if SSO is intermittently down, and we need retries? That takes longer (backoff), and IIUC all this time no API requests will be made, despite auth being good. IOW, we're not as "decoupled" from SSO as we hoped.

However, I'm fine with shelving this indefinitely, until actual need is proven 👍.
For starters, we'll soon start collecting stats on retries and after few weeks we'll know something about actual impact.

@cben
Copy link
Contributor Author

cben commented Aug 18, 2020

an easier change idea:
what if for some period before blocking for retries, we start "soft refresh" attempts, that do 1 request but don't retry, just fall back to using existing token?
(risk: This may also amplify SSO traffic without enforced backoff.)

hmm, actually that may already be happening!
The retries are around whole tokensContext, which tries a few strategies, but if those fail it warns and returns without error:

ocm-sdk-go/token.go

Lines 152 to 168 in b9d2a1d

// Here we know that the access and refresh tokens are unavailable, expired or about to
// expire. We also know that we don't have credentials to request new ones. But we could
// still use the refresh token if it isn't completely expired.
if c.refreshToken != nil && refreshLeft > 0 {
c.logger.Warn(
ctx,
"OCM auth: refresh token expires in only %s, but there is no other mechanism to "+
"obtain a new token, so will try to use it anyhow",
refreshLeft,
)
code, _, err = c.sendRefreshTokenForm(ctx, attempt)
if err != nil {
return
}
access, refresh = c.currentTokens()
return
}

This should make backoff.Retry happy and not retry.

So blocking-for-retries scenario won't kick in until tokens are actually expired (at which point we must block anyway)?

@tzvatot is this last analysis correct?

@cben
Copy link
Contributor Author

cben commented Aug 18, 2020

no, I don't think this happens now. when sendRefreshTokenForm() or sendRefreshTokenForm() fail, it returns error immediately.
WDYT of changing that?

@cben
Copy link
Contributor Author

cben commented Aug 18, 2020

token requests latency data from prod:
https://prometheus.app-sre-prod-04.devshift.net/graph?g0.range_input=1w&g0.expr=histogram_quantile(0.8%2C%20sum%20without(instance%2Cpod)%20(api_outbound_token_request_duration_bucket))&g0.tab=0&g1.range_input=1h&g1.expr=api_outbound_token_request_duration_bucket&g1.tab=1

The data is too coarse to take the quantile seriously, but what we know is these requests almost never complete ≤100ms 😞 and almost always complete ≤1sec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants