-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
source-mysql: Connection read/write timeouts redux #2080
base: main
Are you sure you want to change the base?
Conversation
@willdonnelly your upstream PR has been merged, so I think we can now just update the client pin instead of this workaround? go-mysql-org/go-mysql#925 |
Yeah, I'm going to upload a separate PR to bump the client library to that commit. Once that's live (and we've given it a little while to be sure it didn't break anything else in production that might force us to revert) I'll remove the workaround and we can merge this as just the original change to set the timeouts. |
The latest commit to the go-mysql driver is a fix for a bug in how packet timeouts are handled when connecting with TLS: go-mysql-org/go-mysql@ff1dab4 As discussed in #2080, we would like to have this fix ASAP so we're upgrading to that specific commit on master. Once this is live (and we've confirmed that nothing else breaks in production which would necessitate a revert) we'll be free to add the timeout connect option back in without any workarounds.
The latest commit to the go-mysql driver is a fix for a bug in how packet timeouts are handled when connecting with TLS: go-mysql-org/go-mysql@ff1dab4 As discussed in #2080, we would like to have this fix ASAP so we're upgrading to that specific commit on master. Once this is live (and we've confirmed that nothing else breaks in production which would necessitate a revert) we'll be free to add the timeout connect option back in without any workarounds.
2a038cb
to
f9cc901
Compare
This conflicts very slightly with the changes in #2097 so I'll have to make sure they both land correctly, but otherwise I think we're about ready to try these timeouts again. The upstream client library bug has been fixed, we've updated to the latest commit of that dependency, and nothing caught fire today so it's unlikely to require a revert. |
source-mysql-batch/main.go
Outdated
} else if connWithoutTLS, errWithoutTLS := client.Connect(cfg.Address, cfg.User, cfg.Password, cfg.Advanced.DBName); errWithoutTLS == nil { | ||
log.WithField("addr", cfg.Address).Info("connected without TLS") | ||
conn = connWithoutTLS | ||
} else if errors.As(errWithoutTLS, &mysqlErr) && mysqlErr.Code == mysql.ER_ACCESS_DENIED_ERROR { | ||
log.WithField("withTLS", errWithTLS).Warn("connecting with TLS failed for an unrelated reason") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would reword this a little bit to avoid ambiguity for users about the "unrelated" error, although my suggestion is a bit of a mouthful, feel free to use whatever you think suits
log.WithField("withTLS", errWithTLS).Warn("connecting with TLS failed for an unrelated reason") | |
log.WithField("withTLS", errWithTLS).Warn("connecting with TLS failed for a non-TLS-related reason") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, though it should probably be addressed in the other PR -- this one just includes that diff because I wanted to get the merge conflict between the two resolved ahead of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@willdonnelly feel free to merge as-is 👍🏽
source-mysql/main.go
Outdated
logrus.WithField("addr", address).Info("connected without TLS") | ||
db.conn = connWithoutTLS | ||
} else if errors.As(errWithoutTLS, &mysqlErr) && mysqlErr.Code == mysql.ER_ACCESS_DENIED_ERROR { | ||
logrus.WithField("withTLS", errWithTLS).Warn("connecting with TLS failed for an unrelated reason") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logrus.WithField("withTLS", errWithTLS).Warn("connecting with TLS failed for an unrelated reason") | |
logrus.WithField("withTLS", errWithTLS).Warn("connecting with TLS failed for a non-TLS-related reason") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, left two nits that are up to you
Previously when we set read/write timeouts on the MySQL connection we saw a bunch of captures in production consistently fail exactly one timeout-duration after they connected. I have sinced debugged this, gotten a fix into the upstream client library, and upgraded our dependency version to that commit. So we can try again once we're sure the version bump won't need a revert.
f9cc901
to
6a33c3a
Compare
The spurious diffs have been resolved now that the other MySQL connections PR is merged. Going to hold off on this for a little bit while we make sure nothing goes wrong from the other connection changes, and then merge this one. |
Description:
Previously when we set read/write timeouts on the MySQL connection we saw a bunch of captures in production consistently fail exactly one timeout-duration after they connected.
I have sinced debugged this (see the comment which makes up the majority of this commit's lines changed) and believe that I have a viable workaround while we try to get the client library fixed.
Note that even with this workaround, we won't have timeouts on subsequent database communications when using TLS until the bug is fixed upstream (and we've upgraded the dependency version). Once it's fixed upstream this workaround is basically a no-op and can be removed at our leisure.
Even without timeouts on subsequent interactions, the timeout is still useful to have since it prevents indefinite hangs during the initial database handshake, which is something we've seen happen pretty frequently in production.
This change is