Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

source-mysql: Connection read/write timeouts redux #2080

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

willdonnelly
Copy link
Member

@willdonnelly willdonnelly commented Oct 23, 2024

Description:

Previously when we set read/write timeouts on the MySQL connection we saw a bunch of captures in production consistently fail exactly one timeout-duration after they connected.

I have sinced debugged this (see the comment which makes up the majority of this commit's lines changed) and believe that I have a viable workaround while we try to get the client library fixed.

Note that even with this workaround, we won't have timeouts on subsequent database communications when using TLS until the bug is fixed upstream (and we've upgraded the dependency version). Once it's fixed upstream this workaround is basically a no-op and can be removed at our leisure.

Even without timeouts on subsequent interactions, the timeout is still useful to have since it prevents indefinite hangs during the initial database handshake, which is something we've seen happen pretty frequently in production.


This change is Reviewable

@willdonnelly willdonnelly added the change:unplanned Unplanned change, useful for things like doc updates label Oct 23, 2024
@willdonnelly willdonnelly requested a review from a team October 23, 2024 19:56
@mdibaiee
Copy link
Member

@willdonnelly your upstream PR has been merged, so I think we can now just update the client pin instead of this workaround? go-mysql-org/go-mysql#925

@willdonnelly
Copy link
Member Author

Yeah, I'm going to upload a separate PR to bump the client library to that commit. Once that's live (and we've given it a little while to be sure it didn't break anything else in production that might force us to revert) I'll remove the workaround and we can merge this as just the original change to set the timeouts.

willdonnelly added a commit that referenced this pull request Oct 25, 2024
The latest commit to the go-mysql driver is a fix for a bug in
how packet timeouts are handled when connecting with TLS:

    go-mysql-org/go-mysql@ff1dab4

As discussed in #2080,
we would like to have this fix ASAP so we're upgrading to that
specific commit on master.

Once this is live (and we've confirmed that nothing else breaks
in production which would necessitate a revert) we'll be free to
add the timeout connect option back in without any workarounds.
willdonnelly added a commit that referenced this pull request Oct 28, 2024
The latest commit to the go-mysql driver is a fix for a bug in
how packet timeouts are handled when connecting with TLS:

    go-mysql-org/go-mysql@ff1dab4

As discussed in #2080,
we would like to have this fix ASAP so we're upgrading to that
specific commit on master.

Once this is live (and we've confirmed that nothing else breaks
in production which would necessitate a revert) we'll be free to
add the timeout connect option back in without any workarounds.
@willdonnelly willdonnelly force-pushed the wgd/mysql-packet-timeouts-20241023 branch from 2a038cb to f9cc901 Compare October 28, 2024 23:28
@willdonnelly
Copy link
Member Author

This conflicts very slightly with the changes in #2097 so I'll have to make sure they both land correctly, but otherwise I think we're about ready to try these timeouts again. The upstream client library bug has been fixed, we've updated to the latest commit of that dependency, and nothing caught fire today so it's unlikely to require a revert.

} else if connWithoutTLS, errWithoutTLS := client.Connect(cfg.Address, cfg.User, cfg.Password, cfg.Advanced.DBName); errWithoutTLS == nil {
log.WithField("addr", cfg.Address).Info("connected without TLS")
conn = connWithoutTLS
} else if errors.As(errWithoutTLS, &mysqlErr) && mysqlErr.Code == mysql.ER_ACCESS_DENIED_ERROR {
log.WithField("withTLS", errWithTLS).Warn("connecting with TLS failed for an unrelated reason")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would reword this a little bit to avoid ambiguity for users about the "unrelated" error, although my suggestion is a bit of a mouthful, feel free to use whatever you think suits

Suggested change
log.WithField("withTLS", errWithTLS).Warn("connecting with TLS failed for an unrelated reason")
log.WithField("withTLS", errWithTLS).Warn("connecting with TLS failed for a non-TLS-related reason")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, though it should probably be addressed in the other PR -- this one just includes that diff because I wanted to get the merge conflict between the two resolved ahead of time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@willdonnelly feel free to merge as-is 👍🏽

logrus.WithField("addr", address).Info("connected without TLS")
db.conn = connWithoutTLS
} else if errors.As(errWithoutTLS, &mysqlErr) && mysqlErr.Code == mysql.ER_ACCESS_DENIED_ERROR {
logrus.WithField("withTLS", errWithTLS).Warn("connecting with TLS failed for an unrelated reason")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logrus.WithField("withTLS", errWithTLS).Warn("connecting with TLS failed for an unrelated reason")
logrus.WithField("withTLS", errWithTLS).Warn("connecting with TLS failed for a non-TLS-related reason")

Copy link
Member

@mdibaiee mdibaiee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left two nits that are up to you

Previously when we set read/write timeouts on the MySQL connection
we saw a bunch of captures in production consistently fail exactly
one timeout-duration after they connected.

I have sinced debugged this, gotten a fix into the upstream client
library, and upgraded our dependency version to that commit. So we
can try again once we're sure the version bump won't need a revert.
@willdonnelly willdonnelly force-pushed the wgd/mysql-packet-timeouts-20241023 branch from f9cc901 to 6a33c3a Compare October 29, 2024 15:22
@willdonnelly
Copy link
Member Author

The spurious diffs have been resolved now that the other MySQL connections PR is merged. Going to hold off on this for a little bit while we make sure nothing goes wrong from the other connection changes, and then merge this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
change:unplanned Unplanned change, useful for things like doc updates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants