-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backup: add resume and onfail/cancel logic to compactions #141753
base: master
Are you sure you want to change the base?
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
afcd540
to
ced14ba
Compare
pkg/backup/backup_compaction_test.go
Outdated
jobutils.WaitForJobToPause(t, db, jobID) | ||
db.Exec(t, "RESUME JOB $1", jobID) | ||
waitForSuccessfulJob(t, tc, jobID) | ||
validateCompactedBackupForTables(t, db, []string{"foo", "bar"}, "'nodelocal://1/backup'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've realized in my testing that this lets through a couple of false negatives. validateCompactedBackupForTables
simply restores from the latest backup. If the tests were created in isolation, this might not be too big of an issue, but since they all share the same cluster, one test that may have failed to create a compaction would actually just restore from a previous test.
I think we can fix this by having validateCompactedBackupForTables
take in the start
and end
of the compaction, and it should also validate that a backup spanning that timespan exists. But I'll do this in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you instead use a different collection dir for each subtest instead of nodelocal://1/backup
for everything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah that would do the trick. I do think having an extra check that makes sure that there exists a backup that spans the entire time from start
to end
is good though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do this in a separate PR though
1a0f8d2
to
ec5a023
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a pass at the production code. i'll take a look at that tests later
// TODO (kev-cao): Add telemetry for backup compactions. | ||
} | ||
|
||
// For all backups, partitioned or not, the main BACKUP manifest is stored at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once we get all the resolution logic in, i do think there are opportunities to cleanly combine more parts of compaction job and regular backup jobs. But i think it's important to first lay out exactly what we need (resumption logic/ checkpointing, distSQL flow), with all the duplicate code, before refactoring.
to avoid if else
conditionals, perhaps we'll create an backupDistSQLPrep
interface that implements the prep work for for regular and compacted backups. But lets think through that after we land all the basic blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree, after splitting this up instead of integrating it early, I found some areas where improvements could be made because the code was now split up.
@@ -338,12 +338,9 @@ func GetEncryptionFromBase( | |||
user username.SQLUsername, | |||
makeCloudStorage cloud.ExternalStorageFromURIFactory, | |||
baseBackupURI string, | |||
encryptionParams jobspb.BackupEncryptionOptions, | |||
encryptionParams *jobspb.BackupEncryptionOptions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this kind of small refactor should be in a separate commit.
I'm also not sure what it is seeking to accomplish: previously we could avoid makeCloudStorage
call if there was no encryption, but now we have to make the call regardless?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh this issue was caused by the fact that initialDetails
would start off with a non-nil encryption
because it was always passed an encryption from the builtin. But through our flow of creating the updated details, if we aren't using encryption, encryption actually becomes nil
in the updated details (from GetEncryptionFromBaseStore
).
So on resume, initialDetails.EncryptionOptions
is nil
, which led to some nil
pointer exceptions because some of our code expects a *jobspb.BackupEncryptionOptions
and other parts expect a jobspb.BackupEncryptionOptions
. Classic backup happens to avoid this because on resume, we skip all of this work. But with compaction, since we have to validate the compaction chain, we pull all of the manifests for the backup chain, and as a consequence, end up requiring the encryption from initialDetails
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Could you put the encryption patch in a separate, first, commit. And further, could you add
if encryptionParams == nil || encryptionParams.Mode == jobspb.EncryptionMode_None {
return nil, nil
}
to GetEncryptionFromBase before making cloud storage to avoid that extra work if the backup is not encrypted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, I'll push up another PR for this then.
pkg/backup/backup_compaction.go
Outdated
return &m, nil | ||
} | ||
|
||
func getRestoreChain( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: i think we should use the term backup chain, not restore chain.
354dbbe
to
10f1357
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking better!
pkg/backup/backup_compaction.go
Outdated
// Resolve the backup destination. If we have already resolved and persisted | ||
// the destination during a previous resumption of this job, we can re-use | ||
// the previous resolution. | ||
if needsInitializing { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this block could be put in the needsInitializing block below, that way, we only have one initializing block. then we don't need to declare BackupDest above, i think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, the dependency on backupDest got moved into the block so this can be moved too now. That's awesome, I hated this extra block.
@@ -338,12 +338,9 @@ func GetEncryptionFromBase( | |||
user username.SQLUsername, | |||
makeCloudStorage cloud.ExternalStorageFromURIFactory, | |||
baseBackupURI string, | |||
encryptionParams jobspb.BackupEncryptionOptions, | |||
encryptionParams *jobspb.BackupEncryptionOptions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Could you put the encryption patch in a separate, first, commit. And further, could you add
if encryptionParams == nil || encryptionParams.Mode == jobspb.EncryptionMode_None {
return nil, nil
}
to GetEncryptionFromBase before making cloud storage to avoid that extra work if the backup is not encrypted?
pkg/backup/backup_compaction_test.go
Outdated
db.Exec(t, "INSERT INTO foo VALUES (2, 2)") | ||
db.Exec(t, incBackupCmd) | ||
db.Exec(t, "CREATE TABLE bar (a INT, b INT)") | ||
db.Exec(t, "INSERT INTO bar VALUES (1, 1)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this test require 4 inc backups?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I suppose it doesn't, I've gotten into the habit of trying to make sure there are "meaningful" compactions, but I suppose the other tests cover that so there isn't really a need for it in this test.
pkg/backup/backup_compaction_test.go
Outdated
jobutils.WaitForJobToPause(t, db, jobID) | ||
db.Exec(t, "RESUME JOB $1", jobID) | ||
waitForSuccessfulJob(t, tc, jobID) | ||
validateCompactedBackupForTables(t, db, []string{"foo", "bar"}, "'nodelocal://1/backup'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you instead use a different collection dir for each subtest instead of nodelocal://1/backup
for everything?
pkg/backup/backup_compaction_test.go
Outdated
|
||
db.Exec(t, "SET CLUSTER SETTING jobs.debug.pausepoints = ''") | ||
db.Exec(t, "INSERT INTO bar VALUES (4, 4)") | ||
db.Exec(t, incBackupCmd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this extra vanilla inc backup necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, had it for the same reason above.
10f1357
to
e7b2cfe
Compare
Compactions were jobified in cockroachdb#141183, but missing support for job resumption and cancellation handling. This commit adds support for handling those cases. An additional PR should be made to add telemetry for compactions. Epic: none Release note: None
e7b2cfe
to
5044582
Compare
Compactions were jobified in #141183, but missing support for job resumption and cancellation handling. This commit adds support for handling those cases.
An additional PR should be made to add telemetry for compactions.
Epic: none
Release note: None