Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backup: add resume and onfail/cancel logic to compactions #141753

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kev-cao
Copy link
Contributor

@kev-cao kev-cao commented Feb 19, 2025

Compactions were jobified in #141183, but missing support for job resumption and cancellation handling. This commit adds support for handling those cases.

An additional PR should be made to add telemetry for compactions.

Epic: none

Release note: None

Copy link

blathers-crl bot commented Feb 19, 2025

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@kev-cao kev-cao force-pushed the backup/resume-compactions branch 12 times, most recently from afcd540 to ced14ba Compare February 25, 2025 19:56
jobutils.WaitForJobToPause(t, db, jobID)
db.Exec(t, "RESUME JOB $1", jobID)
waitForSuccessfulJob(t, tc, jobID)
validateCompactedBackupForTables(t, db, []string{"foo", "bar"}, "'nodelocal://1/backup'")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've realized in my testing that this lets through a couple of false negatives. validateCompactedBackupForTables simply restores from the latest backup. If the tests were created in isolation, this might not be too big of an issue, but since they all share the same cluster, one test that may have failed to create a compaction would actually just restore from a previous test.

I think we can fix this by having validateCompactedBackupForTables take in the start and end of the compaction, and it should also validate that a backup spanning that timespan exists. But I'll do this in another PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you instead use a different collection dir for each subtest instead of nodelocal://1/backup for everything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah that would do the trick. I do think having an extra check that makes sure that there exists a backup that spans the entire time from start to end is good though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do this in a separate PR though

@kev-cao kev-cao marked this pull request as ready for review February 25, 2025 19:58
@kev-cao kev-cao requested a review from a team as a code owner February 25, 2025 19:58
@kev-cao kev-cao requested review from jeffswenson, dt, msbutler and a team and removed request for a team February 25, 2025 19:58
@kev-cao kev-cao force-pushed the backup/resume-compactions branch 3 times, most recently from 1a0f8d2 to ec5a023 Compare February 25, 2025 23:48
Copy link
Collaborator

@msbutler msbutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a pass at the production code. i'll take a look at that tests later

// TODO (kev-cao): Add telemetry for backup compactions.
}

// For all backups, partitioned or not, the main BACKUP manifest is stored at
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once we get all the resolution logic in, i do think there are opportunities to cleanly combine more parts of compaction job and regular backup jobs. But i think it's important to first lay out exactly what we need (resumption logic/ checkpointing, distSQL flow), with all the duplicate code, before refactoring.

to avoid if else conditionals, perhaps we'll create an backupDistSQLPrep interface that implements the prep work for for regular and compacted backups. But lets think through that after we land all the basic blocks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree, after splitting this up instead of integrating it early, I found some areas where improvements could be made because the code was now split up.

@@ -338,12 +338,9 @@ func GetEncryptionFromBase(
user username.SQLUsername,
makeCloudStorage cloud.ExternalStorageFromURIFactory,
baseBackupURI string,
encryptionParams jobspb.BackupEncryptionOptions,
encryptionParams *jobspb.BackupEncryptionOptions,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this kind of small refactor should be in a separate commit.

I'm also not sure what it is seeking to accomplish: previously we could avoid makeCloudStorage call if there was no encryption, but now we have to make the call regardless?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this issue was caused by the fact that initialDetails would start off with a non-nil encryption because it was always passed an encryption from the builtin. But through our flow of creating the updated details, if we aren't using encryption, encryption actually becomes nil in the updated details (from GetEncryptionFromBaseStore).

So on resume, initialDetails.EncryptionOptions is nil, which led to some nil pointer exceptions because some of our code expects a *jobspb.BackupEncryptionOptions and other parts expect a jobspb.BackupEncryptionOptions. Classic backup happens to avoid this because on resume, we skip all of this work. But with compaction, since we have to validate the compaction chain, we pull all of the manifests for the backup chain, and as a consequence, end up requiring the encryption from initialDetails.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Could you put the encryption patch in a separate, first, commit. And further, could you add

if encryptionParams == nil || encryptionParams.Mode == jobspb.EncryptionMode_None {
		return nil, nil
	}

to GetEncryptionFromBase before making cloud storage to avoid that extra work if the backup is not encrypted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, I'll push up another PR for this then.

return &m, nil
}

func getRestoreChain(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i think we should use the term backup chain, not restore chain.

@kev-cao kev-cao force-pushed the backup/resume-compactions branch 3 times, most recently from 354dbbe to 10f1357 Compare February 26, 2025 18:35
@kev-cao kev-cao requested a review from msbutler February 26, 2025 19:05
Copy link
Collaborator

@msbutler msbutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking better!

// Resolve the backup destination. If we have already resolved and persisted
// the destination during a previous resumption of this job, we can re-use
// the previous resolution.
if needsInitializing {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this block could be put in the needsInitializing block below, that way, we only have one initializing block. then we don't need to declare BackupDest above, i think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, the dependency on backupDest got moved into the block so this can be moved too now. That's awesome, I hated this extra block.

@@ -338,12 +338,9 @@ func GetEncryptionFromBase(
user username.SQLUsername,
makeCloudStorage cloud.ExternalStorageFromURIFactory,
baseBackupURI string,
encryptionParams jobspb.BackupEncryptionOptions,
encryptionParams *jobspb.BackupEncryptionOptions,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Could you put the encryption patch in a separate, first, commit. And further, could you add

if encryptionParams == nil || encryptionParams.Mode == jobspb.EncryptionMode_None {
		return nil, nil
	}

to GetEncryptionFromBase before making cloud storage to avoid that extra work if the backup is not encrypted?

db.Exec(t, "INSERT INTO foo VALUES (2, 2)")
db.Exec(t, incBackupCmd)
db.Exec(t, "CREATE TABLE bar (a INT, b INT)")
db.Exec(t, "INSERT INTO bar VALUES (1, 1)")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this test require 4 inc backups?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I suppose it doesn't, I've gotten into the habit of trying to make sure there are "meaningful" compactions, but I suppose the other tests cover that so there isn't really a need for it in this test.

jobutils.WaitForJobToPause(t, db, jobID)
db.Exec(t, "RESUME JOB $1", jobID)
waitForSuccessfulJob(t, tc, jobID)
validateCompactedBackupForTables(t, db, []string{"foo", "bar"}, "'nodelocal://1/backup'")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you instead use a different collection dir for each subtest instead of nodelocal://1/backup for everything?


db.Exec(t, "SET CLUSTER SETTING jobs.debug.pausepoints = ''")
db.Exec(t, "INSERT INTO bar VALUES (4, 4)")
db.Exec(t, incBackupCmd)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this extra vanilla inc backup necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, had it for the same reason above.

@kev-cao kev-cao force-pushed the backup/resume-compactions branch from 10f1357 to e7b2cfe Compare February 27, 2025 01:26
Compactions were jobified in cockroachdb#141183, but missing support for job
resumption and cancellation handling. This commit adds support
for handling those cases.

An additional PR should be made to add telemetry for
compactions.

Epic: none

Release note: None
@kev-cao kev-cao force-pushed the backup/resume-compactions branch from e7b2cfe to 5044582 Compare February 27, 2025 01:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants