Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Change RemoveSnapshots to remove unused schemas #12089

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

gaborkaszab
Copy link
Collaborator

Similarly to removing partition specs, this PR improves RemoveSnapshots to also remove unused schemas.

@github-actions github-actions bot added the core label Jan 24, 2025
@gaborkaszab
Copy link
Collaborator Author

For the record, here is the PR that introduced the same for partition specs: #10755
And here is the PR required for the new metadata update in REST spec: #12022

@gaborkaszab
Copy link
Collaborator Author

cc @RussellSpitzer @rdblue @danielcweeks Would you mind taking a look?

@gaborkaszab
Copy link
Collaborator Author

cc @advancedxy

.map(ManifestFile::partitionSpecId)
.forEach(reachableSpecs::add));
snapshotId -> {
base.snapshot(snapshotId).allManifests(ops.io()).stream()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you might want to extract base.snapshot(snapshotId) into a separate variable

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return this;
}

Builder removeSchemas(Iterable<Integer> schemaIds) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please add a test where schemaIds is empty?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

schemas.stream()
.filter(s -> !schemaIdsToRemove.contains(s.schemaId()))
.collect(Collectors.toList());
changes.add(new MetadataUpdate.RemoveSchemas(schemaIdsToRemove));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should only create a change when schemaIdsToRemove was non-empty. This would most likely also apply for removePartitionSpecs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Done

Will create a separate PR for the same for RemovePartitionSpecs once we sort this one out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prepared another PR for the same RemovePartitionSpecs improvement. Will publish it once this is merged (that also depends on the TestTables change in his PR and not sure how to publish stacked changes as separate PRs TBH)

assertThat(table.schemas().size()).isEqualTo(3);

Set<String> deletedFiles = Sets.newHashSet();
// Expire all snapshots except the current one. Also expire all schemas except the current one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Expire all snapshots except the current one. Also expire all schemas except the current one.
// Expire all snapshots/schemas except the current one

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator Author

@gaborkaszab gaborkaszab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look, @nastra !

.map(ManifestFile::partitionSpecId)
.forEach(reachableSpecs::add));
snapshotId -> {
base.snapshot(snapshotId).allManifests(ops.io()).stream()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return this;
}

Builder removeSchemas(Iterable<Integer> schemaIds) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

schemas.stream()
.filter(s -> !schemaIdsToRemove.contains(s.schemaId()))
.collect(Collectors.toList());
changes.add(new MetadataUpdate.RemoveSchemas(schemaIdsToRemove));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Done

Will create a separate PR for the same for RemovePartitionSpecs once we sort this one out.

assertThat(table.schemas().size()).isEqualTo(3);

Set<String> deletedFiles = Sets.newHashSet();
// Expire all snapshots except the current one. Also expire all schemas except the current one.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@gaborkaszab gaborkaszab requested a review from nastra February 12, 2025 12:46
@@ -72,6 +72,26 @@ public static TestTable create(
return new TestTable(ops, name);
}

public static TestTable create(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might want to change the other create method to call this method and pass ops so that we don't duplicate the same implementations:

public static TestTable create(
      File temp,
      String name,
      Schema schema,
      PartitionSpec spec,
      SortOrder sortOrder,
      int formatVersion) {
    TestTableOperations ops = new TestTableOperations(name, temp);
    return create(temp, name, schema, spec, sortOrder, formatVersion, ops);
  }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Done.

Anyway, I was thinking of introducing a Builder for these test tables. I see that we have a number of variations now (with-without MetricsReporter, TableOperations and now in the partition stats PR we'll have one for props too) and might be cleaner to have a builder instead of these different create() methods. WDYT @nastra ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine if we start exploring and seeing how a Builder turns out with TestTable but I wouldn't do it as part of this PR

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely not within this PR :)

Similarly to removing partition specs, this PR improves RemoveSnapshots
to also remove unused schemas.
@nastra
Copy link
Contributor

nastra commented Feb 12, 2025

I'll wait a bit with merging to give others some time to review this as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants