Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core, API, Spec: Metadata Row Lineage #11948

Merged
merged 22 commits into from
Feb 1, 2025

Conversation

RussellSpitzer
Copy link
Member

@RussellSpitzer RussellSpitzer commented Jan 10, 2025

Adds required metadata fields and operations for working with Row Lineage. Does not cover implementation within manifest entries or propagation.

Major Changes

Metadata -

  • Adds field last-row-id for tracking the last used row-id in the table
  • Adds Snapshot field first-row-id for tracking the first row-id assigned within the snapshot
    *~~ Adds Snapshot field addedRows for explictly tracking the number of added rows in the snapshot This was not in the design doc ~~ Done in a separate PR Spec: Add added-rows field to Snapshot #11976

API-

  • adds TableMetadata operations for enabling row lineage
  • adds getters for all of the above fields

Implementation

  • Changes "addSnapshot" in Table Metadata to use the addedRows from the snapshot to increment last-row-id
  • Changed SnapshotProducer to populate added-rows and set first-row-id

@github-actions github-actions bot added API core Specification Issues that may introduce spec changes. labels Jan 10, 2025
@RussellSpitzer
Copy link
Member Author

@nastra Very interested on your thoughts on my test suite here. I'm thinking about how to get away from relying on the Junit4 Test wrapper annotation in TestBase. I know I did something similar in the V3 Metadata change a while back and I'm wondering if you are on board with this direction.

@RussellSpitzer RussellSpitzer changed the title Metadata Row Lineage Core, API, Spec: Metadata Row Lineage Jan 14, 2025
@@ -43,7 +43,16 @@ public void testJsonConversion() throws IOException {

Snapshot expected =
new BaseSnapshot(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JonasJ-ap I believe has a WIP to add test builder classes for Metadata and Snapshot to make these sorts of changes less painful in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The builders will be added in #11947

Copy link
Contributor

@nastra nastra Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that builders in test code are the right path going forward. For BaseSnapshot there's an easy way to avoid all the changes by having an overriding constructor. We could most likely do the same for TableMetadata, but let me first do a deep review of this PR to see what options we have here.
However, I'm curious what others think here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm strongly against multiple constructors because I think the use cases we have here (mostly test cases) don't actually need to specify all of the parameters. Currently we have tests like

testParameter5  () {
  Snapshot = (foo, bar, x ,null, 4, ...,....,.,,..)
}

Which I think would be much cleaner as

testParameter5() {
   testSnapshot().withParameter5(true) ... some beahvior
   testSnapshot().withParameter5(false) .. some other behavior
}

Not only would the new tests be cleaner, but they would also be future proofed.

For the general use in the library I don't think there are any places in the code where we actually want to be using a constructor with less parameters. For Snapshot, we always want everything specified and for TableMetadata we expect users to go through the Public Builder and not the constructor.

String manifestList) {
String manifestList,
Long firstRowId,
Long addedRows) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe worth adding an overriding constructor that takes these two new arguments? And the existing constructor would then just pass nulls for these? That way you wouldn't have to update all the constructor calls

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I think there is more of an argument to have multiple constructors since there isn't a builder, but I"m still hesitant to add another constructor here. What instances to we have where we want to make a new Snapshot and not have these fields explicitly specified?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess one alternative could be to introduce a Snapshot Builder instead of relying on the constructor. But if we decide to do that, I'd add that builder to the core module rather than the test module (as is being done by #11947).
It would be good to get some other opinions here in order to decide what the best path forward is

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough I think we can also punt on this for now and have it be a cleanup later since it's not really integral to this pr

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to add a Builder or some other kinda refactoring, I'd also prefer to punt. The PR is fairly sizeable and I'd prefer to keep it focused on the row lineage core metadata changes (and all of this rather internal at the moment)

@@ -288,7 +293,9 @@ public String toString() {
Map<String, SnapshotRef> refs,
List<StatisticsFile> statisticsFiles,
List<PartitionStatisticsFile> partitionStatisticsFiles,
List<MetadataUpdate> changes) {
List<MetadataUpdate> changes,
boolean rowLineage,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we'd also want to have a separate constructor that takes these two new flags. The existing code that don't need to set this wouldn't have to change, wdyt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we have a builder then it doesn't really make sense to have multiple constructors. All usages should be going through the builder, the test classes are the only ones (other than RewritePaths) that use the raw constructor and they do so incorrectly imho and should be using their own builder as well.

}
}

public Builder enableRowLineage() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels a bit weird to have both methods to effectively do the same. Maybe we should just reduce this to having a single method?

Copy link
Member Author

@RussellSpitzer RussellSpitzer Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops sorry, when I meant to do here is we only allow a public method that turns on RowLineage. There is no public way to turn off table lineage. So I think this is good as is. The method above this is just a private helper to determine whether or not enableRowLineage should be called. I added it when I added the method for turning on row-lineage via a table property but I don't want anyone calling this outside of this class.

@RussellSpitzer
Copy link
Member Author

@nastra @amogh-jahagirdar @HonahX @stevenzwu @nastra , Could you please take another pass? I'd like to get the next chunk of work done as well which is all based on this.

@@ -60,6 +60,7 @@ private MetadataUpdateParser() {}
static final String SET_PARTITION_STATISTICS = "set-partition-statistics";
static final String REMOVE_PARTITION_STATISTICS = "remove-partition-statistics";
static final String REMOVE_PARTITION_SPECS = "remove-partition-specs";
static final String ENABLE_ROW_LINEAGE = "enable-row-lineage";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI that this also requires changes in the OpenAPI spec

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RussellSpitzer
Copy link
Member Author

Checks for equality deletes - #12075 Separate PR for now but can merge it here if we want to discuss that. I think we may have a conversation about relying on summary stats so I wanted to break it out

String manifestList) {
String manifestList,
Long firstRowId,
Long addedRows) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to add a Builder or some other kinda refactoring, I'd also prefer to punt. The PR is fairly sizeable and I'd prefer to keep it focused on the row lineage core metadata changes (and all of this rather internal at the moment)

Long addedRows = null;
Long lastRowId = null;
if (base.rowLineageEnabled()) {
addedRows = calculateAddedRows(manifests);
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an optimization so we can always just do this later (and arguably makes it a bit harder to read the code) but instead of waiting until all the manifests are written what if we set the addedRows as we add manifests to the writer in the try with-resources-above

Something like

// remove the writer.addAll
for (ManifestFile manifest: manifestFIles) {
 if (rowLineageEnabled()) {
     if (manifest.snapshotId() == null || (manifest.snapshotId() == this.snapshotId) {
                     Preconditions.checkArgument(
                  manifest.addedRowsCount() != null,
                  "Cannot determine number of added rows in snapshot because"
                      + " the entry for manifest %s is missing the field `added-rows-count`",
          addedRowsCount += manifest.addedRowsCount();
     }
}
      writer.add(manifest);
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth thinking about, but here I'm not sure it's that much of an optimization. The amount of new manifests should be pretty small, even if it was thousands of manifests the overhead should be very low imho and we already have them in memory.

@pvary
Copy link
Contributor

pvary commented Jan 24, 2025

Checks for equality deletes - #12075 Separate PR for now but can merge it here if we want to discuss that. I think we may have a conversation about relying on summary stats so I wanted to break it out

Left a comment on #12075

Copy link
Contributor

@stevenzwu stevenzwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. just left one minor comment

.filter(
manifest ->
manifest.snapshotId() == null
|| Objects.equals(manifest.snapshotId(), this.snapshotId))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use snapshotId() method (instead of this.snapshotId)?

there is no practical difference, since earlier call of snapshotId() already set it. but it is probably better for safety to avoid such implicit code ordering dependency

@amogh-jahagirdar amogh-jahagirdar self-requested a review January 29, 2025 00:27
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @RussellSpitzer, these changes LGTM now!

Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one minor comment around testing

@nastra nastra merged commit 1fcd6a9 into apache:main Feb 1, 2025
47 checks passed
jbonofre pushed a commit to jbonofre/iceberg that referenced this pull request Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API core Specification Issues that may introduce spec changes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants