-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core, API, Spec: Metadata Row Lineage #11948
Conversation
core/src/test/java/org/apache/iceberg/TestRowLineageMetadata.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/TestRowLineageMetadata.java
Outdated
Show resolved
Hide resolved
@nastra Very interested on your thoughts on my test suite here. I'm thinking about how to get away from relying on the Junit4 Test wrapper annotation in TestBase. I know I did something similar in the V3 Metadata change a while back and I'm wondering if you are on board with this direction. |
@@ -43,7 +43,16 @@ public void testJsonConversion() throws IOException { | |||
|
|||
Snapshot expected = | |||
new BaseSnapshot( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JonasJ-ap I believe has a WIP to add test builder classes for Metadata and Snapshot to make these sorts of changes less painful in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The builders will be added in #11947
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced that builders in test code are the right path going forward. For BaseSnapshot
there's an easy way to avoid all the changes by having an overriding constructor. We could most likely do the same for TableMetadata, but let me first do a deep review of this PR to see what options we have here.
However, I'm curious what others think here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm strongly against multiple constructors because I think the use cases we have here (mostly test cases) don't actually need to specify all of the parameters. Currently we have tests like
testParameter5 () {
Snapshot = (foo, bar, x ,null, 4, ...,....,.,,..)
}
Which I think would be much cleaner as
testParameter5() {
testSnapshot().withParameter5(true) ... some beahvior
testSnapshot().withParameter5(false) .. some other behavior
}
Not only would the new tests be cleaner, but they would also be future proofed.
For the general use in the library I don't think there are any places in the code where we actually want to be using a constructor with less parameters. For Snapshot, we always want everything specified and for TableMetadata we expect users to go through the Public Builder and not the constructor.
core/src/test/java/org/apache/iceberg/TestRowLineageMetadata.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/TestRowLineageMetadata.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/TestRowLineageMetadata.java
Outdated
Show resolved
Hide resolved
String manifestList) { | ||
String manifestList, | ||
Long firstRowId, | ||
Long addedRows) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe worth adding an overriding constructor that takes these two new arguments? And the existing constructor would then just pass nulls for these? That way you wouldn't have to update all the constructor calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I think there is more of an argument to have multiple constructors since there isn't a builder, but I"m still hesitant to add another constructor here. What instances to we have where we want to make a new Snapshot and not have these fields explicitly specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess one alternative could be to introduce a Snapshot Builder instead of relying on the constructor. But if we decide to do that, I'd add that builder to the core module rather than the test module (as is being done by #11947).
It would be good to get some other opinions here in order to decide what the best path forward is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough I think we can also punt on this for now and have it be a cleanup later since it's not really integral to this pr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide to add a Builder or some other kinda refactoring, I'd also prefer to punt. The PR is fairly sizeable and I'd prefer to keep it focused on the row lineage core metadata changes (and all of this rather internal at the moment)
@@ -288,7 +293,9 @@ public String toString() { | |||
Map<String, SnapshotRef> refs, | |||
List<StatisticsFile> statisticsFiles, | |||
List<PartitionStatisticsFile> partitionStatisticsFiles, | |||
List<MetadataUpdate> changes) { | |||
List<MetadataUpdate> changes, | |||
boolean rowLineage, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we'd also want to have a separate constructor that takes these two new flags. The existing code that don't need to set this wouldn't have to change, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we have a builder then it doesn't really make sense to have multiple constructors. All usages should be going through the builder, the test classes are the only ones (other than RewritePaths) that use the raw constructor and they do so incorrectly imho and should be using their own builder as well.
} | ||
} | ||
|
||
public Builder enableRowLineage() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it feels a bit weird to have both methods to effectively do the same. Maybe we should just reduce this to having a single method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops sorry, when I meant to do here is we only allow a public method that turns on RowLineage. There is no public way to turn off table lineage. So I think this is good as is. The method above this is just a private helper to determine whether or not enableRowLineage should be called. I added it when I added the method for turning on row-lineage via a table property but I don't want anyone calling this outside of this class.
core/src/test/java/org/apache/iceberg/TestRowLineageMetadata.java
Outdated
Show resolved
Hide resolved
123d1a4
to
f048cc3
Compare
@nastra @amogh-jahagirdar @HonahX @stevenzwu @nastra , Could you please take another pass? I'd like to get the next chunk of work done as well which is all based on this. |
core/src/test/java/org/apache/iceberg/TestRowLineageMetadata.java
Outdated
Show resolved
Hide resolved
@@ -60,6 +60,7 @@ private MetadataUpdateParser() {} | |||
static final String SET_PARTITION_STATISTICS = "set-partition-statistics"; | |||
static final String REMOVE_PARTITION_STATISTICS = "remove-partition-statistics"; | |||
static final String REMOVE_PARTITION_SPECS = "remove-partition-specs"; | |||
static final String ENABLE_ROW_LINEAGE = "enable-row-lineage"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI that this also requires changes in the OpenAPI spec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checks for equality deletes - #12075 Separate PR for now but can merge it here if we want to discuss that. I think we may have a conversation about relying on summary stats so I wanted to break it out |
String manifestList) { | ||
String manifestList, | ||
Long firstRowId, | ||
Long addedRows) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide to add a Builder or some other kinda refactoring, I'd also prefer to punt. The PR is fairly sizeable and I'd prefer to keep it focused on the row lineage core metadata changes (and all of this rather internal at the moment)
core/src/test/java/org/apache/iceberg/TestRowLineageMetadata.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/TestRowLineageMetadata.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/TestRowLineageMetadata.java
Outdated
Show resolved
Hide resolved
Long addedRows = null; | ||
Long lastRowId = null; | ||
if (base.rowLineageEnabled()) { | ||
addedRows = calculateAddedRows(manifests); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an optimization so we can always just do this later (and arguably makes it a bit harder to read the code) but instead of waiting until all the manifests are written what if we set the addedRows as we add manifests to the writer in the try with-resources-above
Something like
// remove the writer.addAll
for (ManifestFile manifest: manifestFIles) {
if (rowLineageEnabled()) {
if (manifest.snapshotId() == null || (manifest.snapshotId() == this.snapshotId) {
Preconditions.checkArgument(
manifest.addedRowsCount() != null,
"Cannot determine number of added rows in snapshot because"
+ " the entry for manifest %s is missing the field `added-rows-count`",
addedRowsCount += manifest.addedRowsCount();
}
}
writer.add(manifest);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth thinking about, but here I'm not sure it's that much of an optimization. The amount of new manifests should be pretty small, even if it was thousands of manifests the overhead should be very low imho and we already have them in memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. just left one minor comment
.filter( | ||
manifest -> | ||
manifest.snapshotId() == null | ||
|| Objects.equals(manifest.snapshotId(), this.snapshotId)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use snapshotId()
method (instead of this.snapshotId
)?
there is no practical difference, since earlier call of snapshotId()
already set it. but it is probably better for safety to avoid such implicit code ordering dependency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @RussellSpitzer, these changes LGTM now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with one minor comment around testing
Adds required metadata fields and operations for working with Row Lineage. Does not cover implementation within manifest entries or propagation.
Major Changes
Metadata -
*~~ Adds Snapshot field addedRows for explictly tracking the number of added rows in the snapshot This was not in the design doc ~~ Done in a separate PR Spec: Add added-rows field to Snapshot #11976
API-
Implementation
added-rows
and setfirst-row-id