-
Notifications
You must be signed in to change notification settings - Fork 428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: ignore crc files when checking if provided path correspond to a valid delta table #3122
fix: ignore crc files when checking if provided path correspond to a valid delta table #3122
Conversation
48d4fb5
to
d9dbefd
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3122 +/- ##
=======================================
Coverage 72.26% 72.26%
=======================================
Files 134 134
Lines 42903 42973 +70
Branches 42903 42973 +70
=======================================
+ Hits 31003 31056 +53
- Misses 9923 9927 +4
- Partials 1977 1990 +13 ☔ View full report in Codecov by Sentry. |
@@ -28,6 +28,7 @@ lazy_static! { | |||
static ref CHECKPOINT_FILE_PATTERN: Regex = | |||
Regex::new(r"\d+\.checkpoint(\.\d+\.\d+)?\.parquet").unwrap(); | |||
static ref DELTA_FILE_PATTERN: Regex = Regex::new(r"^\d+\.json$").unwrap(); | |||
static ref CRC_FILE_PATTERN: Regex = Regex::new(r"^(\.)?\d+(\.crc|\.json)?\.crc$").unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are valid files?
00001.crc.crc
00001.json.crc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't it just r"^\d+\.crc$"
?
I have not seen .json.crc or .crc.crc before tbh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those get generated when I write a delta table locally using Spark 3.5.4 and delta-spark 3.3.0. iirc I also saw them with Spark 3.2 / delta-spark 2.3.0, so I assume they're legit?
pyspark --packages io.delta:delta-spark_2.12:3.3.0 \
--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
--conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
> spark.sql("select 1 as id, 'value' as value").write.format("delta").save("./test/resources/test-table")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess :P I just wanted some confirmation on this, because I hadn't seen them before
e630501
to
6bd14f9
Compare
…valid delta table Signed-off-by: Julien Guillot <[email protected]>
Signed-off-by: Julien Guillot <[email protected]>
6bd14f9
to
27b0151
Compare
Thanks for the fix! |
Description
The current implementation of
is_delta_table_location
only check the first file in the_delta_log
folder to check for the presence of a commit or checkpoint file. This creates issue when checking tables created using the Spark driver since.crc
files are created and appear before commit files in the directory.This MR modifies the behavior of
is_delta_table_location
so that files are checked in a loop..crc
files are ignored and the loop execution stops as soon as a commit, checkpoint or invalid file is found.Related Issue(s)
is_deltatable()
returns False for tables in S3 created by delta-spark #3115