Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: ignore crc files when checking if provided path correspond to a valid delta table #3122

Merged
merged 2 commits into from
Jan 12, 2025

Conversation

guillotjulien
Copy link
Contributor

Description

The current implementation of is_delta_table_location only check the first file in the _delta_log folder to check for the presence of a commit or checkpoint file. This creates issue when checking tables created using the Spark driver since .crc files are created and appear before commit files in the directory.

This MR modifies the behavior of is_delta_table_location so that files are checked in a loop. .crc files are ignored and the loop execution stops as soon as a commit, checkpoint or invalid file is found.

Related Issue(s)

@github-actions github-actions bot added the binding/rust Issues for the Rust crate label Jan 12, 2025
@guillotjulien guillotjulien force-pushed the fix/handle_crc_log_files branch from 48d4fb5 to d9dbefd Compare January 12, 2025 16:18
Copy link

codecov bot commented Jan 12, 2025

Codecov Report

Attention: Patch coverage is 89.33333% with 8 lines in your changes missing coverage. Please review.

Project coverage is 72.26%. Comparing base (c56d6c0) to head (27b0151).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
crates/core/src/logstore/mod.rs 88.57% 2 Missing and 6 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3122   +/-   ##
=======================================
  Coverage   72.26%   72.26%           
=======================================
  Files         134      134           
  Lines       42903    42973   +70     
  Branches    42903    42973   +70     
=======================================
+ Hits        31003    31056   +53     
- Misses       9923     9927    +4     
- Partials     1977     1990   +13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -28,6 +28,7 @@ lazy_static! {
static ref CHECKPOINT_FILE_PATTERN: Regex =
Regex::new(r"\d+\.checkpoint(\.\d+\.\d+)?\.parquet").unwrap();
static ref DELTA_FILE_PATTERN: Regex = Regex::new(r"^\d+\.json$").unwrap();
static ref CRC_FILE_PATTERN: Regex = Regex::new(r"^(\.)?\d+(\.crc|\.json)?\.crc$").unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are valid files?

00001.crc.crc
00001.json.crc

Copy link
Contributor Author

@guillotjulien guillotjulien Jan 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not from what I observed. The only .crc file generated without . prefix follow the ^\d+\.crc$ pattern. We can further restrict the regex like this: ^(\.\d+(\.crc|\.json)|\d+)\.crc$.

It should cover every case:
image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't it just r"^\d+\.crc$"?

I have not seen .json.crc or .crc.crc before tbh

Copy link
Contributor Author

@guillotjulien guillotjulien Jan 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those get generated when I write a delta table locally using Spark 3.5.4 and delta-spark 3.3.0. iirc I also saw them with Spark 3.2 / delta-spark 2.3.0, so I assume they're legit?

pyspark --packages io.delta:delta-spark_2.12:3.3.0 \
  --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
  --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"


> spark.sql("select 1 as id, 'value' as value").write.format("delta").save("./test/resources/test-table")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess :P I just wanted some confirmation on this, because I hadn't seen them before

@guillotjulien guillotjulien force-pushed the fix/handle_crc_log_files branch from e630501 to 6bd14f9 Compare January 12, 2025 17:08
@ion-elgreco ion-elgreco force-pushed the fix/handle_crc_log_files branch from 6bd14f9 to 27b0151 Compare January 12, 2025 17:18
@ion-elgreco
Copy link
Collaborator

Thanks for the fix!

@ion-elgreco ion-elgreco enabled auto-merge January 12, 2025 17:22
@ion-elgreco ion-elgreco added this pull request to the merge queue Jan 12, 2025
Merged via the queue into delta-io:main with commit 1521a08 Jan 12, 2025
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

is_deltatable() returns False for tables in S3 created by delta-spark
2 participants