-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for fsspec>=2023.9.0
#6244
Conversation
The documentation is not available anymore as the PR was closed or merged. |
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Some comments below.
Maybe we should update the docstring of get_data_patterns
accordingly? Currently it only gives examples of outputs with **
not in a single path segment (i.e. not with a /
as prefix or suffix)
glob.glob, Path.glob, Path.match or fnmatch do not support ** with a prefix/suffix other than a forward slash /. | ||
For instance, this means **.json is the same as *.json. On the contrary, the fsspec glob has no limits regarding the ** prefix/suffix, | ||
resulting in **.json being equivalent to **/*.json. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it seems that the only documented usage of "**" is within a single path segment (between slashes). Before, fsspec
was more lax with it, but they seem that they are aligning with the former's. So I agree we better align as well and use double asterisk only between slashes.
src/datasets/data_files.py
Outdated
KEYWORDS_IN_FILENAME_BASE_PATTERNS = ["**/*[{sep}/]{keyword}[{sep}]*", "{keyword}[{sep}]*"] | ||
KEYWORDS_IN_DIR_NAME_BASE_PATTERNS = ["{keyword}[{sep}/]**", "**/*[{sep}/]{keyword}[{sep}/]**"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the current behavior (double asterisk only between slashes meaning zero or more path segments) I find confusing several patterns here:
- why do we add the slash to the sep characters here:
[{sep}/]
? I think it would be clearer to remove it as we already prefix the pattern with**/
- as they are:
- the first pattern matches
dir/train.csv
,my-train.csv
anddir/my-train.csv
, - whereas the second pattern only matches
train.csv
- the first pattern matches
- I think it would be clearer to use these patterns:
["**/*[{sep}]{keyword}[{sep}]*", "**/{keyword}[{sep}]*"]
, so that- the first pattern matches:
my-train.csv
anddir/my-train.csv
- and the second:
train.csv
anddir/train.csv
- the first pattern matches:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current patterns are indeed hard to read. Unfortunately, fsspec
's ad-hoc conversion from a glob to a regex pattern doesn't work (as expected) for ["**/*[{sep}]{keyword}[{sep}]*", "**/{keyword}[{sep}]*"]
- for instance, it converts "**/{keyword}[{sep}]*".format(keyword="eval", sep="-._ 0-9")
to "^.*eval[-\\._ 0-9][^/]*$"
, which leads to a failure in the data_files
tests as this matches "data/seqeval_results.txt"
, but it shouldn't.
So I suggest reporting this behavior in fsspec
(IMO, they should use fnmatch.translate
for the conversion) and merging this PR in the current state (and improving this when/if it's fixed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see... they introduced new bugs... 😕
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Any comment on my comment below?
Maybe we should update the docstring of
get_data_patterns
accordingly? Currently it only gives examples of outputs with**
not in a single path segment (i.e. not with a/
as prefix or suffix).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ! I think the fix is future proof so we should be fine
Yea right we need to update it indeed, the outputs are the ones from older versions of fsspec, and from older patterns that we don't use anymore. In general in docstrings I also think we should encourage users to use |
Also just noticed that (noticed this by checking that the data pattern is the same for both the dir name and filename examples in the get_data_patterns docstring) |
Co-authored-by: Quentin Lhoest <[email protected]>
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
I've addressed the comments. Let me know if it looks all good now :) |
Actually just found out that the current So there might be a risk that this pattern breaks in the future no ? |
@lhoestq |
Yea after more thoughts I also think it's fine. Feel free to merge ! |
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
* Add support for `fsspec>=2023.9.0` * Fixes * Style * Fix mock fs for files in nested directories * Nit * More fixes * Nit * Remove print * Update tests/test_data_files.py Co-authored-by: Quentin Lhoest <[email protected]> * Address some more comments --------- Co-authored-by: Quentin Lhoest <[email protected]>
* Add support for `fsspec>=2023.9.0` * Fixes * Style * Fix mock fs for files in nested directories * Nit * More fixes * Nit * Remove print * Update tests/test_data_files.py Co-authored-by: Quentin Lhoest <[email protected]> * Address some more comments --------- Co-authored-by: Quentin Lhoest <[email protected]>
* Add support for `fsspec>=2023.9.0` * Fixes * Style * Fix mock fs for files in nested directories * Nit * More fixes * Nit * Remove print * Update tests/test_data_files.py Co-authored-by: Quentin Lhoest <[email protected]> * Address some more comments --------- Co-authored-by: Quentin Lhoest <[email protected]>
Fix #6214