Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: Match only within relative path when indexer is validating #859

Merged
merged 8 commits into from
Jul 22, 2022

Conversation

oesteban
Copy link
Collaborator

One major problem of the current behavior is, if one wants to exclude dot-folders and dot-files then a pattern like re.compile(r"/\.") will also match patterns above the root folder. E.g., the path /home/username/.cache/dataset/sub-01 would match, as well as /home/username/.cache/dataset/.datalad. However, we only the latter to be filtered out.

Resolves: #858.

@oesteban oesteban requested a review from effigies May 18, 2022 10:49
@codecov
Copy link

codecov bot commented May 18, 2022

Codecov Report

Merging #859 (ba43619) into master (9f01802) will decrease coverage by 0.26%.
The diff coverage is 88.88%.

@@            Coverage Diff             @@
##           master     #859      +/-   ##
==========================================
- Coverage   86.29%   86.03%   -0.27%     
==========================================
  Files          35       35              
  Lines        3954     3973      +19     
  Branches     1026     1031       +5     
==========================================
+ Hits         3412     3418       +6     
- Misses        353      363      +10     
- Partials      189      192       +3     
Impacted Files Coverage Δ
bids/layout/validation.py 86.51% <78.57%> (+0.63%) ⬆️
bids/layout/index.py 84.70% <93.54%> (-3.45%) ⬇️
bids/layout/layout.py 88.07% <0.00%> (-0.42%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9f01802...ba43619. Read the comment docs.

Copy link
Collaborator

@effigies effigies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks sensible. Need to look at this CI failure...

@oesteban
Copy link
Collaborator Author

It'd be nice to write some unit tests, but I'm not sure I have the time :S

@effigies
Copy link
Collaborator

What about:

def test__check_path_matches_patterns():
    hidden_ds = "/home/user/.cache/datasets/myds"
    failing_path = f"{hidden_ds}/myfile"
    sensible_exclusions = [re.compile(r"/.*")]
    assert _check_path_matches_patterns(failing_path, sensible_exclusions) is False
    assert _check_path_matches_patterns(failing_path, sensible_exclusions, root=hidden_ds) is True

@effigies
Copy link
Collaborator

Also please rebase on master to fix tests.

@oesteban oesteban force-pushed the fix/matching-paths branch from 606c091 to 13ab11c Compare May 18, 2022 13:54
@effigies
Copy link
Collaborator

Seems to be breaking expectations.

Comment on lines 41 to 42
if isinstance(patt, (str, Path)):
if str(patt) in path:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@effigies I hope I did not misinterpreted these two lines.

Here is the string matching pattern:

  1. Shouldn't we allow strings, not just Paths?
  2. Shouldn't we just check the presence of the substring, rather than the exact match?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally, the changes were much more involved - probably this doesn't have much relevance now.

@oesteban
Copy link
Collaborator Author

Okay, I seem to have opened a can of worms.

The implementation of path matching and validation was pretty brittle, almost tailored to the few tests we have here. Although I haven't done an exhaustive revision of the truth table of possible combinations of ignore and force_index, I believe this implementation resolves all the reasonable edge cases.

@oesteban oesteban requested review from adelavega and effigies May 19, 2022 11:28
Copy link
Collaborator

@adelavega adelavega left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't had time to test this out myself but looks reasonable.

Taking a step back force_index seems like a bit a mistake of the API, as it complicates things quite a bit. I also don't like how rules for what is indexed or excluded by default is in the code, rather than driven the BIDS schema-- but that will be a future effort.

Either way those comments don't have bearing on this PR and I think in the meantime this is good, and I'm glad we have more coverage on this.

@oesteban
Copy link
Collaborator Author

Taking a step back force_index seems like a bit a mistake of the API, as it complicates things quite a bit.

I totally second this thought 👍

@oesteban
Copy link
Collaborator Author

oesteban commented May 31, 2022

I've been thinking about @adelavega's feelings against force_index, and I think he is totally right. In terms of performance, not having force_index would allow to "forget" excluded paths from the highest point (e.g., the full derivatives/ folder) without attempting to check if some content inside them are forced. The speed of indexing would increase dramatically with the right exclude list (e.g., .git or .datalad will be dropped in just one iteration).

That should go in a different PR, though.

@effigies care to give this a final check?

@adelavega
Copy link
Collaborator

@oesteban about such an API breaking change, you should also be aware about #863

We will be eventually replacing BIDSLayout with a different, much more performant impementation/backend.

At that point, we will allow a fair amount of API breaking changes so that would be a good time for that.

The plan is to also having the current BIDSLayout available for a version for backwards compatibility.

@oesteban
Copy link
Collaborator Author

For now, it could be useful to:

  • raise a warning when something is set in force_index
  • when not force_index, reverse the sorting (i.e., from shallower to deeper), and make sure excluded directories are not entered.

@adelavega
Copy link
Collaborator

That works for me

@effigies
Copy link
Collaborator

Will merge if tests pass. Thanks for your patience.

@effigies effigies added this to the 0.15.2 milestone Jul 22, 2022
@effigies effigies merged commit 107de32 into bids-standard:master Jul 22, 2022
@oesteban oesteban deleted the fix/matching-paths branch November 16, 2022 14:42
oesteban added a commit to templateflow/python-client that referenced this pull request Nov 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BIDSLayoutIndexer should not match force/ignore patterns above BIDS root
3 participants