Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework ExtractionFilter to adept to boolean values #423

Merged
merged 7 commits into from
Apr 20, 2024

Conversation

MaxDall
Copy link
Collaborator

@MaxDall MaxDall commented Apr 18, 2024

With #422, 'free_access' is computed correctly. Because not all publishers set this information correctly, I felt the need to rework the extraction filter again.

  • I added a new parameter to Requires, skip_boolean which enables one to skip boolean values for evaluation
  • There are now two predefined extraction filters: Requires, RequiresAll,
  • RequiresAll is a name wrap for Requires() and propagates the skip_bool parameter to the user.

@MaxDall MaxDall added DON'T MERGE Current PR is based on another PR. When used, indicate in title with [Based on #...] rework Reworks parts of the project labels Apr 18, 2024
@MaxDall MaxDall requested review from dobbersc and addie9800 April 18, 2024 13:58
Copy link
Collaborator

@addie9800 addie9800 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest also updating the default value for the only_complete attribute in the crawl() method of crawler.py

@MaxDall
Copy link
Collaborator Author

MaxDall commented Apr 19, 2024

I would suggest also updating the default value for the only_complete attribute in the crawl() method of crawler.py

Hmm, what would be the reasoning behind this? Maybe I'm missing something, but the behavior of Requires("title", "body", "publishing_date") doesn't change at all with this PR.

Base automatically changed from fix-a-bug-in-ld-bfsearch to master April 19, 2024 10:35
@addie9800
Copy link
Collaborator

Hmm, what would be the reasoning behind this? Maybe I'm missing something, but the behavior of Requires("title", "body", "publishing_date") doesn't change at all with this PR.

From what I see, the current default value is Requires("title", "body", "publishing_date", [skip_bool = False]). As far as I understood, the goal of this PR was to update the Requires() functionality to ignore the boolean values per default.

@MaxDall
Copy link
Collaborator Author

MaxDall commented Apr 19, 2024

From what I see, the current default value is Requires("title", "body", "publishing_date", [skip_bool = False]). As far as I understood, the goal of this PR was to update the Requires() functionality to ignore the boolean values per default.

Yeah, that's right, but neither title, body nor publishing_date is a boolean value.

Update: Wait, maybe there is a misunderstanding here. Requires shouldn't ignore booleans per default so that Requires("free_access") actually requires the articles to have free_access=True. RequiresAll() should skip booleans per default.

@addie9800
Copy link
Collaborator

addie9800 commented Apr 19, 2024

Yeah, that's right, but neither title, body nor publishing_date is a boolean value.

Update: Wait, maybe there is a misunderstanding here. Requires shouldn't ignore booleans per default so that Requires("free_access") actually requires the articles to have free_access=True. RequiresAll() should skip booleans per default.

OK, I think I see my mistake. So just to be sure: the expected behavior of Requires("title", "body", "publishing_date") is the same as Requires("title", "body", "publishing_date", "free_access", skip_bool = True)?

@MaxDall
Copy link
Collaborator Author

MaxDall commented Apr 19, 2024

OK, I think I see my mistake. So just to be sure: the expected behavior of Requires("title", "body", "publishing_date") is the same as Requires("title", "body", "publishing_date", "free_access", skip_bool = True)?

No, Requires checks only passed attributes. So Requires("title", "body", "publishing_date") behaves differently than Requires("title", "body", "publishing_date", "free_access") in a way that the latter checks the extraction for the free_access attribute as well. skip_boolean only determines if boolean values are evaluated with bool or if it is checked if they are present in the extraction at all.

@addie9800
Copy link
Collaborator

OK, now I think I got the idea. I guess the name confused me a bit, perhaps we could change it to something along the lines of eval_bools or eval_bool_values? Also, should we mention this change in the docs?

@MaxDall
Copy link
Collaborator Author

MaxDall commented Apr 19, 2024

OK, now I think I got the idea. I guess the name confused me a bit, perhaps we could change it to something along the lines of eval_bools or eval_bool_values? Also, should we mention this change in the docs?

That's a good point. I agree, skip_boolean seems quite bad 😅. Do you have any ideas? IMO eval_bools fits better but isn't quite there.

@addie9800
Copy link
Collaborator

Do you have any ideas?

Not really, but maybe something like filter_by_bool_value or require_existence

Copy link
Collaborator

@addie9800 addie9800 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@MaxDall MaxDall merged commit f9816f6 into master Apr 20, 2024
4 checks passed
@MaxDall MaxDall deleted the reqork-extraction-filter-for-boolean-values branch April 20, 2024 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DON'T MERGE Current PR is based on another PR. When used, indicate in title with [Based on #...] rework Reworks parts of the project
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants