Rework `ExtractionFilter` to adept to boolean values #423

MaxDall · 2024-04-18T13:57:40Z

With #422, 'free_access' is computed correctly. Because not all publishers set this information correctly, I felt the need to rework the extraction filter again.

I added a new parameter to Requires, skip_boolean which enables one to skip boolean values for evaluation
There are now two predefined extraction filters: Requires, RequiresAll,
RequiresAll is a name wrap for Requires() and propagates the skip_bool parameter to the user.

addie9800

I would suggest also updating the default value for the only_complete attribute in the crawl() method of crawler.py

MaxDall · 2024-04-19T10:34:37Z

I would suggest also updating the default value for the only_complete attribute in the crawl() method of crawler.py

Hmm, what would be the reasoning behind this? Maybe I'm missing something, but the behavior of Requires("title", "body", "publishing_date") doesn't change at all with this PR.

addie9800 · 2024-04-19T10:38:37Z

Hmm, what would be the reasoning behind this? Maybe I'm missing something, but the behavior of Requires("title", "body", "publishing_date") doesn't change at all with this PR.

From what I see, the current default value is Requires("title", "body", "publishing_date", [skip_bool = False]). As far as I understood, the goal of this PR was to update the Requires() functionality to ignore the boolean values per default.

MaxDall · 2024-04-19T10:46:12Z

From what I see, the current default value is Requires("title", "body", "publishing_date", [skip_bool = False]). As far as I understood, the goal of this PR was to update the Requires() functionality to ignore the boolean values per default.

Yeah, that's right, but neither title, body nor publishing_date is a boolean value.

Update: Wait, maybe there is a misunderstanding here. Requires shouldn't ignore booleans per default so that Requires("free_access") actually requires the articles to have free_access=True. RequiresAll() should skip booleans per default.

addie9800 · 2024-04-19T11:23:48Z

Yeah, that's right, but neither title, body nor publishing_date is a boolean value.

Update: Wait, maybe there is a misunderstanding here. Requires shouldn't ignore booleans per default so that Requires("free_access") actually requires the articles to have free_access=True. RequiresAll() should skip booleans per default.

OK, I think I see my mistake. So just to be sure: the expected behavior of Requires("title", "body", "publishing_date") is the same as Requires("title", "body", "publishing_date", "free_access", skip_bool = True)?

MaxDall · 2024-04-19T11:37:46Z

OK, I think I see my mistake. So just to be sure: the expected behavior of Requires("title", "body", "publishing_date") is the same as Requires("title", "body", "publishing_date", "free_access", skip_bool = True)?

No, Requires checks only passed attributes. So Requires("title", "body", "publishing_date") behaves differently than Requires("title", "body", "publishing_date", "free_access") in a way that the latter checks the extraction for the free_access attribute as well. skip_boolean only determines if boolean values are evaluated with bool or if it is checked if they are present in the extraction at all.

addie9800 · 2024-04-19T11:50:13Z

OK, now I think I got the idea. I guess the name confused me a bit, perhaps we could change it to something along the lines of eval_bools or eval_bool_values? Also, should we mention this change in the docs?

MaxDall · 2024-04-19T15:03:01Z

OK, now I think I got the idea. I guess the name confused me a bit, perhaps we could change it to something along the lines of eval_bools or eval_bool_values? Also, should we mention this change in the docs?

That's a good point. I agree, skip_boolean seems quite bad 😅. Do you have any ideas? IMO eval_bools fits better but isn't quite there.

addie9800 · 2024-04-19T17:07:38Z

Do you have any ideas?

Not really, but maybe something like filter_by_bool_value or require_existence

addie9800

👍

MaxDall added 5 commits April 18, 2024 15:31

rework ExtractionFilter to adapt to boolean values

bf45841

add some test cases for ExtractionFilter

aaec222

update documentation

84f5447

RequiresAllSkipBoolean -> RequiresAll

ebb5723

add skip_bool parameter to RequiresAll

a3ff496

MaxDall added DON'T MERGE Current PR is based on another PR. When used, indicate in title with [Based on #...] rework Reworks parts of the project labels Apr 18, 2024

MaxDall requested review from dobbersc and addie9800 April 18, 2024 13:58

MaxDall mentioned this pull request Apr 18, 2024

Fix a bug in bf_search regarding boolean values #422

Merged

addie9800 reviewed Apr 19, 2024

View reviewed changes

Base automatically changed from fix-a-bug-in-ld-bfsearch to master April 19, 2024 10:35

MaxDall added 2 commits April 20, 2024 10:31

rename skip_boolean -> eval_bools and change logic accordingly

489a2b6

coherency and rename to eval_booleans

1461bae

addie9800 approved these changes Apr 20, 2024

View reviewed changes

MaxDall merged commit f9816f6 into master Apr 20, 2024
4 checks passed

MaxDall deleted the reqork-extraction-filter-for-boolean-values branch April 20, 2024 08:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework `ExtractionFilter` to adept to boolean values #423

Rework `ExtractionFilter` to adept to boolean values #423

MaxDall commented Apr 18, 2024

addie9800 left a comment

MaxDall commented Apr 19, 2024

addie9800 commented Apr 19, 2024

MaxDall commented Apr 19, 2024 •

edited

Loading

addie9800 commented Apr 19, 2024 •

edited

Loading

MaxDall commented Apr 19, 2024 •

edited

Loading

addie9800 commented Apr 19, 2024

MaxDall commented Apr 19, 2024

addie9800 commented Apr 19, 2024

addie9800 left a comment

Rework ExtractionFilter to adept to boolean values #423

Rework ExtractionFilter to adept to boolean values #423

Conversation

MaxDall commented Apr 18, 2024

addie9800 left a comment

Choose a reason for hiding this comment

MaxDall commented Apr 19, 2024

addie9800 commented Apr 19, 2024

MaxDall commented Apr 19, 2024 • edited Loading

addie9800 commented Apr 19, 2024 • edited Loading

MaxDall commented Apr 19, 2024 • edited Loading

addie9800 commented Apr 19, 2024

MaxDall commented Apr 19, 2024

addie9800 commented Apr 19, 2024

addie9800 left a comment

Choose a reason for hiding this comment

Rework `ExtractionFilter` to adept to boolean values #423

Rework `ExtractionFilter` to adept to boolean values #423

MaxDall commented Apr 19, 2024 •

edited

Loading

addie9800 commented Apr 19, 2024 •

edited

Loading

MaxDall commented Apr 19, 2024 •

edited

Loading