-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Images #606
Merged
Add Images #606
Changes from 1 commit
Commits
Show all changes
154 commits
Select commit
Hold shift + click to select a range
673627b
add Image datatype
addie9800 02348df
parse first images
addie9800 c2ad68d
simplify images
addie9800 95136f0
first working version
addie9800 3fcc108
first working version for all publishers with ld json
addie9800 1bf8344
format comment
addie9800 ab34220
simplify image parsing to only support standard formats
addie9800 01f862f
add cover property
addie9800 0905b7a
save
addie9800 00272e5
Update documentation from @ 7aa4a47284cfbe5f02cc78a61b5173ddaae07665
addie9800 9c3082f
remove bloat files
addie9800 754f38d
remove bloat files
addie9800 ccb619a
identify img element by url similarity
addie9800 35ac88e
data extraction from html
addie9800 1d5d53c
restrict to Article and Blog JSON objects
addie9800 0edab02
rework utility methods for more flexibility
addie9800 bc8f7c0
add functionality to merge image objects
addie9800 6559f1d
implement images for the namibian
addie9800 f7fdf9a
documentation
addie9800 4c398a1
implement images for at, na
addie9800 a9b6e07
cbc
addie9800 00be4de
finish ca
addie9800 ad9dc96
documentation
addie9800 de231a7
remove json extraction
addie9800 d1cbfce
author cleaning
addie9800 32dbcc0
fix default images
addie9800 c81be6c
add default author selector and remove author from caption
addie9800 cb43637
br
addie9800 d5700ea
add author filter
addie9800 4235520
funke
addie9800 b745eae
publisher bis BSZ
addie9800 b2fccdd
ch
addie9800 ab7c5ca
cn
addie9800 27b15c7
bi_de
addie9800 44645f1
remove url parameter
addie9800 d512c47
no
addie9800 6fbe3dd
rewrite core logic
MaxDall 2adfa3a
add `images` attribute to guidelines and `Article`
MaxDall b460866
add serialization for `Image` class
MaxDall b2d298d
fix image extraction for `TheNamibian`
MaxDall d5acc61
add `images` to unit tests
MaxDall f26d7c8
Update documentation from @ d512c4791f40706a86e02c0f85519e789f8f8cf2
MaxDall 2a0967e
Merge branch 'images' into images-suggestions
MaxDall b5b6444
add test cases for `no` publishers
MaxDall 977bd66
Update src/fundus/parser/utility.py
MaxDall 83affdc
rename `parse_image_node` -> `parse_image_nodes`
MaxDall cfcc480
Merge remote-tracking branch 'origin/images-suggestions' into images-…
MaxDall b9b1e49
Merge pull request #640 from flairNLP/images-suggestions
MaxDall 6b6cac1
boersenzeitung
addie9800 f4c5397
add images to dw - focus
addie9800 c01f9e2
strip urls
addie9800 da1bae1
Update documentation from @ 5d3f301cd4077a4b7f3fb92d8da1ae368438b273
addie9800 b54914a
FAZ
addie9800 8fad8d3
Update documentation from @ 5d3f301cd4077a4b7f3fb92d8da1ae368438b273
addie9800 9fa1a32
add comment about images in v1
addie9800 6168c02
fr
addie9800 63f98ae
minor changes to images utility
addie9800 d34725f
add images to `FreiePresse` - `MitteldeutscheZeitung`
addie9800 c8db15f
Update documentation from @ 6168c02013124257d4b7d0007b6c5c00354bdbc1
addie9800 2752bee
Add images to `MDR` - `RuhrNachrichten`
addie9800 79c49f3
add images for `UK` publishers
MaxDall a926cf3
apply patch
MaxDall a8617fc
Update documentation from @ 79c49f345a2976fea052852d55b85ea8550280e8
MaxDall e82b325
simplify kicker image extraction
MaxDall 779e62b
Merge remote-tracking branch 'origin/images-suggestions' into images-…
MaxDall 738af53
Merge pull request #653 from flairNLP/images-suggestions
MaxDall 1308d2f
finish `at`, `ca`, `fr` and `ind`
MaxDall be17481
Update documentation from @ db2d4c594a3d139b2bb71634b2fe20b6cef6c8a2
MaxDall 7fe1fdd
add `lt`, `my` and `tr`
MaxDall 14718ff
`People`
addie9800 5290449
`People`
addie9800 8e2786f
`RuhrNachrichten` - `WDR`
addie9800 7449c7c
Update documentation from @ f06969f1c0ead73f7a8b2ec2bbbe73e79df42c66
addie9800 a36a0db
Finish `DE`
addie9800 8f4c727
`JungeWelt`, `Merkur` - `RheinischePost`
addie9800 4aa680e
`NDR`
addie9800 3125fcf
`APNews`, `BusinessInsider`
addie9800 5d7d913
remove video preview images from `Welt`
MaxDall 2f60495
adjust image selector for `TheIndependent`
MaxDall 9e46bd9
`TheNewYorker` - `Wired`
addie9800 bcc3d2d
add version parsing
MaxDall 62a57c6
`TheNation`
addie9800 3f5f9ac
`FoxNews` - `TheIntercept`
addie9800 5d390bb
Update documentation from @ f4b31d90b017a22a1b57892c7924f3adc8aed707
addie9800 3f66e49
fix typo
MaxDall 246e74c
parse `max-width` and rename `min-width` -> `query-width`
MaxDall f8338ae
Merge branch 'images' into add-version-parsing
addie9800 3296f17
Update utility.py
addie9800 978c7c3
Update utility.py
addie9800 3a7fed7
resolve forwarded types
MaxDall 5cd8439
Merge pull request #661 from flairNLP/add-version-parsing
MaxDall f69217f
Merge branch 'master' into images
MaxDall 272d840
Update documentation from @ f4b31d90b017a22a1b57892c7924f3adc8aed707
MaxDall 2a51169
remove leftover test case
MaxDall 5966036
bug fixes
MaxDall 4e50454
fix `__lt__` for `ImageVersion`
MaxDall e6e4ef4
fix a bug in `src` and `srcset` parsing
MaxDall 4b656d8
Fix ˋWDRˋ, add testcase for ˋORFˋ
addie9800 c473901
Overwrite test-case for ˋTheIndependentˋ
addie9800 c938b33
ˋLeFigaroˋ test case overwrite
addie9800 f3f8f54
ˋMalayMailˋ testcase update
addie9800 a673c71
fix image extraction for `APNews` and `TheNation`
MaxDall 5b4ca5d
fix a bug with sorting test jsons
MaxDall c8e2cc5
add image extraction for `WestAustralian`
MaxDall 35986ac
Update documentation from @ 0957415dc22763e7fea7397af2635779a636fbe5
MaxDall 162cd0a
Update `FreiePresse`
addie9800 0243a9e
remove duplicate selectors
addie9800 f64587b
remove test files
addie9800 f80bce6
beatify list comprehension
addie9800 79a5cbf
add test file for `FreiePresse` version `V1_1`
MaxDall 4605e15
add immage extraction for `TagesAnzeiger`
MaxDall fddaee7
Update documentation from @ 0957415dc22763e7fea7397af2635779a636fbe5
MaxDall 5eb93ab
fix v1 images parsing
addie9800 eda8323
overwrite json
addie9800 8cbe5f5
Add image extraction for `Bhaskar`
addie9800 645441e
Update documentation from @ 0957415dc22763e7fea7397af2635779a636fbe5
addie9800 b80b829
Add image extraction for `TheJapanNews`
addie9800 27f9261
Add image extraction for `YomiuriShimbun`
addie9800 113e1df
Update documentation from @ 0957415dc22763e7fea7397af2635779a636fbe5
addie9800 2a85504
image_extraction documentation
addie9800 a5e5ce1
add image example to README.md
addie9800 960288d
update image example in README.md
addie9800 dc7c14f
add images to article documentation
addie9800 2a31d36
update `TechCrunch`
addie9800 d9e83ed
remove author_filter usage
addie9800 bcd9f8d
remove image author bloat
addie9800 de31f6f
guard `Optional[str]` for mypy
MaxDall dd2ddfb
Merge pull request #663 from flairNLP/update-freie-presse
MaxDall c540cbc
add image extraction for `ElPais`
MaxDall 8bb273d
some improvements regarding printouts and documentation
MaxDall fc40318
fix FreeBeacon
addie9800 1d519db
fix FrankfurterRundschau
addie9800 97e7bb6
JSON reordering, clean image authors
addie9800 a2255bc
update Merkur
addie9800 e8aa4d8
json reordering
addie9800 6a7ffcd
json reordering
addie9800 3343e1f
improve image author parsing
addie9800 1baee60
remove selected image author bloat
addie9800 cf34efc
update WDR
addie9800 25f79d6
Merge remote-tracking branch 'origin/images' into images
addie9800 3799401
remove author_filter
addie9800 4b40a3a
remove author_filter
addie9800 27a6766
black
addie9800 3d64b09
fix pytest
addie9800 ab846d7
simplify credit_keywords
addie9800 fdc74ac
Merge branch 'master' into images
addie9800 95f5424
update metro tests
addie9800 c7c5d33
catch invalid width and height values
addie9800 489a6aa
remove author replacement in description
addie9800 c64ec68
Merge remote-tracking branch 'origin/images' into images
addie9800 a123bce
remove try - except from float parsing
addie9800 12d2895
Merge branch 'master' into images
addie9800 8957e9b
update test data
addie9800 c5c9c98
mypy
addie9800 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -432,26 +432,29 @@ def preprocess_url(url: str, domain: str) -> str: | |
return url | ||
|
||
|
||
def image_author_parsing(authors: Union[str, List[str]], author_filter: Optional[Pattern[str]] = None) -> List[str]: | ||
def image_author_parsing(authors: Union[str, List[str]]) -> List[str]: | ||
credit_keywords = [ | ||
"credits?", | ||
"quellen?", | ||
"bild(rechte)?", | ||
"sources?", | ||
r"(((f|ph)otos?(graph)?|image|illustration)\s*)+(by|:)", | ||
r"(((f|ph)oto(graph)?s?|image|illustrations?|cartoons?)\s*)+(by|:|courtesy)", | ||
"©", | ||
"– alle rechte vorbehalten", | ||
"copyright", | ||
"all rights reserved", | ||
"pictures?" | ||
"pictures?( by|:)", | ||
"courtesy of", | ||
"=" | ||
"=", | ||
] | ||
author_filter = re.compile(r"(?is)^(" + r"|".join(credit_keywords) + r"):?\s*") | ||
|
||
def clean(author: str): | ||
author = re.sub(r"^\((.*)\)$", r"\1", author).strip() | ||
# filtering credit keywords | ||
author = re.sub(author_filter, "", author, count=1) | ||
# filtering bloat follwing the author | ||
author = re.sub(r"(?i)/?copyright.*", "", author) | ||
return author.strip() | ||
|
||
if isinstance(authors, list): | ||
|
@@ -599,7 +602,6 @@ def parse_image_nodes( | |
caption_selector: XPath, | ||
alt_selector: XPath, | ||
author_selector: Union[XPath, Pattern[str]], | ||
author_filter: Optional[Pattern[str]] = None, | ||
domain: Optional[str] = None, | ||
size_pattern: Optional[Pattern[str]] = None, | ||
) -> Iterator[Image]: | ||
|
@@ -611,8 +613,6 @@ def parse_image_nodes( | |
alt_selector: Selector selecting the descriptive text of an image. Defaults to selecting alt value. | ||
author_selector: Selector selecting the credits for an image. Defaults to selecting an arbitrary child of | ||
figure with copyright or credit in its class attribute. | ||
author_filter: In case the author_selector cannot adequately select the author, this filter can be used to | ||
remove unwanted substrings | ||
domain: If set, the domain will be prepended to URLs in case they are relative | ||
size_pattern: Regular expression to select <width>, <height> and <dpr> from the image URL. The given regExp | ||
will be matched with re.findall and overwrites existing values. Defaults to None. | ||
|
@@ -637,21 +637,24 @@ def nodes_to_text(nodes: List[Union[lxml.html.HtmlElement, str]]) -> Optional[st | |
# parse caption | ||
caption = nodes_to_text(caption_selector(node)) | ||
|
||
# parse description | ||
description = nodes_to_text(alt_selector(node)) | ||
|
||
# parse authors | ||
authors = [] | ||
if isinstance(author_selector, Pattern): | ||
# author is part of the caption | ||
if caption and (match := re.search(author_selector, caption)): | ||
authors = [match.group("credits")] | ||
caption = re.sub(author_selector, "", caption).strip() or None | ||
elif description and (match := re.search(author_selector, description)): | ||
authors = [match.group("credits")] | ||
description = re.sub(author_selector, "", description).strip() or None | ||
else: | ||
# author is selectable as node | ||
if author_nodes := author_selector(node): | ||
authors = generic_nodes_to_text(author_nodes, normalize=True) | ||
authors = image_author_parsing(authors, author_filter) | ||
|
||
# parse description | ||
description = nodes_to_text(alt_selector(node)) | ||
authors = image_author_parsing(authors) | ||
|
||
yield Image( | ||
versions=versions, | ||
|
@@ -707,7 +710,6 @@ def image_extraction( | |
author_selector: Union[XPath, Pattern[str]] = XPath( | ||
"(./ancestor::figure//*[(contains(@class, 'copyright') or contains(@class, 'credit')) and text()])[1]" | ||
), | ||
author_filter: Optional[Pattern[str]] = None, | ||
relative_urls: Union[bool, XPath] = False, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @MaxDall I suggest changing this slightly, to cover unusual cases. Refer to Update: Also used for |
||
size_pattern: Pattern[str] = re.compile( | ||
r"width([=-])(?P<width>[0-9.]+)|height([=-])(?P<height>[0-9.]+)|dpr=(?P<dpr>[0-9.]+|)" | ||
|
@@ -733,8 +735,6 @@ def image_extraction( | |
alt_selector: Selector selecting the descriptive text of an image. Defaults to selecting alt value. | ||
author_selector: Selector selecting the credits for an image. Defaults to selecting an arbitrary child of | ||
figure with copyright or credit in its class attribute. | ||
author_filter: In case the author_selector cannot adequately select the author, this filter can be used to | ||
remove unwanted substrings. | ||
relative_urls: If True, the extractor assumes that image src URLs are relative and prepends the publisher | ||
domain | ||
size_pattern: Regular expression to select <width>, <height> and <dpr> from the image URL. The given regExp | ||
|
@@ -774,7 +774,6 @@ def image_extraction( | |
caption_selector=caption_selector, | ||
alt_selector=alt_selector, | ||
author_selector=author_selector, | ||
author_filter=author_filter, | ||
domain=domain, | ||
size_pattern=size_pattern, | ||
) | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest leaving description as is and not applying filters here. If I remember correctly we stated in the documentation, that its the parsed
alt
attribute of the image, so I would argue one would expect the raw data.