Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Images #606

Merged
merged 154 commits into from
Jan 2, 2025
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
154 commits
Select commit Hold shift + click to select a range
673627b
add Image datatype
addie9800 Sep 3, 2024
02348df
parse first images
addie9800 Sep 3, 2024
c2ad68d
simplify images
addie9800 Sep 3, 2024
95136f0
first working version
addie9800 Sep 3, 2024
3fcc108
first working version for all publishers with ld json
addie9800 Sep 3, 2024
1bf8344
format comment
addie9800 Sep 3, 2024
ab34220
simplify image parsing to only support standard formats
addie9800 Sep 11, 2024
01f862f
add cover property
addie9800 Sep 11, 2024
0905b7a
save
addie9800 Sep 12, 2024
00272e5
Update documentation from @ 7aa4a47284cfbe5f02cc78a61b5173ddaae07665
addie9800 Sep 12, 2024
9c3082f
remove bloat files
addie9800 Sep 14, 2024
754f38d
remove bloat files
addie9800 Sep 14, 2024
ccb619a
identify img element by url similarity
addie9800 Sep 17, 2024
35ac88e
data extraction from html
addie9800 Sep 17, 2024
1d5d53c
restrict to Article and Blog JSON objects
addie9800 Sep 19, 2024
0edab02
rework utility methods for more flexibility
addie9800 Sep 23, 2024
bc8f7c0
add functionality to merge image objects
addie9800 Sep 23, 2024
6559f1d
implement images for the namibian
addie9800 Sep 24, 2024
f7fdf9a
documentation
addie9800 Sep 24, 2024
4c398a1
implement images for at, na
addie9800 Sep 24, 2024
a9b6e07
cbc
addie9800 Sep 24, 2024
00be4de
finish ca
addie9800 Sep 25, 2024
ad9dc96
documentation
addie9800 Sep 25, 2024
de231a7
remove json extraction
addie9800 Oct 1, 2024
d1cbfce
author cleaning
addie9800 Oct 1, 2024
32dbcc0
fix default images
addie9800 Oct 1, 2024
c81be6c
add default author selector and remove author from caption
addie9800 Oct 1, 2024
cb43637
br
addie9800 Oct 1, 2024
d5700ea
add author filter
addie9800 Oct 1, 2024
4235520
funke
addie9800 Oct 1, 2024
b745eae
publisher bis BSZ
addie9800 Oct 1, 2024
b2fccdd
ch
addie9800 Oct 1, 2024
ab7c5ca
cn
addie9800 Oct 1, 2024
27b15c7
bi_de
addie9800 Oct 1, 2024
44645f1
remove url parameter
addie9800 Oct 5, 2024
d512c47
no
addie9800 Oct 13, 2024
6fbe3dd
rewrite core logic
MaxDall Oct 15, 2024
2adfa3a
add `images` attribute to guidelines and `Article`
MaxDall Oct 15, 2024
b460866
add serialization for `Image` class
MaxDall Oct 15, 2024
b2d298d
fix image extraction for `TheNamibian`
MaxDall Oct 15, 2024
d5acc61
add `images` to unit tests
MaxDall Oct 15, 2024
f26d7c8
Update documentation from @ d512c4791f40706a86e02c0f85519e789f8f8cf2
MaxDall Oct 15, 2024
2a0967e
Merge branch 'images' into images-suggestions
MaxDall Oct 15, 2024
b5b6444
add test cases for `no` publishers
MaxDall Oct 15, 2024
977bd66
Update src/fundus/parser/utility.py
MaxDall Oct 17, 2024
83affdc
rename `parse_image_node` -> `parse_image_nodes`
MaxDall Oct 17, 2024
cfcc480
Merge remote-tracking branch 'origin/images-suggestions' into images-…
MaxDall Oct 17, 2024
b9b1e49
Merge pull request #640 from flairNLP/images-suggestions
MaxDall Oct 17, 2024
6b6cac1
boersenzeitung
addie9800 Oct 22, 2024
f4c5397
add images to dw - focus
addie9800 Oct 23, 2024
c01f9e2
strip urls
addie9800 Oct 23, 2024
da1bae1
Update documentation from @ 5d3f301cd4077a4b7f3fb92d8da1ae368438b273
addie9800 Oct 23, 2024
b54914a
FAZ
addie9800 Oct 23, 2024
8fad8d3
Update documentation from @ 5d3f301cd4077a4b7f3fb92d8da1ae368438b273
addie9800 Oct 23, 2024
9fa1a32
add comment about images in v1
addie9800 Oct 28, 2024
6168c02
fr
addie9800 Oct 28, 2024
63f98ae
minor changes to images utility
addie9800 Oct 29, 2024
d34725f
add images to `FreiePresse` - `MitteldeutscheZeitung`
addie9800 Oct 29, 2024
c8db15f
Update documentation from @ 6168c02013124257d4b7d0007b6c5c00354bdbc1
addie9800 Oct 29, 2024
2752bee
Add images to `MDR` - `RuhrNachrichten`
addie9800 Oct 30, 2024
79c49f3
add images for `UK` publishers
MaxDall Nov 4, 2024
a926cf3
apply patch
MaxDall Nov 5, 2024
a8617fc
Update documentation from @ 79c49f345a2976fea052852d55b85ea8550280e8
MaxDall Nov 5, 2024
e82b325
simplify kicker image extraction
MaxDall Nov 5, 2024
779e62b
Merge remote-tracking branch 'origin/images-suggestions' into images-…
MaxDall Nov 5, 2024
738af53
Merge pull request #653 from flairNLP/images-suggestions
MaxDall Nov 5, 2024
1308d2f
finish `at`, `ca`, `fr` and `ind`
MaxDall Nov 5, 2024
be17481
Update documentation from @ db2d4c594a3d139b2bb71634b2fe20b6cef6c8a2
MaxDall Nov 5, 2024
7fe1fdd
add `lt`, `my` and `tr`
MaxDall Nov 5, 2024
14718ff
`People`
addie9800 Nov 7, 2024
5290449
`People`
addie9800 Nov 7, 2024
8e2786f
`RuhrNachrichten` - `WDR`
addie9800 Nov 7, 2024
7449c7c
Update documentation from @ f06969f1c0ead73f7a8b2ec2bbbe73e79df42c66
addie9800 Nov 7, 2024
a36a0db
Finish `DE`
addie9800 Nov 8, 2024
8f4c727
`JungeWelt`, `Merkur` - `RheinischePost`
addie9800 Nov 8, 2024
4aa680e
`NDR`
addie9800 Nov 8, 2024
3125fcf
`APNews`, `BusinessInsider`
addie9800 Nov 8, 2024
5d7d913
remove video preview images from `Welt`
MaxDall Nov 12, 2024
2f60495
adjust image selector for `TheIndependent`
MaxDall Nov 12, 2024
9e46bd9
`TheNewYorker` - `Wired`
addie9800 Nov 12, 2024
bcc3d2d
add version parsing
MaxDall Nov 13, 2024
62a57c6
`TheNation`
addie9800 Nov 13, 2024
3f5f9ac
`FoxNews` - `TheIntercept`
addie9800 Nov 14, 2024
5d390bb
Update documentation from @ f4b31d90b017a22a1b57892c7924f3adc8aed707
addie9800 Nov 14, 2024
3f66e49
fix typo
MaxDall Nov 15, 2024
246e74c
parse `max-width` and rename `min-width` -> `query-width`
MaxDall Nov 15, 2024
f8338ae
Merge branch 'images' into add-version-parsing
addie9800 Nov 18, 2024
3296f17
Update utility.py
addie9800 Nov 18, 2024
978c7c3
Update utility.py
addie9800 Nov 18, 2024
3a7fed7
resolve forwarded types
MaxDall Nov 19, 2024
5cd8439
Merge pull request #661 from flairNLP/add-version-parsing
MaxDall Nov 19, 2024
f69217f
Merge branch 'master' into images
MaxDall Nov 19, 2024
272d840
Update documentation from @ f4b31d90b017a22a1b57892c7924f3adc8aed707
MaxDall Nov 19, 2024
2a51169
remove leftover test case
MaxDall Nov 19, 2024
5966036
bug fixes
MaxDall Nov 19, 2024
4e50454
fix `__lt__` for `ImageVersion`
MaxDall Nov 20, 2024
e6e4ef4
fix a bug in `src` and `srcset` parsing
MaxDall Nov 21, 2024
4b656d8
Fix ˋWDRˋ, add testcase for ˋORFˋ
addie9800 Nov 22, 2024
c473901
Overwrite test-case for ˋTheIndependentˋ
addie9800 Nov 22, 2024
c938b33
ˋLeFigaroˋ test case overwrite
addie9800 Nov 22, 2024
f3f8f54
ˋMalayMailˋ testcase update
addie9800 Nov 22, 2024
a673c71
fix image extraction for `APNews` and `TheNation`
MaxDall Nov 22, 2024
5b4ca5d
fix a bug with sorting test jsons
MaxDall Nov 22, 2024
c8e2cc5
add image extraction for `WestAustralian`
MaxDall Nov 26, 2024
35986ac
Update documentation from @ 0957415dc22763e7fea7397af2635779a636fbe5
MaxDall Nov 26, 2024
162cd0a
Update `FreiePresse`
addie9800 Nov 26, 2024
0243a9e
remove duplicate selectors
addie9800 Nov 26, 2024
f64587b
remove test files
addie9800 Nov 26, 2024
f80bce6
beatify list comprehension
addie9800 Nov 26, 2024
79a5cbf
add test file for `FreiePresse` version `V1_1`
MaxDall Nov 29, 2024
4605e15
add immage extraction for `TagesAnzeiger`
MaxDall Nov 29, 2024
fddaee7
Update documentation from @ 0957415dc22763e7fea7397af2635779a636fbe5
MaxDall Nov 29, 2024
5eb93ab
fix v1 images parsing
addie9800 Nov 29, 2024
eda8323
overwrite json
addie9800 Nov 29, 2024
8cbe5f5
Add image extraction for `Bhaskar`
addie9800 Dec 2, 2024
645441e
Update documentation from @ 0957415dc22763e7fea7397af2635779a636fbe5
addie9800 Dec 2, 2024
b80b829
Add image extraction for `TheJapanNews`
addie9800 Dec 3, 2024
27f9261
Add image extraction for `YomiuriShimbun`
addie9800 Dec 3, 2024
113e1df
Update documentation from @ 0957415dc22763e7fea7397af2635779a636fbe5
addie9800 Dec 3, 2024
2a85504
image_extraction documentation
addie9800 Dec 15, 2024
a5e5ce1
add image example to README.md
addie9800 Dec 15, 2024
960288d
update image example in README.md
addie9800 Dec 15, 2024
dc7c14f
add images to article documentation
addie9800 Dec 15, 2024
2a31d36
update `TechCrunch`
addie9800 Dec 15, 2024
d9e83ed
remove author_filter usage
addie9800 Dec 15, 2024
bcd9f8d
remove image author bloat
addie9800 Dec 15, 2024
de31f6f
guard `Optional[str]` for mypy
MaxDall Dec 16, 2024
dd2ddfb
Merge pull request #663 from flairNLP/update-freie-presse
MaxDall Dec 16, 2024
c540cbc
add image extraction for `ElPais`
MaxDall Dec 16, 2024
8bb273d
some improvements regarding printouts and documentation
MaxDall Dec 16, 2024
fc40318
fix FreeBeacon
addie9800 Dec 16, 2024
1d519db
fix FrankfurterRundschau
addie9800 Dec 16, 2024
97e7bb6
JSON reordering, clean image authors
addie9800 Dec 16, 2024
a2255bc
update Merkur
addie9800 Dec 16, 2024
e8aa4d8
json reordering
addie9800 Dec 16, 2024
6a7ffcd
json reordering
addie9800 Dec 16, 2024
3343e1f
improve image author parsing
addie9800 Dec 16, 2024
1baee60
remove selected image author bloat
addie9800 Dec 16, 2024
cf34efc
update WDR
addie9800 Dec 16, 2024
25f79d6
Merge remote-tracking branch 'origin/images' into images
addie9800 Dec 16, 2024
3799401
remove author_filter
addie9800 Dec 16, 2024
4b40a3a
remove author_filter
addie9800 Dec 16, 2024
27a6766
black
addie9800 Dec 16, 2024
3d64b09
fix pytest
addie9800 Dec 16, 2024
ab846d7
simplify credit_keywords
addie9800 Dec 16, 2024
fdc74ac
Merge branch 'master' into images
addie9800 Dec 16, 2024
95f5424
update metro tests
addie9800 Dec 16, 2024
c7c5d33
catch invalid width and height values
addie9800 Dec 17, 2024
489a6aa
remove author replacement in description
addie9800 Dec 21, 2024
c64ec68
Merge remote-tracking branch 'origin/images' into images
addie9800 Dec 21, 2024
a123bce
remove try - except from float parsing
addie9800 Dec 21, 2024
12d2895
Merge branch 'master' into images
addie9800 Dec 21, 2024
8957e9b
update test data
addie9800 Dec 21, 2024
c5c9c98
mypy
addie9800 Dec 21, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 14 additions & 15 deletions src/fundus/parser/utility.py
Original file line number Diff line number Diff line change
Expand Up @@ -432,26 +432,29 @@ def preprocess_url(url: str, domain: str) -> str:
return url


def image_author_parsing(authors: Union[str, List[str]], author_filter: Optional[Pattern[str]] = None) -> List[str]:
def image_author_parsing(authors: Union[str, List[str]]) -> List[str]:
credit_keywords = [
"credits?",
"quellen?",
"bild(rechte)?",
"sources?",
r"(((f|ph)otos?(graph)?|image|illustration)\s*)+(by|:)",
r"(((f|ph)oto(graph)?s?|image|illustrations?|cartoons?)\s*)+(by|:|courtesy)",
"©",
"– alle rechte vorbehalten",
"copyright",
"all rights reserved",
"pictures?"
"pictures?( by|:)",
"courtesy of",
"="
"=",
]
author_filter = re.compile(r"(?is)^(" + r"|".join(credit_keywords) + r"):?\s*")

def clean(author: str):
author = re.sub(r"^\((.*)\)$", r"\1", author).strip()
# filtering credit keywords
author = re.sub(author_filter, "", author, count=1)
# filtering bloat follwing the author
author = re.sub(r"(?i)/?copyright.*", "", author)
return author.strip()

if isinstance(authors, list):
Expand Down Expand Up @@ -599,7 +602,6 @@ def parse_image_nodes(
caption_selector: XPath,
alt_selector: XPath,
author_selector: Union[XPath, Pattern[str]],
author_filter: Optional[Pattern[str]] = None,
domain: Optional[str] = None,
size_pattern: Optional[Pattern[str]] = None,
) -> Iterator[Image]:
Expand All @@ -611,8 +613,6 @@ def parse_image_nodes(
alt_selector: Selector selecting the descriptive text of an image. Defaults to selecting alt value.
author_selector: Selector selecting the credits for an image. Defaults to selecting an arbitrary child of
figure with copyright or credit in its class attribute.
author_filter: In case the author_selector cannot adequately select the author, this filter can be used to
remove unwanted substrings
domain: If set, the domain will be prepended to URLs in case they are relative
size_pattern: Regular expression to select <width>, <height> and <dpr> from the image URL. The given regExp
will be matched with re.findall and overwrites existing values. Defaults to None.
Expand All @@ -637,21 +637,24 @@ def nodes_to_text(nodes: List[Union[lxml.html.HtmlElement, str]]) -> Optional[st
# parse caption
caption = nodes_to_text(caption_selector(node))

# parse description
description = nodes_to_text(alt_selector(node))

# parse authors
authors = []
if isinstance(author_selector, Pattern):
# author is part of the caption
if caption and (match := re.search(author_selector, caption)):
authors = [match.group("credits")]
caption = re.sub(author_selector, "", caption).strip() or None
elif description and (match := re.search(author_selector, description)):
authors = [match.group("credits")]
description = re.sub(author_selector, "", description).strip() or None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest leaving description as is and not applying filters here. If I remember correctly we stated in the documentation, that its the parsed alt attribute of the image, so I would argue one would expect the raw data.

else:
# author is selectable as node
if author_nodes := author_selector(node):
authors = generic_nodes_to_text(author_nodes, normalize=True)
authors = image_author_parsing(authors, author_filter)

# parse description
description = nodes_to_text(alt_selector(node))
authors = image_author_parsing(authors)

yield Image(
versions=versions,
Expand Down Expand Up @@ -707,7 +710,6 @@ def image_extraction(
author_selector: Union[XPath, Pattern[str]] = XPath(
"(./ancestor::figure//*[(contains(@class, 'copyright') or contains(@class, 'credit')) and text()])[1]"
),
author_filter: Optional[Pattern[str]] = None,
relative_urls: Union[bool, XPath] = False,
Copy link
Collaborator Author

@addie9800 addie9800 Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxDall I suggest changing this slightly, to cover unusual cases. Refer to People for an examplory use-case.

Update: Also used for NDR

size_pattern: Pattern[str] = re.compile(
r"width([=-])(?P<width>[0-9.]+)|height([=-])(?P<height>[0-9.]+)|dpr=(?P<dpr>[0-9.]+|)"
Expand All @@ -733,8 +735,6 @@ def image_extraction(
alt_selector: Selector selecting the descriptive text of an image. Defaults to selecting alt value.
author_selector: Selector selecting the credits for an image. Defaults to selecting an arbitrary child of
figure with copyright or credit in its class attribute.
author_filter: In case the author_selector cannot adequately select the author, this filter can be used to
remove unwanted substrings.
relative_urls: If True, the extractor assumes that image src URLs are relative and prepends the publisher
domain
size_pattern: Regular expression to select <width>, <height> and <dpr> from the image URL. The given regExp
Expand Down Expand Up @@ -774,7 +774,6 @@ def image_extraction(
caption_selector=caption_selector,
alt_selector=alt_selector,
author_selector=author_selector,
author_filter=author_filter,
domain=domain,
size_pattern=size_pattern,
)
Expand Down
2 changes: 1 addition & 1 deletion tests/resources/parser/test_data/de/EuronewsDE.json
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@
"description": "Tausende haben am Montag in Tiflis an einer regierungsfreundlichen Kundgebung teilgenommen.",
"caption": null,
"authors": [
"Shakh Aivazov/2024"
"Shakh Aivazov"
],
"position": 472
}
Expand Down
Loading