Skip to content

Commit

Permalink
Merge pull request #606 from flairNLP/images
Browse files Browse the repository at this point in the history
Add Images
  • Loading branch information
addie9800 authored Jan 2, 2025
2 parents 6ec184f + c5c9c98 commit c550a1a
Show file tree
Hide file tree
Showing 250 changed files with 25,523 additions and 868 deletions.
64 changes: 58 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,24 +68,25 @@ That's already it!
If you run this code, it should print out something like this:

```console
Fundus-Article:
Fundus-Article including 1 image(s):
- Title: "Feinstein's Return Not Enough for Confirmation of Controversial New [...]"
- Text: "Democrats jammed three of President Joe Biden's controversial court nominees
through committee votes on Thursday thanks to a last-minute [...]"
- Text: "89-year-old California senator arrived hour late to Judiciary Committee hearing
to advance President Biden's stalled nominations Democrats [...]"
- URL: https://freebeacon.com/politics/feinsteins-return-not-enough-for-confirmation-of-controversial-new-hampshire-judicial-nominee/
- From: FreeBeacon (2023-05-11 18:41)
- From: The Washington Free Beacon (2023-05-11 18:41)

Fundus-Article:
Fundus-Article including 3 image(s):
- Title: "Northwestern student government freezes College Republicans funding over [...]"
- Text: "Student government at Northwestern University in Illinois "indefinitely" froze
the funds of the university's chapter of College Republicans [...]"
- URL: https://www.foxnews.com/us/northwestern-student-government-freezes-college-republicans-funding-poster-critical-lgbtq-community
- From: FoxNews (2023-05-09 14:37)
- From: Fox News (2023-05-09 14:37)
```

This printout tells you that you successfully crawled two articles!

For each article, the printout details:
- the number of images included in the article
- the "Title" of the article, i.e. its headline
- the "Text", i.e. the main article body text
- the "URL" from which it was crawled
Expand Down Expand Up @@ -146,6 +147,57 @@ for article in crawler.crawl(max_articles=1000000):
````


## Example 4: Crawl some images

By default, Fundus tries to parse the images included in every crawled article.
Let's crawl an article and print out the images for some more details.

```python
from fundus import PublisherCollection, Crawler

# initialize the crawler for The LA Times
crawler = Crawler(PublisherCollection.us.LATimes)

# crawl 1 article and print the images
for article in crawler.crawl(max_articles=1):
for image in article.images:
print(image)
```

For [this article](https://www.latimes.com/sports/lakers/story/2024-12-13/lakers-lebron-james-away-from-team-timberwolves) you will get the following output:

```console
Fundus-Article Cover-Image:
-URL: 'https://ca-times.brightspotcdn.com/dims4/default/41c9bc4/2147483647/strip/true/crop/4598x3065+0+0/resize/1200x800!/format/webp/quality/75/?url=https%3A%2F%2Fcalifornia-times-brightspot.s3.amazonaws.com%2F77%2Feb%2F7fed2d3942fd97b0f7325e7060cf%2Flakers-timberwolves-basketball-33765.jpg'
-Description: 'Minnesota Timberwolves forward Julius Randle (30) works toward the basket.'
-Caption: 'Minnesota Timberwolves forward Julius Randle, left, controls the ball in front of Lakers forward Anthony Davis during the first half of the Lakers’ 97-87 loss Friday.'
-Authors: ['Abbie Parr / Associated Press']
-Versions: [320x213, 568x379, 768x512, 1024x683, 1200x800]

Fundus-Article Image:
-URL: 'https://ca-times.brightspotcdn.com/dims4/default/9a22715/2147483647/strip/true/crop/4706x3137+0+0/resize/1200x800!/format/webp/quality/75/?url=https%3A%2F%2Fcalifornia-times-brightspot.s3.amazonaws.com%2Ff7%2F52%2Fdcd6b263480ab579ac583a4fdbbf%2Flakers-timberwolves-basketball-48004.jpg'
-Description: 'Lakers coach JJ Redick talks with forward Anthony Davis during a loss to the Timberwolves.'
-Caption: 'Lakers coach JJ Redick, right, talks with forward Anthony Davis during the first half of a 97-87 loss to the Timberwolves on Friday night.'
-Authors: ['Abbie Parr / Associated Press']
-Versions: [320x213, 568x379, 768x512, 1024x683, 1200x800]

Fundus-Article Image:
-URL: 'https://ca-times.brightspotcdn.com/dims4/default/580bae4/2147483647/strip/true/crop/5093x3470+0+0/resize/1200x818!/format/webp/quality/75/?url=https%3A%2F%2Fcalifornia-times-brightspot.s3.amazonaws.com%2F3b%2Fdf%2F64c0198b4c2fb2b5824aaccb64b7%2F1486148-sp-nba-lakers-trailblazers-25-gmf.jpg'
-Description: 'Lakers star LeBron James sits in street clothes on the bench next to his son, Bronny James.'
-Caption: 'Lakers star LeBron James sits in street clothes on the bench next to his son, Bronny James, during a win over Portland at Crypto.com Arena on Dec. 8.'
-Authors: ['Gina Ferazzi / Los Angeles Times']
-Versions: [320x218, 568x387, 768x524, 1024x698, 1200x818]
```

For each image, the printout details:
- The cover image designation (if applicable).
- The URL for the highest-resolution version of the image.
- A description of the image.
- The image's caption.
- The name of the copyright holder.
- A list of all available versions of the image.


## Tutorials

We provide **quick tutorials** to get you started with the library:
Expand Down
17 changes: 17 additions & 0 deletions docs/3_the_article_class.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
* [What is an `Article`](#what-is-an-article)
* [The articles' body](#the-articles-body)
* [HTML](#html)
* [Images](#images)
* [Language detection](#language-detection)
* [Saving an Article](#saving-an-article)

Expand Down Expand Up @@ -117,6 +118,22 @@ Here you have access to the following information:
4. `crawl_date: datetime`: The exact timestamp the article was crawled.
5. `source_info: SourceInfo`: Some information about the HTML's origins, mostly for debugging purpose.

## Images

Some publishers provide images with their articles.
To encompass all necessary information, the articles `images` attribute returns a list of custom `Image` objects.
Each `Image` object contains the following attributes:
- `url`: the URL of the image with the largest dimensions.
- `versions`: a list of custom `ImageVersion` objects, each containing the following attributes:
- `url`: the URL of the image with the specific dimensions.
- `size`: a `Dimension` object with attributes `width` and `height`.
- `type`: the image format (e.g. `jpeg`, `png`).
- `is_cover`: a boolean indicating whether the image is the cover image of the article.
- `description`: a string describing the image (usually the alt-text).
- `caption`: the image caption as used in the article.
- `authors`: a list of strings representing the authors of the image.
- `position`: an integer describing the position of the image in the DOM-tree.

## Language detection

Sometimes publishers support articles in different languages.
Expand Down
7 changes: 7 additions & 0 deletions docs/attribute_guidelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,4 +66,11 @@ Those attributes will be validated with unit tests when used.
<td><code>bool</code></td>
<td></td>
</tr>
<tr>
<td>images</td>
<td>A list of `Images` - Fundus own datatype for image representation - included within the article.
The `Images` include metadata like caption, authors, and position if available.</td>
<td><code>List[Image]</code></td>
<td><code>image_extraction</code></td>
</tr>
</table>
38 changes: 36 additions & 2 deletions docs/how_to_add_a_publisher.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,8 @@
* [Working with `lxml`](#working-with-lxml)
* [CSS-Select](#css-select)
* [XPath](#xpath)
* [Extract the ArticleBody](#extract-the-articlebody)
* [Extracting the ArticleBody](#extracting-the-articlebody)
* [Extracting the Images](#extracting-the-images)
* [Checking the free_access attribute](#checking-the-free_access-attribute)
* [Finishing the Parser](#finishing-the-parser)
* [6. Generate unit tests and update tables](#6-generate-unit-tests-and-update-tables)
Expand Down Expand Up @@ -533,7 +534,7 @@ Instead, we recommend referring to [this](https://devhints.io/xpath) documentati
Make sure to examine other parsers and consult the [attribute guidelines](attribute_guidelines.md) for specifics on attribute implementation.
We strongly encourage utilizing these utility functions, especially when parsing the `ArticleBody`.

### Extract the ArticleBody
### Extracting the ArticleBody

In the context of Fundus, an article's body typically includes multiple paragraphs, and optionally, a summary and several subheadings.
It's important to note that article layouts can vary significantly between publishers, with the most common layouts being:
Expand All @@ -546,6 +547,39 @@ To accurately extract the body of an article, use the `extract_article_body_with
This function accepts selectors for the different body parts as input and returns a parsed `ArticleBody`.
For practical examples, refer to existing parser implementations to understand how everything integrates.

### Extracting the images

Fundus offers a utility function `image_extraction` to extract images from the article.
This function only requires the `doc` element of the article and the `_paragraph_selector` of the parser with further optional attributes that can be used if necessary.
The skeleton of the function looks like this:

```python
from fundus.parser.utility import image_extraction
from fundus.parser import Image

@attribute
def images(self) -> List[Image]:
return image_extraction(
doc=self.precomputed.doc,
paragraph_selector=self._paragraph_selector,
)
```

Once you have implemented this, you can try to extract your first images from the article body!
What can happen now, is that you get an IndexError.
This is caused by the `upper_boundary_selector` not selecting an element.
You have to adjust it to select an element above the cover image, all images that lie before this upper boundary are discarded.
Once you get your first images, you can further fine-tune your results:

- `image_selector`: This selector is used to filter which image elements are selected.
- `lower_boundary_selector`: By default, all images after the last paragraph are discarded. With this selector, you can define your custom boundary.
- `caption_selector`: This selector is used to extract the caption of the image and should usually be of the form `XPath("./ancestor::...")`
- `alt_selector`: This selector selects the alt text (description) of the image.
- `author_selector`: You have two options, when selecting the author of the image:
- Preferably, the credits are within their own HTML element and can be directly addressed using a XPath selector.
- Alternatively, a `re.Pattern` object can be passed to select the authors from the caption. In this case, a selection group named `credits` is saved as the author, while the entire `Match` will be removed from the caption.
- `relative_urls`: If set, an attempt will be made to complete relative URLs.
- `size_pattern`: A `re.Pattern` object that can be used to extract the image sizes.

### Checking the free_access attribute

Expand Down
18 changes: 14 additions & 4 deletions docs/supported_publishers.md
Original file line number Diff line number Diff line change
Expand Up @@ -393,7 +393,9 @@
<span>www.dw.com</span>
</a>
</td>
<td>&#160;</td>
<td>
<code>images</code>
</td>
<td>&#160;</td>
</tr>
<tr>
Expand Down Expand Up @@ -1697,7 +1699,9 @@
<span>www.cnbc.com</span>
</a>
</td>
<td>&#160;</td>
<td>
<code>images</code>
</td>
<td>
<code>key_points</code>
</td>
Expand Down Expand Up @@ -1748,7 +1752,9 @@
<span>occupydemocrats.com</span>
</a>
</td>
<td>&#160;</td>
<td>
<code>images</code>
</td>
<td>
<code>description</code>
</td>
Expand All @@ -1767,7 +1773,9 @@
<span>www.reuters.com</span>
</a>
</td>
<td>&#160;</td>
<td>
<code>images</code>
</td>
<td>&#160;</td>
</tr>
<tr>
Expand Down Expand Up @@ -1897,6 +1905,7 @@
</a>
</td>
<td>
<code>images</code>
<code>topics</code>
</td>
<td>&#160;</td>
Expand Down Expand Up @@ -1931,6 +1940,7 @@
</a>
</td>
<td>
<code>images</code>
<code>topics</code>
</td>
<td>&#160;</td>
Expand Down
6 changes: 5 additions & 1 deletion scripts/generate_parser_test_files.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,11 @@ def main() -> None:
test_data[type(versioned_parser).__name__] = new
else:
entry.update(new)
test_data[type(versioned_parser).__name__] = dict(sorted(entry.items()))

# sort entries
test_data[type(versioned_parser).__name__] = dict(
sorted(test_data[type(versioned_parser).__name__].items())
)

test_data_file.write(test_data)
bar.update()
Expand Down
4 changes: 2 additions & 2 deletions src/fundus/parser/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from .base_parser import BaseParser, ParserProxy, attribute, function
from .data import ArticleBody
from .data import ArticleBody, Image

__all__ = ["ParserProxy", "BaseParser", "attribute", "function", "ArticleBody"]
__all__ = ["ParserProxy", "BaseParser", "attribute", "function", "ArticleBody", "Image"]
Loading

0 comments on commit c550a1a

Please sign in to comment.