Merge pull request #606 from flairNLP/images

Add Images
flairNLP · Jan 2, 2025 · c550a1a · c550a1a
2 parents 6ec184f + c5c9c98
commit c550a1a
Show file tree

Hide file tree

Showing 250 changed files with 25,523 additions and 868 deletions.
diff --git a/README.md b/README.md
@@ -68,24 +68,25 @@ That's already it!
 If you run this code, it should print out something like this:
 
 ```console
-Fundus-Article:
+Fundus-Article including 1 image(s):
 - Title: "Feinstein's Return Not Enough for Confirmation of Controversial New [...]"
-- Text:  "Democrats jammed three of President Joe Biden's controversial court nominees
-          through committee votes on Thursday thanks to a last-minute [...]"
+- Text:  "89-year-old California senator arrived hour late to Judiciary Committee hearing
+          to advance President Biden's stalled nominations  Democrats [...]"
 - URL:    https://freebeacon.com/politics/feinsteins-return-not-enough-for-confirmation-of-controversial-new-hampshire-judicial-nominee/
-- From:   FreeBeacon (2023-05-11 18:41)
+- From:   The Washington Free Beacon (2023-05-11 18:41)
 
-Fundus-Article:
+Fundus-Article including 3 image(s):
 - Title: "Northwestern student government freezes College Republicans funding over [...]"
 - Text:  "Student government at Northwestern University in Illinois "indefinitely" froze
           the funds of the university's chapter of College Republicans [...]"
 - URL:    https://www.foxnews.com/us/northwestern-student-government-freezes-college-republicans-funding-poster-critical-lgbtq-community
-- From:   FoxNews (2023-05-09 14:37)
+- From:   Fox News (2023-05-09 14:37)
 ```
 
 This printout tells you that you successfully crawled two articles!
 
 For each article, the printout details:
+- the number of images included in the article
 - the "Title" of the article, i.e. its headline 
 - the "Text", i.e. the main article body text
 - the "URL" from which it was crawled
@@ -146,6 +147,57 @@ for article in crawler.crawl(max_articles=1000000):
 ````
 
 
+## Example 4: Crawl some images
+
+By default, Fundus tries to parse the images included in every crawled article.
+Let's crawl an article and print out the images for some more details.
+
+```python
+from fundus import PublisherCollection, Crawler
+
+# initialize the crawler for The LA Times
+crawler = Crawler(PublisherCollection.us.LATimes)
+
+# crawl 1 article and print the images
+for article in crawler.crawl(max_articles=1):
+    for image in article.images:
+        print(image)
+```
+
+For [this article](https://www.latimes.com/sports/lakers/story/2024-12-13/lakers-lebron-james-away-from-team-timberwolves) you will get the following output:
+
+```console
+Fundus-Article Cover-Image:
+-URL:			 'https://ca-times.brightspotcdn.com/dims4/default/41c9bc4/2147483647/strip/true/crop/4598x3065+0+0/resize/1200x800!/format/webp/quality/75/?url=https%3A%2F%2Fcalifornia-times-brightspot.s3.amazonaws.com%2F77%2Feb%2F7fed2d3942fd97b0f7325e7060cf%2Flakers-timberwolves-basketball-33765.jpg'
+-Description:	         'Minnesota Timberwolves forward Julius Randle (30) works toward the basket.'
+-Caption:		 'Minnesota Timberwolves forward Julius Randle, left, controls the ball in front of Lakers forward Anthony Davis during the first half of the Lakers’ 97-87 loss Friday.'
+-Authors:		 ['Abbie Parr / Associated Press']
+-Versions:		 [320x213, 568x379, 768x512, 1024x683, 1200x800]
+
+Fundus-Article Image:
+-URL:			 'https://ca-times.brightspotcdn.com/dims4/default/9a22715/2147483647/strip/true/crop/4706x3137+0+0/resize/1200x800!/format/webp/quality/75/?url=https%3A%2F%2Fcalifornia-times-brightspot.s3.amazonaws.com%2Ff7%2F52%2Fdcd6b263480ab579ac583a4fdbbf%2Flakers-timberwolves-basketball-48004.jpg'
+-Description:	         'Lakers coach JJ Redick talks with forward Anthony Davis during a loss to the Timberwolves.'
+-Caption:		 'Lakers coach JJ Redick, right, talks with forward Anthony Davis during the first half of a 97-87 loss to the Timberwolves on Friday night.'
+-Authors:		 ['Abbie Parr / Associated Press']
+-Versions:		 [320x213, 568x379, 768x512, 1024x683, 1200x800]
+
+Fundus-Article Image:
+-URL:			 'https://ca-times.brightspotcdn.com/dims4/default/580bae4/2147483647/strip/true/crop/5093x3470+0+0/resize/1200x818!/format/webp/quality/75/?url=https%3A%2F%2Fcalifornia-times-brightspot.s3.amazonaws.com%2F3b%2Fdf%2F64c0198b4c2fb2b5824aaccb64b7%2F1486148-sp-nba-lakers-trailblazers-25-gmf.jpg'
+-Description:	         'Lakers star LeBron James sits in street clothes on the bench next to his son, Bronny James.'
+-Caption:		 'Lakers star LeBron James sits in street clothes on the bench next to his son, Bronny James, during a win over Portland at Crypto.com Arena on Dec. 8.'
+-Authors:		 ['Gina Ferazzi / Los Angeles Times']
+-Versions:		 [320x218, 568x387, 768x524, 1024x698, 1200x818]
+```
+
+For each image, the printout details:
+- The cover image designation (if applicable).
+- The URL for the highest-resolution version of the image.
+- A description of the image.
+- The image's caption.
+- The name of the copyright holder.
+- A list of all available versions of the image.
+
+
 ## Tutorials
 
 We provide **quick tutorials** to get you started with the library:

diff --git a/docs/3_the_article_class.md b/docs/3_the_article_class.md
@@ -4,6 +4,7 @@
   * [What is an `Article`](#what-is-an-article)
   * [The articles' body](#the-articles-body)
   * [HTML](#html)
+  * [Images](#images)
   * [Language detection](#language-detection)
   * [Saving an Article](#saving-an-article)
 
@@ -117,6 +118,22 @@ Here you have access to the following information:
 4. `crawl_date: datetime`: The exact timestamp the article was crawled.
 5. `source_info: SourceInfo`: Some information about the HTML's origins, mostly for debugging purpose.
 
+## Images
+
+Some publishers provide images with their articles.
+To encompass all necessary information, the articles `images` attribute returns a list of custom `Image` objects.
+Each `Image` object contains the following attributes:
+- `url`: the URL of the image with the largest dimensions.
+- `versions`: a list of custom `ImageVersion` objects, each containing the following attributes:
+  - `url`: the URL of the image with the specific dimensions.
+  - `size`: a `Dimension` object with attributes `width` and `height`.
+  - `type`: the image format (e.g. `jpeg`, `png`).
+- `is_cover`: a boolean indicating whether the image is the cover image of the article.
+- `description`: a string describing the image (usually the alt-text).
+- `caption`: the image caption as used in the article.
+- `authors`: a list of strings representing the authors of the image.
+- `position`: an integer describing the position of the image in the DOM-tree.
+
 ## Language detection
 
 Sometimes publishers support articles in different languages.

diff --git a/docs/attribute_guidelines.md b/docs/attribute_guidelines.md
@@ -66,4 +66,11 @@ Those attributes will be validated with unit tests when used.
         <td><code>bool</code></td>
         <td></td>
     </tr>
+    <tr>
+        <td>images</td>
+        <td>A list of `Images` - Fundus own datatype for image representation - included within the article. 
+        The `Images` include metadata like caption, authors, and position if available.</td>
+        <td><code>List[Image]</code></td>
+        <td><code>image_extraction</code></td>
+    </tr>
 </table>
diff --git a/docs/how_to_add_a_publisher.md b/docs/how_to_add_a_publisher.md
@@ -17,7 +17,8 @@
       * [Working with `lxml`](#working-with-lxml)
       * [CSS-Select](#css-select)
       * [XPath](#xpath)
-    * [Extract the ArticleBody](#extract-the-articlebody)
+    * [Extracting the ArticleBody](#extracting-the-articlebody)
+    * [Extracting the Images](#extracting-the-images)
     * [Checking the free_access attribute](#checking-the-free_access-attribute)
     * [Finishing the Parser](#finishing-the-parser)
   * [6. Generate unit tests and update tables](#6-generate-unit-tests-and-update-tables)
@@ -533,7 +534,7 @@ Instead, we recommend referring to [this](https://devhints.io/xpath) documentati
 Make sure to examine other parsers and consult the [attribute guidelines](attribute_guidelines.md) for specifics on attribute implementation. 
 We strongly encourage utilizing these utility functions, especially when parsing the `ArticleBody`.
 
-### Extract the ArticleBody
+### Extracting the ArticleBody
 
 In the context of Fundus, an article's body typically includes multiple paragraphs, and optionally, a summary and several subheadings.
 It's important to note that article layouts can vary significantly between publishers, with the most common layouts being:
@@ -546,6 +547,39 @@ To accurately extract the body of an article, use the `extract_article_body_with
 This function accepts selectors for the different body parts as input and returns a parsed `ArticleBody`.
 For practical examples, refer to existing parser implementations to understand how everything integrates.
 
+### Extracting the images
+
+Fundus offers a utility function `image_extraction` to extract images from the article.
+This function only requires the `doc` element of the article and the `_paragraph_selector` of the parser with further optional attributes that can be used if necessary.
+The skeleton of the function looks like this:
+
+```python
+from fundus.parser.utility import image_extraction
+from fundus.parser import Image
+
+@attribute
+def images(self) -> List[Image]:
+    return image_extraction(
+        doc=self.precomputed.doc,
+        paragraph_selector=self._paragraph_selector,
+    )
+```
+
+Once you have implemented this, you can try to extract your first images from the article body!
+What can happen now, is that you get an IndexError.
+This is caused by the `upper_boundary_selector` not selecting an element.
+You have to adjust it to select an element above the cover image, all images that lie before this upper boundary are discarded.
+Once you get your first images, you can further fine-tune your results:
+
+- `image_selector`: This selector is used to filter which image elements are selected.
+- `lower_boundary_selector`: By default, all images after the last paragraph are discarded. With this selector, you can define your custom boundary.
+- `caption_selector`: This selector is used to extract the caption of the image and should usually be of the form `XPath("./ancestor::...")`
+- `alt_selector`: This selector selects the alt text (description) of the image.
+- `author_selector`: You have two options, when selecting the author of the image:
+    - Preferably, the credits are within their own HTML element and can be directly addressed using a XPath selector.
+    - Alternatively, a `re.Pattern` object can be passed to select the authors from the caption. In this case, a selection group named `credits` is saved as the author, while the entire `Match` will be removed from the caption.
+- `relative_urls`: If set, an attempt will be made to complete relative URLs.
+- `size_pattern`: A `re.Pattern` object that can be used to extract the image sizes.
 
 ### Checking the free_access attribute
 

diff --git a/docs/supported_publishers.md b/docs/supported_publishers.md
@@ -393,7 +393,9 @@
           <span>www.dw.com</span>
         </a>
       </td>
-      <td>&#160;</td>
+      <td>
+        <code>images</code>
+      </td>
       <td>&#160;</td>
     </tr>
     <tr>
@@ -1697,7 +1699,9 @@
           <span>www.cnbc.com</span>
         </a>
       </td>
-      <td>&#160;</td>
+      <td>
+        <code>images</code>
+      </td>
       <td>
         <code>key_points</code>
       </td>
@@ -1748,7 +1752,9 @@
           <span>occupydemocrats.com</span>
         </a>
       </td>
-      <td>&#160;</td>
+      <td>
+        <code>images</code>
+      </td>
       <td>
         <code>description</code>
       </td>
@@ -1767,7 +1773,9 @@
           <span>www.reuters.com</span>
         </a>
       </td>
-      <td>&#160;</td>
+      <td>
+        <code>images</code>
+      </td>
       <td>&#160;</td>
     </tr>
     <tr>
@@ -1897,6 +1905,7 @@
         </a>
       </td>
       <td>
+        <code>images</code>
         <code>topics</code>
       </td>
       <td>&#160;</td>
@@ -1931,6 +1940,7 @@
         </a>
       </td>
       <td>
+        <code>images</code>
         <code>topics</code>
       </td>
       <td>&#160;</td>

diff --git a/scripts/generate_parser_test_files.py b/scripts/generate_parser_test_files.py
@@ -143,7 +143,11 @@ def main() -> None:
                     test_data[type(versioned_parser).__name__] = new
                 else:
                     entry.update(new)
-                    test_data[type(versioned_parser).__name__] = dict(sorted(entry.items()))
+
+                # sort entries
+                test_data[type(versioned_parser).__name__] = dict(
+                    sorted(test_data[type(versioned_parser).__name__].items())
+                )
 
             test_data_file.write(test_data)
             bar.update()

diff --git a/src/fundus/parser/__init__.py b/src/fundus/parser/__init__.py
@@ -1,4 +1,4 @@
 from .base_parser import BaseParser, ParserProxy, attribute, function
-from .data import ArticleBody
+from .data import ArticleBody, Image
 
-__all__ = ["ParserProxy", "BaseParser", "attribute", "function", "ArticleBody"]
+__all__ = ["ParserProxy", "BaseParser", "attribute", "function", "ArticleBody", "Image"]