Skip to content

v0.4.0

Compare
Choose a tag to compare
@MaxDall MaxDall released this 17 Jul 15:51
· 537 commits to master since this release
72e7ff0

๐Ÿš€ Crawl 1 million articles in 7 hours on local hardware*

With this release, we made stability improvements regarding our CC-NEWS pipeline and are introducing some QoL features, being:

  • a timeout parameter for the crawler
  • article serialization
  • improved logging
  • redesign of the PublisherCollection
  • redesign of the Article class

Further, we added two new publishers (golem, Heise) and made several updates to existing publishers and general bug fixes.

*Testing involved crawling 100,000, which took 41.5 minutes, and scaling timings up by 10. This was done on a machine using 1000 Mbit/s bandwidth, Core i9-13905H, 64GB RAM, Windows 11, and the complete PublisherCollection. Results may vary based on the use case and bandwidth.

CC-NEWS pipeline and documentation

  • Slow down WARC path requests by @MaxDall in #538
  • Guard download and streaming of WARC files by @MaxDall in #537
  • Spread parallel requests for CCNewsCrawler by @MaxDall in #539
  • Fix upper bound for retries and catch urllib3.exceptions.HTTPError by @MaxDall in #541
  • Add progress bar for WARC file processing by @MaxDall in #542
  • Rework examples and tutorials regarding CC-NEWS by @MaxDall in #560

QoL

New timeout parameter for crawl method

New article serialization

Improved logging

  • Expose loggers and update documentation by @MaxDall in #540
  • Rework logging and fix overwritten config by @MaxDall in #553

Redesigned PublisherCollection class

Redesigned Article class

Publishers

New Publishers

Fixes

Misc

  • Add timeout to publisher_coverage.py by @MaxDall in #508
  • Remove _parser from file names by @addie9800 in #516
  • Catch errors in coverage only if no complete articles were received by @MaxDall in #515
  • Remove previous file when using -o option in test case script by @MaxDall in #517
  • Set PYTHONPATH to the Root of the Repository for the Publisher Coverage Actions by @dobbersc in #519
  • Refactor metadata parsing to include multiple values using the same key by @MaxDall in #523
  • Deprecated Flag for Uncrawlable Publishers by @addie9800 in #534
  • Show details about incomplete articles in Publisher Coverage by @addie9800 in #531
  • Use timeout parameter in coverage script instead of wrapper by @MaxDall in #548

Bug Fixes

New Contributors

Full Changelog: v0.3.1...v0.4.0