v0.4.0
๐ Crawl 1 million articles in 7 hours on local hardware*
With this release, we made stability improvements regarding our CC-NEWS pipeline and are introducing some QoL features, being:
- a timeout parameter for the crawler
- article serialization
- improved logging
- redesign of the
PublisherCollection
- redesign of the
Article
class
Further, we added two new publishers (golem, Heise) and made several updates to existing publishers and general bug fixes.
*Testing involved crawling 100,000, which took 41.5 minutes, and scaling timings up by 10. This was done on a machine using 1000 Mbit/s bandwidth, Core i9-13905H, 64GB RAM, Windows 11, and the complete PublisherCollection
. Results may vary based on the use case and bandwidth.
CC-NEWS pipeline and documentation
- Slow down WARC path requests by @MaxDall in #538
- Guard download and streaming of WARC files by @MaxDall in #537
- Spread parallel requests for
CCNewsCrawler
by @MaxDall in #539 - Fix upper bound for retries and catch
urllib3.exceptions.HTTPError
by @MaxDall in #541 - Add progress bar for WARC file processing by @MaxDall in #542
- Rework examples and tutorials regarding CC-NEWS by @MaxDall in #560
QoL
New timeout parameter for crawl
method
- Add crawl timeout functionality by @olaughter in #536
New article serialization
- Add export feature for Articles by @addie9800 in #530
Improved logging
- Expose loggers and update documentation by @MaxDall in #540
- Rework logging and fix overwritten config by @MaxDall in #553
Redesigned PublisherCollection
class
- Publisher Collection Rework by @addie9800 in #526
Redesigned Article
class
Publishers
New Publishers
- Adds new publisher Heise by @addie9800 in #426
- added golem as publisher by @Feyrbrand in #484
Fixes
- Fix author parsing for
BSZ
by @MaxDall in #518 - Update
TechCrunch
by @MaxDall in #522 - Remove unreachable source for
FreeBeacon
by @MaxDall in #521 - Add sitemap filter to
BusinessInsiderDE
by @MaxDall in #520 - Fix
sitemap_filter
forFreeBeacon
by @MaxDall in #527 - Mark Occupy Democrats as deprecated by @addie9800 in #543
- Fix The Mirror by @addie9800 in #547
- Fix Heise by @addie9800 in #545
- Update
EveningStandard
parser by @MaxDall in #549 - Fix Freie Presse by @addie9800 in #554
- Fix haberturk selectors by @MaxDall in #551
- Fix Funke topics by @addie9800 in #555
Misc
- Add timeout to publisher_coverage.py by @MaxDall in #508
- Remove _parser from file names by @addie9800 in #516
- Catch errors in coverage only if no complete articles were received by @MaxDall in #515
- Remove previous file when using
-o
option in test case script by @MaxDall in #517 - Set PYTHONPATH to the Root of the Repository for the Publisher Coverage Actions by @dobbersc in #519
- Refactor metadata parsing to include multiple values using the same key by @MaxDall in #523
- Deprecated Flag for Uncrawlable Publishers by @addie9800 in #534
- Show details about incomplete articles in Publisher Coverage by @addie9800 in #531
- Use
timeout
parameter in coverage script instead of wrapper by @MaxDall in #548
Bug Fixes
- Update LD Selector by @addie9800 in #514
- Documentation Fix
Requires
by @addie9800 in #535 - Fix an error message related to summary parsing by @MaxDall in #552
New Contributors
- @Feyrbrand made their first contribution in #484
- @olaughter made their first contribution in #536
Full Changelog: v0.3.1...v0.4.0