diff --git a/README.md b/README.md index a092f0f74..9018bdd29 100644 --- a/README.md +++ b/README.md @@ -18,7 +18,7 @@ Developed at
-[Quick Start](#quick-start) | [Tutorials](#tutorials) | [News Sources](/docs/supported_publishers.md) +[Quick Start](#quick-start) | [Tutorials](#tutorials) | [News Sources](/docs/supported_publishers.md) | [Paper](https://arxiv.org/abs/2403.15279) @@ -143,6 +143,35 @@ You can find the publishers currently supported [**here**](/docs/supported_publi Also: **Adding a new publisher is easy - consider contributing to the project!** +## Evaluation benchmark + +Check out our evaluation [benchmark](https://github.com/dobbersc/fundus-evaluation). + +| **Scraper** | **Precision** | **Recall** | **F1-Score** | +|-------------|---------------------------|---------------------------|---------------------------| +| [Fundus](https://github.com/flairNLP/fundus) | **99.89**±0.57 | 96.75±12.75 | **97.69**±9.75 | +| [Trafilatura](https://github.com/adbar/trafilatura) | 90.54±18.86 | 93.23±23.81 | 89.81±23.69 | +| [BTE](https://github.com/dobbersc/fundus-evaluation/blob/master/src/fundus_evaluation/scrapers/bte.py) | 81.09±19.41 | **98.23**±8.61 | 87.14±15.48 | +| [jusText](https://github.com/miso-belica/jusText) | 86.51±18.92 | 90.23±20.61 | 86.96±19.76 | +| [news-please](https://github.com/fhamborg/news-please) | 92.26±12.40 | 86.38±27.59 | 85.81±23.29 | +| [BoilerNet](https://github.com/dobbersc/fundus-evaluation/tree/master/src/fundus_evaluation/scrapers/boilernet) | 84.73±20.82 | 90.66±21.05 | 85.77±20.28 | +| [Boilerpipe](https://github.com/kohlschutter/boilerpipe) | 82.89±20.65 | 82.11±29.99 | 79.90±25.86 | + +## Cite + +Please cite the following [paper](https://arxiv.org/abs/2403.15279) when using Fundus or building upon our work: + +```bibtex +@misc{dallabetta2024fundus, + title={Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions}, + author={Max Dallabetta and Conrad Dobberstein and Adrian Breiding and Alan Akbik}, + year={2024}, + eprint={2403.15279}, + archivePrefix={arXiv}, + primaryClass={cs.CL} +} +``` + ## Contact Please email your questions or comments to [**Max Dallabetta**](mailto:max.dallabetta@googlemail.com?subject=[GitHub]%20Fundus)