Skip to content

bhaddow/pmindia-crawler

Repository files navigation

PMIndia Crawler

Overview

This repository contains the code for creating a parallel corpus from the website of the Indian Prime Minister (www.pmindia.gov.in). It contains code for crawling, document and sentence alignment and language-code based filtering.

The latest releases of the corpus can be found at http://data.statmt.org/pmindia

Usage

The following dependencies are required:

  • Snakemake A Python-based workflow management system, similar to make
  • Beautiful Soup A html scraping toolkit
  • Alcazar For extraction of text from html.
  • pycld2 For language detection.
  • hunalign A heuristic sentence aligner.
  • vecalign A recent sentence aligner based on sentence embedding (optional)
  • The Pavlick Dictionaries Crowd-sourced dictionaries available in many languages.
  • Moses We use the Moses sentence splitter.

To run the crawling/alignment, you use snakemake, with the targets listed at the top of the Snakefile. Assuming the configuration variables are set correctly, and the dependencies are installed, you can crawl with:

snakemake crawl_all

To run the full pipeline, including all alignment:

snakemake release_all

Reference

If you use the code or corpus, then please cite:

@ARTICLE{2020arXiv200109907H,
       author = {{Haddow}, Barry and {Kirefu}, Faheem},
        title = "{PMIndia -- A Collection of Parallel Corpora of Languages of India}",
      journal = {arXiv e-prints},
     keywords = {Computer Science - Computation and Language},
         year = "2020",
        month = "Jan",
          eid = {arXiv:2001.09907},
        pages = {arXiv:2001.09907},
archivePrefix = {arXiv},
       eprint = {2001.09907},
}

About

Code for extracting parallel corpora from pmindia

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published