This repository contains the code for creating a parallel corpus from the website of the Indian Prime Minister (www.pmindia.gov.in). It contains code for crawling, document and sentence alignment and language-code based filtering.
The latest releases of the corpus can be found at http://data.statmt.org/pmindia
The following dependencies are required:
- Snakemake A Python-based workflow management system, similar to
make
- Beautiful Soup A html scraping toolkit
- Alcazar For extraction of text from html.
- pycld2 For language detection.
- hunalign A heuristic sentence aligner.
- vecalign A recent sentence aligner based on sentence embedding (optional)
- The Pavlick Dictionaries Crowd-sourced dictionaries available in many languages.
- Moses We use the Moses sentence splitter.
To run the crawling/alignment, you use snakemake, with the targets listed at the top of the Snakefile. Assuming the configuration variables are set correctly, and the dependencies are installed, you can crawl with:
snakemake crawl_all
To run the full pipeline, including all alignment:
snakemake release_all
If you use the code or corpus, then please cite:
@ARTICLE{2020arXiv200109907H,
author = {{Haddow}, Barry and {Kirefu}, Faheem},
title = "{PMIndia -- A Collection of Parallel Corpora of Languages of India}",
journal = {arXiv e-prints},
keywords = {Computer Science - Computation and Language},
year = "2020",
month = "Jan",
eid = {arXiv:2001.09907},
pages = {arXiv:2001.09907},
archivePrefix = {arXiv},
eprint = {2001.09907},
}