PMIndia Crawler

Overview

This repository contains the code for creating a parallel corpus from the website of the Indian Prime Minister (www.pmindia.gov.in). It contains code for crawling, document and sentence alignment and language-code based filtering.

The latest releases of the corpus can be found at http://data.statmt.org/pmindia

Usage

The following dependencies are required:

Snakemake A Python-based workflow management system, similar to make
Beautiful Soup A html scraping toolkit
Alcazar For extraction of text from html.
pycld2 For language detection.
hunalign A heuristic sentence aligner.
vecalign A recent sentence aligner based on sentence embedding (optional)
The Pavlick Dictionaries Crowd-sourced dictionaries available in many languages.
Moses We use the Moses sentence splitter.

To run the crawling/alignment, you use snakemake, with the targets listed at the top of the Snakefile. Assuming the configuration variables are set correctly, and the dependencies are installed, you can crawl with:

snakemake crawl_all

To run the full pipeline, including all alignment:

snakemake release_all

Reference

If you use the code or corpus, then please cite:

@ARTICLE{2020arXiv200109907H,
       author = {{Haddow}, Barry and {Kirefu}, Faheem},
        title = "{PMIndia -- A Collection of Parallel Corpora of Languages of India}",
      journal = {arXiv e-prints},
     keywords = {Computer Science - Computation and Language},
         year = "2020",
        month = "Jan",
          eid = {arXiv:2001.09907},
        pages = {arXiv:2001.09907},
archivePrefix = {arXiv},
       eprint = {2001.09907},
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
nmt-scripts		nmt-scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
Snakefile.nmt		Snakefile.nmt
compare_aligns.py		compare_aligns.py
count.py		count.py
extract.py		extract.py
filter.py		filter.py
for-keops.sh		for-keops.sh
get_corresp_line_counts.py		get_corresp_line_counts.py
get_pavlick_dict.py		get_pavlick_dict.py
get_urls.py		get_urls.py
line_counts.sh		line_counts.sh
make_index.py		make_index.py
split_sentences_indicnlp.py		split_sentences_indicnlp.py
vecalign_to_tsv.py		vecalign_to_tsv.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PMIndia Crawler

Overview

Usage

Reference

About

Releases

Packages

Languages

License

bhaddow/pmindia-crawler

Folders and files

Latest commit

History

Repository files navigation

PMIndia Crawler

Overview

Usage

Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages