Releases: bitextor/bitextor-data
Bitextor test data - WARCs v1.1
Collection of WARC files which are compliant to the WARC-1.0 standard and can be used to run regression tests with Bitextor. This release includes three websites that were crawled between January 25 and 28 of 2019. The websites are:
- https://greenpeace.org/canada, which is under Creative Commons Attribution 2.0,
- http://kremlin.ru/, which is under Creative Commons Attribution 4.0,
- https://primeminister.gr/, which is under Creative Commons Attribution-NonDerivatives 4.0
25/11/2022: Added documents.tar.gz
file containing the necessary documents for testing dir2warc.
Bitextor test data - WARCs v1.0
Collection of XZ compressed files that can be used to run regression tests with Bitextor (run-tests.sh
). Tests can be run on three websites crawled between January 25 and 28 of 2019. The three websites are:
- [greenpeace.org/canada], which is under Creative Commons Attribution 2.0,
- [http://kremlin.ru/], which is under Creative Commons Attribution 4.0,
- and * [https://primeminister.gr/], which is under Creative Commons Attribution-NonDerivatives 4.0
kremlin-many-small.tar.xz
package is a test using kremlin.warc.xz
content, but each warc only contains one pair of documents (from Bitextor run of kremlin.warc.xz
).
Bitextor dictionaries v1.0
Bitextor document aligner dictionaries: https://github.com/bitextor/bitextor/
en-ar.dic
: generated using OpenSubtitles2018
ca-es.dic
: generated using https://object.pouta.csc.fi/OPUS-DOGC/v2/moses/ca-es.txt.zip (mostly).
en-ru-morpheme.QED.dic
and en-ru.QED.dic
: generated using QED corpora.
hu-en.hunalign.dic
: from Hunalign original code
kk-ru.dic
: all OPUS available data on 2017
The rest of dictionaries were trained using JRC-AQUI, on 2017.