Skip to content

robflynnyh/opensubtitles_parser

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

opensubtitles_parser

Code to download and parse OpenSubtitles, specifically for MTCue (ACL2023).

Installation

A couple python packages are required to run the parser. Preferably in a conda environment, run:

pip install pycld3 mosestokenizer tqdm

To download the OpenSubtitles XML files, run

bash src/download_os_xml.sh

By default, this will download files necessary for four language pairs: English to-and-from Polish, German, French and Russian. Comment out the specific languages if they're not necessary.

To extract context files, you must obtain an API key from OMDb by subscribing to the (minimum Basic) Patreon here. It costs only $1 and grants access to the API.

Once files are downloaded, run

python src/extract_bitext.py --language [de/fr/pl/ru] --split_set [train/dev/test] --apikey [OMDb API Key]

The relevant files will be saved under data/en-[de/fr/pl/ru]. Context files will be saved under data/en-[de/fr/pl/ru]/context.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Perl 50.3%
  • Ruby 33.4%
  • Python 15.5%
  • Shell 0.8%