Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anuvaad-zee-30042021-eng-ben ERROR:: Unable to add Anuvaad-zee-30042021-eng-ben: en-bn/*.en matched []; expected one file #108

Open
XapaJIaMnu opened this issue Feb 28, 2022 · 1 comment

Comments

@XapaJIaMnu
Copy link
Contributor

XapaJIaMnu commented Feb 28, 2022

mtdata get -l bn-en -tr Anuvaad-zee-30042021-eng-ben -o Anuvaad-zee-30042021-eng-ben --compress
2022-02-28 14:43:08 entry.lang_pair:24 INFO:: Suggestion: Use codes ben-eng instead of bn-en. Let's make a little space for all languages of our planet 😢.
2022-02-28 14:43:08 main.get_data:32 WARNING:: Args are ignored: {'verbose': False, 'reindex': False, 'task': 'get'}
2022-02-28 14:43:08 __init__.get_instance:48 INFO:: Loading index from cache /home/nikolay/.mtdata/mtdata.index.0.3.3.pkl
2022-02-28 14:43:10 cache.__post_init__:34 INFO:: Local cache is at /home/nikolay/.mtdata
2022-02-28 14:43:10 data.add_parts:280 ERROR:: Unable to add Anuvaad-zee-30042021-eng-ben:  en-bn/*.en matched []; expected one file
2022-02-28 14:43:10 data.add_parts:283 WARNING::  en-bn/*.en matched []; expected one file

This seems to be an issue for a few of the Anuvaad* datasets. Also confirmed for Anuvaad-toi-20210320-eng-ben, Anuvaad-anuvaad_general-corpus-eng-ben,mtdata_Anuvaad-prothomalo_2014-2020-eng-ben, Anuvaad-ik_2021-v1-eng-ben

@thammegowda
Copy link
Owner

Thanks for reporting. Anuvaad corpus has isconsistent format and IDs.
I notified them: project-anuvaad/anuvaad-parallel-corpus#1
but I got no reply.

I ended up adding them with best effort to fix inconsistencies. So a few datset IDs are failing.

Here is the relevant code:

assert url.startswith('http') and url.endswith('.zip')
file_name = url.split('/')[-1]
file_name = file_name[:-4] # .zip
char_count = coll.Counter(list(file_name))
n_hyps = char_count.get('-', 0)
n_unders = char_count.get('_', 0)
if n_hyps > n_unders:
parts = file_name.split('-')
else:
assert '_' in file_name
parts = file_name.split('_')
name, version= '?', '?'
l1, l2 = 'en', '?'
if parts[-2] == l1 and parts[-1] in langs:
l2 = parts[-1]
version = parts[-3]
elif parts[-3] == l1 and parts[-2] in langs:
l2 = parts[-2]
version = parts[-1]
else:
log.warn(f"Unable to parse {file_name} :: {parts}")
continue
name = '_'.join(parts[:-3])
name = name.replace('-', '_')
f1 = f'{l1}-{l2}/*.{l1}'
f2 = f'{l1}-{l2}/*.{l2}'
if name == 'wikipedia':
f1 = f'{l1}-{l2}/{l1}.txt'
f2 = f'{l1}-{l2}/{l2}.txt'

If you find a simple fix, please send a pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants