Is there a way to see the dataset size before starting the download #129

XapaJIaMnu · 2022-10-05T16:26:46Z

Hi,

Is there a way to get information about the dataset size (num sentences, etc) before downloading it? Is this available through the API somehow? Is this what cols is https://github.com/thammegowda/mtdata/blob/master/mtdata/entry.py#L101 ?

Thanks,
Nick

The text was updated successfully, but these errors were encountered:

thammegowda · 2022-10-05T17:26:13Z

Hi @XapaJIaMnu
Thank you for the question!
We have NOT collected stats for datasets (stats were unavailable/unreliable for most datasets). The col is not related to stats, it for parsing TSV/CSV files.

BTW, you maybe aware there is a command to report stats after downloading/caching dataset

mtdata/mtdata/main.py

Lines 200 to 201 in b1c0b21

    
           stats_p = sub_ps.add_parser('stats', formatter_class=MyFormatter) 
        
           stats_p.add_argument('did', nargs='+', type=DatasetId.parse, help="Show stats of dataset IDs")

I admit I need to improve this feature in the future versions with better caching.

thammegowda · 2022-10-05T17:39:11Z

Here is an example

$ mtdata stats Statmt-europarl-10-deu-eng Statmt-newstest_ende-2020-eng-deu  2> /dev/null
{'id': 'Statmt-europarl-10-deu-eng', 'segs': 1817758, 'segs_err': 10763, 'segs_noise': 0, 'deu_toks': 42413399, 'eng_toks': 45510191, 'deu_chars': 305112155, 'eng_chars': 305112155}
{'id': 'Statmt-newstest_ende-2020-eng-deu', 'segs': 1418, 'segs_err': 0, 'segs_noise': 0, 'deu_toks': 45855, 'eng_toks': 44018, 'deu_chars': 323536, 'eng_chars': 323536}

XapaJIaMnu · 2022-10-07T14:39:35Z

Thanks for your reply!

Would it be possible to include approximate size in MB/GB to give users some idea about size prior to download?

kpu · 2022-10-08T08:10:43Z

There's two ways to do this: a HEAD request to the URLs on the fly and a cached set of statistics about each corpus, to include number of segments etc. To have the cached version without downloading, effectively what you are asking for is a continuous release system that downloads stuff then puts metadata in the release. This continuous release system could also cache things and be branded OPUS...

thammegowda · 2022-10-08T23:28:19Z

I like the HEAD request approach and we can easily do that. An edge case I am concerned is, we have indexed many zip/tarball files that gets mapped to multiple datasets. I could show the overall tarball file size, though it is inaccurate, it would be a good start.

Also, as shown in my previous comment, mtdata stats outputs character counts. I will revise it to output byte count, and also as human readable size (kB, MB, GB etc). This will be more accurate but costly (we have to download dataset for once)

Also +1 for caching these stats and distributing as part of release. OPUS is the large source and fortunately precomputed stats are available from their API. Then for the remaining datasets, I have to run on one of our servers and collect stats overnight. For any new additions, we can rerun a script to update the cached stats and make it available in the next release.

So, to summarize, the action items (for me, for the next release):

Show byte count and total size in the output of mtdata stats <DataID>
- Cache the stats once we compute them
Add mtdata stats --quick <DataID> option to perform HEAD request and show content-length header

Preserve stats from OPUS -- put them in cache.

mtdata/mtdata/index/opus/opus_index.py

Lines 14 to 17 in b1c0b21

    
           data_file = Path(__file__).parent / 'opus_index.tsv' 
        
           """ To refresh the data_file from OPUS:  
        
           $ curl "https://opus.nlpl.eu/opusapi/?preprocessing=moses" > opus_all.json  
        
           $ cat opus_all.json |  jq -r  '.corpora[] | [.corpus, .version, .source, .target] | @tsv'  | sort  > opus_all.tsv

A script to update the stats that are missing in cache;
- run this on a server with sufficient delays between requests
modify release process to include the stats cache and automatically update stats for newly added datasets
(Optional/good to have) stats in search and visualizations http://gowda.ai/mtdata/search

XapaJIaMnu mentioned this issue Oct 5, 2022

mtdata downloader hplt-project/OpusCleaner#10

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to see the dataset size before starting the download #129

Is there a way to see the dataset size before starting the download #129

XapaJIaMnu commented Oct 5, 2022

thammegowda commented Oct 5, 2022

thammegowda commented Oct 5, 2022

XapaJIaMnu commented Oct 7, 2022

kpu commented Oct 8, 2022 •

edited

Loading

thammegowda commented Oct 8, 2022

Is there a way to see the dataset size before starting the download #129

Is there a way to see the dataset size before starting the download #129

Comments

XapaJIaMnu commented Oct 5, 2022

thammegowda commented Oct 5, 2022

thammegowda commented Oct 5, 2022

XapaJIaMnu commented Oct 7, 2022

kpu commented Oct 8, 2022 • edited Loading

thammegowda commented Oct 8, 2022

kpu commented Oct 8, 2022 •

edited

Loading