-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there a way to see the dataset size before starting the download #129
Comments
Hi @XapaJIaMnu BTW, you maybe aware there is a command to report stats after downloading/caching dataset Lines 200 to 201 in b1c0b21
I admit I need to improve this feature in the future versions with better caching. |
Here is an example
|
Thanks for your reply! Would it be possible to include approximate size in MB/GB to give users some idea about size prior to download? |
There's two ways to do this: a HEAD request to the URLs on the fly and a cached set of statistics about each corpus, to include number of segments etc. To have the cached version without downloading, effectively what you are asking for is a continuous release system that downloads stuff then puts metadata in the release. This continuous release system could also cache things and be branded OPUS... |
I like the Also, as shown in my previous comment, Also +1 for caching these stats and distributing as part of release. OPUS is the large source and fortunately precomputed stats are available from their API. Then for the remaining datasets, I have to run on one of our servers and collect stats overnight. For any new additions, we can rerun a script to update the cached stats and make it available in the next release. So, to summarize, the action items (for me, for the next release):
|
Hi,
Is there a way to get information about the dataset size (num sentences, etc) before downloading it? Is this available through the API somehow? Is this what
cols
is https://github.com/thammegowda/mtdata/blob/master/mtdata/entry.py#L101 ?Thanks,
Nick
The text was updated successfully, but these errors were encountered: