-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Download only split data #5243
Comments
Hi @capsabogdan! Unfortunately, it's hard to implement because quite often datasets data is being hosted in a single archive for all splits :( So we have to download the whole archive to split it into splits. This is the case for CommonVoice too. However, for cases when data is distributed in separate archives ащк different splits I suppose it can (and will) be implemented someday. Btw for quick check of the dataset you can use streaming: cv = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="test", streaming=True)
cv = iter(cv)
print(next(cv))
>> {'client_id': 'a07b17f8234ded5e847443ea6f423cef745cbbc7537fb637d58326000aa751e829a21c4fd0a35fc17fb833aa7e95ebafce5efd19beeb8d843887b85e4eb35f5b',
>> 'path': None,
>> 'audio': {'path': 'cv-corpus-11.0-2022-09-21/en/clips/common_voice_en_100363.mp3',
>> 'array': array([ 0.0000000e+00, 1.1748125e-14, 1.5450088e-14, ...,
>> 1.3011958e-06, -6.3548953e-08, -9.9098514e-08], dtype=float32),
>> ...} |
thank you for the answer but am not sure if this will not be helpful, as we
need maybe just 10% of the datasets for some experiment
can we get just a portion of the dataset with stream?
is there really no solution? :(
Am Di., 15. Nov. 2022 um 16:55 Uhr schrieb Polina Kazakova <
***@***.***>:
… Hi @capsabogdan <https://github.com/capsabogdan>! Unfortunately, it's
hard to implement because quite often datasets data is being hosted in a
single archive for all splits :( So we have to download the whole archive
to split it into splits. This is the case for CommonVoice too.
However, for cases when data is distributed in separate archives in
different splits I suppose it can be implemented someday.
Btw for quick check of the dataset you can use streaming
<https://huggingface.co/docs/datasets/stream>:
cv = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="test", streaming=True)cv = iter(cv)print(next(cv))
>> {'client_id': 'a07b17f8234ded5e847443ea6f423cef745cbbc7537fb637d58326000aa751e829a21c4fd0a35fc17fb833aa7e95ebafce5efd19beeb8d843887b85e4eb35f5b',>> 'path': None,>> 'audio': {'path': 'cv-corpus-11.0-2022-09-21/en/clips/common_voice_en_100363.mp3',>> 'array': array([ 0.0000000e+00, 1.1748125e-14, 1.5450088e-14, ...,>> 1.3011958e-06, -6.3548953e-08, -9.9098514e-08], dtype=float32),>> ...}
—
Reply to this email directly, view it on GitHub
<#5243 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALSIFOC3JYRCTH54OBRUJULWIOW6PANCNFSM6AAAAAASAYO2LY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
maybe it would be nice if you guys ould do some sort of shard before
loading the dataset, so users can download just chunks of data :)
I think this would be very helpful
Am Di., 15. Nov. 2022 um 19:24 Uhr schrieb Bogdan Capsa <
***@***.***>:
… thank you for the answer but am not sure if this will not be helpful, as
we need maybe just 10% of the datasets for some experiment
can we get just a portion of the dataset with stream?
is there really no solution? :(
Am Di., 15. Nov. 2022 um 16:55 Uhr schrieb Polina Kazakova <
***@***.***>:
> Hi @capsabogdan <https://github.com/capsabogdan>! Unfortunately, it's
> hard to implement because quite often datasets data is being hosted in a
> single archive for all splits :( So we have to download the whole archive
> to split it into splits. This is the case for CommonVoice too.
>
> However, for cases when data is distributed in separate archives in
> different splits I suppose it can be implemented someday.
>
> Btw for quick check of the dataset you can use streaming
> <https://huggingface.co/docs/datasets/stream>:
>
> cv = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="test", streaming=True)cv = iter(cv)print(next(cv))
> >> {'client_id': 'a07b17f8234ded5e847443ea6f423cef745cbbc7537fb637d58326000aa751e829a21c4fd0a35fc17fb833aa7e95ebafce5efd19beeb8d843887b85e4eb35f5b',>> 'path': None,>> 'audio': {'path': 'cv-corpus-11.0-2022-09-21/en/clips/common_voice_en_100363.mp3',>> 'array': array([ 0.0000000e+00, 1.1748125e-14, 1.5450088e-14, ...,>> 1.3011958e-06, -6.3548953e-08, -9.9098514e-08], dtype=float32),>> ...}
>
> —
> Reply to this email directly, view it on GitHub
> <#5243 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ALSIFOC3JYRCTH54OBRUJULWIOW6PANCNFSM6AAAAAASAYO2LY>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
+1 on this feature request - I am running into the same problem, where I only need the test set for a dataset that has a huge training set |
Hey, I'm also interested in that as a feature. I'm having the same problem with Common Voice 13.0. The dataset is super big but I only want the test data to benchmark multilingual models, but I don't have much Terabytes to store all the dataset... |
Consider this approach: Download and save individual audio files by streaming each split, then compile a CSV file that contains the file names and corresponding text. import os
import shutil
from pathlib import Path
import datasets
import pandas as pd
import soundfile
from datasets import Dataset, concatenate_datasets, load_dataset
dataset = load_dataset("librispeech_asr", 'clean', split="train.100", streaming=True)
dataset = iter(dataset)
download_path = os.path.join(os.getcwd(), 'librispeech', 'clips')
csv_name = os.path.join(os.getcwd(), 'librispeech', 'clean_train_100.csv')
rows = []
for i, row in enumerate(dataset):
print(i)
path = os.path.join(download_path, row['audio']['path'])
soundfile.write(path, row['audio']['array'], row['audio']['sampling_rate'])
del row['audio']
rows.append(row)
df = pd.DataFrame(rows)
df.to_csv(csv_name, index=False, header=True) |
Faced this issue as well so wrote a short script that pulls a hub dataset, creates a small sample of it, and pushes the sample data to the hub as a new dataset. def create_sample_dataset(full_dataset_name, sample_count=100, username="my-username", cache_dir="./dataset"):
# Create a directory to save the sampled dataset
os.makedirs(cache_dir, exist_ok=True)
# Get the dataset name
dataset_name = full_dataset_name.split("/")[-1]
dataset_name_sample = f"{dataset_name}-sample-{sample_count}"
# Load the dataset
dataset = datasets.load_dataset(full_dataset_name, cache_dir=cache_dir)
# Sample 100 rows from the training split (or modify for other splits)
train_sample = dataset["train"].shuffle(seed=42).select(range(sample_count))
test_sample = dataset["test"].shuffle(seed=42).select(range(sample_count))
# Push to hub
train_sample.push_to_hub(dataset_name_sample, split="train")
print("INFO: Train split pushed to the hub successfully")
test_sample.push_to_hub(dataset_name_sample, split="test")
print("INFO: Test split pushed to the hub successfully") Once sampled / pushed you have a smaller version of your dataset in the hub to pull from. The full gist is here. |
Feature request
Is it possible to download only the data that I am requesting and not the entire dataset? I run out of disk spaceas it seems to download the entire dataset, instead of only the part needed.
common_voice["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="test",
cache_dir="cache/path...",
use_auth_token=True,
download_config=DownloadConfig(delete_extracted='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')
)
Motivation
efficiency improvement
Your contribution
n/a
The text was updated successfully, but these errors were encountered: