Download only split data #5243

capsabogdan · 2022-11-15T10:15:54Z

Feature request

Is it possible to download only the data that I am requesting and not the entire dataset? I run out of disk spaceas it seems to download the entire dataset, instead of only the part needed.

common_voice["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="test",
cache_dir="cache/path...",
use_auth_token=True,
download_config=DownloadConfig(delete_extracted='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')
)

Motivation

efficiency improvement

Your contribution

n/a

polinaeterna · 2022-11-15T15:55:08Z

Hi @capsabogdan! Unfortunately, it's hard to implement because quite often datasets data is being hosted in a single archive for all splits :( So we have to download the whole archive to split it into splits. This is the case for CommonVoice too.

However, for cases when data is distributed in separate archives ащк different splits I suppose it can (and will) be implemented someday.

Btw for quick check of the dataset you can use streaming:

cv = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="test", streaming=True)
cv = iter(cv)
print(next(cv))

>> {'client_id': 'a07b17f8234ded5e847443ea6f423cef745cbbc7537fb637d58326000aa751e829a21c4fd0a35fc17fb833aa7e95ebafce5efd19beeb8d843887b85e4eb35f5b',
>>  'path': None,
>>  'audio': {'path': 'cv-corpus-11.0-2022-09-21/en/clips/common_voice_en_100363.mp3',
>>  'array': array([ 0.0000000e+00,  1.1748125e-14,  1.5450088e-14, ...,
>>          1.3011958e-06, -6.3548953e-08, -9.9098514e-08], dtype=float32),
>> ...}

capsabogdan · 2022-11-15T18:24:25Z

thank you for the answer but am not sure if this will not be helpful, as we need maybe just 10% of the datasets for some experiment can we get just a portion of the dataset with stream? is there really no solution? :( Am Di., 15. Nov. 2022 um 16:55 Uhr schrieb Polina Kazakova < ***@***.***>:

…

Hi @capsabogdan <https://github.com/capsabogdan>! Unfortunately, it's hard to implement because quite often datasets data is being hosted in a single archive for all splits :( So we have to download the whole archive to split it into splits. This is the case for CommonVoice too. However, for cases when data is distributed in separate archives in different splits I suppose it can be implemented someday. Btw for quick check of the dataset you can use streaming <https://huggingface.co/docs/datasets/stream>: cv = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="test", streaming=True)cv = iter(cv)print(next(cv)) >> {'client_id': 'a07b17f8234ded5e847443ea6f423cef745cbbc7537fb637d58326000aa751e829a21c4fd0a35fc17fb833aa7e95ebafce5efd19beeb8d843887b85e4eb35f5b',>> 'path': None,>> 'audio': {'path': 'cv-corpus-11.0-2022-09-21/en/clips/common_voice_en_100363.mp3',>> 'array': array([ 0.0000000e+00, 1.1748125e-14, 1.5450088e-14, ...,>> 1.3011958e-06, -6.3548953e-08, -9.9098514e-08], dtype=float32),>> ...} — Reply to this email directly, view it on GitHub <#5243 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALSIFOC3JYRCTH54OBRUJULWIOW6PANCNFSM6AAAAAASAYO2LY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

capsabogdan · 2022-11-15T20:12:24Z

maybe it would be nice if you guys ould do some sort of shard before loading the dataset, so users can download just chunks of data :) I think this would be very helpful Am Di., 15. Nov. 2022 um 19:24 Uhr schrieb Bogdan Capsa < ***@***.***>:

…

thank you for the answer but am not sure if this will not be helpful, as we need maybe just 10% of the datasets for some experiment can we get just a portion of the dataset with stream? is there really no solution? :( Am Di., 15. Nov. 2022 um 16:55 Uhr schrieb Polina Kazakova < ***@***.***>: > Hi @capsabogdan <https://github.com/capsabogdan>! Unfortunately, it's > hard to implement because quite often datasets data is being hosted in a > single archive for all splits :( So we have to download the whole archive > to split it into splits. This is the case for CommonVoice too. > > However, for cases when data is distributed in separate archives in > different splits I suppose it can be implemented someday. > > Btw for quick check of the dataset you can use streaming > <https://huggingface.co/docs/datasets/stream>: > > cv = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="test", streaming=True)cv = iter(cv)print(next(cv)) > >> {'client_id': 'a07b17f8234ded5e847443ea6f423cef745cbbc7537fb637d58326000aa751e829a21c4fd0a35fc17fb833aa7e95ebafce5efd19beeb8d843887b85e4eb35f5b',>> 'path': None,>> 'audio': {'path': 'cv-corpus-11.0-2022-09-21/en/clips/common_voice_en_100363.mp3',>> 'array': array([ 0.0000000e+00, 1.1748125e-14, 1.5450088e-14, ...,>> 1.3011958e-06, -6.3548953e-08, -9.9098514e-08], dtype=float32),>> ...} > > — > Reply to this email directly, view it on GitHub > <#5243 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ALSIFOC3JYRCTH54OBRUJULWIOW6PANCNFSM6AAAAAASAYO2LY> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

rbawden · 2023-01-05T09:01:07Z

+1 on this feature request - I am running into the same problem, where I only need the test set for a dataset that has a huge training set

thomas-ferraz · 2023-05-02T09:27:51Z

Hey, I'm also interested in that as a feature. I'm having the same problem with Common Voice 13.0. The dataset is super big but I only want the test data to benchmark multilingual models, but I don't have much Terabytes to store all the dataset...

VladimirVincan · 2024-03-06T13:28:15Z

Consider this approach: Download and save individual audio files by streaming each split, then compile a CSV file that contains the file names and corresponding text.

import os
import shutil
from pathlib import Path

import datasets
import pandas as pd
import soundfile
from datasets import Dataset, concatenate_datasets, load_dataset


dataset = load_dataset("librispeech_asr", 'clean', split="train.100", streaming=True)
dataset = iter(dataset)

download_path = os.path.join(os.getcwd(), 'librispeech', 'clips')
csv_name = os.path.join(os.getcwd(), 'librispeech', 'clean_train_100.csv')

rows = []
for i, row in enumerate(dataset):
    print(i)
    path = os.path.join(download_path, row['audio']['path'])
    soundfile.write(path, row['audio']['array'], row['audio']['sampling_rate'])

    del row['audio']
    rows.append(row)

df = pd.DataFrame(rows)
df.to_csv(csv_name, index=False, header=True)

neonwatty · 2025-02-25T14:46:38Z

Faced this issue as well so wrote a short script that pulls a hub dataset, creates a small sample of it, and pushes the sample data to the hub as a new dataset.

def create_sample_dataset(full_dataset_name, sample_count=100, username="my-username", cache_dir="./dataset"):
    # Create a directory to save the sampled dataset
    os.makedirs(cache_dir, exist_ok=True)

    # Get the dataset name
    dataset_name = full_dataset_name.split("/")[-1]
    dataset_name_sample = f"{dataset_name}-sample-{sample_count}"

    # Load the dataset
    dataset = datasets.load_dataset(full_dataset_name, cache_dir=cache_dir)

    # Sample 100 rows from the training split (or modify for other splits)
    train_sample = dataset["train"].shuffle(seed=42).select(range(sample_count))
    test_sample = dataset["test"].shuffle(seed=42).select(range(sample_count))

    # Push to hub
    train_sample.push_to_hub(dataset_name_sample, split="train")
    print("INFO: Train split pushed to the hub successfully")

    test_sample.push_to_hub(dataset_name_sample, split="test")
    print("INFO: Test split pushed to the hub successfully")

Once sampled / pushed you have a smaller version of your dataset in the hub to pull from.

The full gist is here.

capsabogdan added the enhancement New feature or request label Nov 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download only split data #5243

Download only split data #5243

capsabogdan commented Nov 15, 2022

polinaeterna commented Nov 15, 2022 •

edited

Loading

capsabogdan commented Nov 15, 2022 via email

capsabogdan commented Nov 15, 2022 via email

rbawden commented Jan 5, 2023

thomas-ferraz commented May 2, 2023

VladimirVincan commented Mar 6, 2024

neonwatty commented Feb 25, 2025 •

edited

Loading

Download only split data #5243

Download only split data #5243

Comments

capsabogdan commented Nov 15, 2022

Feature request

Motivation

Your contribution

polinaeterna commented Nov 15, 2022 • edited Loading

capsabogdan commented Nov 15, 2022 via email

capsabogdan commented Nov 15, 2022 via email

rbawden commented Jan 5, 2023

thomas-ferraz commented May 2, 2023

VladimirVincan commented Mar 6, 2024

neonwatty commented Feb 25, 2025 • edited Loading

polinaeterna commented Nov 15, 2022 •

edited

Loading

neonwatty commented Feb 25, 2025 •

edited

Loading