Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download only split data #5243

Open
capsabogdan opened this issue Nov 15, 2022 · 7 comments
Open

Download only split data #5243

capsabogdan opened this issue Nov 15, 2022 · 7 comments
Labels
enhancement New feature or request

Comments

@capsabogdan
Copy link

Feature request

Is it possible to download only the data that I am requesting and not the entire dataset? I run out of disk spaceas it seems to download the entire dataset, instead of only the part needed.

common_voice["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="test",
cache_dir="cache/path...",
use_auth_token=True,
download_config=DownloadConfig(delete_extracted='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')
)

Motivation

efficiency improvement

Your contribution

n/a

@capsabogdan capsabogdan added the enhancement New feature or request label Nov 15, 2022
@polinaeterna
Copy link
Contributor

polinaeterna commented Nov 15, 2022

Hi @capsabogdan! Unfortunately, it's hard to implement because quite often datasets data is being hosted in a single archive for all splits :( So we have to download the whole archive to split it into splits. This is the case for CommonVoice too.

However, for cases when data is distributed in separate archives ащк different splits I suppose it can (and will) be implemented someday.

Btw for quick check of the dataset you can use streaming:

cv = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="test", streaming=True)
cv = iter(cv)
print(next(cv))

>> {'client_id': 'a07b17f8234ded5e847443ea6f423cef745cbbc7537fb637d58326000aa751e829a21c4fd0a35fc17fb833aa7e95ebafce5efd19beeb8d843887b85e4eb35f5b',
>>  'path': None,
>>  'audio': {'path': 'cv-corpus-11.0-2022-09-21/en/clips/common_voice_en_100363.mp3',
>>  'array': array([ 0.0000000e+00,  1.1748125e-14,  1.5450088e-14, ...,
>>          1.3011958e-06, -6.3548953e-08, -9.9098514e-08], dtype=float32),
>> ...}

@capsabogdan
Copy link
Author

capsabogdan commented Nov 15, 2022 via email

@capsabogdan
Copy link
Author

capsabogdan commented Nov 15, 2022 via email

@rbawden
Copy link

rbawden commented Jan 5, 2023

+1 on this feature request - I am running into the same problem, where I only need the test set for a dataset that has a huge training set

@thomas-ferraz
Copy link

Hey, I'm also interested in that as a feature. I'm having the same problem with Common Voice 13.0. The dataset is super big but I only want the test data to benchmark multilingual models, but I don't have much Terabytes to store all the dataset...

@VladimirVincan
Copy link

Consider this approach: Download and save individual audio files by streaming each split, then compile a CSV file that contains the file names and corresponding text.

import os
import shutil
from pathlib import Path

import datasets
import pandas as pd
import soundfile
from datasets import Dataset, concatenate_datasets, load_dataset


dataset = load_dataset("librispeech_asr", 'clean', split="train.100", streaming=True)
dataset = iter(dataset)

download_path = os.path.join(os.getcwd(), 'librispeech', 'clips')
csv_name = os.path.join(os.getcwd(), 'librispeech', 'clean_train_100.csv')

rows = []
for i, row in enumerate(dataset):
    print(i)
    path = os.path.join(download_path, row['audio']['path'])
    soundfile.write(path, row['audio']['array'], row['audio']['sampling_rate'])

    del row['audio']
    rows.append(row)

df = pd.DataFrame(rows)
df.to_csv(csv_name, index=False, header=True)

@neonwatty
Copy link

neonwatty commented Feb 25, 2025

Faced this issue as well so wrote a short script that pulls a hub dataset, creates a small sample of it, and pushes the sample data to the hub as a new dataset.

def create_sample_dataset(full_dataset_name, sample_count=100, username="my-username", cache_dir="./dataset"):
    # Create a directory to save the sampled dataset
    os.makedirs(cache_dir, exist_ok=True)

    # Get the dataset name
    dataset_name = full_dataset_name.split("/")[-1]
    dataset_name_sample = f"{dataset_name}-sample-{sample_count}"

    # Load the dataset
    dataset = datasets.load_dataset(full_dataset_name, cache_dir=cache_dir)

    # Sample 100 rows from the training split (or modify for other splits)
    train_sample = dataset["train"].shuffle(seed=42).select(range(sample_count))
    test_sample = dataset["test"].shuffle(seed=42).select(range(sample_count))

    # Push to hub
    train_sample.push_to_hub(dataset_name_sample, split="train")
    print("INFO: Train split pushed to the hub successfully")

    test_sample.push_to_hub(dataset_name_sample, split="test")
    print("INFO: Test split pushed to the hub successfully")

Once sampled / pushed you have a smaller version of your dataset in the hub to pull from.

The full gist is here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants