Add the option of saving in parquet instead of arrow #6903

arita37 · 2024-05-16T13:35:51Z

Feature request

In dataset.save_to_disk('/path/to/save/dataset'),

add the option to save in parquet format

dataset.save_to_disk('/path/to/save/dataset', format="parquet"),

because arrow is not used for Production Big data.... (only parquet)

Motivation

because arrow is not used for Production Big data.... (only parquet)

Your contribution

I can do the testing !

Dref360 · 2024-05-16T20:38:01Z

I think Dataset.to_parquet is what you're looking for.

Let me know if I'm wrong

arita37 · 2024-05-17T03:40:02Z

No, it does not save the metadata json. We have to recode all meta json load/save with another custome functions. save_to_disk and load should have option with “Parquet” instead of “arrow” since “arrow” is never user for production (only parquet). Thanks !

…

On May 17, 2024, at 5:38, Frédéric Branchaud-Charron ***@***.***> wrote: I think Dataset.to_parquet is what you're looking for. Let me know if I'm wrong — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.

lhoestq · 2024-06-13T16:09:19Z

You can use to_parquet and ds.info.write_to_directory() to save the dataset info

arita37 · 2024-06-13T17:46:57Z

Ok, What about loading ? Should we do in 2 steps ?

…

On Jun 14, 2024, at 1:09, Quentin Lhoest ***@***.***> wrote: You can use to_parquet and ds.info.write_to_directory() to save the dataset info — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.

lhoestq · 2024-06-14T10:53:52Z

Yes, and there is DatasetInfo.from_directory(). to reload the info

arita37 · 2024-06-14T11:01:05Z

Isn’t easier to combine both into load_dataset and save_dataset with parquet options. 2) another question, How can we download large dataset into disk directly without loading all in memory (!)

…

On Jun 14, 2024, at 19:54, Quentin Lhoest ***@***.***> wrote: Yes, and there is DatasetInfo.from_directory(). to reload the info — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.

lhoestq · 2024-06-14T15:06:31Z

load_dataset doesn't load the dataset in memory, it progressively writes to disk in Arrow format and then memory maps the Arrow files. This allows to load datasets bigger than memory and without filling your RAM

arita37 · 2024-06-14T16:24:30Z

Sure. How memory map is managed ? Managed by the OS ? Why the need of save_dataset() ?

…

On Jun 15, 2024, at 0:06, Quentin Lhoest ***@***.***> wrote: load_dataset doesn't load the dataset in memory, it progressively writes to disk in Arrow format and then memory maps the Arrow files. This allows to load datasets bigger than memory and without filling your RAM — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.

Adenialzz · 2025-02-25T10:25:59Z

can to_parquet do auto sharding ?

lhoestq · 2025-02-25T14:09:36Z

Not at the moment, but you can shard yourself using

num_shards = 32
for index in range(num_shards):
    shard = ds.shard(index=index, num_shards=num_shards)
    shard.to_parquet("data-{index:05d}.parquet")

Adenialzz · 2025-02-25T14:36:04Z

Thanks! That's good!

arita37 added the enhancement New feature or request label May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the option of saving in parquet instead of arrow #6903

Add the option of saving in parquet instead of arrow #6903

arita37 commented May 16, 2024

Dref360 commented May 16, 2024

arita37 commented May 17, 2024 via email

lhoestq commented Jun 13, 2024

arita37 commented Jun 13, 2024 via email

lhoestq commented Jun 14, 2024

arita37 commented Jun 14, 2024 via email

lhoestq commented Jun 14, 2024

arita37 commented Jun 14, 2024 via email

Adenialzz commented Feb 25, 2025

lhoestq commented Feb 25, 2025

Adenialzz commented Feb 25, 2025

Add the option of saving in parquet instead of arrow #6903

Add the option of saving in parquet instead of arrow #6903

Comments

arita37 commented May 16, 2024

Feature request

Motivation

Your contribution

Dref360 commented May 16, 2024

arita37 commented May 17, 2024 via email

lhoestq commented Jun 13, 2024

arita37 commented Jun 13, 2024 via email

lhoestq commented Jun 14, 2024

arita37 commented Jun 14, 2024 via email

lhoestq commented Jun 14, 2024

arita37 commented Jun 14, 2024 via email

Adenialzz commented Feb 25, 2025

lhoestq commented Feb 25, 2025

Adenialzz commented Feb 25, 2025