-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the option of saving in parquet instead of arrow #6903
Comments
I think Let me know if I'm wrong |
No, it does not save the metadata json.
We have to recode all meta json load/save
with another custome functions.
save_to_disk
and load should have option with
“Parquet” instead of “arrow”
since “arrow” is never user for production
(only parquet).
Thanks !
… On May 17, 2024, at 5:38, Frédéric Branchaud-Charron ***@***.***> wrote:
I think Dataset.to_parquet is what you're looking for.
Let me know if I'm wrong
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.
|
You can use |
Ok,
What about loading ?
Should we do in 2 steps ?
… On Jun 14, 2024, at 1:09, Quentin Lhoest ***@***.***> wrote:
You can use to_parquet and ds.info.write_to_directory() to save the dataset info
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.
|
Yes, and there is DatasetInfo.from_directory(). to reload the info |
Isn’t easier to combine both
into load_dataset and save_dataset
with parquet options.
2) another question,
How can we download large dataset into disk directly without loading all in memory (!)
… On Jun 14, 2024, at 19:54, Quentin Lhoest ***@***.***> wrote:
Yes, and there is DatasetInfo.from_directory(). to reload the info
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.
|
|
Sure.
How memory map is managed ?
Managed by the OS ?
Why the need of save_dataset() ?
… On Jun 15, 2024, at 0:06, Quentin Lhoest ***@***.***> wrote:
load_dataset doesn't load the dataset in memory, it progressively writes to disk in Arrow format and then memory maps the Arrow files. This allows to load datasets bigger than memory and without filling your RAM
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.
|
can |
Not at the moment, but you can shard yourself using num_shards = 32
for index in range(num_shards):
shard = ds.shard(index=index, num_shards=num_shards)
shard.to_parquet("data-{index:05d}.parquet") |
Thanks! That's good! |
Feature request
In dataset.save_to_disk('/path/to/save/dataset'),
add the option to save in parquet format
dataset.save_to_disk('/path/to/save/dataset', format="parquet"),
because arrow is not used for Production Big data.... (only parquet)
Motivation
because arrow is not used for Production Big data.... (only parquet)
Your contribution
I can do the testing !
The text was updated successfully, but these errors were encountered: