Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix export to JSON when dataset larger than batch size #7039

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

albertvillanova
Copy link
Member

@albertvillanova albertvillanova commented Jul 11, 2024

Fix export to JSON (lines=False) when dataset larger than batch size.

Fix #7037.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@albertvillanova
Copy link
Member Author

albertvillanova commented Jul 11, 2024

The test before confirms the bug.

There are different possible solutions to this issue:

  • the easiest would be to write multiple JSON files, one for each batch; this solution can be done in parallel if num_proc is passed
  • alternatively, we could tweak the writing and remove the extra [ and ] characters; this solution will only be valid if orient="records"
  • others?

@varadhbhatnagar
Copy link
Contributor

@albertvillanova I was planning to approach it in the second way for orient="records" , orient="values" and orient="index". For orient="split", the columns and index can be written in one go and the data can be written in streaming manner. For orient="columns", each column can be written in a streaming way. LMK if I should go ahead with this.

The test before confirms the bug.

There are different possible solutions to this issue:

* the easiest would be to write multiple JSON files, one for each batch; this solution can be done in parallel if `num_proc` is passed

* alternatively, we could tweak the writing and remove the extra `[` and `]` characters; this solution will only be valid if `orient="records"`

* others?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A bug of Dataset.to_json() function
3 participants