Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datasets): Add option to async load and save in PartitionedDatasets #696

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

puneeter
Copy link

@puneeter puneeter commented May 23, 2024

Description

  • This PR provides the user to load and save PartitionedDataset asynchronously for partitions provided.
  • PartitionedDatasets already provide a way to do lazy loading, which solves for memory complexity. With this PR the time complexity is also reduced if the user wants to save/load these partitions in parallel with the help of use_async argument.

Development notes

  • Additional use_async argument to PartitionedDataset constructor is used to control the async load/save.
  • Based on this argument, _save and _load methods call different private functions.
  • Leveraged existing tests for PartitionedDataset by parameterizing value for use_async using @pytest.mark.parametrize("use_async", [True, False])

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes

@puneeter puneeter changed the title Add async load and save methods to PartitionedDatasets feat(datasets): Add option to async load and save in PartitionedDatasets May 23, 2024
@puneeter puneeter marked this pull request as ready for review May 23, 2024 18:20
Signed-off-by: puneeter <[email protected]>
@merelcht
Copy link
Member

Hi @puneeter, can you please provide a description and any relevant development notes on the PR? This will make it easier for the team to review.

@puneeter
Copy link
Author

Hi @puneeter, can you please provide a description and any relevant development notes on the PR? This will make it easier for the team to review.

I updated the description. Please let me know if it needs any refactoring.

Signed-off-by: puneeter <[email protected]>
Signed-off-by: puneeter <[email protected]>
Signed-off-by: puneeter <[email protected]>
Signed-off-by: puneeter <[email protected]>
@puneeter puneeter requested review from astrojuanlu and noklam and removed request for astrojuanlu October 18, 2024 11:06
@puneeter
Copy link
Author

puneeter commented Oct 18, 2024

Would need team's help to point to the right documentation to be changed because of this change. Maybe: docs/source/data/partitioned_and_incremental_datasets.md?

@astrojuanlu
Copy link
Member

Hey @puneeter, sorry for the long delay. Indeed, partitioned_and_incremental_datasets.md corresponds to https://docs.kedro.org/en/0.19.10/data/partitioned_and_incremental_datasets.html

In the end, is the usage similar to what I wrote here #696 (comment) or is it different?

Aside from that, I'll leave one more comment

Comment on lines 309 to +311
def save(self, data: dict[str, Any]) -> None:
if self._use_async:
asyncio.run(self._async_save(data))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, asyncio.run creates a new event loop, so if there's already an event loop running (for example, in a Jupyter notebook), calling this will raise an error.

This is essentially the red/blue function problem... Most of Kedro is synchronous anyway AFAIK, but I think this might set an API expectation that could be difficult to satisfy cleanly.

@merelcht @ElenaKhaustova do you have more thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, asyncio.run creates a new event loop, so if there's already an event loop running (for example, in a Jupyter notebook), calling this will raise an error.

That is indeed correct. Maybe it's alright though to say async saving doesn't work in interactive envs?

but I think this might set an API expectation that could be difficult to satisfy cleanly

@astrojuanlu can you elaborate what you mean with this?

@merelcht
Copy link
Member

@puneeter I see all the tests have been modified to take the use_async argument, but is there a way to also check that the async functionality is working?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

5 participants