-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(datasets): Add option to async load and save in PartitionedDatasets #696
base: main
Are you sure you want to change the base?
feat(datasets): Add option to async load and save in PartitionedDatasets #696
Conversation
Signed-off-by: puneeter <[email protected]>
Signed-off-by: puneeter <[email protected]>
Signed-off-by: puneeter <[email protected]>
Signed-off-by: puneeter <[email protected]>
Signed-off-by: puneeter <[email protected]>
Signed-off-by: puneeter <[email protected]>
Hi @puneeter, can you please provide a description and any relevant development notes on the PR? This will make it easier for the team to review. |
I updated the description. Please let me know if it needs any refactoring. |
kedro-datasets/kedro_datasets/partitions/partitioned_dataset.py
Outdated
Show resolved
Hide resolved
Signed-off-by: puneeter <[email protected]>
Signed-off-by: puneeter <[email protected]>
Signed-off-by: puneeter <[email protected]>
Signed-off-by: puneeter <[email protected]>
Would need team's help to point to the right documentation to be changed because of this change. Maybe: |
Hey @puneeter, sorry for the long delay. Indeed, In the end, is the usage similar to what I wrote here #696 (comment) or is it different? Aside from that, I'll leave one more comment |
def save(self, data: dict[str, Any]) -> None: | ||
if self._use_async: | ||
asyncio.run(self._async_save(data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, asyncio.run
creates a new event loop, so if there's already an event loop running (for example, in a Jupyter notebook), calling this will raise an error.
This is essentially the red/blue function problem... Most of Kedro is synchronous anyway AFAIK, but I think this might set an API expectation that could be difficult to satisfy cleanly.
@merelcht @ElenaKhaustova do you have more thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, asyncio.run creates a new event loop, so if there's already an event loop running (for example, in a Jupyter notebook), calling this will raise an error.
That is indeed correct. Maybe it's alright though to say async saving doesn't work in interactive envs?
but I think this might set an API expectation that could be difficult to satisfy cleanly
@astrojuanlu can you elaborate what you mean with this?
@puneeter I see all the tests have been modified to take the |
Description
PartitionedDataset
asynchronously for partitions provided.use_async
argument.Development notes
use_async
argument toPartitionedDataset
constructor is used to control the async load/save.argument
,_save
and_load
methods call different private functions.PartitionedDataset
by parameterizing value foruse_async
using@pytest.mark.parametrize("use_async", [True, False])
Checklist
RELEASE.md
file