Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(datasets): Add parameter to enable/disable lazy saving for PartitionedDataset #978

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

ElenaKhaustova
Copy link
Contributor

@ElenaKhaustova ElenaKhaustova commented Jan 7, 2025

Description

Fixes #759

Cause of the issue: #759 (comment)

Dataset lazy saving docs: https://docs.kedro.org/en/stable/data/partitioned_and_incremental_datasets.html#partitioned-dataset-lazy-saving

Development notes

  1. Added a parameter (save_lazily) to enable/disable lazy saving for PartitionedDataset. The parameter is enabled by default but users still need to wrap objects with callable as it was required before to apply lazy saving, so the change is not breaking.
  2. We didn't add a similar argument for save function as was suggested initially since it contradicts the definition of AbstractDataset.save
  3. Added callable object saving and loading test
  4. Following docs update PR: Update partitioned dataset lazy saving docs kedro#4402

How to test without TensorFlow

The following code will execute without errors and save two objects if save_lazily is set to False. The code will fail and TestSaveCallable.__calll__ will be called if save_lazily is set to True or not set.

from kedro_datasets.partitions import PartitionedDataset


class TestSaveCallable:
    def __init__(self, a: int):
        print(f"--- Initialized {a} ---")

    def __call__(self, *args, **kwargs):
        print("--- Called ---")


class TestSaveNotCallable:
    def __init__(self, a: int):
        print(f"--- Initialized {a} ---")


def main():
    partitioned_dataset = PartitionedDataset(
        path="data/01_raw/pickle",
        dataset="kedro_datasets.pickle.PickleDataset",
        filename_suffix=".pkl",
        save_lazily=False
    )

    save_dict = {
        "object_1": TestSaveNotCallable(1),
        "object_2": TestSaveCallable(2),
    }

    partitioned_dataset.save(save_dict)


if __name__ == "__main__":
    main()

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Updated jsonschema/kedro-catalog-X.XX.json if necessary
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes
  • Received approvals from at least half of the TSC (required for adding a new, non-experimental dataset)

Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think callable data (that is not intend to be lazy saved) is rare and is not worth to break PartitionedDataset for it. Left the comment in the original issue since I saw two PRs are associated.

#759 (comment)

@ElenaKhaustova ElenaKhaustova marked this pull request as draft January 14, 2025 14:45
@ElenaKhaustova ElenaKhaustova changed the title fix(datasets): Save callable with PartitionedDataset fix(datasets): Add parameter to enable/disable lazy saving for PartitionedDataset Jan 15, 2025
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
@ElenaKhaustova ElenaKhaustova marked this pull request as ready for review January 16, 2025 12:08
Copy link
Member

@DimedS DimedS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, @ElenaKhaustova ! LGTM!

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice and clean solution 👌

Signed-off-by: Elena Khaustova <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error when saving TensorFlowModelDataset as partition
5 participants