Error when saving `TensorFlowModelDataset` as partition #759

anabelchuinard · 2023-08-21T22:04:11Z

Description

Can't save TensorFlowModelDataset objects as partition.

Context

I am dealing with a project where I have to train several models concurrently. I started writing my code using PartitionedDataset where each partition corresponds to the data relative to one training. When I am trying to save the resulting tensorflow models as a partition, I get an error. I wonder is this has to do with the fact that those inherit from the AbstractVersionedDataset instead of the AbstractDataset. And if yes, I am interested to know if there is any workaround for batch saving those.

This is the instance of my catalog corresponding to the models I want to save:

tensorflow_models:
  type: PartitionedDataset
  path: data/derived/ML/models
  filename_suffix: ".hdf5"
  dataset:
    type: kedro.extras.datasets.tensorflow.TensorFlowModelDataset

Note: Saving one model (not as partition) works.

Steps to Reproduce

Generate a bunch of trained models
Try to save them in a partition as TensorFlowModelDataset objects

Expected Result

Should save one .hdf5 file per partition with the name of the file being the associate dictionary key.

Actual Result

Getting this error:

DatasetError: Failed while saving data to data set PartitionedDataset(dataset_config={}, dataset_type=TensorFlowModelDataset,
path=...).
The first argument to `Layer.call` must always be passed.

Your Environment

Kedro version used (pip show kedro or kedro -V): kedro, version 0.18.12
Python version used (python -V): 3.9.16
Operating system and version: Mac M2

The text was updated successfully, but these errors were encountered:

astrojuanlu · 2023-09-05T11:07:32Z

Hi @anabelchuinard, thanks for opening this issue and sorry for the delay. It will take us some time but I'm labeling this issue so we don't lose track of it.

merelcht · 2024-07-08T14:34:35Z

Hi @anabelchuinard, do you still need help fixing this issue?

anabelchuinard · 2024-07-08T17:13:22Z

@merelcht I found a non-kedronic workaround for this but would love to know if there is now a kedronic way for batch-saving those models.

merelcht · 2024-07-09T13:52:43Z

Using the PartitionedDataset is definitely the recommended Kedro way for batch saving. I've done some digging and it seems that the following lines are causing issues for using the TensorFlowModelDataset with PartitionedDataset:

kedro-plugins/kedro-datasets/kedro_datasets/partitions/partitioned_dataset.py

Lines 313 to 314 in be99fec

    
           if callable(partition_data): 
        
               partition_data = partition_data()  # noqa: PLW2901

ElenaKhaustova · 2025-01-07T15:21:27Z

Cause of the issue

The issue is in how we implement partitioned dataset lazy saving. To postpone data loading, we require return Callable types in the dictionary fed to PartitionedDataset instead of the actual object.

kedro-plugins/kedro-datasets/kedro_datasets/partitions/partitioned_dataset.py

Lines 313 to 314 in be99fec

    
           if callable(partition_data): 
        
               partition_data = partition_data()  # noqa: PLW2901

When saving the data, we check if the Callable type was passed and call it to get the actual object. Since the TensorFlow model is callable, we make this call when saving, which causes the above error, though the user didn't mean to apply lazy saving.

So PartitionedDataset cannot save Callable types now, unless they're wrapped with another Callable, for example, lambda.

Current workaround

@anabelchuinard - To make PartitionedDataset save Callable in the current Kedro version you need to wrap an object as if you wanted to do a lazy saving:

save_dict = {
	"tensorflow_model_32": models["tensorflow_model_32"](),
	"tensorflow_model_64": models["tensorflow_model_64"](),
}

# Tensorflow model can be wrapped with lambda, to avoid calling it when saving
save_dict = {
	"tensorflow_model_32": lambda: models["tensorflow_model_32"](),
	"tensorflow_model_64": lambda: models["tensorflow_model_64"](),
}

Suggested fix

Make PartitionedDataset accept only lambda functions for lazy saving and ignore other callable objects - #978

Following PR to update docs

kedro-org/kedro#4402

noklam · 2025-01-07T16:51:41Z

Suggested fix
Make PartitionedDataset accept only lambda functions for lazy saving and ignore other callable objects - #978

To me this seems to be a niche case, and changing PartitionedDataset to only accept lambda is a bigger breaking change. Any useful callable will likely be more complicated than a simple lambda. Maybe we can disable lazy loading/saving (default enable) when specified?

ElenaKhaustova · 2025-01-08T11:31:01Z

Suggested fix
Make PartitionedDataset accept only lambda functions for lazy saving and ignore other callable objects - #978

To me this seems to be a niche case, and changing PartitionedDataset to only accept lambda is a bigger breaking change. Any useful callable will likely be more complicated than a simple lambda. Maybe we can disable lazy loading/saving (default enable) when specified?

I see the point but I think the issue is a little bit broader than this case. Particularly I don't think it's right to call any callable object and use this check to decide if we apply lazy saving. This affects all the ml-models cases (tensorflow, pytorch, scikit-learn, etc.) and potentially can also execute some unwanted code implemented in __call__. Moreover, it's not intuitive for users to wrap their objects to avoid such a behaviour.

In the solution suggested I tried to narrow down these cases from callable to lamda, so there's less chance to get them.

As an alternative, we can consider making lazy saving a default behaviour so we internally wrap and unwrap objects automatically. But here, the question is whether we need to make it the only option (as it is for lazy loading) or provide some interface to disable it.

DimedS · 2025-01-08T17:00:21Z

Thanks for the investigation and PR, @ElenaKhaustova! I agree with @noklam that relying solely on lambda functions for lazy saving doesn't seem like a generic solution. While it is a breaking change, it's hard to determine how much it will impact users. In my opinion, it would be better to avoid treating all Callables as participants in lazy saving by default. However, this would also be a breaking change. As a simpler alternative, we could provide an option to disable lazy saving, as you suggested.

ElenaKhaustova · 2025-01-10T14:26:05Z

@noklam, @DimedS, @astrojuanlu

Based on the above arguments, my suggestion would be to make lazy saving a default behaviour like it's done for lazy loading now. For that, we can wrap and unwrap objects internally (instead of asking users to do so manually like we do now), which will guarantee that the Callable we get is expected to be called.

The other question is whether we should provide an option to disable lazy saving. Are there any known cases when disabling it might be critical? Note that we don't have such an option for lazy loading, so it's always enabled.

Please see the edited suggestion below.

DimedS · 2025-01-14T10:52:10Z

Based on the above arguments, my suggestion would be to make lazy saving a default behaviour like it's done for lazy loading now. For that, we can wrap and unwrap objects internally (instead of asking users to do so manually like we do now), which will guarantee that the Callable we get is expected to be called.

Hi @ElenaKhaustova,
Could you please explain how lazy saving will work? For instance, if I want to enable lazy saving and have a function in one partition that executes some code and returns a pandas DataFrame, how should I modify my function to align with your proposal of wrapping all partitions?

ElenaKhaustova · 2025-01-14T11:35:31Z

@DimedS

Could you please explain how lazy saving will work?

I think the easiest way with minimal changes will be to add lazy argument to save() function with True default value:

def save(self, data: dict[str, Any], lazy: bool =True) -> None:

Then:

If the input object is callable and lazy=True we unwrap it
If the input object is not callable and lazy=True we do nothing
If the input object is callable and lazy=False we do nothing
If the input object is not callable and lazy=False we do nothing
Lazy saving will be enabled by default similar to lazy loading

So a user will still need to wrap the object as it was required before and this behaviour won't change. But there will be a proper option to disable it. Now in case of working will callable, like a TF model, one needs to wrap it to avoid its calling: #759 (comment)

DimedS · 2025-01-14T12:17:34Z

Thanks, @ElenaKhaustova! If I understand correctly, the default behavior will remain the same as the current one. However, we are adding the option to use lazy=False, which will prevent Callables from being unwrapped, allowing users to apply it in scenarios like TensorFlow. Is that correct? If so, I really like this idea!

ElenaKhaustova · 2025-01-14T12:22:45Z

Thanks, @ElenaKhaustova! If I understand correctly, the default behavior will remain the same as the current one. However, we are adding the option to use lazy=False, which will prevent Callables from being unwrapped, allowing users to apply it in scenarios like TensorFlow. Is that correct? If so, I really like this idea!

Yes, exactly! That's the way to avoid a breaking change.

merelcht · 2025-01-14T13:13:13Z

I think the easiest way with minimal changes will be to add lazy argument to save() function with True default value:
def save(self, data: dict[str, Any], lazy=True) -> None:

This sounds like a good and clean solution to me.

astrojuanlu · 2025-01-14T15:44:16Z

@ElenaKhaustova Would users be able to toggle that from catalog.yml?

ElenaKhaustova · 2025-01-14T16:13:54Z

@ElenaKhaustova Would users be able to toggle that from catalog.yml?

Yes, I think we should also add it to make sure disabling is possible not only programmatically but with kedro run as well.

github-actions bot mentioned this issue Sep 1, 2023

Monthly issue metrics report kedro-org/kedro#2996

Closed

astrojuanlu added the Community Issue/PR opened by the open-source community label Sep 5, 2023

merelcht removed the Community Issue/PR opened by the open-source community label Jul 9, 2024

merelcht changed the title ~~Saving TensorFlowModelDataset as partition~~ Error when saving TensorFlowModelDataset as partition Jul 9, 2024

merelcht transferred this issue from kedro-org/kedro Jul 9, 2024

merelcht added the bug Something isn't working label Jul 9, 2024

merelcht moved this to To Do in Kedro Framework Aug 5, 2024

merelcht added this to the Individual dataset improvements milestone Oct 22, 2024

ElenaKhaustova self-assigned this Jan 6, 2025

ElenaKhaustova moved this from To Do to In Progress in Kedro Framework Jan 6, 2025

This was referenced Jan 7, 2025

fix(datasets): Add parameter to enable/disable lazy saving for PartitionedDataset #978

Open

Update PartitionedDataset lazy saving docs page kedro-org/kedro#4401

Open

ElenaKhaustova moved this from In Progress to In Review in Kedro Framework Jan 7, 2025

merelcht moved this from In Review to To Do in Kedro Framework Jan 13, 2025

ElenaKhaustova moved this from To Do to In Progress in Kedro Framework Jan 13, 2025

ElenaKhaustova moved this from In Progress to In Review in Kedro Framework Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when saving `TensorFlowModelDataset` as partition #759

Error when saving `TensorFlowModelDataset` as partition #759

anabelchuinard commented Aug 21, 2023 •

edited

Loading

astrojuanlu commented Sep 5, 2023

merelcht commented Jul 8, 2024

anabelchuinard commented Jul 8, 2024

merelcht commented Jul 9, 2024 •

edited

Loading

ElenaKhaustova commented Jan 7, 2025 •

edited

Loading

noklam commented Jan 7, 2025

ElenaKhaustova commented Jan 8, 2025 •

edited

Loading

DimedS commented Jan 8, 2025

ElenaKhaustova commented Jan 10, 2025 •

edited

Loading

DimedS commented Jan 14, 2025

ElenaKhaustova commented Jan 14, 2025 •

edited

Loading

DimedS commented Jan 14, 2025

ElenaKhaustova commented Jan 14, 2025

merelcht commented Jan 14, 2025

astrojuanlu commented Jan 14, 2025

ElenaKhaustova commented Jan 14, 2025 •

edited

Loading

Error when saving TensorFlowModelDataset as partition #759

Error when saving TensorFlowModelDataset as partition #759

Comments

anabelchuinard commented Aug 21, 2023 • edited Loading

Description

Context

Steps to Reproduce

Expected Result

Actual Result

Your Environment

astrojuanlu commented Sep 5, 2023

merelcht commented Jul 8, 2024

anabelchuinard commented Jul 8, 2024

merelcht commented Jul 9, 2024 • edited Loading

ElenaKhaustova commented Jan 7, 2025 • edited Loading

Cause of the issue

Current workaround

Suggested fix

Following PR to update docs

noklam commented Jan 7, 2025

ElenaKhaustova commented Jan 8, 2025 • edited Loading

DimedS commented Jan 8, 2025

ElenaKhaustova commented Jan 10, 2025 • edited Loading

DimedS commented Jan 14, 2025

ElenaKhaustova commented Jan 14, 2025 • edited Loading

DimedS commented Jan 14, 2025

ElenaKhaustova commented Jan 14, 2025

merelcht commented Jan 14, 2025

astrojuanlu commented Jan 14, 2025

ElenaKhaustova commented Jan 14, 2025 • edited Loading

Error when saving `TensorFlowModelDataset` as partition #759

Error when saving `TensorFlowModelDataset` as partition #759

anabelchuinard commented Aug 21, 2023 •

edited

Loading

merelcht commented Jul 9, 2024 •

edited

Loading

ElenaKhaustova commented Jan 7, 2025 •

edited

Loading

ElenaKhaustova commented Jan 8, 2025 •

edited

Loading

ElenaKhaustova commented Jan 10, 2025 •

edited

Loading

ElenaKhaustova commented Jan 14, 2025 •

edited

Loading

ElenaKhaustova commented Jan 14, 2025 •

edited

Loading