Add an attribute to dataset classes to flag persistence #1910

jmholzer · 2022-10-06T16:11:41Z

Description

Adding a boolean attribute (e.g. IS_PERSISTENT) to dataset objects would create a clean, consistent interface to check whether a dataset is persistent or not. This will help the development of new features in the long term and will have a direct positive impact on at least two open issues (#1802, #1830).

Currently, the best-possible check for persistence is to check if the dataset object is an instance of MemoryDataSet. This is a problem because not all non-persistent DataSets are necessarily instances of MemoryDataSet; there are user-defined dataset objects to consider as well as some defined elsewhere in Kedro (_SharedMemoryDataSet, for instance).

Context

The context for this change is that it is similar to other design decisions that we made in the past. For instance, the attribute _SINGLE_PROCESS is used to flag whether a dataset can be used with ParallelRunner.

This problem has been important in several contexts recently:

Possible Implementation

I see two ways to implement this interface.

We selectively implement the flag on any datasets that are not persistent.
We add it to AbstractDataSet, with inheriting classes overriding the default definition as necessary.

I think that option 2 is the better one, since it would define a consistent interface across all dataset objects. It also presents a cleaner interface than option 1, which would require all checks for persistence to take the form if getattr(data_set, "IS_PERSISTENT", False): ....

The text was updated successfully, but these errors were encountered:

merelcht · 2022-10-17T13:36:47Z

Suggestion from @idanov: Change IS_PERSISTENT to _EPHEMERAL instead, because most datasets are persistent.

yetudada · 2023-06-30T13:46:38Z

This will be needed to support being able to debug a node in a Jupyter notebook because you need to know if the previous dataset in the pipeline is available to be loaded or not, and if not then re-run the pipeline up to the point.

@noklam We might be able to delay this, it's an improvement on having the line magic for loading a node implemented.

jmholzer added the Issue: Feature Request New feature or improvement to existing feature label Oct 6, 2022

merelcht added this to Kedro Framework Oct 17, 2022

merelcht added this to the Improve the Interactive Jupyter notebook workflow milestone Feb 6, 2023

yetudada modified the milestones: Improve the Interactive Jupyter notebook workflow, Improving the debugging experience with Jupyter Notebook Jun 30, 2023

merelcht moved this to To Do in Kedro Framework Jan 8, 2024

merelcht assigned merelcht and lrcouto Jan 8, 2024

lrcouto moved this from To Do to In Progress in Kedro Framework Jan 15, 2024

lrcouto linked a pull request Jan 17, 2024 that will close this issue

Add attribute to flag persistence in Dataset classes #3520

Merged

7 tasks

merelcht mentioned this issue Jan 19, 2024

Add attribute to flag persistence in Dataset classes #3520

Merged

7 tasks

lrcouto moved this from In Progress to In Review in Kedro Framework Jan 19, 2024

lrcouto moved this from In Review to Done in Kedro Framework Jan 24, 2024

lrcouto closed this as completed Jan 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an attribute to dataset classes to flag persistence #1910

Add an attribute to dataset classes to flag persistence #1910

jmholzer commented Oct 6, 2022 •

edited

Loading

merelcht commented Oct 17, 2022

yetudada commented Jun 30, 2023

Add an attribute to dataset classes to flag persistence #1910

Add an attribute to dataset classes to flag persistence #1910

Comments

jmholzer commented Oct 6, 2022 • edited Loading

Description

Context

Possible Implementation

merelcht commented Oct 17, 2022

yetudada commented Jun 30, 2023

jmholzer commented Oct 6, 2022 •

edited

Loading