Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an attribute to dataset classes to flag persistence #1910

Closed
jmholzer opened this issue Oct 6, 2022 · 2 comments · Fixed by #3520
Closed

Add an attribute to dataset classes to flag persistence #1910

jmholzer opened this issue Oct 6, 2022 · 2 comments · Fixed by #3520
Assignees
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@jmholzer
Copy link
Contributor

jmholzer commented Oct 6, 2022

Description

Adding a boolean attribute (e.g. IS_PERSISTENT) to dataset objects would create a clean, consistent interface to check whether a dataset is persistent or not. This will help the development of new features in the long term and will have a direct positive impact on at least two open issues (#1802, #1830).

Currently, the best-possible check for persistence is to check if the dataset object is an instance of MemoryDataSet. This is a problem because not all non-persistent DataSets are necessarily instances of MemoryDataSet; there are user-defined dataset objects to consider as well as some defined elsewhere in Kedro (_SharedMemoryDataSet, for instance).

Context

The context for this change is that it is similar to other design decisions that we made in the past. For instance, the attribute _SINGLE_PROCESS is used to flag whether a dataset can be used with ParallelRunner.

This problem has been important in several contexts recently:

Possible Implementation

I see two ways to implement this interface.

  1. We selectively implement the flag on any datasets that are not persistent.
  2. We add it to AbstractDataSet, with inheriting classes overriding the default definition as necessary.

I think that option 2 is the better one, since it would define a consistent interface across all dataset objects. It also presents a cleaner interface than option 1, which would require all checks for persistence to take the form if getattr(data_set, "IS_PERSISTENT", False): ....

@jmholzer jmholzer added the Issue: Feature Request New feature or improvement to existing feature label Oct 6, 2022
@merelcht
Copy link
Member

Suggestion from @idanov: Change IS_PERSISTENT to _EPHEMERAL instead, because most datasets are persistent.

@yetudada
Copy link
Contributor

This will be needed to support being able to debug a node in a Jupyter notebook because you need to know if the previous dataset in the pipeline is available to be loaded or not, and if not then re-run the pipeline up to the point.

@noklam We might be able to delay this, it's an improvement on having the line magic for loading a node implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants