You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Adding a boolean attribute (e.g. IS_PERSISTENT) to dataset objects would create a clean, consistent interface to check whether a dataset is persistent or not. This will help the development of new features in the long term and will have a direct positive impact on at least two open issues (#1802, #1830).
Currently, the best-possible check for persistence is to check if the dataset object is an instance of MemoryDataSet. This is a problem because not all non-persistent DataSets are necessarily instances of MemoryDataSet; there are user-defined dataset objects to consider as well as some defined elsewhere in Kedro (_SharedMemoryDataSet, for instance).
Context
The context for this change is that it is similar to other design decisions that we made in the past. For instance, the attribute _SINGLE_PROCESS is used to flag whether a dataset can be used with ParallelRunner.
This problem has been important in several contexts recently:
We selectively implement the flag on any datasets that are not persistent.
We add it to AbstractDataSet, with inheriting classes overriding the default definition as necessary.
I think that option 2 is the better one, since it would define a consistent interface across all dataset objects. It also presents a cleaner interface than option 1, which would require all checks for persistence to take the form if getattr(data_set, "IS_PERSISTENT", False): ....
The text was updated successfully, but these errors were encountered:
This will be needed to support being able to debug a node in a Jupyter notebook because you need to know if the previous dataset in the pipeline is available to be loaded or not, and if not then re-run the pipeline up to the point.
@noklam We might be able to delay this, it's an improvement on having the line magic for loading a node implemented.
Description
Adding a boolean attribute (e.g.
IS_PERSISTENT
) to dataset objects would create a clean, consistent interface to check whether a dataset is persistent or not. This will help the development of new features in the long term and will have a direct positive impact on at least two open issues (#1802, #1830).Currently, the best-possible check for persistence is to check if the dataset object is an instance of
MemoryDataSet
. This is a problem because not all non-persistent DataSets are necessarily instances ofMemoryDataSet
; there are user-defined dataset objects to consider as well as some defined elsewhere in Kedro (_SharedMemoryDataSet
, for instance).Context
The context for this change is that it is similar to other design decisions that we made in the past. For instance, the attribute
_SINGLE_PROCESS
is used to flag whether a dataset can be used withParallelRunner
.This problem has been important in several contexts recently:
session.run
? #1802Possible Implementation
I see two ways to implement this interface.
AbstractDataSet
, with inheriting classes overriding the default definition as necessary.I think that option 2 is the better one, since it would define a consistent interface across all dataset objects. It also presents a cleaner interface than option 1, which would require all checks for persistence to take the form
if getattr(data_set, "IS_PERSISTENT", False): ...
.The text was updated successfully, but these errors were encountered: