Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we change the output of session.run? #1802

Closed
1 task
noklam opened this issue Aug 23, 2022 · 7 comments
Closed
1 task

Should we change the output of session.run? #1802

noklam opened this issue Aug 23, 2022 · 7 comments
Labels
Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation

Comments

@noklam
Copy link
Contributor

noklam commented Aug 23, 2022

Background

What's the output of session.run()? Currently, this is not clear as you think and it isn't documented anywhere. The logic is defined in runner.py, this can be counter-intuitive in some cases, is there a good reason why we want to do this?

free_outputs = pipeline.outputs() - set(catalog.list())
unregistered_ds = pipeline.data_sets() - set(catalog.list())
for ds_name in unregistered_ds:
catalog.add(ds_name, self.create_default_data_set(ds_name))
if self._is_async:
self._logger.info(
"Asynchronous mode is enabled for loading and saving data"
)
self._run(pipeline, catalog, hook_manager, session_id)
self._logger.info("Pipeline execution completed successfully.")
return {ds_name: catalog.load(ds_name) for ds_name in free_outputs}

kedro has improved a lot in terms of how to run the pipeline with packaging & KedroSession as a standalone application, #1423 documents different ways to do it. Personally, I think it is still not easy enough to integrate with kedro for someone who is inexperienced with kedro. In #1423, It mentioned how a pipeline can be called programmatically. Even though the pipeline itself is a function call, it doesn't behave like a function, i.e. you can't really define an input as an argument easily (it has to be a Catalog entry), the output of the pipeline is also very restricted.

Motivation

Kedro works really well within the kedro world, but it also mean that kedro works very differently from the rest of the Python world.

This issue mainly focuses on the output side, this will improve the experience to integrate the kedro pipeline as an upstream. In a over-simplified world, this should be straight forward to do. Currently I think we a strong assumption that people work with "Kedro Project", but if we are moving towards a kedro package, i.e. using from kedro_package import main, it should behave just like a Python function, I think this is a reasonable expectation.

1. df = get_some_data()
2. model = my_kedro_pipeline(input={'my_pipeline_input_df': df})
3. app = PredictionWebService(model)

Questions

  • What should be return with session.run?

Things to consider

  • How can any Python developer integrate with the kedro pipeline easily? Can it behave just like a function?
  • In an interactive workflow, it may make sense to keep all intermediate output in the resulting dict
  • Is there a known reason why the output is defined as it is?

Related Issue:

@noklam noklam added the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Aug 23, 2022
@antonymilne
Copy link
Contributor

This is a very interesting question. I think it's right to focus just on the output side here so I'll save my comments on input for another time 🙂

I think we'd need @idanov or maybe even @tsanikgr to explain exactly why session.run returns what it does. AFAIK it's always been this way. Intuitively it kind of feels like the right thing to me, since those are the "unprocessed" datasets which you might want to work with further. All intermediate datasets have already been consumed by the pipeline and so shouldn't be required further downstream. If you really want to make them available then you could make a mock identify node that copies them to a free output. Returning all intermediate datasets feels like too much to me.

The reason I only say kind of above is that it seems more questionable to me that we only return those outputs that are MemoryDataSet. I think there's an argument that we should return all unconsumed outputs, even if they have been persisted to disk. i.e. we could have free_output = pipeline.outputs()

Also, technically it looks to me like the code that finds free_outputs is not quite right. If I define something explicitly as a MemoryDataSet in my catalog (unusual, but not unheard of, e.g. to change the copy_mode) then it won't count as a free_output when probably it should do. It's an edge case, but worth mentioning since we're discussing it here. What free_output means in the code is just "output that's not defined in the catalog", which is a subset of "output that's not a MemoryDataSet".

@noklam
Copy link
Contributor Author

noklam commented Oct 3, 2022

Add this related SO Question - How to run a kedro pipeline interactively like a fuction - this issues only focus on the output of a pipeline, what about input? I think this will be the next question.

@noklam
Copy link
Contributor Author

noklam commented Oct 4, 2022

Notes for Tech Design

The reason I only say kind of above is that it seems more questionable to me that we only return those outputs that are MemoryDataSet. I think there's an argument that we should return all unconsumed outputs, even if they have been persisted to disk. i.e. we could have free_output = pipeline.outputs()

  1. Less controversial - Change the default - the definition of free_output is a bit buggy, we should change it.

The reason I only say kind of above is that it seems more questionable to me that we only return those outputs that are MemoryDataSet. I think there's an argument that we should return all unconsumed outputs, even if they have been persisted to disk. i.e. we could have free_output = pipeline.outputs()

  1. Open up an optional argument for session.run to return any targeted datasets - even if it's an intermediate dataset or persisted dataset - this is more useful for interactive workflow (i.e. notebook) or debugging purpose. Currently it's tricky to make it work. This one is highly related to Workflow of debugging Kedro pipeline in notebook #1832
    2.1. If it's an intermediate Memory dataset - you can't really get it.
    2.2. If it's an intermediate persisted dataset - you need to first session.run and then do catalog.load

@merelcht
Copy link
Member

merelcht commented Oct 5, 2022

Notes from Technical Design session:

There was agreement that the "free outputs" output from session isn't very clear. It was suggested to simply return all output from nodes that is not consumed, even if it's defined in the catalog. However, this could lead to very large amounts of data being returned. Instead we'll change it to return all free outputs and additionally any MemoryDataSets that are defined in the catalog.

The second point about adding an optional argument for session.run() to return any targeted datasets was discussed briefly, but it was decided to talk about it more thoroughly in a separate workstream about node debugging.

@noklam
Copy link
Contributor Author

noklam commented Oct 5, 2022

Supplement on the above comments to address @AntonyMilneQB question:

i.e. we could have free_output = pipeline.outputs()

The answer to that is there is a catalog.load call at the end, it's an expensive call and potentially memory hungry. So persisted datasets are deleted from memory as long as they are not needed. For MemoryDataSet, it's loaded in memory already, so there is no harm to return it.

@noklam
Copy link
Contributor Author

noklam commented Oct 6, 2022

I just give it a go to see what would it takes to make the initial idea works, partly because I want to test how the nbdev system works. See DebugRunner

https://noklam.github.io/kedro-debug-runner/core.html

@noklam
Copy link
Contributor Author

noklam commented Mar 23, 2023

Adding this as inspiration on whether we should have some kind of argument or debug mode that can specifically return output easily without editing configuration.

At the moment, the proper way to inspect is

  • For "free memory data" - it will return by session.run
  • "intermediate memory data" - it will be deleted as soon as it not needed (Not possible to be returned by a session, user need to do a session.run which make the "targeted dataset" as "free output"
  • For "persisted dataset" - user need to do `catalog.load("dataset_name")

The complication is mainly due to the kedro run need to be efficient and thus some data is deleted on the fly to reduce memory footprint.

The question is how can we improve the user experience? It's hard to reason what is "free output" and what is not.
I would also question that there are significant users working with moderate size of data, keeping everything in memory isn't a problem and make the development experience smoother. kedro-org/kedro-plugins#44
Is there a way to let users do what they want without touching any configuration?

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation
Projects
Archived in project
Development

No branches or pull requests

4 participants