Improve Kedro run as a Package (2023) #3237

noklam · 2023-10-27T16:33:50Z

Context

I have sat down with @idanov today and try to recall our memory about #1423 that was made by @antonymilne. Solving this PR would make Kedro more compatible anywhere (particularly Databricks) and potentially simplify our documentation on Databricks.

#1423 summarize how Kedro run is supported currently

A user can run their kedro project in several ways. Note that the run command executed can be defined on the framework side or overridden in turn by a plugin or a project cli.py (done by _find_run_command).

Kedro CLI: kedro run. This is the only route that doesn't go through the project main.py; instead it goes through kedro.framework.main, which builds the CLI tree and does something like _find_run_command

CLI but not kedro CLI: python -m spaceflights. This hits the project main.py and will call main

CLI but not kedro CLI: python src/spaceflights. This is just a more unusual way of doing 2

Inside your own python script: from spaceflights.main import main; main(), then run the script using python

Inside IPython/Jupyter with no kedro ipython extension: same as 4 but you run it in the notebook

Inside IPython/Jupyter with kedro ipython extension: same as 5 but you can also do session.run (which is what we have advertised as the way to do a kedro run in the past)

All the above must be run from within the project root or kedro won't be able to find your conf. Options 2 onwards needs you to have pip installed your project or to have src in your PYTHONPATH. Note that having the project pip installed could mean first doing kedro package and then pip install the resulting .whl file or it could mean just pip install ./src from your project root; it doesn't make a difference.

In summary, there are 3 things that #1423 attempt to fix and we can break it down.

Packaged kedro pipeline does not work well on Databricks #1807 summarise the 1st problem which is click is emitting a sys.exit which make it hard to integrate Kedro and causing Databricks Job "failure" despite a success run.
- Make default project entrypoint interactive (Databricks) friendly #3191 is a potential solution and @idanov has this PR which may solve this issue and removing the databricks_run.py to keep a single __main__ entrypoint across project.
Simplify Packaged Kedro Project entrypoint __main__.py, move _find_run_command to Framework #3051 - The incomplete change in Improve kedro run as a package #1423 kedro/framework/project/__init__.py is trying to address this.
Running kedro run or Kedro's entrypoint does not return anything - features/steps/test_starter/{{ cookiecutter.repo_name }}/src/{{ cookiecutter.python_package }}/__main__.py

Added one more:

Consider using run(standalone_mode=True) to fix packaged Kedro project getting a sys.exit #2682

The text was updated successfully, but these errors were encountered:

noklam · 2024-02-28T15:50:35Z

So I see there are two different ways of going:

I. Improve the current approach with CLI entrypoints - Parent #3237

It consists 3 sub-tasks:

package Kedro project should return session.run #2681
Consider using run(standalone_mode=True) to fix packaged Kedro project getting a sys.exit #2682
Simplify Packaged Kedro Project entrypoint __main__.py, move _find_run_command to Framework #3051

II. Use `KedroSession` everywhere

The current entrypoint look like this:
https://github.com/kedro-org/kedro-starters/blob/87715c48977bdfe1c64dd1d924a9f3e6c0933951/pandas-iris/%7B%7B%20cookiecutter.repo_name%20%7D%7D/src/%7B%7B%20cookiecutter.python_package%20%7D%7D/__main__.py#L39C2-L43C25

def main(*args, **kwargs):
    package_name = Path(__file__).parent.name
    configure_project(package_name)
    run = _find_run_command(package_name)
    run(*args, **kwargs)

I want to highlight the fact that, using the CLI approach is the only way to make cli.py works in packaged mode. #2384 suggests to remove cli.py, this will however be a breaking change and is this something we want to do?

If we give up cli.py, we can swap the _find_run_command with KedroSession. This will easily avoid all #2681, #2682, #3051, but it almost mean that there will be no alternatives to extend Kedro CLI.

(Roughly like this)

def main(*args, **kwargs):
    package_name = Path(__file__).parent.name
    configure_project(package_name)
    session = KedroSession.create() # need to handle `env` and `extra_params`
    result = session.run(*args, **kwargs)

I remember #2169 (comment) mention he extend Kedro CLI for this particular reason, though kedro-boot maybe the latest approach? Cc @takikadiri

@astrojuanlu @merelcht I'd like to get some feedback about this.

Lastly, Databricks is not the only issue. Solving this will improve integrating Kedro with downstream application (but not a complete solution). The philosophy so far focus on getting the pipeline run once in a wheel format, but recent discussion suggest that users want to run pipeline repeatly, thus this won't help kedro-boot.

astrojuanlu · 2024-03-07T09:04:06Z

I'd like to explore the idea of using the KedroSession everywhere, because it moves us towards making the Session more usable.

In the end, users will want to define their own scripts, ways of launching the KedroSession, and so on.

but it almost mean that there will be no alternatives to extend Kedro CLI.

Could you explain this in a bit more detail?

noklam · 2024-03-08T12:10:42Z

but it almost mean that there will be no alternatives to extend Kedro CLI.
Could you explain this in a bit more detail?
cli.py is a way to extend Kedro CLI at a project level, for example, you can add new options in a cli.py in <python_package>/cli.py. The existing packaging Kedro project has a __main__ entrypoint, which load this cli.py if exists. KedroSession will omit this part completely.

Extend questions:

How many people are using cli.py? Example: Allow injecting data into a KedroSession run #2169 (comment)
Can we provide a better way to extend CLI instead of cli.py? For example Spike: Provide a way for plugins to have runtime configuration and extend CLI arguments #2866

I'd like to explore the idea of using the KedroSession everywhere, because it moves us towards making the Session more usable.

In the end, users will want to define their own scripts, ways of launching the KedroSession, and so on.
I see you mentioned we should do #2682, do you think we need to do both?

astrojuanlu · 2024-03-08T14:44:18Z

I see you mentioned we should do #2682, do you think we need to do both?

Oh, I commented on #2682 after I saw your comment on #3680. Is there any leftover?

To clarify, I was withdrawing my opposition in case we ever have to do that.

For this particular issue, I still think pursuing the KedroSession approach is better.

The existing packaging Kedro project has a __main__ entrypoint, which load this cli.py if exists. KedroSession will omit this part completely.

Yes, that's what I understood. Isn't it possible to extend the CLI developing a normal plugin, like kedro airflow and kedro docker do? Although now that I mention it, I have no idea how that interacts with python -m packaged_kedro.

noklam · 2024-03-08T14:55:05Z

Yes, that's what I understood. Isn't it possible to extend the CLI developing a normal plugin, like kedro airflow and kedro docker do? Although now that I mention it, I have no idea how that interacts with python -m packaged_kedro.

Correct, you can extend and add subcommand but not change existing KedroCLI, cli.py particular can change any kedro command, more commonly changing the run command. For example, you can add new arguments like
kedro run --my-custom-arg <value>, this isn't possible with the plugin mechanism (and would be very bad if it's possible)

astrojuanlu · 2024-03-08T17:11:48Z

Got it. Assuming something like this:

def main(*args, **kwargs):
    package_name = Path(__file__).parent.name
    configure_project(package_name)
    session = KedroSession.create() # need to handle `env` and `extra_params`
    result = session.run(*args, **kwargs)

(from your comment above)

Can't the users add any CLI arguments they want, with argparse, click, fire, tyro, or anything else?

merelcht · 2024-03-12T13:39:09Z

I think a redesign of the KedroSession is long overdue. However, I do think it goes beyond just the issue addressed here of running Kedro as a package and I'd like to explore what other workflows the session is involved in or could be involved in. e.g. #2169 (also mentioned above), we've talked about the role of the session in interactive workflows as well (notebooks) and of course there's the role of the session in experiment tracking. I guess any meaningful changes to the session are also likely to be breaking, and it would be a shame to postpone improving packaged Kedro until then.

Is it possible to do:

in a non-breaking way first and then tackle the KedroSession as a separate larger piece?

takikadiri · 2024-03-12T23:19:26Z

Enable running kedro project/package programatically was one of my main focus while developing kedro-boot. I enumerated these three entry points for running kedro :

Running kedro project with a CLI --> Kedro CLI run command,
Running kedro package with a CLI --> __main__ module that configure the project and reuse the run command
Running kedro package/project programatically or as part of interactive workflow --> This is missing. I don't think that the __main__.py should cover this. This maybe could be solved by a run_package/run_project functions provided by the framework not the template. The run_package/run_project could reuse the underlying function of the Kedro run command.

These three entry points could reuse the same run function (used by the Kedro run command). This is something that i tried to achieve with kedro-boot while developing the boot_project and boot_package. I faced some problems with click, when trying to decouple the click command from it's uderlying function, that i manage to solve by giving up the function definition and having just a **kwargs that would be passed by the click command or the user interface that reuse the function. Here is an example of such decoupling:

def run_function(**kwargs):

	# some args preprocessing
	# Session creation & running
    with KedroSession.create(
            env=kedro_args.get("env", ""),
            extra_params=kedro_args.get("params", ""),
            conf_source=kedro_args.get("conf_source", ""),
        ) as session:
            return session.run(
                tags=tuple_tags,
                runner=runner,
                node_names=tuple_node_names,
                from_nodes=kedro_args.get("from_nodes", ""),
                to_nodes=kedro_args.get("to_nodes", ""),
                from_inputs=kedro_args.get("from_inputs", ""),
                to_outputs=kedro_args.get("to_outputs", ""),
                load_versions=kedro_args.get("load_versions", {}),
                pipeline_name=kedro_args.get("pipeline", ""),
                namespace=kedro_args.get("namespace", ""),
            )

@click.command(name="run", short_help="")
def run(**kwargs) -> Any:
    return run_function(**kwargs)

run_params = [
    click.option("--pipeline", type=str, help=""),
    click.option("--env", type=str, help=""),
    .
    .
]

for param in run_params:
        run = param(run)

merelcht · 2024-09-12T10:58:18Z

All subtasks are completed, so I'm closing this issue as complete as well! 🎉

noklam added this to the Ensure packaged Kedro projects can run smoothly everywhere milestone Oct 27, 2023

astrojuanlu added the Type: Parent Issue label Oct 27, 2023

github-actions bot mentioned this issue Nov 1, 2023

Monthly issue metrics report #3256

Closed

noklam added this to Kedro Framework Nov 6, 2023

noklam mentioned this issue Nov 8, 2023

Packaged kedro pipeline does not work well on Databricks #1807

Closed

merelcht added this to Roadmap Mar 28, 2024

merelcht moved this to Near term in Roadmap Mar 28, 2024

merelcht added the roadmap label Apr 4, 2024

merelcht moved this from Near term to Future in Roadmap Jul 1, 2024

This was referenced Jul 23, 2024

Move _find_run_command from template to framework #4012

Merged

Make packaged Kedro project work in interactive environment #4026

Merged

Update starters __main__.py with find run logic from framework kedro-org/kedro-starters#228

Closed

noklam mentioned this issue Jul 30, 2024

kedro-telemetry: Improve performance by switching to after_command_run #4014

Merged

7 tasks

merelcht moved this from Future to Current in Roadmap Aug 12, 2024

merelcht closed this as completed Sep 12, 2024

github-project-automation bot moved this to Done in Kedro Framework Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Kedro run as a Package (2023) #3237

Improve Kedro run as a Package (2023) #3237

noklam commented Oct 27, 2023 •

edited by merelcht

Loading

noklam commented Feb 28, 2024 •

edited

Loading

astrojuanlu commented Mar 7, 2024

noklam commented Mar 8, 2024

astrojuanlu commented Mar 8, 2024

noklam commented Mar 8, 2024

astrojuanlu commented Mar 8, 2024

merelcht commented Mar 12, 2024

takikadiri commented Mar 12, 2024 •

edited

Loading

merelcht commented Sep 12, 2024

Improve Kedro run as a Package (2023) #3237

Improve Kedro run as a Package (2023) #3237

Comments

noklam commented Oct 27, 2023 • edited by merelcht Loading

Context

noklam commented Feb 28, 2024 • edited Loading

I. Improve the current approach with CLI entrypoints - Parent #3237

II. Use KedroSession everywhere

astrojuanlu commented Mar 7, 2024

noklam commented Mar 8, 2024

astrojuanlu commented Mar 8, 2024

noklam commented Mar 8, 2024

astrojuanlu commented Mar 8, 2024

merelcht commented Mar 12, 2024

takikadiri commented Mar 12, 2024 • edited Loading

merelcht commented Sep 12, 2024

noklam commented Oct 27, 2023 •

edited by merelcht

Loading

noklam commented Feb 28, 2024 •

edited

Loading

II. Use `KedroSession` everywhere

takikadiri commented Mar 12, 2024 •

edited

Loading