Universal Kedro deployment (Part 3) - Add the ability to extend and distribute the project running logic #4277
Replies: 5 comments
-
Hey @Galileo-Galilei I haven't had time to read this in detail - but high level the runner is something we want to completely rewrite, but because it works enough today it's not an immediate priority. We recently had a hack sprint on where @jiriklein on our team worked on this topic, from a slightly different perspective. |
Beta Was this translation helpful? Give feedback.
-
Hello @Galileo-Galilei! Firstly let me say thank you very much for all your thoughts on this and for writing it all up so carefully. Your three posts on Universal Kedro deployment really are a tour de force 🤯 I often go back to re-read and ponder them. There's clearly been a huge amount of time and effort put into it, and it is much appreciated! Now the good news is that your short term solution of injecting a custom Why does this already work? It boils down to this line in load_obj, which means that Some more food for thought: some of our deployment targets (at least AWS batch and Dask) rely on writing a new |
Beta Was this translation helpful? Give feedback.
-
I've given this a bit more thought while reviewing #1248, and I think what you say here makes a lot of sense. I have a couple of suggested modifications to your proposal. A
My thinking here is:
@Galileo-Galilei what do you think of this? It seems to me there are maybe 3 options here along the lines we're thinking:
Any of these options would make the AWS and Dask runner configuration that I mentioned above much more elegant; there would no longer be a need to create custom |
Beta Was this translation helpful? Give feedback.
-
Hi @AntonyMilneQB, sorry for the long response delay and thanks for taking time to answer. I'll try to anwser to all your points:
Thank you very much. It is always hard to identify clearly what I want to cover in the issues (and somehow figuring what I do not cover is sometimes even harder), to digest my trials and errors while deploying kedro projects and to create something understandable and hopefully useful for the community. Glad to see you find it give you food for thoughts, even if you do not end up implementing it in the core library.
I actually figured this out after I wrote this issue and I was afraid I wrote this up for no reason😅 But it turns that it is not currently possible to pass arguments to the constructor of such a custom runner, which makes this solution hardly usable in practice.
I am aware of this, and I really advocate against overriding the
While reading again my own, issue, i found out that I did not mention this, but in my mind this is completly natural that the
Given clarification above, we both agree that 1. is not the right solution. Regarding the 2 remaining solution, I would be inclined to pick up solution 3. The solution 2 seems very elegant (we can argue that getting rid of an argument in the run command may makes user experience easier here), but it feels restrictive. This is quite common for me to have different environment (e.g. a "debug" one which persists all intermediary datasets, and an "integration" one where I log remotely some datasets instead of logging them locally). I want to be able to run the project in different fashion for the same environment (at least the usual "run" while developing; a "prod like" runner which recreates a virtual environment to test locally how the project would behave in the CI; a "service" runner which serves the pipeline instead of running it to test the project as an API locally). It would increase the maintenance burden if I have to duplicate my environments each time I want to run the project with a different runner. |
Beta Was this translation helpful? Give feedback.
-
If I understand this correctly, another possible solution could be having some sort of pre-run, post-run hooks? Those would trigger before actually instantiating the At the same time, the proposal would turn Regardless of my personal opinion about the proposal, as part of an issue cleanup we're doing to use Discussions for enhancement proposals #3767, I'm moving this to a Discussion so that we can continue the conversation there. |
Beta Was this translation helpful? Give feedback.
-
Preamble
This is the third part of my serie of design documentation on refactoring Kedro to make deployment easier:
DataCatalog
entries which have a compute/storage backend different than "python / in memory operations" (including SQl, Spark...).KedroSession
.Defining the feature: Modifying the running logic and distribute the modifier
Current state of Kedro's extensibility
There are currently several ways to extend Kedro natively, described hereafter:
- log data remotely (mlflow, neptune, dolt, store kedro-viz static files...)
- OR manual declaration in settings.py
- profile a catalog entry
- convert a kedro pipeline to an orchestrator
- visualize the pipeline in a webrowser…
Use cases not covered by previous mechanisms
However, I've encountered a bunch of use case where people want to extend the running logic (=how to run the pipeline) rather than of the execution logic (=how the pipeline behaves during runtime, which is achieved by hooks). Some examples includes:
These are real life use-cases which cannot be achieved by hooks because we want to perform operations outside of a
KedroSession
.Current workaround pros and cons analysis
Actually, I can think of two ways to achieve previous use cases in Kedro:
cli.py:run
commmand at the project level (or in a plugin) with custom logicAbstractRunner
which contains the execution logic and manually inject it in yourcli.py
at the project level.These solutions have strong issues:
run
from another project or plugin, you have to recode everything at the project level. At least therunner
solution enable to compose logics through inheritance, but it is not easy to maintain.pip install
it and benefits from the new logic; howewever you have to give up the possibility to extend your own cli at the project level; even worse, plugin order import can lead to inconsistent behaviour if several plugins implements a run command.run
command is running in case of concurrrent overriding of the command, it can obfuscate a lot running errors.run
command and the custom one (e.g. you want to run your pipeline normally most of the time while developping, and have another logic sometimes (e.g. one of the ones described above).The best workflow I could came up with to implement such "running logic" changes is the following:
AbstractRunner
cli.py
on a per project basis to use my custom runnerkedro run
.So I can at least reuse my custom
runner
in other projects by importing them and modifying the other projectcli.py
, which is not very convenient.Potential solutions:
A short term solution: Injecting the
runner
class at runtimeActually, kedro seems to have all the important
elementary bricks
to create custom running logic and choose it at runtime: therun
command and theAbstractRunner
class.The main default is that we can't easility distribute this logic to other users. I suggest to modify the default
run
command to be able to flexibly specify the runner at runtime with a similar logic as customDataSet
in theDataCatalog
by specifying its path.https://github.com/quantumblacklabs/kedro/blob/c2c984a260132cdb9c434099485eae05707ad116/kedro/framework/cli/project.py#L351-L392
Advantages for kedro users:
Towards more flexibility: configure runners in a configuration file
The previous solution does not enable to inject additional parameters to the runner. It "feels" currently poorly managed (there are "if condition" inside the run command to check wether a parameter can be used with the given runner or not...). A solution could be to have a
runner.yml
file behaving in a catalog-like way to enable parametrization. it would also enable to use the same runner with different parameters. Such a file could look like this:And the
run
command could resolve a name in thisRunnerCatalog
and use it in the following fashion:Beta Was this translation helpful? Give feedback.
All reactions