Replies: 3 comments 3 replies
-
I'm going to paste what I wrote last night on Discord:
Questions for the community:
|
Beta Was this translation helpful? Give feedback.
-
@broehrig It's a tradeoff as you described it. Packaging modular pipelines and using them as normal pip-installable python packages is the best case, but then you run into the problem of not being able to slightly modify the code to fit your usecase. On the other hand pulling into your source code gives you that flexibility at the expense of harder updates of those pipelines. Some projects approach the problem by creating "versions" of the pipeline, e.g. in your pipeline package you have multiple subpackages named Another approach I'd suggest is to ensure that your pipelines are made up from trully modular nodes. Then the modification of the pipeline happens through a node replacement. You pull your pipeline and never change the actual code in it (so to preserve ability to upgrade the code), but you only modify the pipeline creation part by using the one provided from the package and then removing the node you want to modify and adding a new replica with the modified code. None of those options are perfect, but code sharing with allowing modifications is quite hard to do in any case and is fundamentally limited by the same problems any distributed system faces. |
Beta Was this translation helpful? Give feedback.
-
I think before I mention my ideas, it's good to note I have never shared Kedro pipelines with other people. I've just thought about how I'd do it. I believe distributing ready-made pipelines in packaged modules wouldn't work well precisely because of what @datajoely mentioned - more often than not you have to adapt the external pipeline to work with your data or to just add functionality to it. Considering then the best way to distribute these would be through source code, a simple idea would be to just have each of these shared pipelines as a Git submodule or something like that. People would be able to clone it into their
The second issue I think is the hardest one to tackle. A quick google search suggested some tools for dealing with YAML files do support So overall, I think Git submodules would offer a good solution to sharing pipelines, however creating the issues of managing a great number of repositories (one per pipeline most likely) and of config files, or unifying the configs somehow. |
Beta Was this translation helpful? Give feedback.
-
Hi everyone,
we have a few steps that all our kedro pipelines use, just with different parameters. We are thinking about using modular pipelines to implement this in a reusable way.
However, I am unsure how to do the actual sharing: Using
kedro pipeline pull
copies the source code into the project. I am afraid that this would lead to the different usages of the modular pipeline diverging over time. Since the pipeline is reused in many different repositories, that seems like a big risk to me.On the other hand, actually sharing the code in this way would enable each project that reuses the pipeline to make small adjustments that might be needed for specific use cases. If we go for an importable library rather than copying the code, we might run into the issue that the shared pipeline grows with conditionals satisfying all the possible use cases.
So I would really like to hear your opinion on this matter - How do you deal with the issue of sharing code / modular pipelines between kedro projects? What works and doesn't work for you?
Beta Was this translation helpful? Give feedback.
All reactions