Universal Kedro Deployment (Part 4) - Embedding kedro pipelines in third-party applications #3540

Galileo-Galilei · 2024-01-21T22:16:18Z

Introduction

Overview

📝 Document type The document is mostly a "User Research" which focuses what functional needs should be added to kedro, but it also sometimes slips to a "Design document" which suggests how they could be integrated to the core kedro framework. The 2 parts are clearly separated, because I think the user research (mainly a compilation of existing kedro issues) is worth sharing, even if we do not end up with the proposed implementation.

👥 Target Audience : Kedro core team / plugin developers / Mlops engineer who puts kedro in production. I try to keep the issue as self-contained as possible, but I still assume the reader knows the default kedro objects (runner,pipeline, catalog...), and how KedroSession.run works under the hood.

🙏 Credits : The mind behind part / most of the design and the thoughts described hereafter is @takikadiri. I mostly reformulate, clarify and try to give a comprehensive overview of the issues we are trying to solve and how we solve them.

📚 Additional resources: Most of the features described hereafter are implemented in the kedro-boot plugin, however the technical implementation sometimes differ for subtle reasons. You can find examples on how to use it for the features described in the "User research" section in the kedro-boot examples repo

‼️ Important note : The kedro-boot plugin also provides other features, especially to launch an app from a kedro entrypoint (called standalone mode in this comment). This is out of scope of this issue.

TL;DR

We need to make the session being runnable multiple times, and optimize latency to serve lots of uses cases (serving, dynamic pipelines, ...). In summary, we should make below pseudo code possible :

session=KedroSession.create(project_path)
session.preload(artifacts=["my_model", "my_encoder"]) # :sparkles: FEATURE 1: preload artifacts, cache them and not not release them between runs for speed
session.run(pipeline="my_pipeline", runtime_data={"data": data1}) # :sparkles: FEATURE 2: inject data at runtime from the session
session.run(pipeline="my_pipeline", runtime_params= {"customer_id": id}) # :sparkles: FEATURE 3: run the same session mulitple times + :sparkles: FEATURE 4: inject runtime params at... runtime  (as the name says!) instead of instantation time 
session.release_cache(artifacts=["my_model", "my_encoder"]) # free memory when needed

There are a bunch of technical optimisations (for speed, kedro-viz compatibility and ability to inject parameters truly at runtime) which are needed under the hood.

User research : embedded deployment patterns

Functional need 1 : Triggering kedro from third party applications which owns the entrypoint

There is a very common "deployment pattern" which consists to run the KedroSession programmatically in another python program. It consists in running (more or less) the following snippet:

# SNIPPET 1 : naive code
from kedro.framework.project import bootstrap_project
from kedro.framework.session import KedroSession

session=KedroSession.create(project_path)
session.run(pipeline="my_pipeline")

Overall, this is well described by the following issue: #2169

Multiple use cases are well identified:

Expose the pipeline from an API to trigger the run (see Synthesis of research related to deployment of Kedro to modern MLOps platforms #3094, Investigate how session/context should be re-designed to work well for API use-cases #2182, Kedro with FastAPI - improving performance of `KedroSession.load_context()` #2134, Usage of Kedro pipeline with web services & Deployment #1846, Kedro with FastAPI - improving performance of `KedroSession.load_context()` #2134)

from fastapi import FastAPI
app = FastAPI(title="Spaceflights shuttle price prediction")

@app.post("/predict")
def predict_data():
	session=KedroSession.create(project_path)
	session.run(pipeline="my_pipeline")
	
	return data.to_dict()

Trigger the pipeline from a frontend low-code tool (streamlit, dash, shiny...)([KED-1154] Streamlt Use Cases #115, Allow injecting data into a KedroSession run #2169). A code example can be found here: https://github.com/takikadiri/kedro-boot-examples/blob/29438e45adb581b060f43c953c2557badeafd7d1/src/spaceflights_streamlit/app.py#L1-L33
Launch kedro from Datarobot : Allow injecting data into a KedroSession run #2169 (comment)
The "type 1" dynamic pipeline as defined in Dynamic Pipeline #2627 (comment) : people try to run the same pipeline multiple times with slightly different inputs (list of countries, hyperparamters,, several days for a time serie... someone even mentioned a monte carlo simulation here: https://linen-slack.kedro.org/t/14162145/hi-there-what-s-the-best-way-to-run-a-monte-carlo-simulation#48ef7630-854f-4e98-b698-3534f80a05b7)
Eventually in jupyter notebook, as described in [KED-2629] Users can't provide custom run_id or save_version to KedroSession #1273 (comment)
We can eventually imagine people deploying kedro pipelines by running pipelines in airflow tasks "programatically" (with session.run()) instead of trying to "map" kedro objects (e.g. nodes or namespaced pipelines) to airflow tasks automatically.

✅ Above "naive" code is valid in kedro>=0.18. So... is that already ok? Clearly not, because of the next paragraph.

Functional need 2 : Passing data at runtime and getting the results

This long-demanded feature is described in details in #2169 so I'll try not to duplicate it here. The point is that almost all use cases described in previous paragraph need to inject "some data" at runtime, e.g. :

loops (hyperparameters tuning, montecarlo simulation)
api injection

and retrieve the results to in memory, e.g. do something like this:

metrics=[]
for params in params_grid:
	computed_metric=session.run(pipeline="my_pipeline", runtime_params=params)
	metrics.append(computed_metric)
...
do_something_with_computed data(metrics)

Without the ability to pass data at runtime, all use cases presented above are not feasible so SNIPPET 1 is hardly useful. We need to find a way to circumvent the current limitations to pass data at runtime.

Injecting data & parameters

This is what is covered #2169.

# SNIPPET 2 :  naive pseudo-code with runtime data injection
from kedro.framework.project import bootstrap_project
from kedro.framework.session import KedroSession

data = get_data()# where data is anything, retrieved from the context: the API input in case 1, the frontend in case 2, some parameters  in case 4, literally anything computed on the fly in case 5 ...)
session=KedroSession.create(project_path)
session.run(pipeline="my_pipeline", runtime_data= {"my_catalog_dataset": data}) # this runtime_data key does not exist, this is the suggestion of #2169

The current workaround consists in rewriting the KedroSession.run method in your app with many private methods which causes a lot of maitenance issues, because it becomes really hard to upgrade yout kedro version. We'll cover it in a next paragraph.

Injecting globals & runtime_params

A couple of issues suggest that users do not want to override the full data but only pass some globals or runtime_params to be resolved in the catalog : #1723.
The typical uses cases consist in using the catalog to :

trigger a parametrized SQL request (with a date, a customer_id...)
call a REST API endpoint with API dataset
change a filepath where the result is read/saved dynamically

# SNIPPET 3 :  naive pseudo-code with runtime data injection
from kedro.framework.project import bootstrap_project
from kedro.framework.session import KedroSession

session=KedroSession.create(project_path)
session.run(pipeline="my_pipeline", runtime_globals= {"my_global": time.time()}) # this ``runtime_globals`` can be passed as ``runtime_params``

There are 2 big problems in the current implementation :

the extra_params (note that name is inconsistent, it is called params in the CLI, extra_params in the session and runtime_params for the resolver, I strongly suggest we normalize the name which is confusing) key exist, but is currently a KedroSession.create argument instead of
KedroSession.run arguments. This means that we cannot override these extra params on each run. This tightly couples the 2 methods create and run. It currently makes sense because Kedro makes the strong assumption that 1 session = 1 run and even raises an exception (

kedro/kedro/framework/session/session.py

Lines 324 to 329 in 2e64459

    
           if self._run_called: 
        
               raise KedroSessionError( 
        
                   "A run has already been completed as part of the" 
        
                   " active KedroSession. KedroSession has a 1-1 mapping with" 
        
                   " runs, and thus only one run should be executed per session." 
        
               )

) if a user attemps to run it multiple times, but we can see that it is a strong limitation that is discussed below.

the KedroSession.run method access the catalog through context.catalog attribute. However, the catalog is already resolved at this step and it is no longer possible to inject globals without rebuilding the entire catalog manually (inlcuding rewriting all the load/merge logic between environments ...). The issue Spike: Enable access to OmegaConfigLoader DictConfig instead of dict only #2973 describes in details what the problem is and suggests a potential a solution. There is a more general issue about decoupling loading and resolving configuration [Proposal] - Replace ConfigLoader with ConfigLoader and ConfigResolver #2481. kedro-boot attempts to solve this issue by enabling a new type of parametrize query in the catalog with the [[ ]] syntax instead of the jinja ```${}` syntax, but this feels a very bad workaround not sustainable on the long run. We'll do better with a custom resover in a next version.

Technical requirements 1 : Speed of execution

Above "business" use cases (especially API serving) requires speed of execution, hence any overhead induced by kedro negatively impact their feasibility. A couple of seconds of overhead is acceptable on a 5 mn batch, but less on an API serving preidction swith high latency (say <100ms).
There are at least 3 known performance issues of the KedroSession:

CLI issues due to click plugins collection (Provide a lightweight solution to speed up session reload or create new session #2879). This is not relevant here since we focus on programmatical use.
Each session run calls session.load_context() which is slow. The goal is to provide the save and load version by using context._get_catalog(...) , as described in

kedro/kedro/framework/session/session.py

Line 334 in 2e64459

context = self.load_context()

. The aforementioned issue makes clear why it is done as is currently, but this feels a high cost for very low benefit. If we can avoid calling load_context, it will speed up the session run.
Artifacts preloading : kedro loads data lazily in the first nodes which needs it, and release it once the last nodes which needs it has finished compute. This is the expected behaviour in batch mode, but in API mode you will run the pipeline multiple times, and you don't want the I/O to become the performance bottleneck. It would be useful to enable to force loading some datasets (e.g., a ML model) in memory before running the session, and do not release them until being explictly told. Mlflow call such datasets "artifacts", see https://github.com/Galileo-Galilei/kedro-mlflow/blob/e0033c5072c929a4c26cfaeaf61fcedf93d36522/kedro_mlflow/mlflow/kedro_pipeline_model.py#L174-L219.

# SNIPPET 4 :  naive pseudo-code with artifact preloading

session=KedroSession.create(project_path)
session.preload(artifacts=["my_model", "my_encoder"]) # syntax TBD
session.run(pipeline="my_pipeline", runtime_data= {"data": data1}) # the catalog.load("my_model") should use the preloaded cached data. Maybe reuse CachedDataSet? 
session.run(pipeline="my_pipeline", runtime_data= {"data": data2}) # the catalog.load("my_model") should use the preloaded cached data. Maybe reuse CachedDataSet? 
session.release_cache(artifacts=["my_model", "my_encoder"]) # syntax TBD

Technical sub-requirements 2 : Running multiple times a single session

If we summarize the last paragraphs, it boils down that we need to make the session runnable mulitples times without the need to create it for each run which creates a tight coupling between the 2 methods. We show that it would enable:

Pass data dynamically to Kedrosession.run
Move extra_params to Kedrosession.run instead of KedroSession.create
Pass runtime_params (or globals) to Kedrosession.run and resolve the catalog on each run instead on Session instantiation
Preload some datasets after create and do not release them at the end of each run to increase runtime speed of execution

However, there are deep investigations to make before enabling this, because kedro makes the assumption that 1 session = 1 run on purpose. According to the issue, this is done to simplify kedro-viz experiment tracking functionality. Changing this assumption may be impossible to be backward compatible with existing experiment tracking in kedro-viz, as described in #1273. However this potential breaking change should be thought in regards of kedro-org/kedro-viz#1624, which suggests low adoption of the experiment tracking functionality, and my feeling is that enabling running multiple sessions is worth breaking experiment tracking, but I am overly biased here :)

A positive side effect of getting rid of this assumption is that it would solve the original motivation of #1273, e.g. letting people customize the session run_id, which is particluarly useful when deploying a kedro pipeline to an orchestrator which assumes task independency (like airflow).

Technical requirements 2 : Ease of use

Database connection lazy loading

One current issue people are facing when using the session, particularly in interactive mode (but not only), is that database connection & API's calls are instantiated on KedroSession.create, which means that you cannot run your pipeline if another pipeline is not correcly instantiated. I think this would simplify debugging and ease of use if this issue was tackled #2829.

In general, we should be able to run a pipeline from a session even if other pipelines are broken (no import error, invalid data connection...). but this is a hard problem and not the prioritary one.

Create a public API for consistent access to kedro objects between versions

The lack of public API makes code supporting above use cases become obsolete very fast as kedro versions are changing, because it relies ont a lot of private methods. This severly degrades the developer experience for kedro users :

beginners are very confused because they find very different examples across the web (Kedro with FastAPI - improving performance of `KedroSession.load_context()` #2134, https://medium.com/indiciumtech/how-to-build-models-as-products-using-mlops-part-4-deploying-models-in-apis-with-kedro-fast-api-15f13e67c080 ...) , and there are hard to adapt to make them work. ChatGPT itself is quite confused between versions and mixes up 0.16/0.17/0.18 examples
seasoned developers deploying kedro in production increase tech debt and maintenance cost, and this can be a hindrance to kedro adoptions
plugins developers need to tests in details their plugins and to keep them very up to date - unfortunately this is not something we can reasonably expect for unpaid open source developers, and public plugins for serving are obsolete (https://pypi.org/project/kedro-fast-api/, https://github.com/Galileo-Galilei/kedro-serving), and can add even more confusion for users.

There is a long standing issue to help make consistent access to kedro objects for plugin developers #779, and creating a public API to support such use cases is a good step forward.

The text was updated successfully, but these errors were encountered:

astrojuanlu · 2024-02-21T11:11:27Z

This issue, among others, was mentioned on 2024-02-14 tech design, in which @Galileo-Galilei and @takikadiri showed kedro-boot https://github.com/takikadiri/kedro-boot/ there was agreement that implementing this, plus making the Session re-entrant and more lightweight #2182 #2879, would be good.

Originally posted by @astrojuanlu in #2169 (comment)

Galileo-Galilei added the pinned Issue shouldn't be closed by stale bot label Jan 21, 2024

github-actions bot mentioned this issue Feb 1, 2024

Monthly issue metrics report #3582

Open

astrojuanlu mentioned this issue Feb 21, 2024

Allow injecting data into a KedroSession run #2169

Open

astrojuanlu mentioned this issue Feb 21, 2024

[KED-1148] Kedro-Server #143

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Universal Kedro Deployment (Part 4) - Embedding kedro pipelines in third-party applications #3540

Universal Kedro Deployment (Part 4) - Embedding kedro pipelines in third-party applications #3540

Galileo-Galilei commented Jan 21, 2024 •

edited

Loading

astrojuanlu commented Feb 21, 2024

Universal Kedro Deployment (Part 4) - Embedding kedro pipelines in third-party applications #3540

Universal Kedro Deployment (Part 4) - Embedding kedro pipelines in third-party applications #3540

Comments

Galileo-Galilei commented Jan 21, 2024 • edited Loading

Introduction

Overview

TL;DR

User research : embedded deployment patterns

Functional need 1 : Triggering kedro from third party applications which owns the entrypoint

Functional need 2 : Passing data at runtime and getting the results

Injecting data & parameters

Injecting globals & runtime_params

Technical requirements 1 : Speed of execution

Technical sub-requirements 2 : Running multiple times a single session

Technical requirements 2 : Ease of use

Database connection lazy loading

Create a public API for consistent access to kedro objects between versions

astrojuanlu commented Feb 21, 2024

Galileo-Galilei commented Jan 21, 2024 •

edited

Loading