Conduct external market research #94

astrojuanlu · 2023-06-22T15:50:57Z

Objective: Assess perceptions of Kedro and competitors.

astrojuanlu · 2023-08-28T16:14:51Z

Postponing this for now.

astrojuanlu · 2023-09-15T13:25:50Z

Expanding the scope of this to

Reassess the evolution of the more established Kedro competitors (DVC and MLflow gained pipelines, dbt gained Python support)
Evaluate nascent competitors (Hamilton, Databricks bundles, brickflow)
Understand Kedro's connection with other pieces of a typical MLOps stack (feature stores, experiment tracking solutions, data & model observability)
Explore the current status of using Kedro with large structured and semi-structured data

Possibly intersecting with kedro-org/kedro#3012

astrojuanlu · 2023-10-16T19:33:43Z

More axis worth exploring. All of my "conclusions" here are preliminary and should be starting points for further exploration.

Data orchestration vs workflow orchestration

I contend that Kedro is a great data orchestrator (allow me to abuse the term "orchestrator" here to refer to pipelines) but a not so good workflow orchestrator. In fact, we've seen time and time again how users use "dummy datasets" to artificially connect two nodes that aren't otherwise connected with the goal of controlling the execution order.

Is this something Kedro should improve? Or should it continue to stay away from workflow orchestration?

Data pipelines

Speaking of ETL vs ELT, I contend that Kedro is an excellent framework if you're doing ETL, less so if you're doing ELT. Why? Because ELT sort of assumes direct storage of structured data on a data warehouse, and structured data is very amenable to SQL. Many teams will want to use ELT with Python though, and Kedro will serve them well.

Machine learning pipelines

Following Hopsworks' FTI (Feature, Training, Inference) mental map, I contend that Kedro is perfect for Feature and Training pipelines, but not very useful for Inference pipelines (which are basically model serving).

This mental map, by the way, greatly helps make sense of architecture diagrams like these:

(https://ml-ops.org/content/state-of-mlops, https://mymlops.com/)

What do data practitioners care about?

There's sufficient evidence that data scientists (or, to avoid somewhat outdated categorizations, "machine learning scientists") don't care about orchestration or pipelines. They do care about data modelling, statistical significance, confounding factors, experiment tracking, and many other things.

(strawman proposal of a "how much data scientists care" pyramid, originally from https://venturebeat.com/business/mlops-vs-devops-why-data-makes-it-different/ then reproduced in https://outerbounds.com/metaflow/)

So, if "data scientists" don't care about orchestration, how do we serve them well? And what do data engineers and machine learning engineers care about?

The "infant mortality" problem of ML

Some (a few? many?) models don't make it to production ("early failures" in the Bathtub curve). But is that a bad thing? Or a natural result of the experimentation process?

(https://ml-ops.org/content/crisp-ml)

And if it's a natural result, does it constitute a problem worth solving?

astrojuanlu · 2023-10-25T07:09:59Z

And one last thing I forgot

Batch vs streaming

Kedro is not a streaming system. If anything, it can simulate streaming like most people do: using a micro-batch approach. But Kedro startup times are notoriously high kedro-org/kedro#1476 so the latency would be noticeable.

astrojuanlu · 2023-10-29T17:39:15Z

from The State of Applied Machine Learning 2023 https://resources.tecton.ai/the-state-of-applied-machine-learning-2023-report 1700+ respondents during the month of February.

very good insights. defines 5 pieces of an MLOps stack:

model serving
model registry and versioning
feature store
model monitoring
data monitoring

The Feature Store / Feature Platform and Monitoring & Observability components will see the largest increases (~43 percentage points increase for both) in adoption in the next 12 months
...
Nearly 70% of respondents say they either have or plan to have a central MLOps platform in the next 12 months

More:

Respondents who shared that their companies have only batch models in production also shared that they struggle more with simpler organizational problems, such as demonstrating business ROI (41.5%) and lack of engineering and data science resources (21.5% and 24.8%, respectively)
Meanwhile, respondents who shared that their companies have real-time models in production struggle more with “advanced” challenges, such as collaboration between engineering and data science teams (28.0%) and serving models with enterprise SLAs (21.5%)

also, "building production data pipelines" was the second most cited challenge for both groups.

on the other hand:

Deploying a new model to production is a long process (>1 month for 65.0% of respondents and >3 months for 31.7%)
71.4% of respondents shared that their companies aim to improve deployment time by at least 10% in the next 12 months.

but (1) it doesn't explain why or what does "in production" entail! and (2) a 10 % improvement doesn't seem like a particularly ambitious target to me (only 30 % want to make it 50 % faster, only 3.6 % want to make it 2x faster). a 10 % improvement sounds to me like incremental progress = not a bottleneck.

more insights:

> 51 % use Amazon S3, 25 % use GCP Cloud Storage
tie between Databricks and Snowflake
different cloud providers seem to have different strengths - for example Azure Blog Storage is the 3rd blob storage solution, but Azure ML is right behind Amazon Sagemaker
44 % use the built-in monitoring of their cloud provider, 31.8 % use in-house or open source solutions. unclear how this is translated to other pieces of the MLOps ecosystem but it's a good starting point.
open source model-serving (BentoML, Seldon) are one order of magnitude behind built-in cloud solutions

astrojuanlu · 2023-10-29T17:41:19Z

from https://www.comet.com/site/ty/report-2023-machine-learning-practitioner-survey/ "41% of their machine learning experiments had to be scrapped", mainly due to "API integration errors (26%), lack of resources (25%), inaccurate or misrepresentative data (25%) and manual mismanagement (25%)" and "machine learning practitioners surveyed say it takes their team seven months to deploy a single machine learning project"

and https://imerit.net/the-2023-state-of-mlops-report/ "Data’s often the culprit for model failures", "when evaluating the reason for the failure of ML projects, almost half of professionals (46%) said lack of data quality or precision was the number-one reason, followed by a lack of expertise")

astrojuanlu · 2023-12-06T09:05:00Z

https://learn.microsoft.com/en-us/azure/machine-learning/concept-ml-pipelines?view=azureml-api-2#which-azure-pipeline-technology-should-i-use

Azure also separating data pipelines from machine learning pipelines.

astrojuanlu · 2023-12-11T10:58:18Z

Splitted the research in two: data pipelines (ETL/ELT) and machine learning pipelines.

Data pipelines

Tool survey from August 2021 on Reddit (n=597) https://www.reddit.com/r/dataengineering/comments/pbaw2f/what_etl_tool_do_you_use/?utm_source=share&utm_medium=web2x&context=3

Another survey from 2023 (n=189, 89% were Metabase customers) https://www.metabase.com/data-stack-report-2023/#data-ingestion-in-house

In conclusion:

Orchestrators are not ETL tools: they are used in combination (for example Airflow + Meltano, Prefect + Airbyte).
No clear winner, expensive commercial tools widely used (presumably by big companies), Airbyte top open source one not far from the alternatives.
And yet, lots of complains about Airbyte, including spotty quality of the various connectors, thousands of open issues in their tracker, fragile support for heavy workloads, and others https://www.reddit.com/r/dataengineering/comments/17ck4cm/why_so_many_airbyte_complaints/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
Airflow (hence custom Python scripts) and "just Python" commonly mentioned, hence lots of homegrown scripts rather than a specific ETL/ELT tool.
Kedro has zero mentions in this space.

Machine learning pipelines

"Pipelines are a buzzword" https://www.reddit.com/r/datascience/comments/vmhurh/comment/ie2lai3/?utm_source=share&utm_medium=web2x&context=3, https://www.reddit.com/r/datascience/comments/vmhurh/comment/ie1xdro/?utm_source=share&utm_medium=web2x&context=3 and "pipelines are just automation of data processing" https://www.reddit.com/r/datascience/comments/vmhurh/comment/ie2a1xc/?utm_source=share&utm_medium=web2x&context=3

Hence "machine learning pipelines" is basically MLOps, or a subset of it https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning

The one tool that was consistently mentioned was MLFlow, not only for experiment tracking but as a broader MLOps solution.

According to the Tecton report, commercial MLOps platforms are much more widespread than open source solutions:

with adoption numbers reflecting broader trends on cloud market share https://www.statista.com/chart/18819/worldwide-market-share-of-leading-cloud-infrastructure-service-providers/

So in this space there's no clear winner either, but it's evident that commercial platforms win open source solutions.

astrojuanlu · 2024-01-30T17:48:39Z

Adding one more interesting industry survey about data engineering https://seattledataguy.substack.com/p/the-state-of-data-engineering-part-b61

astrojuanlu self-assigned this Jun 22, 2023

astrojuanlu added this to Kedro Framework Jun 22, 2023

astrojuanlu moved this to To Do in Kedro Framework Jun 22, 2023

stichbury removed the status in Kedro Framework Jun 26, 2023

stichbury moved this to To Do in Kedro Framework Jun 26, 2023

astrojuanlu closed this as completed Dec 11, 2023

github-project-automation bot moved this from To Do to Done in Kedro Framework Dec 11, 2023

astrojuanlu mentioned this issue Dec 12, 2023

Synthesis of research related to deployment of Kedro to modern MLOps platforms kedro-org/kedro#3094

Closed

astrojuanlu mentioned this issue Jan 30, 2024

[spike] Investigate suitability of Kedro for EL pipelines and incremental data loading kedro-org/kedro#3578

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conduct external market research #94

Conduct external market research #94

astrojuanlu commented Jun 22, 2023

astrojuanlu commented Aug 28, 2023

astrojuanlu commented Sep 15, 2023 •

edited

Loading

astrojuanlu commented Oct 16, 2023 •

edited

Loading

astrojuanlu commented Oct 25, 2023

astrojuanlu commented Oct 29, 2023 •

edited

Loading

astrojuanlu commented Oct 29, 2023

astrojuanlu commented Dec 6, 2023

astrojuanlu commented Dec 11, 2023

astrojuanlu commented Jan 30, 2024

Conduct external market research #94

Conduct external market research #94

Comments

astrojuanlu commented Jun 22, 2023

astrojuanlu commented Aug 28, 2023

astrojuanlu commented Sep 15, 2023 • edited Loading

astrojuanlu commented Oct 16, 2023 • edited Loading

Data orchestration vs workflow orchestration

Data pipelines

Machine learning pipelines

What do data practitioners care about?

The "infant mortality" problem of ML

astrojuanlu commented Oct 25, 2023

Batch vs streaming

astrojuanlu commented Oct 29, 2023 • edited Loading

astrojuanlu commented Oct 29, 2023

astrojuanlu commented Dec 6, 2023

astrojuanlu commented Dec 11, 2023

Data pipelines

Machine learning pipelines

astrojuanlu commented Jan 30, 2024

astrojuanlu commented Sep 15, 2023 •

edited

Loading

astrojuanlu commented Oct 16, 2023 •

edited

Loading

astrojuanlu commented Oct 29, 2023 •

edited

Loading