Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conduct external market research #94

Closed
astrojuanlu opened this issue Jun 22, 2023 · 9 comments
Closed

Conduct external market research #94

astrojuanlu opened this issue Jun 22, 2023 · 9 comments
Assignees

Comments

@astrojuanlu
Copy link
Member

Objective: Assess perceptions of Kedro and competitors.

@astrojuanlu astrojuanlu self-assigned this Jun 22, 2023
@astrojuanlu astrojuanlu moved this to To Do in Kedro Framework Jun 22, 2023
@stichbury stichbury removed the status in Kedro Framework Jun 26, 2023
@stichbury stichbury moved this to To Do in Kedro Framework Jun 26, 2023
@astrojuanlu
Copy link
Member Author

Postponing this for now.

@astrojuanlu
Copy link
Member Author

astrojuanlu commented Sep 15, 2023

Expanding the scope of this to

  1. Reassess the evolution of the more established Kedro competitors (DVC and MLflow gained pipelines, dbt gained Python support)
  2. Evaluate nascent competitors (Hamilton, Databricks bundles, brickflow)
  3. Understand Kedro's connection with other pieces of a typical MLOps stack (feature stores, experiment tracking solutions, data & model observability)
  4. Explore the current status of using Kedro with large structured and semi-structured data

Possibly intersecting with kedro-org/kedro#3012

@astrojuanlu
Copy link
Member Author

astrojuanlu commented Oct 16, 2023

More axis worth exploring. All of my "conclusions" here are preliminary and should be starting points for further exploration.

Data orchestration vs workflow orchestration

I contend that Kedro is a great data orchestrator (allow me to abuse the term "orchestrator" here to refer to pipelines) but a not so good workflow orchestrator. In fact, we've seen time and time again how users use "dummy datasets" to artificially connect two nodes that aren't otherwise connected with the goal of controlling the execution order.

Is this something Kedro should improve? Or should it continue to stay away from workflow orchestration?

Data pipelines

Speaking of ETL vs ELT, I contend that Kedro is an excellent framework if you're doing ETL, less so if you're doing ELT. Why? Because ELT sort of assumes direct storage of structured data on a data warehouse, and structured data is very amenable to SQL. Many teams will want to use ELT with Python though, and Kedro will serve them well.

image

Machine learning pipelines

Following Hopsworks' FTI (Feature, Training, Inference) mental map, I contend that Kedro is perfect for Feature and Training pipelines, but not very useful for Inference pipelines (which are basically model serving).

image

This mental map, by the way, greatly helps make sense of architecture diagrams like these:

Screenshot 2023-10-16 at 20-42-05 Build your MLOps stack MyMLOps Screenshot 2023-10-16 at 20-42-19 ml-ops org

(https://ml-ops.org/content/state-of-mlops, https://mymlops.com/)

What do data practitioners care about?

There's sufficient evidence that data scientists (or, to avoid somewhat outdated categorizations, "machine learning scientists") don't care about orchestration or pipelines. They do care about data modelling, statistical significance, confounding factors, experiment tracking, and many other things.

image

(strawman proposal of a "how much data scientists care" pyramid, originally from https://venturebeat.com/business/mlops-vs-devops-why-data-makes-it-different/ then reproduced in https://outerbounds.com/metaflow/)

So, if "data scientists" don't care about orchestration, how do we serve them well? And what do data engineers and machine learning engineers care about?

The "infant mortality" problem of ML

Some (a few? many?) models don't make it to production ("early failures" in the Bathtub curve). But is that a bad thing? Or a natural result of the experimentation process?

image

(https://ml-ops.org/content/crisp-ml)

And if it's a natural result, does it constitute a problem worth solving?

@astrojuanlu
Copy link
Member Author

And one last thing I forgot

Batch vs streaming

Kedro is not a streaming system. If anything, it can simulate streaming like most people do: using a micro-batch approach. But Kedro startup times are notoriously high kedro-org/kedro#1476 so the latency would be noticeable.

@astrojuanlu
Copy link
Member Author

astrojuanlu commented Oct 29, 2023

from The State of Applied Machine Learning 2023 https://resources.tecton.ai/the-state-of-applied-machine-learning-2023-report 1700+ respondents during the month of February.

very good insights. defines 5 pieces of an MLOps stack:

  1. model serving
  2. model registry and versioning
  3. feature store
  4. model monitoring
  5. data monitoring

The Feature Store / Feature Platform and Monitoring & Observability components will see the largest increases (~43 percentage points increase for both) in adoption in the next 12 months
...
Nearly 70% of respondents say they either have or plan to have a central MLOps platform in the next 12 months

More:

Respondents who shared that their companies have only batch models in production also shared that they struggle more with simpler organizational problems, such as demonstrating business ROI (41.5%) and lack of engineering and data science resources (21.5% and 24.8%, respectively)
Meanwhile, respondents who shared that their companies have real-time models in production struggle more with “advanced” challenges, such as collaboration between engineering and data science teams (28.0%) and serving models with enterprise SLAs (21.5%)

also, "building production data pipelines" was the second most cited challenge for both groups.

on the other hand:

Deploying a new model to production is a long process (>1 month for 65.0% of respondents and >3 months for 31.7%)
71.4% of respondents shared that their companies aim to improve deployment time by at least 10% in the next 12 months.

but (1) it doesn't explain why or what does "in production" entail! and (2) a 10 % improvement doesn't seem like a particularly ambitious target to me (only 30 % want to make it 50 % faster, only 3.6 % want to make it 2x faster). a 10 % improvement sounds to me like incremental progress = not a bottleneck.

more insights:

  • > 51 % use Amazon S3, 25 % use GCP Cloud Storage
  • tie between Databricks and Snowflake
  • different cloud providers seem to have different strengths - for example Azure Blog Storage is the 3rd blob storage solution, but Azure ML is right behind Amazon Sagemaker
  • 44 % use the built-in monitoring of their cloud provider, 31.8 % use in-house or open source solutions. unclear how this is translated to other pieces of the MLOps ecosystem but it's a good starting point.
  • open source model-serving (BentoML, Seldon) are one order of magnitude behind built-in cloud solutions

@astrojuanlu
Copy link
Member Author

from https://www.comet.com/site/ty/report-2023-machine-learning-practitioner-survey/ "41% of their machine learning experiments had to be scrapped", mainly due to "API integration errors (26%), lack of resources (25%), inaccurate or misrepresentative data (25%) and manual mismanagement (25%)" and "machine learning practitioners surveyed say it takes their team seven months to deploy a single machine learning project"

and https://imerit.net/the-2023-state-of-mlops-report/ "Data’s often the culprit for model failures", "when evaluating the reason for the failure of ML projects, almost half of professionals (46%) said lack of data quality or precision was the number-one reason, followed by a lack of expertise")

@astrojuanlu
Copy link
Member Author

@astrojuanlu
Copy link
Member Author

Splitted the research in two: data pipelines (ETL/ELT) and machine learning pipelines.

Data pipelines

Tool survey from August 2021 on Reddit (n=597) https://www.reddit.com/r/dataengineering/comments/pbaw2f/what_etl_tool_do_you_use/?utm_source=share&utm_medium=web2x&context=3

Pasted image 20231210235631

Another survey from 2023 (n=189, 89% were Metabase customers) https://www.metabase.com/data-stack-report-2023/#data-ingestion-in-house

Pasted image 20231211000044

In conclusion:

Machine learning pipelines

"Pipelines are a buzzword" https://www.reddit.com/r/datascience/comments/vmhurh/comment/ie2lai3/?utm_source=share&utm_medium=web2x&context=3, https://www.reddit.com/r/datascience/comments/vmhurh/comment/ie1xdro/?utm_source=share&utm_medium=web2x&context=3 and "pipelines are just automation of data processing" https://www.reddit.com/r/datascience/comments/vmhurh/comment/ie2a1xc/?utm_source=share&utm_medium=web2x&context=3

Hence "machine learning pipelines" is basically MLOps, or a subset of it https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning

image

The one tool that was consistently mentioned was MLFlow, not only for experiment tracking but as a broader MLOps solution.

According to the Tecton report, commercial MLOps platforms are much more widespread than open source solutions:

image

with adoption numbers reflecting broader trends on cloud market share https://www.statista.com/chart/18819/worldwide-market-share-of-leading-cloud-infrastructure-service-providers/

image

So in this space there's no clear winner either, but it's evident that commercial platforms win open source solutions.

@astrojuanlu
Copy link
Member Author

Adding one more interesting industry survey about data engineering https://seattledataguy.substack.com/p/the-state-of-data-engineering-part-b61

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

1 participant