-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conduct external market research #94
Comments
Postponing this for now. |
Expanding the scope of this to
Possibly intersecting with kedro-org/kedro#3012 |
More axis worth exploring. All of my "conclusions" here are preliminary and should be starting points for further exploration. Data orchestration vs workflow orchestrationI contend that Kedro is a great data orchestrator (allow me to abuse the term "orchestrator" here to refer to pipelines) but a not so good workflow orchestrator. In fact, we've seen time and time again how users use "dummy datasets" to artificially connect two nodes that aren't otherwise connected with the goal of controlling the execution order. Is this something Kedro should improve? Or should it continue to stay away from workflow orchestration? Data pipelinesSpeaking of ETL vs ELT, I contend that Kedro is an excellent framework if you're doing ETL, less so if you're doing ELT. Why? Because ELT sort of assumes direct storage of structured data on a data warehouse, and structured data is very amenable to SQL. Many teams will want to use ELT with Python though, and Kedro will serve them well. Machine learning pipelinesFollowing Hopsworks' FTI (Feature, Training, Inference) mental map, I contend that Kedro is perfect for Feature and Training pipelines, but not very useful for Inference pipelines (which are basically model serving). This mental map, by the way, greatly helps make sense of architecture diagrams like these: (https://ml-ops.org/content/state-of-mlops, https://mymlops.com/) What do data practitioners care about?There's sufficient evidence that data scientists (or, to avoid somewhat outdated categorizations, "machine learning scientists") don't care about orchestration or pipelines. They do care about data modelling, statistical significance, confounding factors, experiment tracking, and many other things. (strawman proposal of a "how much data scientists care" pyramid, originally from https://venturebeat.com/business/mlops-vs-devops-why-data-makes-it-different/ then reproduced in https://outerbounds.com/metaflow/) So, if "data scientists" don't care about orchestration, how do we serve them well? And what do data engineers and machine learning engineers care about? The "infant mortality" problem of MLSome (a few? many?) models don't make it to production ("early failures" in the Bathtub curve). But is that a bad thing? Or a natural result of the experimentation process? (https://ml-ops.org/content/crisp-ml) And if it's a natural result, does it constitute a problem worth solving? |
And one last thing I forgot Batch vs streamingKedro is not a streaming system. If anything, it can simulate streaming like most people do: using a micro-batch approach. But Kedro startup times are notoriously high kedro-org/kedro#1476 so the latency would be noticeable. |
from The State of Applied Machine Learning 2023 https://resources.tecton.ai/the-state-of-applied-machine-learning-2023-report 1700+ respondents during the month of February. very good insights. defines 5 pieces of an MLOps stack:
More:
also, "building production data pipelines" was the second most cited challenge for both groups. on the other hand:
but (1) it doesn't explain why or what does "in production" entail! and (2) a 10 % improvement doesn't seem like a particularly ambitious target to me (only 30 % want to make it 50 % faster, only 3.6 % want to make it 2x faster). a 10 % improvement sounds to me like incremental progress = not a bottleneck. more insights:
|
from https://www.comet.com/site/ty/report-2023-machine-learning-practitioner-survey/ "41% of their machine learning experiments had to be scrapped", mainly due to "API integration errors (26%), lack of resources (25%), inaccurate or misrepresentative data (25%) and manual mismanagement (25%)" and "machine learning practitioners surveyed say it takes their team seven months to deploy a single machine learning project" and https://imerit.net/the-2023-state-of-mlops-report/ "Data’s often the culprit for model failures", "when evaluating the reason for the failure of ML projects, almost half of professionals (46%) said lack of data quality or precision was the number-one reason, followed by a lack of expertise") |
Azure also separating data pipelines from machine learning pipelines. |
Splitted the research in two: data pipelines (ETL/ELT) and machine learning pipelines. Data pipelinesTool survey from August 2021 on Reddit (n=597) https://www.reddit.com/r/dataengineering/comments/pbaw2f/what_etl_tool_do_you_use/?utm_source=share&utm_medium=web2x&context=3 Another survey from 2023 (n=189, 89% were Metabase customers) https://www.metabase.com/data-stack-report-2023/#data-ingestion-in-house In conclusion:
Machine learning pipelines"Pipelines are a buzzword" https://www.reddit.com/r/datascience/comments/vmhurh/comment/ie2lai3/?utm_source=share&utm_medium=web2x&context=3, https://www.reddit.com/r/datascience/comments/vmhurh/comment/ie1xdro/?utm_source=share&utm_medium=web2x&context=3 and "pipelines are just automation of data processing" https://www.reddit.com/r/datascience/comments/vmhurh/comment/ie2a1xc/?utm_source=share&utm_medium=web2x&context=3 Hence "machine learning pipelines" is basically MLOps, or a subset of it https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning The one tool that was consistently mentioned was MLFlow, not only for experiment tracking but as a broader MLOps solution. According to the Tecton report, commercial MLOps platforms are much more widespread than open source solutions: with adoption numbers reflecting broader trends on cloud market share https://www.statista.com/chart/18819/worldwide-market-share-of-leading-cloud-infrastructure-service-providers/ So in this space there's no clear winner either, but it's evident that commercial platforms win open source solutions. |
Adding one more interesting industry survey about data engineering https://seattledataguy.substack.com/p/the-state-of-data-engineering-part-b61 |
Objective: Assess perceptions of Kedro and competitors.
The text was updated successfully, but these errors were encountered: