Create massive pipeline to test with flowchart on Kedro-viz #1064

rashidakanchwala · 2022-09-14T11:53:44Z

Description

Create a massive kedro-viz pipeline to stress-test flowchart features.

Context

The fluidity of flowchart interactions depends on the size of the pipeline, currently we don't have massive pipelines so we cannot stress tests a lot of features on kedro-viz. We know a lot of data science projects have huge pipelines. This issue is to make sure we build kedro-viz to also handle massive pipelines.

Possible Implementation

Maybe we can just create a big json file with multiple large pipelines

Checklist

Include labels so that we can categorise your feature request

rashidakanchwala · 2022-09-20T15:48:18Z

@jmholzer recently did this kedro-org/kedro#1795 (comment) where he tested the runner with 1000 nodes. I am wondering if we can create a json from the pipeline with 1000 nodes and use it for the above.

tynandebold · 2022-10-04T15:20:05Z

Great idea. Let's try and build this into the demo project so we don't have maintain two data sources.

Thoughts from backlog grooming.

Default pipeline is our current view
In the pipeline dropdown we have an item that, when selected, loops through and generates a massive pipeline.

tynandebold · 2023-09-25T14:55:00Z

Another idea: find a team that has a massive pipeline and get it from them.

astrojuanlu · 2023-09-25T15:15:40Z

I know a few of them 😄

tynandebold · 2023-09-25T15:32:15Z

Please let us know where we can get one!

rashidakanchwala · 2024-01-15T16:08:28Z

We will use the insurex (QB vertical team) sanitized pipeline for this.

ravi-kumar-pilla · 2024-04-24T02:01:01Z

Hi Team,

Update:

I reached out to Shubham from CommercialX and got one of their pipeline. He also shared a box link to go over the setup. I have set it up in my local and kedro viz run seems to load pretty normally. Though I had to comment out the Spark session initialization step.

Observations:

If spark session is instantiated without using hooks, ignoring hooks by default will not have affect
Since it is a huge pipeline, having an alignment option of horizontal/vertical nodes should be of great help
If I would like to quickly filter the DAG on dataset type (want to see only SparkDatasets) it is not possible. At this moment our filter panel is limited. We should add more filterable options.
The load time of Kedro-Viz DAG is not bad (for this pipeline at least) . But might take longer due to Spark sessions. (Need to investigate further each step)

I would like to get some help from the framework team (@SajidAlamQB , @ankatiyar if anyone has some time), to speed the process of Spark setup locally and successfully execute kedro run.

Thank you

ravi-kumar-pilla · 2024-04-25T01:16:51Z

CommercialX Kedro Viz Testing -

Observations:

Populating piplines dict(pipelines) takes 50% of the time to start the server
Kedro Catalog creation takes up considerable time as well

Size of the data -

RUN 1 -

Starting Kedro Viz ...
Time taken to configure/bootstrap project:: 2.6968612670898438
Time taken to create a kedro session:: 0.44796109199523926
[04/24/24 19:43:54] WARNING /Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/promotion/lib/python3.9/site-packages/kedro/framework/session/session.py:267: KedroDeprecationWarning: Jinja2TemplatedConfigLoader will be deprecated in warnings.py:109
Kedro 0.19. Please use the OmegaConfigLoader instead. To consult the documentation for OmegaConfigLoader, see here:
https://docs.kedro.org/en/stable/configuration/advanced_configuration.html#omegaconfigloader
warnings.warn(

Time taken to create a kedro context:: 0.12806415557861328
Time taken to create a kedro session store:: 9.5367431640625e-07
Time taken to create a kedro catalog:: 15.315791845321655
[04/24/24 19:44:31] WARNING /Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/promotion/lib/python3.9/site-packages/pyspark/pandas/init.py:47: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is warnings.py:109
required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context
already launched.
warnings.warn(

Time taken to create pipeline dictionary:: 23.553779125213623
Time taken to create stats dictionary:: 7.510185241699219e-05
Time taken to load kedro project data:: 42.1427047252655
Time taken to populate pipelines:: 9.5367431640625e-07
[04/24/24 19:44:33] WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006
[04/24/24 19:44:34] WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006
WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006
WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006
WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006
Time taken to populate viz repositories:: 1.3385379314422607
Time taken to start uvicorn server:: 43.49144387245178
Kedro Viz started successfully.

RUN 2 -

Starting Kedro Viz ...
Time taken to configure/bootstrap project:: 1.7348659038543701
Time taken to create a kedro session:: 0.2879657745361328
[04/24/24 19:59:22] WARNING /Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/promotion/lib/python3.9/site-packages/kedro/framework/session/session.py:267: KedroDeprecationWarning: Jinja2TemplatedConfigLoader will be deprecated in warnings.py:109
Kedro 0.19. Please use the OmegaConfigLoader instead. To consult the documentation for OmegaConfigLoader, see here:
https://docs.kedro.org/en/stable/configuration/advanced_configuration.html#omegaconfigloader
warnings.warn(

Time taken to create a kedro context:: 0.12883210182189941
Time taken to create a kedro session store:: 0.0
Time taken to create a kedro catalog:: 13.26403284072876
[04/24/24 19:59:54] WARNING /Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/promotion/lib/python3.9/site-packages/pyspark/pandas/init.py:47: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is warnings.py:109
required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context
already launched.
warnings.warn(

Time taken to create pipeline dictionary:: 21.121844053268433
Time taken to create stats dictionary:: 6.508827209472656e-05
Time taken to load kedro project data:: 36.5377631187439
Time taken to populate pipelines:: 1.1920928955078125e-06
[04/24/24 19:59:57] WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006
WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006
WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006
WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006
WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006
Time taken to populate viz repositories:: 1.4388270378112793
Time taken to start uvicorn server:: 37.98678135871887
Kedro Viz started successfully.

Immediate RUN 3 -

Starting Kedro Viz ...
Time taken to configure/bootstrap project:: 1.6473729610443115
Time taken to create a kedro session:: 0.2387540340423584
[04/24/24 20:01:57] WARNING /Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/promotion/lib/python3.9/site-packages/kedro/framework/session/session.py:267: KedroDeprecationWarning: Jinja2TemplatedConfigLoader will be deprecated in warnings.py:109
Kedro 0.19. Please use the OmegaConfigLoader instead. To consult the documentation for OmegaConfigLoader, see here:
https://docs.kedro.org/en/stable/configuration/advanced_configuration.html#omegaconfigloader
warnings.warn(

Time taken to create a kedro context:: 0.12455415725708008
Time taken to create a kedro session store:: 9.5367431640625e-07
Time taken to create a kedro catalog:: 9.044120073318481
[04/24/24 20:02:15] WARNING /Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/promotion/lib/python3.9/site-packages/pyspark/pandas/init.py:47: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is warnings.py:109
required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context
already launched.
warnings.warn(

Time taken to create pipeline dictionary:: 9.573238134384155
Time taken to create stats dictionary:: 4.982948303222656e-05
Time taken to load kedro project data:: 20.628222227096558
Time taken to populate pipelines:: 9.5367431640625e-07
[04/24/24 20:02:16] WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006
[04/24/24 20:02:17] WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006
WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006
WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006
WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006
Time taken to populate viz repositories:: 1.3532860279083252
Time taken to start uvicorn server:: 21.99152898788452
Kedro Viz started successfully.

astrojuanlu · 2024-04-29T09:53:56Z

Populating piplines dict(pipelines) takes 50% of the time to start the server

Kedro Catalog creation takes up considerable time as well

Good to know. What are the next steps?

The logs are a bit difficult to read. Maybe it would help to see a flamegraph, like this kedro-org/kedro#3033 (comment)

astrojuanlu · 2024-04-29T09:55:16Z

Also notice that, while testing with internal projects is useful, for us to confidently move forward with this we will probably have to generate some open source synthetic projects to test. See kedro-org/kedro#3790 for past discussion about this

ravi-kumar-pilla · 2024-05-02T01:57:56Z

Hi @astrojuanlu , Thank you for the suggestions. I tested with the tools you have mentioned and also prepared a rough notes on the next steps here.

To summarize, as a first step, if we load kedro data in an async way (async loading test branch) would help improve the Kedro-Viz load time for larger pipelines. If there are any new findings on the internal implementation of Kedro, I would be happy to discuss in the next Tech design.

Thank you

astrojuanlu · 2024-05-08T12:40:54Z

Thanks @ravi-kumar-pilla. To summarize from the internal document:

Insights

It takes a long time to initialise the Kedro modules and reach the actual kedro viz run command (already sort of known, [spike] Improve Kedro CLI startup time kedro#1476)
The expensive operation before starting the viz server is loading the data from the Kedro session (possibly related to Lazy Loading of Catalog Items kedro#2829 ?)
Most of the time taken to load the data is from catalog and pipelines_dict resolution, which worsens as the pipeline count increases

Next steps

Stress test with https://github.com/noklam/kedro-example/tree/master/stress-test-pipeline and summarize the results
Check for internals of _get_catalog() and pipelines to further optimize

And if I may add, I think

we need Create QA Kedro test projects for stress testing and performance and evaluation kedro#3790 to do this properly (beyond @noklam's pipeline linked above), and
the Framework team needs to be involved.

astrojuanlu · 2024-05-08T14:27:46Z

Adding a bit more context after a quick discussion:

These performance bottlenecks affect all projects, not only large ones, because startup times for Kedro are exceedingly long, and also the data is seemingly loaded in sequence cc @yetudada
We will likely need not 1, but several "massive pipelines" to do a comprehensive performance analysis, where "massive" means
- 1 pipeline with increasingly large number of nodes (essentially Create QA Kedro test projects for stress testing and performance and evaluation kedro#3790)
- N pipelines of 1 node
- 1 pipeline and 1 node with increasingly large number of datasets

astrojuanlu · 2024-11-07T10:38:46Z

I'm not sure there's anything else for us to do here.

We did extensive benchmarking and found that, because of how Kedro Viz waits for all the data to be ready, the main bottleneck is Kedro itself
We addressed all the issues we found, and added extensive benchmarks
We haven't found, or heard from users, that the rendering step is slow
We opened Enhancing Kedro-Viz Performance with Lazy Loading #1806 to track the possibility of making the Kedro Viz UI launch before collecting the data

Let's close this issue as completed until we have more concrete actions.

rashidakanchwala added Issue: Feature Request Testing labels Sep 14, 2022

rashidakanchwala changed the title ~~<Title>~~ Create massive pipeline to test with flowchart with on Kedro-viz Sep 14, 2022

rashidakanchwala changed the title ~~Create massive pipeline to test with flowchart with on Kedro-viz~~ Create massive pipeline to test with flowchart on Kedro-viz Sep 14, 2022

tynandebold added this to Kedro-Viz Sep 14, 2022

tynandebold moved this to Inbox in Kedro-Viz Sep 14, 2022

tynandebold moved this from Inbox to Backlog in Kedro-Viz Oct 4, 2022

tynandebold removed the Issue: Feature Request label Sep 25, 2023

rashidakanchwala added the Type: Technical Design label Nov 9, 2023

rashidakanchwala added Technical Design and removed Type: Technical Design labels Dec 5, 2023

NeroOkwa added this to the Improve large pipeline experience milestone Jan 15, 2024

NeroOkwa mentioned this issue Jan 31, 2024

Evaluating the Kedro-Viz experience for large pipelines #1726

Closed

rashidakanchwala moved this from Backlog to Todo in Kedro-Viz Apr 15, 2024

rashidakanchwala assigned ravi-kumar-pilla Apr 15, 2024

ravi-kumar-pilla moved this from Todo to In Progress in Kedro-Viz Apr 22, 2024

ravi-kumar-pilla moved this from In Progress to Todo in Kedro-Viz Apr 25, 2024

ravi-kumar-pilla moved this from Todo to In Progress in Kedro-Viz May 1, 2024

astrojuanlu mentioned this issue May 13, 2024

[Stress Testing] - Create example projects to assess Kedro performance for complex pipelines kedro-org/kedro#3866

Closed

rashidakanchwala moved this from In Progress to Backlog in Kedro-Viz Jul 4, 2024

ravi-kumar-pilla mentioned this issue Aug 30, 2024

Enhancing Kedro-Viz Performance with Lazy Loading #1806

Open

1 task

astrojuanlu closed this as completed Nov 7, 2024

github-project-automation bot moved this from Backlog to Done in Kedro-Viz Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create massive pipeline to test with flowchart on Kedro-viz #1064

Create massive pipeline to test with flowchart on Kedro-viz #1064

rashidakanchwala commented Sep 14, 2022 •

edited by tynandebold

Loading

rashidakanchwala commented Sep 20, 2022

tynandebold commented Oct 4, 2022

tynandebold commented Sep 25, 2023

astrojuanlu commented Sep 25, 2023

tynandebold commented Sep 25, 2023

rashidakanchwala commented Jan 15, 2024 •

edited

Loading

ravi-kumar-pilla commented Apr 24, 2024

ravi-kumar-pilla commented Apr 25, 2024

astrojuanlu commented Apr 29, 2024

astrojuanlu commented Apr 29, 2024

ravi-kumar-pilla commented May 2, 2024

astrojuanlu commented May 8, 2024

astrojuanlu commented May 8, 2024

astrojuanlu commented Nov 7, 2024

Create massive pipeline to test with flowchart on Kedro-viz #1064

Create massive pipeline to test with flowchart on Kedro-viz #1064

Comments

rashidakanchwala commented Sep 14, 2022 • edited by tynandebold Loading

Description

Context

Possible Implementation

Checklist

rashidakanchwala commented Sep 20, 2022

tynandebold commented Oct 4, 2022

tynandebold commented Sep 25, 2023

astrojuanlu commented Sep 25, 2023

tynandebold commented Sep 25, 2023

rashidakanchwala commented Jan 15, 2024 • edited Loading

ravi-kumar-pilla commented Apr 24, 2024

ravi-kumar-pilla commented Apr 25, 2024

astrojuanlu commented Apr 29, 2024

astrojuanlu commented Apr 29, 2024

ravi-kumar-pilla commented May 2, 2024

astrojuanlu commented May 8, 2024

Insights

Next steps

astrojuanlu commented May 8, 2024

astrojuanlu commented Nov 7, 2024

rashidakanchwala commented Sep 14, 2022 •

edited by tynandebold

Loading

rashidakanchwala commented Jan 15, 2024 •

edited

Loading