Environment Forking for Flexible Data Source Configuration #4298

pascalwhoop · 2024-08-07T19:56:22Z

pascalwhoop
Aug 7, 2024

We're considering adding a feature to allow more flexible configuration of data sources across environments. The primary use case is to enable testing part of the pipeline using production data without needing to copy data manually. Thought I'd share here to see if others find this useful as well.

Proposed Features:

Environment Forking Flag:
Example: kedro run --from-nodes a,b,c --fork-from prod --env dev
This would read initial datasets from the 'prod' environment and then execute the rest of the pipeline in the 'dev' environment.
Dataset Copying Command:
Example: kedro copy --datasets a,b,c --from prod --to dev
This would manually copy specified datasets from 'prod' to 'dev' environment before running the pipeline.
Inverse Tag Filtering:
Example: kedro run --without-tags tag1,tag2
This would filter out nodes based on tags, inverse of the existing --tags option.

Use Case:

Allow developers to run nodes X>Y>Z with real production data on their own machines.
Initial reads come from the production environment, but intermediate data is stored in the developer's environment.
Helps maintain data consistency across team members without overwriting each other's data or requiring manual copying of intermediate results.

Current Limitations:

Data needs to be copied between environments manually to run "on prod data but in dev env"
Difficulty in working with the latest production data without interfering with other developers' work.

Potential Implementation:

Extend the CLI or implement a hook to support these features.

Long-term Consideration:

Explore the possibility of implementing a more comprehensive "virtual data environment" solution similar to https://tobikodata.com/virtual-data-environments.html

SajidAlamQB · 2024-08-08T13:16:51Z

SajidAlamQB
Aug 8, 2024
Collaborator

Hi @pascalwhoop, thanks for bringing these suggestion, they seem worthwhile exploring!

0 replies

astrojuanlu · 2024-11-04T22:21:59Z

astrojuanlu
Nov 4, 2024
Maintainer

Turning this into a discussion 🙏🏼 Let's continue the conversation there.

0 replies

astrojuanlu · 2024-11-04T22:49:36Z

astrojuanlu
Nov 4, 2024
Maintainer

About kedro run --without-tags tag1,tag2, I think it's not the first time it's proposed but I cannot locate any previous issue... maybe @datajoely remembers?

0 replies

astrojuanlu · 2024-11-04T22:50:20Z

astrojuanlu
Nov 4, 2024
Maintainer

About kedro copy --datasets a,b,c --from prod --to dev, what does this do? Copy the appropriate entries from conf/prod/catalog.yml to conf/dev/catalog.yml?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Environment Forking for Flexible Data Source Configuration #4298

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Environment Forking for Flexible Data Source Configuration #4298

pascalwhoop Aug 7, 2024

Proposed Features:

Replies: 4 comments

SajidAlamQB Aug 8, 2024 Collaborator

astrojuanlu Nov 4, 2024 Maintainer

astrojuanlu Nov 4, 2024 Maintainer

astrojuanlu Nov 4, 2024 Maintainer

pascalwhoop
Aug 7, 2024

SajidAlamQB
Aug 8, 2024
Collaborator

astrojuanlu
Nov 4, 2024
Maintainer

astrojuanlu
Nov 4, 2024
Maintainer

astrojuanlu
Nov 4, 2024
Maintainer