Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Add support for Lineage Graphs in Beam's direct runner through job graph traversal #33980

Open
2 of 17 tasks
rohitsinha54 opened this issue Feb 13, 2025 · 2 comments

Comments

@rohitsinha54
Copy link
Contributor

What would you like to happen?

Apache Beam from 2.63 onwards officially have capability to track data lineage from various IOs.
This is done by the IOs emitting the lineage information as metrics individually.

Once the lineage information is collected from IOs in the pipeline they need to be linked to each other based on the IOs of the pipeline which are connected in the graph. Dataflow backend already does this by:

  1. Traversing the job graph to identify all the connected paths (Note: Beam job graph can have more than one connected components)
  2. Identify the sources in these paths (a middle node can be source too).
  3. Forming pairs of sources and sinks based on the above connected node.

Once this lineage information is linked this can be served by:

  1. Simple API which given a source/sink connected to each other show the data lineage.
    or
  2. Sending this information to some Lineage graph visualization tool.

Issue Priority

Priority: 2 (default / most feature requests should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@rohitsinha54
Copy link
Contributor Author

.set-labels gsoc, gsoc2025, lineage, infra, python, java, runner

Copy link
Contributor

Label cannot be managed because it does not exist in the repo. Please check your spelling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants