You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Apache Beam from 2.63 onwards officially have capability to track data lineage from various IOs.
This is done by the IOs emitting the lineage information as metrics individually.
Once the lineage information is collected from IOs in the pipeline they need to be linked to each other based on the IOs of the pipeline which are connected in the graph. Dataflow backend already does this by:
Traversing the job graph to identify all the connected paths (Note: Beam job graph can have more than one connected components)
Identify the sources in these paths (a middle node can be source too).
Forming pairs of sources and sinks based on the above connected node.
Once this lineage information is linked this can be served by:
Simple API which given a source/sink connected to each other show the data lineage.
or
Sending this information to some Lineage graph visualization tool.
Issue Priority
Priority: 2 (default / most feature requests should be filed as P2)
Issue Components
Component: Python SDK
Component: Java SDK
Component: Go SDK
Component: Typescript SDK
Component: IO connector
Component: Beam YAML
Component: Beam examples
Component: Beam playground
Component: Beam katas
Component: Website
Component: Infrastructure
Component: Spark Runner
Component: Flink Runner
Component: Samza Runner
Component: Twister2 Runner
Component: Hazelcast Jet Runner
Component: Google Cloud Dataflow Runner
The text was updated successfully, but these errors were encountered:
What would you like to happen?
Apache Beam from 2.63 onwards officially have capability to track data lineage from various IOs.
This is done by the IOs emitting the lineage information as metrics individually.
Once the lineage information is collected from IOs in the pipeline they need to be linked to each other based on the IOs of the pipeline which are connected in the graph. Dataflow backend already does this by:
Once this lineage information is linked this can be served by:
or
Issue Priority
Priority: 2 (default / most feature requests should be filed as P2)
Issue Components
The text was updated successfully, but these errors were encountered: