Integration

Details below along with visual documentation for integrating each pipeline.

Streaming Pipeline

This real-time streaming pipeline uses the Polygon python library to create a WebSocket client that subscribes to configurable data streams, sending JSON messages every second. The subscriptions are defined in a CSV file stored in an S3 bucket Given the large volume of real-time financial data, key design decisions include:

The Polygon WebSocket client is containerized with Docker and deployed on ECS Fargate, enabling scalable compute resources.
Kafka acts as the message broker, facilitating real-time data flow between the Polygon WebSocket producer and ClickHouse consumer, ensuring scalability and fault tolerance. It also allows for future consumers to be added

Two Kafka topics are used: polygon_stream_aggregates for incoming messages and custom_filtered_stock_data_topic for pre-processed data. A Kafka ClickHouse sink connector streams this processed data into ClickHouse, which stores it in a MergeTree table optimized for real-time analytics, indexed by timestamp and stock symbol. This architecture supports continuous and scalable data ingestion.

Batch Pipeline

For data ingestion, we initially considered using the polygon python library for restful API. However, we opted to build a custom Airbyte connector for the following reasons:

Scalability: Airbyte is designed for scalable, automated, and repeatable data ingestion workflows
Integration: Airbyte allows seamless integration of multiple data streams.
Centralized orchestration: Leveraging Dagster for managing and orchestrating data pipelines, particularly for batch processing
Ease of data replication: Airbyte simplies the process of replicating data from Polygon to Snowflake

The airbyte connector was configured to support incremental data extraction, using the last_updated_utc field. Data collection was structured by date and/stock, ensuring that only the most recent daily data was ingested efficiently. Initially, we planned to deploy Airbyte on an EC2 instance. Although Airbyte was successfully installed on a T2.medium instance, we found that the updated version required a T2.large instance. By the time this issue was discovered and resolved, the custom Airbyte connector was already integrated into Airbyte Cloud, and the decision was made to continue with this cloud-based setup for ease of use and scalability.

Screenshots for Streaming Pipeline

Policy for S3 Websocket Subscriptions Kafka Incoming Topic from Producer Kafka Outgoing Topic for Consumer Clickhouse Merge Tree Table

Screenshots for Batch Pipeline

Airbyte Custom Stream of Grouped Daily Endpoint Airbyte Custom Stream of Tickers Endpoint Airbyte Connection Airbyte Ingestion Schema Logs of Airbyte Sync

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Integration

Streaming Pipeline

Batch Pipeline

Screenshots for Streaming Pipeline

Screenshots for Batch Pipeline

Files

README.md

Latest commit

History

README.md

File metadata and controls

Integration

Streaming Pipeline

Batch Pipeline

Screenshots for Streaming Pipeline

Screenshots for Batch Pipeline