Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Google Dataflow docs #3148

Open
wants to merge 24 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
7cd2c2b
add google dataflow docs
BentsiLeviav Jan 27, 2025
80803a9
Fix extra space
BentsiLeviav Jan 27, 2025
44017ea
exclude words from spell check
BentsiLeviav Jan 27, 2025
067f256
Merge branch 'main' into add-dataflow-docs
BentsiLeviav Jan 27, 2025
a11a2ef
rename folder name to images
BentsiLeviav Jan 28, 2025
22c4444
Merge remote-tracking branch 'origin/add-dataflow-docs' into add-data…
BentsiLeviav Jan 28, 2025
316bbad
put settings in ``
BentsiLeviav Jan 28, 2025
3ec5b59
put settings in ``
BentsiLeviav Jan 28, 2025
83007ea
add a file level specific exclusion
BentsiLeviav Jan 28, 2025
a09a844
add troubleshooting section
BentsiLeviav Jan 28, 2025
4474f6f
Update docs/en/integrations/data-ingestion/google-dataflow/dataflow.md
BentsiLeviav Jan 28, 2025
74db9c4
replacing dataflow based on alphabet
BentsiLeviav Jan 28, 2025
d4d64f0
Update docs/en/integrations/index.mdx
BentsiLeviav Jan 28, 2025
c67d425
Merge remote-tracking branch 'origin/add-dataflow-docs' into add-data…
BentsiLeviav Jan 28, 2025
5afe3a8
move links from headers
BentsiLeviav Jan 28, 2025
a522b2c
Add github tracking issues for gcs and pubsub
BentsiLeviav Jan 28, 2025
ab21da2
add beam parameters and link dataflow parameters to it
BentsiLeviav Jan 28, 2025
adaa9f8
change links to relative
BentsiLeviav Feb 5, 2025
c7fd488
apply spelling for specific file
BentsiLeviav Feb 5, 2025
1cebe75
add more words
BentsiLeviav Feb 5, 2025
7b2d278
Merge branch 'main' into add-dataflow-docs
BentsiLeviav Feb 5, 2025
fc9a448
add more words
BentsiLeviav Feb 5, 2025
1bafb12
Merge remote-tracking branch 'origin/add-dataflow-docs' into add-data…
BentsiLeviav Feb 5, 2025
f41d9df
remove problematic chars
BentsiLeviav Feb 5, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions docs/en/integrations/data-ingestion/google-dataflow/dataflow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
---
sidebar_label: Integrating Dataflow with ClickHouse
slug: /en/integrations/google-dataflow/dataflow
sidebar_position: 1
description: Users can ingest data into ClickHouse using Google Dataflow
---

# Integrating Google Dataflow with ClickHouse

[Google Dataflow](https://cloud.google.com/dataflow) is a fully managed stream and batch data processing service. It supports pipelines written in Java or Python and is built on the Apache Beam SDK.

There are two main ways to use Google Dataflow with ClickHouse, both are leveraging [`ClickHouseIO Apache Beam connector`](../../apache-beam):

## 1. **[Java Runner](./java-runner)**
BentsiLeviav marked this conversation as resolved.
Show resolved Hide resolved
The Java Runner allows users to implement custom Dataflow pipelines using the Apache Beam SDK `ClickHouseIO` integration. This approach provides full flexibility and control over the pipeline logic, enabling users to tailor the ETL process to specific requirements.
BentsiLeviav marked this conversation as resolved.
Show resolved Hide resolved
However, this option requires knowledge of Java programming and familiarity with the Apache Beam framework.

### Key Features:
- High degree of customization.
- Ideal for complex or advanced use cases.
- Requires coding and understanding of the Beam API.

## 2. **[Predefined Templates](./templates)**
ClickHouse offers predefined templates designed for specific use cases, such as importing data from BigQuery into ClickHouse. These templates are ready-to-use and simplify the integration process, making them an excellent choice for users who prefer a no-code solution.

### Key Features:
- No Beam coding required.
- Quick and easy setup for simple use cases.
- Suitable also for users with minimal programming expertise.

Both approaches are fully compatible with Google Cloud and the ClickHouse ecosystem, offering flexibility depending on your technical expertise and project requirements.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
---
sidebar_label: Java Runner
slug: /en/integrations/google-dataflow/java-runner
sidebar_position: 2
description: Users can ingest data into ClickHouse using Google Dataflow Java Runner
---

# Dataflow Java Runner

The Dataflow Java Runner lets you execute custom Apache Beam pipelines on Google Cloud's Dataflow service. This approach provides maximum flexibility and is well-suited for advanced ETL workflows.

## How It Works

1. **Pipeline Implementation**
To use the Java Runner, you need to implement your Beam pipeline using the `ClickHouseIO` - our official Apache Beam connector. For code examples and instructions on how to use the `ClickHouseIO`, please visit [ClickHouse Apache Beam](../../apache-beam).

2. **Deployment**
Once your pipeline is implemented and configured, you can deploy it to Dataflow using Google Cloud's deployment tools. Comprehensive deployment instructions are provided in the [Google Cloud Dataflow documentation - Java Pipeline](https://cloud.google.com/dataflow/docs/quickstarts/create-pipeline-java).

**Note**: This approach assumes familiarity with the Beam framework and coding expertise. If you prefer a no-code solution, consider using [ClickHouse's predefined templates](./templates).
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
---
sidebar_label: Templates
slug: /en/integrations/google-dataflow/templates
sidebar_position: 3
description: Users can ingest data into ClickHouse using Google Dataflow Templates
---

# Google Dataflow Templates

Google Dataflow templates provide a convenient way to execute prebuilt, ready-to-use data pipelines without the need to write custom code. These templates are designed to simplify common data processing tasks and are built using [Apache Beam](https://beam.apache.org/), leveraging connectors like `ClickHouseIO` for seamless integration with ClickHouse databases. By running these templates on Google Dataflow, you can achieve highly scalable, distributed data processing with minimal effort.




## Why Use Dataflow Templates?

- **Ease of Use**: Templates eliminate the need for coding by offering preconfigured pipelines tailored to specific use cases.
- **Scalability**: Dataflow ensures your pipeline scales efficiently, handling large volumes of data with distributed processing.
- **Cost Efficiency**: Pay only for the resources you consume, with the ability to optimize pipeline execution costs.

## How to Run Dataflow Templates

As of today, the ClickHouse official template is available via the Google Cloud CLI or Dataflow REST API.
For detailed step-by-step instructions, refer to the [Google Dataflow Run Pipeline From a Template Guide](https://cloud.google.com/dataflow/docs/templates/provided-templates).


## List of ClickHouse Templates
* [BigQuery To ClickHouse](./templates/bigquery-to-clickhouse)
* GCS To ClickHouse (coming soon!)
mshustov marked this conversation as resolved.
Show resolved Hide resolved
* Pub Sub To ClickHouse (coming soon!)

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/en/integrations/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -242,6 +242,7 @@ We are actively compiling this list of ClickHouse integrations below, so it's no
|------|----|----------------|------------------|-------------|
|Apache Airflow|<img src={require('./images/logos/logo_airflow.png').default} className="image" alt="Airflow logo" style={{width: '3rem', 'backgroundColor': 'transparent'}}/>|Data ingestion|Open-source workflow management platform for data engineering pipelines|[Github](https://github.com/bryzgaloff/airflow-clickhouse-plugin)|
|Apache Beam|<img src={require('./images/logos/logo_beam.png').default} className="image" alt="Beam logo" style={{width: '3rem', 'backgroundColor': 'transparent'}}/>|Data ingestion|Open source, unified model and set of language-specific SDKs for defining and executing data processing workflows. Compatible with Google Dataflow.|[Documentation](https://clickhouse.com/docs/en/integrations/apache-beam),<br/>[Examples](https://github.com/ClickHouse/clickhouse-beam-connector/)|
|Google Dataflow|<img src={require('./images/logos/dataflow_logo.png').default} className="image" alt="Dataflow logo" style={{width: '3rem', 'backgroundColor': 'transparent'}}/>|Data ingestion|Google Dataflow is a serverless service for running batch and streaming data pipelines using Apache Beam.|[Documentation](/docs/en/integrations/google-dataflow/dataflow)|
|Apache InLong|<img src={require('./images/logos/logo_inlong.png').default} className="image" alt="InLong logo" style={{width: '3rem', 'backgroundColor': 'transparent'}}/>|Data ingestion|One-stop integration framework for massive data|[Documentation](https://inlong.apache.org/docs/data_node/load_node/clickhouse)|
|Apache NiFi|<img src={require('./images/logos/logo_nifi.png').default} className="image" alt="NiFi logo" style={{width: '3rem', 'backgroundColor': 'transparent'}}/>|Data ingestion|Automates the flow of data between software systems|[Documentation](/docs/en/integrations/nifi)|
|Apache SeaTunnel|<img src={require('./images/logos/logo_seatunnel.png').default} className="image" alt="SeaTunnel logo" style={{width: '3rem', 'backgroundColor': 'transparent'}}/>|Data ingestion|SeaTunnel is a very easy-to-use ultra-high-performance distributed data integration platform|[Website](https://seatunnel.apache.org/docs/2.3.0/connector-v2/sink/Clickhouse)|
Expand Down
2 changes: 1 addition & 1 deletion scripts/aspell-ignore/en/aspell-dict.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3452,4 +3452,4 @@ znode
znodes
zookeeperSessionUptime
zstd
Okta
Okta
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
DataFlow
Dataflow
DataflowTemplates
GoogleSQL
22 changes: 22 additions & 0 deletions sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -868,6 +868,28 @@ const sidebars = {
"en/integrations/data-ingestion/etl-tools/airbyte-and-clickhouse",
"en/integrations/data-ingestion/aws-glue/index",
"en/integrations/data-ingestion/etl-tools/apache-beam",
{
type: "category",
label: "Google Dataflow",
className: "top-nav-item",
collapsed: true,
collapsible: true,
items: [
"en/integrations/data-ingestion/google-dataflow/dataflow",
"en/integrations/data-ingestion/google-dataflow/java-runner",
"en/integrations/data-ingestion/google-dataflow/templates",
{
type: "category",
label: "Dataflow Templates",
className: "top-nav-item",
collapsed: true,
collapsible: true,
items: [
"en/integrations/data-ingestion/google-dataflow/templates/bigquery-to-clickhouse",
],
},
],
},
"en/integrations/data-ingestion/etl-tools/dbt/index",
"en/integrations/data-ingestion/etl-tools/dlt-and-clickhouse",
"en/integrations/data-ingestion/etl-tools/fivetran/index",
Expand Down
Loading