Dataflow BigQuery to ClickHouse template

The BigQuery to ClickHouse template is a batch pipeline that ingests data from BigQuery table into ClickHouse table. The template can either read the entire table or read specific records using a provided query.

Pipeline requirements
Template Parameters
Source and Target Tables Schema
Data Types Mapping
Running the Template
Troubleshooting
- Code: 241. DB::Exception: Memory limit (total) exceeded
Template Source Code

Pipeline requirements

The source BigQuery table must exist.
The target ClickHouse table must exist.
The ClickHouse host Must be accessible from the Dataflow worker machines.

Template Parameters

Parameter Name	Parameter Description	Required	Notes
`jdbcUrl`	The ClickHouse JDBC URL in the format `jdbc:clickhouse://{host}:{port}/{schema}`.	✅	Don't add the username and password as JDBC options. Any other JDBC option could be added at the end of the JDBC URL. For ClickHouse Cloud users, add `ssl=true&sslmode=NONE` to the `jdbcUrl`.
`clickHouseUsername`	The ClickHouse username to authenticate with.	✅
`clickHousePassword`	The ClickHouse password to authenticate with.	✅
`clickHouseTable`	The target ClickHouse table name to insert the data to.	✅
`maxInsertBlockSize`	The maximum block size for insertion, if we control the creation of blocks for insertion (ClickHouseIO option).		A `ClickHouseIO` option.
`insertDistributedSync`	If setting is enabled, insert query into distributed waits until data will be sent to all nodes in cluster. (ClickHouseIO option).		A `ClickHouseIO` option.
`insertQuorum`	For INSERT queries in the replicated table, wait writing for the specified number of replicas and linearize the addition of the data. 0 - disabled.		A `ClickHouseIO` option. This setting is disabled in default server settings.
`insertDeduplicate`	For INSERT queries in the replicated table, specifies that deduplication of inserting blocks should be performed.		A `ClickHouseIO` option.
`maxRetries`	Maximum number of retries per insert.		A `ClickHouseIO` option.
`InputTableSpec`	The BigQuery table to read from. Specify either `inputTableSpec` or `query`. When both are set, the `query` parameter takes precedence. Example: `<BIGQUERY_PROJECT>:<DATASET_NAME>.<INPUT_TABLE>`.		Reads data directly from BigQuery storage using the BigQuery Storage Read API. Be aware of the Storage Read API limitations.
`outputDeadletterTable`	The BigQuery table for messages that failed to reach the output table. If a table doesn't exist, it is created during pipeline execution. If not specified, `<outputTableSpec>_error_records` is used. For example, `<PROJECT_ID>:<DATASET_NAME>.<DEADLETTER_TABLE>`.
`query`	The SQL query to use to read data from BigQuery. If the BigQuery dataset is in a different project than the Dataflow job, specify the full dataset name in the SQL query, for example: `<PROJECT_ID>.<DATASET_NAME>.<TABLE_NAME>`. Defaults to GoogleSQL unless `useLegacySql` is true.		You must specify either `inputTableSpec` or `query`. If you set both parameters, the template uses the `query` parameter. Example: `SELECT * FROM sampledb.sample_table`.
`useLegacySql`	Set to `true` to use legacy SQL. This parameter only applies when using the `query` parameter. Defaults to `false`.
`queryLocation`	Needed when reading from an authorized view without the underlying table's permission. For example, `US`.
`queryTempDataset`	Set an existing dataset to create the temporary table to store the results of the query. For example, `temp_dataset`.
`KMSEncryptionKey`	If reading from BigQuery using the query source, use this Cloud KMS key to encrypt any temporary tables created. For example, `projects/your-project/locations/global/keyRings/your-keyring/cryptoKeys/your-key`.

Note

All ClickHouseIO parameters default values could be found in ClickHouseIO Apache Beam Connector

Source and Target Tables Schema

In order to effectively load the BigQuery dataset to ClickHouse, and a column infestation process is conducted with the following phases:

The templates build a schema object based on the target ClickHouse table.
The templates iterate over the BigQuery dataset, and tried to match between column based on their names.

Info

Having said that, your BigQuery dataset (either table or query) must have the exact same column names as your ClickHouse target table.

Data Types Mapping

The BigQuery types are converted based on your ClickHouse table definition. Therefore, the above table lists the recommended mapping you should have in your target ClickHouse table (for a given BigQuery table/query):

BigQuery Type	ClickHouse Type	Notes
Array Type	Array Type	The inner type must be one of the supported primitive data types listed in this table.
Boolean Type	Bool Type
Date Type	Date Type
Datetime Type	Datetime Type	Works as well with `Enum8`, `Enum16` and `FixedString`.
String Type	String Type	In BigQuery all Int types (`INT`, `SMALLINT`, `INTEGER`, `BIGINT`, `TINYINT`, `BYTEINT`) are aliases to `INT64`. We recommend you setting in ClickHouse the right Integer size, as the template will convert the column based on the defined column type (`Int8`, `Int16`, `Int32`, `Int64`).
Numeric - Integer Types	Integer Types	In BigQuery all Int types (`INT`, `SMALLINT`, `INTEGER`, `BIGINT`, `TINYINT`, `BYTEINT`) are aliases to `INT64`. We recommend you setting in ClickHouse the right Integer size, as the template will convert the column based on the defined column type (`Int8`, `Int16`, `Int32`, `Int64`). The template will also convert unassigned Int types if used in ClickHouse table (`UInt8`, `UInt16`, `UInt32`, `UInt64`).
Numeric - Float Types	Float Types	Supported ClickHouse types: `Float32` and `Float64`

Running the Template

The BigQuery to ClickHouse template is available for execution via the Google Cloud CLI.

Note

Be sure to review this document, and specifically the above sections, to fully understand the template's configuration requirements and prerequisites.

Install & Configure `gcloud` CLI

If not already installed, install the gcloud CLI.
Follow the Before you begin section in this guide to set up the required configurations, settings, and permissions for running the DataFlow template.

Run Command

Use the gcloud dataflow flex-template run command to run a Dataflow job that uses the Flex Template.

Below is an example of the command:

gcloud dataflow flex-template run "bigquery-clickhouse-dataflow-$(date +%Y%m%d-%H%M%S)" \
 --template-file-gcs-location "gs://clickhouse-dataflow-templates/bigquery-clickhouse-metadata.json" \
 --parameters inputTableSpec="{bigquery table id}",jdbcUrl="jdbc:clickhouse://{clickhouse host}:{clickhouse port}/{schema}?ssl=true&sslmode=NONE",clickHouseUsername="{username}",clickHousePassword="{password}",clickHouseTable="{clickhouse target table}"

Command Breakdown

Job Name: The text following the run keyword is the unique job name.
Template File: The JSON file specified by --template-file-gcs-location defines the template structure and details about the accepted parameters. The mention file path is public and ready to use.
Parameters: Parameters are separated by commas. For string-based parameters, enclose the values in double quotes.

Expected Response

After running the command, you should see a response similar to the following:

job:
  createTime: '2025-01-26T14:34:04.608442Z'
  currentStateTime: '1970-01-01T00:00:00Z'
  id: 2025-01-26_06_34_03-13881126003586053150
  location: us-central1
  name: bigquery-clickhouse-dataflow-20250126-153400
  projectId: ch-integrations
  startTime: '2025-01-26T14:34:04.608442Z'

Monitor the Job

Navigate to the Dataflow Jobs tab in your Google Cloud Console to monitor the status of the job. You’ll find the job details, including progress and any errors:

Troubleshooting

Code: 241. DB::Exception: Memory limit (total) exceeded

This error occurs when ClickHouse runs out of memory while processing large batches of data. To resolve this issue:

Increase the instance resources: Upgrade your ClickHouse server to a larger instance with more memory to handle the data processing load.
Decrease the batch size: Adjust the batch size in your Dataflow job configuration to send smaller chunks of data to ClickHouse, reducing memory consumption per batch. These changes might help balance resource usage during data ingestion.

Template Source Code

The template's source code is available in ClickHouse's DataflowTemplates fork.

Pipeline requirements​

Template Parameters​

Source and Target Tables Schema​

Data Types Mapping​

Running the Template​

Install & Configure gcloud CLI​

Run Command​

Command Breakdown​

Expected Response​

Monitor the Job​

Troubleshooting​

Code: 241. DB::Exception: Memory limit (total) exceeded​

Template Source Code​