feat(incremental): copy multiple tables in parallel (#1237) #1413

AxelThevenot · 2024-11-25T23:06:30Z

resolves dbt-labs/dbt-adapters#559
N/A dbt-labs/docs.getdbt.com/#

Problem

Copy partitions and tables in parallel instead of sequentially which is slow for large partition management

Solution

Run jobs in parallel and waits for the results.

Checklist

I have read the contributing guide and understand what's expected of me
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX

leohoare

Not a maintainer but this is a much needed feature, thanks @AxelThevenot 🙌🏻
Large datasets with a lot of partitions currently performs quite poorly.

I looked into doing this directly with one call copy_table call but it doesn't look like it's possible currently (unless we delete the partitions upfront and use write_append)

leohoare · 2025-01-13T05:35:02Z

dbt/adapters/bigquery/connections.py

+            # Runs all the copy jobs in parallel
+            for source_ref in source_ref_array:
+
+                for partition_id in partition_ids or [None]:
+                    source_ref_partition = (
+                        f"{source_ref}${partition_id}" if partition_id else source_ref
+                    )
+                    destination_ref_partition = (
+                        f"{destination_ref}${partition_id}" if partition_id else destination_ref
+                    )
+                    copy_job = client.copy_table(
+                        source_ref_partition,
+                        destination_ref_partition,
+                        job_config=CopyJobConfig(write_disposition=write_disposition),
+                        retry=self._retry.create_reopen_with_deadline(conn),
+                    )
+                    copy_jobs.append(copy_job)


Would being explicit here clarify the logic?

if partition_ids: copy_jobs = [ client.copy_table( f"{source_ref}${partition_id}", f"{destination_ref}${partition_id}", job_config=CopyJobConfig(write_disposition=write_disposition), retry=self._retry.create_reopen_with_deadline(conn), ) for partition_id in partition_ids for source_ref in source_ref_array ] else: copy_jobs = [client.copy_table( source_ref_array, destination_ref, job_config=CopyJobConfig(write_disposition=write_disposition), retry=self._retry.create_reopen_with_deadline(conn), )]

leohoare · 2025-01-13T05:37:31Z

dbt/adapters/bigquery/connections.py

+                    copy_job = client.copy_table(
+                        source_ref_partition,
+                        destination_ref_partition,
+                        job_config=CopyJobConfig(write_disposition=write_disposition),


Will there be more than one element in source_ref_partition?

If we ever be in the scenario where we have source_ref_array greater than one and write_disposition set to WRITE_TRUNCATE, we'll be overwriting the same data.

feat(incremental): copy multiple tables in parallel (#1237)

8c2d9bb

AxelThevenot requested a review from a team as a code owner November 25, 2024 23:06

leohoare reviewed Jan 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(incremental): copy multiple tables in parallel (#1237) #1413

feat(incremental): copy multiple tables in parallel (#1237) #1413

AxelThevenot commented Nov 25, 2024

leohoare left a comment

leohoare Jan 13, 2025

leohoare Jan 13, 2025

feat(incremental): copy multiple tables in parallel (#1237) #1413

Are you sure you want to change the base?

feat(incremental): copy multiple tables in parallel (#1237) #1413

Conversation

AxelThevenot commented Nov 25, 2024

Problem

Solution

Checklist

leohoare left a comment

Choose a reason for hiding this comment

leohoare Jan 13, 2025

Choose a reason for hiding this comment

leohoare Jan 13, 2025

Choose a reason for hiding this comment