Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for inserts via GCP cloud function and pub/sub #670

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jordanwillifordcruise
Copy link

Added support for handle_tests_results and insert_rows to use a new insert_rows_method method called gcp-cloud-function which calls a UDF to push results into BigQuery via Pub/Sub rather than direct insert queries. This significantly increases Elementary's capacity to insert records/test results when using BigQuery.

Here is an example configuration in dbt_project.yml. You identify a UDF that publishes to a topic (first argument of SELECT), and then define 3 pubsub topics to send data to. Those topics are defined to pass records straight to BigQuery, which is a simple option. I did also define schemas for those tables (all of this is created in Terraform) which is not ideal, it means this is coupled to schema/code changes for those tables.

  # This is to prevent BQ from exceeding the query size limit, specifically used to prevent errors writing Elementary metadata.
  insert_rows_method: gcp-cloud-function
  insert_rows_udf: "_project_._dataset_.publish_to_pubsub_function"
  insert_rows_topics: {
    "data_monitoring_metrics": "projects/_project_/topics/elementary-monitoring-metrics-topic",
    "test_result_rows": "projects/_project_/topics/elementary-test-result-rows-topic",
    "elementary_test_results": "projects/_project_/topics/elementary-elementary-test-results-topic"
  }
  query_max_size: 100000

This is an example for publish_to_pubsub_function:
CREATE OR REPLACE FUNCTION project.dataset.publish_to_pubsub_function(pubsub_topic STRING, json_data STRING, attributes STRING) RETURNS STRING REMOTE WITH CONNECTION .... OPTIONS (endpoint = ......, max_batching_rows = 500);

@haritamar
Copy link
Collaborator

HI @jordanwillifordcruise !
Thanks a lot for this contribution and sorry for the delay in our response.

This is pretty cool! I am a bit worried that this setup would be a bit too complicated for most users, though maybe it's fine as an advanced flag.
It would help though if you can provide instructions on how to set this up, ideally without terraform (I'm fine with manual commands in the GCP UI) - so we can reproduce + add to our docs.

Also - we recently merged another contribution that was related to BQ insert_rows issues. Would that help your use case?

Thanks,
Itamar

@jordanwillifordcruise
Copy link
Author

HI @jordanwillifordcruise ! Thanks a lot for this contribution and sorry for the delay in our response.

This is pretty cool! I am a bit worried that this setup would be a bit too complicated for most users, though maybe it's fine as an advanced flag. It would help though if you can provide instructions on how to set this up, ideally without terraform (I'm fine with manual commands in the GCP UI) - so we can reproduce + add to our docs.

Also - we recently merged another contribution that was related to BQ insert_rows issues. Would that help your use case?

Thanks, Itamar

I totally agree that the setup is pretty complicated. I will try to add some more content around infra setup without Terraform.
I am happy to see that chunk PR, and we have seen that issue too, but the one that this PR fixes is related to the number of small writes to a BQ table in a day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants