Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set up Cloud SQL Postgres database for dagster storage #2996

Merged
merged 11 commits into from
Nov 3, 2023

Conversation

bendnorman
Copy link
Member

@bendnorman bendnorman commented Oct 31, 2023

PR Overview

This PR sets up a Cloud SQL Postgres database for dagster storage in our nightly builds. This will hopefully prevent the dagster backend storage database from locking up and killing our builds.

Changes:

  • Creates a Cloud SQL Postgres console with 2 CPUs and 8 GB of RAM. @jdangerx and I decided handling the instance lifecycle from within the container might cause some jankyness so we will keep the instance running 24/7 which will cost about $101 a month. We will explore running postgres within our VM container to save cost and the overhead of managing an external db Spin up a postgres database within our nightly build container #3003.
  • Adds logic in docker/gcp_pudl_etl.sh to start and stop the dagster-storage Cloud SQL instance and create and delete a dagster-storage-$ACTION_SHA-$GITHUB_REF database. This way, we have a fresh database for each nightly build run and won't incur Cloud SQL instance charges when we aren't running a nightly build.
  • Add static external IP addresses to the pudl-deployment-tag and pudl-deployment-dev VMs so they can be whitelisted by the Cloud SQL instance.
  • Points the dagster deployment to the Postgres database
  • Passes the PUDL_SETTINGS_YAML to the container. I think this var was set manually in the GCP Console.

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

@bendnorman bendnorman marked this pull request as ready for review November 1, 2023 01:49
@bendnorman bendnorman linked an issue Nov 1, 2023 that may be closed by this pull request
Copy link

codecov bot commented Nov 1, 2023

Codecov Report

Attention: 13 lines in your changes are missing coverage. Please review.

Comparison is base (b376141) 88.7% compared to head (c15e063) 88.7%.
Report is 78 commits behind head on dev.

Additional details and impacted files
@@          Coverage Diff          @@
##             dev   #2996   +/-   ##
=====================================
  Coverage   88.7%   88.7%           
=====================================
  Files         91      91           
  Lines      11007   11007           
=====================================
  Hits        9766    9766           
  Misses      1241    1241           
Files Coverage Δ
src/pudl/metadata/fields.py 100.0% <ø> (ø)
src/pudl/transform/ferc714.py 100.0% <100.0%> (ø)
src/pudl/transform/params/ferc1.py 100.0% <ø> (ø)
src/pudl/output/ferc1.py 88.2% <69.2%> (ø)
src/pudl/transform/ferc1.py 96.7% <94.8%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bendnorman bendnorman requested a review from jdangerx November 1, 2023 07:05
Copy link
Member

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work, I think, though lots of aspects of the postgres setup aren't actually in this PR.

I have some blocking questions about the lifecycle management stuff - I don't think we need to fully figure out how to make it robust, but we should at least talk about that before moving on.

@@ -85,6 +85,8 @@ jobs:

# Deploy PUDL image to GCE
- name: Deploy
env:
DAGSTER_PG_PASSWORD: ${{ secrets.DAGSTER_PG_PASSWORD }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future we should probably store this secret in Google Secret Manager instead - but this is fine for now. We need to get our nightly builds working consistently again, so there's urgency. And, I think we're building up momentum to revamp the infra completely - what's one more piece of jankiness at this juncture?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're saying pass secrets from Google Secret Manager directly to the VM instead of storing GCP secrets in Github Actions and passing them to the VM?

@@ -110,6 +112,11 @@ jobs:
--container-env AWS_ACCESS_KEY_ID=${{ secrets.AWS_ACCESS_KEY_ID }} \
--container-env AWS_SECRET_ACCESS_KEY=${{ secrets.AWS_SECRET_ACCESS_KEY }} \
--container-env AWS_DEFAULT_REGION=${{ secrets.AWS_DEFAULT_REGION }} \
--container-env DAGSTER_PG_USERNAME="postgres" \
--container-env DAGSTER_PG_PASSWORD="$DAGSTER_PG_PASSWORD" \
--container-env DAGSTER_PG_HOST="104.154.182.24" \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to run Postgres in the container itself, so we don't have to do all the weird Cloud SQL lifecycle management stuff.

On the other hand, it is weird to run your application and database in the same container. But gcloud compute only supports one container per VM (unless you want to run docker compose manually on the VM). So I'm OK with the Cloud SQL stuff for now as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my experience, it seems like it's best to keep the app and database containers separate. I went with Cloud SQL because it's what dagster recommends and it seemed simpler than figuring out how to launch multiple containers.

I guess it's a little strange to spin the database up and down but I don't want to pay for the time it goes unused.

# Start dagster-storage Cloud SQL instance
gcloud sql instances patch dagster-storage --activation-policy=ALWAYS
# Create database
gcloud sql databases create $DAGSTER_PG_DB --instance=dagster-storage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Does this wait for the database to spin up before returning?
  2. Does create throw an error if the DB already exists? Do we expect the DB to persist data over runs, or do we want to drop everything between runs? I see we delete the database below, but what happens if we error out and get interrupted before we delete the database?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I'll double check the database is created before returning.
  2. I figured it was fine to delete the database even if the build fails because the dagster event logs aren't actually that helpful when debugging build failures.

@bendnorman
Copy link
Member Author

I removed the Cloud SQL lifecycle logic from the gcp_pudl_etl.sh script.

@bendnorman bendnorman marked this pull request as ready for review November 2, 2023 18:47
@bendnorman bendnorman merged commit 18bb60f into dev Nov 3, 2023
@bendnorman bendnorman deleted the setup-dagster-postgres branch November 3, 2023 00:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Resolve Event database is locked error in nightly builds
2 participants