Update nightly build script to quietly publish public Parquet outputs. #3680

zaneselvans · 2024-06-18T00:27:00Z

Overview

Publish Parquet files to the public GCS & S3 buckets when the builds succeed.
Create a pudl_parquet.zip archive containing all the Parquet outputs that we can point Kaggle at.
Switch to using zip instead of gzip to compress SQLite files because Windows can only open zip archives and we don't want Windows users to have to download a 3rd party archive/compression utility.

Closes #3678

Testing

Is there an easy way to give this a test besides doing a whole deployment?

To-do list

Give feedback

Review the PR yourself and call out any questions or issues you have
Update Zenodo data release process to only archive the zipped parquet files (due to file number limit)
Ensure docs build, unit & integration tests, and test coverage pass locally with make pytest-coverage
Options

zaneselvans · 2024-06-18T00:29:42Z

docker/gcp_pudl_etl.sh

-    gzip --verbose "$PUDL_OUTPUT"/*.sqlite && \
-    # Grab hourly tables which are only written to Parquet for distribution
-    cp "$PUDL_OUTPUT"/parquet/*__hourly_*.parquet "$PUDL_OUTPUT" && \
+    cd "$PUDL_OUTPUT" && \


Go into the directory with the files to avoid capturing their directory paths in the zip archive.

zaneselvans · 2024-06-18T00:30:11Z

docker/gcp_pudl_etl.sh

+    for file in *.sqlite; do
+        echo "Compressing $file" && \
+        zip "$file.zip" "$file" && \
+        rm "$file"
+    done


Is this for loop going to break the chain of && dependencies and mess up our exit codes?

I don't think so: the && operator just glues two things together with the boolean AND as far as I know. Though I'm not sure if there will simply be a syntax error here.

One way around this would be just "pull the compression logic into its own function."

I thought what was being ANDed together was the exit codes? So if they're all 0, then the exit code for the whole chain of codes ends up being 0, but if any of them is non-zero, then the remaining commands are not executed, and the exit code ends up being whatever nonzero value the failing command issued?

Yeah, I think the exit codes are the things that are being ANDed together. We're using the short-circuiting behavior of && to say that each expression only executes if everything to its left returns 0.

But we have several separate &&-chains, and we don't exit immediately if one of the chains themselves has a non-0 return value. I think that's intentional so we can make the slack notifications at the end.

The body of the for loop is its own &&-chain because we don't have && after rm "$file" - so in my mental model, a failure in zip would stop us from trying to rm "$file" but we would effectively just skip over that file completely.

docker/gcp_pudl_etl.sh

jdangerx

If we want to stick to using bash, we should use pushd/popd, it seems more robust to changes in the rest of the script.

Apart from that, I do think this script is due for either a refactor along the lines of #3643 or even encapsulating logic within Python - that would make the testing of individual pieces of the logic much simpler.

An incremental-improvement option is to start taking some of the trickier logic and putting it in tested Python code, and calling that Python code from within this bash script. That dodges "one big scary refactor" but runs into "what if we never finish refactoring?"

If the idea of running unit tests on our deployment code without refactoring everything in one go sounds good, I think it would make sense to have pudl.deployment.clean_up_outputs_for_distribution in the main pudl package, and then a simple script in the docker/ directory that just calls that function:

#! /usr/bin/env python

from pudl.deployment import clean_up_outputs_for_distribution

if __name__ == "__main__":
    clean_up_outputs_for_distribution()

For testing with a temp dir: you could either pass in PUDL_OUTPUT as an argument in the bash script, or mock the env var to point at a temp dir in pytest. Personally I think passing in the argument is better, it's nice to push all the environment interactions as far towards the edge of your system as possible.

We could follow up with moving more and more of the functions to Python until all this bash script does is call a bunch of Python functions based on environment variables. At which point we can decide to finish the port or leave the simple bash wrapper around.

docker/gcp_pudl_etl.sh

jdangerx · 2024-06-18T18:51:28Z

docker/gcp_pudl_etl.sh

+    for file in *.sqlite; do
+        echo "Compressing $file" && \
+        zip "$file.zip" "$file" && \
+        rm "$file"
+    done


I don't think so: the && operator just glues two things together with the boolean AND as far as I know. Though I'm not sure if there will simply be a syntax error here.

One way around this would be just "pull the compression logic into its own function."

zaneselvans · 2024-06-18T19:37:28Z

A bunch of the deployment complexity stems from the different needs of the different deployment targets, combined with the fact that we're trying to use the same directory to prepare all of those different outputs. Maybe it would be simpler to never touch the build outputs, and instead populate a temporary directory for each of the different places that data needs to be sent, such that each of them can be independently set up for the needs of that target? Which would include:

Datasette / Fly.io
Zenodo data release
Public GCS & AWS buckets (which also need to contain whatever Kaggle needs)
Private Parquet bucket (to be deprecated soon)

jdangerx · 2024-06-18T20:11:28Z

I also think a lot of our woes come from having one piece of logic handling too many different deployment targets.*

So I think the separate-temporary-directories thing is a good idea! Though I think we still run into testing problems because of the various 3rd party services we need to stub out. If we're still emphasizing testing the deploy code, I think that we should split clean_up... into multiple python functions and test those individually.

* Though also we have "the logic for each deployment target is spread over multiple pieces of logic that are all trying to do too much."

zaneselvans · 2024-06-19T03:32:02Z

I'd love to refactor this more deeply, bring it into Python, make it testable (and avoid the need to install and depend on the gcloud and aws CLI tools) but at least for me I think that would take the better part of a work week and feels like more resources that we want to spend on just getting a basic version of the Parquet files out. Maybe we can put the refactor into our project menu for the summer? I think it would fall under the infrastructure portion of the POSE project if it comes through.

jdangerx · 2024-06-19T14:50:21Z

I'd love to refactor this more deeply, bring it into Python, make it testable (and avoid the need to install and depend on the gcloud and aws CLI tools) but at least for me I think that would take the better part of a work week and feels like more resources that we want to spend on just getting a basic version of the Parquet files out. Maybe we can put the refactor into our project menu for the summer? I think it would fall under the infrastructure portion of the POSE project if it comes through.

I think the reason to put work into testability now would be if we think that saves us time over "try running the nightly build and hope it works" - hence me suggesting the "incremental" (read "half-assed") way of getting some testability just for the bit of logic you're trying to add.

I totally trust you and your judgement on what would be the fastest expected path to shipping the dang Parquet, though - you're the one doing the development!

jdangerx

Seems like it should work, let's give it a shot!

zaneselvans · 2024-06-19T17:14:06Z

🙈 I will be pleasantly amazed if it actually works on the first try. 🙈

Update nightly build script to quietly publish public Parquet outputs.

ba814dc

zaneselvans requested a review from jdangerx June 18, 2024 00:27

zaneselvans self-assigned this Jun 18, 2024

zaneselvans added 2 commits June 17, 2024 18:32

Remove unnecessary path element.

1cf2909

Clarify comments

953f7d7

zaneselvans commented Jun 18, 2024

View reviewed changes

Remove ~200 parquet files before Zenodo data release

c0a1dbd

jdangerx reviewed Jun 18, 2024

View reviewed changes

zaneselvans added 3 commits June 18, 2024 15:36

Merge branch 'main' into quiet-parquet

c55b66b

Merge branch 'main' into quiet-parquet

b9d708e

Use pushd/popd instead of cd in build script.

0903a15

Simplify parquet file deployment a little

c838354

zaneselvans marked this pull request as ready for review June 19, 2024 03:50

zaneselvans requested a review from jdangerx June 19, 2024 03:52

jdangerx approved these changes Jun 19, 2024

View reviewed changes

zaneselvans added this pull request to the merge queue Jun 19, 2024

Merged via the queue into main with commit b096fbc Jun 19, 2024
12 checks passed

zaneselvans deleted the quiet-parquet branch June 19, 2024 18:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update nightly build script to quietly publish public Parquet outputs. #3680

Update nightly build script to quietly publish public Parquet outputs. #3680

zaneselvans commented Jun 18, 2024 •

edited

Loading

To-do list

zaneselvans Jun 18, 2024

zaneselvans Jun 18, 2024

jdangerx Jun 18, 2024

zaneselvans Jun 18, 2024

jdangerx Jun 19, 2024

jdangerx left a comment •

edited

Loading

jdangerx Jun 18, 2024

zaneselvans commented Jun 18, 2024 •

edited

Loading

jdangerx commented Jun 18, 2024 •

edited

Loading

zaneselvans commented Jun 19, 2024

jdangerx commented Jun 19, 2024

jdangerx left a comment

zaneselvans commented Jun 19, 2024

Update nightly build script to quietly publish public Parquet outputs. #3680

Update nightly build script to quietly publish public Parquet outputs. #3680

Conversation

zaneselvans commented Jun 18, 2024 • edited Loading

Overview

Testing

To-do list

zaneselvans Jun 18, 2024

Choose a reason for hiding this comment

zaneselvans Jun 18, 2024

Choose a reason for hiding this comment

jdangerx Jun 18, 2024

Choose a reason for hiding this comment

zaneselvans Jun 18, 2024

Choose a reason for hiding this comment

jdangerx Jun 19, 2024

Choose a reason for hiding this comment

jdangerx left a comment • edited Loading

Choose a reason for hiding this comment

jdangerx Jun 18, 2024

Choose a reason for hiding this comment

zaneselvans commented Jun 18, 2024 • edited Loading

jdangerx commented Jun 18, 2024 • edited Loading

zaneselvans commented Jun 19, 2024

jdangerx commented Jun 19, 2024

jdangerx left a comment

Choose a reason for hiding this comment

zaneselvans commented Jun 19, 2024

zaneselvans commented Jun 18, 2024 •

edited

Loading

jdangerx left a comment •

edited

Loading

zaneselvans commented Jun 18, 2024 •

edited

Loading

jdangerx commented Jun 18, 2024 •

edited

Loading