Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update nightly build script to quietly publish public Parquet outputs. #3680

Merged
merged 8 commits into from
Jun 19, 2024

Conversation

zaneselvans
Copy link
Member

@zaneselvans zaneselvans commented Jun 18, 2024

Overview

  • Publish Parquet files to the public GCS & S3 buckets when the builds succeed.
  • Create a pudl_parquet.zip archive containing all the Parquet outputs that we can point Kaggle at.
  • Switch to using zip instead of gzip to compress SQLite files because Windows can only open zip archives and we don't want Windows users to have to download a 3rd party archive/compression utility.

Closes #3678

Testing

  • Is there an easy way to give this a test besides doing a whole deployment?

To-do list

Preview Give feedback

@zaneselvans zaneselvans added output Exporting data from PUDL into other platforms or interchange formats. cloud Stuff that has to do with adapting PUDL to work in cloud computing context. parquet Issues related to the Apache Parquet file format which we use for long tables. kaggle Sharing our data and analysis with the Kaggle community labels Jun 18, 2024
@zaneselvans zaneselvans requested a review from jdangerx June 18, 2024 00:27
@zaneselvans zaneselvans self-assigned this Jun 18, 2024
gzip --verbose "$PUDL_OUTPUT"/*.sqlite && \
# Grab hourly tables which are only written to Parquet for distribution
cp "$PUDL_OUTPUT"/parquet/*__hourly_*.parquet "$PUDL_OUTPUT" && \
cd "$PUDL_OUTPUT" && \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go into the directory with the files to avoid capturing their directory paths in the zip archive.

Comment on lines +212 to +216
for file in *.sqlite; do
echo "Compressing $file" && \
zip "$file.zip" "$file" && \
rm "$file"
done
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for loop going to break the chain of && dependencies and mess up our exit codes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so: the && operator just glues two things together with the boolean AND as far as I know. Though I'm not sure if there will simply be a syntax error here.

One way around this would be just "pull the compression logic into its own function."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought what was being ANDed together was the exit codes? So if they're all 0, then the exit code for the whole chain of codes ends up being 0, but if any of them is non-zero, then the remaining commands are not executed, and the exit code ends up being whatever nonzero value the failing command issued?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think the exit codes are the things that are being ANDed together. We're using the short-circuiting behavior of && to say that each expression only executes if everything to its left returns 0.

But we have several separate &&-chains, and we don't exit immediately if one of the chains themselves has a non-0 return value. I think that's intentional so we can make the slack notifications at the end.

The body of the for loop is its own &&-chain because we don't have && after rm "$file" - so in my mental model, a failure in zip would stop us from trying to rm "$file" but we would effectively just skip over that file completely.

docker/gcp_pudl_etl.sh Outdated Show resolved Hide resolved
docker/gcp_pudl_etl.sh Outdated Show resolved Hide resolved
Copy link
Member

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to stick to using bash, we should use pushd/popd, it seems more robust to changes in the rest of the script.

Apart from that, I do think this script is due for either a refactor along the lines of #3643 or even encapsulating logic within Python - that would make the testing of individual pieces of the logic much simpler.

An incremental-improvement option is to start taking some of the trickier logic and putting it in tested Python code, and calling that Python code from within this bash script. That dodges "one big scary refactor" but runs into "what if we never finish refactoring?"

If the idea of running unit tests on our deployment code without refactoring everything in one go sounds good, I think it would make sense to have pudl.deployment.clean_up_outputs_for_distribution in the main pudl package, and then a simple script in the docker/ directory that just calls that function:

#! /usr/bin/env python

from pudl.deployment import clean_up_outputs_for_distribution

if __name__ == "__main__":
    clean_up_outputs_for_distribution()

For testing with a temp dir: you could either pass in PUDL_OUTPUT as an argument in the bash script, or mock the env var to point at a temp dir in pytest. Personally I think passing in the argument is better, it's nice to push all the environment interactions as far towards the edge of your system as possible.

We could follow up with moving more and more of the functions to Python until all this bash script does is call a bunch of Python functions based on environment variables. At which point we can decide to finish the port or leave the simple bash wrapper around.

docker/gcp_pudl_etl.sh Outdated Show resolved Hide resolved
docker/gcp_pudl_etl.sh Outdated Show resolved Hide resolved
Comment on lines +212 to +216
for file in *.sqlite; do
echo "Compressing $file" && \
zip "$file.zip" "$file" && \
rm "$file"
done
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so: the && operator just glues two things together with the boolean AND as far as I know. Though I'm not sure if there will simply be a syntax error here.

One way around this would be just "pull the compression logic into its own function."

@zaneselvans
Copy link
Member Author

zaneselvans commented Jun 18, 2024

A bunch of the deployment complexity stems from the different needs of the different deployment targets, combined with the fact that we're trying to use the same directory to prepare all of those different outputs. Maybe it would be simpler to never touch the build outputs, and instead populate a temporary directory for each of the different places that data needs to be sent, such that each of them can be independently set up for the needs of that target? Which would include:

  • Datasette / Fly.io
  • Zenodo data release
  • Public GCS & AWS buckets (which also need to contain whatever Kaggle needs)
  • Private Parquet bucket (to be deprecated soon)

@jdangerx
Copy link
Member

jdangerx commented Jun 18, 2024

I also think a lot of our woes come from having one piece of logic handling too many different deployment targets.*

So I think the separate-temporary-directories thing is a good idea! Though I think we still run into testing problems because of the various 3rd party services we need to stub out. If we're still emphasizing testing the deploy code, I think that we should split clean_up... into multiple python functions and test those individually.

* Though also we have "the logic for each deployment target is spread over multiple pieces of logic that are all trying to do too much."

@zaneselvans
Copy link
Member Author

I'd love to refactor this more deeply, bring it into Python, make it testable (and avoid the need to install and depend on the gcloud and aws CLI tools) but at least for me I think that would take the better part of a work week and feels like more resources that we want to spend on just getting a basic version of the Parquet files out. Maybe we can put the refactor into our project menu for the summer? I think it would fall under the infrastructure portion of the POSE project if it comes through.

@zaneselvans zaneselvans marked this pull request as ready for review June 19, 2024 03:50
@zaneselvans zaneselvans requested a review from jdangerx June 19, 2024 03:52
@jdangerx
Copy link
Member

I'd love to refactor this more deeply, bring it into Python, make it testable (and avoid the need to install and depend on the gcloud and aws CLI tools) but at least for me I think that would take the better part of a work week and feels like more resources that we want to spend on just getting a basic version of the Parquet files out. Maybe we can put the refactor into our project menu for the summer? I think it would fall under the infrastructure portion of the POSE project if it comes through.

I think the reason to put work into testability now would be if we think that saves us time over "try running the nightly build and hope it works" - hence me suggesting the "incremental" (read "half-assed") way of getting some testability just for the bit of logic you're trying to add.

I totally trust you and your judgement on what would be the fastest expected path to shipping the dang Parquet, though - you're the one doing the development!

Copy link
Member

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it should work, let's give it a shot!

@zaneselvans zaneselvans added this pull request to the merge queue Jun 19, 2024
@zaneselvans
Copy link
Member Author

🙈 I will be pleasantly amazed if it actually works on the first try. 🙈

Merged via the queue into main with commit b096fbc Jun 19, 2024
12 checks passed
@zaneselvans zaneselvans deleted the quiet-parquet branch June 19, 2024 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud Stuff that has to do with adapting PUDL to work in cloud computing context. kaggle Sharing our data and analysis with the Kaggle community output Exporting data from PUDL into other platforms or interchange formats. parquet Issues related to the Apache Parquet file format which we use for long tables.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Quietly publish nightly Parquet outputs
2 participants