-
-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update nightly build script to quietly publish public Parquet outputs. #3680
Conversation
docker/gcp_pudl_etl.sh
Outdated
gzip --verbose "$PUDL_OUTPUT"/*.sqlite && \ | ||
# Grab hourly tables which are only written to Parquet for distribution | ||
cp "$PUDL_OUTPUT"/parquet/*__hourly_*.parquet "$PUDL_OUTPUT" && \ | ||
cd "$PUDL_OUTPUT" && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go into the directory with the files to avoid capturing their directory paths in the zip archive.
for file in *.sqlite; do | ||
echo "Compressing $file" && \ | ||
zip "$file.zip" "$file" && \ | ||
rm "$file" | ||
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for loop going to break the chain of &&
dependencies and mess up our exit codes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so: the &&
operator just glues two things together with the boolean AND as far as I know. Though I'm not sure if there will simply be a syntax error here.
One way around this would be just "pull the compression logic into its own function."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought what was being ANDed together was the exit codes? So if they're all 0, then the exit code for the whole chain of codes ends up being 0, but if any of them is non-zero, then the remaining commands are not executed, and the exit code ends up being whatever nonzero value the failing command issued?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think the exit codes are the things that are being ANDed together. We're using the short-circuiting behavior of &&
to say that each expression only executes if everything to its left returns 0.
But we have several separate &&
-chains, and we don't exit immediately if one of the chains themselves has a non-0 return value. I think that's intentional so we can make the slack notifications at the end.
The body of the for loop is its own &&
-chain because we don't have &&
after rm "$file"
- so in my mental model, a failure in zip
would stop us from trying to rm "$file"
but we would effectively just skip over that file completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to stick to using bash, we should use pushd
/popd
, it seems more robust to changes in the rest of the script.
Apart from that, I do think this script is due for either a refactor along the lines of #3643 or even encapsulating logic within Python - that would make the testing of individual pieces of the logic much simpler.
An incremental-improvement option is to start taking some of the trickier logic and putting it in tested Python code, and calling that Python code from within this bash script. That dodges "one big scary refactor" but runs into "what if we never finish refactoring?"
If the idea of running unit tests on our deployment code without refactoring everything in one go sounds good, I think it would make sense to have pudl.deployment.clean_up_outputs_for_distribution
in the main pudl
package, and then a simple script in the docker/
directory that just calls that function:
#! /usr/bin/env python
from pudl.deployment import clean_up_outputs_for_distribution
if __name__ == "__main__":
clean_up_outputs_for_distribution()
For testing with a temp dir: you could either pass in PUDL_OUTPUT
as an argument in the bash script, or mock the env var to point at a temp dir in pytest
. Personally I think passing in the argument is better, it's nice to push all the environment interactions as far towards the edge of your system as possible.
We could follow up with moving more and more of the functions to Python until all this bash script does is call a bunch of Python functions based on environment variables. At which point we can decide to finish the port or leave the simple bash wrapper around.
for file in *.sqlite; do | ||
echo "Compressing $file" && \ | ||
zip "$file.zip" "$file" && \ | ||
rm "$file" | ||
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so: the &&
operator just glues two things together with the boolean AND as far as I know. Though I'm not sure if there will simply be a syntax error here.
One way around this would be just "pull the compression logic into its own function."
A bunch of the deployment complexity stems from the different needs of the different deployment targets, combined with the fact that we're trying to use the same directory to prepare all of those different outputs. Maybe it would be simpler to never touch the build outputs, and instead populate a temporary directory for each of the different places that data needs to be sent, such that each of them can be independently set up for the needs of that target? Which would include:
|
I also think a lot of our woes come from having one piece of logic handling too many different deployment targets.* So I think the separate-temporary-directories thing is a good idea! Though I think we still run into testing problems because of the various 3rd party services we need to stub out. If we're still emphasizing testing the deploy code, I think that we should split * Though also we have "the logic for each deployment target is spread over multiple pieces of logic that are all trying to do too much." |
I'd love to refactor this more deeply, bring it into Python, make it testable (and avoid the need to install and depend on the |
I think the reason to put work into testability now would be if we think that saves us time over "try running the nightly build and hope it works" - hence me suggesting the "incremental" (read "half-assed") way of getting some testability just for the bit of logic you're trying to add. I totally trust you and your judgement on what would be the fastest expected path to shipping the dang Parquet, though - you're the one doing the development! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it should work, let's give it a shot!
🙈 I will be pleasantly amazed if it actually works on the first try. 🙈 |
Overview
pudl_parquet.zip
archive containing all the Parquet outputs that we can point Kaggle at.zip
instead ofgzip
to compress SQLite files because Windows can only openzip
archives and we don't want Windows users to have to download a 3rd party archive/compression utility.Closes #3678
Testing
To-do list