Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update nightly build script to quietly publish public Parquet outputs. #3680

Merged
merged 8 commits into from
Jun 19, 2024
24 changes: 18 additions & 6 deletions docker/gcp_pudl_etl.sh
Original file line number Diff line number Diff line change
Expand Up @@ -208,10 +208,21 @@ function merge_tag_into_branch() {

function clean_up_outputs_for_distribution() {
# Compress the SQLite DBs for easier distribution
gzip --verbose "$PUDL_OUTPUT"/*.sqlite && \
# Grab hourly tables which are only written to Parquet for distribution
cp "$PUDL_OUTPUT"/parquet/*__hourly_*.parquet "$PUDL_OUTPUT" && \
# Remove all other parquet output, which we are not yet distributing.
pushd "$PUDL_OUTPUT" && \
for file in *.sqlite; do
echo "Compressing $file" && \
zip "$file.zip" "$file" && \
rm "$file"
done
Comment on lines +212 to +216
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for loop going to break the chain of && dependencies and mess up our exit codes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so: the && operator just glues two things together with the boolean AND as far as I know. Though I'm not sure if there will simply be a syntax error here.

One way around this would be just "pull the compression logic into its own function."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought what was being ANDed together was the exit codes? So if they're all 0, then the exit code for the whole chain of codes ends up being 0, but if any of them is non-zero, then the remaining commands are not executed, and the exit code ends up being whatever nonzero value the failing command issued?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think the exit codes are the things that are being ANDed together. We're using the short-circuiting behavior of && to say that each expression only executes if everything to its left returns 0.

But we have several separate &&-chains, and we don't exit immediately if one of the chains themselves has a non-0 return value. I think that's intentional so we can make the slack notifications at the end.

The body of the for loop is its own &&-chain because we don't have && after rm "$file" - so in my mental model, a failure in zip would stop us from trying to rm "$file" but we would effectively just skip over that file completely.

popd && \
# Create a zip file of all the parquet outputs for distribution on Kaggle
# Don't try to compress the already compressed Parquet files with Zip.
pushd "$PUDL_OUTPUT/parquet" && \
zip -0 "$PUDL_OUTPUT/pudl_parquet.zip" ./*.parquet && \
# Move the individual parquet outputs to the output directory for direct access
mv ./*.parquet "$PUDL_OUTPUT" && \
popd && \
# Remove any remaiining files and directories we don't want to distribute
rm -rf "$PUDL_OUTPUT/parquet" && \
rm -f "$PUDL_OUTPUT/metadata.yml"
}
Expand Down Expand Up @@ -277,8 +288,9 @@ if [[ $ETL_SUCCESS == 0 ]]; then
# Copy cleaned up outputs to the S3 and GCS distribution buckets
copy_outputs_to_distribution_bucket | tee -a "$LOGFILE"
DISTRIBUTION_BUCKET_SUCCESS=${PIPESTATUS[0]}
# TODO: this currently just makes a sandbox release, for testing. Should be
# switched to production and only run on push of a version tag eventually.
# Remove individual parquet outputs and distribute just the zipped parquet
# archives on Zenodo, due to their number of files limit
rm -f "$PUDL_OUTPUT"/*.parquet && \
# Push a data release to Zenodo for long term accessiblity
zenodo_data_release "$ZENODO_TARGET_ENV" 2>&1 | tee -a "$LOGFILE"
ZENODO_SUCCESS=${PIPESTATUS[0]}
Expand Down