Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run, and document ,the CDMLite marine workflow #68

Open
agstephens opened this issue Feb 26, 2021 · 1 comment
Open

Run, and document ,the CDMLite marine workflow #68

agstephens opened this issue Feb 26, 2021 · 1 comment

Comments

@agstephens
Copy link
Collaborator

agstephens commented Feb 26, 2021

Test run, and document, the CDMLite marine workflow to:

docs/ingest-workflow-lite-marine.md

The r3.0 marine files are located in:

/gws/nopw/j04/glamod_marine/data/marine-user-guide/v5/level2
@agstephens
Copy link
Collaborator Author

Here is a dump of what we know so far...

Re-structuring the input files

The first stage involves restructuring the input data so that the files are in a structure that can be directly loaded into PostgreSQL database partitions (using the \COPY SQL command). This involves grouping all the source decks data into single directories by year. The script is run on LOTUS and the intermediate data is written to the /work/scratch-nopw/ directory:

cd $WORK_DIR/
git clone https:// github.com/glamod/glamod-ingest 
cd glamod-ingest/
export RELEASE=r2.0
nohup ./scripts/marine/prepare-all-marine-lite.sh $RELEASE batch 2>&1 > 
prepare.marine.output.txt &

These tasks are run on the high-memory nodes (as defined in the script) and can take up 24 hours to complete. The restructuring task also includes various, functions, checks and fixes such as:

  • Merge header and observation records
  • Set "report_type" to 0
  • Set the "height_above_surface" value from input fields
  • Interpret a number of missing value indicators and set to NULL

Generating SQL scripts

After restructuring the files, a separate script is run as a set of LOTUS jobs to generate SQL scripts that will load the restructured pipe-separated (PSV) files:

cd $WORK_DIR/glamod-ingest/
./scripts/marine/create-sqls-marine.sh $RELEASE r2.0

This writes a set of commands to load entire PSV files directly in to database partitions, e.g.:

$ cat /gws/nopw/j04/c3s311a_lot2/data/ingest/marine/sql/0/load-0-1999.sql 
\cd '/work/scratch-nopw/astephen/glamod/r2.0/cdmlite/prepare/marine/1999/'
\COPY lite_2_0.observations_1999_marine_0 FROM '063-714-1999-r2.0-000000.psv' WITH CSV HEADER DELIMITER AS '|' NULL AS 'NULL'
\COPY lite_2_0.observations_1999_marine_0 FROM '100-792-1999-r2.0-000000.psv' WITH CSV HEADER DELIMITER AS '|' NULL AS 'NULL'
\COPY lite_2_0.observations_1999_marine_0 FROM '103-792-1999-r2.0-000000.psv' WITH CSV HEADER DELIMITER AS '|' NULL AS 'NULL'
\COPY lite_2_0.observations_1999_marine_0 FROM '112-926-1999-r2.0-000000.psv' WITH CSV HEADER DELIMITER AS '|' NULL AS 'NULL'
\COPY lite_2_0.observations_1999_marine_0 FROM '113-927-1999-r2.0-000000.psv' WITH CSV HEADER DELIMITER AS '|' NULL AS 'NULL'
\COPY lite_2_0.observations_1999_marine_0 FROM '114-992-1999-r2.0-000000.psv' WITH CSV HEADER DELIMITER AS '|' NULL AS 'NULL'

Loading the data into the database partitions

The loader script is run as a single process for marine data. It is spawned under "nohup" so that it will complete even if the SSH connection to the server is interrupted:

cd $WORK_DIR/glamod-ingest/
nohup ./scripts/marine/load-marine-sql.sh 0 > load.marine.0.txt &

Log files are written to:

/gws/nopw/j04/c3s311a_lot2/data/ingest/marine/populate/

The logs can be analysed for any errors. A successful process will report the number of records copied into the database partition per PSV file, e.g.:

$ cat /gws/nopw/j04/c3s311a_lot2/data/ingest/marine/populate/load-0-1999.sql.log 
COPY 1964636
COPY 1175883
COPY 50241
COPY 4942575
COPY 172003
COPY 814021

@agstephens agstephens changed the title Document the CDMLite marine workflow Run, and document ,the CDMLite marine workflow Mar 1, 2021
@jhaigh0 jhaigh0 removed their assignment Oct 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants