Run, and document ,the CDMLite marine workflow #68

agstephens · 2021-02-26T09:30:17Z

Test run, and document, the CDMLite marine workflow to:

docs/ingest-workflow-lite-marine.md

The r3.0 marine files are located in:

/gws/nopw/j04/glamod_marine/data/marine-user-guide/v5/level2

The text was updated successfully, but these errors were encountered:

agstephens · 2021-02-26T09:39:36Z

Here is a dump of what we know so far...

Re-structuring the input files

The first stage involves restructuring the input data so that the files are in a structure that can be directly loaded into PostgreSQL database partitions (using the \COPY SQL command). This involves grouping all the source decks data into single directories by year. The script is run on LOTUS and the intermediate data is written to the /work/scratch-nopw/ directory:

cd $WORK_DIR/
git clone https:// github.com/glamod/glamod-ingest 
cd glamod-ingest/
export RELEASE=r2.0
nohup ./scripts/marine/prepare-all-marine-lite.sh $RELEASE batch 2>&1 > 
prepare.marine.output.txt &

These tasks are run on the high-memory nodes (as defined in the script) and can take up 24 hours to complete. The restructuring task also includes various, functions, checks and fixes such as:

Merge header and observation records
Set "report_type" to 0
Set the "height_above_surface" value from input fields
Interpret a number of missing value indicators and set to NULL

Generating SQL scripts

After restructuring the files, a separate script is run as a set of LOTUS jobs to generate SQL scripts that will load the restructured pipe-separated (PSV) files:

cd $WORK_DIR/glamod-ingest/
./scripts/marine/create-sqls-marine.sh $RELEASE r2.0

This writes a set of commands to load entire PSV files directly in to database partitions, e.g.:

$ cat /gws/nopw/j04/c3s311a_lot2/data/ingest/marine/sql/0/load-0-1999.sql 
\cd '/work/scratch-nopw/astephen/glamod/r2.0/cdmlite/prepare/marine/1999/'
\COPY lite_2_0.observations_1999_marine_0 FROM '063-714-1999-r2.0-000000.psv' WITH CSV HEADER DELIMITER AS '|' NULL AS 'NULL'
\COPY lite_2_0.observations_1999_marine_0 FROM '100-792-1999-r2.0-000000.psv' WITH CSV HEADER DELIMITER AS '|' NULL AS 'NULL'
\COPY lite_2_0.observations_1999_marine_0 FROM '103-792-1999-r2.0-000000.psv' WITH CSV HEADER DELIMITER AS '|' NULL AS 'NULL'
\COPY lite_2_0.observations_1999_marine_0 FROM '112-926-1999-r2.0-000000.psv' WITH CSV HEADER DELIMITER AS '|' NULL AS 'NULL'
\COPY lite_2_0.observations_1999_marine_0 FROM '113-927-1999-r2.0-000000.psv' WITH CSV HEADER DELIMITER AS '|' NULL AS 'NULL'
\COPY lite_2_0.observations_1999_marine_0 FROM '114-992-1999-r2.0-000000.psv' WITH CSV HEADER DELIMITER AS '|' NULL AS 'NULL'

Loading the data into the database partitions

The loader script is run as a single process for marine data. It is spawned under "nohup" so that it will complete even if the SSH connection to the server is interrupted:

cd $WORK_DIR/glamod-ingest/
nohup ./scripts/marine/load-marine-sql.sh 0 > load.marine.0.txt &

Log files are written to:

/gws/nopw/j04/c3s311a_lot2/data/ingest/marine/populate/

The logs can be analysed for any errors. A successful process will report the number of records copied into the database partition per PSV file, e.g.:

$ cat /gws/nopw/j04/c3s311a_lot2/data/ingest/marine/populate/load-0-1999.sql.log 
COPY 1964636
COPY 1175883
COPY 50241
COPY 4942575
COPY 172003
COPY 814021

agstephens assigned jhaigh0 Feb 26, 2021

agstephens changed the title ~~Document the CDMLite marine workflow~~ Run, and document ,the CDMLite marine workflow Mar 1, 2021

jhaigh0 removed their assignment Oct 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run, and document ,the CDMLite marine workflow #68

Run, and document ,the CDMLite marine workflow #68

agstephens commented Feb 26, 2021 •

edited

Loading

agstephens commented Feb 26, 2021

Run, and document ,the CDMLite marine workflow #68

Run, and document ,the CDMLite marine workflow #68

Comments

agstephens commented Feb 26, 2021 • edited Loading

agstephens commented Feb 26, 2021

Re-structuring the input files

Generating SQL scripts

Loading the data into the database partitions

agstephens commented Feb 26, 2021 •

edited

Loading