###Scripts for extraction of climateprediction.net data
-
wah_extract_local.py is for extraction of data on machines that have direct access to the data
-
wah_extract_wget.py is for extraction of data on remote machines by first downloading the zip files off the server
First clone the repository using git. i.e. $git clone https://github.com/pfuhe1/cpdn_extract_scripts.git
Requires python2.7 with packages: numpy, netCDF4
Miniconda is a stand alone package containing python, enabling users to create environments with the packages they require on a system where they don't have control of the system python installation.
-
If miniconda is already installed on a shared system e.g. by a system administrator, create a conda environment:
conda create --name pyenv python=2.7 source activate pyenv conda install -y numpy conda install -y netCDF4
NOTE: you will have to run '$source activate pyenv' each time you create a new terminal before you run the scripts.
-
If you don't already have miniconda:
- Download miniconda installer from https://conda.io/miniconda.html
- Install miniconda to home directory (following directions on website above)
- Close and reopen your terminal so your envionment is updated
- Install numpy and netCDF4 to your miniconda environment:
conda install numpy netCDF4
Now you should be ready to go when you run the python scripts.
wah_extract_local.py ONLY:
- -i / --in_dir: input directory containing subfolders for each task (e.g. /gpfs/projects/cpdn/storage/boinc/upload/batch_440/successful/)
wah_extract_wget.py ONLY:
- -u / --urls_file: File containing list of urls of zip files (in gzipped format)
BOTH SCRIPTS:
- -o / --out_dir: output directory for extracted data
- -s / --start_zip: First zip to extract
- -e / --end_zip: Last zip to extract
- -f / --fields: List of (comma separated) fields to extract
- --output-freq: Output frequency of model data zips [month|year], defaults to month
- --structure: Directory structure of extracted data (after output directory) Options for structure are: std (default): out_dir/region[_subregion]/variable/ startdate-dir: out_dir/region[_subregion]/variable/startdate Where: region is [ocean,atmos,region] (depending on the file stream), startdate is yyyymm
-
[file_stream,stash_code,[subregion],process,valid_min,valid_max,time_freq,cell_method,vert_lev]'
where:
-
file_stream = ga.pd|ga.pe|ma.pc
-
stash_code = stash_section * 1000 + stash_item
-
[subregion] = [lon_NW,lat_NW,lon_SW,lat_SW] or []
-
process = time post_processing: min|max|mean|sum|all
-
time_freq = input variable data frequency in hours (e.g. 24=daily, 720=monthly)
-
cell_method = input variable time cell method: minimum,maximum,mean
-
vert_lev = input variable name of vertical level in netcdf file or ''
Note that older batches use a different directories structure, so the examples have a few different options (commented out)