Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove "options" for file locations in favor of automatic cache #46

Open
mcgibbon opened this issue Jan 11, 2020 · 1 comment
Open

Remove "options" for file locations in favor of automatic cache #46

mcgibbon opened this issue Jan 11, 2020 · 1 comment

Comments

@mcgibbon
Copy link
Collaborator

We should move towards a system that no longer has "options" available for data, and instead uses remote URLs that get cached locally. Motivations for this change are listed below.

This involves a few changes:

  • Replace get_default_config() with get_config(location) which accepts a remote file location. Initially, get_default_config should remain in place but should present a DeprecationWarning. It would return a dictionary that uses remote paths instead of "options".
  • Remove initial_conditions and forcing keys from the configuration dictionary.
  • Rename patch_files to file_sources.
  • Allow not just asset dictionaries but also strings in file_sources. A string should represent a path (either file or directory) which is copied into the run directory recursively. These would be replaced by asset dictionaries internally. To place files in subdirectories, users should either place the files under a subdirectory in the source location, or use the more extensive asset dict representation (@oliverwm1 has found for loops generating asset dicts to be a very smooth workflow).
  • Add keys orographic_data and field_tables to the config dict. These locations should be directories which have the same structure as the current cache locations. Orographic data should be placed in resolution subfolders and field_tables should be labelled by scheme.
  • Remove the data_table key, treat it as a forcing file.
  • Treat "default" as a filename instead of option for diag_table.

A sample configuration dictionary (excluding the namelist) might look like the following yaml (note this is a mockup and doesn't represent a valid run directory):

experiment_name: default_experiment
forcing: gs://vcm-fv3config/data/base_forcing/v1.1
initial_conditions: gfs_example
file_sources: [
    gs://vcm-fv3config/data/base_forcing/v1.1,
    gs://vcm-fv3config/config/data_table/v1.0/data_table,
    {
        source_location: gs://vcm-fv3config/data/initial_conditions/gfs_initial_conditions/v1.0,
        source_name: file.nc,
        target_location: ,
        target_name: file.nc,
        copy_method: copy,
    },
]
orographic_data: gs://vcm-fv3config/data/orographic_data/v1.0
field_tables: gs://vcm-fv3config/config/field_tables/v1.0
diag_table: gs://vcm-fv3config/config/diag_table/v1.0/diag_table

Motivations:

  • @oliverwm1 has found it useful to use the patch_files feature to specify all of the input data. He'd like to be able to disable the "initial_conditions" option by setting it to "None", but this seems a little hack-ish. This stems from the fact that initial conditions has two ways it can be provided (patch_files or initial_conditions).
  • @nbren12 has pointed out that it is not ideal to have strings behave as file paths under some conditions or lookup keys under other conditions
  • In moving to the new fv3atm repo, we found changes are necessary in the forcing data structure and in the model configuration. I have opted to use remote paths instead of change/add new "options" for data, alongside automatic caching offered by Add local caching to get_file operation #45.
  • We are also looking soon at releasing a "for public use" docker implementation of the model. The current set-up with built-in options is not very publicly digestable, because it is unclear what a "default" configuration should be, and this "default" configuration is also model version dependent and likely to break. The "option" method doesn't have great ways to version the data.
@mcgibbon
Copy link
Collaborator Author

When we do this, file_sources should always be a list, even if it contains a single item.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant