Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify minimal required schema changes #29

Open
jpnavarro opened this issue Jan 23, 2025 · 6 comments
Open

Identify minimal required schema changes #29

jpnavarro opened this issue Jan 23, 2025 · 6 comments

Comments

@jpnavarro
Copy link
Collaborator

jpnavarro commented Jan 23, 2025

The minimum necessary schema changes to merge US Solr indexes into a single ESGF2-US Globus Search Index.

Additional requirements:

  1. Simplify access to Globus data service by splitting url = "globus:<multiple_field_values>" into separate fields. This eliminates the need to parse these values for use in Globus data service interfaces.
@jpnavarro
Copy link
Collaborator Author

jpnavarro commented Jan 23, 2025

ESGF2-US metadata schema follows WCRP specifications and will not change in ESGF-1.5 except as follows:

Index entries of type=Dataset will not change.

Index entires of type=File with a "url" attribute of "globus:|<mime_type>|" will be replaced by:

  1. New "url" attribute value "globus_file_attributes:" (which informs client to use the following separate new attributes for a file)
  2. New attribute: mime_type
  3. New globus file attributes: collection (string), path (string), globus_type (string)

Change existing attribute values:
5. index_node attribute will have a value of "esgf2-us-globus-search"

Unchanged existing attribute values:
6. data_node - Public facing replica location
7. id - unique id
8. title - filename
9. format
10. checksum and chcksum_type
11. size
12. all others

For values of "globus_type" see: https://docs.globus.org/api/transfer/file_operations/#file_document

@jpnavarro
Copy link
Collaborator Author

jpnavarro commented Jan 23, 2025

Example selected File entry attributes:

        "title":"bldep_AERmon_TaiESM1_hist-piNTCF_r1i1p1f1_gn_185001-201412.nc",
        "data_node":"eagle.alcf.anl.gov",
        "index_node":"esgf2-us-globus-search",
        "size":330810526,
        "checksum":["8f6f7f0bf0c8efc8f7ecc489323cd83bf3611fb1650f09324ecee90fcf25dc54"],
        "checksum_type":["SHA256"],
        "url": [
           "https://g-52ba3.fd635.8443.data.globus.org/css03_data/CMIP6/AerChemMIP/AS-RCEC/TaiESM1/hist-piNTCF/r1i1p1f1/AERmon/bldep/gn/v20210603/bldep_AERmon_TaiESM1_hist-piNTCF_r1i1p1f1_gn_185001-201412.nc|application/netcdf|HTTPServer",
           "globus_file_attributes:"
        ],
        "mime_type": "application/netcdf",
        "collection": "8896f38e-68d1-4708-bce4-b1b3a3405809",
        "path": "/css03_data/CMIP6/AerChemMIP/AS-RCEC/TaiESM1/hist-piNTCF/r1i1p1f1/AERmon/bldep/gn/v20210603/",
        "globus_type": "file",
        "id":"CMIP6.AerChemMIP.AS-RCEC.TaiESM1.hist-piNTCF.r1i1p1f1.AERmon.bldep.gn.v20210603.bldep_AERmon_TaiESM1_hist-piNTCF_r1i1p1f1_gn_185001-201412.nc|eagle.alcf.anl.gov",

@bstrdsmkr
Copy link

@jpnavarro I think maybe the formatting didn't come through above -- is the globus_file_attributes key meant to be contained in the url array? Or is it new key at the same level as url with collection, path, and globus_type nested underneath?

Also, in looking at those docs, is there ever a valid case where globus_type is not file in an ESGF index? It doesn't seem so to me so initially, it feels like we could drop that key?

Are the new globus_attributes necessary in order to use download the file via HTTPS? If so, we'd now need to differentiate the Globus HTTPS file from all other HTTPS links (previously they were distinguishable via the | delimited data at the end)

Is the consolidation down to one index in scope here? If so, we need to either tackle combining the access attributes into a nested key or arrays or face triplicate index entries (ORNL, LLNL, and ANL)

@jpnavarro
Copy link
Collaborator Author

jpnavarro commented Jan 24, 2025

@bstrdsmkr, I think I corrected the formatting.

We have agreed to merge existing Solr indexes into a single consolidated Globus Search index in a single step. These schema changes address the requirements for doing this, including a decision to make the minimal schema changes necessary to implement what will soon be a legacy esgf-1.5 system.

The merge will involve loading a single Dataset entry into the Globus Index most likely from LLNL, and the corresponding File entries from each of the replica locations. Each of those File index entries have their own data_node values. This means that the consolidated Index would have multiple File entries for the same file at different replica locations.

You may be correct about the globus_type. I'm confirming with the Globus folks that gave us that field what their use case is.

As I understand it, the url array represents one or more paths on the same data_node to a single file. In our current Solr indexes, Globus accessible files generally have two urls, an https one and a "globus:<bunch_of_embedded_values>".

We are proposing to not change https urls, but replace urls of the form "globus:<bunch_of_embedded_values>" with "globus_file_attributes:" and move the <bunch_of_embedded_values>" to separate new attributes: collection, path, and title. Actually, title already has the file name.

Clients will continue to be able to use the https url as they do today, or if the url contains the value "globus_file_attributes:" look in the separate collection, path, and title fields for values used to construct Globus transfer requests. The reason for having a url of "globus_file_attributes:" at all is in order to support other uses of the new attributes collection and path.

@bstrdsmkr
Copy link

@jpnavarro in order to make my own developer's lives easier, I'd propose keeping the necessary access information encoded as a uri. Any new fields could be added as query parameters, that way all the entries in the url array can be parsed by standard language tools. That would look something like:

"title":"bldep_AERmon_TaiESM1_hist-piNTCF_r1i1p1f1_gn_185001-201412.nc",
"data_node":"eagle.alcf.anl.gov",
"index_node":"esgf2-us-globus-search",
"size":330810526,
"checksum":["8f6f7f0bf0c8efc8f7ecc489323cd83bf3611fb1650f09324ecee90fcf25dc54"],
"checksum_type":["SHA256"],
"url": [
    "https://g-52ba3.fd635.8443.data.globus.org/css03_data/CMIP6/AerChemMIP/AS-RCEC/TaiESM1/hist-piNTCF/r1i1p1f1/AERmon/bldep/gn/v20210603/bldep_AERmon_TaiESM1_hist-piNTCF_r1i1p1f1_gn_185001-201412.nc|application/netcdf|HTTPServer",
    "globus://8896f38e-68d1-4708-bce4-b1b3a3405809/css03_data/CMIP6/AerChemMIP/AS-RCEC/TaiESM1/hist-piNTCF/r1i1p1f1/AERmon/bldep/gn/v20210603/bldep_AERmon_TaiESM1_hist-piNTCF_r1i1p1f1_gn_185001-201412.nc?globus_type=file|application/netcdf|Globus"
],
"id":"CMIP6.AerChemMIP.AS-RCEC.TaiESM1.hist-piNTCF.r1i1p1f1.AERmon.bldep.gn.v20210603.bldep_AERmon_TaiESM1_hist-piNTCF_r1i1p1f1_gn_185001-201412.nc|eagle.alcf.anl.gov",

@jpnavarro
Copy link
Collaborator Author

@bstrdsmkr I just updated the issue description above with a recommendation/requirement that the team that proposed the schema changes was provided: Simplify access to Globus data service by splitting url = "globus:<multiple_field_values>" into separate fields, eliminating the need to parse a string to obtain values used in various Globus data service interfaces.

It is simpler to explicitly store different values in their own fields so that clients can use the subset of fields they need for different Globus interfaces.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants