Identify minimal required schema changes #29

jpnavarro · 2025-01-23T16:48:50Z

The minimum necessary schema changes to merge US Solr indexes into a single ESGF2-US Globus Search Index.

Additional requirements:

Simplify access to Globus data service by splitting url = "globus:<multiple_field_values>" into separate fields. This eliminates the need to parse these values for use in Globus data service interfaces.

jpnavarro · 2025-01-23T17:03:05Z

ESGF2-US metadata schema follows WCRP specifications and will not change in ESGF-1.5 except as follows:

Index entries of type=Dataset will not change.

Index entires of type=File with a "url" attribute of "globus:|<mime_type>|" will be replaced by:

New "url" attribute value "globus_file_attributes:" (which informs client to use the following separate new attributes for a file)
New attribute: mime_type
New globus file attributes: collection (string), path (string), globus_type (string)

Change existing attribute values:
5. index_node attribute will have a value of "esgf2-us-globus-search"

Unchanged existing attribute values:
6. data_node - Public facing replica location
7. id - unique id
8. title - filename
9. format
10. checksum and chcksum_type
11. size
12. all others

For values of "globus_type" see: https://docs.globus.org/api/transfer/file_operations/#file_document

jpnavarro · 2025-01-23T17:04:38Z

Example selected File entry attributes:

        "title":"bldep_AERmon_TaiESM1_hist-piNTCF_r1i1p1f1_gn_185001-201412.nc",
        "data_node":"eagle.alcf.anl.gov",
        "index_node":"esgf2-us-globus-search",
        "size":330810526,
        "checksum":["8f6f7f0bf0c8efc8f7ecc489323cd83bf3611fb1650f09324ecee90fcf25dc54"],
        "checksum_type":["SHA256"],
        "url": [
           "https://g-52ba3.fd635.8443.data.globus.org/css03_data/CMIP6/AerChemMIP/AS-RCEC/TaiESM1/hist-piNTCF/r1i1p1f1/AERmon/bldep/gn/v20210603/bldep_AERmon_TaiESM1_hist-piNTCF_r1i1p1f1_gn_185001-201412.nc|application/netcdf|HTTPServer",
           "globus_file_attributes:"
        ],
        "mime_type": "application/netcdf",
        "collection": "8896f38e-68d1-4708-bce4-b1b3a3405809",
        "path": "/css03_data/CMIP6/AerChemMIP/AS-RCEC/TaiESM1/hist-piNTCF/r1i1p1f1/AERmon/bldep/gn/v20210603/",
        "globus_type": "file",
        "id":"CMIP6.AerChemMIP.AS-RCEC.TaiESM1.hist-piNTCF.r1i1p1f1.AERmon.bldep.gn.v20210603.bldep_AERmon_TaiESM1_hist-piNTCF_r1i1p1f1_gn_185001-201412.nc|eagle.alcf.anl.gov",

bstrdsmkr · 2025-01-23T23:12:18Z

@jpnavarro I think maybe the formatting didn't come through above -- is the globus_file_attributes key meant to be contained in the url array? Or is it new key at the same level as url with collection, path, and globus_type nested underneath?

Also, in looking at those docs, is there ever a valid case where globus_type is not file in an ESGF index? It doesn't seem so to me so initially, it feels like we could drop that key?

Are the new globus_attributes necessary in order to use download the file via HTTPS? If so, we'd now need to differentiate the Globus HTTPS file from all other HTTPS links (previously they were distinguishable via the | delimited data at the end)

Is the consolidation down to one index in scope here? If so, we need to either tackle combining the access attributes into a nested key or arrays or face triplicate index entries (ORNL, LLNL, and ANL)

jpnavarro · 2025-01-24T14:14:33Z

@bstrdsmkr, I think I corrected the formatting.

We have agreed to merge existing Solr indexes into a single consolidated Globus Search index in a single step. These schema changes address the requirements for doing this, including a decision to make the minimal schema changes necessary to implement what will soon be a legacy esgf-1.5 system.

The merge will involve loading a single Dataset entry into the Globus Index most likely from LLNL, and the corresponding File entries from each of the replica locations. Each of those File index entries have their own data_node values. This means that the consolidated Index would have multiple File entries for the same file at different replica locations.

You may be correct about the globus_type. I'm confirming with the Globus folks that gave us that field what their use case is.

As I understand it, the url array represents one or more paths on the same data_node to a single file. In our current Solr indexes, Globus accessible files generally have two urls, an https one and a "globus:<bunch_of_embedded_values>".

We are proposing to not change https urls, but replace urls of the form "globus:<bunch_of_embedded_values>" with "globus_file_attributes:" and move the <bunch_of_embedded_values>" to separate new attributes: collection, path, and title. Actually, title already has the file name.

Clients will continue to be able to use the https url as they do today, or if the url contains the value "globus_file_attributes:" look in the separate collection, path, and title fields for values used to construct Globus transfer requests. The reason for having a url of "globus_file_attributes:" at all is in order to support other uses of the new attributes collection and path.

bstrdsmkr · 2025-01-24T15:09:35Z

@jpnavarro in order to make ~~my own~~ developer's lives easier, I'd propose keeping the necessary access information encoded as a uri. Any new fields could be added as query parameters, that way all the entries in the url array can be parsed by standard language tools. That would look something like:

"title":"bldep_AERmon_TaiESM1_hist-piNTCF_r1i1p1f1_gn_185001-201412.nc",
"data_node":"eagle.alcf.anl.gov",
"index_node":"esgf2-us-globus-search",
"size":330810526,
"checksum":["8f6f7f0bf0c8efc8f7ecc489323cd83bf3611fb1650f09324ecee90fcf25dc54"],
"checksum_type":["SHA256"],
"url": [
    "https://g-52ba3.fd635.8443.data.globus.org/css03_data/CMIP6/AerChemMIP/AS-RCEC/TaiESM1/hist-piNTCF/r1i1p1f1/AERmon/bldep/gn/v20210603/bldep_AERmon_TaiESM1_hist-piNTCF_r1i1p1f1_gn_185001-201412.nc|application/netcdf|HTTPServer",
    "globus://8896f38e-68d1-4708-bce4-b1b3a3405809/css03_data/CMIP6/AerChemMIP/AS-RCEC/TaiESM1/hist-piNTCF/r1i1p1f1/AERmon/bldep/gn/v20210603/bldep_AERmon_TaiESM1_hist-piNTCF_r1i1p1f1_gn_185001-201412.nc?globus_type=file|application/netcdf|Globus"
],
"id":"CMIP6.AerChemMIP.AS-RCEC.TaiESM1.hist-piNTCF.r1i1p1f1.AERmon.bldep.gn.v20210603.bldep_AERmon_TaiESM1_hist-piNTCF_r1i1p1f1_gn_185001-201412.nc|eagle.alcf.anl.gov",

jpnavarro · 2025-01-24T18:24:23Z

@bstrdsmkr I just updated the issue description above with a recommendation/requirement that the team that proposed the schema changes was provided: Simplify access to Globus data service by splitting url = "globus:<multiple_field_values>" into separate fields, eliminating the need to parse a string to obtain values used in various Globus data service interfaces.

It is simpler to explicitly store different values in their own fields so that clients can use the subset of fields they need for different Globus interfaces.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identify minimal required schema changes #29

Identify minimal required schema changes #29

jpnavarro commented Jan 23, 2025 •

edited

Loading

jpnavarro commented Jan 23, 2025 •

edited

Loading

jpnavarro commented Jan 23, 2025 •

edited

Loading

bstrdsmkr commented Jan 23, 2025

jpnavarro commented Jan 24, 2025 •

edited

Loading

bstrdsmkr commented Jan 24, 2025

jpnavarro commented Jan 24, 2025

Identify minimal required schema changes #29

Identify minimal required schema changes #29

Comments

jpnavarro commented Jan 23, 2025 • edited Loading

jpnavarro commented Jan 23, 2025 • edited Loading

jpnavarro commented Jan 23, 2025 • edited Loading

bstrdsmkr commented Jan 23, 2025

jpnavarro commented Jan 24, 2025 • edited Loading

bstrdsmkr commented Jan 24, 2025

jpnavarro commented Jan 24, 2025

jpnavarro commented Jan 23, 2025 •

edited

Loading

jpnavarro commented Jan 23, 2025 •

edited

Loading

jpnavarro commented Jan 23, 2025 •

edited

Loading

jpnavarro commented Jan 24, 2025 •

edited

Loading