-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identify minimal required schema changes #29
Comments
ESGF2-US metadata schema follows WCRP specifications and will not change in ESGF-1.5 except as follows: Index entries of type=Dataset will not change. Index entires of type=File with a "url" attribute of "globus:|<mime_type>|" will be replaced by:
Change existing attribute values: Unchanged existing attribute values: For values of "globus_type" see: https://docs.globus.org/api/transfer/file_operations/#file_document |
Example selected File entry attributes:
|
@jpnavarro I think maybe the formatting didn't come through above -- is the Also, in looking at those docs, is there ever a valid case where Are the new Is the consolidation down to one index in scope here? If so, we need to either tackle combining the access attributes into a nested key or arrays or face triplicate index entries (ORNL, LLNL, and ANL) |
@bstrdsmkr, I think I corrected the formatting. We have agreed to merge existing Solr indexes into a single consolidated Globus Search index in a single step. These schema changes address the requirements for doing this, including a decision to make the minimal schema changes necessary to implement what will soon be a legacy esgf-1.5 system. The merge will involve loading a single Dataset entry into the Globus Index most likely from LLNL, and the corresponding File entries from each of the replica locations. Each of those File index entries have their own data_node values. This means that the consolidated Index would have multiple File entries for the same file at different replica locations. You may be correct about the globus_type. I'm confirming with the Globus folks that gave us that field what their use case is. As I understand it, the url array represents one or more paths on the same data_node to a single file. In our current Solr indexes, Globus accessible files generally have two urls, an https one and a "globus:<bunch_of_embedded_values>". We are proposing to not change https urls, but replace urls of the form "globus:<bunch_of_embedded_values>" with "globus_file_attributes:" and move the <bunch_of_embedded_values>" to separate new attributes: collection, path, and title. Actually, title already has the file name. Clients will continue to be able to use the https url as they do today, or if the url contains the value "globus_file_attributes:" look in the separate collection, path, and title fields for values used to construct Globus transfer requests. The reason for having a url of "globus_file_attributes:" at all is in order to support other uses of the new attributes collection and path. |
@jpnavarro in order to make "title":"bldep_AERmon_TaiESM1_hist-piNTCF_r1i1p1f1_gn_185001-201412.nc",
"data_node":"eagle.alcf.anl.gov",
"index_node":"esgf2-us-globus-search",
"size":330810526,
"checksum":["8f6f7f0bf0c8efc8f7ecc489323cd83bf3611fb1650f09324ecee90fcf25dc54"],
"checksum_type":["SHA256"],
"url": [
"https://g-52ba3.fd635.8443.data.globus.org/css03_data/CMIP6/AerChemMIP/AS-RCEC/TaiESM1/hist-piNTCF/r1i1p1f1/AERmon/bldep/gn/v20210603/bldep_AERmon_TaiESM1_hist-piNTCF_r1i1p1f1_gn_185001-201412.nc|application/netcdf|HTTPServer",
"globus://8896f38e-68d1-4708-bce4-b1b3a3405809/css03_data/CMIP6/AerChemMIP/AS-RCEC/TaiESM1/hist-piNTCF/r1i1p1f1/AERmon/bldep/gn/v20210603/bldep_AERmon_TaiESM1_hist-piNTCF_r1i1p1f1_gn_185001-201412.nc?globus_type=file|application/netcdf|Globus"
],
"id":"CMIP6.AerChemMIP.AS-RCEC.TaiESM1.hist-piNTCF.r1i1p1f1.AERmon.bldep.gn.v20210603.bldep_AERmon_TaiESM1_hist-piNTCF_r1i1p1f1_gn_185001-201412.nc|eagle.alcf.anl.gov", |
@bstrdsmkr I just updated the issue description above with a recommendation/requirement that the team that proposed the schema changes was provided: Simplify access to Globus data service by splitting url = "globus:<multiple_field_values>" into separate fields, eliminating the need to parse a string to obtain values used in various Globus data service interfaces. It is simpler to explicitly store different values in their own fields so that clients can use the subset of fields they need for different Globus interfaces. |
The minimum necessary schema changes to merge US Solr indexes into a single ESGF2-US Globus Search Index.
Additional requirements:
The text was updated successfully, but these errors were encountered: