Update `eml_validate` to match Metacat #133

isteves · 2018-05-10T00:04:59Z

The EML package eml_validate function currently does not run some of the checks used in Metacat. Ideally, they would show the same result.

From Slack:
"it would be nice to add the extra schema-validity rules to the eml_validate function so it will show the same results as Metacat" (Matt)

"Once you send it across the network to the Member Node, Metacat runs a suite of custom validation rules on the EML.
It's stuff that XML schema's can't help us enforce, such as the match between a custom unit and its definition (as here) or ids and references." (Bryce)

open question:
"Im not sure what additional schema rules would be needed? right now EML::eml_validate is basically a wrapper for xml2::xml_validate which feeds that function the eml schema (EML/xsd/eml-2.1.1/eml.xsd). what additional checks are needed?" (Mitchell)

The text was updated successfully, but these errors were encountered:

mbjones · 2018-05-10T01:43:06Z

@maier-m The additional rules beyond schema validation are written in section 3.3 Reusable Content in the EML spec: https://knb.ecoinformatics.org/#external//emlparser/docs/index.html#reusableContent I list them here for convenience:

An ID is required on the eml root element (packageId)
IDs are optional on all other elements.
If an ID is not provided, that content must be interpreted as representing a distinct object.
If an ID is provided for content then that content is distinct from all other content except for that content that references its ID.
If a user wants to reuse content to indicate the repetition of an object, a reference must be used. Two identical ids with the same system attribute cannot exist in a single document.
- document scope is defined as identifiers unique only to a single instance document (if a document does not have a system attribute or if scope is set to 'document' then all IDs are defined as distinct content).
- system scope is defined as identifiers unique to an entire data management system (if two documents share a system string, then any IDs in those two documents that are identical refer to the same object).
If an element references another element, it must not have an ID itself. The system attribute must have the same value in both the target and referencing elements or it must be absent in both.
All EML packages must have the 'eml' module as the root.
The system and scope attribute are always optional except for at the 'eml' module where the scope attribute is fixed as 'system'. The scope attribute defaults to 'document' for all other modules.

What would be great is if we wrote a function that could check all of these issues as an XML document is being parsed, and then call that after the xml2::xml_validate call is made. Both must be valid for the EML document to be considered valid.

The current EML parser is slow in part because it tries to do these checks in memory by loading the XML document as a DOM, and then querying the DOM for matches. A better algorithm is planned to fix the Java EMLParser (Issue NCEAS/eml#1). In this approach, we would 1) use a SAX parser to parse the EML document, and 2) record all id, reference, and element details in a data structure as they are encountered, and 3) once the whole document is parsed, do the id/ref comparisons for uniqueness and for following the rules. The same approach could be implemented in R, but I'm not sure if xml2 supports SAX parsing. If not, the loaded XML document might be usable to directly query for rule checking. Let's discuss.

mbjones mentioned this issue May 11, 2018

eml_validate doesn't check all EML validity rules ropensci/EML#244

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update `eml_validate` to match Metacat #133

Update `eml_validate` to match Metacat #133

isteves commented May 10, 2018 •

edited

Loading

mbjones commented May 10, 2018 •

edited

Loading

Update eml_validate to match Metacat #133

Update eml_validate to match Metacat #133

Comments

isteves commented May 10, 2018 • edited Loading

mbjones commented May 10, 2018 • edited Loading

Update `eml_validate` to match Metacat #133

Update `eml_validate` to match Metacat #133

isteves commented May 10, 2018 •

edited

Loading

mbjones commented May 10, 2018 •

edited

Loading