Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update eml_validate to match Metacat #133

Open
isteves opened this issue May 10, 2018 · 1 comment
Open

Update eml_validate to match Metacat #133

isteves opened this issue May 10, 2018 · 1 comment

Comments

@isteves
Copy link
Collaborator

isteves commented May 10, 2018

The EML package eml_validate function currently does not run some of the checks used in Metacat. Ideally, they would show the same result.


From Slack:
"it would be nice to add the extra schema-validity rules to the eml_validate function so it will show the same results as Metacat" (Matt)

"Once you send it across the network to the Member Node, Metacat runs a suite of custom validation rules on the EML.
It's stuff that XML schema's can't help us enforce, such as the match between a custom unit and its definition (as here) or ids and references." (Bryce)

open question:
"Im not sure what additional schema rules would be needed? right now EML::eml_validate is basically a wrapper for xml2::xml_validate which feeds that function the eml schema (EML/xsd/eml-2.1.1/eml.xsd). what additional checks are needed?" (Mitchell)

@mbjones
Copy link
Member

mbjones commented May 10, 2018

@maier-m The additional rules beyond schema validation are written in section 3.3 Reusable Content in the EML spec: https://knb.ecoinformatics.org/#external//emlparser/docs/index.html#reusableContent I list them here for convenience:

  • An ID is required on the eml root element (packageId)
  • IDs are optional on all other elements.
  • If an ID is not provided, that content must be interpreted as representing a distinct object.
  • If an ID is provided for content then that content is distinct from all other content except for that content that references its ID.
  • If a user wants to reuse content to indicate the repetition of an object, a reference must be used. Two identical ids with the same system attribute cannot exist in a single document.
    • document scope is defined as identifiers unique only to a single instance document (if a document does not have a system attribute or if scope is set to 'document' then all IDs are defined as distinct content).
    • system scope is defined as identifiers unique to an entire data management system (if two documents share a system string, then any IDs in those two documents that are identical refer to the same object).
  • If an element references another element, it must not have an ID itself. The system attribute must have the same value in both the target and referencing elements or it must be absent in both.
  • All EML packages must have the 'eml' module as the root.
  • The system and scope attribute are always optional except for at the 'eml' module where the scope attribute is fixed as 'system'. The scope attribute defaults to 'document' for all other modules.

What would be great is if we wrote a function that could check all of these issues as an XML document is being parsed, and then call that after the xml2::xml_validate call is made. Both must be valid for the EML document to be considered valid.

The current EML parser is slow in part because it tries to do these checks in memory by loading the XML document as a DOM, and then querying the DOM for matches. A better algorithm is planned to fix the Java EMLParser (Issue NCEAS/eml#1). In this approach, we would 1) use a SAX parser to parse the EML document, and 2) record all id, reference, and element details in a data structure as they are encountered, and 3) once the whole document is parsed, do the id/ref comparisons for uniqueness and for following the rules. The same approach could be implemented in R, but I'm not sure if xml2 supports SAX parsing. If not, the loaded XML document might be usable to directly query for rule checking. Let's discuss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants