Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tool for converting metadata standards to SOSO #238

Open
clnsmth opened this issue Oct 28, 2022 · 12 comments
Open

Tool for converting metadata standards to SOSO #238

clnsmth opened this issue Oct 28, 2022 · 12 comments

Comments

@clnsmth
Copy link

clnsmth commented Oct 28, 2022

While discussing strategies to help data repositories in the adoption of SOSO conventions at yesterday‘s meeting, the idea was raised to develop a tool for converting metadata standards to the SOSO representation. One implementation of this could be a crosswalk and record builder within a Python package.

Is this worthwhile? Has this already been done? What would be a “good” design for such a tool? Other thoughts?

@mbjones
Copy link
Collaborator

mbjones commented Oct 28, 2022

Thanks, @clnsmth . I think your idea is useful and would help adoption. Off the top of my head, I know of a few tools that set up for metadata crosswalking and conversion that may be of interest or inspiration.

Interested to see where this goes.

@clnsmth
Copy link
Author

clnsmth commented Oct 31, 2022

Thanks @mbjones. I'll review these resources, think more about design, and circle back with some thoughts.

@clnsmth
Copy link
Author

clnsmth commented Nov 10, 2022

@mbjones Thanks again for pointers to these tools. I found them helpful in understanding how to design a crosswalk. While I couldn't see an opportunity to integrate with one of these projects, I might be missing something obvious and am happy to be convinced otherwise.

Below is a design that draws inspiration from codemeta and codemetar. Guidance on modifying this design is welcome and much appreciated. I've never done something like this, and can use all kinds of help. Thanks in advance.

Goals

  • Help with the adoption of SOSO by providing a tool to convert different metadata standards to the SOSO representation.
  • Develop and maintain this tool as a community of publishers/users in a GitHub (or other) in order to share and synchronize implementations while new versions of metadata standards and SOSO are released.

Design

A Python package containing a semantic mapping, conversion specification, and implementation for each supported metadata standard, which are accessed through a common workflow framework.

soso_crosswalk

A schematic of the proposed design:
(1) A metadata file is read in as a Python object.
(2) The metadata object and SOSO version are passed through a wrapper to a mapping function, along with the metadata standard and version number which are parsed by the wrapper.
(3) Based on the metadata standard, standard version, and SOSO version, the mapping function sources and calls the corresponding implementation, which returns the equivalent SOSO record as a Python object.
(4) The SOSO object is validated via SHACL and a report is returned to the user.
(5) The SOSO object is written to file.

Questions

  1. Should previous versions of a metadata standard be supported? If so, how many?
  2. Likewise, should previous versions of SOSO be supported? How many?
  3. What could be a good model for the semantic mapping and design specification? Some blend of human/machine readability seems optimal, but too much complexity (e.g. DITA?) may preclude the goal of community contribution/maintenance. Perhaps a simple table (e.g. codemeta/DataCite) with a few additional fields for notes etc. would be better?

Other thoughts?

@nicholascar
Copy link

I need to create SPARQL scripts, and perhaps an RDFLib Python script to run them and any other supporting logic, needed to convert spatial DCAT to schema.org. This is part of the ANZGeoDCAT Profile (formal definition: https://linked.data.gov.au/def/anzgeodcat).

So this will be a simple enough tool BUT that same profile will also maintain a mapping from (ANZGeo)DCAT to ISO19115-1/-3 and tooling for such conversions (RDF to XML), so we will have chained mappings from SOSO to ISO19115-1/-3 via (ANZGeo)DCAT.

We are really aiming at CKAN delivering ANZGeoDCAT and then being able to convert to SOSO and/or ISO19115.

@clnsmth
Copy link
Author

clnsmth commented Jan 2, 2023

Hi @nicholascar. This sounds great. If I'm understanding, you’re mapping to SOSO from two profiles/standards:

  1. ANZGeoDCAT to SOSO
  2. ICSM ISO 19115-1/-3 to ANZGeoDCAT to SOSO

Do you have any interest in developing and maintaining this work within a community supported tool like the one being pitched above? If not, where could others find your work? https://github.com/Kurrawong/anzgeodcat?

@nicholascar
Copy link

@clnsmth yes, I will have to maintain this with a community comprised of multiple Australian and New Zealand government agencies, at the very least - we would love wider involvement!

That's right, the ANZGeoDCAT work lives at https://github.com/Kurrawong/anzgeodcat

I've done a first pass DCAT variant to SOSO converter, but it's not for ANZGeoDCAT but for the Australian Indigenous Data Network's profile of DCAT. All that profile requires is the use of qualified attribution roles rather than direct roles (e.g. dcterms:creator) so that all agent roles can be drawn from a vocab. I've done the mapping here first just due to project requirements.

Here is the IDN CP DCAT profile resource listing: https://w3id.org/idn/def/cp

Here is the IDN CP's specification document: https://w3id.org/idn/def/cp/spec

But probably more interesting is the schema.org mapping: https://w3id.org/idn/def/cp/sdo

Alongside the conceptual mapping, I've made an RDF mapping https://w3id.org/idn/def/cp/sdo.ttl

And now I've made a conversion Python script: https://w3id.org/idn/def/cp/sdo.py

Here is a before and after IDN CP / schema.org result:

You'll surely notice the 'before' is really just DCAT with a couple of small additions that aren't really indigenous per se.

We have 5 more months of active development on that profile and the mappings, so we have some time yet to improve the conceptual, RDF & scripted mapping.

@clnsmth
Copy link
Author

clnsmth commented Feb 25, 2023

We discussed this topic in yesterday's meeting to gain perspective on which of two implementation pathways to pursue (thanks @ashepherd , @nein09, @datadavev, @pbuttigieg, and Bill Manley (sorry Bill I couldn't find your GitHub handle)). A summary:

Option 1 (Direct Transform)

A direct transformation of metadata dialects from their typical format to SOSO via a programming language (e.g. EML.xml => Python list => SOSO.jsonld).

  • Con: Business logic is locked up in localized code, which requires refactoring when changes are needed.
  • Con: Hard to extend because new code is required for each dialect.
  • Pro: Business logic can handle variance in the placement of information within a dialect and decide between preferred/non-preferred SOSO representations based on available content.

Option 2 (JSON-LD Framing)

Taking a dialect in JSON-LD, applying a crosswalk to get the equivalent SOSO properties, and then structuring the result with a JSON-LD Frame (e.g. EML.jsonld => crosswalk => Frame.jsonld => SOSO.jsonld). This is one of @mbjones's original recommendations.

  • Con: Some dialects might not yet be available in JSON-LD/RDF.
  • Con: Framing is a bit cumbersome.
  • Pro: Extending support to other dialects can be done by simply adding to the crosswalk.
  • Pro: Can leverage a mapping standard like SSSOM to express the semantics of property alignment (example SSSOM file).
  • Pro: Maintenance is better/easier for the community (i.e. over time changing out/modifying the crosswalk file).

Did I miss anything?

I'm going to push forward with Option 2, using EML as a test case, and report back at next month's meeting.

P.S. I'm revisiting @nicholascar's ANZGeoDCAT work in light of all this (see above), and it kind of looks like a nice blend of the two options. I'm going to take a closer look (thanks @nicholascar!).

@yvanlebras
Copy link

Hi @clnsmth , @mbjones ,

Happy to see you there!

A topic of major interest is discussed there, on standards mapping !

In french Biodiversity e-infrastructure, PNDB, we are using EML as pivotal format to create several others metadata and data standards through mappings. For now, we started to use first versions for ISO19115, INSPIRE Europe, DCAT, and are testing using EML annotations field on attributes to help creation of data standards as Darwin Core from raw EML based data package.

We are notably working on these mappings because we have to harvest all biodiversity information systems in France then convert metadata to EML then apply enrichment on it before having a validation and proposing the enriched metadata in our catalog and propose feedback to original information systems if possible with enriched metadata transformed in their standard to help elevate FAIRness of biodiversity data in all systems. We are testing and validating the entire workflow this year.

Moreover, we are partner of a french 8 years project called "GAIA DATA" focusing on creating a common distributed national infrastructure for Biodiversity, Climate and Earth system . In this project, we are using a pivotal metadata standard derived from geoDCAT based on O&M and we need here create mapping between every infrastructure data silos and this standard, so for biodiversity with EML. We are discussing the method and for now, we were thinking the same 2 options but to start, first option, "simple tabular files" seems to be better at least to help everyone start the process and then we will ameliorate the process taking complex cases as examples.

Si, it seems to me 1/ there is possibilities to capitalize on existing or ongoing initiative to start (for exemple, maybe the work I have done on EML-DCAT can help in combination with the work mentioned by @nicholascar on a DCAT to SOSO mapping. 2/ there is possibilities to mutualized for upcoming effort creating such mappings, maybe a "hat" can be the GO FAIR "BiodiFAIRse" implementation network I am coordinating with Anne-Sophie Archambeau from France GBIF node and the new roadmap we propose to link notably with GEO BON work on EBV and "Bon in a box" linked with @jmlord comment on EML to JSON-LD issue.

Please don't hesitate to comment, I will try to centralize all cited informations there, but this week I am off without computer so not easy ;)

@clnsmth
Copy link
Author

clnsmth commented May 1, 2023

We discussed implementation options at the meeting last week (2023-04-27). A summary is provided here.

While pursuing "Option 2 (JSON-LD Framing)" listed above, it became apparent that transformation using only JSON-LD algorithms would lead to information loss. Specifically, the JSON-LD flatten, expand, and compaction process doesn't handle nested structures common to metadata dialects, and JSON-LD framing doesn't allow for the construction of new data encountered when combining a dialect's properties into SOSO. After some discussion, a third option emerged.

Option 3 (Mapping and Python Transform)

Map a metadata dialect to SOSO and use it to drive a Python based transformer. The benefits of this approach include:

  • Extensibility - New dialects can be supported via creation of a new mapping table and (likely) a little glue code to help the transformer handle variation across dialects.
  • Fidelity - Nested structures of dialects are supported and decision logic enables rendering of preferred SOSO representations at run time.
  • Expressivity - Using a Simple Standard for Sharing Ontology Mappings (SSSOM) for the mapping enables semantic definition of the match between SOSO and the dialect.

A first draft of the crosswalk for EML to SOSO is complete and entering a process of review (see soso-eml.sssom.yml and soso-eml.sssom.tsv). Next steps are to design a generalized transformer based on the SSSOM input and implement with Python. I'll report progress at the next SOSO meeting.

@clnsmth
Copy link
Author

clnsmth commented Jul 14, 2023

I have set up a GitHub repository for prototyping this idea and will now begin testing it out on EML.

@clnsmth
Copy link
Author

clnsmth commented Jul 30, 2024

Hi Folks.

The EML to SOSO conversion functionality is now implemented and ready for use. You can find user documentation on how to run it here.

The package architecture is designed to support conversion of other metadata standards in the future. You can learn more about this in the project design document.

If the overall implementation looks reasonable, perhaps it should move to https://github.com/ESIPFed for broader community development and maintenance?

Comments and suggestions are appreciated.

Thanks!

@yvanlebras
Copy link

Amazing work @clnsmth ! I am, in fact @PaulineSGN ;), on the way to work on EML -> json-ld conversion to try things, I was planning using emld R package but maybe there is an interest to also / instead, use this EML -> SOSO converter ? Then the converted results and/or the conversion method might be of interest for us if we want to generate DCAT ? Looking forward to look at it deeper!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants