Skip to content

datatractor/schema

Repository files navigation

Datatractor Schema:
Schemas for Metadata Extractors

Documentation Github status

A repository containing the LinkML-based schemas backing the registry of extractors at Datatractor Yard and powering the reference implementation of the extractor framework at Datatractor Beam.

This repository is a continuation of the MaRDA WG7 on Automated Metadata Extractors.

For more information, see the preprint:

Datatractor: Metadata, automation, and registries for extractor interoperability in the chemical and materials sciences
Matthew L. Evans, Gian-Marco Rignanese, David Elbert & Peter Kraus
arXiv:2410.18839 (2024)

Contents

The repository contains two user-facing schemas:

  • FileType schema, used to specify the types of files passed to the extractors by users. The schema definition is located in filetype.yaml.

  • Extractor schema, used to specify the download, installation, and usage instructions, allowing for machine execution of the defined extractor/parser code, as well as a list of FileTypes compatible with the Extractor. The schema definition is located in extractor.yaml.

Usage

Validation

The schema definitions contained in this repository can be used to locally validate your own FileTypes and Extractors. Several examples are provided for this purpose in the examples folder.

To get started, first make sure LinkML is installed in your python environment. You may use the provided requirements.txt file for this purpose:

pip install -r requirements.txt

Then, you can check the validity of your filetype or extractor definition against the provided schemas using linkml-validate. For example, to validate the provided example filetype definition in FileType-netcdf.yaml against the FileType class from schemas/filetype.yaml, run:

linkml-validate -s schemas/filetype.yaml -C FileType examples/FileType-netcdf.yaml

If successful, you should see "✓ No problems found" returned by linkml-validate.

Translation

The LinkML schemas provided here can be automatically translated to other formats, including JSONSchema, Python dataclasses, or Pydantic classes:

gen-json-schema schemas/extractor.yaml >> extractor.json
gen-python schemas/filetype.yaml >> filetype.py
gen-pydantic schemas/extractor.yaml >> extractor.py

The generated files can be used in downstream codes such as in the validation function of Datatractor Beam.

Contributing

Contributions are welcome. We pledge to follow the Contributor Covenant Code of Conduct.

If you wish to contribute a new FileType or a new Extractor to the Registry, please open a pull request at the Datatractor Yard repo.

If you have any suggestions, technical queries, or a feature request related to the schemas, please do not hesitate to open an issue in this repository. For general questions related to the Datatractor project, please use the Datatractor discussion board.