Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify target collection metadata in UDP (& batch jobs) #514

Open
jdries opened this issue Oct 3, 2023 · 5 comments
Open

Specify target collection metadata in UDP (& batch jobs) #514

jdries opened this issue Oct 3, 2023 · 5 comments
Assignees

Comments

@jdries
Copy link

jdries commented Oct 3, 2023

2 new use cases came up, with a similar solution:

  1. Our UDP users want to treat a UDP basically as a 'virtual collection'. To allow this, they would like to know the STAC (collection) metadata of the data cube that is generated when the UDP is invoked. Of course, there can be some unknowns, depending on the parameters in the UDP. Some UDP's are fairly constrained, while others can output any raster cube. This case is relevant for the constrained case, where for instance a UDP wants to communicate constraints on the output. Some examples:
  • produces only data over Europe
  • output from 2017 onwards
  • output resolution is 300m
  • output has 4 bands, with detailed band metadata

Note that this collection metadata also acts as a definition of constraints: if the output aoi is europe, than it will probably not accept an input aoi in north america. So UDP tools can use this for input field validation, which is very useful for generic wizards like the openEO editor has.

  1. The second case is perhaps easier to understand: batch jobs try to fill in as much STAC metadata as possible when generating output, but can not know everything. For instance, a job that generated categorical data can not really know which colors would be suited for visualization. As a user, I would like to submit a kind of metadata template in STAC format, so that I can immediately generate output with more complete STAC metadata.

My proposed solution is to simply add a property with the target STAC collection metadata to the UDP and batch job schema:

https://api.openeo.org/#tag/User-Defined-Processes/operation/store-custom-process

I'll probably experiment with this myself, but also wanted to share the idea. These cases are triggered by user projects.

@jdries jdries self-assigned this Jan 16, 2024
@jdries
Copy link
Author

jdries commented Jun 6, 2024

Update for myself: example of metadata that users are asking for.


{
  "geometry" : "τ1{tend=946684800000,tstart=0,ttype=logical}S2(43199,21599){bbox=[-180.0 180.0 -90.0 90.0],proj=EPSG:4326}",

  "metadata" : {
    "im:keywords" : "global, climate, weather, Average temperature",
    "dc:comment" : "This is WorldClim version 2.1 climate data for 1970-2000. This version was released in January 2020.\r\nThere are monthly climate data for average temperature (°C).\r\nThe data is available at 30 seconds (~1 km2).\r\nFor \"time\", the month scope is inside the semantics data annotation",
    "im:notes" : "",
    "dc:title" : "WorldClim Historical climate data version 2.1 data 30s for 1970-2000 average temperature January",
    "dc:url" : "https://worldclim.org/data/worldclim21.html",
    "dc:creator" : "",
    "im:thematic-area" : "Earth",
    "dc:originator" : "Worldclim",
    "im:geographic-area" : "Global",
    "dc:source" : "Fick, S.E. and R.J. Hijmans, 2017. Worldclim 2: New 1-km spatial resolution climate surfaces for global land areas. International Journal of Climatology 37(12):4302-4315."
  },


}

@m-mohr
Copy link
Member

m-mohr commented Sep 20, 2024

With regards to 2: The GEE driver has a parameter for save_result which can contain STAC metadata that is added to the output. Isn't stac_modify also suitable here?

@jdries
Copy link
Author

jdries commented Jan 6, 2025

I created a PR here as one alternative to solve this:
ESA-APEx/apex_algorithms#79

The idea is not to extend the UDP spec, but rather to link to another STAC document that describes the collection that is generated by the UDP. So this is a kind of 'virtual' STAC object, that only materializes when the UDP is executed.

Advantages:

  • The STAC metadata can put constraints on spatiotemporal output extent. This is an often repeated requirement for UDP's, which is not yet possible with json schema.
  • Fixed output properties, like a resolution (gsd, epsg:code) can be described using stac.
  • Bands of the output cube can be formally documented. UDP spec does not allow this.

Challenges/questions:

  1. how to indicate that a certain output property, like a resolution, depends on a parameter?
  2. what is a good 'rel' for the link from UDP to STAC?
  3. what is a good 'rel' for the link from STAC to UDP?
  4. would it be better to use a STAC item rather than collection? This item is also a feature, and can then perhaps be merged with the OGC API Record that APEx uses anyway to describe UDP's? Only then we won't have summaries, which allow to list variable property values that depend on udp params...

@m-mohr
Copy link
Member

m-mohr commented Jan 6, 2025

Generally, the link seems like a good idea. I think I'd need a bit more time to investigate the usecase to come up with potential solutions.

@m-mohr
Copy link
Member

m-mohr commented Jan 6, 2025

The idea is not to extend the UDP spec, but rather to link to another STAC document that describes the collection that is generated by the UDP. So this is a kind of 'virtual' STAC object, that only materializes when the UDP is executed.

I'm not sure I understand this. I thought the STAC Collection should already be present before the UDP is executed. To indicate users about the scope and potentially for validation in clients, e.g. the Web Editor Wizard. The STAC Items would only materialize when the UDP is executed though (and the Collection may be updated, too).


Assuming my comment above is correct, my initial thoughts on this:

A link to a STAC collections seems like the most reasonable approach. It pretty much is a STAC Collection, where you expect the items to be generated for. So the STAC Collection should be valid for that assumption, not STAC Items should be generated that do not match the collection metadata. Primary issue that I see is that you can only easily specify a bbox, not a geometry (unless we use proj:geometry on the top-level, but that is pretty uncommmon).

Relation types

  1. describedby could be a potential candidate for UDP to STAC, which is also used by OGC APIs. The STAC Collection could link back to UDP using describes. My preference for now.
  2. about works similarly, could be an alternative if not happy with describedby.
  3. STAC uses the link relation type collection from an item, maybe that works for us as well. But it might be a bit misleading as you could also expect a collection of processes under that URL.
  4. child seems to be a bit generic and ambiguous in this case.

Remark: STAC has no dedicated media type for collections, so it's always a bit ambiguous whether something links to STAC or not. If a client detects a link with rel = describedby and type = application/json, it still needs to load it and check for existance of stac_version and type = Collection to ensure it's really a STAC Collection.

Populating STAC Items

You can't simply provide the Collection via stac_modify (due to JSON Patch Merge), but back-ends find such a STAC collection in the process metadata, they can automatically fill in some of the metadata:

  • Assets in STAC Items can be populated using the corresponding item asset defintions (assigned via the asset keys), covering for e.g. the "categorical data" usecase.
  • Summaries could also be used to populate the Item properties, but only if a single value is provided in the summaries (i.e. a single value in an array). In all other cases the mapping is ambuguous, unfortunately. Note: The population of Item properties that are arrays (e.g. ìnstruments) needs special handling.
  • The Collection or UDP could potentially also link to an (optional) JSON Patch Merge file if populating the Items is of high importance. JSON Patch Merge has a separate media type (application/merge-patch+json) so is easy to detect. There's no obvious rel type, maybe items, item-changeset or so.

Mapping the example

The metadata described in #514 (comment), is not really STAC compliant. Here's a mapping to a Collection:

  • geometry -> extent (temporal.interval = tend and tstart, spatial.bbox = bbox, proj = proj:code)
  • im:keywords => keywords
  • dc:comment => description
  • dc:title => title
  • dc:url => links, (rel potentially about or describedby)
  • im:thematic-area => not present, potentially in keywords
  • dc:originator => providers (role potentially producer/licensor)
  • im:geogrpahic-area => not present (implicitly through bbox), potentially in keywords
  • dc:source => sci:citation

Open questions/challenges

I think most questions, challenges and usecases would be solved with the approach described above.
What is open and unanswered so far is:

how to indicate that a certain output property, like a resolution, depends on a parameter?

I can't think of a simple solution right now, especially not one that is STAC compliant.
I could think of a solution that uses e.g. JSON Path/Pointer (i.e. an equivalent of $ref in JSON Schema) to point between the files, but this would be an openEO specific extension. Not sure whether it's worth the effort?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants