Specify target collection metadata in UDP (& batch jobs) #514

jdries · 2023-10-03T14:00:28Z

2 new use cases came up, with a similar solution:

Our UDP users want to treat a UDP basically as a 'virtual collection'. To allow this, they would like to know the STAC (collection) metadata of the data cube that is generated when the UDP is invoked. Of course, there can be some unknowns, depending on the parameters in the UDP. Some UDP's are fairly constrained, while others can output any raster cube. This case is relevant for the constrained case, where for instance a UDP wants to communicate constraints on the output. Some examples:

produces only data over Europe
output from 2017 onwards
output resolution is 300m
output has 4 bands, with detailed band metadata

Note that this collection metadata also acts as a definition of constraints: if the output aoi is europe, than it will probably not accept an input aoi in north america. So UDP tools can use this for input field validation, which is very useful for generic wizards like the openEO editor has.

The second case is perhaps easier to understand: batch jobs try to fill in as much STAC metadata as possible when generating output, but can not know everything. For instance, a job that generated categorical data can not really know which colors would be suited for visualization. As a user, I would like to submit a kind of metadata template in STAC format, so that I can immediately generate output with more complete STAC metadata.

My proposed solution is to simply add a property with the target STAC collection metadata to the UDP and batch job schema:

https://api.openeo.org/#tag/User-Defined-Processes/operation/store-custom-process

I'll probably experiment with this myself, but also wanted to share the idea. These cases are triggered by user projects.

jdries · 2024-06-06T06:24:12Z

Update for myself: example of metadata that users are asking for.


{
  "geometry" : "τ1{tend=946684800000,tstart=0,ttype=logical}S2(43199,21599){bbox=[-180.0 180.0 -90.0 90.0],proj=EPSG:4326}",

  "metadata" : {
    "im:keywords" : "global, climate, weather, Average temperature",
    "dc:comment" : "This is WorldClim version 2.1 climate data for 1970-2000. This version was released in January 2020.\r\nThere are monthly climate data for average temperature (°C).\r\nThe data is available at 30 seconds (~1 km2).\r\nFor \"time\", the month scope is inside the semantics data annotation",
    "im:notes" : "",
    "dc:title" : "WorldClim Historical climate data version 2.1 data 30s for 1970-2000 average temperature January",
    "dc:url" : "https://worldclim.org/data/worldclim21.html",
    "dc:creator" : "",
    "im:thematic-area" : "Earth",
    "dc:originator" : "Worldclim",
    "im:geographic-area" : "Global",
    "dc:source" : "Fick, S.E. and R.J. Hijmans, 2017. Worldclim 2: New 1-km spatial resolution climate surfaces for global land areas. International Journal of Climatology 37(12):4302-4315."
  },


}

m-mohr · 2024-09-20T00:48:27Z

With regards to 2: The GEE driver has a parameter for save_result which can contain STAC metadata that is added to the output. Isn't stac_modify also suitable here?

Open-EO/openeo-api#514

jdries · 2025-01-06T11:54:07Z

I created a PR here as one alternative to solve this:
ESA-APEx/apex_algorithms#79

The idea is not to extend the UDP spec, but rather to link to another STAC document that describes the collection that is generated by the UDP. So this is a kind of 'virtual' STAC object, that only materializes when the UDP is executed.

Advantages:

The STAC metadata can put constraints on spatiotemporal output extent. This is an often repeated requirement for UDP's, which is not yet possible with json schema.
Fixed output properties, like a resolution (gsd, epsg:code) can be described using stac.
Bands of the output cube can be formally documented. UDP spec does not allow this.

Challenges/questions:

how to indicate that a certain output property, like a resolution, depends on a parameter?
what is a good 'rel' for the link from UDP to STAC?
what is a good 'rel' for the link from STAC to UDP?
would it be better to use a STAC item rather than collection? This item is also a feature, and can then perhaps be merged with the OGC API Record that APEx uses anyway to describe UDP's? Only then we won't have summaries, which allow to list variable property values that depend on udp params...

m-mohr · 2025-01-06T12:16:04Z

Generally, the link seems like a good idea. I think I'd need a bit more time to investigate the usecase to come up with potential solutions.

m-mohr · 2025-01-06T15:37:19Z

The idea is not to extend the UDP spec, but rather to link to another STAC document that describes the collection that is generated by the UDP. So this is a kind of 'virtual' STAC object, that only materializes when the UDP is executed.

I'm not sure I understand this. I thought the STAC Collection should already be present before the UDP is executed. To indicate users about the scope and potentially for validation in clients, e.g. the Web Editor Wizard. The STAC Items would only materialize when the UDP is executed though (and the Collection may be updated, too).

Assuming my comment above is correct, my initial thoughts on this:

A link to a STAC collections seems like the most reasonable approach. It pretty much is a STAC Collection, where you expect the items to be generated for. So the STAC Collection should be valid for that assumption, not STAC Items should be generated that do not match the collection metadata. Primary issue that I see is that you can only easily specify a bbox, not a geometry (unless we use proj:geometry on the top-level, but that is pretty uncommmon).

Relation types

describedby could be a potential candidate for UDP to STAC, which is also used by OGC APIs. The STAC Collection could link back to UDP using describes. My preference for now.
about works similarly, could be an alternative if not happy with describedby.
STAC uses the link relation type collection from an item, maybe that works for us as well. But it might be a bit misleading as you could also expect a collection of processes under that URL.
child seems to be a bit generic and ambiguous in this case.

Remark: STAC has no dedicated media type for collections, so it's always a bit ambiguous whether something links to STAC or not. If a client detects a link with rel = describedby and type = application/json, it still needs to load it and check for existance of stac_version and type = Collection to ensure it's really a STAC Collection.

Populating STAC Items

You can't simply provide the Collection via stac_modify (due to JSON Patch Merge), but back-ends find such a STAC collection in the process metadata, they can automatically fill in some of the metadata:

Assets in STAC Items can be populated using the corresponding item asset defintions (assigned via the asset keys), covering for e.g. the "categorical data" usecase.
Summaries could also be used to populate the Item properties, but only if a single value is provided in the summaries (i.e. a single value in an array). In all other cases the mapping is ambuguous, unfortunately. Note: The population of Item properties that are arrays (e.g. ìnstruments) needs special handling.
The Collection or UDP could potentially also link to an (optional) JSON Patch Merge file if populating the Items is of high importance. JSON Patch Merge has a separate media type (application/merge-patch+json) so is easy to detect. There's no obvious rel type, maybe items, item-changeset or so.

Mapping the example

The metadata described in #514 (comment), is not really STAC compliant. Here's a mapping to a Collection:

geometry -> extent (temporal.interval = tend and tstart, spatial.bbox = bbox, proj = proj:code)
im:keywords => keywords
dc:comment => description
dc:title => title
dc:url => links, (rel potentially about or describedby)
im:thematic-area => not present, potentially in keywords
dc:originator => providers (role potentially producer/licensor)
im:geogrpahic-area => not present (implicitly through bbox), potentially in keywords
dc:source => sci:citation

Open questions/challenges

I think most questions, challenges and usecases would be solved with the approach described above.
What is open and unanswered so far is:

how to indicate that a certain output property, like a resolution, depends on a parameter?

I can't think of a simple solution right now, especially not one that is STAC compliant.
I could think of a solution that uses e.g. JSON Path/Pointer (i.e. an equivalent of $ref in JSON Schema) to point between the files, but this would be an openEO specific extension. Not sure whether it's worth the effort?

jdries self-assigned this Jan 16, 2024

jdries added a commit to ESA-APEx/apex_algorithms that referenced this issue Jan 6, 2025

first small example of 'virtual collection'

ee9986d

Open-EO/openeo-api#514

jdries mentioned this issue Jan 6, 2025

prototype: describe UDP output as virtual STAC collection ESA-APEx/apex_algorithms#79

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify target collection metadata in UDP (& batch jobs) #514

Specify target collection metadata in UDP (& batch jobs) #514

jdries commented Oct 3, 2023

jdries commented Jun 6, 2024

m-mohr commented Sep 20, 2024 •

edited

Loading

jdries commented Jan 6, 2025 •

edited

Loading

m-mohr commented Jan 6, 2025

m-mohr commented Jan 6, 2025 •

edited

Loading

Specify target collection metadata in UDP (& batch jobs) #514

Specify target collection metadata in UDP (& batch jobs) #514

Comments

jdries commented Oct 3, 2023

jdries commented Jun 6, 2024

m-mohr commented Sep 20, 2024 • edited Loading

jdries commented Jan 6, 2025 • edited Loading

m-mohr commented Jan 6, 2025

m-mohr commented Jan 6, 2025 • edited Loading

Relation types

Populating STAC Items

Mapping the example

Open questions/challenges

m-mohr commented Sep 20, 2024 •

edited

Loading

jdries commented Jan 6, 2025 •

edited

Loading

m-mohr commented Jan 6, 2025 •

edited

Loading