Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Describing data accessed from static endpoints (e.g. S3 object stores)? #240

Open
cboettig opened this issue Feb 7, 2023 · 2 comments
Open
Labels
Needs clarification Issue needs to be clarified Update Documentation updates to the guidance docs

Comments

@cboettig
Copy link

cboettig commented Feb 7, 2023

The Schema.org's Dataset model, and the documentation here, describes accessing data either through Service Endpoints or as Data Downloads.

It is not entirely clear to me how to document data that is accessed through a object store which not meant to be downloaded as individual assets. For example, consider the GBIF parquet snapshots on AWS S3. Yes, technically we can define a contentUrl for each of the 2053 parquet shards in an occurance.parquet sub-directory, but really such data is intended to operate in a model somewhat closer to a service endpoint, where a tool like Apache Arrow is used to open over-the-wire connection to the database root. However, it doesn't seem that a service endpoint is the right choice either, as this approach is not intended as a set of curl-based REST calls.

  • Can a distribution element include list-valued argument to contentUrl (i.e. for a multi-part file?)
  • Does sceince-on-schema have advice about URI construction in this context? i.e. should bucket protocol notation, like s3:// or abfs::// be used?

FWIW, I find the examples of stac metadata documentation instructive and very practical here, e.g. GBIF stac JSON of azure. Notably this identifies only the parquet 'root' as the href, and uses the bucket URI notation. that approach seems to work well with existing tooling and workflows.

@mbjones
Copy link
Collaborator

mbjones commented Feb 7, 2023

Great questions, @cboettig , and thanks for raising them. I think updating our guidance to address the issues you raise would be really helpful to many groups. @ashepherd maybe we can add this to the list we generated last meeting of next priorities, and discuss at the next meeting? I'll miss the next meeting while I am on vacation, but I'll put a vote in here for addressing this issue as we are grappling with similar concerns wrt STAC metadata for collections.

@mbjones mbjones added Update Documentation updates to the guidance docs Needs clarification Issue needs to be clarified labels Feb 7, 2023
@fils
Copy link
Collaborator

fils commented Feb 7, 2023

@cboettig interesting question....

I think I would start with the schema:url for the https://gbif-open-data-us-east-1.s3.us-east-1.amazonaws.com/index.html#occurrence/2023-02-01/occurrence.parquet/ as it is a more human URL.

schema:distribution would be more for a single download of the data though, which is not the case in a sharded parquet file/directory of files.

Just a first thought would be to use something like potentialAction to point to an Action type. Once there is an Action you can define a target its is easy to layer in the URL in the s3:// format.

This is from some unrelated approaches, but might be an interesting starting point.

 "potentialAction": {
    "@type": "Action",
    "name": "Use My API",
    "description": "Use the API to retrieve data from my organization.",
    "@id": "https://us-central1-top-operand-112611.cloudfunctions.net/function-1",
    "result": {
      "@type": "DataDownload",
      "encodingFormat": "text/plain",
      "description": "a simple text result for the RGB counts"
    },
    "target": {
      "@type": "EntryPoint",
      "urlTemplate": "https://us-central1-top-operand-112611.cloudfunctions.net/function-1",
      "httpMethod": "POST",
      "contentType": [
        "image/jpeg",
        "image/png"
      ]
    },
    "object": {
      "@type": "ImageObject",
      "description": "A JPEG or PNG to analyze the RGB counts"
    }
  },

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs clarification Issue needs to be clarified Update Documentation updates to the guidance docs
Projects
None yet
Development

No branches or pull requests

3 participants