Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best practice for using DRS and Data Connect together #394

Open
briandoconnor opened this issue May 22, 2023 · 11 comments
Open

Best practice for using DRS and Data Connect together #394

briandoconnor opened this issue May 22, 2023 · 11 comments

Comments

@briandoconnor
Copy link
Contributor

briandoconnor commented May 22, 2023

Background

Feature Branch: https://github.com/ga4gh/data-repository-service-schemas/tree/feature/issue-394-drs-plus-connect-docs-v1

I'm opening this issue based on followup to the April 20th, 2023 GA4GH Connect meeting "DRS and Data Connect" session. This session looked at exploring how standards from the Cloud and Discovery work streams can be used together to identify the two needs identified in the aims listed below:

  • Address the need to obtain additional data about a DRS object
  • Revisit how Data Connect handles the need for bundles

Some resources of interest:

Key Takeaways from GA4GH Connect

Metadata + DRS

We agreed that best practices for working with metadata were important, and largely agreed on two guiding principles:

    1. DRS doesn’t know about metadata, and shouldn’t. Instead, we should lean into the fact that systems that use DRS typically have some database-like component that does know about object metadata.
    1. No new APIs (or API changes for DRS) are needed. Instead, we should add an appendix to the DRS spec documenting best practices for building systems that use DRS and care about metadata.

Compound Objects

We agreed with the way the DRS 1.3.0 develop branch frames the need for compound object support:

  • Some content (e.g. DICOM images) is best represented as a compound object consisting of a structured collection of atomic DrsObjects.
  • Each compound object should have a DRS ID, that clients can use to retrieve the object structure and its constituent atomic objects.
    We discussed two possible ways to represent and retrieve compound object contents, but didn’t have time to discuss their tradeoffs:
    1. The approach documented in the develop branch (Best Practice: Manifests), where the compound object’s DRS ID provides access to a manifest file listing the object contents. Manifest format is datatype-specific and outside the scope of the DRS spec (but could for example be a JSON file).
    1. An alternate approach where the compound object’s DRS ID provides access to a Data Connect table describing the object contents. Table format is datatype-specific and outside the scope of the DRS spec.

Goal for this Issue

This issue is to give us a place to discuss the use of Data Connect and DRS together (and link PRs to). The immediate goal of this Issue is to get a corresponding PR that addresses the best practice of using Data Connect together with DRS to provide 1) more metadata about DRS objects and 2) a scalable alternative to bundles. The intention is a documentation only change with a best practice appendix to the DRS spec.

@ianfore
Copy link

ianfore commented Jul 24, 2023

Three suggestions:

  1. The use of Data Connect to obtain lists of DRS ids as an alternative to bundles relies is only feasible because of the ability to make DRS calls in bulk. Where a Data Connect query returns multiple DRS ids it is likely to be common practice to request all DRS ids for a specific DRS server in bulk.

  2. It is a Data Connect rarher than a DRS matter, but a specific data type for DRS URIs in Data Connect schema is likely to be helpful in indicating that a given table column contains DRS URIs.

  3. It is expected that Data Connect tables store DRS URIs as opposed to DRS ids.

@briandoconnor
Copy link
Contributor Author

To respond to Ian's comment 1, we now have DRS bulk so hopefully item 1 is solved.

Agree with 2

Agree with DRS URIs

Do we have bi-directional links between a DRS object and a Data Connect query? Do we have info in the service-info about the Data Connect server linked to this DRS server?

@dglazer
Copy link
Member

dglazer commented Apr 23, 2024

I (still) strongly agree with the premise above that "No new APIs (or API changes for DRS) are needed. Instead, we should add an appendix to the DRS spec documenting best practices for building systems that use DRS and care about metadata." And now that compound objects are well-documented in the spec, I don't think we need to say more about how to handle them with Data Connect. Therefore, the simplest thing that could possibly work is to add a few sentences to the DRS doc saying roughly:

  • Repeat that DRS is just about fetching bytes; it doesn't handle semantic information about objects.
  • Therefore, DRS servers aren't useful in isolation. Systems typically use DRS paired with an object catalog that supports search and discovery, and provides a pointer to the DRS id for actually fetching the bytes when needed.
  • A recommended, but not required, best practice is to use Data Connect as the API to that data catalog.

And maybe add:

  • There may be use cases where users find a DRS id without any context. If so, they could look up the object in the catalog to understand what that's an ID for.

Or:

  • DRS ids aren't useful without context -- they should always come from a catalog.

@mattions
Copy link

link to the issue where we have been discussed this: #336 (comment)

@bheavner
Copy link

@dglazer re: "There may be use cases where users find a DRS id without any context. If so, they could look up the object in the catalog to understand what that's an ID for."

  • how would a user find the catalog for a free DRS id? (I'm not aware of a system like DNS for DRS ids - is there one?)

@mattions
Copy link

@bheavner the main idea is that you do not search drs_uri.
As written by @dglazer DRS alone do not tell you a lot, just how to get the bytes that they point to, and which authorization you need to pass to get them.

The discoverability track is huge, but it's better done via FHIR, Data Connect, Cohort Portals that provides a list of DRS Uris at the end.

Usually the search is like "give me all the files for patients that have this disease with these conditions".
These question needs to go to a clinical/phenotypical server, which then will return the patients with connected samples which will have drs_uri attached. (For example check the model of the NCPI-FHIR IG: https://nih-ncpi.github.io/ncpi-fhir-ig/index.html as one of the possible way to connect clinical information with DRS_uri)

Discovering DRS_uris per se does not make too much sense.

@bheavner
Copy link

@mattions - oh, I certainly agree that DRS shouldn't do the work of FHIR!

I don't mean to solve the problem of discoverability. Instead, I was hoping there might be a way to include a breadcrumb for a receiving system to know where to go for more context about the bytes the DRS URI is pointing to. That could be something like a FHIR endpoint, or a landing page, or a homebrew API from some external system that can resolve DRS URIs and provide authorization information as required by the spec.

Informally, a conversation like:

System executing a workflow: "please resolve DRS://FOO"

Data hosting system: "DRS://FOO points to file BAR in cloud location GS://BAT. It requires this kind of authorization token/passport/credentials: MAGIC_KEY. You can learn more about it at: ENDPOINT_URI, which is a FHIR_API."

System executing a workflow: "Thanks. Here's my MAGIC_KEY. Please give me file BAR from location GS://BAT. (p.s. I'll be sure to pass that ENDPOINT_URI along to my user interface and record it in my log for provenance tracking purposes.)"

That sentence "You can learn more about it at: ENDPOINT_URI, which is a FHIR_API" is the one that I mean to propose, and wonder if it might benefit the spec by giving some flexibility and increase opportunity to link data with the context of that data.

@bheavner
Copy link

p.s. Perhaps a key:value approach to the "you can learn more about it at" bit could also enable something like "This DRS URI is being included in a list of DRS URIs that are gathered in response to SEARCHQUERY_REFERENCE (of some sort)"

@mattions
Copy link

@bheavner yes, I'm 100% aligned with that.

The comment on #336 (comment) was exactly hinting at that.
It would be good if we have in the DRS response the "learn more" ENDPOINT as you call it.

The idea was to include something like:

"additional_info" : [ {"type" : "fhir", "uri" : "<url_to_the_drsDocumentReference>"},
{"type" : "dataconnect", "uri" : "<url_to_the_server/drs_id>",
]

where additional_info is a new field that the DRS server could provide, which will point to the source of the clinical/phenotypic metadata.

on the system I work with, these tend to be in FHIR system, however we can have a list with an enumeration, so people could implement how to navigate the one that are more adopted (I have the example of dataconnect, but I do not have anything with that for example)

We could call it metadata, however that means lots of different things to lots of different people, and I really like the idea (BTW, thanks for the great comment!) to have a "learn_more" pointer

@bheavner
Copy link

@mattions yes - that word "metadata" is always causing problems! Let's avoid it.

Sounds like we're on the same page then, thanks for following up.

Given that this is fairly early stage and in a brainstorming/conversation phase, do you think it's a better approach to have a general "additional_info" field and see what people do with it or how it grows (potentially splitting into more specific kinds of additional information at some point in the future), or is it better to limit it a bit?

If someone wants to use "additional_info" to link to some sort of search provenance information, would that be good?

Personally, I like the approach currently proposed - enumerated values for additional_info, but we can always expand the allowable values, and perhaps it could be refined at some point in the future if people are actually using it for real functionality.

@briandoconnor
Copy link
Contributor Author

In the Cloud WS meeting on Aug 12th, 2024 we decided to just add text to the spec that you should have a catalog, such as Data Connect. And have that be sufficient for DRS 1.5. As a result I'll merge #406

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment