Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

indicating sort behavior across VRS classes #411

Closed
ahwagner opened this issue Nov 3, 2022 · 4 comments
Closed

indicating sort behavior across VRS classes #411

ahwagner opened this issue Nov 3, 2022 · 4 comments
Assignees

Comments

@ahwagner
Copy link
Member

ahwagner commented Nov 3, 2022

Concern 1. we need to sort some JSON arrays and not sort others for digest serialization

JSON Schema does not differentiate between arrays and sets, but VRS does. In general, we represent all sets in VRS (e.g. VariationSet.members, Genotype.members) as arrays in JSON Schema, and this represents the typical case. However, when we introduced ComposedSequenceExpressions (CSEs), we needed a way of differentiating between ordered arrays for CSEs and unordered arrays for everything else. At the time, our digest serialization rules were sufficient to resolve this, as we only reordered arrays if they contained identifiers (as pointed out by @reece, here). Since CSEs were not identifiable, they did not get changed into identifiers, and no sorting took place.

Later, however, we added Genotype, which contained an unordered members array with non-identifiable GenotypeMember objects. Because these are non-identifiable our current digest serialization rules treat this array the same way it treats a CSE array: as ordered (in this case, incorrectly). This issue (#410) and the associated PR (#409) were created to prompt discussion about updating the serialization rules to handle this situation. In addition to the need to sort this array, we need to define a mechanism that allows us to sort JSON Objects (which are evaluated as dicts in Python and have no default sorting comparison behavior). This proposal above defines a sorting behavior.

As an aside, one potential decision that would sidestep this issue (for now) is to make GenotypeMember objects identifiable, but I think this is not the correct design choice as GenotypeMembers are intended to be used only as a nested class within Genotype. Making them globally identifiable is contrary to that intended use. If there is disagreement on that underlying decision the conversation should start there.

Concern 2. define a mechanism that allows us to uniformly indicate sort behavior across classes

During the GKS-Pilot work we opted to use the ordered attribute in the JSON Schema (as opposed to within-message solutions) to handle the challenge of Concern 1. As the JSON Schema is parsed by VRS-Python and the JSON Schema specification allows for custom attributes, this seemed like a good solution to explicitly define ordered/unordered array behavior on a per-array level without increasing message size or changing digests of VRS objects. We discussed the advantages of a schema-level attribute on this thread. @larrybabb I think we saw eye-to-eye on this at the time, but @reece and @andreasprlic did not weigh in there, so it seems appropriate to me to revisit this decision and get their comments and come to a consensus decision.

There are many approaches we may take to address this concern, including some previously suggested ones:

Schema-based approaches (not in message)

  • ordered boolean property for arrays in JSON Schema (currently implemented solution)

Message-based approaches (defined in schema and explicit in message)

Documentation approaches (implementation concern only; not in message or schema)

  • Revise serialization rules and/or class-specific implementation guidance to describe the expected sort behavior of arrays

I ask that we keep the discussion on Concern 1: array sorting proposal in this thread and discuss Concern 2: indicating sort behavior (which is dependent on resolving the first concern) in a separate issue (#411).

Originally posted by @ahwagner in #410 (comment)

This issue (#411) is to discuss Concern 2 above

@ahwagner
Copy link
Member Author

ahwagner commented Nov 6, 2022

@ahwagner I get what you are saying. Thank you (again) for taking the time and effort to lay out the details above.

It seems like the 2 issues are

  1. When an array contains items with no specified ordering and are also un-identifiable (no computed digest) we need a mechanism to make sure we can digest the parent class of the array element consistently. This is the Genotype.member[] problem we are currently trying to address.

  2. When an array has a need to be in a specific order whether the elements are identifiable or not, we need a mechanism to assure the ordering is preserved. This one is less clear to me here because in the CSEs example you cite above there is no value in the json that would allow one to understand what the intended order is. I'm guessing we simply presume it is in the correct order with no way of verifying it? This seems a bit risky, and I would need some more clarity on why this is acceptable to us. If we are saying that we are not responsible for the ordering and presume however we create the data is the order it is in and the data does not need to contain any values (i.e. indices) to clarify the accuracy of the order, then we really shouldn't order anything ever.

The CSE case should probably be set up as a linked list or nesting construct to make sure that the chain/hierarchy is semantically included in the data. Maybe all ordered lists should use a designed construct to preserve those semantics?

For any data that has no ordering we would only need to solve the issue of digesting un-identifiable components in lists. For this we should assume that any items in a Value object array would meet the requirement of being a value object too and we should simply digest these elements and sort them by their value object digest, even though we don't ever persist or preserve these non-identifiable element's ids.

Again, I apologize for re-surfacing these issues. I think the idea of having an ordered flag attribute in any/all array parent classes is a bit difficult to accept as a final solution to this issue.

Originally posted by @larrybabb in #410 (comment). Replicated here to address point 2.

@ahwagner
Copy link
Member Author

ahwagner commented Nov 6, 2022

  1. When an array has a need to be in a specific order whether the elements are identifiable or not, we need a mechanism to assure the ordering is preserved. This one is less clear to me here because in the CSEs example you cite above there is no value in the json that would allow one to understand what the intended order is. I'm guessing we simply presume it is in the correct order with no way of verifying it? This seems a bit risky, and I would need some more clarity on why this is acceptable to us. If we are saying that we are not responsible for the ordering and presume however we create the data is the order it is in and the data does not need to contain any values (i.e. indices) to clarify the accuracy of the order, then we really shouldn't order anything ever.

Ordered behavior is something we get for free with JSON Arrays, and it is explicit in the JSON Schema Array definition. We originally discussed this in this issue. If there is any ambiguity in CSEs it may be in the computational definition of the CSE class, which reads:

An expression of a sequence composed from multiple other Sequence Expressions objects. MUST have at least one component that is not a LiteralSequenceExpression. CANNOT be composed from nested composed sequence expressions.

I think that what may need to happen here is an updating of the definition of CSEs to explicitly indicate that the composed sequence is created by concatenating sequences in the order expressed in the components Array.

We will also still need to discuss the mechanism for indicating that the order of a given Array in a VRS class is meaningful. It seems more correct to me to define this as a non-standard Array attribute in schema (e.g. our current implementation using ordered) than alter the message (since the meaningful order is a property of the class, not an instance).

@ahwagner
Copy link
Member Author

ahwagner commented Nov 7, 2022

On 11/7 leads call, @andreasprlic @larrybabb and @ahwagner walked through the current implementation and evaluated alternatives. We agreed that the current implementation solves our concerns and is enabling for other implementations without impacting message structures.

Proposed resolution: keep the ordered property as a mandatory custom attribute of all JSON Schema arrays used by VRS and use this property to determine if order is meaningful (ordered: true) or not (ordered: false).

@ahwagner
Copy link
Member Author

ahwagner commented Nov 9, 2022

Implemented in #409

@ahwagner ahwagner closed this as completed Nov 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants