Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filters other than dataset_id break verbatim handover to Terra #6898

Open
hannes-ucsc opened this issue Feb 12, 2025 · 4 comments
Open

Filters other than dataset_id break verbatim handover to Terra #6898

hannes-ucsc opened this issue Feb 12, 2025 · 4 comments
Assignees
Labels
+ [priority] High demo [process] To be demonstrated at the end of the sprint manifests [subject] Generation and contents of manifests orange [process] Done by the Azul team

Comments

@hannes-ucsc
Copy link
Member

… as they create dangling references. These dangling references will be apparent to Terra due to the fact that the relations are now (starting with PR #6815 for #6066) declared in the PFB preamble.

#6843 (comment)

This will break anvilprod when filtering by non-dataset fields. As an example, consider a manifest with a filter to select files smaller than 1 gigabyte, and a primary bundle with the following structure:

Image

The manifest will include file 1 and every entity upstream from it, but will omit file 2. The sequencing activity contains foreign key references to both files via its generated_file_id column. File 2 will then become a dangling relation that will cause Terra to reject the manifest.

#6843 (comment)

Other observations:

The foreign key anvil_assayactivity.antibody_id references the anvil_antibody table, but since we don't index contributions from that table, its replicas only occur as orphans (from replica bundles). If a manifest is created with filters that exclude orphans, dangling reference errors will occur for each anvil_assayactivity replica with a non-null antibody_id column. However, the anvil_assayactivity table is empty for all snapshots currently indexed on anvilprod, so this should be unobservable.

anvil_project presents a similar edge case (being a table described by the schema that we only index as orphans), but there are no foreign keys that reference that table, so it can't cause the dangling reference error.

@hannes-ucsc hannes-ucsc added the orange [process] Done by the Azul team label Feb 12, 2025
@hannes-ucsc
Copy link
Member Author

hannes-ucsc commented Feb 12, 2025

I hope that this can be solved by making the inclusion of relations conditional upon whether only the dataset_id filter is used, which is the same condition as for whether replicas of orphans are included. This will be easy to explain to stakeholders.

A more complicated solution would be to segregate the relations into safe and unsafe. A safe relation would never contain dangling references, no matter the filters. Unsafe relations would only be included when filtering by dataset_id.

@nadove-ucsc
Copy link
Contributor

nadove-ucsc commented Feb 12, 2025

Here is a reproduction of the scenario described above. Using manifest filters

{
  "bundle_uuid": {
    "is": [
      "bc98344a-a02a-ae02-8645-e0557f98e1fe"
    ]
  },
  "files.file_format": {
     "is": [
       ".crai"
     ]
  }
}

Downloaded as bundle_bisect.avro:

$ pfb show -i bundle_bisect.avro | jq -r '.relations[].dst_id' | sort | uniq >bisect_relation_ids.txt
$ pfb show -i bundle_bisect.avro | jq -r '.id' | sort | uniq >bisect_entity_ids.txt
$ comm -13 bisect_entity_ids.txt bisect_relation_ids.txt 
473ae075-2294-4e67-a07f-fb13bb038c05
f0462861-43bf-4660-b173-bbbe0fbe8fa9

The UUIDs in the output of comm are dangling relations (they appear in the relation IDs, but not the entity IDs).

@hannes-ucsc
Copy link
Member Author

hannes-ucsc commented Feb 14, 2025

For demo in anvildev, perform two hand-overs, one filtered by dataset_id and one filtered by file type. Ideally, the handovers should be done from the Data Browser.

Demo again when this lands in anvilprod, following the same instructions.

@nadove-ucsc
Copy link
Contributor

Demoed on anvildev.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
+ [priority] High demo [process] To be demonstrated at the end of the sprint manifests [subject] Generation and contents of manifests orange [process] Done by the Azul team
Projects
None yet
Development

No branches or pull requests

2 participants