Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix H3 vaccine strains #174

Closed
6 tasks done
joverlee521 opened this issue Dec 18, 2024 · 3 comments
Closed
6 tasks done

Fix H3 vaccine strains #174

joverlee521 opened this issue Dec 18, 2024 · 3 comments
Assignees

Comments

@joverlee521
Copy link
Contributor

joverlee521 commented Dec 18, 2024

Details on Slack.

TODOs

General improvements:

Data clean up:

  • remove A/Croatia/10136/RV/2023 sequences since they are duplicates of A/Croatia/10136RV/2023
  • remove A/DistrictofColumbia/27/2023 sequences and reupload so they have A/DistrictOfColumbia/27/2023 strain names
  • rename all test and reference virus names in titers for A/Croatia/10136RV/2023 and A/DistrictOfColumbia/27/2023

Post clean up:

@joverlee521
Copy link
Contributor Author

joverlee521 commented Jan 16, 2025

Standardize capitalization of location in strain names

This has proven to be more complicated than I initially thought...

I wanted to use geo_synonyms.tsv to standardize locations in strain names, where I would do lower case string comparisons against the label column and then standardize on whatever capitalization was used in the label column. However, there are labels for different locations that would be duplicates in lower case string comparisons, e.g. Hanam/South Korea and HaNam/Vietnam would both match hanam.

This duplicate location label is already an issue for geolocation assignments so I thought I would tackle it here as well. I wanted to use the GISAID provided location metadata to make sure the label and the country matched. Then I realized the virus strain name and the sequence strain name are curated separately and the sequence data do not include the GISAID location metadata. I can create a map of the virus strain name to the GISAID EPI ISL that can be used to match the sequence strain name instead of curating them separately. However, this idea just does not work for the strain names in the titer data because they do not have location metadata. Titer data mostly do not include the GISAID EPI ISL so I cannot think of how to reliably match titer strain name to the sequence strain names...

Even if I can tackle all of the above, this will only standardize strain names for new uploads. If geo_synonyms.tsv is updated, it does not update virus records already in fauna. This is further complicated by the fact that we use strain as the index field for viruses and the virus_strain and serum_strain are used to create the index for titer records.

All that is to say this is taking longer than I would like for fixing this specific data issue, so I plan to do some short term fixes before tackling the larger issue of standardizing locations in strain names.

  1. Add entries to flu_fix_location_label.tsv and flu_strain_name_fix.tsv to fix these specific data issues. These files are used in both the vdb and tdb uploads, so these fixes would apply across sequence and titer strain names:

    • DistrictofColumbia -> DistrictOfColumbia
    • A/Croatia/10136/RV/2023 -> A/Croatia/10136RV/2023
  2. Remove the "bad" strains from the vdb/flu_* tables and re-upload them to fix their strain names.

  3. Remove the titer records with "bad" strains from tdb/* tables and re-upload them to fix their strain names.

joverlee521 added a commit that referenced this issue Jan 16, 2025
Fixes the specific strain name issues raised in
<#174> for future uploads.

Records already in the database need to be manually deleted and
re-uploaded.
@joverlee521
Copy link
Contributor Author

Manually cleaned up sequence and titer data in fauna and running uploads to S3 in seasonal-flu.

joverlee521 added a commit to nextstrain/seasonal-flu that referenced this issue Jan 22, 2025
As part of clean up in nextstrain/fauna#174,
the strain name "A/DistrictofColumbia/27/2023" has been updated to
"A/DistrictOfColumbia/27/2023" to follow our standard
capitalization of DC in strain names.

Removed the now duplicate entries from references_for_titer_plots.
@joverlee521
Copy link
Contributor Author

Closing since this specific data issue has been fixed in fauna. In the long term, the geolocation standardization issues will be easier to fix once we tackle #162.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant