Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTAN2 Data Model & BigQuery Architecture Planning #468

Open
5 tasks
aditigopalan opened this issue Nov 21, 2024 · 3 comments
Open
5 tasks

HTAN2 Data Model & BigQuery Architecture Planning #468

aditigopalan opened this issue Nov 21, 2024 · 3 comments
Assignees

Comments

@aditigopalan
Copy link
Contributor

After discussion with Darya on 11/20, creating this ticket to track our BigQuery architecture and data flow planning for Phase 2.

Including the following action items

  • Create UML diagram for BigQuery tables. (Dar'ya made this diagram)
  • Define which tables are mutable or automated.
  • Determine if new tables are needed for Phase 2.
  • Review if cloud workflows need updates. Discuss roles in relation to the Synapse uploads and schematic releases.
  • Draft a design document for Phase 2 architecture.
@aditigopalan aditigopalan changed the title Data model / BQ Architecture HTAN2 Data Model & BigQuery Architecture Planning Nov 21, 2024
@aclayton555
Copy link
Contributor

24-11/12 Close-out: Dar'ya has been working on this with Aditi and Adam. Link to Figma: https://www.figma.com/board/7MaRE8wDQABuNg5GqJXN9x/HTAN2-BQ-Medallion-Architecture?node-id=0-1&p=f&t=hGRTfR7YVkW5jbqq-0

Notes from preliminary discussion on medallion tiers:

  • Any time we use a SynapseID, we should add a version number
  • Data model version will soon be added as an annotation to manifests
  • Also chat with Ino about how (gold?) tier will be used with portal

Also discussed if/how Bigquery tables can be used as a monitoring tool to monitor sparseness. (@PozhidayevaDarya not sure if this this within scope for the designs in progress here, or if these should be developed longer term in a separate ticket?)

@aclayton555
Copy link
Contributor

25-1 Close out: Have made first pass of refactored BQ code into medallion architecture. Met with Ino and Onur to understand how this work with portal and what would be most useful to inform refactoring to portal code.

Suggestions:

  • Sanity check/ review with BQ team at ISB. Adam also suggests engagement with Tom Yu at Sage
  • Bring for review and feedback at upcoming HTAN DCC call
  • Current testing and prototyping is on HTAN1 data, but we will need to think about how this will look with HTAN2 and potential changes from LinkML (although not expecting data model to make a difference here other than the format of the data being pulled in to BQ - CSV, a table, scrape of Synapse annotations). Structure of output files and what the portal communicates with will be an important consideration.
  • need to better assess what the design and implementation needs are here, especially for the portal. Clear understanding of the challenges in HTAN1 will be helpful to flesh out and why this change is needed.
  • Also need to understand impact and relation on dashboard.

Next steps:

  • Bring to upcoming Monday call to request feedback. Likely that we will adapt this into an data flow RFC.

Ideal outcome is to streamline information and tables to help inform release numbers

@aclayton555
Copy link
Contributor

25-2/3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants