-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Target data file formats and organisation #10
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @annakrystalli--this is great and very timely, since we're currently working on target data for variant-nowcast-hub.
I had several comments and questions, mostly trying to parse apart the proposal for file formats and partioning schemes from the many mentions of arrow as a specific tool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for writing this up! I really appreciate the examples and links to existing issues that we can work on once we adopt this. It matches three of the four guiding principles (clear, general, and format-agnostic) I outlined in hubverse-org/hubUtils#197 (comment).
In general, I agree with @bsweger's assessment that we should focus on the outputs and not the tools.
I have two questions:
- Does this proposal exclude using a single file for these target data?
- Is there an alternative to prescribing file names? I noticed that we prescribed
timeseries.[csv|parquet]
andoracle-output.[csv|parquet]
, but this would require existing hubs to adapt their data to this naming scheme (e.g. FluSight usestarget-hospital-admissions.csv
and even the example hub usestime-series.csv
).
Co-authored-by: Becky Sweger <[email protected]>
…of files and file names
Thanks for the comments @bsweger , @zkamvar & @elray1 I've done some additional digging and also responded to your comments in the RFC. More specifically:
It would be really useful to get feedback on any assumptions I've made on how actual hubs might update their target data. Such understanding would certainly help evaluating the proposal for fit for purpose! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks very good and is super thorough!
One minor comment (that should not preclude merging necessarily) is that I found it confusing to have all these files named 2025-01-06-rfc-target-data-formats
. Maybe the files in the directory named 2025-01-06-rfc-target-data-formats
could have "supplement" added on to their name, so it's clear that they are a supplement, and not the source file for the file in decisions/
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approve modulo Becky's comments r.e. describing formats rather than specific tools, which i support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love how this is evolving, especially the focus on data access!
I made a few wording suggestions to take or leave; the things I feel strongly about are:
- removing the references to specific Python file access methods, since I don't think that's been decided yet
- removing the section about writing target data files with extra columns (because that's a data creation concern, and we've already stated the hubs are responsible for creating target data via whatever method they choose)
And I'll surface this idea here (it's already in a comment):
What if this first iteration states that partitioned target data files must be in parquet format? I know we've already written a lot of code to handle the .csv format for model-output files, so you're a better judge of whether or not that can be re-used here.
But it's worth considering that we don't need to do everything in the first implementation. If limiting partitioned files to .parquet speeds up implementation/reduces amount of code and tests, we can do that and see if there's demand for partitioned .csv files.
Just filed a variant-nowcast-hub bug thanks to this RFC and its focus on what's required for successful data access 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for incorporating the changes and the absolutely heroic effort that it took to investigate the CSV partition chaos 😱 💪🏽
I left some comments and updated my suggestions for clarity. I agree with @bsweger's assessment, specifically on two points:
- we should focus less on the specific tools and more on the expected formats.
- for partitioned data, we should start with parquet (non partitioned can be anything). The value add for partitioned CSV is so small and we don't have any evidence of hubs that have partitioned CSV data.
|
||
- `arrow::open_dataset()` works for single files also so we can use the same function to read both single and partitioned target data. | ||
|
||
Large un-partitioned data could be stored locally in a `target-data-raw` directory (which is not committed?) and then written and partitioned into the `target-data` directory (once it has been validated and correct schema is set? see below). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems out-of-scope for this rfc as it deals with the raw data. I would suggest to remove it.
Large un-partitioned data could be stored locally in a `target-data-raw` directory (which is not committed?) and then written and partitioned into the `target-data` directory (once it has been validated and correct schema is set? see below). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to get feedback from hub administrators before removing. It might be a tip that helps them think how they will manage the process of partitioning and updating their target data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did go ahead and remove this
Thanks everyone for the comments. I've addressed the comments individually but just wanted to add a couple overall responses:
Overall, because we are not providing tooling for creating target data, this actually means we will need to provide clear documentation for hub admins to ensure they succeed. So detail and specific examples of available tooling will still be important to make our documentation helpful to hub admins. I view this RFC and the experimentation within as an important starting source of information for our documentation. I've added a comment regarding documentation in the proposal also 76bfbb1. |
If folks are happy with the current version, I can share it on more widely to get some input from hub admins as well. |
…rget data. Move status to accepted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the work on this and for the thrashing that came with dogfooding our new RFC process!
Thanks everyone for your feedback and patience and helping work through the RFC process too. Merging! |
This PR is a Request For Comment on an initial proposal for target data file formats and organisation