[RFC] Target data file formats and organisation #10

annakrystalli · 2025-01-06T14:39:32Z

This PR is a Request For Comment on an initial proposal for target data file formats and organisation

bsweger

Thanks, @annakrystalli--this is great and very timely, since we're currently working on target data for variant-nowcast-hub.

I had several comments and questions, mostly trying to parse apart the proposal for file formats and partioning schemes from the many mentions of arrow as a specific tool.

decisions/2025-01-06-rfc-target-data-formats.md

decisions/2024-12-09-rfc-record-decisions.md

zkamvar

Thank you for writing this up! I really appreciate the examples and links to existing issues that we can work on once we adopt this. It matches three of the four guiding principles (clear, general, and format-agnostic) I outlined in hubverse-org/hubUtils#197 (comment).

In general, I agree with @bsweger's assessment that we should focus on the outputs and not the tools.

I have two questions:

Does this proposal exclude using a single file for these target data?
Is there an alternative to prescribing file names? I noticed that we prescribed timeseries.[csv|parquet] and oracle-output.[csv|parquet], but this would require existing hubs to adapt their data to this naming scheme (e.g. FluSight uses target-hospital-admissions.csv and even the example hub uses time-series.csv).

decisions/2025-01-06-rfc-target-data-formats.md

Co-authored-by: Becky Sweger <[email protected]>

…of files and file names

annakrystalli · 2025-01-08T16:02:46Z

Thanks for the comments @bsweger , @zkamvar & @elray1

I've done some additional digging and also responded to your comments in the RFC. More specifically:

I've clarified that single files are acceptable (I had indeed neglected to make that explicit statement clearly at the beginning of the RFC)
I clarified that the tools used to create the required structure are not prescribed but mention the arrow tooling as a recommendation
I have made clearer that the goal of the proposal is to enable accessing the data using arrow as an arrow dataset
I added sections to both the supplementary materials and the RFC on:
- applying a schema to target data datasets
- ways of splitting data into files that include all columns but can still be accessed as an arrow dataset
- more detail on file naming and an example of splitting up files that do not use default huve filenames. This may actually be closer to how actual hubs might update their target data.

It would be really useful to get feedback on any assumptions I've made on how actual hubs might update their target data. Such understanding would certainly help evaluating the proposal for fit for purpose!

decisions/2025-01-06-rfc-target-data-formats.md

nickreich

I think this looks very good and is super thorough!

One minor comment (that should not preclude merging necessarily) is that I found it confusing to have all these files named 2025-01-06-rfc-target-data-formats. Maybe the files in the directory named 2025-01-06-rfc-target-data-formats could have "supplement" added on to their name, so it's clear that they are a supplement, and not the source file for the file in decisions/.

decisions/2025-01-06-rfc-target-data-formats.md

elray1

approve modulo Becky's comments r.e. describing formats rather than specific tools, which i support

decisions/2025-01-06-rfc-target-data-formats.md

bsweger

Love how this is evolving, especially the focus on data access!

I made a few wording suggestions to take or leave; the things I feel strongly about are:

removing the references to specific Python file access methods, since I don't think that's been decided yet
removing the section about writing target data files with extra columns (because that's a data creation concern, and we've already stated the hubs are responsible for creating target data via whatever method they choose)

And I'll surface this idea here (it's already in a comment):

What if this first iteration states that partitioned target data files must be in parquet format? I know we've already written a lot of code to handle the .csv format for model-output files, so you're a better judge of whether or not that can be re-used here.

But it's worth considering that we don't need to do everything in the first implementation. If limiting partitioned files to .parquet speeds up implementation/reduces amount of code and tests, we can do that and see if there's demand for partitioned .csv files.

bsweger · 2025-01-09T20:56:09Z

Just filed a variant-nowcast-hub bug thanks to this RFC and its focus on what's required for successful data access 😄
reichlab/variant-nowcast-hub#265

zkamvar

Thank you for incorporating the changes and the absolutely heroic effort that it took to investigate the CSV partition chaos 😱 💪🏽

I left some comments and updated my suggestions for clarity. I agree with @bsweger's assessment, specifically on two points:

we should focus less on the specific tools and more on the expected formats.
for partitioned data, we should start with parquet (non partitioned can be anything). The value add for partitioned CSV is so small and we don't have any evidence of hubs that have partitioned CSV data.

decisions/2025-01-06-rfc-target-data-formats.md

zkamvar · 2025-01-10T00:43:22Z

decisions/2025-01-06-rfc-target-data-formats.md

+
+- `arrow::open_dataset()` works for single files also so we can use the same function to read both single and partitioned target data.
+
+Large un-partitioned data could be stored locally in a `target-data-raw` directory (which is not committed?) and then written and partitioned into the `target-data` directory (once it has been validated and correct schema is set? see below).


This seems out-of-scope for this rfc as it deals with the raw data. I would suggest to remove it.

Suggested change

Large un-partitioned data could be stored locally in a `target-data-raw` directory (which is not committed?) and then written and partitioned into the `target-data` directory (once it has been validated and correct schema is set? see below).

I would prefer to get feedback from hub administrators before removing. It might be a tip that helps them think how they will manage the process of partitioning and updating their target data.

I did go ahead and remove this

decisions/2025-01-06-rfc-target-data-formats.md

…g confusion

annakrystalli · 2025-01-13T09:40:50Z

Thanks everyone for the comments.

I've addressed the comments individually but just wanted to add a couple overall responses:

I would be very happy with only supporting parquet to start off with. However I feel we should get some feedback from a few hub administrators themselves to ensure we are not being too prescriptive. In any case, I have for now removed mention of csv as an accepted file format and moved it to the Other Options Considered section, along with the reasoning for rejecting it as an acceptable format for the time being. I feel we should get some approvals from the community before accepting it as it would mean the flusight hub for example is already violating this requirement. (Having said that, they will already need to change the filename to oracle-output).
I want to be clear that I am making a specific and actionable proposal on tooling to access the data. I have kept mention of tooling for writing data as well as python versions of functions just for informational purposes without making their use prescriptive. However I am proposing the use of arrow::open_dataset()` in soon to be developed hubverse functionality so I don't feel it needs to be removed from the section where I'm explicitly listing the benefits of using it. In my view the proposal has described the format we are expecting and then I'm making a case for how a specific tool can handle the required format.
There was an initial comment by @zkamvar that I missed previously regarding flexibility in filenames (and also pointed out that the example complex forecast hub was using time-series.csv instead of timeseries.csv which was originally proposed).

Is there an alternative to prescribing file names? I noticed that we prescribed timeseries.[csv|parquet] and oracle-output.[csv|parquet], but this would require existing hubs to adapt their data to this naming scheme (e.g. FluSight uses target-hospital-admissions.csv and even the example hub uses time-series.csv).

I have corrected all mention in the RFC of timeseries to time-series but my feeling is that we should just be prescriptive in the filename instead of allowing flexibility. Flexibility would likely require a property in one of the config files which for now we have not considered. It may be that at some point we will require information from hub admins to be able to successfully make use of time-series data in the form of config so perhaps we could revisit this when we reach that point? Thoughts by others welcome.

Overall, because we are not providing tooling for creating target data, this actually means we will need to provide clear documentation for hub admins to ensure they succeed. So detail and specific examples of available tooling will still be important to make our documentation helpful to hub admins. I view this RFC and the experimentation within as an important starting source of information for our documentation. I've added a comment regarding documentation in the proposal also 76bfbb1.

annakrystalli · 2025-01-13T09:41:41Z

If folks are happy with the current version, I can share it on more widely to get some input from hub admins as well.

…rget data. Move status to accepted.

bsweger

Thanks for all the work on this and for the thrashing that came with dogfooding our new RFC process!

annakrystalli · 2025-01-15T09:58:38Z

Thanks everyone for your feedback and patience and helping work through the RFC process too. Merging!

annakrystalli added 4 commits January 6, 2025 11:47

ignore Rstudio related files

5a82267

Fix typo

ddc3270

Add supplementary material

99882de

Add target data format RFC

975f71c

bsweger reviewed Jan 6, 2025

View reviewed changes

zkamvar reviewed Jan 7, 2025

View reviewed changes

decisions/2024-12-09-rfc-record-decisions.md Show resolved Hide resolved

zkamvar reviewed Jan 8, 2025

View reviewed changes

elray1 reviewed Jan 8, 2025

View reviewed changes

decisions/2025-01-06-rfc-target-data-formats.md Outdated Show resolved Hide resolved

annakrystalli and others added 7 commits January 8, 2025 15:25

update supplementary materials

60769bd

Update decisions/2025-01-06-rfc-target-data-formats.md

c31982e

Co-authored-by: Becky Sweger <[email protected]>

Add toc and automatic partition column detection in csv datasets

4410059

Respond to comments. Add sections on applying schema, flat splitting …

8ccbfe6

…of files and file names

add link to filenames section

b9f390c

correct anchor link

b930acf

extra clarification

a616cf6

annakrystalli requested review from elray1, bsweger and zkamvar January 8, 2025 16:02

nickreich reviewed Jan 9, 2025

View reviewed changes

decisions/2025-01-06-rfc-target-data-formats.md Outdated Show resolved Hide resolved

nickreich approved these changes Jan 9, 2025

View reviewed changes