Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Target data file formats and organisation #10

Merged
merged 23 commits into from
Jan 15, 2025

Conversation

annakrystalli
Copy link
Collaborator

This PR is a Request For Comment on an initial proposal for target data file formats and organisation

Copy link
Contributor

@bsweger bsweger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @annakrystalli--this is great and very timely, since we're currently working on target data for variant-nowcast-hub.

I had several comments and questions, mostly trying to parse apart the proposal for file formats and partioning schemes from the many mentions of arrow as a specific tool.

decisions/2025-01-06-rfc-target-data-formats.md Outdated Show resolved Hide resolved
decisions/2025-01-06-rfc-target-data-formats.md Outdated Show resolved Hide resolved
decisions/2025-01-06-rfc-target-data-formats.md Outdated Show resolved Hide resolved
decisions/2025-01-06-rfc-target-data-formats.md Outdated Show resolved Hide resolved
decisions/2025-01-06-rfc-target-data-formats.md Outdated Show resolved Hide resolved
decisions/2025-01-06-rfc-target-data-formats.md Outdated Show resolved Hide resolved
Copy link
Member

@zkamvar zkamvar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for writing this up! I really appreciate the examples and links to existing issues that we can work on once we adopt this. It matches three of the four guiding principles (clear, general, and format-agnostic) I outlined in hubverse-org/hubUtils#197 (comment).

In general, I agree with @bsweger's assessment that we should focus on the outputs and not the tools.

I have two questions:

  1. Does this proposal exclude using a single file for these target data?
  2. Is there an alternative to prescribing file names? I noticed that we prescribed timeseries.[csv|parquet] and oracle-output.[csv|parquet], but this would require existing hubs to adapt their data to this naming scheme (e.g. FluSight uses target-hospital-admissions.csv and even the example hub uses time-series.csv).

decisions/2025-01-06-rfc-target-data-formats.md Outdated Show resolved Hide resolved
decisions/2025-01-06-rfc-target-data-formats.md Outdated Show resolved Hide resolved
decisions/2025-01-06-rfc-target-data-formats.md Outdated Show resolved Hide resolved
decisions/2025-01-06-rfc-target-data-formats.md Outdated Show resolved Hide resolved
decisions/2025-01-06-rfc-target-data-formats.md Outdated Show resolved Hide resolved
decisions/2025-01-06-rfc-target-data-formats.md Outdated Show resolved Hide resolved
decisions/2025-01-06-rfc-target-data-formats.md Outdated Show resolved Hide resolved
@annakrystalli
Copy link
Collaborator Author

Thanks for the comments @bsweger , @zkamvar & @elray1

I've done some additional digging and also responded to your comments in the RFC. More specifically:

  • I've clarified that single files are acceptable (I had indeed neglected to make that explicit statement clearly at the beginning of the RFC)
  • I clarified that the tools used to create the required structure are not prescribed but mention the arrow tooling as a recommendation
  • I have made clearer that the goal of the proposal is to enable accessing the data using arrow as an arrow dataset
  • I added sections to both the supplementary materials and the RFC on:
    • applying a schema to target data datasets
    • ways of splitting data into files that include all columns but can still be accessed as an arrow dataset
    • more detail on file naming and an example of splitting up files that do not use default huve filenames. This may actually be closer to how actual hubs might update their target data.

It would be really useful to get feedback on any assumptions I've made on how actual hubs might update their target data. Such understanding would certainly help evaluating the proposal for fit for purpose!

Copy link
Member

@nickreich nickreich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks very good and is super thorough!

One minor comment (that should not preclude merging necessarily) is that I found it confusing to have all these files named 2025-01-06-rfc-target-data-formats. Maybe the files in the directory named 2025-01-06-rfc-target-data-formats could have "supplement" added on to their name, so it's clear that they are a supplement, and not the source file for the file in decisions/.

Copy link
Contributor

@elray1 elray1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approve modulo Becky's comments r.e. describing formats rather than specific tools, which i support

Copy link
Contributor

@bsweger bsweger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love how this is evolving, especially the focus on data access!

I made a few wording suggestions to take or leave; the things I feel strongly about are:

  • removing the references to specific Python file access methods, since I don't think that's been decided yet
  • removing the section about writing target data files with extra columns (because that's a data creation concern, and we've already stated the hubs are responsible for creating target data via whatever method they choose)

And I'll surface this idea here (it's already in a comment):

What if this first iteration states that partitioned target data files must be in parquet format? I know we've already written a lot of code to handle the .csv format for model-output files, so you're a better judge of whether or not that can be re-used here.

But it's worth considering that we don't need to do everything in the first implementation. If limiting partitioned files to .parquet speeds up implementation/reduces amount of code and tests, we can do that and see if there's demand for partitioned .csv files.

@bsweger
Copy link
Contributor

bsweger commented Jan 9, 2025

Just filed a variant-nowcast-hub bug thanks to this RFC and its focus on what's required for successful data access 😄
reichlab/variant-nowcast-hub#265

Copy link
Member

@zkamvar zkamvar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for incorporating the changes and the absolutely heroic effort that it took to investigate the CSV partition chaos 😱 💪🏽

I left some comments and updated my suggestions for clarity. I agree with @bsweger's assessment, specifically on two points:

  1. we should focus less on the specific tools and more on the expected formats.
  2. for partitioned data, we should start with parquet (non partitioned can be anything). The value add for partitioned CSV is so small and we don't have any evidence of hubs that have partitioned CSV data.

decisions/2025-01-06-rfc-target-data-formats.md Outdated Show resolved Hide resolved

- `arrow::open_dataset()` works for single files also so we can use the same function to read both single and partitioned target data.

Large un-partitioned data could be stored locally in a `target-data-raw` directory (which is not committed?) and then written and partitioned into the `target-data` directory (once it has been validated and correct schema is set? see below).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems out-of-scope for this rfc as it deals with the raw data. I would suggest to remove it.

Suggested change
Large un-partitioned data could be stored locally in a `target-data-raw` directory (which is not committed?) and then written and partitioned into the `target-data` directory (once it has been validated and correct schema is set? see below).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to get feedback from hub administrators before removing. It might be a tip that helps them think how they will manage the process of partitioning and updating their target data.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did go ahead and remove this

decisions/2025-01-06-rfc-target-data-formats.md Outdated Show resolved Hide resolved
decisions/2025-01-06-rfc-target-data-formats.md Outdated Show resolved Hide resolved
decisions/2025-01-06-rfc-target-data-formats.md Outdated Show resolved Hide resolved
@annakrystalli
Copy link
Collaborator Author

annakrystalli commented Jan 13, 2025

Thanks everyone for the comments.

I've addressed the comments individually but just wanted to add a couple overall responses:

  1. I would be very happy with only supporting parquet to start off with. However I feel we should get some feedback from a few hub administrators themselves to ensure we are not being too prescriptive. In any case, I have for now removed mention of csv as an accepted file format and moved it to the Other Options Considered section, along with the reasoning for rejecting it as an acceptable format for the time being. I feel we should get some approvals from the community before accepting it as it would mean the flusight hub for example is already violating this requirement. (Having said that, they will already need to change the filename to oracle-output).

  2. I want to be clear that I am making a specific and actionable proposal on tooling to access the data. I have kept mention of tooling for writing data as well as python versions of functions just for informational purposes without making their use prescriptive. However I am proposing the use of arrow::open_dataset()` in soon to be developed hubverse functionality so I don't feel it needs to be removed from the section where I'm explicitly listing the benefits of using it. In my view the proposal has described the format we are expecting and then I'm making a case for how a specific tool can handle the required format.

  3. There was an initial comment by @zkamvar that I missed previously regarding flexibility in filenames (and also pointed out that the example complex forecast hub was using time-series.csv instead of timeseries.csv which was originally proposed).

    Is there an alternative to prescribing file names? I noticed that we prescribed timeseries.[csv|parquet] and oracle-output.[csv|parquet], but this would require existing hubs to adapt their data to this naming scheme (e.g. FluSight uses target-hospital-admissions.csv and even the example hub uses time-series.csv).

    I have corrected all mention in the RFC of timeseries to time-series but my feeling is that we should just be prescriptive in the filename instead of allowing flexibility. Flexibility would likely require a property in one of the config files which for now we have not considered. It may be that at some point we will require information from hub admins to be able to successfully make use of time-series data in the form of config so perhaps we could revisit this when we reach that point? Thoughts by others welcome.

Overall, because we are not providing tooling for creating target data, this actually means we will need to provide clear documentation for hub admins to ensure they succeed. So detail and specific examples of available tooling will still be important to make our documentation helpful to hub admins. I view this RFC and the experimentation within as an important starting source of information for our documentation. I've added a comment regarding documentation in the proposal also 76bfbb1.

@annakrystalli
Copy link
Collaborator Author

If folks are happy with the current version, I can share it on more widely to get some input from hub admins as well.

Copy link
Contributor

@bsweger bsweger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work on this and for the thrashing that came with dogfooding our new RFC process!

@annakrystalli
Copy link
Collaborator Author

Thanks everyone for your feedback and patience and helping work through the RFC process too. Merging!

@annakrystalli annakrystalli merged commit a76fbee into main Jan 15, 2025
@annakrystalli annakrystalli deleted the ak/rfc/target-data-formats branch January 15, 2025 09:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants