Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor pandas-based header detection #74

Open
sbesson opened this issue Apr 19, 2022 · 0 comments
Open

Refactor pandas-based header detection #74

sbesson opened this issue Apr 19, 2022 · 0 comments
Labels
enhancement New feature or request

Comments

@sbesson
Copy link
Member

sbesson commented Apr 19, 2022

#67 introduces a new strategy based on the pandas library for parsing the columns of a CSV file and choosing the appropriate OMERO.table columns types when running populate metadata with the default ParsingContext. The initial implementation was introduced at the MetadataControl level, allowing to generate a column_types list and pass it to the existing API of the HeaderResolver.

A downside of this approach is that any non CLI-based usage of the new functionality requires the omero_metadata.cli.MetadataControl class to be approach- see #67 (comment). A minimal approach would be to migrate the column types detection logic under the omero_metadata.library module.

Capturing a few wider thoughts about the migration of this API down at the library level:

  • are we expecting to support the former column type detection strategy alongside the new approach? If not, the HeaderResolver logic could potentially be deprecated in favor of a new implementation e.g. HeaderResolver2/PandasResolver/...
  • at the moment the metadata population code makes several full reads of the CSV file even after detecting the columns, first to perform the object resolution and then to populate each row of the table. Possibly in the case of very large analytical tables, these multiple reads can be a bottleneck of the annotation workflow (I have no numbers, definitely something worth benchmarking). In this case, a secondary advantage approach of the pandas approach is that it creates an in-memory representation of the CSV file into a DataFrame which could then be modified e.g. by appending column and used for generating the table.
@sbesson sbesson added the enhancement New feature or request label Apr 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant