Refactor pandas-based header detection #74

sbesson · 2022-04-19T14:46:57Z

#67 introduces a new strategy based on the pandas library for parsing the columns of a CSV file and choosing the appropriate OMERO.table columns types when running populate metadata with the default ParsingContext. The initial implementation was introduced at the MetadataControl level, allowing to generate a column_types list and pass it to the existing API of the HeaderResolver.

A downside of this approach is that any non CLI-based usage of the new functionality requires the omero_metadata.cli.MetadataControl class to be approach- see #67 (comment). A minimal approach would be to migrate the column types detection logic under the omero_metadata.library module.

Capturing a few wider thoughts about the migration of this API down at the library level:

are we expecting to support the former column type detection strategy alongside the new approach? If not, the HeaderResolver logic could potentially be deprecated in favor of a new implementation e.g. HeaderResolver2/PandasResolver/...
at the moment the metadata population code makes several full reads of the CSV file even after detecting the columns, first to perform the object resolution and then to populate each row of the table. Possibly in the case of very large analytical tables, these multiple reads can be a bottleneck of the annotation workflow (I have no numbers, definitely something worth benchmarking). In this case, a secondary advantage approach of the pandas approach is that it creates an in-memory representation of the CSV file into a DataFrame which could then be modified e.g. by appending column and used for generating the table.

The text was updated successfully, but these errors were encountered:

sbesson added the enhancement New feature or request label Apr 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor pandas-based header detection #74

Refactor pandas-based header detection #74

sbesson commented Apr 19, 2022

Refactor pandas-based header detection #74

Refactor pandas-based header detection #74

Comments

sbesson commented Apr 19, 2022