Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix header detection for tables with sparse numerical data #77

Merged
merged 9 commits into from
Jun 28, 2022

Conversation

sbesson
Copy link
Member

@sbesson sbesson commented Jun 5, 2022

Fixes #76

Reproducible scenario

First create a minimal dataset/image hierarchy e.g. as follows:

touch test1.fake test2.fake
dataset=$(omero obj new Dataset name=sparse_table)
omero import -T $dataset test1.fake test2.fake

CSV files with sparse string data such as the one below are correctly handled by the current HEAD of omero-metadata.

$ cat sparse_string_column.csv 
Image name,meas1,meas2,meas3,meas4
test1.fake,1.1,1,high,low
test2.fake,0.5,2,,low

The columns with missing values are mapped as s/StringCOlumn and the missing value are turned into empty strings where running omero metadata populate --file sparse_string_column.csv $dataset e.g.

CSV files with sparse numerical columns such as the one below currently fail during the population command:

$ cat sparse_numeric_column.csv
Image name,meas1,meas2,meas3,meas4
test1.fake,1.1,1.2,high,low
test2.fake,,2.1,,low

Here, the meas1 column is currently mapped into a d header type/DoubleColumn by the pandas detection logic, With the default omero metadata populate --file command, the table population fails with ValueError: Empty Double or Long value. Use --allow_nan to convert to NaN`.

Proposed changes

Since the library already includes some logic allowing the user to control whether NaN values are allowed in the OMERO.table (introduced in #60), this PR proposes the following changes

  • the MetadataControl.detect_headers API supports an extra keep_default_na argument (True by default). Its value is forwarded to pandas.read_csvand determines controls how pandas should handle NA (missing) values and whether the column is detected asdvss` (see https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)
  • unit tests are added to cover the new MetadataControl.detect_headers API with different combination of tables/arguments
  • the populate command is updated to use the value of args.allow_nan (False by default) as pass it as keep_default_na to MetadataControl.detect_headers

fe73a17 adds a cosmetic change defining GNU-style aliases of the command-line arguments (--manual-header, --allow-nan) using hyphen as separator. The existing underscore separated flags are preserved.

Testing

With these changes, annotating of sparse CSV tables using the default header detection should be functional in all cases.

  1. if the tabular data is dense or containing sparse string columns, the behavior of the command should be unchanged

  2. if the tabular data is dense or containing sparse numerical columns,, the behavior of the command will depend on the --allow-nan flag

    omero metadata populate --file sparse_numeric_column.csv $data
    

    will detect the sparse numeric column as aStringColumn and store the missing values as empty strings

    omero metadata populate --file sparse_numeric_column.csv --allow-nan $data
    

    will detect the sparse numeric column as a DoubleColumn and store missing values as nan

@sbesson sbesson requested review from will-moore and muhanadz June 5, 2022 20:23
@snoopycrimecop
Copy link
Member

Conflicting PR. Removed from build OMERO-plugins-push#1169. See the console output for more details.
Possible conflicts:

--conflicts

@sbesson sbesson added this to the 0.11.0 milestone Jun 10, 2022
@snoopycrimecop
Copy link
Member

snoopycrimecop commented Jun 11, 2022

Conflicting PR. Removed from build OMERO-plugins-push#1176. See the console output for more details.
Possible conflicts:

--conflicts Conflict resolved in build OMERO-plugins-push#1179. See the console output for more details.

@sbesson sbesson mentioned this pull request Jun 21, 2022
Copy link
Member

@muhanadz muhanadz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review summary

In conclusion, what was described in the PR description is reasonable, valid, and in my opinion necessary.

Following the steps described, without the PR there are cases where the script will fail (with a ValueError exception) in the presence of a sparse numeric column and the lack of --allow-nan. This PR fixes that behavior such that in the case of a sparse numeric column you can choose to have the sparse numeric column read as 'd' (DoubleColumn) or 's' (StringColumn) by passing --allow-nan or not

My current understanding is that nan values or empty values will be processed regardless of if they're in a numeric or string column. And the default is to read the nan values as string type.

If so I'd suggest adding something like "without --allow-nan, by default, the column will be read as a string column in the presence of nans in the column" somewhere either in the help of the argument or the README.

Review details

Using sparse_string_column.csv (4th column sparse):

Without PR detected ['s', 'd', 'l', 's', 's']. Added empty string to final result.
With PR changes, detected ['s', 'd', 'l', 's', 's'] and with --allow-nan, similary detected ['s', 'd', 'l', 's', 's']. Added empty string to final result.

Using sparse_numeric_column.csv (2nd and 4th column sparse)

Without PR detected ['s', 'd', 'd', 's', 's'], however will get the following exception:

ValueError: Empty Double or Long value. Use --allow_nan to convert to NaN

With PR changes, detected ['s', 's', 'd', 's', 's'] which a dded empty string to final result and with --allow-nan detected ['s', 'd', 'd', 's', 's'] which added 'nan' to final result.

@sbesson
Copy link
Member Author

sbesson commented Jun 28, 2022

Thanks @muhanadz for the review. 27050f2 should amend the README where most of the information about the library usage is captured at the moment.

@sbesson sbesson requested a review from muhanadz June 28, 2022 08:31
Copy link
Member

@muhanadz muhanadz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@sbesson sbesson merged commit dea8834 into ome:master Jun 28, 2022
@sbesson
Copy link
Member Author

sbesson commented Jun 28, 2022

Thanks @muhanadz. The new column detection behavior is now released as omero-metadata 0.11.0 🎉

@sbesson sbesson deleted the detect_headers_sparse branch September 12, 2022 10:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Header detection: default behavior and handling sparse columns
3 participants