Data type inconsistencies between database, processing, and output #216

CarlinLiao · 2023-09-26T17:54:13Z

There are at least two instances of this I'd like to point out

histological_structure being stored as string/VARCHAR in the database but coerced and assumed to be int in more recent features, like selection of cells by ID in FeatureMatrixExtractor. The database schema can be updated to canonize the int format.
Channel expression being expressed as 0/1 values for the sake of data expression but interpreted as bool for the purposes of DataFrame selection. Discussed to some extent in this PR. Instead of coercing to bool when we want to use it for DataFrame indexing, they could be converted to bool as soon as possible, and only stay ints when coming out of the postgres query and when saving and loading from compressed bytes file.

I think one or both of these situations would benefit from consistency.

The text was updated successfully, but these errors were encountered:

jimmymathews · 2023-09-27T18:43:16Z

As for (1), the string values for all identifiers in the scstudies schema is a deliberate design choice, motivated by the aim of consistency across the schema for all identifiers. This greatly simplifies schema authoring and inter-table referencing. histological_structure is no exception and the apparent integrality of its identifiers is an implementation-specific artifact of the identifier-issuing portion of the (SPT) data import process. There are several schema alterations and additions that SPT does for performance purposes, since the ADI schema does not prioritize database performance, but I prefer to use such a mechanism only if there is a genuine performance-related purpose.

As for (2), "channel expression being expressed as 0/1" happens in only a very slim intermediate processing step. As I noted elsewhere in comments, the database tables do not use 0/1 values for expression, and boolean values are supported by the database but not used, since the expression values in the schema are not necessarily dichotomous.

Moreover the aim of consistency between the storage format in the database and the feature matrices' values is not highly prioritized, because they have different semantics. In the feature matrix dichotomous values is the paradigm, and in the database storage format this is not so. The inconsistency has a reason. ADI-compliant datasets could use trinary expression values, for example, like "high/low/absent", which would cause SPT's feature matrix functionality to fail, but that is SPT's problem not scstudies' problem.

jimmymathews · 2023-10-24T19:59:27Z

Closing for now pending a proposal for followup action.

CarlinLiao added the refactor Code change preserving functionality label Sep 26, 2023

jimmymathews added the wontimplement New feature needs to be broken down or deferred label Oct 24, 2023

jimmymathews closed this as completed Oct 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data type inconsistencies between database, processing, and output #216

Data type inconsistencies between database, processing, and output #216

CarlinLiao commented Sep 26, 2023

jimmymathews commented Sep 27, 2023 •

edited

Loading

jimmymathews commented Oct 24, 2023

Data type inconsistencies between database, processing, and output #216

Data type inconsistencies between database, processing, and output #216

Comments

CarlinLiao commented Sep 26, 2023

jimmymathews commented Sep 27, 2023 • edited Loading

jimmymathews commented Oct 24, 2023

jimmymathews commented Sep 27, 2023 •

edited

Loading