Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data type inconsistencies between database, processing, and output #216

Closed
CarlinLiao opened this issue Sep 26, 2023 · 2 comments
Closed
Labels
refactor Code change preserving functionality wontimplement New feature needs to be broken down or deferred

Comments

@CarlinLiao
Copy link
Collaborator

There are at least two instances of this I'd like to point out

  1. histological_structure being stored as string/VARCHAR in the database but coerced and assumed to be int in more recent features, like selection of cells by ID in FeatureMatrixExtractor. The database schema can be updated to canonize the int format.
  2. Channel expression being expressed as 0/1 values for the sake of data expression but interpreted as bool for the purposes of DataFrame selection. Discussed to some extent in this PR. Instead of coercing to bool when we want to use it for DataFrame indexing, they could be converted to bool as soon as possible, and only stay ints when coming out of the postgres query and when saving and loading from compressed bytes file.

I think one or both of these situations would benefit from consistency.

@CarlinLiao CarlinLiao added the refactor Code change preserving functionality label Sep 26, 2023
@jimmymathews
Copy link
Collaborator

jimmymathews commented Sep 27, 2023

As for (1), the string values for all identifiers in the scstudies schema is a deliberate design choice, motivated by the aim of consistency across the schema for all identifiers. This greatly simplifies schema authoring and inter-table referencing. histological_structure is no exception and the apparent integrality of its identifiers is an implementation-specific artifact of the identifier-issuing portion of the (SPT) data import process. There are several schema alterations and additions that SPT does for performance purposes, since the ADI schema does not prioritize database performance, but I prefer to use such a mechanism only if there is a genuine performance-related purpose.

As for (2), "channel expression being expressed as 0/1" happens in only a very slim intermediate processing step. As I noted elsewhere in comments, the database tables do not use 0/1 values for expression, and boolean values are supported by the database but not used, since the expression values in the schema are not necessarily dichotomous.

Moreover the aim of consistency between the storage format in the database and the feature matrices' values is not highly prioritized, because they have different semantics. In the feature matrix dichotomous values is the paradigm, and in the database storage format this is not so. The inconsistency has a reason. ADI-compliant datasets could use trinary expression values, for example, like "high/low/absent", which would cause SPT's feature matrix functionality to fail, but that is SPT's problem not scstudies' problem.

@jimmymathews jimmymathews added the wontimplement New feature needs to be broken down or deferred label Oct 24, 2023
@jimmymathews
Copy link
Collaborator

Closing for now pending a proposal for followup action.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactor Code change preserving functionality wontimplement New feature needs to be broken down or deferred
Projects
None yet
Development

No branches or pull requests

2 participants