-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[builder] normalized layer improvements #884
Conversation
…ized calc for smart-seq assays
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## bkmartinjr/schema-four #884 +/- ##
==========================================================
- Coverage 86.76% 86.49% -0.27%
==========================================================
Files 72 72
Lines 5228 5311 +83
==========================================================
+ Hits 4536 4594 +58
- Misses 692 717 +25
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
tools/cellxgene_census_builder/src/cellxgene_census_builder/build_soma/experiment_builder.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spot checking of the data yield no errors and increased in precision is observed.
Precision increase
The min/max row sums of the normalized layer got closer to 1 by aprox 4 orders of magnitude in a small size of census (10K cells)
min(row_sum_norm)
went from 0.9946784973144531
to 0.9999870657920837
max(row_sum_norm)
went from 1.0054092407226562
to 1.0000145435333252
No rank order differences between normalized and raw layers
There is virtually no shuffle of order between normalized and raw layers, min spearman correlation = 0.9999999999999998
Correct feature-length normalization of smart-seq data
There are minimal negligible differences between full precision normalization as done
by Gene Expression (@atarashansky) and the reduced precision as done by Census.
- 98% of values agree at 7 decimal points
- 99.7% of values agree at 6 decimal points
Test data:
"assay == 'Smart-seq2' and dataset_id == 'cee11228-9f0b-4e57-afe2-cfe15ee56312'"
Decimal points: 10 values differing: n = 2778536 , fraction = 0.754564533509455
Decimal points: 9 values differing: n = 1614282 , fraction = 0.43838911724833146
Decimal points: 8 values differing: n = 515363 , fraction = 0.1399566684336763
Decimal points: 7 values differing: n = 90236 , fraction = 0.024505309719132368
Decimal points: 6 values differing: n = 11112 , fraction = 0.0030176758898776417
Decimal points: 5 values differing: n = 1190 , fraction = 0.0003231672344271413
Decimal points: 4 values differing: n = 142 , fraction = 3.856281284760845e-05
Decimal points: 3 values differing: n = 7 , fraction = 1.9009837319243603e-06
Decimal points: 2 values differing: n = 2 , fraction = 5.431382091212458e-07
Decimal points: 1 values differing: n = 0 , fraction = 0.0
Decimal points: 0 values differing: n = 0 , fraction = 0.0
See agreement betwen the two calculations
See notebook https://colab.research.google.com/drive/1tX_M1BW9ai_4joXOXlAmuthJVCUdIsVR#scrollTo=yfRuxOUemuKC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. (I did not review `validate_soma_bounding_box(), which seems unrelated to this PR).
For a value `X[i,j]` in the counts (raw) matrix, library-size normalized values are defined | ||
For Smart-Seq assays, given a value `X[i,j]` in the counts (raw) matrix, library-size normalized values are defined | ||
as `normalized[i,j] = (X[i,j] / var[j].feature_length) / sum(X[i, ] / var.feature_length[j])`. | ||
For all other assays, for a value `X[i,j]` in the counts (raw) matrix, library-size normalized values are defined | ||
as `normalized[i,j] = X[i,j] / sum(X[i, ])`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the schema also mention the precision reduction being performed? Or are we satisfied that it's near equivalence justifies not having to mention it? (I did see Pablo's verification)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this PR does not introduce precision reduction, just improves it. I leave this up to @pablo-gar to resolve at a later date, at his discretion.
* first pass at using enum types * add better error logging for file size assertion * add feature flag for dict schema fields * update a few dependencies * remove debugging print * update comment * bump compression level * pr feedback * fix typos in comments * add schema_util tests and fix a bug found by those tests * lint
* schema 4 * update dep pins * AnnData version update allows for compat code cleanup * fix bug in feature_length * bump tiledbsoma dependency to latest * bump schema version * update census schema version * more dependency updates * update to use production REST API * [builder] normalized layer improvements (#884) * improve normalized layer floating point precision, and correct normalized calc for smart-seq assays * fix int32 overflow in sparse matrix code * add check for tiledb issue 1969 * bump dependency versions * work around SOMA bug temporarily * pr feedback * [builder] port to use enums in schema (#896) * first pass at using enum types * add better error logging for file size assertion * add feature flag for dict schema fields * update a few dependencies * remove debugging print * update comment * bump compression level * pr feedback * fix typos in comments * add schema_util tests and fix a bug found by those tests * lint
Changes:
Note to reviewers: this PR is stacked on top of the schema four PR