Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[builder] normalized layer improvements #884

Merged
merged 20 commits into from
Dec 21, 2023

Conversation

bkmartinjr
Copy link
Contributor

@bkmartinjr bkmartinjr commented Dec 6, 2023

Changes:

Note to reviewers: this PR is stacked on top of the schema four PR

Copy link

codecov bot commented Dec 6, 2023

Codecov Report

Attention: 25 lines in your changes are missing coverage. Please review.

Comparison is base (409b86e) 86.76% compared to head (c8fc4bc) 86.49%.

❗ Current head c8fc4bc differs from pull request most recent head 0d9ba30. Consider uploading reports for the commit 0d9ba30 to get more accurate results

Files Patch % Lines
...ne_census_builder/build_soma/experiment_builder.py 51.16% 21 Missing ⚠️
...llxgene_census_builder/build_soma/validate_soma.py 91.83% 4 Missing ⚠️
Additional details and impacted files
@@                    Coverage Diff                     @@
##           bkmartinjr/schema-four     #884      +/-   ##
==========================================================
- Coverage                   86.76%   86.49%   -0.27%     
==========================================================
  Files                          72       72              
  Lines                        5228     5311      +83     
==========================================================
+ Hits                         4536     4594      +58     
- Misses                        692      717      +25     
Flag Coverage Δ
unittests 86.49% <73.11%> (-0.27%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bkmartinjr bkmartinjr changed the title normalized layer improvements [builder] normalized layer improvements Dec 10, 2023
@bkmartinjr bkmartinjr marked this pull request as ready for review December 18, 2023 19:58
Copy link
Contributor

@pablo-gar pablo-gar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spot checking of the data yield no errors and increased in precision is observed.

Precision increase

The min/max row sums of the normalized layer got closer to 1 by aprox 4 orders of magnitude in a small size of census (10K cells)

min(row_sum_norm) went from 0.9946784973144531 to 0.9999870657920837
max(row_sum_norm) went from 1.0054092407226562 to 1.0000145435333252

No rank order differences between normalized and raw layers

There is virtually no shuffle of order between normalized and raw layers, min spearman correlation = 0.9999999999999998

Correct feature-length normalization of smart-seq data

There are minimal negligible differences between full precision normalization as done
by Gene Expression (@atarashansky) and the reduced precision as done by Census.

  • 98% of values agree at 7 decimal points
  • 99.7% of values agree at 6 decimal points

Test data:

"assay == 'Smart-seq2' and dataset_id == 'cee11228-9f0b-4e57-afe2-cfe15ee56312'"
Decimal points:  10  values differing: n =  2778536 , fraction =  0.754564533509455
Decimal points:  9  values differing: n =  1614282 , fraction =  0.43838911724833146
Decimal points:  8  values differing: n =  515363 , fraction =  0.1399566684336763
Decimal points:  7  values differing: n =  90236 , fraction =  0.024505309719132368
Decimal points:  6  values differing: n =  11112 , fraction =  0.0030176758898776417
Decimal points:  5  values differing: n =  1190 , fraction =  0.0003231672344271413
Decimal points:  4  values differing: n =  142 , fraction =  3.856281284760845e-05
Decimal points:  3  values differing: n =  7 , fraction =  1.9009837319243603e-06
Decimal points:  2  values differing: n =  2 , fraction =  5.431382091212458e-07
Decimal points:  1  values differing: n =  0 , fraction =  0.0
Decimal points:  0  values differing: n =  0 , fraction =  0.0

See agreement betwen the two calculations

image

See notebook https://colab.research.google.com/drive/1tX_M1BW9ai_4joXOXlAmuthJVCUdIsVR#scrollTo=yfRuxOUemuKC

Copy link
Member

@ebezzi ebezzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Collaborator

@atolopko-czi atolopko-czi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. (I did not review `validate_soma_bounding_box(), which seems unrelated to this PR).

For a value `X[i,j]` in the counts (raw) matrix, library-size normalized values are defined
For Smart-Seq assays, given a value `X[i,j]` in the counts (raw) matrix, library-size normalized values are defined
as `normalized[i,j] = (X[i,j] / var[j].feature_length) / sum(X[i, ] / var.feature_length[j])`.
For all other assays, for a value `X[i,j]` in the counts (raw) matrix, library-size normalized values are defined
as `normalized[i,j] = X[i,j] / sum(X[i, ])`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the schema also mention the precision reduction being performed? Or are we satisfied that it's near equivalence justifies not having to mention it? (I did see Pablo's verification)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this PR does not introduce precision reduction, just improves it. I leave this up to @pablo-gar to resolve at a later date, at his discretion.

* first pass at using enum types

* add better error logging for file size assertion

* add feature flag for dict schema fields

* update a few dependencies

* remove debugging print

* update comment

* bump compression level

* pr feedback

* fix typos in comments

* add schema_util tests and fix a bug found by those tests

* lint
@bkmartinjr bkmartinjr merged commit 930dd6a into bkmartinjr/schema-four Dec 21, 2023
6 of 14 checks passed
@bkmartinjr bkmartinjr deleted the bkmartinjr/norm-layer branch December 21, 2023 01:02
bkmartinjr pushed a commit that referenced this pull request Dec 21, 2023
* schema 4

* update dep pins

* AnnData version update allows for compat code cleanup

* fix bug in feature_length

* bump tiledbsoma dependency to latest

* bump schema version

* update census schema version

* more dependency updates

* update to use production REST API

* [builder] normalized layer improvements (#884)

* improve normalized layer floating point precision, and correct normalized calc for smart-seq assays

* fix int32 overflow in sparse matrix code

* add check for tiledb issue 1969

* bump dependency versions

* work around SOMA bug temporarily

* pr feedback

* [builder] port to use enums in schema (#896)

* first pass at using enum types

* add better error logging for file size assertion

* add feature flag for dict schema fields

* update a few dependencies

* remove debugging print

* update comment

* bump compression level

* pr feedback

* fix typos in comments

* add schema_util tests and fix a bug found by those tests

* lint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants