[builder] normalized layer improvements #884

bkmartinjr · 2023-12-06T17:21:35Z

Changes:

For smart-seq based assays, normalized value incorporates feature length. Fixes feat: the normalized layer should contain gene-length normalized counts from SmartSeq data #813
Improve precision reduction method for normalized layer, in effect storing a PXR24 float. This dramatically reduces error accumulation across the row due to the previous truncation method. Fixes [builder] improve float precision reduction on normalized X layer #867

Note to reviewers: this PR is stacked on top of the schema four PR

…ized calc for smart-seq assays

codecov · 2023-12-06T18:52:24Z

Codecov Report

Attention: 25 lines in your changes are missing coverage. Please review.

Comparison is base (409b86e) 86.76% compared to head (c8fc4bc) 86.49%.

❗ Current head c8fc4bc differs from pull request most recent head 0d9ba30. Consider uploading reports for the commit 0d9ba30 to get more accurate results

Files	Patch %	Lines
...ne_census_builder/build_soma/experiment_builder.py	51.16%	21 Missing ⚠️
...llxgene_census_builder/build_soma/validate_soma.py	91.83%	4 Missing ⚠️

Additional details and impacted files

@@                    Coverage Diff                     @@
##           bkmartinjr/schema-four     #884      +/-   ##
==========================================================
- Coverage                   86.76%   86.49%   -0.27%     
==========================================================
  Files                          72       72              
  Lines                        5228     5311      +83     
==========================================================
+ Hits                         4536     4594      +58     
- Misses                        692      717      +25

Flag	Coverage Δ
unittests	`86.49% <73.11%> (-0.27%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

tools/cellxgene_census_builder/src/cellxgene_census_builder/build_soma/experiment_builder.py

pablo-gar

Spot checking of the data yield no errors and increased in precision is observed.

Precision increase

The min/max row sums of the normalized layer got closer to 1 by aprox 4 orders of magnitude in a small size of census (10K cells)

min(row_sum_norm) went from 0.9946784973144531 to 0.9999870657920837
max(row_sum_norm) went from 1.0054092407226562 to 1.0000145435333252

No rank order differences between normalized and raw layers

There is virtually no shuffle of order between normalized and raw layers, min spearman correlation = 0.9999999999999998

Correct feature-length normalization of smart-seq data

There are minimal negligible differences between full precision normalization as done
by Gene Expression (@atarashansky) and the reduced precision as done by Census.

98% of values agree at 7 decimal points
99.7% of values agree at 6 decimal points

Test data:

"assay == 'Smart-seq2' and dataset_id == 'cee11228-9f0b-4e57-afe2-cfe15ee56312'"

Decimal points:  10  values differing: n =  2778536 , fraction =  0.754564533509455
Decimal points:  9  values differing: n =  1614282 , fraction =  0.43838911724833146
Decimal points:  8  values differing: n =  515363 , fraction =  0.1399566684336763
Decimal points:  7  values differing: n =  90236 , fraction =  0.024505309719132368
Decimal points:  6  values differing: n =  11112 , fraction =  0.0030176758898776417
Decimal points:  5  values differing: n =  1190 , fraction =  0.0003231672344271413
Decimal points:  4  values differing: n =  142 , fraction =  3.856281284760845e-05
Decimal points:  3  values differing: n =  7 , fraction =  1.9009837319243603e-06
Decimal points:  2  values differing: n =  2 , fraction =  5.431382091212458e-07
Decimal points:  1  values differing: n =  0 , fraction =  0.0
Decimal points:  0  values differing: n =  0 , fraction =  0.0

See agreement betwen the two calculations

See notebook https://colab.research.google.com/drive/1tX_M1BW9ai_4joXOXlAmuthJVCUdIsVR#scrollTo=yfRuxOUemuKC

ebezzi

LGTM!

atolopko-czi

LGTM. (I did not review `validate_soma_bounding_box(), which seems unrelated to this PR).

atolopko-czi · 2023-12-20T22:47:43Z

docs/cellxgene_census_schema.md

-For a value `X[i,j]` in the counts (raw) matrix, library-size normalized values are defined
+For Smart-Seq assays, given a value `X[i,j]` in the counts (raw) matrix, library-size normalized values are defined
+as `normalized[i,j] = (X[i,j] / var[j].feature_length) / sum(X[i, ] / var.feature_length[j])`.
+For all other assays, for a value `X[i,j]` in the counts (raw) matrix, library-size normalized values are defined
 as `normalized[i,j] = X[i,j] / sum(X[i, ])`.


Should the schema also mention the precision reduction being performed? Or are we satisfied that it's near equivalence justifies not having to mention it? (I did see Pablo's verification)

this PR does not introduce precision reduction, just improves it. I leave this up to @pablo-gar to resolve at a later date, at his discretion.

* first pass at using enum types * add better error logging for file size assertion * add feature flag for dict schema fields * update a few dependencies * remove debugging print * update comment * bump compression level * pr feedback * fix typos in comments * add schema_util tests and fix a bug found by those tests * lint

* schema 4 * update dep pins * AnnData version update allows for compat code cleanup * fix bug in feature_length * bump tiledbsoma dependency to latest * bump schema version * update census schema version * more dependency updates * update to use production REST API * [builder] normalized layer improvements (#884) * improve normalized layer floating point precision, and correct normalized calc for smart-seq assays * fix int32 overflow in sparse matrix code * add check for tiledb issue 1969 * bump dependency versions * work around SOMA bug temporarily * pr feedback * [builder] port to use enums in schema (#896) * first pass at using enum types * add better error logging for file size assertion * add feature flag for dict schema fields * update a few dependencies * remove debugging print * update comment * bump compression level * pr feedback * fix typos in comments * add schema_util tests and fix a bug found by those tests * lint

improve normalized layer floating point precision, and correct normal…

f25883f

…ized calc for smart-seq assays

bkmartinjr added 5 commits December 7, 2023 05:43

fix int32 overflow in sparse matrix code

2324c4b

Merge branch 'bkmartinjr/schema-four' into bkmartinjr/norm-layer

8b295a5

Merge branch 'bkmartinjr/schema-four' into bkmartinjr/norm-layer

1f29275

Merge branch 'bkmartinjr/schema-four' into bkmartinjr/norm-layer

9b11017

Merge branch 'bkmartinjr/schema-four' into bkmartinjr/norm-layer

88e1611

bkmartinjr changed the title ~~normalized layer improvements~~ [builder] normalized layer improvements Dec 10, 2023

bkmartinjr added 10 commits December 11, 2023 16:53

add check for tiledb issue 1969

9a6d9cf

Merge branch 'bkmartinjr/schema-four' into bkmartinjr/norm-layer

4a33328

bump dependency versions

044a299

Merge branch 'bkmartinjr/schema-four' into bkmartinjr/norm-layer

d386a54

Merge branch 'bkmartinjr/schema-four' into bkmartinjr/norm-layer

2b76d8c

Merge branch 'bkmartinjr/schema-four' into bkmartinjr/norm-layer

654c0e5

Merge branch 'bkmartinjr/schema-four' into bkmartinjr/norm-layer

6183067

Merge branch 'bkmartinjr/schema-four' into bkmartinjr/norm-layer

1a88854

work around SOMA bug temporarily

ddd4410

Merge branch 'bkmartinjr/schema-four' into bkmartinjr/norm-layer

2aba251

bkmartinjr requested review from pablo-gar, ebezzi and atolopko-czi December 18, 2023 19:56

bkmartinjr marked this pull request as ready for review December 18, 2023 19:58

merge with schema-four

8a5822c

ebezzi reviewed Dec 19, 2023

View reviewed changes

tools/cellxgene_census_builder/src/cellxgene_census_builder/build_soma/experiment_builder.py Outdated Show resolved Hide resolved

pablo-gar approved these changes Dec 19, 2023

View reviewed changes

bkmartinjr added 2 commits December 19, 2023 22:22

Merge branch 'bkmartinjr/schema-four' into bkmartinjr/norm-layer

ab9960c

pr feedback

c8fc4bc

ebezzi approved these changes Dec 20, 2023

View reviewed changes

atolopko-czi approved these changes Dec 20, 2023

View reviewed changes

bkmartinjr merged commit 930dd6a into bkmartinjr/schema-four Dec 21, 2023
6 of 14 checks passed

bkmartinjr deleted the bkmartinjr/norm-layer branch December 21, 2023 01:02

This was referenced Dec 21, 2023

[builder] improve float precision reduction on normalized X layer #867

Closed

feat: the normalized layer should contain gene-length normalized counts from SmartSeq data #813

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[builder] normalized layer improvements #884

[builder] normalized layer improvements #884

bkmartinjr commented Dec 6, 2023 •

edited

Loading

codecov bot commented Dec 6, 2023 •

edited

Loading

pablo-gar left a comment •

edited

Loading

ebezzi left a comment

atolopko-czi left a comment

atolopko-czi Dec 20, 2023

bkmartinjr Dec 20, 2023

[builder] normalized layer improvements #884

[builder] normalized layer improvements #884

Conversation

bkmartinjr commented Dec 6, 2023 • edited Loading

codecov bot commented Dec 6, 2023 • edited Loading

Codecov Report

pablo-gar left a comment • edited Loading

Choose a reason for hiding this comment

ebezzi left a comment

Choose a reason for hiding this comment

atolopko-czi left a comment

Choose a reason for hiding this comment

atolopko-czi Dec 20, 2023

Choose a reason for hiding this comment

bkmartinjr Dec 20, 2023

Choose a reason for hiding this comment

bkmartinjr commented Dec 6, 2023 •

edited

Loading

codecov bot commented Dec 6, 2023 •

edited

Loading

pablo-gar left a comment •

edited

Loading