hotfix cggnn scripts, streamline calls #208

CarlinLiao · 2023-09-12T23:16:02Z

While using set cggnn explore_classes and extract for the first time, I noticed a few fixes and improvements I could make.

explore_classes didn't pass the config file location to the cursor correctly, resulting in an error. This is fixed.
extract now organizes files it extracts related to a study into a directory with that study's name, for easy organization.
extract does nothing if the files it intends to create are already created.
extract is amended to only have FME pull specimens that are of the strata the user requested, instead of all specimens from a study

CarlinLiao · 2023-09-13T13:29:59Z

@jimmymathews I don't think FeatureMatrixExtractor does what we want it to do when given a single specimen.

When calling a single specimen, it seems to pull sparse entries from every study in the database, and then tries to match the specimen against ever study it just pulled, resulting in an error. (I'm not wholly confident on the latter part of that statement, but something's erroring.)

This may be a relic of when FME supported extraction from every study at once? I'll look into it.

(The rest of spt cggnn extract works now. The only study where we'd only want a subset of stratifiers is "Melanoma intralesional IL2".)

Here's the full output:

(spt_use) cliao@LTDV0308 SPT % spt cggnn extract --study 'Melanoma intralesional IL2' --spt_db_config_location 'spt_db.config' --strata 5 7 --output_location 'zzz'

09-13 01:57:11 [  INFO   ] db.feature_matrix_extractor: Retrieving stratification from database.
09-13 01:57:12 [  INFO   ] db.feature_matrix_extractor: Done retrieving stratification.
09-13 01:57:12 [  INFO   ] db.feature_matrix_extractor: Retrieving expression data from database.
09-13 01:57:12 [  INFO   ] workflow.common.sparse_matrix_puller: Will pull feature matrices for studies:
09-13 01:57:12 [  INFO   ] workflow.common.sparse_matrix_puller:     Breast cancer IMC - measurement
09-13 01:57:12 [  INFO   ] workflow.common.sparse_matrix_puller:     LUAD progression - measurement
09-13 01:57:12 [  INFO   ] workflow.common.sparse_matrix_puller:     Melanoma CyTOF ICI - measurement
09-13 01:57:12 [  INFO   ] workflow.common.sparse_matrix_puller:     Melanoma IMC TME protein - measurement
09-13 01:57:12 [  INFO   ] workflow.common.sparse_matrix_puller:     Melanoma intralesional IL2 - measurement
09-13 01:57:12 [  DEBUG  ] workflow.common.sparse_matrix_puller:327: Target by symbol: {'H3F3': '0', 'H3Lys28trimethyl': '1', 'KRT5': '2', 'FN1': '3', 'KRT19': '4', 'KRT8KRT18': '5', 'TWIST1': '6', 'CD68': '7', 'KRT14': '8', 'ACTA2': '9', 'VIM': '10', 'MYC': '11', 'ERBB2': '12', 'CD3E': '13', 'SNAI2': '14', 'ESR1': '15', 'IgG': '16', 'PGR': '17', 'TP53': '18', 'CD44': '19', 'CD45': '20', 'GATA3': '21', 'CD20': '22', 'CA9': '23', 'CDH1CDH3': '24', 'MKI67': '25', 'EGFR': '26', 'VWF': '27', 'PECAM1': '28', 'KRT7': '29', 'CK': '30', 'PARPAsp214cleaved': '31', 'CASP3cleaved': '32'}
09-13 01:57:12 [  DEBUG  ] workflow.common.sparse_matrix_puller:144: Pulling sparse entries for study "Breast cancer IMC - measurement".
09-13 01:59:36 [  DEBUG  ] workflow.common.sparse_matrix_puller:219: Received 2578368 entries from DB.
09-13 01:59:37 [  DEBUG  ] workflow.common.sparse_matrix_puller:297: Done parsing 99168 feature vectors from lesion 0_1.
09-13 01:59:37 [  INFO   ] workflow.common.sparse_matrix_puller: 100% finished with pulling sparse entries from the study. (lesion 0_1 ...)
09-13 01:59:37 [  INFO   ] workflow.common.sparse_matrix_puller: Done pulling sparse entries from the study. (1 iterations)
09-13 01:59:37 [  DEBUG  ] workflow.common.sparse_matrix_puller:327: Target by symbol: {'CD68': '7', 'FOXP3': '59', 'ITGAX': '64', 'MPO': '75', 'KIT': '118', 'CD14': '119', 'CD163': '120', 'FCGR3A': '121', 'CD20': '122', 'CD31': '123', 'CD3': '124', 'CD4': '125', 'CD8A': '126', 'KLRD1': '127', 'DNA1': '128', 'HLA-DR': '129', 'HistoneH3': '130', 'PanCK': '131'}
09-13 01:59:37 [  DEBUG  ] workflow.common.sparse_matrix_puller:144: Pulling sparse entries for study "LUAD progression - measurement".
09-13 02:02:05 [  DEBUG  ] workflow.common.sparse_matrix_puller:219: Received 2578368 entries from DB.
09-13 02:02:06 [  DEBUG  ] workflow.common.sparse_matrix_puller:297: Done parsing 99168 feature vectors from lesion 0_1.
09-13 02:02:06 [  INFO   ] workflow.common.sparse_matrix_puller: 100% finished with pulling sparse entries from the study. (lesion 0_1 ...)
09-13 02:02:06 [  INFO   ] workflow.common.sparse_matrix_puller: Done pulling sparse entries from the study. (1 iterations)
09-13 02:02:06 [  DEBUG  ] workflow.common.sparse_matrix_puller:327: Target by symbol: {'MKI67': '25', 'HDAC': '34', 'SMN1': '35', 'CD45RA': '43', 'CD68': '47', 'CD45RO': '58', 'FOXP3': '59', 'ICOS': '60', 'CTNNB1': '61', 'CD8A': '62', 'CD4': '66', 'SOX10': '68', 'PTPRC': '74', 'CD14': '100', 'PECAM1': '101', 'ITGB1': '102', 'CD274': '103', 'TNFRSF4': '104', 'LAG3': '105', 'HAVCR2': '106', 'CCR7': '107', 'VSIR': '108', 'MS4A1': '109', 'IGLL5': '110', 'CD40': '111', 'COL1': '112', 'CD3': '113', 'p-ERK1/2': '114', 'cleav-Casp3': '115', 'HLA-DR': '116', 'S100': '117'}
09-13 02:02:06 [  DEBUG  ] workflow.common.sparse_matrix_puller:144: Pulling sparse entries for study "Melanoma CyTOF ICI - measurement".
09-13 02:04:28 [  DEBUG  ] workflow.common.sparse_matrix_puller:219: Received 2578368 entries from DB.
09-13 02:04:29 [  DEBUG  ] workflow.common.sparse_matrix_puller:297: Done parsing 99168 feature vectors from lesion 0_1.
09-13 02:04:29 [  INFO   ] workflow.common.sparse_matrix_puller: 100% finished with pulling sparse entries from the study. (lesion 0_1 ...)
09-13 02:04:29 [  INFO   ] workflow.common.sparse_matrix_puller: Done pulling sparse entries from the study. (1 iterations)
09-13 02:04:29 [  DEBUG  ] workflow.common.sparse_matrix_puller:327: Target by symbol: {'VIM': '10', 'CAV1': '33', 'HDAC': '34', 'SMN1': '35', 'FUT4': '36', 'H3K27me3': '37', 'CD7': '38', 'CXCR2': '39', 'HLA-DRB1': '40', 'S100': '41', 'CD19': '42', 'CD45RA': '43', 'SOX9': '44', 'TOX': '45', 'CD20': '46', 'CD68': '47', 'pERK1/2': '48', 'CD3': '49', 'CD36': '50', 'NGFR': '51', 'PDCD1': '52', 'MITF': '53', 'ITGAM': '54', 'GZMB': '55', 'PDL1': '56', 'TCF1/7': '57', 'CD45RO': '58', 'FOXP3': '59', 'ICOS': '60', 'CTNNB1': '61', 'CD8A': '62', 'COL1A1': '63', 'MKI67': '25', 'ITGAX': '64', 'TAS2R63P': '65', 'CD4': '66', 'IDO1': '67', 'SOX10': '68', 'CLEC4C': '69', 'MRC1': '70', 'PARP1': '71', 'DNA1': '72', 'DNA2': '73', 'PTPRC': '74', 'MPO': '75'}
09-13 02:04:29 [  DEBUG  ] workflow.common.sparse_matrix_puller:144: Pulling sparse entries for study "Melanoma IMC TME protein - measurement".
09-13 02:06:52 [  DEBUG  ] workflow.common.sparse_matrix_puller:219: Received 2578368 entries from DB.
09-13 02:06:53 [  DEBUG  ] workflow.common.sparse_matrix_puller:297: Done parsing 99168 feature vectors from lesion 0_1.
09-13 02:06:53 [  INFO   ] workflow.common.sparse_matrix_puller: 100% finished with pulling sparse entries from the study. (lesion 0_1 ...)
09-13 02:06:53 [  INFO   ] workflow.common.sparse_matrix_puller: Done pulling sparse entries from the study. (1 iterations)
09-13 02:06:53 [  DEBUG  ] workflow.common.sparse_matrix_puller:327: Target by symbol: {'IDO1': '67', 'SOX10': '68', 'B2M': '76', 'B7H3': '77', 'CD14': '78', 'CD163': '79', 'CD20': '80', 'CD25': '81', 'CD27': '82', 'CD3': '83', 'CD4': '84', 'CD56': '85', 'CD68': '86', 'CD8': '87', 'DAPI': '88', 'FOXP3': '89', 'KI67': '90', 'LAG3': '91', 'MHCI': '92', 'MHCII': '93', 'MRC1': '94', 'PD1': '95', 'PDL1': '96', 'S100B': '97', 'TGM2': '98', 'TIM3': '99'}
09-13 02:06:53 [  DEBUG  ] workflow.common.sparse_matrix_puller:144: Pulling sparse entries for study "Melanoma intralesional IL2 - measurement".
09-13 02:09:25 [  DEBUG  ] workflow.common.sparse_matrix_puller:219: Received 2578368 entries from DB.
09-13 02:09:26 [  DEBUG  ] workflow.common.sparse_matrix_puller:297: Done parsing 99168 feature vectors from lesion 0_1.
09-13 02:09:26 [  INFO   ] workflow.common.sparse_matrix_puller: 100% finished with pulling sparse entries from the study. (lesion 0_1 ...)
09-13 02:09:26 [  INFO   ] workflow.common.sparse_matrix_puller: Done pulling sparse entries from the study. (1 iterations)
09-13 02:09:26 [  INFO   ] db.feature_matrix_extractor: Done retrieving expression data from database.
09-13 02:09:26 [  INFO   ] db.feature_matrix_extractor: Retrieving polygon centroids from shapefiles in database.
09-13 02:09:41 [  INFO   ] workflow.common.structure_centroids_puller: Done parsing shapefiles for Melanoma intralesional IL2 - measurement. (1 iterations)
09-13 02:09:41 [  INFO   ] db.feature_matrix_extractor: Done retrieving centroids.
09-13 02:09:41 [  INFO   ] db.feature_matrix_extractor: Retrieving phenotypes from database.
09-13 02:09:42 [  INFO   ] db.feature_matrix_extractor: Done retrieving phenotypes.
09-13 02:09:42 [  INFO   ] db.feature_matrix_extractor: Aggregating channel information for one study.
09-13 02:09:42 [  INFO   ] db.feature_matrix_extractor: Done aggregating channel information.
Traceback (most recent call last):
  File "/Users/cliao/miniconda3/envs/spt_use/lib/python3.11/site-packages/spatialprofilingtoolbox/cggnn/scripts/extract.py", line 57, in <module>
    df_cell, df_label, label_to_result = extract_cggnn_data(
                                         ^^^^^^^^^^^^^^^^^^^
  File "/Users/cliao/miniconda3/envs/spt_use/lib/python3.11/site-packages/spatialprofilingtoolbox/cggnn/extract.py", line 115, in extract_cggnn_data
    df_cell = _create_cell_df({
                              ^
  File "/Users/cliao/miniconda3/envs/spt_use/lib/python3.11/site-packages/spatialprofilingtoolbox/cggnn/extract.py", line 116, in <dictcomp>
    specimen: extractor.extract(specimen=specimen)[specimen].dataframe
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/cliao/miniconda3/envs/spt_use/lib/python3.11/site-packages/spatialprofilingtoolbox/db/feature_matrix_extractor.py", line 109, in extract
    extraction = self._extract(
                 ^^^^^^^^^^^^^^
  File "/Users/cliao/miniconda3/envs/spt_use/lib/python3.11/site-packages/spatialprofilingtoolbox/db/feature_matrix_extractor.py", line 142, in _extract
    self._create_channel_information(data_arrays),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/cliao/miniconda3/envs/spt_use/lib/python3.11/site-packages/spatialprofilingtoolbox/db/feature_matrix_extractor.py", line 244, in _create_channel_information
    return [
           ^
  File "/Users/cliao/miniconda3/envs/spt_use/lib/python3.11/site-packages/spatialprofilingtoolbox/db/feature_matrix_extractor.py", line 245, in <listcomp>
    symbols[targets[i]] for i in sorted([int(index) for index in targets.keys()])
    ~~~~~~~^^^^^^^^^^^^
KeyError: '67'

CarlinLiao · 2023-09-13T14:03:04Z

I made two changes to SparseMatrixPuller to avoid pulling from every study when provided only 1 specimen.

A couple of notes:

There's still a lot of redundant computation involved in that a new SparseMatrixPuller is spun up and garbage collected for every single specimen. A strategy where we give SparseMatrixPuller an iterable of specimens instead of a single specimen might be better.
I added a new import to SparseMatrixPuller, StudyAccess.
- The import statement will need to be changed in order to merge with Upload importance scores via cggnn command #204
- SparseMatrixPuller does stuff internally that could be handled by StudyAccess, but it forms its own custom SQL queries itself instead. Should one method or the other be preferred for consistency, or leave as is?

I'm not planning to make any changes for now since it all Just Works™️, but it might be better for maintainability to so I will if I'm asked to.

jimmymathews · 2023-09-13T14:36:11Z

I think this is due to some relatively recently added optimizations to the SparseMatrixPuller and friends system.
This optimization relies on a precomputed database index (relating specimens to cells) to save lots of time on initialization of the application.
This is great for that use case but for single specimens in the current implementation it entails ignoring the "study name" parameter, i.e. if the provided specimen isn't part of a given study it proceeds anyway, looping through all studies. This is a mistake of the implementation.

CarlinLiao · 2023-09-13T14:45:09Z

One more issue: somewhere in FME or cggnn extract, the histological_structure ID is getting reset per specimen instead of retaining unique IDs. It might be pulling the table identifier instead of the global ID? Need to look into this...

jimmymathews · 2023-09-13T14:59:39Z

spatialprofilingtoolbox/workflow/common/sparse_matrix_puller.py

+        continuous_also: bool = False,
+    ) -> CompressedDataArrays:
+        if specimen is not None:
+            study = StudyAccess(self.cursor).get_study_from_specimen(specimen)


This should work to specialize to just the one study in case a specimen is provided.

CarlinLiao · 2023-09-13T15:00:49Z

Looks like FME doesn't preserve the histological_structure ID when it forms feature matrices for each specimen.

I'd like to change it for index by HSID. Would that break anything downstream?

SPT/spatialprofilingtoolbox/db/feature_matrix_extractor.py

Lines 178 to 205 in eb3e928

    
           def _create_feature_matrices(self, 
        
               study: dict[str, dict[str, Any]], 
        
               centroid_coordinates: dict[str, Any], 
        
               phenotypes: dict[str, PhenotypeCriteria], 
        
               channel_information: list[str], 
        
           ) -> dict[str, MatrixBundle]: 
        
               logger.info('Creating feature matrices from binary data arrays and centroids.') 
        
               matrices: dict[str, MatrixBundle] = {} 
        
               for j, specimen in enumerate(sorted(list(study['data arrays by specimen'].keys()))): 
        
                   logger.debug('Specimen %s .', specimen) 
        
                   expressions = study['data arrays by specimen'][specimen] 
        
                   rows = [ 
        
                       self._create_feature_matrix_row( 
        
                           centroid_coordinates[specimen][i], 
        
                           expressions[i], 
        
                           len(study['target index lookup']), 
        
                       ) for i in range(len(expressions)) 
        
                   ] 
        
                   dataframe = DataFrame( 
        
                       rows, 
        
                       columns=['pixel x', 'pixel y'] + [f'C {cs}' for cs in channel_information], 
        
                   ) 
        
                   for symbol, criteria in phenotypes.items(): 
        
                       dataframe[f'P {symbol}'] = ( 
        
                           dataframe[[f'C {m}' for m in criteria.positive_markers]].all(axis=1) & 
        
                           ~dataframe[[f'C {m}' for m in criteria.negative_markers]].any(axis=1) 
        
                       ).astype(int) 
        
                   matrices[specimen] = MatrixBundle(dataframe, f'{j}.tsv')

jimmymathews · 2023-09-13T15:05:28Z

I seem to remember deliberately omitting this ID from the feature matrix output. Seemed like unnecessary coupling between database internal format and the output data structure.
I can't remember if the index for the dataframe is assumed somewhere to be 0-based integers. I don't think so. So it should be ok to use it.

CarlinLiao · 2023-09-13T15:07:26Z

I need the histological structure IDs for cell importance scoring because otherwise there's no way to map the CG-GNN output back to the cell it came from. So we're good to reverse this intentional omission? Or add a histological structure ID column instead of indexing on it. That might break even more things, though.

jimmymathews · 2023-09-13T15:08:29Z

Yeah, give it a try.

CarlinLiao · 2023-09-13T15:10:01Z

Do you happen to know if you dropped the HSID in SparseMatrixPuller, FeatureMatrixExtractor, or somewhere else?

EDIT: I think it's in SparseMatrixPuller... downstream effects likelihood increases.

jimmymathews · 2023-09-13T15:16:27Z

The cells are supposed to be sorted by histological structure ID, because of this line:

SPT/spatialprofilingtoolbox/workflow/common/sparse_matrix_puller.py

Line 271 in eb3e928

sparse_entries.sort(key=lambda x: (x[3], x[0]))

(x[3] refers to the specimen ID string, and x[0] refers to the histological structure ID string.)

So you could just pull all the histological structure IDs for a specimen and sort them.

CarlinLiao · 2023-09-13T15:50:02Z

I'm leveraging this assumption for this but we've already used it elsewhere (e.g., StudyAccess): each specimen string is uniquely identified to a study. In other words, two studies in the same database are not allowed to have specimens with the same name. This could be a problem in the future.

CarlinLiao · 2023-09-13T16:17:04Z

The new commit assumes that the cells are in the order specified in the Puller and assumes that specimen name strings are uniquely attached to a single study. I have my reservations about doing it this way instead of editing the Puller to retain HSIDs instead of discarding them, but this Just Works™️ so maybe we'll worry about it later.

jimmymathews · 2023-09-14T22:04:22Z

Editing the Puller to retain HSIDs as you say is fine with me, just might not be worth the effort at the moment.

All tests pass for me, I'm ready to merge.

hotfix cggnn extract, streamline calls

9155401

CarlinLiao added the refactor Code change preserving functionality label Sep 12, 2023

CarlinLiao requested a review from jimmymathews September 12, 2023 23:16

CarlinLiao self-assigned this Sep 12, 2023

more fixes

4cff3c9

SparseMatrixPuller fix for single specimen

8ff47d9

jimmymathews reviewed Sep 13, 2023

View reviewed changes

have FME preserve hist struct id

3144569

booled extracts

c200f16

jimmymathews merged commit 970d06d into main Sep 14, 2023
1 check passed

jimmymathews deleted the extract_fix branch September 14, 2023 22:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hotfix cggnn scripts, streamline calls #208

hotfix cggnn scripts, streamline calls #208

CarlinLiao commented Sep 12, 2023

CarlinLiao commented Sep 13, 2023

CarlinLiao commented Sep 13, 2023

jimmymathews commented Sep 13, 2023

CarlinLiao commented Sep 13, 2023

jimmymathews Sep 13, 2023

CarlinLiao commented Sep 13, 2023

jimmymathews commented Sep 13, 2023 •

edited

Loading

CarlinLiao commented Sep 13, 2023

jimmymathews commented Sep 13, 2023

CarlinLiao commented Sep 13, 2023 •

edited

Loading

jimmymathews commented Sep 13, 2023

CarlinLiao commented Sep 13, 2023

CarlinLiao commented Sep 13, 2023

jimmymathews commented Sep 14, 2023

hotfix cggnn scripts, streamline calls #208

hotfix cggnn scripts, streamline calls #208

Conversation

CarlinLiao commented Sep 12, 2023

CarlinLiao commented Sep 13, 2023

CarlinLiao commented Sep 13, 2023

jimmymathews commented Sep 13, 2023

CarlinLiao commented Sep 13, 2023

jimmymathews Sep 13, 2023

Choose a reason for hiding this comment

CarlinLiao commented Sep 13, 2023

jimmymathews commented Sep 13, 2023 • edited Loading

CarlinLiao commented Sep 13, 2023

jimmymathews commented Sep 13, 2023

CarlinLiao commented Sep 13, 2023 • edited Loading

jimmymathews commented Sep 13, 2023

CarlinLiao commented Sep 13, 2023

CarlinLiao commented Sep 13, 2023

jimmymathews commented Sep 14, 2023

jimmymathews commented Sep 13, 2023 •

edited

Loading

CarlinLiao commented Sep 13, 2023 •

edited

Loading