Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hotfix cggnn scripts, streamline calls #208

Merged
merged 5 commits into from
Sep 14, 2023
Merged

hotfix cggnn scripts, streamline calls #208

merged 5 commits into from
Sep 14, 2023

Conversation

CarlinLiao
Copy link
Collaborator

While using set cggnn explore_classes and extract for the first time, I noticed a few fixes and improvements I could make.

  • explore_classes didn't pass the config file location to the cursor correctly, resulting in an error. This is fixed.
  • extract now organizes files it extracts related to a study into a directory with that study's name, for easy organization.
  • extract does nothing if the files it intends to create are already created.
  • extract is amended to only have FME pull specimens that are of the strata the user requested, instead of all specimens from a study

@CarlinLiao CarlinLiao added the refactor Code change preserving functionality label Sep 12, 2023
@CarlinLiao CarlinLiao self-assigned this Sep 12, 2023
@CarlinLiao
Copy link
Collaborator Author

@jimmymathews I don't think FeatureMatrixExtractor does what we want it to do when given a single specimen.

When calling a single specimen, it seems to pull sparse entries from every study in the database, and then tries to match the specimen against ever study it just pulled, resulting in an error. (I'm not wholly confident on the latter part of that statement, but something's erroring.)

This may be a relic of when FME supported extraction from every study at once? I'll look into it.

(The rest of spt cggnn extract works now. The only study where we'd only want a subset of stratifiers is "Melanoma intralesional IL2".)

Here's the full output:

(spt_use) cliao@LTDV0308 SPT % spt cggnn extract --study 'Melanoma intralesional IL2' --spt_db_config_location 'spt_db.config' --strata 5 7 --output_location 'zzz'

09-13 01:57:11 [  INFO   ] db.feature_matrix_extractor: Retrieving stratification from database.
09-13 01:57:12 [  INFO   ] db.feature_matrix_extractor: Done retrieving stratification.
09-13 01:57:12 [  INFO   ] db.feature_matrix_extractor: Retrieving expression data from database.
09-13 01:57:12 [  INFO   ] workflow.common.sparse_matrix_puller: Will pull feature matrices for studies:
09-13 01:57:12 [  INFO   ] workflow.common.sparse_matrix_puller:     Breast cancer IMC - measurement
09-13 01:57:12 [  INFO   ] workflow.common.sparse_matrix_puller:     LUAD progression - measurement
09-13 01:57:12 [  INFO   ] workflow.common.sparse_matrix_puller:     Melanoma CyTOF ICI - measurement
09-13 01:57:12 [  INFO   ] workflow.common.sparse_matrix_puller:     Melanoma IMC TME protein - measurement
09-13 01:57:12 [  INFO   ] workflow.common.sparse_matrix_puller:     Melanoma intralesional IL2 - measurement
09-13 01:57:12 [  DEBUG  ] workflow.common.sparse_matrix_puller:327: Target by symbol: {'H3F3': '0', 'H3Lys28trimethyl': '1', 'KRT5': '2', 'FN1': '3', 'KRT19': '4', 'KRT8KRT18': '5', 'TWIST1': '6', 'CD68': '7', 'KRT14': '8', 'ACTA2': '9', 'VIM': '10', 'MYC': '11', 'ERBB2': '12', 'CD3E': '13', 'SNAI2': '14', 'ESR1': '15', 'IgG': '16', 'PGR': '17', 'TP53': '18', 'CD44': '19', 'CD45': '20', 'GATA3': '21', 'CD20': '22', 'CA9': '23', 'CDH1CDH3': '24', 'MKI67': '25', 'EGFR': '26', 'VWF': '27', 'PECAM1': '28', 'KRT7': '29', 'CK': '30', 'PARPAsp214cleaved': '31', 'CASP3cleaved': '32'}
09-13 01:57:12 [  DEBUG  ] workflow.common.sparse_matrix_puller:144: Pulling sparse entries for study "Breast cancer IMC - measurement".
09-13 01:59:36 [  DEBUG  ] workflow.common.sparse_matrix_puller:219: Received 2578368 entries from DB.
09-13 01:59:37 [  DEBUG  ] workflow.common.sparse_matrix_puller:297: Done parsing 99168 feature vectors from lesion 0_1.
09-13 01:59:37 [  INFO   ] workflow.common.sparse_matrix_puller: 100% finished with pulling sparse entries from the study. (lesion 0_1 ...)
09-13 01:59:37 [  INFO   ] workflow.common.sparse_matrix_puller: Done pulling sparse entries from the study. (1 iterations)
09-13 01:59:37 [  DEBUG  ] workflow.common.sparse_matrix_puller:327: Target by symbol: {'CD68': '7', 'FOXP3': '59', 'ITGAX': '64', 'MPO': '75', 'KIT': '118', 'CD14': '119', 'CD163': '120', 'FCGR3A': '121', 'CD20': '122', 'CD31': '123', 'CD3': '124', 'CD4': '125', 'CD8A': '126', 'KLRD1': '127', 'DNA1': '128', 'HLA-DR': '129', 'HistoneH3': '130', 'PanCK': '131'}
09-13 01:59:37 [  DEBUG  ] workflow.common.sparse_matrix_puller:144: Pulling sparse entries for study "LUAD progression - measurement".
09-13 02:02:05 [  DEBUG  ] workflow.common.sparse_matrix_puller:219: Received 2578368 entries from DB.
09-13 02:02:06 [  DEBUG  ] workflow.common.sparse_matrix_puller:297: Done parsing 99168 feature vectors from lesion 0_1.
09-13 02:02:06 [  INFO   ] workflow.common.sparse_matrix_puller: 100% finished with pulling sparse entries from the study. (lesion 0_1 ...)
09-13 02:02:06 [  INFO   ] workflow.common.sparse_matrix_puller: Done pulling sparse entries from the study. (1 iterations)
09-13 02:02:06 [  DEBUG  ] workflow.common.sparse_matrix_puller:327: Target by symbol: {'MKI67': '25', 'HDAC': '34', 'SMN1': '35', 'CD45RA': '43', 'CD68': '47', 'CD45RO': '58', 'FOXP3': '59', 'ICOS': '60', 'CTNNB1': '61', 'CD8A': '62', 'CD4': '66', 'SOX10': '68', 'PTPRC': '74', 'CD14': '100', 'PECAM1': '101', 'ITGB1': '102', 'CD274': '103', 'TNFRSF4': '104', 'LAG3': '105', 'HAVCR2': '106', 'CCR7': '107', 'VSIR': '108', 'MS4A1': '109', 'IGLL5': '110', 'CD40': '111', 'COL1': '112', 'CD3': '113', 'p-ERK1/2': '114', 'cleav-Casp3': '115', 'HLA-DR': '116', 'S100': '117'}
09-13 02:02:06 [  DEBUG  ] workflow.common.sparse_matrix_puller:144: Pulling sparse entries for study "Melanoma CyTOF ICI - measurement".
09-13 02:04:28 [  DEBUG  ] workflow.common.sparse_matrix_puller:219: Received 2578368 entries from DB.
09-13 02:04:29 [  DEBUG  ] workflow.common.sparse_matrix_puller:297: Done parsing 99168 feature vectors from lesion 0_1.
09-13 02:04:29 [  INFO   ] workflow.common.sparse_matrix_puller: 100% finished with pulling sparse entries from the study. (lesion 0_1 ...)
09-13 02:04:29 [  INFO   ] workflow.common.sparse_matrix_puller: Done pulling sparse entries from the study. (1 iterations)
09-13 02:04:29 [  DEBUG  ] workflow.common.sparse_matrix_puller:327: Target by symbol: {'VIM': '10', 'CAV1': '33', 'HDAC': '34', 'SMN1': '35', 'FUT4': '36', 'H3K27me3': '37', 'CD7': '38', 'CXCR2': '39', 'HLA-DRB1': '40', 'S100': '41', 'CD19': '42', 'CD45RA': '43', 'SOX9': '44', 'TOX': '45', 'CD20': '46', 'CD68': '47', 'pERK1/2': '48', 'CD3': '49', 'CD36': '50', 'NGFR': '51', 'PDCD1': '52', 'MITF': '53', 'ITGAM': '54', 'GZMB': '55', 'PDL1': '56', 'TCF1/7': '57', 'CD45RO': '58', 'FOXP3': '59', 'ICOS': '60', 'CTNNB1': '61', 'CD8A': '62', 'COL1A1': '63', 'MKI67': '25', 'ITGAX': '64', 'TAS2R63P': '65', 'CD4': '66', 'IDO1': '67', 'SOX10': '68', 'CLEC4C': '69', 'MRC1': '70', 'PARP1': '71', 'DNA1': '72', 'DNA2': '73', 'PTPRC': '74', 'MPO': '75'}
09-13 02:04:29 [  DEBUG  ] workflow.common.sparse_matrix_puller:144: Pulling sparse entries for study "Melanoma IMC TME protein - measurement".
09-13 02:06:52 [  DEBUG  ] workflow.common.sparse_matrix_puller:219: Received 2578368 entries from DB.
09-13 02:06:53 [  DEBUG  ] workflow.common.sparse_matrix_puller:297: Done parsing 99168 feature vectors from lesion 0_1.
09-13 02:06:53 [  INFO   ] workflow.common.sparse_matrix_puller: 100% finished with pulling sparse entries from the study. (lesion 0_1 ...)
09-13 02:06:53 [  INFO   ] workflow.common.sparse_matrix_puller: Done pulling sparse entries from the study. (1 iterations)
09-13 02:06:53 [  DEBUG  ] workflow.common.sparse_matrix_puller:327: Target by symbol: {'IDO1': '67', 'SOX10': '68', 'B2M': '76', 'B7H3': '77', 'CD14': '78', 'CD163': '79', 'CD20': '80', 'CD25': '81', 'CD27': '82', 'CD3': '83', 'CD4': '84', 'CD56': '85', 'CD68': '86', 'CD8': '87', 'DAPI': '88', 'FOXP3': '89', 'KI67': '90', 'LAG3': '91', 'MHCI': '92', 'MHCII': '93', 'MRC1': '94', 'PD1': '95', 'PDL1': '96', 'S100B': '97', 'TGM2': '98', 'TIM3': '99'}
09-13 02:06:53 [  DEBUG  ] workflow.common.sparse_matrix_puller:144: Pulling sparse entries for study "Melanoma intralesional IL2 - measurement".
09-13 02:09:25 [  DEBUG  ] workflow.common.sparse_matrix_puller:219: Received 2578368 entries from DB.
09-13 02:09:26 [  DEBUG  ] workflow.common.sparse_matrix_puller:297: Done parsing 99168 feature vectors from lesion 0_1.
09-13 02:09:26 [  INFO   ] workflow.common.sparse_matrix_puller: 100% finished with pulling sparse entries from the study. (lesion 0_1 ...)
09-13 02:09:26 [  INFO   ] workflow.common.sparse_matrix_puller: Done pulling sparse entries from the study. (1 iterations)
09-13 02:09:26 [  INFO   ] db.feature_matrix_extractor: Done retrieving expression data from database.
09-13 02:09:26 [  INFO   ] db.feature_matrix_extractor: Retrieving polygon centroids from shapefiles in database.
09-13 02:09:41 [  INFO   ] workflow.common.structure_centroids_puller: Done parsing shapefiles for Melanoma intralesional IL2 - measurement. (1 iterations)
09-13 02:09:41 [  INFO   ] db.feature_matrix_extractor: Done retrieving centroids.
09-13 02:09:41 [  INFO   ] db.feature_matrix_extractor: Retrieving phenotypes from database.
09-13 02:09:42 [  INFO   ] db.feature_matrix_extractor: Done retrieving phenotypes.
09-13 02:09:42 [  INFO   ] db.feature_matrix_extractor: Aggregating channel information for one study.
09-13 02:09:42 [  INFO   ] db.feature_matrix_extractor: Done aggregating channel information.
Traceback (most recent call last):
  File "/Users/cliao/miniconda3/envs/spt_use/lib/python3.11/site-packages/spatialprofilingtoolbox/cggnn/scripts/extract.py", line 57, in <module>
    df_cell, df_label, label_to_result = extract_cggnn_data(
                                         ^^^^^^^^^^^^^^^^^^^
  File "/Users/cliao/miniconda3/envs/spt_use/lib/python3.11/site-packages/spatialprofilingtoolbox/cggnn/extract.py", line 115, in extract_cggnn_data
    df_cell = _create_cell_df({
                              ^
  File "/Users/cliao/miniconda3/envs/spt_use/lib/python3.11/site-packages/spatialprofilingtoolbox/cggnn/extract.py", line 116, in <dictcomp>
    specimen: extractor.extract(specimen=specimen)[specimen].dataframe
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/cliao/miniconda3/envs/spt_use/lib/python3.11/site-packages/spatialprofilingtoolbox/db/feature_matrix_extractor.py", line 109, in extract
    extraction = self._extract(
                 ^^^^^^^^^^^^^^
  File "/Users/cliao/miniconda3/envs/spt_use/lib/python3.11/site-packages/spatialprofilingtoolbox/db/feature_matrix_extractor.py", line 142, in _extract
    self._create_channel_information(data_arrays),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/cliao/miniconda3/envs/spt_use/lib/python3.11/site-packages/spatialprofilingtoolbox/db/feature_matrix_extractor.py", line 244, in _create_channel_information
    return [
           ^
  File "/Users/cliao/miniconda3/envs/spt_use/lib/python3.11/site-packages/spatialprofilingtoolbox/db/feature_matrix_extractor.py", line 245, in <listcomp>
    symbols[targets[i]] for i in sorted([int(index) for index in targets.keys()])
    ~~~~~~~^^^^^^^^^^^^
KeyError: '67'

@CarlinLiao
Copy link
Collaborator Author

I made two changes to SparseMatrixPuller to avoid pulling from every study when provided only 1 specimen.

A couple of notes:

  • There's still a lot of redundant computation involved in that a new SparseMatrixPuller is spun up and garbage collected for every single specimen. A strategy where we give SparseMatrixPuller an iterable of specimens instead of a single specimen might be better.
  • I added a new import to SparseMatrixPuller, StudyAccess.
    • The import statement will need to be changed in order to merge with Upload importance scores via cggnn command #204
    • SparseMatrixPuller does stuff internally that could be handled by StudyAccess, but it forms its own custom SQL queries itself instead. Should one method or the other be preferred for consistency, or leave as is?

I'm not planning to make any changes for now since it all Just Works™️, but it might be better for maintainability to so I will if I'm asked to.

@jimmymathews
Copy link
Collaborator

I think this is due to some relatively recently added optimizations to the SparseMatrixPuller and friends system.
This optimization relies on a precomputed database index (relating specimens to cells) to save lots of time on initialization of the application.
This is great for that use case but for single specimens in the current implementation it entails ignoring the "study name" parameter, i.e. if the provided specimen isn't part of a given study it proceeds anyway, looping through all studies. This is a mistake of the implementation.

@CarlinLiao
Copy link
Collaborator Author

One more issue: somewhere in FME or cggnn extract, the histological_structure ID is getting reset per specimen instead of retaining unique IDs. It might be pulling the table identifier instead of the global ID? Need to look into this...

continuous_also: bool = False,
) -> CompressedDataArrays:
if specimen is not None:
study = StudyAccess(self.cursor).get_study_from_specimen(specimen)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work to specialize to just the one study in case a specimen is provided.

@CarlinLiao
Copy link
Collaborator Author

Looks like FME doesn't preserve the histological_structure ID when it forms feature matrices for each specimen.

I'd like to change it for index by HSID. Would that break anything downstream?

def _create_feature_matrices(self,
study: dict[str, dict[str, Any]],
centroid_coordinates: dict[str, Any],
phenotypes: dict[str, PhenotypeCriteria],
channel_information: list[str],
) -> dict[str, MatrixBundle]:
logger.info('Creating feature matrices from binary data arrays and centroids.')
matrices: dict[str, MatrixBundle] = {}
for j, specimen in enumerate(sorted(list(study['data arrays by specimen'].keys()))):
logger.debug('Specimen %s .', specimen)
expressions = study['data arrays by specimen'][specimen]
rows = [
self._create_feature_matrix_row(
centroid_coordinates[specimen][i],
expressions[i],
len(study['target index lookup']),
) for i in range(len(expressions))
]
dataframe = DataFrame(
rows,
columns=['pixel x', 'pixel y'] + [f'C {cs}' for cs in channel_information],
)
for symbol, criteria in phenotypes.items():
dataframe[f'P {symbol}'] = (
dataframe[[f'C {m}' for m in criteria.positive_markers]].all(axis=1) &
~dataframe[[f'C {m}' for m in criteria.negative_markers]].any(axis=1)
).astype(int)
matrices[specimen] = MatrixBundle(dataframe, f'{j}.tsv')

@jimmymathews
Copy link
Collaborator

jimmymathews commented Sep 13, 2023

I seem to remember deliberately omitting this ID from the feature matrix output. Seemed like unnecessary coupling between database internal format and the output data structure.
I can't remember if the index for the dataframe is assumed somewhere to be 0-based integers. I don't think so. So it should be ok to use it.

@CarlinLiao
Copy link
Collaborator Author

I need the histological structure IDs for cell importance scoring because otherwise there's no way to map the CG-GNN output back to the cell it came from. So we're good to reverse this intentional omission? Or add a histological structure ID column instead of indexing on it. That might break even more things, though.

@jimmymathews
Copy link
Collaborator

Yeah, give it a try.

@CarlinLiao
Copy link
Collaborator Author

CarlinLiao commented Sep 13, 2023

Do you happen to know if you dropped the HSID in SparseMatrixPuller, FeatureMatrixExtractor, or somewhere else?

EDIT: I think it's in SparseMatrixPuller... downstream effects likelihood increases.

@jimmymathews
Copy link
Collaborator

The cells are supposed to be sorted by histological structure ID, because of this line:

sparse_entries.sort(key=lambda x: (x[3], x[0]))

(x[3] refers to the specimen ID string, and x[0] refers to the histological structure ID string.)

So you could just pull all the histological structure IDs for a specimen and sort them.

@CarlinLiao
Copy link
Collaborator Author

I'm leveraging this assumption for this but we've already used it elsewhere (e.g., StudyAccess): each specimen string is uniquely identified to a study. In other words, two studies in the same database are not allowed to have specimens with the same name. This could be a problem in the future.

@CarlinLiao
Copy link
Collaborator Author

The new commit assumes that the cells are in the order specified in the Puller and assumes that specimen name strings are uniquely attached to a single study. I have my reservations about doing it this way instead of editing the Puller to retain HSIDs instead of discarding them, but this Just Works™️ so maybe we'll worry about it later.

@jimmymathews
Copy link
Collaborator

Editing the Puller to retain HSIDs as you say is fine with me, just might not be worth the effort at the moment.

All tests pass for me, I'm ready to merge.

@jimmymathews jimmymathews merged commit 970d06d into main Sep 14, 2023
1 check passed
@jimmymathews jimmymathews deleted the extract_fix branch September 14, 2023 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactor Code change preserving functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants