Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create cg-gnn extraction script #185

Merged
merged 35 commits into from
Sep 12, 2023
Merged

create cg-gnn extraction script #185

merged 35 commits into from
Sep 12, 2023

Conversation

CarlinLiao
Copy link
Collaborator

This creates a spt cggnn extract command that pulls information cg-gnn needs from a SPT db instance and saves it to two DataFrames and a JSON.

@CarlinLiao CarlinLiao self-assigned this Aug 8, 2023
@CarlinLiao CarlinLiao added the enhancement New feature or request label Aug 8, 2023
spatialprofilingtoolbox/cggnn/extract.py Outdated Show resolved Hide resolved
spatialprofilingtoolbox/cggnn/extract.py Show resolved Hide resolved
Comment on lines 35 to 37
df = merge(df_assignments, df_strata, on='stratum identifier', how='left')[
['specimen', 'subject diagnosed result']].rename(
{'specimen': 'slide', 'subject diagnosed result': 'result'}, axis=1)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This code is from what was already implemented in main in run_sql.py, just moved to a new file.) Is this how we want to define the labels to classify against? It appears to be a little different from what we discussed over email.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimmymathews Regarding stratification, this implementation uses only "subject diagnosed result" but not "stratum_identifier", "local temporal position indicator", or "subject diagnosed condition".

What do we want the model spt cggnn extract supports to do? In prior iterations, it's been to predict, on pre-treatment slides only, if the patient the tissue sample is from will respond to treatment. Do we want to continue this, or generalize?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prediction of treatment response from pre-treatment slides is certainly not the only study design that we support.

Although one cannot, in advance, write a single description of the outcome variable that will always be of greatest interest in any study, the SampleStratificationCreator does a pretty good job of automatically creating a best guess at a reasonable granularity, given the facts about the samples available according to the schema.

The result of SampleStratificationCreator's work is the outcome variable that I endorse for the use case of general-purpose downstream analysis.

It is defined by the fields named stratum_identifier (an integer) and sample, in a supplementary table called sample_strata.

In this use case, one shouldn't be reading local_temporal_position_indicator or subject_diagnosed_condition or subject_diagnosed_result at all. In this use case, these values are just metadata tags for the strata.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(sample here means the tissue sample, right? I don't see why the sample is relevant to the stratification aside from samples being the granularity at which we're stratifying.)

As I envision it, the spt cggnn extract command isn't for "general-purpose downstream analysis" but to extract the specific information the user wants to train a model on. (If this isn't the case, we should write a separate command that does so, as I don't think this should be within the scope of the standalone cg-gnn package.)

So, if prediction of treatment response isn't the only study design we want to support, it sounds like what we should do is provide the person calling spt cggnn extract some way to identify which strata they want to use. For example, in the treatment response use case, they should be given the option of only selecting strata from before treatment. In most use cases, I don't think the user would want every stratum possible.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, stratum_identifier being an integer isn't user-interpretable; what's nice about result is that it is a descriptive string that translates the integer class to something the user can understand, like "the model predicts that this patient will be a non-responder". I'd like some equivalent of this in any extract.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole design constraint for SampleStratificationCreator was automation with no user intervention, but that's the opposite of what you're describing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be built into the FeatureMatrixExtractor like the convenience stratifier is, or handled seperately?

And to do that we'd have to guess at all the different ways a user would want to slice up and stratify the samples... or just give them every option in the database. At which point, the user may as well write their own SQL queries.

Sure we can't just support only treatment response for now, and leave the implementation of alternative classifications as an exercise for the reader future work?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think any of our other pre-curated studies have 2 cohorts that are both pre-treatment and distinguished by response.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe 1 other?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we'd have to think of at least one classification method for each study, and they're all different.

spatialprofilingtoolbox/cggnn/scripts/run.py Outdated Show resolved Hide resolved
@CarlinLiao
Copy link
Collaborator Author

Installed this branch to test its functionality. It errored on this line after testing with "Breast cancer IMC":

Traceback (most recent call last):
  File "/Users/cliao/miniconda3/envs/spt_use/lib/python3.11/site-packages/spatialprofilingtoolbox/cggnn/scripts/extract.py", line 39, in <module>
    df_cell, df_label, label_to_result = extract_cggnn_data(args.spt_db_config_location, args.study)
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/cliao/miniconda3/envs/spt_use/lib/python3.11/site-packages/spatialprofilingtoolbox/cggnn/extract.py", line 50, in extract_cggnn_data
    {slide: data['dataframe'] for slide, data in study_data['feature matrices'].items()},
                                                 ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
KeyError: 'feature matrices'

study_data is extractor.extract(study=study) so I don't see why it would break here.

@jimmymathews
Copy link
Collaborator

Hm it seems that FeatureMatrixExtractor is pedantically interested in the specimen_measurement_study names only. This is related to the fact that it does not bother to look for composite phenotypes, since these are part of the data analysis study phase, not the data measurement phase of the study.

Still, it would much more convenient if FeatureMatrixExtractor took responsibility for looking up the measurement study name from the overall study handle string. In the current implemention, you have to lookup the measurement study name yourself (most likely "Breast cancer IMC - measurement"?).

@CarlinLiao
Copy link
Collaborator Author

Yes. In the SQL query workflow of the standalone cg-gnn pip package it queries the measurement study for the targets and shape strings, the analysis study for phenotypes, and the specimen collection study for the stratification. The FeatureMatrixExtractor has elements sourced from all three (sub)studies, so it's interesting that they aren't wholly reflected in the current implementation.

@jimmymathews
Copy link
Collaborator

Yes, it is a legacy aspect (like the "Feature Matrix" in the name), reflecting its original purpose of providing a familiar feature matrix representation of the portion of the dataset which is representable this way.

@CarlinLiao
Copy link
Collaborator Author

The complex phenotypes features would be consistent with the feature matrix conceptualization.

@CarlinLiao
Copy link
Collaborator Author

CarlinLiao commented Aug 15, 2023

I wrote a test for the extract function and it errors because SPT is written in Python 3.11 but the Docker image for set cggnn is Python 3.7 for DGL and PyTorch compatibility. This causes the test to error on new 3.11 syntax like match in FeatureMatrixExtractor. How should we proceed?

(It also complains about unary operators in these lines, but they're identical to other tests I assume Bash SyntaxErrors aren't breaking, and only reported if something else breaks, as the above.)

[ $status -eq 0 ] || echo "cggnn extract failed."

if [ $status -eq 0 ];

@jimmymathews
Copy link
Collaborator

Well 3.7 is pretty old, released 2018. According to this, DGL now supports python 3.11. PyTorch I believe also does.
I'm trying a newer base image than pytorch/pytorch:1.13.0-cuda11.6-cudnn8-runtime, and it seems to be building OK.
I had to rebuild the cg-gnn wheel since it is built with a declared requirement of python ~3.9.

@jimmymathews
Copy link
Collaborator

As for the unary operator complaint, it is because your status=... line does not return anything, status is empty. I changed to:

$([ $? -gt 0 ] && [ -e "label_to_results.json" ] && [ -e "cells.h5" ] && [ -e "labels.h5" ])
status="$?"
...

and the error reports correctly (no unary operator complaint).

@jimmymathews
Copy link
Collaborator

jimmymathews commented Aug 15, 2023

The new docker image works for me, the test extraction works. I pushed a proof of concept to branch cggnn_dfs3.11. It relies on a locally-built cg-gnn wheel that removes the python ~3.9 constraint.

@CarlinLiao
Copy link
Collaborator Author

CarlinLiao commented Aug 15, 2023

Extraction shouldn't run into any issues, it's the run pipeline I'm worried about. Histocartography model training and testing use esoteric features of DGL and pytorch that break in newer versions of both and Python (as of my testing last year, at least). I had to install specific older versions of DGL and pytorch, not the newest versions. (In fact, I wouldn't be surprised if they were the reason we're only getting zero imprtance scores..)

@CarlinLiao
Copy link
Collaborator Author

Were you able to take another look at the complex phenotype extraction as well as the stratification method?

@jimmymathews
Copy link
Collaborator

If the dependency conflicts are persistent, it probably makes sense for us to keep major functionality isolated to separate CLI tool instances, not requiring the same python runtime for SPT and for cg-gnn.

@jimmymathews
Copy link
Collaborator

I guess that's what you had in mind by creating HDF5 files. It seems fine for this extraction-specific code to be in spt cggnn submodule, where it will run best on python 3.11.

@CarlinLiao
Copy link
Collaborator Author

I'm going to test if the latest changes in the histocartography repo fix the dependency issues, but in the event they don't help, yes we'll need to further isolate cggnn running, either by removing spt cggnn run entirely or by having it call a separate container instance that has Python 3.9 and the cg-gnn pip package.

@jimmymathews
Copy link
Collaborator

What would you like me to look at for complex phenotype extraction and stratification?
As for phenotype extraction, It seems that what is needed is a classifier utility that pulls out the signatures and assesses each cell in the cell dataframe for membership in the class associated with each given signature.
That should be pretty easy. This snippet will get you started with pulling phenotypes:

import json

from pandas import read_sql

from spatialprofilingtoolbox.db.database_connection import DatabaseConnectionMaker
from spatialprofilingtoolbox.db.exchange_data_formats.metrics import PhenotypeCriteria

class PhenotypeDefinitionPuller:
    @staticmethod
    def _get_phenotype(criteria_dataframe):
        positives, negatives = [
            sorted(map(str, list(criteria_dataframe[
                criteria_dataframe['coded_value']==coded_value
            ]['channel_symbol'])))
            for coded_value in (1, 0)
        ]
        return PhenotypeCriteria(positive_markers=positives, negative_markers=negatives)

    @classmethod
    def pull(cls, study):
        query = r'''
        SELECT
            cp.name as phenotype_name,
            cs.symbol as channel_symbol,
            CASE WHEN cpc.polarity='positive' THEN 1 ELSE 0 END AS coded_value
        FROM cell_phenotype_criterion cpc
        JOIN cell_phenotype cp ON cp.identifier=cpc.cell_phenotype
        JOIN chemical_species cs ON cs.identifier=cpc.marker
        JOIN study_component sc ON sc.component_study=cpc.study
        WHERE sc.primary_study='%s'
        ;
        '''
        with DatabaseConnectionMaker() as dcm:
            df = read_sql(query % study, dcm.get_connection())
        return {
            key: cls._get_phenotype(group)
            for key, group in df.groupby('phenotype_name')
        }

phenotypes = PhenotypeDefinitionPuller.pull('Breast cancer IMC')
for key, phenotype_definition in phenotypes.items():
    print('')
    print(key)
    print(phenotype_definition)

phenotypes = PhenotypeDefinitionPuller.pull('Melanoma intralesional IL2')
for key, phenotype_definition in phenotypes.items():
    print('')
    print(key)
    print(phenotype_definition)

If you have a local db running, you can run the above as

SINGLE_CELL_DATABASE_HOST=localhost SINGLE_CELL_DATABASE_USER=postgres SINGLE_CELL_DATABASE_PASSWORD=postgres python snippet.py

The output:

Luminal
positive_markers=['ESR1', 'PGR'] negative_markers=[]

Myoepithelial
positive_markers=['ACTA2', 'CK', 'EGFR', 'KRT14', 'MYC'] negative_markers=[]

Proliferative
positive_markers=['MKI67'] negative_markers=[]



Adipocyte or Langerhans cell
positive_markers=['S100B'] negative_markers=['CD14', 'CD163', 'CD20', 'CD3', 'CD4', 'CD56', 'CD68', 'CD8', 'FOXP3', 'SOX10']

B cell
positive_markers=['CD20'] negative_markers=['CD3', 'CD4', 'CD56', 'CD8', 'FOXP3', 'SOX10']

CD163+MHCII+ macrophage
positive_markers=['CD163', 'MHCII'] negative_markers=['CD20', 'CD3', 'CD56', 'CD8', 'FOXP3', 'SOX10']

CD163+MHCII- macrophage
positive_markers=['CD163'] negative_markers=['CD20', 'CD3', 'CD56', 'CD8', 'FOXP3', 'MHCII', 'SOX10']

CD4+ T cell
positive_markers=['CD3', 'CD4'] negative_markers=['CD20', 'CD56', 'CD8', 'FOXP3', 'SOX10']

CD4+ natural killer T cell
positive_markers=['CD3', 'CD4', 'CD56'] negative_markers=['CD20', 'CD8', 'FOXP3', 'SOX10']

CD4+ regulatory T cell
positive_markers=['CD3', 'CD4', 'FOXP3'] negative_markers=['CD20', 'CD56', 'CD8', 'SOX10']

CD4+/CD8+ T cell
positive_markers=['CD3', 'CD4', 'CD8'] negative_markers=['CD20', 'CD56', 'FOXP3', 'SOX10']

CD68+MHCII+ macrophage
positive_markers=['CD68', 'MHCII'] negative_markers=['CD20', 'CD3', 'CD56', 'CD8', 'FOXP3', 'SOX10']

CD68+MHCII- macrophage
positive_markers=['CD68'] negative_markers=['CD20', 'CD3', 'CD56', 'CD8', 'FOXP3', 'MHCII', 'SOX10']

CD8+ T cell
positive_markers=['CD3', 'CD8'] negative_markers=['CD20', 'CD4', 'CD56', 'FOXP3', 'SOX10']

CD8+ natural killer T cell
positive_markers=['CD3', 'CD56', 'CD8'] negative_markers=['CD20', 'CD4', 'FOXP3', 'SOX10']

CD8+ regulatory T cell
positive_markers=['CD3', 'CD8', 'FOXP3'] negative_markers=['CD20', 'CD4', 'CD56', 'SOX10']

Double negative regulatory T cell
positive_markers=['CD3', 'FOXP3'] negative_markers=['CD20', 'CD4', 'CD56', 'CD8', 'SOX10']

Natural killer T cell
positive_markers=['CD3', 'CD56'] negative_markers=['CD20', 'CD4', 'CD8', 'FOXP3', 'SOX10']

Natural killer cell
positive_markers=['CD56'] negative_markers=['CD20', 'CD3', 'CD4', 'CD8', 'FOXP3', 'S100B', 'SOX10']

Nerve
positive_markers=['CD56', 'S100B'] negative_markers=['CD14', 'CD163', 'CD20', 'CD3', 'CD4', 'CD68', 'CD8', 'FOXP3', 'SOX10']

Other macrophage/monocyte CD14+
positive_markers=['CD14'] negative_markers=['CD163', 'CD20', 'CD3', 'CD56', 'CD68', 'CD8', 'FOXP3', 'SOX10']

Other macrophage/monocyte CD4+
positive_markers=['CD4'] negative_markers=['CD163', 'CD20', 'CD3', 'CD56', 'CD68', 'CD8', 'FOXP3', 'SOX10']

T cell/null phenotype
positive_markers=['CD3'] negative_markers=['CD20', 'CD4', 'CD56', 'CD8', 'FOXP3', 'SOX10']

Tumor
positive_markers=['SOX10'] negative_markers=['CD20', 'CD3', 'CD4', 'CD8', 'FOXP3']

@CarlinLiao
Copy link
Collaborator Author

CarlinLiao commented Aug 15, 2023

I've yet to setup a workflow that gives me a local db to query. Perhaps a tutorial in docs/ would be good for that.

Is your suggestion to add in functionality to spt cggnn extract so that it generates complex phenotype features itself (thus the snippet example), or to extend FeatureMatrixExtractor so that it extracts complex phenotypes in addition to simple ones?

@jimmymathews
Copy link
Collaborator

I think the functionality is appropriate for a new class or functions near FeatureMatrixExtractor.
If you want deep integration with FeatureMatrixExtractor, such that the cell dataframe has new columns for each phenotype, you'll have to make some decision for naming these new columns. You could just use the long names as in the snippet, but as you have pointed out above this hides the distinction between the original-channel-associated columns and the phenotype columns. (Note that "simple" and "complex" phenotype is personal terminology of my own that will be confusing to most others. "Phenotype" refers to the latter concept normally.)
I would probably keep the phenotype columns in a separate dataframe to avoid that problem.

@CarlinLiao
Copy link
Collaborator Author

If the phenotypes are kept in a separate DataFrame, then FeatureMatrixExtractor would just return another DataFrame in its result dictionary, right? That's simple to understand. spt cggnn extract can handle merging the chemical species(?) and phenotypes together in its own logic.

Would you like me to try implementing this, or would you rather do it yourself since you're more familiar with the architecture of the extractor?

@jimmymathews
Copy link
Collaborator

Sure, please give it a try. I don't think it requires too much architectural knowledge.
And yeah I was imagining another dataframe in the dict, beside the existing one.

(I'm thinking of the primary columns as "channels" as in "imaging channels". The "chemical species" terminology is the pedantic description of the target indicated by the imaging modality. The column name is the protein's gene name, so yeah this is the name of the "chemical species" indicated, but again no one else is going to call this the "chemical species".)

@CarlinLiao
Copy link
Collaborator Author

I'm not sure we're aligned on what a sufficient amount of architectural knowledge is, but I'll give it a crack.

@@ -1,18 +1,28 @@
FROM pytorch/pytorch:1.13.0-cuda11.6-cudnn8-runtime
FROM pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated cg-gnn and in the original repo it targets cuda 11.8, but there doesn't appear to be a Docker image for this version yet.

@CarlinLiao
Copy link
Collaborator Author

test_proximity_pipeline should was as simple as fixing some regex, but now it's failing at the same feature specification area as test_autocomputed_squidpy. Hopefully this'll make it easier to debug for me since any squidpy tests will autofail on my machine because squidpy can't install on my MacBook.

Comment on lines 43 to +55
def create_and_transcribe_one_sample(
sample: str,
df: DataFrame,
channel_symbols_by_column_name: dict[str, str],
feature_uploader: ADIFeaturesUploader,
) -> None:
for column, symbol in channel_symbols_by_column_name.items():
criteria = PhenotypeCriteria(positive_markers=[column], negative_markers=[])
value = compute_squidpy_metric_for_one_sample(df, [criteria], 'spatial autocorrelation')
if value is None:
continue
feature_uploader.stage_feature_value((symbol,), sample, value)
for column in df.columns:
if column.startswith('C '):
symbol = column[2:]
criteria = PhenotypeCriteria(positive_markers=[symbol], negative_markers=[])
value = compute_squidpy_metric_for_one_sample(df, [criteria], 'spatial autocorrelation')
if value is None:
continue
feature_uploader.stage_feature_value((symbol,), sample, value)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both old and new versions of this function create and transcribe spatial autocorrelation for only channels (simple phenotypes). Do we want to do autocorrelation for (complex) phenotypes as well?

(It might require some another input parameter since the columns will have the complex phenotype name, but not its signature, so that'll have to be fetched from elsewhere.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary to do this, since the primary use case is the ondemand anyway.

@CarlinLiao
Copy link
Collaborator Author

The latest commit should fix the inconsistently minted feature specifications without relaxing the tests. (The tests that can run on my M1 Mac pass, but that leaves out a lot of tests.)

@CarlinLiao CarlinLiao linked an issue Sep 7, 2023 that may be closed by this pull request
@CarlinLiao
Copy link
Collaborator Author

Almost all tests pass now and merging this PR will quash several open issues. The only exception are the proximity and squidpy module tests in apiserver, which run but result in different values compared to expected. Not totally sure why, but this could be a case of updated computations leading to new values rather than the output being wrong.

@jimmymathews
Copy link
Collaborator

The tests that check computed feature values (squidpy and proximity) are failing for the first time at the commit that introduces a new way to compute masks, see these lines:

df_cell = df_cell.rename({
column: (column[2:] if (column.startswith('C ') or column.startswith('P ')) else column)
for column in cells.columns
}, axis=1)
masks: list[Series[bool]] = [
(df_cell[signature.positive_markers].all(axis=1) &
(~df_cell[signature.negative_markers]).all(axis=1))
for signature in phenotypes
]

Most likely these masks differ. I'll try to find out which one is correct, and if the new one is not correct we can correct it.

@jimmymathews
Copy link
Collaborator

The masks in this version were wrong according to one example I checked, and the previously computed masks were correct. I think this was due to the .all pandas function being applied to 0 and 1 values. I converted to bool in the previous commit, and the mask in the example I checked is now correct.

@jimmymathews
Copy link
Collaborator

The squidpy metrics test now passes.
Working on proximity metrics test.

@jimmymathews
Copy link
Collaborator

All tests pass!

template = '{0:0%sb}' % number_channels # pylint: disable=consider-using-f-string
feature_vector = [int(value) for value in list(template.format(binary)[::-1])]
feature_vector: list[int] = [int(value) for value in list(template.format(binary)[::-1])]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the line we would change if we wanted to turn this from ints to bools (not including all the downstream changes we'd need to do...)

Copy link
Collaborator

@jimmymathews jimmymathews left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a pass through each file, looks OK to me. Now that tests pass it is ready to merge.

@CarlinLiao CarlinLiao merged commit dda3cbd into main Sep 12, 2023
1 check passed
@CarlinLiao CarlinLiao deleted the cggnn_dfs branch September 12, 2023 20:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
2 participants