-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
create cg-gnn extraction script #185
Conversation
df = merge(df_assignments, df_strata, on='stratum identifier', how='left')[ | ||
['specimen', 'subject diagnosed result']].rename( | ||
{'specimen': 'slide', 'subject diagnosed result': 'result'}, axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This code is from what was already implemented in main
in run_sql.py
, just moved to a new file.) Is this how we want to define the labels to classify against? It appears to be a little different from what we discussed over email.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jimmymathews Regarding stratification, this implementation uses only "subject diagnosed result" but not "stratum_identifier", "local temporal position indicator", or "subject diagnosed condition".
What do we want the model spt cggnn extract
supports to do? In prior iterations, it's been to predict, on pre-treatment slides only, if the patient the tissue sample is from will respond to treatment. Do we want to continue this, or generalize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prediction of treatment response from pre-treatment slides is certainly not the only study design that we support.
Although one cannot, in advance, write a single description of the outcome variable that will always be of greatest interest in any study, the SampleStratificationCreator
does a pretty good job of automatically creating a best guess at a reasonable granularity, given the facts about the samples available according to the schema.
The result of SampleStratificationCreator
's work is the outcome variable that I endorse for the use case of general-purpose downstream analysis.
It is defined by the fields named stratum_identifier
(an integer) and sample
, in a supplementary table called sample_strata
.
In this use case, one shouldn't be reading local_temporal_position_indicator
or subject_diagnosed_condition
or subject_diagnosed_result
at all. In this use case, these values are just metadata tags for the strata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(sample
here means the tissue sample, right? I don't see why the sample
is relevant to the stratification aside from samples being the granularity at which we're stratifying.)
As I envision it, the spt cggnn extract
command isn't for "general-purpose downstream analysis" but to extract the specific information the user wants to train a model on. (If this isn't the case, we should write a separate command that does so, as I don't think this should be within the scope of the standalone cg-gnn
package.)
So, if prediction of treatment response isn't the only study design we want to support, it sounds like what we should do is provide the person calling spt cggnn extract
some way to identify which strata they want to use. For example, in the treatment response use case, they should be given the option of only selecting strata from before treatment. In most use cases, I don't think the user would want every stratum possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, stratum_identifier
being an integer isn't user-interpretable; what's nice about result
is that it is a descriptive string that translates the integer class to something the user can understand, like "the model predicts that this patient will be a non-responder". I'd like some equivalent of this in any extract.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole design constraint for SampleStratificationCreator
was automation with no user intervention, but that's the opposite of what you're describing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be built into the FeatureMatrixExtractor
like the convenience stratifier is, or handled seperately?
And to do that we'd have to guess at all the different ways a user would want to slice up and stratify the samples... or just give them every option in the database. At which point, the user may as well write their own SQL queries.
Sure we can't just support only treatment response for now, and leave the implementation of alternative classifications as an exercise for the reader future work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think any of our other pre-curated studies have 2 cohorts that are both pre-treatment and distinguished by response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe 1 other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we'd have to think of at least one classification method for each study, and they're all different.
Installed this branch to test its functionality. It errored on this line after testing with "Breast cancer IMC":
|
Hm it seems that Still, it would much more convenient if |
Yes. In the SQL query workflow of the standalone |
Yes, it is a legacy aspect (like the "Feature Matrix" in the name), reflecting its original purpose of providing a familiar feature matrix representation of the portion of the dataset which is representable this way. |
The complex phenotypes features would be consistent with the feature matrix conceptualization. |
I wrote a test for the extract function and it errors because SPT is written in Python 3.11 but the Docker image for (It also complains about unary operators in these lines, but they're identical to other tests I assume Bash SyntaxErrors aren't breaking, and only reported if something else breaks, as the above.)
|
Well 3.7 is pretty old, released 2018. According to this, DGL now supports python 3.11. PyTorch I believe also does. |
As for the unary operator complaint, it is because your $([ $? -gt 0 ] && [ -e "label_to_results.json" ] && [ -e "cells.h5" ] && [ -e "labels.h5" ])
status="$?"
... and the error reports correctly (no unary operator complaint). |
The new docker image works for me, the test extraction works. I pushed a proof of concept to branch |
Extraction shouldn't run into any issues, it's the run pipeline I'm worried about. Histocartography model training and testing use esoteric features of DGL and pytorch that break in newer versions of both and Python (as of my testing last year, at least). I had to install specific older versions of DGL and pytorch, not the newest versions. (In fact, I wouldn't be surprised if they were the reason we're only getting zero imprtance scores..) |
Were you able to take another look at the complex phenotype extraction as well as the stratification method? |
If the dependency conflicts are persistent, it probably makes sense for us to keep major functionality isolated to separate CLI tool instances, not requiring the same python runtime for SPT and for cg-gnn. |
I guess that's what you had in mind by creating HDF5 files. It seems fine for this extraction-specific code to be in spt cggnn submodule, where it will run best on python 3.11. |
I'm going to test if the latest changes in the histocartography repo fix the dependency issues, but in the event they don't help, yes we'll need to further isolate cggnn running, either by removing |
What would you like me to look at for complex phenotype extraction and stratification? import json
from pandas import read_sql
from spatialprofilingtoolbox.db.database_connection import DatabaseConnectionMaker
from spatialprofilingtoolbox.db.exchange_data_formats.metrics import PhenotypeCriteria
class PhenotypeDefinitionPuller:
@staticmethod
def _get_phenotype(criteria_dataframe):
positives, negatives = [
sorted(map(str, list(criteria_dataframe[
criteria_dataframe['coded_value']==coded_value
]['channel_symbol'])))
for coded_value in (1, 0)
]
return PhenotypeCriteria(positive_markers=positives, negative_markers=negatives)
@classmethod
def pull(cls, study):
query = r'''
SELECT
cp.name as phenotype_name,
cs.symbol as channel_symbol,
CASE WHEN cpc.polarity='positive' THEN 1 ELSE 0 END AS coded_value
FROM cell_phenotype_criterion cpc
JOIN cell_phenotype cp ON cp.identifier=cpc.cell_phenotype
JOIN chemical_species cs ON cs.identifier=cpc.marker
JOIN study_component sc ON sc.component_study=cpc.study
WHERE sc.primary_study='%s'
;
'''
with DatabaseConnectionMaker() as dcm:
df = read_sql(query % study, dcm.get_connection())
return {
key: cls._get_phenotype(group)
for key, group in df.groupby('phenotype_name')
}
phenotypes = PhenotypeDefinitionPuller.pull('Breast cancer IMC')
for key, phenotype_definition in phenotypes.items():
print('')
print(key)
print(phenotype_definition)
phenotypes = PhenotypeDefinitionPuller.pull('Melanoma intralesional IL2')
for key, phenotype_definition in phenotypes.items():
print('')
print(key)
print(phenotype_definition) If you have a local db running, you can run the above as SINGLE_CELL_DATABASE_HOST=localhost SINGLE_CELL_DATABASE_USER=postgres SINGLE_CELL_DATABASE_PASSWORD=postgres python snippet.py The output: Luminal
positive_markers=['ESR1', 'PGR'] negative_markers=[]
Myoepithelial
positive_markers=['ACTA2', 'CK', 'EGFR', 'KRT14', 'MYC'] negative_markers=[]
Proliferative
positive_markers=['MKI67'] negative_markers=[]
Adipocyte or Langerhans cell
positive_markers=['S100B'] negative_markers=['CD14', 'CD163', 'CD20', 'CD3', 'CD4', 'CD56', 'CD68', 'CD8', 'FOXP3', 'SOX10']
B cell
positive_markers=['CD20'] negative_markers=['CD3', 'CD4', 'CD56', 'CD8', 'FOXP3', 'SOX10']
CD163+MHCII+ macrophage
positive_markers=['CD163', 'MHCII'] negative_markers=['CD20', 'CD3', 'CD56', 'CD8', 'FOXP3', 'SOX10']
CD163+MHCII- macrophage
positive_markers=['CD163'] negative_markers=['CD20', 'CD3', 'CD56', 'CD8', 'FOXP3', 'MHCII', 'SOX10']
CD4+ T cell
positive_markers=['CD3', 'CD4'] negative_markers=['CD20', 'CD56', 'CD8', 'FOXP3', 'SOX10']
CD4+ natural killer T cell
positive_markers=['CD3', 'CD4', 'CD56'] negative_markers=['CD20', 'CD8', 'FOXP3', 'SOX10']
CD4+ regulatory T cell
positive_markers=['CD3', 'CD4', 'FOXP3'] negative_markers=['CD20', 'CD56', 'CD8', 'SOX10']
CD4+/CD8+ T cell
positive_markers=['CD3', 'CD4', 'CD8'] negative_markers=['CD20', 'CD56', 'FOXP3', 'SOX10']
CD68+MHCII+ macrophage
positive_markers=['CD68', 'MHCII'] negative_markers=['CD20', 'CD3', 'CD56', 'CD8', 'FOXP3', 'SOX10']
CD68+MHCII- macrophage
positive_markers=['CD68'] negative_markers=['CD20', 'CD3', 'CD56', 'CD8', 'FOXP3', 'MHCII', 'SOX10']
CD8+ T cell
positive_markers=['CD3', 'CD8'] negative_markers=['CD20', 'CD4', 'CD56', 'FOXP3', 'SOX10']
CD8+ natural killer T cell
positive_markers=['CD3', 'CD56', 'CD8'] negative_markers=['CD20', 'CD4', 'FOXP3', 'SOX10']
CD8+ regulatory T cell
positive_markers=['CD3', 'CD8', 'FOXP3'] negative_markers=['CD20', 'CD4', 'CD56', 'SOX10']
Double negative regulatory T cell
positive_markers=['CD3', 'FOXP3'] negative_markers=['CD20', 'CD4', 'CD56', 'CD8', 'SOX10']
Natural killer T cell
positive_markers=['CD3', 'CD56'] negative_markers=['CD20', 'CD4', 'CD8', 'FOXP3', 'SOX10']
Natural killer cell
positive_markers=['CD56'] negative_markers=['CD20', 'CD3', 'CD4', 'CD8', 'FOXP3', 'S100B', 'SOX10']
Nerve
positive_markers=['CD56', 'S100B'] negative_markers=['CD14', 'CD163', 'CD20', 'CD3', 'CD4', 'CD68', 'CD8', 'FOXP3', 'SOX10']
Other macrophage/monocyte CD14+
positive_markers=['CD14'] negative_markers=['CD163', 'CD20', 'CD3', 'CD56', 'CD68', 'CD8', 'FOXP3', 'SOX10']
Other macrophage/monocyte CD4+
positive_markers=['CD4'] negative_markers=['CD163', 'CD20', 'CD3', 'CD56', 'CD68', 'CD8', 'FOXP3', 'SOX10']
T cell/null phenotype
positive_markers=['CD3'] negative_markers=['CD20', 'CD4', 'CD56', 'CD8', 'FOXP3', 'SOX10']
Tumor
positive_markers=['SOX10'] negative_markers=['CD20', 'CD3', 'CD4', 'CD8', 'FOXP3'] |
I've yet to setup a workflow that gives me a local db to query. Perhaps a tutorial in Is your suggestion to add in functionality to |
I think the functionality is appropriate for a new class or functions near |
If the phenotypes are kept in a separate DataFrame, then Would you like me to try implementing this, or would you rather do it yourself since you're more familiar with the architecture of the extractor? |
Sure, please give it a try. I don't think it requires too much architectural knowledge. (I'm thinking of the primary columns as "channels" as in "imaging channels". The "chemical species" terminology is the pedantic description of the target indicated by the imaging modality. The column name is the protein's gene name, so yeah this is the name of the "chemical species" indicated, but again no one else is going to call this the "chemical species".) |
I'm not sure we're aligned on what a sufficient amount of architectural knowledge is, but I'll give it a crack. |
@@ -1,18 +1,28 @@ | |||
FROM pytorch/pytorch:1.13.0-cuda11.6-cudnn8-runtime | |||
FROM pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated cg-gnn
and in the original repo it targets cuda 11.8, but there doesn't appear to be a Docker image for this version yet.
|
def create_and_transcribe_one_sample( | ||
sample: str, | ||
df: DataFrame, | ||
channel_symbols_by_column_name: dict[str, str], | ||
feature_uploader: ADIFeaturesUploader, | ||
) -> None: | ||
for column, symbol in channel_symbols_by_column_name.items(): | ||
criteria = PhenotypeCriteria(positive_markers=[column], negative_markers=[]) | ||
value = compute_squidpy_metric_for_one_sample(df, [criteria], 'spatial autocorrelation') | ||
if value is None: | ||
continue | ||
feature_uploader.stage_feature_value((symbol,), sample, value) | ||
for column in df.columns: | ||
if column.startswith('C '): | ||
symbol = column[2:] | ||
criteria = PhenotypeCriteria(positive_markers=[symbol], negative_markers=[]) | ||
value = compute_squidpy_metric_for_one_sample(df, [criteria], 'spatial autocorrelation') | ||
if value is None: | ||
continue | ||
feature_uploader.stage_feature_value((symbol,), sample, value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both old and new versions of this function create and transcribe spatial autocorrelation for only channels (simple phenotypes). Do we want to do autocorrelation for (complex) phenotypes as well?
(It might require some another input parameter since the columns will have the complex phenotype name, but not its signature, so that'll have to be fetched from elsewhere.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's necessary to do this, since the primary use case is the ondemand anyway.
The latest commit should fix the inconsistently minted feature specifications without relaxing the tests. (The tests that can run on my M1 Mac pass, but that leaves out a lot of tests.) |
Almost all tests pass now and merging this PR will quash several open issues. The only exception are the proximity and squidpy module tests in apiserver, which run but result in different values compared to expected. Not totally sure why, but this could be a case of updated computations leading to new values rather than the output being wrong. |
The tests that check computed feature values (squidpy and proximity) are failing for the first time at the commit that introduces a new way to compute masks, see these lines: SPT/spatialprofilingtoolbox/workflow/common/squidpy.py Lines 39 to 47 in f3f6178
Most likely these masks differ. I'll try to find out which one is correct, and if the new one is not correct we can correct it. |
The masks in this version were wrong according to one example I checked, and the previously computed masks were correct. I think this was due to the |
The squidpy metrics test now passes. |
All tests pass! |
template = '{0:0%sb}' % number_channels # pylint: disable=consider-using-f-string | ||
feature_vector = [int(value) for value in list(template.format(binary)[::-1])] | ||
feature_vector: list[int] = [int(value) for value in list(template.format(binary)[::-1])] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the line we would change if we wanted to turn this from ints to bools (not including all the downstream changes we'd need to do...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a pass through each file, looks OK to me. Now that tests pass it is ready to merge.
This creates a
spt cggnn extract
command that pulls informationcg-gnn
needs from a SPT db instance and saves it to two DataFrames and a JSON.