Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create cg-gnn extraction script #185

Merged
merged 35 commits into from
Sep 12, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
0b20fdf
created cg-gnn extraction script
CarlinLiao Aug 8, 2023
ae2170d
added a quick test for cggnn extract
CarlinLiao Aug 9, 2023
ad8e61b
hotfix importance saving
CarlinLiao Aug 11, 2023
c002b6d
Use updated pytorch image for Python 3.11 support, cggnn extract test…
jimmymathews Aug 15, 2023
ccdf771
Use python3.11 directly in test case usage of pip, now that python3.1…
jimmymathews Aug 16, 2023
c9aab1b
Same fix applied to second test.
jimmymathews Aug 16, 2023
61e855e
undo cggnn docker change
CarlinLiao Aug 16, 2023
fed65fc
Merge squidpy changes into cggnn_dfs
CarlinLiao Aug 16, 2023
3c1827b
Merge remote-tracking branch 'origin/main' into cggnn_dfs
CarlinLiao Aug 16, 2023
bdda7d7
added pheno, removed multistudy to FME
CarlinLiao Aug 21, 2023
bce56e8
Merge branch 'main' into cggnn_dfs
CarlinLiao Aug 21, 2023
0c2c658
logic hotfix
CarlinLiao Aug 21, 2023
3269864
fix study-substudy references
CarlinLiao Aug 21, 2023
aa012c7
fix test to account for phenotype columns
CarlinLiao Aug 21, 2023
30e8ef1
make phenotype dict key consistent
CarlinLiao Aug 21, 2023
f5123d6
Update usage of extractor to omit study reference.
jimmymathews Aug 22, 2023
91be9ba
remove phenotypes from continuous dataframes
CarlinLiao Aug 22, 2023
70008eb
fix pheno neg expression match
CarlinLiao Aug 24, 2023
7f14832
split extract stratification, symbols in col names
CarlinLiao Aug 24, 2023
61c53b5
a little formatting on FME
CarlinLiao Aug 24, 2023
d10fdd8
explore classes for cggnn extraction
CarlinLiao Aug 24, 2023
94b504b
fix FME test
CarlinLiao Aug 25, 2023
96634c9
Make test more diagnosable.
jimmymathews Aug 25, 2023
c26eeb4
cggnn extract clarity refactors
CarlinLiao Sep 5, 2023
f3f6178
update providers for new feature column names
CarlinLiao Sep 6, 2023
aced7a1
add cg-gnn to toml checking
CarlinLiao Sep 6, 2023
434f474
adjust squidpy clustering
CarlinLiao Sep 7, 2023
702d631
actually handling this without a try except is better
CarlinLiao Sep 7, 2023
d2eb9c1
any typo
CarlinLiao Sep 7, 2023
dac2cf9
handle malformed squidpy returns more gracefully
CarlinLiao Sep 7, 2023
9779a78
cggnn extract docstring
CarlinLiao Sep 8, 2023
db8d2e9
Merge branch 'main' into cggnn_dfs
CarlinLiao Sep 8, 2023
f4a0f24
Change dataframe to bool values to permit "all" & "all" syntax.
jimmymathews Sep 12, 2023
bb67799
Fix accidentally booleanization of pixel position columns.
jimmymathews Sep 12, 2023
5dfc001
Make operator order precedence explicit, booleanization and negation.
jimmymathews Sep 12, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion build/build_scripts/check_dockerfiles_consistency.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ def load_dockerfile(submodule):

def check_exists(dependency, dockerfile):
for line in dockerfile.split('\n'):
if re.search(f'RUN python3? -m pip install "?{dependency}"?$', line):
if re.search(f'RUN python3?(.11)? -m pip install "?{dependency}"?$', line):
return True
if re.search(dependency, line):
print(f'Dependency "{dependency}" is mentioned in Dockerfile, but something isn\'t quite right with installation command.')
Expand Down
2 changes: 1 addition & 1 deletion build/build_scripts/create_pyproject.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
import toml

def validate_dependencies_all(project):
modules = ['apiserver', 'db', 'ondemand', 'workflow']
modules = ['apiserver', 'cggnn', 'db', 'ondemand', 'workflow']
dependencies = set()
for module in modules:
dependencies = dependencies.union(set(project['project']['optional-dependencies'][module]))
Expand Down
26 changes: 17 additions & 9 deletions build/cggnn/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,18 +1,26 @@
FROM pytorch/pytorch:1.13.0-cuda11.6-cudnn8-runtime
FROM pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated cg-gnn and in the original repo it targets cuda 11.8, but there doesn't appear to be a Docker image for this version yet.

ENV DEBIAN_FRONTEND=noninteractive
RUN apt update && apt install -y gcc libpq-dev && rm -rf /var/lib/apt/lists/*
RUN apt update && apt install -y gcc libpq-dev
WORKDIR /usr/src/app
RUN python -m pip install dgl-cu116 dglgo -f https://data.dgl.ai/wheels/repo.html
RUN python -m pip install psycopg2==2.9.6
RUN python -m pip install adiscstudies==0.11.0
RUN python -m pip install numba==0.57.0
RUN python -m pip install attrs==23.1.0
RUN python -m pip install cg-gnn
RUN apt install software-properties-common -y
RUN add-apt-repository ppa:deadsnakes/ppa
RUN apt update
RUN apt install python3.11 -y
RUN apt install python3.11-dev -y
RUN apt install python3.11-venv -y
RUN apt install python3.11-distutils -y
RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.11 && python3.11 -m ensurepip
RUN python3.11 -m pip install dgl-cu117 dglgo -f https://data.dgl.ai/wheels/repo.html
RUN python3.11 -m pip install psycopg2==2.9.6
RUN python3.11 -m pip install adiscstudies==0.11.0
RUN python3.11 -m pip install numba==0.57.0
RUN python3.11 -m pip install attrs==23.1.0
RUN python3.11 -m pip install cg-gnn
ARG version
ARG service_name
ARG WHEEL_FILENAME
LABEL version=$version
LABEL service_name=$service_name
ENV service_name $service_name
COPY $WHEEL_FILENAME ./
RUN python -m pip install "$WHEEL_FILENAME"
RUN python3.11 -m pip install "$WHEEL_FILENAME"
7 changes: 4 additions & 3 deletions pyproject.toml.unversioned
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,9 @@ apiserver = [
"Pillow==9.5.0",
"pydantic==2.0.2"
]
cggnn = [
"cg-gnn"
]
db = [
"pandas==2.0.2",
"pyshp==2.2.0",
Expand All @@ -55,9 +58,6 @@ ondemand = [
"pydantic==2.0.2",
"squidpy==1.3.0"
]
cggnn = [
"cg-gnn"
]
workflow = [
"matplotlib==3.7.1",
"umap-learn==0.5.3",
Expand All @@ -71,6 +71,7 @@ workflow = [
"Pillow==9.5.0"
]
all = [
"cg-gnn",
"matplotlib==3.7.1",
"umap-learn==0.5.3",
"uvicorn>=0.15.0,<0.16.0",
Expand Down
2 changes: 2 additions & 0 deletions spatialprofilingtoolbox/cggnn/__init__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
"""Cell-graph graph neural network functionality."""
__version__ = '0.2.1'

from spatialprofilingtoolbox.cggnn.extract import extract_cggnn_data
112 changes: 112 additions & 0 deletions spatialprofilingtoolbox/cggnn/extract.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
"""Extract information cg-gnn needs from SPT."""

from pandas import DataFrame, concat, merge # type: ignore
from numpy import sort # type: ignore

from spatialprofilingtoolbox.db.feature_matrix_extractor import FeatureMatrixExtractor


def _create_cell_df(dfs_by_specimen: dict[str, DataFrame]) -> DataFrame:
"""Find simple and complex phenotypes, and locations and merge into a DataFrame."""
for specimen, df_specimen in dfs_by_specimen.items():
df_specimen['specimen'] = specimen

df = concat(dfs_by_specimen.values(), axis=0)
CarlinLiao marked this conversation as resolved.
Show resolved Hide resolved
df.index.name = 'histological_structure'
# Reorder columns so it's specimen, xy, channels, and phenotypes
column_order = ['specimen', 'pixel x', 'pixel y']
column_order.extend(df.columns[df.columns.str.startswith('C ')])
column_order.extend(df.columns[df.columns.str.startswith('P ')])
return df[column_order]


def _create_label_df(
df_assignments: DataFrame,
df_strata: DataFrame,
strata_to_use: list[int] | None,
) -> tuple[DataFrame, dict[int, str]]:
"""Get slide-level results."""
df_assignments = df_assignments.set_index('specimen')
df_strata = df_strata.set_index('stratum identifier')
df_strata = _filter_for_strata(strata_to_use, df_strata)
df_strata = _drop_unneeded_columns(df_strata)
df_strata = _compress_df(df_strata)
return _label(df_assignments, df_strata)


def _filter_for_strata(strata_to_use: list[int] | None, df_strata: DataFrame) -> DataFrame:
if strata_to_use is not None:
df_strata = df_strata.loc[sorted(strata_to_use)]
if df_strata.shape[0] < 2:
raise ValueError(f'Need at least 2 strata to classify, there are {df_strata.shape[0]}.')
return df_strata


def _drop_unneeded_columns(df_strata: DataFrame) -> DataFrame:
"""Drop columns that have internally same contents."""
for col in df_strata.columns.tolist():
if df_strata[col].nunique() == 1:
df_strata = df_strata.drop(col, axis=1)
return df_strata


def _compress_df(df_strata: DataFrame) -> DataFrame:
"""Compress remaining columns into a single string"""
df_strata['label'] = '(' + df_strata.iloc[:, 0].astype(str)
for i in range(1, df_strata.shape[1]):
df_strata['label'] += df_strata.iloc[:, i].astype(str)
df_strata['label'] += ')'
df_strata = df_strata[['label']]
return df_strata


def _label(df_assignments: DataFrame, df_strata: DataFrame) -> tuple[DataFrame, dict[int, str]]:
"""Merge with specimen assignments, keeping only selected strata."""
df = merge(df_assignments, df_strata, on='stratum identifier', how='inner')[['label']]
label_to_result = dict(enumerate(sort(df['label'].unique())))
return df.replace({res: i for i, res in label_to_result.items()}), label_to_result


def extract_cggnn_data(
spt_db_config_location: str,
study: str,
strata_to_use: list[int] | None,
) -> tuple[DataFrame, DataFrame, dict[int, str]]:
"""Extract information cg-gnn needs from SPT.

Parameters
----------
spt_db_config_location : str
Location of the SPT DB config file.
study : str
Name of the study to query data for.
strata_to_use : list[int] | None
Specimen strata to use as labels, identified according to the "stratum identifier" in
`explore_classes`. This should be given as space separated integers.
If not provided, all strata will be used.

Returns
-------
df_cell: DataFrame
Rows are individual cells, indexed by an integer ID.
Column or column groups are, named and in order:
1. The 'specimen' the cell is from
2. Cell centroid positions 'pixel x' and 'pixel y'
3. Channel expressions starting with 'C ' and followed by human-readable symbol text
4. Phenotype expressions starting with 'P ' followed by human-readable symbol text
df_label: DataFrame
Rows are specimens, the sole column 'label' is its class label as an integer.
label_to_result_text: dict[int, str]
Mapping from class integer label to human-interpretable result text.
"""
extractor = FeatureMatrixExtractor(database_config_file=spt_db_config_location)
df_cell = _create_cell_df({
slide: data.dataframe for slide, data in extractor.extract(study=study).items()
})
cohorts = extractor.extract_cohorts(study)
df_label, label_to_result_text = _create_label_df(
cohorts['assignments'],
cohorts['strata'],
strata_to_use,
)
return df_cell, df_label, label_to_result_text
33 changes: 33 additions & 0 deletions spatialprofilingtoolbox/cggnn/scripts/explore_classes.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
"""Report the different strata available to classify with."""

from argparse import ArgumentParser

from spatialprofilingtoolbox.db.feature_matrix_extractor import FeatureMatrixExtractor


def parse_arguments():
"""Process command line arguments."""
parser = ArgumentParser(
prog='spt cggnn explore_classes',
description='See the strata available to classify on.'
)
parser.add_argument(
'--spt_db_config_location',
type=str,
help='Location of the SPT DB config file.',
required=True
)
parser.add_argument(
'--study',
type=str,
help='Name of the study to query data for.',
required=True
)
return parser.parse_args()


if __name__ == "__main__":
args = parse_arguments()
extractor = FeatureMatrixExtractor(args.spt_db_config_location)
strata = extractor.extract_cohorts(study=args.study)['strata']
print(strata.to_string())
63 changes: 63 additions & 0 deletions spatialprofilingtoolbox/cggnn/scripts/extract.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
"""Extract information cg-gnn needs from SPT and save to file."""

from argparse import ArgumentParser
from os.path import join, exists
from json import dump

from spatialprofilingtoolbox.cggnn import extract_cggnn_data


def parse_arguments():
"""Process command line arguments."""
parser = ArgumentParser(
prog='spt cggnn extract',
description='Extract information cg-gnn needs from SPT and save to file.'
)
parser.add_argument(
'--spt_db_config_location',
type=str,
help='Location of the SPT DB config file.',
required=True
)
parser.add_argument(
'--study',
type=str,
help='Name of the study to query data for.',
required=True
)
parser.add_argument(
'--strata',
nargs='+',
type=int,
help='Specimen strata to use as labels, identified according to the "stratum identifier" '
'in `explore_classes`. This should be given as space separated integers.\n'
'If not provided, all strata will be used.',
required=False,
default=None
)
parser.add_argument(
'--output_location',
type=str,
help='Directory to save extracted data to.',
required=True
)
return parser.parse_args()


if __name__ == "__main__":
args = parse_arguments()
df_cell, df_label, label_to_result = extract_cggnn_data(
args.spt_db_config_location,
args.study,
args.strata,
)

assert isinstance(args.output_location, str)
dict_filename = join(args.output_location, 'label_to_results.json')
cells_filename = join(args.output_location, 'cells.h5')
labels_filename = join(args.output_location, 'labels.h5')
if not (exists(dict_filename) and exists(cells_filename) and exists(labels_filename)):
df_cell.to_hdf(cells_filename, 'cells')
df_label.to_hdf(labels_filename, 'labels')
with open(dict_filename, 'w', encoding='utf-8') as f:
dump(label_to_result, f)
Loading