This repo uses a LLM to extract cancer hallmarks from text. The model interacts with the Amazon Bedrock API to process text and identify relevant cancer hallmarks. This approach is applied to abstracts from grants annd publications from the Cancer Complexity Knowledge Portal.
- Python 3.7 or higher
- Required Python packages (listed in requirements.txt)
- AWS credentials configured for the htan-dev profile
Clone the repository:
git clone https://github.com/mc2-center/infer_hallmarks_of_cancer.git
cd infer_hallmarks_of_cancer
Install the required packages:
pip install -r requirements.txt
Set up your AWS credentials for the htan-dev profile:
aws sso login --profile htan-dev
This is currently hardcoded to the htan-dev profile.
-
Run the script:
python hallmarks_llm.py <abstract>
-
The script will process the abstract and extract cancer hallmarks, printing the results to the console.
For example:
% python hallmarks_llm.py "This is my abstract which is about gene mutation and cellular senesence" { "extracted_hallmarks": [ { "hallmark": "Genome instability and mutation", "confidence_score": 0.9 }, { "hallmark": "Senescent cells", "confidence_score": 0.9 } ] }
-
Prepare your Synapse table with an
abstract
column containing the text to be processed. Note that synapse table ID is currently hardcoded -
Run the
cckp_publication_abstracts.py
script:python cckp_publication_abstracts.py
The script will:
- Load the data from the Synapse table.
- Initialize new columns for storing hallmarks and scores.
- Use the Bedrock API to process each abstract and extract cancer hallmarks.
- Save the results to a new CSV file.
The main functionality is implemented in the hallmarks_llm.py
and cckp_publication_abstracts.py
files.
hallmarks_llm.py
- Loads the input data from a CSV file.
- Initializes the columns for storing hallmarks and - scores.
- Uses the Bedrock API to process each abstract and - extract cancer hallmarks.
- Saves the results to a new CSV file.
cckp_publication_abstracts.py
- Loads the data from a Synapse table.
- Initializes new columns for storing hallmarks and - scores.
- Uses the Bedrock API to process each abstract and - extract cancer hallmarks.
- Saves the results to a new CSV file.
validate_response(response)
: Validates the response from the API against a JSON schema.extract_cancer_hallmarks(abstract, max_retries=5)
: Sends the abstract to the API and extracts hallmarks.main()
: Main function to process an abstract provided as a command-line argument.
The scripts include retry logic to handle API request failures. They will attempt to process each request up to max_retries times before logging an error.
Abstract:
Obesity is a risk factor for the development of post-menopausal breast cancer. Breast white adipose tissue (WAT) inflammation, which is commonly found in women with excess body fat, is also associated with increased breast cancer risk. Both local and systemic effects are probably important for explaining the link between excess body fat, adipose inflammation and breast cancer. The first goal of this cross-sectional study of 196 women was to carry out transcriptome profiling to define the molecular changes that occur in the breast related to excess body fat and WAT inflammation. A second objective was to determine if commonly measured blood biomarkers of risk and prognosis reflect molecular changes in the breast. Breast WAT inflammation was assessed by immunohistochemistry. Bulk RNA-sequencing was carried out to assess gene expression in non-tumorous breast. Obesity and WAT inflammation were associated with a large number of differentially expressed genes and changes in multiple pathways linked to the development and progression of breast cancer. Altered pathways included inflammatory response, complement, KRAS signaling, tumor necrosis factor alpha signaling via NFkB, interleukin (IL)6-JAK-STAT3 signaling, epithelial mesenchymal transition, angiogenesis, interferon gamma response and transforming growth factor (TGF)-beta signaling. Increased expression of several drug targets such as aromatase, TGF-beta1, IDO-1 and PD-1 were observed. Levels of various blood biomarkers including high sensitivity C-reactive protein, IL6, leptin, adiponectin, triglycerides, high-density lipoprotein cholesterol and insulin were altered and correlated with molecular changes in the breast. Collectively, this study helps to explain both the link between obesity and breast cancer and the utility of blood biomarkers for determining risk and prognosis.
Hallmarks:
- Tumor-promoting inflammation (0.95)
- Activating invasion and metastasis (0.85)
- Inducing or accessing angiogenesis (0.85)
- Evading immune destruction (0.85)
- Unlocking phenotypic plasticity (0.85)
A mathematical model for predicting the spatiotemporal response of breast cancer cells treated with doxorubicin
Abstract:
Tumor heterogeneity contributes significantly to chemoresistance, a leading cause of treatment failure. To better personalize therapies, it is essential to develop tools capable of identifying and predicting intra- and inter-tumor heterogeneities. Biology-inspired mathematical models are capable of attacking this problem, but tumor heterogeneity is often overlooked in in-vivo modeling studi(Read more on Pubmed)
Hallmarks:
- Genome instability and mutation (0.9)
- Unlocking phenotypic plasticity (0.85)
Abstract:
Previous studies have developed vascularized tumor spheroid models to demonstrate the impact of intravascular flow on tumor progression and treatment. However, these models have not been widely adopted so the vascularization of tumor spheroids in vitro is generally lower than vascularized tumor tissues in vivo. To improve the tumor vascularization level, a new strategy is introduced to form tumor spheroids by adding fibroblasts (FBs) sequentially to a pre-formed tumor spheroid and demonstrate this method with tumor cell lines from kidney, lung, and ovary cancer. Tumor spheroids made with the new strategy have higher FB densities on the periphery of the tumor spheroid, which tend to enhance vascularization. The vessels close to the tumor spheroid made with this new strategy are more perfusable than the ones made with other methods. Finally, chimeric antigen receptor (CAR) T cells are perfused under continuous flow into vascularized tumor spheroids to demonstrate immunotherapy evaluation using vascularized tumor-on-a-chip model. This new strategy for establishing tumor spheroids leads to increased vascularization in vitro, allowing for the examination of immune, endothelial, stromal, and tumor cell responses under static or flow conditions.
Hallmarks:
- Inducing or accessing angiogenesis (0.95)
- Evading immune destruction (0.85)