Skip to content

Genentech/PerturbQA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Contextualizing biological perturbation experiments through language

This is the official repository for PerturbQA. If you find our work interesting, please check out our paper to learn more!

@inproceedings{
    wu2025perturbqa,
    title={Contextualizing biological perturbation experiments through language},
    author={Menghua Wu and Russell Littman and Jacob Levine and Lin Qiu and Tommaso Biancalani and David Richmond and Jan-Christian Huetter},
    booktitle={The Thirteenth International Conference on Learning Representations},
    year={2025},
}

Installation

git clone [email protected]:Genentech/PerturbQA.git
cd PerturbQA
pip install -e .

Specifically, the following packages are required to run our evaluation.

- scikit-learn
- numpy
- (optional, for ROUGE and BERT scores) torchmetrics, which requires torch
- (optional, for BERT score) transformers

This code distribution contains the PerturbQA input and label pairs. For additional materials, including processed knowledge graphs and model predictions, please see the data distribution.

PerturbQA benchmark

Differential expression and direction of change

Datasets can be loaded as follows.

from pertqa import load_de, load_dir

# options: "k562" "rpe1" "hepg2" "jurkat" "k562_set"
data_de = load_de("k562")
# train/test splits
X_train = data_de["train"]
X_test = data_de["test"]

data_dir = load_dir("k562")

To evaluate your predictions (additional example in examples/results.ipynb):

import numpy as np
from pertqa import auc_per_gene

keys = [(x["pert"], x["gene"]) for x in X_test]
pred = np.random.randn(len(keys))  # list / numpy array of floats
true = [x["label"] for x in X_test]  # from load_de/dir
auc = auc_per_gene(keys, pred, true)

Gene set enrichment

Set flag skip_empty to skip entries without manual annotation (defaults to True).

from pertqa import load_gse

# options: "pert" "gene"
data = load_gse("pert", skip_empty=True)

To evaluate your predictions (requires torchmetrics):

from pertqa import rouge1_recall

pred = ["hello world"]  # list of predictions
true = ["hello"]  # list of labels, e.g. from load_gse
score = rouge1_recall(pred, true)

The transformers library is required to compute BERTScore, and we recommend having access to a GPU.

from pertqa import bert_score

pred = ["hello world"]  # list of predictions
true = ["hello"]  # list of labels, e.g. from load_gse
scores = bert_score(pred, true)

Knowledge graph to prompts

Processed knowledge graphs are available in the data distribution under the archive kg.zip.

See examples/kg_to_prompt.ipynb for details about how to load these files and how to generate gene summary prompts. Please place kg at perturbqa/datasets/kg if you wish to run these examples.

Models

LLMs

Please see examples/summer for more details.

  • All prompt templates may be found at examples/summer/prompts.
  • Raw LLM outputs can be found in the data distribution, in the archives named:
    • summer_outputs.zip
    • llm-nocot.zip
    • llm-noretrieve.zip

Baselines

  • Code or instructions required to run baselines can be found under examples
  • Baselines have their own installation requirements.

Data attribution and license

This codebase is licensed under the Genentech Non-Commercial Software License Version 1.0. For more information, please see the attached LICENSE.txt file.

The PerturbQA datasets in the data folder of this repository are licensed under the CC BY 4.0 license. They are derived from the following datasets:

Datasets Reference License
DE/Dir: k562, k562_set, rpe1. Gene set enrichment Mapping information-rich genotype-phenotype landscapes with genome-scale perturb-seq Cell, 185(14):2559–2575.e28, 2022. ISSN 0092-8674. doi:505 (link) CC BY 4.0
DE/Dir: hepg2, jurkat Transcriptome-wide characterization of genetic perturbations. bioRxiv, 07 2024. doi: 10.1101/2024.07.03.601903 (link) CC BY 4.0

The LLM outputs in the data distribution (summer_outputs.zip, summer_enrichment.zip, llm-nocot.zip, llm-noretrieve.zip) and results tables (results.zip) are licensed under the CC BY 4.0 license.

The knowledge graph entries and gene summaries (kg.zip, gene_summary.zip of the data distribution, respectively) are derived from the following datasets and are governed by the original licenses of these datasets:

Database Reference License
UniProt UniProt: the Universal Protein Knowledgebase in 2023 Nucleic Acids Res. 51:D523–D531 (2023) (link) CC BY 4.0
Ensembl Ensembl 2024 Nucleic Acids Res. 2024, 52(D1):D891–D899 PMID: 37953337 10.1093/nar/gkad1049 (link) Apache 2.0
Gene Ontology 2024-01-17 release (DOI:10.5281/zenodo.10536401) CC BY 4.0
CORUM CORUM: the comprehensive resource of mammalian protein complexes–2022 Nucleic Acids Research, 51(D1):D539–D545 (link) CC BY NC 4.0
STRINGDB Szklarczyk et al. Nucleic acids research 51.D1 (2023): D638-D646 (link) CC BY 4.0
Reactome The Reactome Pathway Knowledgebase 2024. Nucleic Acids Research. 2024. doi: 10.1093/nar/gkad1025. (link) CC BY 4.0
Bioplex Huttlin et al. (2021) Cell 184(11):3022-3040. doi: 10.1016/j.cell.2021.04.011. (link)

Please note that CORUM is licensed under CC BY NC 4.0

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages