Contextualizing biological perturbation experiments through language

This is the official repository for PerturbQA. If you find our work interesting, please check out our paper to learn more!

@inproceedings{
    wu2025perturbqa,
    title={Contextualizing biological perturbation experiments through language},
    author={Menghua Wu and Russell Littman and Jacob Levine and Lin Qiu and Tommaso Biancalani and David Richmond and Jan-Christian Huetter},
    booktitle={The Thirteenth International Conference on Learning Representations},
    year={2025},
}

Installation

git clone [email protected]:Genentech/PerturbQA.git
cd PerturbQA
pip install -e .

Specifically, the following packages are required to run our evaluation.

- scikit-learn
- numpy
- (optional, for ROUGE and BERT scores) torchmetrics, which requires torch
- (optional, for BERT score) transformers

This code distribution contains the PerturbQA input and label pairs. For additional materials, including processed knowledge graphs and model predictions, please see the data distribution.

PerturbQA benchmark

Differential expression and direction of change

Datasets can be loaded as follows.

from pertqa import load_de, load_dir

# options: "k562" "rpe1" "hepg2" "jurkat" "k562_set"
data_de = load_de("k562")
# train/test splits
X_train = data_de["train"]
X_test = data_de["test"]

data_dir = load_dir("k562")

To evaluate your predictions (additional example in examples/results.ipynb):

import numpy as np
from pertqa import auc_per_gene

keys = [(x["pert"], x["gene"]) for x in X_test]
pred = np.random.randn(len(keys))  # list / numpy array of floats
true = [x["label"] for x in X_test]  # from load_de/dir
auc = auc_per_gene(keys, pred, true)

Gene set enrichment

Set flag skip_empty to skip entries without manual annotation (defaults to True).

from pertqa import load_gse

# options: "pert" "gene"
data = load_gse("pert", skip_empty=True)

To evaluate your predictions (requires torchmetrics):

from pertqa import rouge1_recall

pred = ["hello world"]  # list of predictions
true = ["hello"]  # list of labels, e.g. from load_gse
score = rouge1_recall(pred, true)

The transformers library is required to compute BERTScore, and we recommend having access to a GPU.

from pertqa import bert_score

pred = ["hello world"]  # list of predictions
true = ["hello"]  # list of labels, e.g. from load_gse
scores = bert_score(pred, true)

Knowledge graph to prompts

Processed knowledge graphs are available in the data distribution under the archive kg.zip.

See examples/kg_to_prompt.ipynb for details about how to load these files and how to generate gene summary prompts. Please place kg at perturbqa/datasets/kg if you wish to run these examples.

Models

LLMs

Please see examples/summer for more details.

All prompt templates may be found at examples/summer/prompts.
Raw LLM outputs can be found in the data distribution, in the archives named:
- summer_outputs.zip
- llm-nocot.zip
- llm-noretrieve.zip

Baselines

Code or instructions required to run baselines can be found under examples
Baselines have their own installation requirements.

Data attribution and license

This codebase is licensed under the Genentech Non-Commercial Software License Version 1.0. For more information, please see the attached LICENSE.txt file.

The PerturbQA datasets in the data folder of this repository are licensed under the CC BY 4.0 license. They are derived from the following datasets:

Datasets	Reference	License
DE/Dir: k562, k562_set, rpe1. Gene set enrichment	Mapping information-rich genotype-phenotype landscapes with genome-scale perturb-seq Cell, 185(14):2559–2575.e28, 2022. ISSN 0092-8674. doi:505 (link)	CC BY 4.0
DE/Dir: hepg2, jurkat	Transcriptome-wide characterization of genetic perturbations. bioRxiv, 07 2024. doi: 10.1101/2024.07.03.601903 (link)	CC BY 4.0

The LLM outputs in the data distribution (summer_outputs.zip, summer_enrichment.zip, llm-nocot.zip, llm-noretrieve.zip) and results tables (results.zip) are licensed under the CC BY 4.0 license.

The knowledge graph entries and gene summaries (kg.zip, gene_summary.zip of the data distribution, respectively) are derived from the following datasets and are governed by the original licenses of these datasets:

Database	Reference	License
UniProt	UniProt: the Universal Protein Knowledgebase in 2023 Nucleic Acids Res. 51:D523–D531 (2023) (link)	CC BY 4.0
Ensembl	Ensembl 2024 Nucleic Acids Res. 2024, 52(D1):D891–D899 PMID: 37953337 10.1093/nar/gkad1049 (link)	Apache 2.0
Gene Ontology	2024-01-17 release (DOI:10.5281/zenodo.10536401)	CC BY 4.0
CORUM	CORUM: the comprehensive resource of mammalian protein complexes–2022 Nucleic Acids Research, 51(D1):D539–D545 (link)	CC BY NC 4.0
STRINGDB	Szklarczyk et al. Nucleic acids research 51.D1 (2023): D638-D646 (link)	CC BY 4.0
Reactome	The Reactome Pathway Knowledgebase 2024. Nucleic Acids Research. 2024. doi: 10.1093/nar/gkad1025. (link)	CC BY 4.0
Bioplex	Huttlin et al. (2021) Cell 184(11):3022-3040. doi: 10.1016/j.cell.2021.04.011. (link)

Please note that CORUM is licensed under CC BY NC 4.0

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
examples		examples
perturbqa		perturbqa
LICENSE.txt		LICENSE.txt
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Contextualizing biological perturbation experiments through language

Installation

PerturbQA benchmark

Differential expression and direction of change

Gene set enrichment

Knowledge graph to prompts

Models

LLMs

Baselines

Data attribution and license

About

Releases

Packages

Languages

License

Genentech/PerturbQA

Folders and files

Latest commit

History

Repository files navigation

Contextualizing biological perturbation experiments through language

Installation

PerturbQA benchmark

Differential expression and direction of change

Gene set enrichment

Knowledge graph to prompts

Models

LLMs

Baselines

Data attribution and license

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages