This is the official repository for PerturbQA. If you find our work interesting, please check out our paper to learn more!
@inproceedings{
wu2025perturbqa,
title={Contextualizing biological perturbation experiments through language},
author={Menghua Wu and Russell Littman and Jacob Levine and Lin Qiu and Tommaso Biancalani and David Richmond and Jan-Christian Huetter},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
}
git clone [email protected]:Genentech/PerturbQA.git
cd PerturbQA
pip install -e .
Specifically, the following packages are required to run our evaluation.
- scikit-learn
- numpy
- (optional, for ROUGE and BERT scores) torchmetrics, which requires torch
- (optional, for BERT score) transformers
This code distribution contains the PerturbQA input and label pairs. For additional materials, including processed knowledge graphs and model predictions, please see the data distribution.
Datasets can be loaded as follows.
from pertqa import load_de, load_dir
# options: "k562" "rpe1" "hepg2" "jurkat" "k562_set"
data_de = load_de("k562")
# train/test splits
X_train = data_de["train"]
X_test = data_de["test"]
data_dir = load_dir("k562")
To evaluate your predictions (additional example in examples/results.ipynb
):
import numpy as np
from pertqa import auc_per_gene
keys = [(x["pert"], x["gene"]) for x in X_test]
pred = np.random.randn(len(keys)) # list / numpy array of floats
true = [x["label"] for x in X_test] # from load_de/dir
auc = auc_per_gene(keys, pred, true)
Set flag skip_empty
to skip entries without manual annotation
(defaults to True
).
from pertqa import load_gse
# options: "pert" "gene"
data = load_gse("pert", skip_empty=True)
To evaluate your predictions (requires torchmetrics
):
from pertqa import rouge1_recall
pred = ["hello world"] # list of predictions
true = ["hello"] # list of labels, e.g. from load_gse
score = rouge1_recall(pred, true)
The transformers
library is required to compute BERTScore,
and we recommend having access to a GPU.
from pertqa import bert_score
pred = ["hello world"] # list of predictions
true = ["hello"] # list of labels, e.g. from load_gse
scores = bert_score(pred, true)
Processed knowledge graphs are available in the data
distribution
under the archive kg.zip
.
See examples/kg_to_prompt.ipynb
for details about how to load these files
and how to generate gene summary prompts.
Please place kg
at perturbqa/datasets/kg
if you wish to run these examples.
Please see examples/summer
for more details.
- All prompt templates may be found at
examples/summer/prompts
. - Raw LLM outputs can be found in the data
distribution, in the archives named:
summer_outputs.zip
llm-nocot.zip
llm-noretrieve.zip
- Code or instructions required to run baselines can be found under
examples
- Baselines have their own installation requirements.
This codebase is licensed under the Genentech Non-Commercial Software License Version 1.0. For more information, please see the attached LICENSE.txt file.
The PerturbQA datasets in the data folder of this repository are licensed under the CC BY 4.0 license. They are derived from the following datasets:
Datasets | Reference | License |
---|---|---|
DE/Dir: k562, k562_set, rpe1. Gene set enrichment | Mapping information-rich genotype-phenotype landscapes with genome-scale perturb-seq Cell, 185(14):2559–2575.e28, 2022. ISSN 0092-8674. doi:505 (link) | CC BY 4.0 |
DE/Dir: hepg2, jurkat | Transcriptome-wide characterization of genetic perturbations. bioRxiv, 07 2024. doi: 10.1101/2024.07.03.601903 (link) | CC BY 4.0 |
The LLM outputs in the data distribution (summer_outputs.zip
, summer_enrichment.zip
, llm-nocot.zip
, llm-noretrieve.zip
) and results tables (results.zip
) are licensed under the CC BY 4.0 license.
The knowledge graph entries and gene summaries (kg.zip
, gene_summary.zip
of the data distribution, respectively) are derived from the following datasets and are governed by the original licenses of these datasets:
Database | Reference | License |
---|---|---|
UniProt | UniProt: the Universal Protein Knowledgebase in 2023 Nucleic Acids Res. 51:D523–D531 (2023) (link) | CC BY 4.0 |
Ensembl | Ensembl 2024 Nucleic Acids Res. 2024, 52(D1):D891–D899 PMID: 37953337 10.1093/nar/gkad1049 (link) | Apache 2.0 |
Gene Ontology | 2024-01-17 release (DOI:10.5281/zenodo.10536401) | CC BY 4.0 |
CORUM | CORUM: the comprehensive resource of mammalian protein complexes–2022 Nucleic Acids Research, 51(D1):D539–D545 (link) | CC BY NC 4.0 |
STRINGDB | Szklarczyk et al. Nucleic acids research 51.D1 (2023): D638-D646 (link) | CC BY 4.0 |
Reactome | The Reactome Pathway Knowledgebase 2024. Nucleic Acids Research. 2024. doi: 10.1093/nar/gkad1025. (link) | CC BY 4.0 |
Bioplex | Huttlin et al. (2021) Cell 184(11):3022-3040. doi: 10.1016/j.cell.2021.04.011. (link) |
Please note that CORUM is licensed under CC BY NC 4.0