From e3a555e939d7170facf990f1a393d1d9267b84e4 Mon Sep 17 00:00:00 2001 From: johli Date: Tue, 1 Oct 2024 13:58:10 -0700 Subject: [PATCH 1/7] Update README.md --- README.md | 49 ++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 40 insertions(+), 9 deletions(-) diff --git a/README.md b/README.md index d81ff13..498f17c 100644 --- a/README.md +++ b/README.md @@ -9,7 +9,7 @@ Code repository for Borzoi models, which are convolutional neural networks train [https://www.biorxiv.org/content/10.1101/2023.08.30.555582v1](https://www.biorxiv.org/content/10.1101/2023.08.30.555582v1). -Borzoi was trained on a large set of RNA-seq experiments from ENCODE and GTEx, as well as re-processed versions of the original Enformer training data (including ChIP-seq and DNase data from ENCODE, ATAC-seq data from CATlas, and CAGE data from FANTOM5). Click [here](https://raw.githubusercontent.com/calico/borzoi/main/examples/targets_human.txt) for a list of trained-on experiments. +Borzoi was trained on a large set of RNA-seq experiments from ENCODE and GTEx, as well as re-processed versions of the original Enformer training data (including ChIP-seq and DNase data from ENCODE, ATAC-seq data from CATlas, and CAGE data from FANTOM5). Here is a list of trained-on experiments: [human](https://raw.githubusercontent.com/calico/borzoi/main/examples/targets_human.txt) / [mouse](https://raw.githubusercontent.com/calico/borzoi/main/examples/targets_mouse.txt). The repository contains example usage code (including jupyter notebooks for predicting and visualizing genetic variants) as well as links for downloading model weights, training data, QTL benchmark tasks, etc. @@ -30,20 +30,51 @@ cd borzoi pip install -e . ``` -These repositories further depend on a number of python packages (which are automatically installed with borzoi). See **setup.cfg** for a complete list. The most important version dependencies are: -- Python == 3.9 -- Tensorflow == 2.12.x (see [https://www.tensorflow.org/install/pip](https://www.tensorflow.org/install/pip)) +To train new models, the [westminster repository](https://github.com/calico/westminster.git) is also required and can be installed with these commands: +```sh +git clone https://github.com/calico/westminster.git +cd westminster +pip install -e . +``` + +These repositories further depend on a number of python packages (which are automatically installed with borzoi). See **pyproject.toml** for a complete list. The most important version dependencies are: +- Python == 3.10 +- Tensorflow == 2.15.x (see [https://www.tensorflow.org/install/pip](https://www.tensorflow.org/install/pip)) *Note*: The example notebooks require jupyter, which can be installed with `pip install notebook`.
-A new conda environment can be created with `conda create -n borzoi_py39 python=3.9`. +A new conda environment can be created with `conda create -n borzoi_py310 python=3.10`. + +Finally, the code base relies on a number of environment variables. For convenience, these can be configured in the active conda environment with the 'env_vars.sh' script. +```sh +cd borzoi +conda activate borzoi_py310 +./env_vars.sh +``` + +Alternatively, these environment variables can be set manually: +```sh +export BORZOI_DIR=/home//borzoi +export PATH=$BORZOI_DIR/src/scripts:$PATH +export PYTHONPATH=$BORZOI_DIR/src/scripts:$PYTHONPATH + +export BORZOI_CONDA=/home//anaconda3/etc/profile.d/conda.sh +export BORZOI_HG38=$BORZOI_DIR/examples/hg38 +export BORZOI_MM10=$BORZOI_DIR/examples/mm10 +``` ### Model Availability The model weights can be downloaded as .h5 files from the URLs below. We trained a total of 4 model replicates with identical train, validation and test splits (test = fold3, validation = fold4 from [sequences_human.bed.gz](https://github.com/calico/borzoi/blob/main/data/sequences_human.bed.gz)). -[Borzoi V2 Replicate 0](https://storage.googleapis.com/seqnn-share/borzoi/f0/model0_best.h5)
-[Borzoi V2 Replicate 1](https://storage.googleapis.com/seqnn-share/borzoi/f1/model0_best.h5)
-[Borzoi V2 Replicate 2](https://storage.googleapis.com/seqnn-share/borzoi/f2/model0_best.h5)
-[Borzoi V2 Replicate 3](https://storage.googleapis.com/seqnn-share/borzoi/f3/model0_best.h5)
+[Borzoi Replicate 0 (human)](https://storage.googleapis.com/seqnn-share/borzoi/f0/model0_best.h5) | [(mouse)](https://storage.googleapis.com/seqnn-share/borzoi/f0/model1_best.h5)
+[Borzoi Replicate 1 (human)](https://storage.googleapis.com/seqnn-share/borzoi/f1/model0_best.h5) | [(mouse)](https://storage.googleapis.com/seqnn-share/borzoi/f1/model1_best.h5)
+[Borzoi Replicate 2 (human)](https://storage.googleapis.com/seqnn-share/borzoi/f2/model0_best.h5) | [(mouse)](https://storage.googleapis.com/seqnn-share/borzoi/f2/model1_best.h5)
+[Borzoi Replicate 3 (human)](https://storage.googleapis.com/seqnn-share/borzoi/f3/model0_best.h5) | [(mouse)](https://storage.googleapis.com/seqnn-share/borzoi/f3/model1_best.h5)
+ +For convenience, users can run *download_models.sh* to download model replicates and annotations into the 'examples/' folder. +```sh +cd borzoi +./download_models.sh +``` #### Mini Borzoi Models We have trained a collection of (smaller) model instances on various subsets of data modalities (or on all data modalities but with architectural changes compared to the original architecture). For example, some models are trained only on RNA-seq data while others are trained on DNase-, ATAC- and RNA-seq. Similarly, some model instances are trained on human-only data while others are trained on human- and mouse data. The models were trained with either 2- or 4-fold cross-validation and are available at the following URL: From 900be7cb760c189f27475b8d47b6f1931e80d5ba Mon Sep 17 00:00:00 2001 From: johli Date: Tue, 1 Oct 2024 13:59:36 -0700 Subject: [PATCH 2/7] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 498f17c..4c2a314 100644 --- a/README.md +++ b/README.md @@ -51,7 +51,7 @@ conda activate borzoi_py310 ./env_vars.sh ``` -Alternatively, these environment variables can be set manually: +Alternatively, the environment variables can be set manually: ```sh export BORZOI_DIR=/home//borzoi export PATH=$BORZOI_DIR/src/scripts:$PATH From 9c5df564d0db3acf721d3712db0f2e497c96d3d3 Mon Sep 17 00:00:00 2001 From: johli Date: Tue, 1 Oct 2024 14:00:27 -0700 Subject: [PATCH 3/7] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 4c2a314..4a3941d 100644 --- a/README.md +++ b/README.md @@ -70,7 +70,7 @@ The model weights can be downloaded as .h5 files from the URLs below. We trained [Borzoi Replicate 2 (human)](https://storage.googleapis.com/seqnn-share/borzoi/f2/model0_best.h5) | [(mouse)](https://storage.googleapis.com/seqnn-share/borzoi/f2/model1_best.h5)
[Borzoi Replicate 3 (human)](https://storage.googleapis.com/seqnn-share/borzoi/f3/model0_best.h5) | [(mouse)](https://storage.googleapis.com/seqnn-share/borzoi/f3/model1_best.h5)
-For convenience, users can run *download_models.sh* to download model replicates and annotations into the 'examples/' folder. +Users can run the script *download_models.sh* to download all model replicates and annotations into the 'examples/' folder. ```sh cd borzoi ./download_models.sh From 6f8184d483202a0af726283849ec5a9a510d854b Mon Sep 17 00:00:00 2001 From: johli Date: Tue, 1 Oct 2024 15:16:25 -0700 Subject: [PATCH 4/7] Update README.md --- README.md | 20 +++++++++++++++++++- 1 file changed, 19 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 4a3941d..85cca31 100644 --- a/README.md +++ b/README.md @@ -91,7 +91,7 @@ For example, here are the weights, targets, and parameter file of a model traine ### Data Availability The training data for Borzoi can be downloaded from the following URL: -[Borzoi V2 Training Data](https://storage.googleapis.com/borzoi-paper/data/)
+[Borzoi Training Data](https://storage.googleapis.com/borzoi-paper/data/)
*Note*: This data bucket is very large and thus set to "Requester Pays". @@ -103,6 +103,24 @@ The curated e-/s-/pa-/ipaQTL benchmarking data can be downloaded from the follow [paQTL Data](https://storage.googleapis.com/borzoi-paper/qtl/paqtl/)
[ipaQTL Data](https://storage.googleapis.com/borzoi-paper/qtl/ipaqtl/)
+### Paper Replication +To replicate the results presented in the paper, visit the [borzoi-paper repository](https://github.com/calico/borzoi-paper.git). This repository contains scripts for **training**, **evaluating**, and **analyzing** the published model. + +### Tutorials +Todo. + +#### Data Processing +Todo. + +#### Model Training +Todo. + +#### Variant Scoring +Todo. + +#### Sequence Attribution +Todo. + ### Example Notebooks The following notebooks contain example code for predicting and interpreting genetic variants. From 933f8c1a5b102435d11f8c7070f96143733dd22e Mon Sep 17 00:00:00 2001 From: johli Date: Tue, 1 Oct 2024 15:25:29 -0700 Subject: [PATCH 5/7] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 85cca31..1bbf9ce 100644 --- a/README.md +++ b/README.md @@ -104,7 +104,7 @@ The curated e-/s-/pa-/ipaQTL benchmarking data can be downloaded from the follow [ipaQTL Data](https://storage.googleapis.com/borzoi-paper/qtl/ipaqtl/)
### Paper Replication -To replicate the results presented in the paper, visit the [borzoi-paper repository](https://github.com/calico/borzoi-paper.git). This repository contains scripts for **training**, **evaluating**, and **analyzing** the published model. +To replicate the results presented in the paper, visit the [borzoi-paper repository](https://github.com/calico/borzoi-paper.git). This repository contains scripts for **training**, **evaluating**, and **analyzing** the published model, and for processing the **training data**. ### Tutorials Todo. From dd8db20a45cb49299e26d4a8476b8c594d356ab0 Mon Sep 17 00:00:00 2001 From: johli Date: Tue, 1 Oct 2024 15:39:00 -0700 Subject: [PATCH 6/7] Update README.md --- README.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/README.md b/README.md index 1bbf9ce..50a627b 100644 --- a/README.md +++ b/README.md @@ -49,6 +49,10 @@ Finally, the code base relies on a number of environment variables. For convenie cd borzoi conda activate borzoi_py310 ./env_vars.sh +cd ../baskerville +./env_vars.sh +cd ../westminster +./env_vars.sh ``` Alternatively, the environment variables can be set manually: @@ -57,11 +61,21 @@ export BORZOI_DIR=/home//borzoi export PATH=$BORZOI_DIR/src/scripts:$PATH export PYTHONPATH=$BORZOI_DIR/src/scripts:$PYTHONPATH +export BASKERVILLE_DIR=/home//baskerville +export PATH=$BASKERVILLE_DIR/src/baskerville/scripts:$PATH +export PYTHONPATH=$BASKERVILLE_DIR/src/baskerville/scripts:$PYTHONPATH + +export WESTMINSTER_DIR=/home//westminster +export PATH=$WESTMINSTER_DIR/src/westminster/scripts:$PATH +export PYTHONPATH=$WESTMINSTER_DIR/src/westminster/scripts:$PYTHONPATH + export BORZOI_CONDA=/home//anaconda3/etc/profile.d/conda.sh export BORZOI_HG38=$BORZOI_DIR/examples/hg38 export BORZOI_MM10=$BORZOI_DIR/examples/mm10 ``` +*Note*: The *baskerville* and *westminster* variables are only required for data processing and model training. + ### Model Availability The model weights can be downloaded as .h5 files from the URLs below. We trained a total of 4 model replicates with identical train, validation and test splits (test = fold3, validation = fold4 from [sequences_human.bed.gz](https://github.com/calico/borzoi/blob/main/data/sequences_human.bed.gz)). From 10ed86703662f54e4594768522cf8a3faaf336b6 Mon Sep 17 00:00:00 2001 From: johli Date: Tue, 1 Oct 2024 20:24:04 -0700 Subject: [PATCH 7/7] Update README.md --- README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 50a627b..0cbc43b 100644 --- a/README.md +++ b/README.md @@ -42,7 +42,8 @@ These repositories further depend on a number of python packages (which are auto - Tensorflow == 2.15.x (see [https://www.tensorflow.org/install/pip](https://www.tensorflow.org/install/pip)) *Note*: The example notebooks require jupyter, which can be installed with `pip install notebook`.
-A new conda environment can be created with `conda create -n borzoi_py310 python=3.10`. +A new conda environment can be created with `conda create -n borzoi_py310 python=3.10`.
+Some of the scripts in this repository start multi-process jobs and require [slurm](https://slurm.schedmd.com/). Finally, the code base relies on a number of environment variables. For convenience, these can be configured in the active conda environment with the 'env_vars.sh' script. ```sh