- Date: Sunday, Dec 3, 2023
- Author: ChuNan Liu
- Email: [email protected]
Contents:
docker build -t $USER/esmfold:base .
-t $USER/esmfold:base
: tag the image with the name$USER/esmfold
and the tagbase
.- You can omit
$USER
if you want
- You can omit
This image is based on the nvidia/cuda:11.3.1-devel-ubuntu20.04 image.
You might already have noticed there are some packages installed in the Dockerfile are downloaded using gdown
which is a python package that downloads files from Google Drive. These files are:
- openfold.tar.gz: the official release of OpenFold
- My modifications: I commented out the flash-attn package from the default environment.yml file because it's not compatible with the latest version of ESM.
- esm-main.tar.gz: the official release of ESM.
- esm2_t36_3B_UR50D.pt : the pre-trained ESM2 model.
- esm2_t36_3B_UR50D-contact-regression.pt: the pre-trained ESM2 model with contact regression.
- esmfold_3B_v1.pt: the pre-trained ESMFold model.
Even though the three
.pt
checkpoint files are downloaded upon first run of the container, it's better to have them in the image to avoid downloading them every time the container is run.
The Google Drive folder for the above files are esmfold.
The default entrypoint for the image, as specified in the Dockerfile, is
ENTRYPOINT ["zsh", "run-esm-fold.sh"]
content of run-esm-fold.sh
:
#!/bin/zsh
# init conda
source $HOME/.zshrc
# activate py39-esmfold
conda activate py39-esmfold
# run esm-fold
esm-fold $@
Run the following command to see the help information of esm-fold
:
docker run --rm esmfold:base --help
stdout:
usage: esm-fold [-h] -i FASTA -o PDB [-m MODEL_DIR]
[--num-recycles NUM_RECYCLES]
[--max-tokens-per-batch MAX_TOKENS_PER_BATCH]
[--chunk-size CHUNK_SIZE] [--cpu-only] [--cpu-offload]
optional arguments:
-h, --help show this help message and exit
-i FASTA, --fasta FASTA
Path to input FASTA file
-o PDB, --pdb PDB Path to output PDB directory
-m MODEL_DIR, --model-dir MODEL_DIR
Parent path to the pre-trained ESM data directory.
--num-recycles NUM_RECYCLES
Number of recycles to run. Defaults to number used in
training (4).
--max-tokens-per-batch MAX_TOKENS_PER_BATCH
Maximum number of tokens per gpu forward-pass. This
will group shorter sequences together for batched
prediction. Lowering this can help with out of memory
issues, if these occur on short sequences.
--chunk-size CHUNK_SIZE
Chunks axial attention computation to reduce memory
usage from O(L^2) to O(L). Equivalent to running a for
loop over chunks of of each dimension. Lower values
will result in lower memory usage at the cost of
speed. Recommended values: 128, 64, 32. Default: None.
--cpu-only CPU only
--cpu-offload Enable CPU offloading
If GPUs are available.
$ mkdir -p ./example/{input,output,logs}
$ docker run --rm --gpus all \
-v ./example/input:/home/vscode/input \
-v ./example/output:/home/vscode/output \
esmfold:base \
-i /home/vscode/input/1a2y-HLC.fasta \
-o /home/vscode/output \
> ./example/logs/pred.log 2>./example/logs/pred.err
If no GPUs are available, add the --cpu-only
flag:
$ mkdir -p ./example/{input,output,logs}
$ docker run --rm \
-v ./example/input:/home/vscode/input \
-v ./example/output:/home/vscode/output \
esmfold:base \
--cpu-only \
-i /home/vscode/input/1a2y-HLC.fasta \
-o /home/vscode/output \
> ./example/logs/pred.log 2>./example/logs/pred.err
-i /input/1a2y-HLC.fasta
: input fasta file-o /output
: path to output predicted structure> ./example/logs/pred.log 2>./example/logs/pred.err
: redirect stdout and stderr to log files
Other ESMFold flags, refer to ESMFold repo documentation section
--num-recycles NUM_RECYCLES
: Number of recycles to run. Defaults to number used in training (default is 4).--max-tokens-per-batch MAX_TOKENS_PER_BATCH
: Maximum number of tokens per gpu forward-pass. This will group shorter sequences together for batched prediction. Lowering this can help with out of memory issues, if these occur on short sequences.--chunk-size CHUNK_SIZE
: Chunks axial attention computation to reduce memory usage from O(L^2) to O(L). Equivalent to running a for loop over chunks of of each dimension. Lower values will result in lower memory usage at the cost of speed. Recommended values: 128, 64, 32. Default: None.--cpu-only
: CPU only--cpu-offload
: Enable CPU offloading
If you want to overwrite the entrypoint, you can do so by adding the following to the end of the docker run
command:
docker run --rm --gpus all --entrypoint "/bin/zsh" esmfold:base -c "echo 'hello world'"
docker run --rm --gpus all --entrypoint "nvidia-smi" esmfold:base