Here we introduce the Machine Learning-guided Antigenic Evolution Prediction (MLAEP), which combines structure modeling, multi-task learning, and genetic algorithm to model the viral fitness landscape and explore the antigenic evolution via in silico directed evolution.
data
: Required datascripts
: Bash scriptssrc
: python codeanalysis
: jupyter notebooks for all the analysis and plotsresult
: directory for the results
Trained and tested on one NVIDIA Tesla V100 with 32GB GPU memory
For storing all intermediate files for all methods and all datasets, approximately 100G of disk space will be needed.
The codes have been tested on CentOS Linux release 7.9.2009 with conda 4.13.0 and python 3.8.5. The list of software dependencies are provided in the environment.yml
file.
- Create the conda environment from the environment.yaml file:
conda env create -f environment.yml
- Activate the new conda environment:
conda activate covid_predict
- Update huggingface_hub package
conda install huggingface_hub=0.2.1 --force
GISAD dataset repuires authentication, and registration is needed to access the data. Therefore, we can't provide the data directly. You can download the data from their web: https://www.gisaid.org.
The model could be download through the link https://drive.google.com/file/d/1em8015ooDVihvyKbcva9ty70mzoBFvgS/view?usp=sharing
The model could be put under the folder trained_model
Here, we provide an example of model inference and variants synthetic using the selected high-risk variant(HRV) RBD sequences data/pVNT_seq.csv
(Which is also used in the Fig. 2b). You will find the our model embeddings, predicted escape/binding potentials, and the possible successors along the antigenic evolution direction in result
directory after running the corresponding commands.
We also provide code to perform the model training with the deep mutational scanning data. The training of the entire model takes around 10 hours with a V100 GPU. The inference can be completed in less than one minute. For the synthetic process, we used a toolkit OpenAttack to perform and visualize the process. It was designed for the natural language and we extended it to protein sequences, the original repository could be find here: https://github.com/thunlp/OpenAttack/tree/bfedfa74f37c69db6d7092d9cc61822ee324919d. It takes approximately one minute to synthesize one variant.
To get the predictions and embeddings for variants :
bash scripts/run_infer.sh
To sythesize the high-risk variants:
bash scripts/run_synthetic.sh
If you find our work useful or use our software in your research, please cite our paper: Han, Wenkai, et al. "Predicting the antigenic evolution of SARS-COV-2 with deep learning." bioRxiv (2022): 2022-06.