Attention-Based Conditional Variational Autoencoder.
A new attention-based conditional variational autoencoder neural network architecture based on recent developments in attention-based methods.
- Operating system : Linux with CUDA support
- Hardware : GPU compatible with tensorflow
- Python environment manager : anaconda, miniconda...
└── ACoVAE/
├── doc/
│ └── Pharms_transformer.pdf # A visual representation of the ACoVAE neural network
├── utils/
│ ├── # Imports
│ ├── # data preparation - only for GTM universal maps.
│ ├── # Network implementation
│ └── # Utilities (SMILESParser class)
├── training_data/
│ └──
├── LICENSE # GNU General Public License v3.0 license
├── model_parameters_standard.yaml # Standard parameters for model training
└── requirements.txt # Requirements to create the python environment
git clone
mkdir model
conda create --name ACoVAE --file requirements.txt
pip install adabelief-tf CGRtools
conda list | grep tensorflow
conda list | grep cudatoolkit
conda list | grep cudnn
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
The input file is a descriptor matrix file under the .svm format.
It can be generated using the ISIDA/Fragmentor software (Open access, request the software using the form :
Please refer to the ISIDA/Fragmentor documentation to generate the descriptor matrix.
- Column 1 : SMILES
- Column 2-end : Descriptor matrix in a .svm file following the libSVM format. These columns consist in a pair of values separated by a ":". The first value identifies the fragment's index in the header file (.hdr file created by ISIDA/Fragmentor), the second value is the fragment count.
An example dataset can be downloaded here :
conda activate ACoVAE
python3 -i ./training_data/chembl23_umap1.svm -m ./model/model_name -mp model_parameters_standard.yaml --log log_19-07-2022
For extended functions, consult the help command:
python --help
Check the log file for model quality scores : epoch, loss, mask_acc, rec_rate, val_loss, val_mask_acc, val_rec_rate
val_mask_acc (accuracy - character-specific reconstruction rate), val_rec_rate (reconstruction rate) are the values to follow.
Select the model which suits your needs in the /model folder.
First, generate the descriptor vector for a known compound, using ISIDA/Fragmentor. Note : The first column in the output file must be the ID of the compound, not the SMILES. (see ./training_data/1.svm for an example).
Then, use the generated vector as seeds for new compounds generation.
mkdir sampled_smi
python -f ./training_data/1.svm -n 1000 -m ./model/model_name_99_0.98 -sp ./model/model_name_smi_parser.pkl -mp model_parameters_standard.yaml -o ./sampled_smi/known_compound_vector_sampled.smi
- sp : SMILES parser pickle object created during the network training. It is needed for sampling.
- m : the model created during the network training.
- mp : the model yaml created during the network training.
- n : Number of batches sampled per query.
- Arkadii Lin, Daniyar Mazitov, William Bort, Timur Madzhidov and Alexandre Varnek
- Kazan Federal University, Russia
- University of Strasbourg, France
Distributed under the GNU GENERAL PUBLIC LICENSE Version 3. See the LICENSE
link in the additional resources for more information.