Johannes Zenn · Dominik Gond · Fabian Jirasek · Robert Bamler
This is the official GitHub repository for our work Balancing Molecular Information and Empirical Data in the Prediction of Physico-Chemical Properties where we propose a hybrid method for combining molecular descriptors with representation learning for the (exemplary) task of predicting activity coefficients.
Predicting the physico-chemical properties of pure substances and mixtures is a central task in thermodynamics. Established prediction methods range from fully physics-based ab-initio calculations, which are only feasible for very simple systems, over descriptor-based methods that use some information on the molecules to be modeled together with fitted model parameters (e.g., quantitative-structure-property relationship methods or classical group contribution methods), to representation-learning methods, which may, in extreme cases, completely ignore molecular descriptors and extrapolate only from existing data on the property to be modeled (e.g., matrix completion methods). In this work, we propose a general method for combining molecular descriptors with representation learning using the so-called expectation maximization algorithm from the probabilistic machine learning literature, which uses uncertainty estimates to trade off between the two approaches. The proposed hybrid model exploits chemical structure information using graph neural networks, but it automatically detects cases where structure-based predictions are unreliable, in which case it corrects them by representation-learning based predictions that can better specialize to unusual cases. The effectiveness of the proposed method is demonstrated using the prediction of activity coefficients in binary mixtures as an example. The results are compelling, as the method significantly improves predictive accuracy over the current state of the art, showcasing its potential to advance the prediction of physico-chemical properties in general.
We recommend using a virtual environment to avoid dealing with other packages
installed in the system.
First, clone the repository using git clone [email protected]:jzenn/gnn-mcm.git
and navigate to the repository folder with cd gnn-mcm
.
You can install a virtual environment via either of the two methods given below.
- install miniconda
- create a new environment
conda create python=3.9 --name gnn-mcm
- activate the environment
conda activate gnn-mcm
Please make sure that Python 3.9 is used for the installation, otherwise one might
run into version conflicts with torch
.
- create a new environment
python3.9 -m venv venv
- activate the environment
source venv/bin/activate
First, make sure that pip==24.3.1
is installed (python3.9 -m pip install --upgrade pip==24.3.1
).
Then, install the requirements.txt
via
python3.9 -m pip install -r requirements.txt
After all requirements have been installed, run the following.
# replace torch-1.10.0+cpu by torch-1.10.0+{cu102,cu113,cu111}
# depending on availability of accelerator
python3.9 -m pip install torch-cluster==1.6.0 -f https://data.pyg.org/whl/torch-1.10.0+cpu.html
python3.9 -m pip install torch-scatter==2.0.9 -f https://data.pyg.org/whl/torch-1.10.0+cpu.html
python3.9 -m pip install torch-sparse==0.6.13 -f https://data.pyg.org/whl/torch-1.10.0+cpu.html
python3.9 -m pip install torch-spline-conv==1.2.1 -f https://data.pyg.org/whl/torch-1.10.0+cpu.html
We provide a detailed description of the data preparation process in the following.
We provide the processed CSV
file for the data used in Medina et al. (2022) (taken from their
repository) at
data/medina_2022/medina_data.csv
.
Additionally, we provide the JSON
file that contains the embeddings for the
molecules.
The Dortmund Data Bank 2019 (DDB) is not publicly available but can be downloaded
with a paid subscription.
If you want to train on the DDB dataset, the data should be processed as described in
Jirasek et al. (2020).
The provided CSV
file should contain the following columns:
log_gamma_exp
: log of the experimental activity coefficientsolute_idx
: index of the solutesolvent_idx
: index of the solventsolute_smiles
: SMILES string of the solutesolvent_smiles
: SMILES string of the solvent
The values of the *_smiles
keys are matched with the corresponding objects in a
JSON
file.
The JSON
file has the same structure as for the data of Medina et al. (2022)
(see data/medina_2022/medina_data.csv
and
data/medina_2022/featurized_molecules.json
).
The script train_medina_example.sh
provides an executable script for training the
GNN-MCM on the dataset used by Medina et al. (2022).
You can use this script by running
./train_medina_example.sh
to test whether the installed libraries work, but the resulting trained model will not be useful because real training requires a lot more training epochs. If this script runs for a few minutes and then prints test results (including test MSE and MAE) to the terminal, then your installation works.
To replicate the results in our paper, run a command of the following form,
python3.9 main.py <arguments for training>
where the exact <arguments for training>
that we used in our experiments are listed
in the files in the directory hyperparameters
.
When taking these arguments, make sure that you
- replace each
<insert-path>
by a suitable path (cf. example in filetrain_medina_example.sh
); - replace
<insert-name>
with an identifier of your choice (the training script will create a subdirectory with this name in the directory specified by--experiment_base_path
, where it will store checkpoints and results); - replace
<ensemble-id>
with a number from 1 to 10 to specify the current train/test split for 10-fold cross validation; - replace
<M>
and<N>
by the number of solutes and solvents that the dataset contains; - replace
<M'>
and<N'>
(if present) by the number of solutes and solvents that should be excluded from the training set for zero-shot prediction; - concatenate all arguments into a single space-separated line.
Distributed under the MIT License. See LICENSE.MIT
for more information.
@article{zenn2024balancing,
title={Balancing Molecular Information and Empirical Data in the Prediction of Physico-Chemical Properties},
author={Johannes Zenn and Dominik Gond and Fabian Jirasek and Robert Bamler},
journal={arXiv preprint arXiv:2406.08075},
year={2024}
}