This repository implements an advanced pipeline for posterior sampling in simulation-based inference settings using diffusion models. The project combines neural guidance and denoising models to enable robust Bayesian inference for scientific simulators. Additionally, it includes a reference implementation of the Generalized Bayesian Inference (GBI) pipeline from Generalized Bayesian Inference for Scientific Simulators via Amortized Cost Estimation (Gao et al., 2023) for benchmarking purposes.
- Diffusion-based posterior sampling pipeline
- Neural guidance and denoising model training
- Implementation of multiple scientific simulators
- MCMC sampling with NUTS kernel
- Comprehensive benchmarking tools
- Support for custom datasets
The pipeline implements conditional guidance-controlled diffusion sampling based on Score-Based Generative Modeling through Stochastic Differential Equations (Song et al., 2020). The posterior gradient is decomposed as:
where:
$p_\psi(x_t | \theta_\tau ) = \frac{1}{Z} \exp(- \beta s_\psi(\theta_\tau, x_t, \tau))$ $\nabla_{\theta_\tau} \log(p_\psi(\theta_\tau)) = f_\psi(\theta_\tau, \tau)$
- Python >= 3.12
- pip
- git
- Clone the repository:
git clone [email protected]:mackelab/neuralgbi_diffusion.git
- Install Poetry (dependency management):
pip install poetry
- Install dependencies:
poetry install --no-root
The project provides a unified CLI interface for all operations:
python -m gbi_diff <action> <options>
Use --help
or -h
with any command to see available options.
The pipeline supports multiple scientific simulators:
- Two Moons (
two_moons
) - SIR Epidemiological Model (
SIR
) - Lotka-Volterra Population Dynamics (
lotka_volterra
) - Inverse Kinematics (
inverse_kinematics
) - Gaussian Mixture (
gaussian_mixture
) - Linear Gaussian (
linear_gaussian
) - Uniform 1D (
uniform
)
Generate data for a specific simulator:
python -m gbi_diff generate-data --dataset-type <type> --size <n_samples> --path data/
Recommended dataset sizes:
- Training: 10,000 samples
- Validation: 1,000 samples
- Observed data: 10 samples
For bulk dataset generation, use the provided script:
./generate_datasets.sh
When adding custom datasets (*.pt files), include the following fields:
Field | Description | Shape/Type |
---|---|---|
_theta | Parameter features | (n_samples, n_param_features) |
_x | Simulator outcomes | (n_samples, n_sim_out_features) |
_target_noise_std | Noise standard deviation | float |
_seed | Random seed | int |
_diffusion_scale | Misspecification parameter | float |
_max_diffusion_steps | Misspecification parameter | int |
_n_misspecified | Number of misspecified samples | int |
_n_noised | Number of noised samples | int |
The guidance model ($s_\psi(\theta_\tau, x_t, \tau)$) is trained using a modified loss function:
$$ \mathcal{L} = \mathbb{E}{\theta, x \sim \mathcal{D}, \tau\sim U{[0, T - 1]}}[||s_\psi(\theta_\tau, x_t, \tau) - d(x, x_t)||^2] $$
Train the guidance model:
python -m gbi_diff train-guidance
Configuration: Modify config/train_guidance.yaml
The diffusion model ($f_\psi(\theta_\tau, \tau)$) follows the approach from Denoising Diffusion Probabilistic Models (Ho et al., 2020):
Train the diffusion model:
python -m gbi_diff train-diffusion
Configuration: Modify config/train_diffusion.yaml
Sample from the posterior distribution:
python -m gbi_diff diffusion-sample --diffusion-ckpt <path> --guidance-ckpt <path> --n-samples <count> [--plot]
Configuration: Modify config/sampling_diffusion.yaml
Key parameter: beta
controls sample diversity
Train the potential function:
python -m gbi_diff train-potential
Sample using MCMC with NUTS kernel:
python -m gbi_diff mcmc-sample --checkpoint <path> --size <count> [--plot]
├── config/ # Configuration files
│ ├── train_guidance.yaml
│ ├── train_diffusion.yaml
│ ├── sampling_diffusion.yaml
│ └── train.yaml
├── data/ # Dataset storage
├── gbi_diff/ # Core implementation
│ ├── dataset # Dataset handling and implementation
│ ├── diffusion # Toolset for training guidance and denoiser
│ ├── model # Network architectures + lighting module
│ ├── sampling # Toolset for sampling
│ ├── scripts # Functions called by the entrypoint
│ └── utils # Utility
│ ├── __init__.py
│ ├── __main__.py # Main entrypoint (generated by pyargwriter)
│ ├── entrypoint.py # Contains entrypoint class
├── results/ # Training and sampling outputs
├── generate_datasets.sh # Dataset generation script
├── poetry.lock # Dependency lock file
└── pyproject.toml # Project metadata and dependencies
Please ensure any contributions:
- Follow the existing code style
- Include appropriate tests
- Update documentation as needed
- Maintain compatibility with the existing data format
MIT License
If you use this code in your research, please cite:
@misc{vetter2025gbidiff,
title={Generalized Diffusion Simulation Based Inference},
author={Vetter, Julius and Uhrich, Robin},
year={2025},
url={https://github.com/mackelab/neuralgbi_diffusion}
}