CSuite is a collection of synthetic datasets for benchmarking causal machine learning algorithms. Each dataset consists of
- the true causal graph, for benchmarking causal discovery;
- 4000 rows of observational training data;
- 2000 rows of observational test data;
- interventional test data, for benchmarking estimation of average treatment effect (ATE) and conditional average treatment effect (CATE), 2000 rows per interventional environment.
The data was generated from known hand-crafted structural equation models (SEMs). Different datasets are intended to test different features of causal discovery and inference algorithms. CSuite was originally introduced in this paper. The data generation code for CSuite is publicly available.
CSuite datasets are versioned so that we can amend and add datasets, whilst ensuring backwards compatibility with older versions of the data. Full reproducibility with CSuite requires specifying the correct version.
The download URLs here are for the latest version.
Each dataset consists of the following files
adj_matrix.csv
, which describes the causal graph used to generate the data; a value1
in rowi
, columnj
indicates an edge from nodei
to nodej
;train.csv
, the observational training data;test.csv
, the observational test data;interventions.json
, a JSON containing interventional test data.
The interventional data JSON consists of pairs of interventional environments, which can be used to estimate (C)ATE. The two environments are the 'primary' and 'reference' environments. Conditional data was generating using HMC. The format of the interventional data is
{
"environments": [
{
"conditioning_idxs": <optional list containing indices of nodes to that were conditioned on>,
"conditioning_values": <list of values set on the conditioning nodes>,
"effect_idxs": <list containing indices of nodes to be considered effect variables>,
"intervention_idxs": <list of indices of nodes that were acted on with do-intervention>,
"intervention_values": <list of values set on the intervention nodes in the primary do-intervention: for example, receiving a medicine>,
"intervention_reference": <list of values set on the intervention nodes in the reference do-intervention: for example, not receiving the medicine>,
"test_data": <array of data from the primary do-intervention, same number of columns as train.csv>,
"reference_data": <array of data from the reference do-intervention>
},
...
],
"metadata": {
"columns_to_nodes": <matches to columns to their corresponding nodes, only important for vector-values nodes>
}
}
You can download CSuite datasets from any previous version using the following URL pattern
$ curl -O https://github.com/microsoft/csuite/releases/download/v<version>/csuite_<name>.zip
where <name>
and <version>
should be set appropriately.
The uncompressed files listed under Data format are also directly available from a public storage account. These may either be accessed through their HTTP links, e.g. https://azuastoragepublic.blob.core.windows.net/datasets/csuite_linexp/train.csv or their equivalent Azure blob storage paths. To load these directly in python:
import pandas as pd
# Load over HTTP
df = pd.read_csv("https://azuastoragepublic.blob.core.windows.net/datasets/csuite_linexp/train.csv")
# Load using `adlfs` (`pip install adlfs`)
df = pd.read_csv("az://[email protected]/csuite_linexp/train.csv")
If you use CSuite datasets in your work, please cite the following paper which originally introduced these datasets
@article{geffner2022deep,
title={Deep End-to-end Causal Inference},
author={Geffner, Tomas and Antoran, Javier and Foster, Adam and Gong, Wenbo and Ma, Chao and Kiciman, Emre and Sharma, Amit and Lamb, Angus and Kukla, Martin and Pawlowski, Nick and Allamanis, Miltiadis and Zhang, Cheng},
journal={arXiv preprint arXiv:2202.02195},
year={2022}
}
A two node linear Gaussian system. The structural equations are
where
A two node linear system with exponentially distributed noise. The structural equations are
where
A two node non-linear system with Gaussian distributed noise. The structural equations are
where
The dataset is constructed so that
An example of Simpson's Paradox using a continuous SEM. The dataset is constructed so that
The structural equations are
where
A dataset exhibiting multi-modality that is suitable for benchmarking CATE estimation. Nonlinear function estimation is important since
The structural equations are
where
A larger dataset with a pyramidal graph structure. This dataset is constructed so that there are many possible choices of backdoor adjustment set for estimating the treatment effect of
A complete description of the structural equations can be found in the data generation code for CSuite.
A larger dataset that is similar to large_backdoor
, but with many additional edges. The causal discovery challenge revolves
around finding all arrows, which are scaled to be relatively weak, but which have significant predictive power for
A complete description of the structural equations can be found in the data generation code for CSuite.
Variable | Discrete/continuous |
---|---|
Discrete on |
|
Continuous |
A two node system with one categorical and one continuous variable. The structural equations are
where
Variable | Discrete/continuous |
---|---|
Continuous | |
Discrete on |
A two node system with one categorical and one continuous variable. The structural equations are
Variable | Discrete/continuous |
---|---|
Discrete on |
|
Continuous | |
Discrete on |
|
Continuous |
Another example of Simpson's Paradox using a mixed-type SEM. The dataset is constructed so that
The structural equations are
where
Variable | Discrete/continuous |
---|---|
Continuous | |
Continuous | |
Continuous | |
Continuous | |
Continuous | |
Continuous | |
Continuous | |
Discrete on |
|
Continuous |
An adaptation of large_backdoor
with a binary variable
A complete description of the structural equations can be found in the data generation code for CSuite.
Variable | Discrete/continuous |
---|---|
Continuous | |
Continuous | |
Continuous | |
Continuous | |
Continuous | |
Continuous | |
Continuous | |
Discrete on |
|
Continuous |
An adaptation of weak_arrows
with a binary variable
A complete description of the structural equations can be found in the data generation code for CSuite.
Variable | Discrete/continuous |
---|---|
Discrete on |
|
Continuous | |
Continuous | |
Continuous | |
Discrete on |
|
Discrete on |
|
Continuous | |
Discrete on |
|
Continuous | |
Continuous | |
Continuous | |
Continuous |
A larger dataset with treatment node
A complete description of the structural equations can be found in the data generation code for CSuite.
Variable | Discrete/continuous |
---|---|
Discrete on |
|
Discrete on |
|
Discrete on |
A chain graph with discrete variables. The structural equations are
Variable | Discrete/continuous |
---|---|
Discrete on |
|
Discrete on |
|
Discrete on |
A collider graph with discrete variables. The structural equations are
This project welcomes contributions and suggestions.
Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.