In the second cycle of the DSRM project, DP2 was refined by exploring various methods for synthetic data generation, addressing prior expert feedback. A systematic literature review identified algorithms like GANs, Gaussian mixture models, and variational autoencoders, which were tested on a financial fraud detection dataset. Privacy checks confirmed the safety of the generated data. The findings underscored the importance of combining synthetic and real data, leading to a new design principle (DP5) and an updated system architecture.
- literatureReview.xlsx - contains information about the papers which were used to select the algorithms for the later experiments.
To run the code to conduct the exeriments described in the paper please follow the steps below:
- Use the information provided in DATA.MD to download the necessary data.
- (Optional) Run 00_assessFeatureImportance.py to assess the importance of the features in the dataset and generate some descriptive statistics.
- Run 01_preprocessData.py to preprocess the data and generate the necessary files for the experiments.
- In the 02_paramSearch folder run each of the listed scripts to conduct the hyperparameter search for the respective algorithm. Please note you need to be logged in to wandb.ai for this to work.
- Run 03_generateSyntheticData.py to generate the synthetic data using the best hyperparameters found in the previous step.
- Run 04_performanceEvaluation.py to evaluate the performance of the synthetic data generated in the previous step.
- Run 05_privacyEvaluation.py to evaluate the privacy of the synthetic data generated in the previous step.
- Run 06_visualizeEvaluation.py to visualize the results of the performance evaluation and generate the plots displayed in the paper.