Authors: Judith Sáinz-Pardo Díaz and Álvaro López García (IFCA - CSIC).
Abstract: Anonymization techniques based on the application of different levels of hierarchies on quasi-identifiers are widely used in order to achieve pre-established levels of privacy. In order to prevent different types of attacks against database privacy it is necessary to apply several anonymization techniques beyond the classical ones such as k-anonymity or l-diversity. However, the application of these methods is directly connected to a reduction of their usefulness in terms of their use in prediction tasks, decision making, etc. Four classical Machine Learning methods currently used for classification tasks are studied in order to analyze the results as a function of the anonymization techniques applied and the parameters selected for each of them. First, different values of k for k-anonymity are stablished, then, with a values of k=5 fixed three other methods are applied: l-diversity, t-closeness and δ-disclosure privacy. The anonymization process has been carried out using the ARX Software [1].
Note: The paper associated to this work has been accepted for publication in the IEEE International Conference on Cyber Security and Resilience 2023 (IEEE CSR 2023), under the title "Comparison of machine learning models applied on anonymized data with different techniques" (see the preprint).
The results of applying four machine learning models: kNN, Random Forest, Adaptive Boosting and Gradient Tree Boosting on the following 5 scenarios on the adult datset are shown by means of ROC curves in Figure 1.
- Scenario 1: raw data (adult_raw.ipynb).
- Scenario 2: k-anonymity with k=5 (adult_k5.ipynb).
- Scenario 3: k-anonymity with k=5 and l-diversity with l=2 (adult_k5_l2.ipynb).
- Scenario 2: k-anonymity with k=5 and t-closenss with t=0.7 (adult_k5_t07.ipynb).
- Scenario 2: k-anonymity with k=5 and δ-disclosure privacy with δ=1.5 (adult_k5_delta15.ipynb).
Using the Python library pyCANON [2], we have analyzed the metrics that are verified for each of the anonymity techniques used (see check_anonymity.py). Specifically, it is can be cheacked that when δ=1.5, the value of t for t-closeness is 0.47 (lower than the one set in the scenario where t-closeness is satisfied for t=0.7). This is the most restrictive scenario (δ=1.5), and the results in each case go hand in hand with the ranking metric calculated in each case (see anonymity_metrics.py).
An analogous analysis has also been carried out for different values of k for k-anonymity, in particular for k=2, 5, 10, 15, 20, 25, 50, 75, 100 (see ml_varying_k.py). And the average equivalence class metric has also been analyzed togueter with the accuracy and the AUC obtained with each ML model.
Two utility metrics are also analyzed: the average equivalence class size metric and the classification metric (CM).
This project is licensed under Apache License Version 2.0 (http://www.apache.org/licenses/).
[1] Prasser, Fabian, and Florian Kohlmayer. "Putting statistical disclosure control into practice: The ARX data anonymization tool." Medical data privacy handbook (2015): 111-148.
[2] Sáinz-Pardo Díaz, Judith, and Álvaro López García. "A Python library to check the level of anonymity of a dataset." Scientific Data 9.1 (2022): 785.
The authors would like to thank the funding through the European Union - NextGenerationEU (Regulation EU 2020/2094), through CSIC’s Global Health Platform (PTI+ Salud Global) and the support from the project AI4EOSC “Artificial Intelligence for the European Open Science Cloud” that has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement number 101058593.