- What: the difference between two datasets in terms of their clustering characteristics
- Why: measure the ability of a differential privacy algorithm to conserve clustering characteristics
- How: computed comparisons of n-column (i.e. n-feature) marginal density distributions between any two similarly structured datasets.
- N-marginal: set of records corresponding to a n-feature value combination.
- N-feature combination: set of n features chosen from the feature space.
- N-feature value combination: set of n values, where each value is selected from the set of possible values for one of the features in a n-feature value combination.
- Bin scalar features so as to prevent a large number of possible values resulting in an explosion of feature value combinations - and thus greatly increases runtime.
- Select n features - for example 3. For each possible 3-feature-value combination corresponding to these 3 features, sum the results of the following computation (results in score between 0 and 2):
- Compute density for the 3-feature-value combination, for both data sets (number of observations corresponding to the 3-feature-value combination, divided by the total number of observations).
- Compute absolute difference between the two densities.
- Do Step 2 and average scores across either 1) a random sunset or 2) all possible combinations of 3 features to achieve an aggregate score in the range 0-2.
- 2 can become infeasible with high-dimensional data.
0 - perfectly matching density distributions (for the marginals used in the comparison). 2 - no overlap whatsoever (for the marginals used in the comparison).
- Bin features as deemed best - fewer possible values per features means less computational complexity for marginal metric evaluation.
- Instantiate a MarginalMetric object with the following parameters:
data_frame_a
(pandas.DataFrame
): first of two data frames to evaluate against one another.data_frame_b
(pandas.DataFrame
): second of two data frames to evaluate against one another.marginal_dimensionality
(int
): the number of features used to create marginals. The higher this number, the greater the number of features considered at a time when the clustering characteristics for the two data sets are evaluated against each other. 3 is the default value. Larger values can drastically increase the computational complexity of the marginal metric evaluation.picking_strategy
(Enum
): determines how feature combinations are selected. Possible values are:lexicographic
--> all possible combinations, in lexicographic orderrolling
--> incrementally shifting sets. Example: (1,2,3), (2,3,4), ... (8,9,n)random
--> random selection from all possible combinations
sample_ratio
(float
): sample proportion of available combinations to use, given a picking_strategy. If value of 1 and picking strategy is lexicographic or random, then all possible combinations will be used in the marginal metric evaluation. If value of 1 and picking strategy is rolling, then all rolling combinations will be used. This value can have a significant impact on runtime.
- Optional parameters include:
stable_features
: A list of the of the Feature names to include in the marginal metric. Default value is the empty list [].
- Make a call to MarginalMetric.compute_results(), which returns a Result object containing the results of the evaluation
- Pass the Result instance to a Report instance (e.g. ConsoleReport), and call the produce_report() method for the Report instance.
marginal_dimensionality
--> 3picking_strategy
--> randomsample_ratio
--> .001