diff --git a/src/main/scala/com/amazon/deequ/examples/ConstraintSuggestionExample.scala b/src/main/scala/com/amazon/deequ/examples/ConstraintSuggestionExample.scala index 8aa0fb6c5..fc8f458bf 100644 --- a/src/main/scala/com/amazon/deequ/examples/ConstraintSuggestionExample.scala +++ b/src/main/scala/com/amazon/deequ/examples/ConstraintSuggestionExample.scala @@ -17,6 +17,8 @@ package com.amazon.deequ.examples import com.amazon.deequ.examples.ExampleUtils.withSpark +import com.amazon.deequ.suggestions.rules.RetainCompletenessRule +import com.amazon.deequ.suggestions.rules.interval.WilsonScoreIntervalStrategy import com.amazon.deequ.suggestions.{ConstraintSuggestionRunner, Rules} private[examples] object ConstraintSuggestionExample extends App { @@ -51,6 +53,10 @@ private[examples] object ConstraintSuggestionExample extends App { val suggestionResult = ConstraintSuggestionRunner() .onData(data) .addConstraintRules(Rules.EXTENDED) + // We can also add our own constraint and customize constraint parameters + .addConstraintRule( + RetainCompletenessRule(intervalStrategy = WilsonScoreIntervalStrategy()) + ) .run() // We can now investigate the constraints that deequ suggested. We get a textual description diff --git a/src/main/scala/com/amazon/deequ/examples/constraint_suggestion_example.md b/src/main/scala/com/amazon/deequ/examples/constraint_suggestion_example.md index df159a9c9..472f63c7d 100644 --- a/src/main/scala/com/amazon/deequ/examples/constraint_suggestion_example.md +++ b/src/main/scala/com/amazon/deequ/examples/constraint_suggestion_example.md @@ -43,6 +43,17 @@ val suggestionResult = ConstraintSuggestionRunner() .run() ``` +Alternatively, we also support customizing and adding individual constraint rule using `addConstraintRule()` +```scala +val suggestionResult = ConstraintSuggestionRunner() + .onData(data) + + .addConstraintRule( + RetainCompletenessRule(intervalStrategy = WilsonScoreIntervalStrategy()) + ) + .run() +``` + We can now investigate the constraints that deequ suggested. We get a textual description and the corresponding scala code for each suggested constraint. Note that the constraint suggestion is based on heuristic rules and assumes that the data it is shown is 'static' and correct, which might often not be the case in the real world. Therefore the suggestions should always be manually reviewed before being applied in real deployments. ```scala suggestionResult.constraintSuggestions.foreach { case (column, suggestions) => @@ -92,3 +103,5 @@ The corresponding scala code is .isContainedIn("status", Array("DELAYED", "UNKNO Currently, we leave it up to the user to decide whether they want to apply the suggested constraints or not, and provide the corresponding Scala code for convenience. For larger datasets, it makes sense to evaluate the suggested constraints on some held-out portion of the data to see whether they hold or not. You can test this by adding an invocation of `.useTrainTestSplitWithTestsetRatio(0.1)` to the `ConstraintSuggestionRunner`. With this configuration, it would compute constraint suggestions on 90% of the data and evaluate the suggested constraints on the remaining 10%. Finally, we would also like to note that the constraint suggestion code provides access to the underlying [column profiles](https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/examples/data_profiling_example.md) that it computed via `suggestionResult.columnProfiles`. + +An [executable and extended version of this example](https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/examples/.scala) is part of our code base. diff --git a/src/main/scala/com/amazon/deequ/suggestions/rules/interval/WaldIntervalStrategy.scala b/src/main/scala/com/amazon/deequ/suggestions/rules/interval/WaldIntervalStrategy.scala index ecfd6fb77..15574bf17 100644 --- a/src/main/scala/com/amazon/deequ/suggestions/rules/interval/WaldIntervalStrategy.scala +++ b/src/main/scala/com/amazon/deequ/suggestions/rules/interval/WaldIntervalStrategy.scala @@ -21,12 +21,15 @@ import com.amazon.deequ.suggestions.rules.interval.ConfidenceIntervalStrategy.{C import scala.math.BigDecimal.RoundingMode /** - * Implements the Wald Interval method for creating a binomial proportion confidence interval. - * + * Implements the Wald Interval method for creating a binomial proportion confidence interval. Provided for backwards + * compatibility. using [[WaldIntervalStrategy]] for calculating confidence interval can be problematic when dealing + * with small sample sizes or proportions close to 0 or 1. It also have poorer coverage and might produce confidence + * limit outside the range of [0,1] * @see * Normal approximation interval (Wikipedia) */ +@deprecated("WilsonScoreIntervalStrategy is recommended for calculating confidence interval") case class WaldIntervalStrategy() extends ConfidenceIntervalStrategy { def calculateTargetConfidenceInterval( pHat: Double,