Text2sql execution accuracy metric updates #1604

oktie · 2025-02-12T14:02:01Z

The goal is to add additional metrics and results to the output of the text2sql execution accuracy metric implementation. We used to produce just one number: 1 if the dataframe produced by the SQLs in pred and gold are the same, 0 otherwise. I've added scores to report 12 scores/outputs:

execution_result: if df responses match (same as before)
non_empty_execution_result: if dfs are non-empty and match
subset_non_empty_execution_result: if non-empty dfs and gt df subset of predicted df
non_empty_gold_df: if gt df is non-empty
gold_sql_runtime: ground truth query runtime
predicted_sql_runtime: predicted query runtime
pred_to_gold_runtime_ratio: ratio of predicted query runtime to gt query runtime
gold_error: if gt has an error
predicted_error: if predicted query has an error
ground truth dataframe
predicted query's dataframe
error message (if any)

What we used to get (output of examples/evaluate_text2sql.py):

num_of_instances (int):
    10
anls (float):
    0.12179476763080924
score (float):
    0.2
score_name (str):
    execution_accuracy
execution_accuracy (float):
    0.2
execution_accuracy_ci_low (float64):
    0.0
execution_accuracy_ci_high (float64):
    0.6
score_ci_low (float64):
    0.0
score_ci_high (float64):
    0.6

What we get with the new additions:

num_of_instances (int):
    10
anls (float):
    0.12179476763080924
score (float):
    0.0
score_name (str):
    non_empty_execution_accuracy
non_empty_execution_accuracy (float):
    0.0
subset_non_empty_execution_result (float):
    0.0
pred_to_gold_runtime_ratio (float):
    0.9950211077562516
predicted_error (float):
    0.1
predicted_sql_runtime (float):
    0.8206439448520542
gold_error (float):
    0.0
non_empty_gold_df (float):
    0.0
gold_sql_runtime (float):
    0.8285779342986643
execution_accuracy (float):
    0.2
predicted_sql_runtime_ci_low (float64):
    0.7456668988301784
predicted_sql_runtime_ci_high (float64):
    1.0325724025184853
gold_sql_runtime_ci_low (float64):
    0.7711773584951769
gold_sql_runtime_ci_high (float64):
    0.9317167796579513
execution_accuracy_ci_low (float64):
    0.0
execution_accuracy_ci_high (float64):
    0.6

Signed-off-by: Oktie Hassanzadeh <[email protected]>

oktie · 2025-02-12T21:41:17Z

I also added non-execution accuracy metrics, that measure the validity and equivalence of the SQLs without running them. Example output from python examples/evaluate_text2sql.py:

num_of_instances (int):
    100
anls (float):
    0.5397801104939468
score (float):
    0.43
score_name (str):
    non_empty_execution_accuracy
sqlglot_optimized_equivalence (float):
    0.08
sqlglot_validity (float):
    0.99
sqlparse_equivalence (float):
    0.04
sqlparse_validity (float):
    1.0
sqlglot_equivalence (float):
    0.06
sql_exact_match (float):
    0.04
sqlglot_optimized_equivalence_ci_low (float64):
    0.04
sqlglot_optimized_equivalence_ci_high (float64):
    0.15
sqlglot_validity_ci_low (float64):
    0.9382717593768168
sqlglot_validity_ci_high (float64):
    1.0
sqlparse_equivalence_ci_low (float64):
    0.01
sqlparse_equivalence_ci_high (float64):
    0.1
sqlglot_equivalence_ci_low (float64):
    0.02
sqlglot_equivalence_ci_high (float64):
    0.12
score_ci_low (float64):
    0.34
score_ci_high (float64):
    0.53
sql_exact_match_ci_low (float64):
    0.01
sql_exact_match_ci_high (float64):
    0.09
pred_to_gold_runtime_ratio (float):
    1.1074899426266915
non_empty_gold_df (float):
    0.95
predicted_sql_runtime (float):
    0.004269861523061991
predicted_error (float):
    0.05
non_empty_execution_accuracy (float):
    0.43
gold_sql_runtime (float):
    0.004916642438620329
gold_error (float):
    0.08
execution_accuracy (float):
    0.43
subset_non_empty_execution_result (float):
    0.48
predicted_sql_runtime_ci_low (float64):
    0.0036396513714974156
predicted_sql_runtime_ci_high (float64):
    0.004940390751999117
non_empty_execution_accuracy_ci_low (float64):
    0.34
non_empty_execution_accuracy_ci_high (float64):
    0.53
gold_sql_runtime_ci_low (float64):
    0.004248886822267619
gold_sql_runtime_ci_high (float64):
    0.005753239914050517
execution_accuracy_ci_low (float64):
    0.34
execution_accuracy_ci_high (float64):
    0.53
subset_non_empty_execution_result_ci_low (float64):
    0.38
subset_non_empty_execution_result_ci_high (float64):
    0.58

Signed-off-by: Oktie Hassanzadeh <[email protected]>

revised text2sql execution accuracy metric

7e905b6

Signed-off-by: Oktie Hassanzadeh <[email protected]>

oktie requested a review from perlitz February 12, 2025 14:02

oktie force-pushed the new-text2sql-metrics-scores branch from dd76590 to 7e905b6 Compare February 12, 2025 14:03

oktie added 4 commits February 12, 2025 08:38

sql execution accuracy catalog card

3c02807

Signed-off-by: Oktie Hassanzadeh <[email protected]>

get_sql processor card update

32e99dc

Signed-off-by: Oktie Hassanzadeh <[email protected]>

adding text2sql non-execution accuracy metric

99a6086

Signed-off-by: Oktie Hassanzadeh <[email protected]>

ruff fix

2e234ab

Signed-off-by: Oktie Hassanzadeh <[email protected]>

oktie added 9 commits February 13, 2025 09:51

adding sqlparse dependency for tests

22f509b

Signed-off-by: Oktie Hassanzadeh <[email protected]>

Merge branch 'main' into new-text2sql-metrics-scores

9dd1f59

moving text2sql metrics functions to sql_utils

c9546e6

Signed-off-by: Oktie Hassanzadeh <[email protected]>

fixing sql_utils ruff issues

a8797d5

Signed-off-by: Oktie Hassanzadeh <[email protected]>

text2sql metrics: empty string is not SQL

dd1c605

Signed-off-by: Oktie Hassanzadeh <[email protected]>

text2sql metric: check if dfs are non-empty before comparison

10af659

Signed-off-by: Oktie Hassanzadeh <[email protected]>

text2sql metrics: fixing is_subset for output dfs

d05f7f1

Signed-off-by: Oktie Hassanzadeh <[email protected]>

sql_utils SQL API fixes

7776710

Signed-off-by: Oktie Hassanzadeh <[email protected]>

text2sql metric: returning gold sql exec error

eb09a0b

Signed-off-by: Oktie Hassanzadeh <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text2sql execution accuracy metric updates #1604

Text2sql execution accuracy metric updates #1604

oktie commented Feb 12, 2025

oktie commented Feb 12, 2025

Text2sql execution accuracy metric updates #1604

Are you sure you want to change the base?

Text2sql execution accuracy metric updates #1604

Conversation

oktie commented Feb 12, 2025

oktie commented Feb 12, 2025