Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text2sql execution accuracy metric updates #1604

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

oktie
Copy link
Member

@oktie oktie commented Feb 12, 2025

The goal is to add additional metrics and results to the output of the text2sql execution accuracy metric implementation. We used to produce just one number: 1 if the dataframe produced by the SQLs in pred and gold are the same, 0 otherwise. I've added scores to report 12 scores/outputs:

  1. execution_result: if df responses match (same as before)
  2. non_empty_execution_result: if dfs are non-empty and match
  3. subset_non_empty_execution_result: if non-empty dfs and gt df subset of predicted df
  4. non_empty_gold_df: if gt df is non-empty
  5. gold_sql_runtime: ground truth query runtime
  6. predicted_sql_runtime: predicted query runtime
  7. pred_to_gold_runtime_ratio: ratio of predicted query runtime to gt query runtime
  8. gold_error: if gt has an error
  9. predicted_error: if predicted query has an error
  10. ground truth dataframe
  11. predicted query's dataframe
  12. error message (if any)

What we used to get (output of examples/evaluate_text2sql.py):

num_of_instances (int):
    10
anls (float):
    0.12179476763080924
score (float):
    0.2
score_name (str):
    execution_accuracy
execution_accuracy (float):
    0.2
execution_accuracy_ci_low (float64):
    0.0
execution_accuracy_ci_high (float64):
    0.6
score_ci_low (float64):
    0.0
score_ci_high (float64):
    0.6

What we get with the new additions:

num_of_instances (int):
    10
anls (float):
    0.12179476763080924
score (float):
    0.0
score_name (str):
    non_empty_execution_accuracy
non_empty_execution_accuracy (float):
    0.0
subset_non_empty_execution_result (float):
    0.0
pred_to_gold_runtime_ratio (float):
    0.9950211077562516
predicted_error (float):
    0.1
predicted_sql_runtime (float):
    0.8206439448520542
gold_error (float):
    0.0
non_empty_gold_df (float):
    0.0
gold_sql_runtime (float):
    0.8285779342986643
execution_accuracy (float):
    0.2
predicted_sql_runtime_ci_low (float64):
    0.7456668988301784
predicted_sql_runtime_ci_high (float64):
    1.0325724025184853
gold_sql_runtime_ci_low (float64):
    0.7711773584951769
gold_sql_runtime_ci_high (float64):
    0.9317167796579513
execution_accuracy_ci_low (float64):
    0.0
execution_accuracy_ci_high (float64):
    0.6

@oktie oktie requested a review from perlitz February 12, 2025 14:02
@oktie oktie force-pushed the new-text2sql-metrics-scores branch from dd76590 to 7e905b6 Compare February 12, 2025 14:03
Signed-off-by: Oktie Hassanzadeh <[email protected]>
Signed-off-by: Oktie Hassanzadeh <[email protected]>
Signed-off-by: Oktie Hassanzadeh <[email protected]>
@oktie
Copy link
Member Author

oktie commented Feb 12, 2025

I also added non-execution accuracy metrics, that measure the validity and equivalence of the SQLs without running them. Example output from python examples/evaluate_text2sql.py:

num_of_instances (int):
    100
anls (float):
    0.5397801104939468
score (float):
    0.43
score_name (str):
    non_empty_execution_accuracy
sqlglot_optimized_equivalence (float):
    0.08
sqlglot_validity (float):
    0.99
sqlparse_equivalence (float):
    0.04
sqlparse_validity (float):
    1.0
sqlglot_equivalence (float):
    0.06
sql_exact_match (float):
    0.04
sqlglot_optimized_equivalence_ci_low (float64):
    0.04
sqlglot_optimized_equivalence_ci_high (float64):
    0.15
sqlglot_validity_ci_low (float64):
    0.9382717593768168
sqlglot_validity_ci_high (float64):
    1.0
sqlparse_equivalence_ci_low (float64):
    0.01
sqlparse_equivalence_ci_high (float64):
    0.1
sqlglot_equivalence_ci_low (float64):
    0.02
sqlglot_equivalence_ci_high (float64):
    0.12
score_ci_low (float64):
    0.34
score_ci_high (float64):
    0.53
sql_exact_match_ci_low (float64):
    0.01
sql_exact_match_ci_high (float64):
    0.09
pred_to_gold_runtime_ratio (float):
    1.1074899426266915
non_empty_gold_df (float):
    0.95
predicted_sql_runtime (float):
    0.004269861523061991
predicted_error (float):
    0.05
non_empty_execution_accuracy (float):
    0.43
gold_sql_runtime (float):
    0.004916642438620329
gold_error (float):
    0.08
execution_accuracy (float):
    0.43
subset_non_empty_execution_result (float):
    0.48
predicted_sql_runtime_ci_low (float64):
    0.0036396513714974156
predicted_sql_runtime_ci_high (float64):
    0.004940390751999117
non_empty_execution_accuracy_ci_low (float64):
    0.34
non_empty_execution_accuracy_ci_high (float64):
    0.53
gold_sql_runtime_ci_low (float64):
    0.004248886822267619
gold_sql_runtime_ci_high (float64):
    0.005753239914050517
execution_accuracy_ci_low (float64):
    0.34
execution_accuracy_ci_high (float64):
    0.53
subset_non_empty_execution_result_ci_low (float64):
    0.38
subset_non_empty_execution_result_ci_high (float64):
    0.58

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant