Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(wren-ai-service): change sql through reasoning #1273

Merged
merged 19 commits into from
Feb 12, 2025

Conversation

cyyeh
Copy link
Member

@cyyeh cyyeh commented Feb 6, 2025

  • remove sql_explanation and sql_regeneration related code, since it's outdated
  • add sql_regeneration api, service, pipeline for feedback loop
  • update demo ui code
  • fix sql-pairs api issue
  • refactor SQL_GENERATION_MODEL_KWARGS

Summary by CodeRabbit

  • New Features

    • Introduced advanced SQL regeneration powered by updated language models for more accurate automated query generation.
    • Added interactive feedback endpoints that allow users to submit and track feedback on generated SQL.
    • Integrated SQL generation reasoning to enhance query precision.
  • Refactor

    • Streamlined SQL processing workflows by consolidating and removing legacy explanation features.
    • Optimized configuration settings for improved SQL pair retrieval and overall performance.
    • Enhanced error handling and configurability in SQL processing.
  • Tests

    • Removed outdated test coverage for SQL regeneration to align with current functionality.

@cyyeh cyyeh added module/ai-service ai-service related ci/ai-service ai-service related labels Feb 6, 2025
@cyyeh cyyeh marked this pull request as draft February 6, 2025 09:41
Copy link
Contributor

coderabbitai bot commented Feb 6, 2025

Walkthrough

The changes update SQL processing across the project by adding and modifying pipeline entries and functions related to SQL regeneration while removing the legacy SQL explanation functionality. Configuration files now include a new sql_regeneration entry with explicit model and engine settings. Application code is refactored to remove several session state variables and endpoints for SQL explanation. Additionally, new functionality for providing feedback on SQL generation has been introduced via updated API endpoints and service methods. Changes also add new configuration fields for managing SQL pairs retrieval and adjust document filtering parameters.

Changes

File(s) Change Summary
deployment/kustomizations/base/cm.yaml, docker/config.example.yaml, docs/config_examples/*, tools/config/config.example.yaml, tools/config/config.full.yaml Added new sql_regeneration pipeline entry with specified model and engine; removed sql_explanation entry; updated/reordered pipeline definitions.
wren-ai-service/demo/app.py, wren-ai-service/demo/utils.py Removed session state variables related to SQL explanation and analysis; introduced new sql_generation_reasoning variable and functions to handle SQL generation reasoning, regeneration, and feedback.
wren-ai-service/src/config.py, wren-ai-service/src/globals.py Added new configuration fields (sql_pairs_similarity_threshold, sql_pairs_retrieval_max_size); consolidated SQL regeneration into the ask_service.
wren-ai-service/src/pipelines/generation/* Renamed and updated functions/constants across SQL pipelines (followup, correction, expansion, generation, regeneration, and reasoning) with removal of SQL explanation components and consolidation of model kwarg definitions.
wren-ai-service/src/web/v1/routers/*, wren-ai-service/src/web/v1/services/* New endpoints for ask feedback introduced in ask.py; removed routers for SQL explanations and regenerations; updated service implementations to support feedback processing.
wren-ai-service/tests/pytest/test_main.py Removed the test_sql_regenerations function from the test suite.
wren-ai-service/src/pipelines/common.py, wren-ai-service/src/pipelines/retrieval/sql_pairs_retrieval.py, wren-ai-service/src/pipelines/indexing/sql_pairs.py Updated document filtering and indexing logic for SQL pairs; introduced additional parameters to adjust scoring thresholds and limit results.

Possibly related PRs

  • chore(wren-ai-service): minor update #1210: The changes in the main PR, which involve adding and removing pipeline entries related to SQL regeneration in the configuration files, are related to the modifications in the retrieved PR that enhance the clarity of instructions for generating SQL queries, particularly in the context of SQL generation.
  • fix(wren-ai-service): column pruning step in retrieval pipeline and correct code for sql generation in evaluation #1225: The changes in the main PR, which involve adding and removing pipeline entries related to SQL regeneration in the configuration files, are directly related to the modifications in the retrieved PR that introduce the sql_generation_reasoning parameter in the Generation and Ask pipelines, enhancing SQL generation capabilities.
  • chore(wren-ai-service): add separate reasoning pipeline #1212: The changes in the main PR, which involve adding a sql_regeneration entry to the wren-ai-service-config ConfigMap, are related to the retrieved PR that introduces a sql_generation_reasoning entry in the same ConfigMap, as both modifications pertain to enhancing SQL processing capabilities within the same configuration file.

Suggested reviewers

  • paopa

Poem

In coding meadows I happily hop,
Through pipelines renaming that never stop.
SQL queries spring fresh from the ground,
With new endpoints where feedback is found.
I'm a rabbit with joy in every line—
Hopping through code, feeling simply divine!


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e6f0f9d and 9862955.

📒 Files selected for processing (1)
  • wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: pytest
  • GitHub Check: Analyze (go)
🔇 Additional comments (1)
wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py (1)

96-96: Verify if the test assertion aligns with the intended behavior.

The test name test_sql_pairs_deletion implies testing document deletion, but the assertion suggests that documents are not being deleted after calling clean(). This could indicate either:

  1. An intentional change where clean() no longer deletes documents
  2. A potential issue where deletion is not working as expected

Please clarify:

  1. Is this change intentional as part of the new SQL regeneration pipeline?
  2. If yes, should we rename the test to better reflect the new behavior?
  3. If no, should we fix the clean() implementation or revert the assertion?

Let's verify the implementation of the clean() method:

❌ Verification inconclusive

🏁 Script executed:

#!/bin/bash
# Search for the SqlPairs class implementation to verify the clean() method behavior
ast-grep --pattern 'class SqlPairs {
  $$$
  async def clean($$$) {
    $$$
  }
  $$$
}'

Length of output: 90


I'll generate another shell script to search comprehensively for the implementation of the SqlPairs class and its clean() method by looking for both "class SqlPairs" and "async def clean(" patterns in the codebase. This should help us verify if the intended deletion behavior is implemented or if there are any discrepancies.


🏁 Script executed:

#!/bin/bash
# Searching for the SqlPairs class in the repository
rg "class SqlPairs" -n

echo "-----"

# Searching for any async clean method definitions in the repository
rg "async def clean(" -n

Length of output: 580


I'll run an updated shell script to search for the async def clean( implementation, ensuring that the regex is escaped properly. This will help us determine whether the clean() method is defined and what its behavior is.

Once we have this output, we can better determine if the deletion behavior in the test is intentional.


🏁 Script executed:

#!/bin/bash
# Searching for the async clean method with proper escaping
rg "async def clean\\(" -n

Length of output: 806


I'll gather additional context by showing the regions around the clean() method implementations in the wren-ai-service/src/pipelines/indexing/sql_pairs.py file. This will help us verify whether deletion is intended or if there's a discrepancy.


🏁 Script executed:

#!/bin/bash
echo "Context for first occurrence of async def clean in sql_pairs.py (around line 125):"
sed -n '120,140p' wren-ai-service/src/pipelines/indexing/sql_pairs.py

echo "-----"
echo "Context for second occurrence of async def clean in sql_pairs.py (around line 215):"
sed -n '210,230p' wren-ai-service/src/pipelines/indexing/sql_pairs.py

Length of output: 1598


Potential Mismatch in Deletion Behavior

The test name (test_sql_pairs_deletion) implies that documents should be deleted, but based on the implementation in wren-ai-service/src/pipelines/indexing/sql_pairs.py, the clean() functions (both the standalone function and the class method) do not manipulate document counts—instead, they call the cleaner’s run() method using an empty documents list. This behavior hints at an intentional change where deletion is no longer performed as before.

Next Steps to Resolve:

  • Confirm Intent: Verify with the product or feature owner if the change to not delete documents is intentional.
  • Test Naming: If intentional, consider renaming the test to reflect that the “clean” operation now only resets the embedding data rather than removing documents.
  • Cleaner Behavior: Double-check the implementation of SqlPairsCleaner.run() to ensure no deletion is expected, and update test documentation accordingly.
✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@cyyeh cyyeh force-pushed the feat/ai-service/change-sql-through-reasoning branch 3 times, most recently from 7024507 to 3db391b Compare February 8, 2025 06:57
@cyyeh cyyeh force-pushed the feat/ai-service/change-sql-through-reasoning branch from 3db391b to f61a024 Compare February 10, 2025 05:10
@cyyeh cyyeh marked this pull request as ready for review February 12, 2025 02:28
@paopa paopa self-requested a review February 12, 2025 02:32
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (10)
wren-ai-service/src/pipelines/generation/sql_regeneration.py (2)

24-38: Validate the output format.
Including a JSON format requirement is good practice, but consider verifying or parsing the final LLM-generated string to confirm it strictly follows the expected JSON shape.


132-135: Default mutable parameter usage.
Passing a mutable default (Configuration()) can cause unexpected behavior if the object is modified. Consider defaulting to None and instantiating inside the method to avoid shared state.

wren-ai-service/src/web/v1/services/ask.py (3)

101-115: Consider direct property definition instead of private _query_id.
This class replicates the getter-setter pattern from AskRequest, but you might simplify by storing query_id directly unless there's a special need for these property methods.


122-134: Similar approach to StopAskRequest.
No immediate concerns, but consider normalizing naming and usage if you integrate both for consistency.


551-706: Unified logic opportunity in ask_feedback.
Most logic in ask_feedback strongly resembles the main ask method. Extract common parts into a shared function or module to reduce duplication and potential inconsistency. Also be mindful of concurrency if multiple feedback requests for the same query ID arrive simultaneously.

wren-ai-service/demo/utils.py (2)

587-624: Refactor common error handling pattern.

The error handling pattern is duplicated across multiple functions (ask_feedback, save_sql_pair, etc.). Consider extracting it into a reusable helper function.

Create a helper function like this:

def handle_api_response(response_future, operation_name: str):
    """
    Generic handler for API responses with status polling.
    
    Args:
        response_future: Async function that returns response with status
        operation_name: Name of the operation for toast messages
    """
    status = None
    while not status or (
        status != "finished"
        and status != "failed"
        and status != "stopped"
    ):
        response = response_future()
        assert response.status_code == 200
        status = response.json()["status"]
        st.toast(f"The {operation_name} status: {status}")
        time.sleep(POLLING_INTERVAL)
        
    if status == "finished":
        return response.json()
    elif status == "failed":
        st.error(
            f'An error occurred while processing the {operation_name}: {response.json()['error']}',
            icon="🚨",
        )
        return None

626-662: Add retry mechanism for API calls.

The save_sql_pair function makes API calls without any retry mechanism. Consider adding retries for better reliability.

Add a retry decorator:

from functools import wraps
import time

def retry_on_exception(retries=3, delay=1):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for i in range(retries):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    if i == retries - 1:
                        raise
                    time.sleep(delay)
            return None
        return wrapper
    return decorator

@retry_on_exception()
def save_sql_pair(question: str, sql: str):
    # existing implementation
wren-ai-service/docs/config_examples/config.ollama.yaml (1)

118-121: Fix trailing whitespace.

Remove trailing whitespace on line 121 to maintain consistent formatting.

Apply this diff to fix the formatting:

  - name: sql_regeneration
    llm: litellm_llm.openai/phi4:14b
    engine: wren_ui
-  
🧰 Tools
🪛 YAMLlint (1.35.1)

[error] 121-121: trailing spaces

(trailing-spaces)

wren-ai-service/tools/config/config.example.yaml (1)

145-147: Consider adding documentation for the sql_regeneration pipeline.

Since this is an example configuration file that serves as a template, consider adding comments explaining the purpose and requirements of the sql_regeneration pipeline.

Add documentation above the sql_regeneration entry:

+  # SQL regeneration pipeline for improving SQL queries through reasoning
+  # Requires: 
+  # - LLM model with SQL generation capabilities
+  # - wren_ui engine for SQL execution
   - name: sql_regeneration
     llm: litellm_llm.gpt-4o-mini-2024-07-18
     engine: wren_ui
deployment/kustomizations/base/cm.yaml (1)

179-181: Verify the intentional model difference between configurations.

The sql_regeneration pipeline entry is using a different model compared to config.deepseek.yaml:

  • cm.yaml: gpt-4o-mini-2024-07-18
  • config.deepseek.yaml: deepseek/deepseek-coder

While this might be intentional for different environments, please confirm this is the desired configuration.

Consider documenting the rationale for using different models in different environments to help with maintenance.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3793455 and 8d29578.

📒 Files selected for processing (28)
  • deployment/kustomizations/base/cm.yaml (1 hunks)
  • docker/config.example.yaml (1 hunks)
  • wren-ai-service/demo/app.py (1 hunks)
  • wren-ai-service/demo/utils.py (6 hunks)
  • wren-ai-service/docs/config_examples/config.azure.yaml (1 hunks)
  • wren-ai-service/docs/config_examples/config.deepseek.yaml (1 hunks)
  • wren-ai-service/docs/config_examples/config.google_ai_studio.yaml (1 hunks)
  • wren-ai-service/docs/config_examples/config.groq.yaml (1 hunks)
  • wren-ai-service/docs/config_examples/config.ollama.yaml (1 hunks)
  • wren-ai-service/src/globals.py (1 hunks)
  • wren-ai-service/src/pipelines/generation/__init__.py (1 hunks)
  • wren-ai-service/src/pipelines/generation/followup_sql_generation.py (2 hunks)
  • wren-ai-service/src/pipelines/generation/sql_correction.py (4 hunks)
  • wren-ai-service/src/pipelines/generation/sql_expansion.py (2 hunks)
  • wren-ai-service/src/pipelines/generation/sql_explanation.py (0 hunks)
  • wren-ai-service/src/pipelines/generation/sql_generation.py (1 hunks)
  • wren-ai-service/src/pipelines/generation/sql_regeneration.py (2 hunks)
  • wren-ai-service/src/pipelines/generation/utils/sql.py (3 hunks)
  • wren-ai-service/src/web/v1/routers/__init__.py (0 hunks)
  • wren-ai-service/src/web/v1/routers/ask.py (2 hunks)
  • wren-ai-service/src/web/v1/routers/sql_explanations.py (0 hunks)
  • wren-ai-service/src/web/v1/routers/sql_pairs.py (1 hunks)
  • wren-ai-service/src/web/v1/routers/sql_regenerations.py (0 hunks)
  • wren-ai-service/src/web/v1/services/__init__.py (0 hunks)
  • wren-ai-service/src/web/v1/services/ask.py (11 hunks)
  • wren-ai-service/tests/pytest/test_main.py (0 hunks)
  • wren-ai-service/tools/config/config.example.yaml (1 hunks)
  • wren-ai-service/tools/config/config.full.yaml (1 hunks)
💤 Files with no reviewable changes (6)
  • wren-ai-service/src/web/v1/routers/init.py
  • wren-ai-service/tests/pytest/test_main.py
  • wren-ai-service/src/web/v1/services/init.py
  • wren-ai-service/src/web/v1/routers/sql_regenerations.py
  • wren-ai-service/src/web/v1/routers/sql_explanations.py
  • wren-ai-service/src/pipelines/generation/sql_explanation.py
🧰 Additional context used
🪛 YAMLlint (1.35.1)
wren-ai-service/docs/config_examples/config.ollama.yaml

[error] 121-121: trailing spaces

(trailing-spaces)

🔇 Additional comments (34)
wren-ai-service/src/pipelines/generation/sql_regeneration.py (8)

3-3: No concerns with the updated import statement.


13-18: No issues with these newly imported symbols.


19-19: Import of Configuration looks fine.


78-79: No concerns with the new regenerate_sql function definition.


82-82: No issues with returning the generator result here.


86-94: Add fallback when replies is missing.
If regenerate_sql doesn’t contain a replies key, post_processor.run() might fail. A defensive check ensures robust error handling.


129-129: Observer decorator rename is consistent.


161-162: No issues with the dry-run usage for testing.

wren-ai-service/src/web/v1/services/ask.py (5)

95-95: Renaming to sql_generation_reasoning is consistent with the new structure.


118-119: No issues with the simple response model.


136-138: No concerns with the StopAskFeedbackResponse.


140-156: No issues with the new feedback result request/response models.


171-173: Caution with concurrent updates to _ask_feedback_results.
TTLCache is not thread-safe. Consider adding locks or ensuring single-thread usage to avoid race conditions in high-concurrency environments.

wren-ai-service/src/pipelines/generation/__init__.py (1)

36-36: Re-adding SQLRegeneration to __all__.
No objections; this aligns with the recent pipeline changes focusing on regeneration.

wren-ai-service/src/pipelines/generation/sql_correction.py (3)

28-28: LGTM! Improved prompt clarity.

The system prompt has been simplified by combining two sentences into one, making it more concise while maintaining the same meaning.


41-46: LGTM! Added conditional rendering of database schema.

The addition of the conditional block ensures that the database schema section is only included when documents are available, improving prompt relevance.


109-111: Verify the impact of using shared model kwargs.

The switch to SQL_GENERATION_MODEL_KWARGS suggests a standardization of model configuration. Ensure this change maintains the necessary response format for SQL correction.

✅ Verification successful

🏁 Script executed:

#!/bin/bash
# Description: Check if SQL_GENERATION_MODEL_KWARGS maintains required response format

# Search for SQL_GENERATION_MODEL_KWARGS definition
rg -A 5 "SQL_GENERATION_MODEL_KWARGS\s*=" src/

Length of output: 93


I've noticed that the initial search command looked in the wrong directory (it used src/ rather than the correct path) and, as a result, returned no results. To verify the impact of using shared model kwargs, we need to locate the definition of SQL_GENERATION_MODEL_KWARGS in its proper context and inspect its settings to ensure it preserves the expected SQL correction response format.

Let's re-run the search in the correct directory.

Please execute the above script and provide the output so we can continue our verification.


🏁 Script executed:

#!/bin/bash
# Search for the definition of SQL_GENERATION_MODEL_KWARGS in the appropriate directory (wren-ai-service/src/)
rg -A 5 "SQL_GENERATION_MODEL_KWARGS\s*=" wren-ai-service/src/

Length of output: 609


Verification of Shared Model Kwargs for SQL Correction Response

  • The SQL_GENERATION_MODEL_KWARGS is defined in wren-ai-service/src/pipelines/generation/utils/sql.py.
  • It specifies a "response_format" with "type": "json_schema".
  • The "json_schema" is set using SqlGenerationResult.model_json_schema(), ensuring the expected SQL generation result format is maintained.
wren-ai-service/src/pipelines/generation/sql_expansion.py (2)

13-16: LGTM! Simplified imports.

The imports have been consolidated to include only the necessary components from the SQL utils module.


104-104: Verify the impact of using shared model kwargs.

Similar to the correction pipeline, ensure that using SQL_GENERATION_MODEL_KWARGS maintains the required response format for SQL expansion.

✅ Verification successful

🏁 Script executed:

#!/bin/bash
# Description: Check if SQL_GENERATION_MODEL_KWARGS is used consistently

# Search for all usages of SQL_GENERATION_MODEL_KWARGS
rg "SQL_GENERATION_MODEL_KWARGS" src/

Length of output: 82


I'll run a new shell script to search for the definition and usage of SQL_GENERATION_MODEL_KWARGS specifically in the wren-ai-service directory. This should help us determine whether it’s defined correctly and maintained consistently.


🏁 Script executed:

#!/bin/bash
# Search for all occurrences of SQL_GENERATION_MODEL_KWARGS in the wren-ai-service directory
rg -n "SQL_GENERATION_MODEL_KWARGS" wren-ai-service

Length of output: 1266


Shared Model Kwargs Usage is Consistent Across Pipelines

The constant SQL_GENERATION_MODEL_KWARGS is defined centrally in wren-ai-service/src/pipelines/generation/utils/sql.py and is used in the same manner in the SQL expansion pipeline as in the correction, generation, and regeneration pipelines. This confirms that the shared model kwargs maintain the required response format for SQL expansion.

wren-ai-service/src/pipelines/generation/sql_generation.py (1)

13-18: LGTM! Consistent model configuration.

The import changes align with the standardization of SQL generation model configuration across pipelines.

wren-ai-service/src/web/v1/routers/sql_pairs.py (1)

147-147: LGTM! Enhanced status tracking.

The addition of "indexing" status to the Literal type provides better visibility into the SQL pairs preparation process, aligning with the status set in the prepare endpoint (line 100).

wren-ai-service/src/pipelines/generation/followup_sql_generation.py (1)

13-18: LGTM! Improved code organization by using shared constants.

The change to use SQL_GENERATION_MODEL_KWARGS from sql.py instead of a local constant improves maintainability by centralizing the configuration.

wren-ai-service/src/web/v1/routers/ask.py (3)

146-166: LGTM! Well-structured feedback endpoint implementation.

The POST endpoint follows the same pattern as the existing /asks endpoint, with proper background task handling and status initialization.


169-181: LGTM! Consistent implementation of feedback cancellation.

The PATCH endpoint maintains consistency with the existing stop functionality in /asks/{query_id}.


184-191: LGTM! Clean implementation of feedback result retrieval.

The GET endpoint follows the same pattern as the existing result retrieval endpoints.

wren-ai-service/demo/app.py (1)

61-62: LGTM! Added session state for SQL generation reasoning.

The addition of sql_generation_reasoning aligns with the PR's objective to change SQL through reasoning.

wren-ai-service/src/globals.py (1)

111-114: LGTM! Consolidated SQL regeneration into ask service.

The addition of the sql_regeneration pipeline to ask_service improves service organization by consolidating related functionality.

wren-ai-service/src/pipelines/generation/utils/sql.py (1)

556-564: LGTM! Well-structured model kwargs.

The SQL_GENERATION_MODEL_KWARGS constant is well-defined using the Pydantic model's JSON schema, ensuring type safety and validation.

docker/config.example.yaml (1)

131-133: LGTM! Consistent pipeline configuration.

The sql_regeneration pipeline entry is well-configured with appropriate model and engine settings.

wren-ai-service/docs/config_examples/config.azure.yaml (1)

125-127: LGTM! Verify Azure model availability.

The SQL regeneration pipeline configuration is correctly structured with the appropriate model and engine specifications.

✅ Verification successful

🌐 Web query:

Is GPT-4 available through Azure OpenAI Service with API version 2024-02-15-preview?

💡 Result:

GPT-4 is available through Azure OpenAI Service using API version 2024-02-15-preview, though with specific access requirements[1][7]. Microsoft introduced GPT-4 integration into Azure OpenAI Service in mid-May 2023[1][7], though initial access required submitting a business case for approval during its private preview phase[7][9].

The 2024-02-15-preview API version specifically supports:

  • Assistants API for building AI-powered conversational agents[8]
  • Text-to-speech capabilities[8]
  • GPT models including GPT-3.5-Turbo and GPT-4[2][7]

Key implementation details:

  1. Access requires an approved Azure OpenAI Service subscription[7]
  2. Users must deploy GPT-4 as a separate model within their Azure OpenAI resource[3][7]
  3. Latest updates recommend migrating to newer API versions like 2025-01-01-preview for enhanced features[5][8]

For current implementations:

# Sample API call using Python SDK
from openai import AzureOpenAI

client = AzureOpenAI(
    api_version="2024-02-15-preview",
    azure_endpoint="https://your-resource.openai.azure.com/",
    api_key="your-api-key"
)

response = client.chat.completions.create(
    model="gpt-4", # Your deployment name
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

Microsoft continues expanding capabilities through newer API versions while maintaining backward compatibility with 2024-02-15-preview for existing implementations[5][8].

Citations:


SQL Regeneration Pipeline Configuration is Correct & Azure GPT-4 is Available

The Azure model configuration in the SQL regeneration pipeline is properly structured. The web query confirms that GPT-4 is available through Azure OpenAI Service with API version 2024-02-15-preview (with the noted access requirements). No follow-up changes are needed.

wren-ai-service/tools/config/config.full.yaml (1)

145-147: LGTM! Verify OpenAI model availability.

The SQL regeneration pipeline configuration is correctly structured with the appropriate model and engine specifications.

✅ Verification successful

🌐 Web query:

Is gpt-4o-mini-2024-07-18 a valid OpenAI model?

💡 Result:

Yes, gpt-4o-mini-2024-07-18 is a valid OpenAI model. Here's what the evidence shows:


Official Documentation Confirmation

OpenAI's API documentation explicitly lists gpt-4o-mini-2024-07-18 as a valid model ID under its GPT‑4o mini section[7][10]. The model supports:
Included in tables:

Model Context Window Max Output Tokens
gpt‑4o‑mini‑2024‑07‑18 128,000 tokens 16,384 tokens

Key Features

  1. Multimodal Inputs: Accepts text and image inputs (with plans for audio/video)[1][3][10].
  2. Performance: Outperforms GPT‑3.5 Turbo and rivals Claude Haiku/Gemini Flash on benchmarks like MMLU (82%) and multimodal reasoning (59.4% on MMMU)[1][3].
  3. Cost Efficiency: Priced at $0.15/million input tokens and $0.60/million output tokens—60% cheaper than GPT‑3.5 Turbo[1][7].

Use Cases

The model is designed for:

  • Applications requiring parallel API calls or large context windows (e.g., full codebases)[1].
  • Fine-tuning workflows where outputs from larger models like GPT‑4o can be distilled into GPT‑4o mini for cost savings[7][10].

Community Validation

Users have reported fine-tuning experiments with gpt‑4o‑mini‑2024‑07‑18, noting its stability compared to larger models like GPT‑4o[6]. While some faced temporary API errors (e.g., "Unknown model" issues), these appear resolved[9].


In summary, gpt‑4o‑mini‑2024‑07‑18 is both officially documented and actively used in real-world applications[1][7][10].

Citations:


SQL Regeneration Configuration Verified
The configuration at lines 145-147 is correctly structured, and we've confirmed that the OpenAI model gpt-4o-mini-2024-07-18 is valid per OpenAI’s official documentation.

wren-ai-service/docs/config_examples/config.groq.yaml (1)

119-121: LGTM! Verify Groq model availability.

The SQL regeneration pipeline configuration is correctly structured with the appropriate model and engine specifications.

✅ Verification successful

🌐 Web query:

Is llama-3.3-70b-specdec available through Groq's API?

💡 Result:

Yes, llama-3.3-70b-specdec is available through Groq's API[1][4]. The model was released on GroqCloud as part of a performance enhancement initiative, offering >6x speed improvements over previous versions by leveraging speculative decoding techniques[4]. Developers can access it via:

  • GroqCloud Developer Console: Available to paying customers with an API key[4][7].
  • On-demand pricing: $0.59 per million input tokens and $0.99 per million output tokens[4].

Key details:
Performance: Achieves 1,660 tokens per second on Groq’s 14nm LPU architecture[4].
Use cases: Agentic workflows, real-time conversational AI, and decision-making applications[4][7].
Integration: Supported in tools like Promptfoo and Helicone for observability and testing[6][7].

For free-tier access, some platforms like App Inventor’s GroqText extension include limited daily tokens for experimentation[10].

Citations:


SQL Regeneration Pipeline and Groq Model Availability Verified

The configuration at lines 119-121, which adds the sql_regeneration pipeline using the llama-3.3-70b-specdec model with the wren_ui engine, is correct. The Groq API confirms that llama-3.3-70b-specdec is indeed available, with notable performance improvements as detailed in the web query results.

wren-ai-service/docs/config_examples/config.google_ai_studio.yaml (1)

126-128: LGTM! Verify Gemini model availability.

The SQL regeneration pipeline configuration is correctly structured with the appropriate model and engine specifications.

✅ Verification successful

🌐 Web query:

Is gemini-2.0-flash-exp available through Google AI Studio?

💡 Result:

Yes, gemini-2.0-flash-exp is available through Google AI Studio as of February 2025[3][5]. Here are key details:

Availability status

  • General availability: The updated Gemini 2.0 Flash (including gemini-2.0-flash-exp) became generally available via Google AI Studio on February 6, 2025[3].
  • Experimental access: Earlier experimental versions were first made available in December 2024 through Google AI Studio[5][10].

Key features

  • 1M token context window for processing large inputs[2][8]
  • Multimodal input (text/images/audio/video)[2][8]
  • Native tool integration with Google Search and code execution[5][8]
  • Free tier limits:
    15 requests/minute
    1M tokens/minute
    1,500 requests/day[1]

How to access

  1. Visit Google AI Studio
  2. Obtain a free API key
  3. Use model ID gemini-2.0-flash-exp in your API calls[1][7]

The model remains experimental for certain features like text-to-speech generation and image output, which require allowlist access[9]. Developers can transition from experimental to production use through Vertex AI's Gen AI SDK[6].

Citations:


Gemini Model Verified

The SQL regeneration pipeline configuration is correct, and the gemini-2.0-flash-exp model is confirmed to be available through Google AI Studio as of February 2025.

wren-ai-service/docs/config_examples/config.deepseek.yaml (1)

137-139: LGTM! Configuration is consistent with other SQL-related pipelines.

The new sql_regeneration pipeline entry follows the established pattern and uses appropriate model and engine choices.

Let's verify that this configuration is consistent across all config files:

❌ Verification inconclusive

🏁 Script executed:

#!/bin/bash
# Description: Check for sql_regeneration configuration consistency across config files
rg -A 2 "sql_regeneration:" deployment/ docker/ wren-ai-service/

Length of output: 64


SQL Regeneration Configuration Verification

The configured sql_regeneration pipeline in wren-ai-service/docs/config_examples/config.deepseek.yaml appears to follow the established pattern. However, the initial search across directories (deployment/, docker/, and wren-ai-service/) returned no additional instances. This might be expected if the new configuration is only meant to reside in the provided example file.

Please verify manually whether the new pipeline entry should also appear in other related config files or environments.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
wren-ai-service/src/pipelines/generation/sql_regeneration.py (2)

24-38: Enhance SQL standards specification in the system prompt.

While the system prompt is clear, it could be more specific about SQL standards to ensure consistent output.

Consider updating the system prompt to explicitly mention supported SQL standards and any specific requirements:

 ### TASK ###
-You are a great ANSI SQL expert. Now you are given a SQL generation reasoning and an original SQL query, 
+You are a great ANSI SQL expert specializing in SQL:2016 standard. Now you are given a SQL generation reasoning and an original SQL query, 
 please carefully review the reasoning, and then generate a new SQL query that matches the reasoning.
 While generating the new SQL query, you should use the original SQL query as a reference.

158-163: Use more realistic test data in dry run.

The current test data is overly simplistic. Consider using more realistic examples that cover common use cases.

Example:

 dry_run_pipeline(
     SQLRegeneration,
     "sql_regeneration",
-    sql_generation_reasoning="this is a test query",
-    sql="select * from users",
+    sql_generation_reasoning="Add a WHERE clause to filter active users with more than 100 points",
+    sql="SELECT user_id, username, points FROM users",
 )
wren-ai-service/src/pipelines/retrieval/sql_pairs_retrieval.py (1)

131-131: Consider addressing the TODO comment.

The comment suggests adding an LLM filter as a backup for the ScoreFilter. With the recent changes to make the scoring threshold configurable, this might be a good time to implement this enhancement.

Would you like me to help implement the LLM filter or create an issue to track this task?

wren-ai-service/src/web/v1/services/ask.py (1)

554-677: Consider refactoring error handling logic.

The error handling logic in ask_feedback is very similar to the logic in ask. Consider extracting common error handling patterns into a shared method to improve maintainability.

Example refactor:

def _handle_error(self, error: Exception, container: dict, query_id: str, results: dict):
    logger.exception(f"pipeline - OTHERS: {error}")
    
    container[query_id] = AskFeedbackResultResponse(
        status="failed",
        error=AskError(
            code="OTHERS",
            message=str(error),
        ),
    )
    
    results["metadata"]["error_type"] = "OTHERS"
    results["metadata"]["error_message"] = str(error)
    return results
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8d29578 and 6f5f93c.

📒 Files selected for processing (8)
  • wren-ai-service/src/config.py (1 hunks)
  • wren-ai-service/src/globals.py (2 hunks)
  • wren-ai-service/src/pipelines/common.py (2 hunks)
  • wren-ai-service/src/pipelines/generation/sql_generation_reasoning.py (6 hunks)
  • wren-ai-service/src/pipelines/generation/sql_regeneration.py (2 hunks)
  • wren-ai-service/src/pipelines/generation/utils/sql.py (3 hunks)
  • wren-ai-service/src/pipelines/retrieval/sql_pairs_retrieval.py (4 hunks)
  • wren-ai-service/src/web/v1/services/ask.py (13 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • wren-ai-service/src/pipelines/generation/utils/sql.py
⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: Analyze (python)
  • GitHub Check: pytest
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
🔇 Additional comments (14)
wren-ai-service/src/pipelines/generation/sql_regeneration.py (1)

1-20: LGTM! Well-organized imports.

The imports are properly organized and include all necessary dependencies for the SQL regeneration functionality.

wren-ai-service/src/pipelines/common.py (2)

49-51: LGTM!

The function call has been reformatted for better readability while maintaining the same functionality.


67-72: LGTM!

The addition of the max_size parameter enhances the filtering functionality by allowing control over the number of documents returned. The implementation correctly applies the limit after sorting to ensure the top-scoring documents are returned.

Also applies to: 78-78

wren-ai-service/src/config.py (1)

33-34: LGTM!

The new configuration fields are well-placed in the indexing and retrieval section, properly typed, and have sensible default values that align with their usage in the SqlPairsRetrieval class.

wren-ai-service/src/pipelines/generation/sql_generation_reasoning.py (3)

30-30: LGTM!

The changes effectively integrate SQL examples into the prompt structure, with proper conditional handling to include the examples section only when examples are provided.

Also applies to: 46-51


67-67: LGTM!

The sql_samples parameter is properly integrated into the function signature and correctly passed to the prompt builder.

Also applies to: 74-74


138-138: LGTM!

The changes properly handle the optional sql_samples parameter with appropriate default values, ensuring backward compatibility through the empty list fallback.

Also applies to: 147-147

wren-ai-service/src/pipelines/retrieval/sql_pairs_retrieval.py (2)

85-90: LGTM!

The function properly integrates the configurable threshold and max size parameters, correctly passing them to the score filter.

Also applies to: 92-96


119-120: LGTM!

The constructor properly handles the new parameters with default values that match the global settings, storing them appropriately in the _configs dictionary.

Also applies to: 134-137

wren-ai-service/src/globals.py (1)

92-93: LGTM! Configuration parameters added for SQL pairs retrieval.

The addition of sql_pairs_similarity_threshold and sql_pairs_retrieval_max_size parameters enhances the configurability of the SQL pairs retrieval pipeline.

Also applies to: 113-115

wren-ai-service/src/web/v1/services/ask.py (4)

102-156: LGTM! Well-structured feedback-related model classes.

The new model classes maintain consistency with existing patterns and provide a clear structure for handling SQL feedback functionality.


177-183: LGTM! Improved reusability of the stop check logic.

The _is_stopped method has been refactored to accept a container parameter, making it reusable for both ask and feedback results.


343-348: LGTM! Enhanced SQL generation with similar SQL pairs.

The addition of SQL samples from similar queries can improve the quality of generated SQL.


617-617: Consider providing context for SQL correction.

The SQL correction pipeline is called with an empty contexts list, which might limit its effectiveness.

❌ Verification inconclusive

🏁 Script executed:

#!/bin/bash
# Description: Check if the sql_correction pipeline requires contexts.
ast-grep --pattern $'sql_correction = generation.SQLCorrection(
  $$$
  contexts=$_,
  $$$
)'

Length of output: 94


SQL Correction Pipeline: Manual Verification Needed

The current implementation instantiates the SQL correction pipeline with an empty contexts list. Our automated search (using ast-grep for a pattern like contexts=$_) didn’t produce evidence to confirm whether other parts of the codebase populate this parameter or if it’s expected to remain empty. Please manually verify if non-empty context data is needed for effective SQL regeneration.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
wren-ai-service/src/pipelines/indexing/sql_pairs.py (1)

125-135: Consider adding error handling for the cleaner.run call.

While the empty check is good, the function should handle potential errors from the document store operation to ensure pipeline robustness.

Consider wrapping the call in a try-catch block:

 async def clean(
     cleaner: SqlPairsCleaner,
     sql_pairs: List[SqlPair],
     embedding: Dict[str, Any],
     project_id: Optional[str] = "",
 ) -> Dict[str, Any]:
     sql_pair_ids = [sql_pair.id for sql_pair in sql_pairs]
     if sql_pair_ids:
-        await cleaner.run(sql_pair_ids=sql_pair_ids, project_id=project_id)
+        try:
+            await cleaner.run(sql_pair_ids=sql_pair_ids, project_id=project_id)
+        except Exception as e:
+            logger.error(f"Error cleaning SQL pairs: {e}")
+            # Consider whether to re-raise or continue with embedding
 
     return embedding
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a5f2ed5 and 1afc8d3.

📒 Files selected for processing (2)
  • wren-ai-service/src/pipelines/generation/sql_generation_reasoning.py (6 hunks)
  • wren-ai-service/src/pipelines/indexing/sql_pairs.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • wren-ai-service/src/pipelines/generation/sql_generation_reasoning.py
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: pytest
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
🔇 Additional comments (1)
wren-ai-service/src/pipelines/indexing/sql_pairs.py (1)

132-133: LGTM! Good defensive programming practice.

The added check prevents unnecessary calls to the document store when there are no SQL pairs to clean, which is a good practice for robustness.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (5)
wren-ai-service/src/pipelines/generation/sql_generation.py (2)

95-105: Enhance error handling in post_process function.

The function could benefit from explicit error handling for timeout and SQL validation scenarios.

Consider adding try-catch blocks:

 @observe(capture_input=False)
 async def post_process(
     generate_sql: dict,
     post_processor: SQLGenPostProcessor,
     engine_timeout: float,
     project_id: str | None = None,
 ) -> dict:
+    try:
         return await post_processor.run(
             generate_sql.get("replies"),
             timeout=engine_timeout,
             project_id=project_id,
         )
+    except TimeoutError:
+        logger.error("SQL validation timed out")
+        raise
+    except Exception as e:
+        logger.error(f"Error in post processing: {str(e)}")
+        raise

111-166: Enhance type safety with additional type hints.

The class could benefit from more explicit type hints and validation.

Consider these improvements:

 class SQLGeneration(BasicPipeline):
     def __init__(
         self,
         llm_provider: LLMProvider,
         engine: Engine,
         engine_timeout: Optional[float] = 30.0,
         **kwargs,
-    ):
+    ) -> None:
         self._components = {
             "generator": llm_provider.get_generator(
                 system_prompt=sql_generation_system_prompt,
                 generation_kwargs=SQL_GENERATION_MODEL_KWARGS,
             ),
             "prompt_builder": PromptBuilder(
                 template=sql_generation_user_prompt_template
             ),
             "post_processor": SQLGenPostProcessor(engine=engine),
         }

         self._configs = {
             "engine_timeout": engine_timeout,
         }

         super().__init__(
             AsyncDriver({}, sys.modules[__name__], result_builder=base.DictResult())
         )

     @observe(name="SQL Generation")
     async def run(
         self,
         query: str,
         contexts: List[str],
         sql_generation_reasoning: Optional[str] = None,
         configuration: Configuration = Configuration(),
         sql_samples: Optional[List[Dict[str, str]]] = None,
         project_id: Optional[str] = None,
         has_calculated_field: bool = False,
         has_metric: bool = False,
-    ):
+    ) -> Dict[str, Any]:
wren-ai-service/eval/utils.py (2)

220-229: Swap hard-coded print statements for logging.
Using print("datasource is duckdb") may be too rigid. If WrenEngine supports other data sources, it might be misleading as well. Consider Python’s logging module.

-        print("datasource is duckdb")
+        import logging
+        logger = logging.getLogger(__name__)
+        logger.info("Detected WrenEngine; defaulting to DuckDB datasource.")

570-587: Rewriting YAML may discard comments & formatting.
The approach with yaml.safe_dump_all can lose existing inline comments or formatting. Consider whether preserving these is necessary and use a more selective editing strategy if so.

wren-ai-service/eval/prediction.py (1)

109-114: Substring check for “spider_” may be brittle.
Using "spider_" in path works, but a more robust detection approach or a dedicated CLI argument might be preferable.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1afc8d3 and ef48398.

📒 Files selected for processing (6)
  • wren-ai-service/eval/__init__.py (1 hunks)
  • wren-ai-service/eval/pipelines.py (8 hunks)
  • wren-ai-service/eval/prediction.py (3 hunks)
  • wren-ai-service/eval/utils.py (4 hunks)
  • wren-ai-service/src/pipelines/generation/sql_generation.py (2 hunks)
  • wren-ai-service/src/pipelines/generation/sql_generation_reasoning.py (6 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • wren-ai-service/src/pipelines/generation/sql_generation_reasoning.py
🔇 Additional comments (13)
wren-ai-service/src/pipelines/generation/sql_generation.py (2)

42-42: Verify the purpose of the empty line change.

An empty line was modified at line 42 between the SQL sample sections. While this might be intentional for formatting, ensure it doesn't affect the template rendering.


13-14: ⚠️ Potential issue

Critical: Resolve the missing SQL_GENERATION_MODEL_KWARGS dependency.

The code still imports and uses SQL_GENERATION_MODEL_KWARGS which has been removed according to the AI summary. This will cause runtime errors.

Let's verify if the variable exists in the utils file:

Consider one of these solutions:

  1. Restore the variable if it's still needed
  2. Replace the usage with inline configuration
  3. Move the configuration to a dedicated config file

Example of inline configuration:

generation_kwargs={
    "response_format": {
        "type": "json_object",
        "schema": {
            "type": "object",
            "properties": {
                "sql": {"type": "string"},
                "reasoning": {"type": "string"}
            },
            "required": ["sql", "reasoning"]
        }
    }
}

Also applies to: 122-122

❌ Incorrect review comment

🏁 Script executed:

#!/bin/bash
# Check if SQL_GENERATION_MODEL_KWARGS exists in the utils file
rg -A 5 "SQL_GENERATION_MODEL_KWARGS" "src/pipelines/generation/utils/sql.py"

Length of output: 153


I'll further investigate by searching the entire repository for any occurrence of SQL_GENERATION_MODEL_KWARGS. This will help confirm whether the variable was indeed removed and if it's referenced elsewhere.


🏁 Script executed:

#!/bin/bash
# Recursively search the repository for SQL_GENERATION_MODEL_KWARGS
rg "SQL_GENERATION_MODEL_KWARGS"

Length of output: 1208


Clarification: SQL_GENERATION_MODEL_KWARGS is defined in the repository.

A recursive search shows that the variable is indeed defined in
wren-ai-service/src/pipelines/generation/utils/sql.py
and is used in multiple pipeline modules (e.g., in SQL regeneration, correction, and generation). The concern that it was removed is incorrect.

Likely an incorrect or invalid review comment.

wren-ai-service/eval/__init__.py (1)

13-13: LGTM! Well-structured configuration addition.

The new allow_sql_samples boolean flag is a clean addition that maintains backward compatibility with its default value of True. The attribute name is descriptive and follows the class's naming convention.

wren-ai-service/eval/pipelines.py (4)

23-24: LGTM! Enhanced metrics coverage.

The addition of ExactMatchAccuracy and ExecutionAccuracy metrics enriches the evaluation capabilities.


229-230: LGTM! Clean settings integration.

The settings: EvalSettings parameter is properly integrated into the constructor, and the _allow_sql_samples flag is correctly initialized from settings.

Also applies to: 238-239


254-254: LGTM! Robust SQL samples handling.

The conditional inclusion of SQL samples using prediction.get("samples", []) if self._allow_sql_samples else [] is a clean implementation that respects the settings flag.


289-291: Verify Spider dataset compatibility.

The comment indicates these metrics are for the Spider dataset but are temporarily disabled. However, the metrics are now uncommented and active.

Please confirm if these metrics should be enabled for the Spider dataset evaluation:

  1. Are there any specific requirements or considerations for the Spider dataset?
  2. Could enabling these metrics affect other dataset evaluations?
✅ Verification successful

🌐 Web query:

What is the Spider dataset in SQL and how is it typically evaluated?

💡 Result:

The Spider dataset is a large-scale benchmark designed for complex and cross-domain text-to-SQL tasks, focusing on translating natural language questions into SQL queries across diverse database schemas. Here’s an overview of its structure and typical evaluation methods:


Overview of the Spider Dataset

  • Scope: Contains 10,181 questions paired with 5,693 unique SQL queries across 200 databases spanning 138 domains (e.g., college admissions, flight tracking)[1][2][4].
  • Complexity: SQL queries include advanced features like:
    • Multi-table joins
    • Nested subqueries
    • Aggregations (GROUP BY, HAVING)
    • Sorting (ORDER BY)[1][2].
  • Cross-Domain Split: Databases in training and test sets do not overlap, forcing models to generalize to unseen schemas and query structures[1][2].

Key Features

  1. Schema Diversity: Each database averages 28 columns and 9 foreign keys, with domain-specific schemas[2].
  2. Human Annotation: Created by 11 annotators over 1,000 hours to ensure quality[1][4].
  3. Task Design: Tests generalization to new databases and new SQL structures[1][4].

Evaluation Metrics

Spider uses three primary evaluation methods:

  1. Exact Matching (EM) Accuracy:
    • Checks if the predicted SQL exactly matches the gold query.
    • State-of-the-art models initially achieved only 9.7–12.4% EM accuracy[1][2][4].
  2. Component Matching:
    • Decomposes SQL into clauses (e.g., SELECT, WHERE) and compares sets of components[8].
  3. Execution Accuracy:
    • Runs predicted SQL against databases to verify correctness of results[8].
    • Limited by reliance on value prediction in some cases[8].

Challenges Highlighted by Spider

  • Models struggle with:
    • Schema linking (matching questions to tables/columns).
    • Handling unseen database structures.
    • Generating complex nested queries[1][2][5].
  • Subsequent benchmarks like Spider 2.0 and Dr.Spider further test robustness against perturbations (e.g., schema variations or paraphrased questions)[3][5].

Example Use Case

For a question like “List all TV shows with more than 10 episodes”, models must:

  1. Identify relevant tables (TV_show, Episodes).
  2. Join tables using foreign keys.
  3. Generate:
    SELECT TV_show.name FROM TV_show 
    JOIN Episodes ON TV_show.id = Episodes.show_id 
    GROUP BY TV_show.id HAVING COUNT(*) > 10;

The Spider dataset remains a critical benchmark for advancing text-to-SQL systems due to its emphasis on real-world complexity and cross-domain generalization[1][2][4].

Citations:


Verification of Spider Dataset Compatibility in Metrics Activation

Based on the Spider dataset overview and its standard evaluation practices, here are the key points:

  • Spider Evaluation Standards: The Spider dataset is a benchmark for complex text-to-SQL tasks that typically evaluates predictions via:

    • Exact Matching Accuracy: Ensuring the predicted SQL exactly matches the gold query.
    • Execution Accuracy: Validating correctness by running the SQL query.

    These are standard and expected metrics for Spider dataset evaluation.

  • No Special Requirements: There are no additional, unique configuration requirements for Spider beyond accommodating its inherent complexity (e.g., handling multi-table joins, nested subqueries, etc.). What’s essential is ensuring that the evaluation pipeline applies these metrics specifically when the Spider dataset is used.

  • Impact on Other Evaluations: Enabling these metrics in the Spider evaluation pipeline should not affect evaluations for other datasets, provided that the pipeline is set up to conditionally apply dataset-specific metrics. Each dataset’s evaluation should only trigger the metrics intended for it, and there is no intrinsic conflict between these metrics and other evaluation routines.

Action: The activated metrics (ExactMatchAccuracy and ExecutionAccuracy) are in line with Spider’s evaluation protocol. As long as the pipeline correctly segregates evaluation per dataset, they should remain enabled for Spider without adverse effects on other dataset evaluations.

wren-ai-service/eval/utils.py (2)

13-13: New YAML import looks fine.
No concerns with adding yaml as a dependency for reading/writing YAML files.


18-18: Import usage is correct.
No issues importing WrenEngine for specialized engine checks.

wren-ai-service/eval/prediction.py (4)

2-2: New import for base64.
No concerns; used properly for encoding.


9-9: Use of orjson for faster serialization.
No issues; it’s generally a performant JSON library.


19-19: Imported functions align with new logic.
No concerns about referencing parse_toml and replace_wren_engine_env_variables.


105-105: Parsing dataset earlier in the flow.
No issues. This can simplify subsequent logic by ensuring the dataset is available upfront.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
wren-ai-service/eval/metrics/spider/__init__.py (2)

567-567: Consider using more specific exception types.

While changing from bare except to except Exception is an improvement, consider catching specific exceptions that you expect (e.g., ValueError, KeyError) for better error handling.

-    except Exception:
+    except (ValueError, KeyError) as e:
+        print(f"Error parsing SQL structure: {e}")

628-639: Consider using logging instead of print for error messages.

While the error handling is good, using the logging module would be more appropriate for error tracking in a production environment.

+import logging
+
 def build_foreign_key_map_from_json(table):
     try:
         import json
 
         with open(table) as f:
             data = json.load(f)
         tables = {}
         for entry in data:
             tables[entry["db_id"]] = build_foreign_key_map(entry)
         return tables
     except Exception as e:
-        print(f"Error building foreign key map from JSON: {e}")
+        logging.error(f"Error building foreign key map from JSON: {e}")
         return {}
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ef48398 and e6f0f9d.

📒 Files selected for processing (2)
  • wren-ai-service/eval/metrics/spider/__init__.py (4 hunks)
  • wren-ai-service/eval/prediction.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • wren-ai-service/eval/prediction.py
🔇 Additional comments (2)
wren-ai-service/eval/metrics/spider/__init__.py (2)

159-162: LGTM! Improved type checking using isinstance().

The change from type() is to isinstance() follows Python best practices and makes the code more robust by properly handling inheritance and subclassing.


501-509: LGTM! Consistent type checking improvements.

The change to isinstance() maintains consistency with other type checking improvements and follows Python best practices.

Copy link
Member

@paopa paopa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm. but having a test case need to be modified. the assertion in test_sql_pairs::test_sql_paris_deletion

        await pipe.clean(sql_pairs=[], project_id="fake-id")
>       assert await store.count_documents() == 0
E       assert 2 == 0

@cyyeh
Copy link
Member Author

cyyeh commented Feb 12, 2025

overall lgtm. but having a test case need to be modified. the assertion in test_sql_pairs::test_sql_paris_deletion

        await pipe.clean(sql_pairs=[], project_id="fake-id")
>       assert await store.count_documents() == 0
E       assert 2 == 0

fixed

@paopa paopa merged commit dfed227 into main Feb 12, 2025
9 checks passed
@paopa paopa deleted the feat/ai-service/change-sql-through-reasoning branch February 12, 2025 09:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/ai-service ai-service related module/ai-service ai-service related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants