-
Notifications
You must be signed in to change notification settings - Fork 588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(wren-ai-service): add bird eval dataset #1321
Conversation
WalkthroughThe pull request updates dataset preparation and DuckDB configuration within the service. The CLI commands in the Justfile now require a dataset parameter (e.g., Changes
Suggested labels
Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (2)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (13)
wren-ai-service/eval/preparation.py (8)
11-11
: Import for URL retrieval.
Useful for downloading the bird dataset. Make sure to properly handle timeouts or errors if you anticipate unreliable network conditions.
64-85
: Ensure robust error handling in bird data download logic.
The_download_and_extract
function is straightforward, but consider catching exceptions for file I/O or network issues to provide clearer feedback.
114-151
: Column descriptions logic looks solid, but watch out for corner cases.
The approach to mergevalue_description
andcolumn_description
is good, but some columns may be missing these fields or have unexpected data. A fallback or validation step could improve robustness.
155-181
: Handling composite primary keys.
Currently, composite primary keys are filtered out. If you plan to support them in the future, you may want to log or track them for clarity.
184-207
: Relationship building.
The many-to-many relationship assumption is acceptable for now, but be aware that some foreign key relationships are one-to-many. If you require more accurate modeling, consider refining this join type in the future.
280-330
: CSV reading for bird database descriptions.
Reading withencoding="ISO-8859-1"
may be necessary for the given dataset. If other files use UTF-8, unify encodings if possible to avoid confusion.
418-419
: DuckDB initialization.
This call is a key step for engine setup. If any logs or error-handling is needed, consider adding them.
455-474
: Final dataset creation and logging.
The final step of creating a TOML dataset is clear. Basic error handling for file write operations might be worth considering.wren-ai-service/eval/__init__.py (1)
14-14
: New DuckDB path setting.
Storing an explicit path for DuckDB is a clean approach. Consider adding a short docstring or comment explaining intended usage.wren-ai-service/eval/prediction.py (1)
111-119
: Consider moving dataset paths to configuration.The hardcoded database paths for both spider and bird datasets should be moved to configuration for better maintainability and flexibility.
Consider moving the paths to a configuration file:
- settings.db_path_for_duckdb = "etc/spider1.0/database" + settings.db_path_for_duckdb = settings.get_dataset_path("spider") - settings.db_path_for_duckdb = "etc/bird/minidev/MINIDEV/dev_databases" + settings.db_path_for_duckdb = settings.get_dataset_path("bird")wren-ai-service/eval/data_curation/app.py (1)
119-123
: Use configuration for database path.For consistency with the suggested changes in prediction.py, the hardcoded database path should be moved to configuration.
prepare_duckdb_init_sql( WREN_ENGINE_ENDPOINT, st.session_state["mdl_json"]["catalog"], - "etc/spider1.0/database", + settings.get_dataset_path("spider"), )wren-ai-service/eval/utils.py (2)
165-167
: Use proper logging instead of print.Replace print statement with proper logging for better error tracking and consistency with Python best practices.
- print(f"Error in quoting SQL: {sql}") + logging.warning(f"Error in quoting SQL: {sql}", exc_info=True)
545-547
: Add path validation.The function should validate that the provided path exists and is accessible before attempting to use it.
def prepare_duckdb_init_sql(api_endpoint: str, db: str, path: str): + db_path = Path(f"{path}/{db}/{db}.sqlite") + if not db_path.exists(): + raise FileNotFoundError(f"Database file not found at {db_path}") + - init_sql = f"ATTACH '{path}/{db}/{db}.sqlite' AS {db} (TYPE sqlite);" + init_sql = f"ATTACH '{db_path}' AS {db} (TYPE sqlite);"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
wren-ai-service/Justfile
(1 hunks)wren-ai-service/eval/__init__.py
(1 hunks)wren-ai-service/eval/data_curation/app.py
(3 hunks)wren-ai-service/eval/pipelines.py
(3 hunks)wren-ai-service/eval/prediction.py
(1 hunks)wren-ai-service/eval/preparation.py
(8 hunks)wren-ai-service/eval/utils.py
(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Analyze (go)
🔇 Additional comments (19)
wren-ai-service/eval/preparation.py (15)
2-2
: Good documentation update.
Adding "bird" to the high-level docstring clarifies the file’s multi-dataset support.
4-4
: New import seems correct.
Argparse is necessary for handling the new CLI argument for dataset selection.
15-15
: Pandas import.
Pandas is used to read CSV files for the bird dataset descriptions. Ensure that the CSV encoding is correct to avoid decoding errors.
27-27
: Established new path for the bird dataset.
Having a separate path for the bird dataset improves clarity.
91-96
: Key-based JSON parsing.
The function returns a dictionary keyed by the specified field. This is simple and clear, but be mindful of potential missing or malformed fields in the JSON.
98-112
: Check for potential length mismatches in_merge_column_info
.
When zippingcolumn_names_original
andcolumn_types
, if the lengths disagree, this could throw an error or cause incomplete merges. Consider adding a safety check or error handling.
211-220
: No immediate issues.
Aggregating ground truths by db id is straightforward. Ensure any missing keys are handled gracefully upstream.
222-240
: spider1.0 model building.
Logic is consistent with the existing code structure. No critical issues found.
247-277
: Question-SQL pairs extraction is correct.
Implementation aligns well with the existing spider approach.
332-356
: Extraction of question-sql pairs for bird.
Logic parallels spider approach effectively. Looks good.
367-377
: Argparse for dataset selection.
Clear user interface improvement by exposing a--dataset
parameter.
379-390
: Downloading bird data.
Manually verifying that the download URL is reliable or hosting a fallback would make this more robust.
392-404
: Building dataset for spider or bird.
Switch-case logic is readable. No issues here.
409-410
: Initializing questions_size.
Straightforward. No issues.
411-413
: Setting duckdb_init_path for bird.
This is consistent with newly introduced code.wren-ai-service/Justfile (1)
35-36
: Parameterizing the prep command.
Requiring a dataset argument better reflects the multi-dataset approach. This is a good extension for future expansions.wren-ai-service/eval/pipelines.py (2)
250-253
: LGTM!The update to include the database path in the engine configuration is correct and consistent with the changes in prediction.py.
344-346
: LGTM!The update to include the database path in the engine configuration is consistent with the GenerationPipeline implementation.
wren-ai-service/eval/data_curation/app.py (1)
27-34
: LGTM!The restoration of necessary imports is correct and well-organized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
Summary by CodeRabbit
New Features
Refactor