Use duckdb loading throughout pyprophet #131
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
With export-parquet, we have an additional dependency of duckdb which allows for fast SQL queries, especially those involving lots of joins.
Here roll out duckdb SQL queries in pyprophet for greater data loading efficiency.
Examples
Conducted on dell XPS ubuntu
Export Command
time pyprophet export --in=39041_Hela_500ng_15SPD_DIA_Py3_1_S2-A7_1_4502.osw
Old timings:
real 0m56.284s
user 0m35.997s
sys 0m15.130s
New timings:
real 0m12.832s
user 0m40.578s
sys 0m8.378s
Score Command
time pyprophet score --in=39041_Hela_500ng_15SPD_DIA_Py3_1_S2-A7_1_4502.osw --ss_num_iter=1
Old Timings:
real 0m59.466s
user 1m30.275s
sys 0m11.004s
New timings:
real 0m30.482s
user 1m21.186s
sys 0m9.460s