Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pg_duckdb benchmarking for existing postgres tables. #311

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

saurabhojha
Copy link

@saurabhojha saurabhojha commented Feb 25, 2025

Resolves #306

Motivation:

Current pg_duckdb benchmark relies on parquet dataset which produces good results since it has columnar storage and vectorized query processing.

One of the advantages of pg_duckdb is that it supports querying postgres tables directly using DuckDb execution without moving the data to a duckdb table. This is useful for ad hoc analytics queries:

SELECT queries executed by the DuckDB engine can directly read Postgres tables. 
(If you only query Postgres tables you need to run SET duckdb.force_execution TO true, see the IMPORTANT section above for details)

It would be interesting to benchmark this feature of pg_duckdb and see whether some queries can outperform native postgres execution.

This benchmark creates a table hits and populates it with rows similar to how it is done in the postgres benchmark.

Once populated, queries are run by setting pg_duckdb's configuration force_execution to true.

This runs all the queries using Duckdb execution ( verified this using explain analyse).

Example output of explain analyse:

     EXPLAIN ANALYSE SELECT COUNT(*) FROM hits;
Timing is on.
 Custom Scan (DuckDBScan)  (cost=0.00..0.00 rows=0 width=0) (actual time=57590.008..57590.089 rows=1 loops=1)
   DuckDB Execution Plan: 
 
 ┌─────────────────────────────────────┐
 │┌───────────────────────────────────┐│
 ││    Query Profiling Information    ││
 │└───────────────────────────────────┘│
 └─────────────────────────────────────┘
 EXPLAIN ANALYZE SELECT count(*) AS count FROM pgduckdb.public.hits
 ┌────────────────────────────────────────────────┐
 │┌──────────────────────────────────────────────┐│
 ││              Total Time: 56.44s              ││
 │└──────────────────────────────────────────────┘│
 └────────────────────────────────────────────────┘
 ┌───────────────────────────┐
 │           QUERY           │
 └─────────────┬─────────────┘
 ┌─────────────┴─────────────┐
 │      EXPLAIN_ANALYZE      │
 │    ────────────────────   │
 │           0 Rows          │
 │          (0.00s)          │
 └─────────────┬─────────────┘
 ┌─────────────┴─────────────┐
 │    UNGROUPED_AGGREGATE    │
 │    ────────────────────   │
 │        Aggregates:        │
 │        count_star()       │
 │                           │
 │           1 Rows          │
 │          (0.01s)          │
 └─────────────┬─────────────┘
 ┌─────────────┴─────────────┐
 │         TABLE_SCAN        │
 │    ────────────────────   │
 │        Table: hits        │
 │                           │
 │       99997497 Rows       │
 │          (56.31s)         │
 └───────────────────────────┘
 
 
 Planning Time: 89.561 ms
 Execution Time: 57591.486 ms

@saurabhojha
Copy link
Author

Latest release of pg_duckdb supports index scans.
duckdb/pg_duckdb#243
I am going to modify the scripts to create indexes and try this out with latest extension version.

@shiv4289
Copy link

shiv4289 commented Mar 2, 2025

Hi @rschu1ze this looks like a more usable pg_duckdb than converting postgres data to parquet via an external realtime ETL job. Any concerns merging this PR? Very interesting to see results with data in postgres and query via duckdb engine.

@saurabhojha
Copy link
Author

I have run some benchmarks on a vm. Now that pg_duckdb supports index only scans, the benchmarking is closer to what a real world scenario would be ( you wouldn't have a postgresql table without indexes).

Screenshot 2025-03-02 at 3 40 55 PM Screenshot 2025-03-02 at 3 41 49 PM

Queries where pg_duckdb was lacking were ones involving the full text scans (GIN indexes). pg_duckdb isn't able to scan GIN indexes and relies on full table scans.(@JelteF )

Overall with index_scans introduced by pg_duckdb (https://github.com/duckdb/pg_duckdb/releases/tag/v0.3.1) its only upwards 🚀 for pg_duckdb directly querying postgresql tables. (This would make sense in cases of adhoc analytical queries where storing it in a columnar format like parquet might not be feasible)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pg_duckdb benchmark
2 participants