Add pg_duckdb benchmarking for existing postgres tables. #311

saurabhojha · 2025-02-25T18:11:01Z

Resolves #306

Motivation:

Current pg_duckdb benchmark relies on parquet dataset which produces good results since it has columnar storage and vectorized query processing.

One of the advantages of pg_duckdb is that it supports querying postgres tables directly using DuckDb execution without moving the data to a duckdb table. This is useful for ad hoc analytics queries:

SELECT queries executed by the DuckDB engine can directly read Postgres tables. 
(If you only query Postgres tables you need to run SET duckdb.force_execution TO true, see the IMPORTANT section above for details)

It would be interesting to benchmark this feature of pg_duckdb and see whether some queries can outperform native postgres execution.

This benchmark creates a table hits and populates it with rows similar to how it is done in the postgres benchmark.

Once populated, queries are run by setting pg_duckdb's configuration force_execution to true.

This runs all the queries using Duckdb execution ( verified this using explain analyse).

Example output of explain analyse:

     EXPLAIN ANALYSE SELECT COUNT(*) FROM hits;
Timing is on.
 Custom Scan (DuckDBScan)  (cost=0.00..0.00 rows=0 width=0) (actual time=57590.008..57590.089 rows=1 loops=1)
   DuckDB Execution Plan: 
 
 ┌─────────────────────────────────────┐
 │┌───────────────────────────────────┐│
 ││    Query Profiling Information    ││
 │└───────────────────────────────────┘│
 └─────────────────────────────────────┘
 EXPLAIN ANALYZE SELECT count(*) AS count FROM pgduckdb.public.hits
 ┌────────────────────────────────────────────────┐
 │┌──────────────────────────────────────────────┐│
 ││              Total Time: 56.44s              ││
 │└──────────────────────────────────────────────┘│
 └────────────────────────────────────────────────┘
 ┌───────────────────────────┐
 │           QUERY           │
 └─────────────┬─────────────┘
 ┌─────────────┴─────────────┐
 │      EXPLAIN_ANALYZE      │
 │    ────────────────────   │
 │           0 Rows          │
 │          (0.00s)          │
 └─────────────┬─────────────┘
 ┌─────────────┴─────────────┐
 │    UNGROUPED_AGGREGATE    │
 │    ────────────────────   │
 │        Aggregates:        │
 │        count_star()       │
 │                           │
 │           1 Rows          │
 │          (0.01s)          │
 └─────────────┬─────────────┘
 ┌─────────────┴─────────────┐
 │         TABLE_SCAN        │
 │    ────────────────────   │
 │        Table: hits        │
 │                           │
 │       99997497 Rows       │
 │          (56.31s)         │
 └───────────────────────────┘
 
 
 Planning Time: 89.561 ms
 Execution Time: 57591.486 ms

saurabhojha · 2025-03-01T21:44:02Z

Latest release of pg_duckdb supports index scans.
duckdb/pg_duckdb#243
I am going to modify the scripts to create indexes and try this out with latest extension version.

shiv4289 · 2025-03-02T04:40:25Z

Hi @rschu1ze this looks like a more usable pg_duckdb than converting postgres data to parquet via an external realtime ETL job. Any concerns merging this PR? Very interesting to see results with data in postgres and query via duckdb engine.

saurabhojha · 2025-03-02T10:29:41Z

I have run some benchmarks on a vm. Now that pg_duckdb supports index only scans, the benchmarking is closer to what a real world scenario would be ( you wouldn't have a postgresql table without indexes).

Queries where pg_duckdb was lacking were ones involving the full text scans (GIN indexes). pg_duckdb isn't able to scan GIN indexes and relies on full table scans.(@JelteF )

Overall with index_scans introduced by pg_duckdb (https://github.com/duckdb/pg_duckdb/releases/tag/v0.3.1) its only upwards 🚀 for pg_duckdb directly querying postgresql tables. (This would make sense in cases of adhoc analytical queries where storing it in a columnar format like parquet might not be feasible)

Add pg_non_parquet benchmarking

2365726

saurabhojha mentioned this pull request Feb 25, 2025

pg_duckdb benchmark #306

Open

rename folder

7268da3

Add pg_duckdb_postgres_tuned benchmark

0e25518

Use indexed pg duck db

c14046e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pg_duckdb benchmarking for existing postgres tables. #311

Add pg_duckdb benchmarking for existing postgres tables. #311

saurabhojha commented Feb 25, 2025 •

edited by rschu1ze

Loading

saurabhojha commented Mar 1, 2025

shiv4289 commented Mar 2, 2025

saurabhojha commented Mar 2, 2025

Add pg_duckdb benchmarking for existing postgres tables. #311

Are you sure you want to change the base?

Add pg_duckdb benchmarking for existing postgres tables. #311

Conversation

saurabhojha commented Feb 25, 2025 • edited by rschu1ze Loading

saurabhojha commented Mar 1, 2025

shiv4289 commented Mar 2, 2025

saurabhojha commented Mar 2, 2025

saurabhojha commented Feb 25, 2025 •

edited by rschu1ze

Loading