Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimal Format for ReplacingMergeTree Inserts #2058

Open
woodhull opened this issue Sep 8, 2024 · 1 comment
Open

Optimal Format for ReplacingMergeTree Inserts #2058

woodhull opened this issue Sep 8, 2024 · 1 comment

Comments

@woodhull
Copy link

woodhull commented Sep 8, 2024

I'm sharing this to share our use case and see if there might be a way to evolve the PeerDB sync to help.

We're using a custom table definition with PeerDB.

CREATE TABLE nashville_staging.people
(
    `id` UInt64,
    `account_id` UInt32,
    `organization_id` UInt64,
    `first_name` String,
    `last_name` String,
    `properties` String,
    `created_at` DateTime64(6),
    `updated_at` DateTime64(6),
    `assigned_id` UInt32,
    `tsvector` String,
    `last_conversation_id` UInt64,
    `_peerdb_synced_at` DateTime64(9) DEFAULT now(),
    `_peerdb_is_deleted` UInt8,
    `_peerdb_version` UInt64
)
ENGINE = SharedReplacingMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}', _peerdb_version, _peerdb_is_deleted)
PRIMARY KEY account_id
ORDER BY (account_id, id)
SETTINGS index_granularity = 8192

All of our queries filter on account_id, so I wanted to use it as the Primary Key.

The clickhouse PRIMARY KEY must prefix the ORDER BY.

This works well with the sync of new rows and updates (since the account_id for rows in this table does not change), but falls apart for deletes.

The delete row that is inserted by PeerDB contains a 0 for the account_id rather than the actual value, so it does not match the removal criteria for the SharedReplacingMergeTree.

I think this means we're effectively stuck only using id as the primary key in these tables.

I'm not sure if this is possible architecturally but is it possible to pass the full row values, or perhaps a set of specified column values on CDC delete through to clickhouse? This would allow more complex / custom table definitions to be sync'd while still supporting deletes.

Alternatively, this might have all been a bad idea and I should have stuck with an id Primary Key matched with Order By like the peerdb default -- I'm just trying to make sure that the tables are well modeled for our anticipated queries.

This is more of an architecture / design question than the typical Github Issue. Sharing it here in this format in case it's useful to others.

@woodhull
Copy link
Author

woodhull commented Sep 8, 2024

See also: #2008

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant