Skip to content

Commit

Permalink
Merge pull request #2972 from ClickHouse/parts_with_playground
Browse files Browse the repository at this point in the history
All examples actually run in playground now.
  • Loading branch information
tom-clickhouse authored Jan 3, 2025
2 parents 6bf10f9 + cc2118d commit 9a63107
Show file tree
Hide file tree
Showing 2 changed files with 41 additions and 3 deletions.
Binary file modified docs/en/managing-data/core-concepts/images/part.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
44 changes: 41 additions & 3 deletions docs/en/managing-data/core-concepts/parts.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,11 @@ keywords: [part]

The data from each table in the ClickHouse [MergeTree engine family](/docs/en/engines/table-engines/mergetree-family) is organized on disk as a collection of immutable `data parts`.

To illustrate this, we use this table (adapted from the [UK property prices dataset](/docs/en/getting-started/example-datasets/uk-price-paid)) tracking the date, town, street, and price for sold properties in the United Kingdom:
To illustrate this, we use [this](https://sql.clickhouse.com/?query=U0hPVyBDUkVBVEUgVEFCTEUgdWsudWtfcHJpY2VfcGFpZF9zaW1wbGU&run_query=true&tab=results) table (adapted from the [UK property prices dataset](/docs/en/getting-started/example-datasets/uk-price-paid)) tracking the date, town, street, and price for sold properties in the United Kingdom:


```
CREATE TABLE uk_price_paid
CREATE TABLE uk.uk_price_paid_simple
(
date Date,
town LowCardinality(String),
Expand All @@ -26,6 +26,7 @@ ENGINE = MergeTree
ORDER BY (town, street);
```

You can [query this table](https://sql.clickhouse.com/?query=U0VMRUNUICogRlJPTSB1ay51a19wcmljZV9wYWlkX3NpbXBsZTs&run_query=true&tab=results) in our ClickHouse SQL Playground.

A data part is created whenever a set of rows is inserted into the table. The following diagram sketches this:

Expand All @@ -48,7 +49,44 @@ Data parts are self-contained, including all metadata needed to interpret their

To manage the number of parts per table, a background merge job periodically combines smaller parts into larger ones until they reach a [configurable](/docs/en/operations/settings/merge-tree-settings#max-bytes-to-merge-at-max-space-in-pool) compressed size (typically ~150 GB). Merged parts are marked as inactive and deleted after a [configurable](/docs/en/operations/settings/merge-tree-settings#old-parts-lifetime) time interval. Over time, this process creates a hierarchical structure of merged parts, which is why it’s called a MergeTree table:

<img src={require('./images/merges.png').default} alt='PART MERGES' class='image' style={{width: '100%'}} />
<img src={require('./images/merges.png').default} alt='PART MERGES' class='image' style={{width: '50%'}} />
<br/>

To minimize the number of initial parts and the overhead of merges, database clients are [encouraged](https://clickhouse.com/blog/asynchronous-data-inserts-in-clickhouse#data-needs-to-be-batched-for-optimal-performance) to either insert tuples in bulk, e.g. 20,000 rows at once, or to use the [asynchronous insert mode](https://clickhouse.com/blog/asynchronous-data-inserts-in-clickhouse), in which ClickHouse buffers rows from multiple incoming INSERTs into the same table and creates a new part only after the buffer size exceeds a configurable threshold, or a timeout expires.

You can [query](https://sql.clickhouse.com/?query=U0VMRUNUIF9wYXJ0CkZST00gdWsudWtfcHJpY2VfcGFpZF9zaW1wbGUKR1JPVVAgQlkgX3BhcnQKT1JERVIgQlkgX3BhcnQgQVNDOw&run_query=true&tab=results) the list of all currently existing active parts of our example table by using the [virtual column](/docs/en/engines/table-engines#table_engines-virtual_columns) `_part`:

```
SELECT _part
FROM uk.uk_price_paid_simple
GROUP BY _part
ORDER BY _part ASC;
┌─_part───────┐
1. │ all_0_5_1 │
2. │ all_12_17_1 │
3. │ all_18_23_1 │
4. │ all_6_11_1 │
└─────────────┘
```
The query above retrieves the names of directories on disk, with each directory representing an active data part of the table. The components of these directory names have specific meanings, which are documented [here](https://github.com/ClickHouse/ClickHouse/blob/f90551824bb90ade2d8a1d8edd7b0a3c0a459617/src/Storages/MergeTree/MergeTreeData.h#L130) for those interested in exploring further.

Alternatively, ClickHouse tracks infos for all parts of all tables in the [system.parts](https://clickhouse.com/docs/en/operations/system-tables/parts) system table, and the following query [returns](https://sql.clickhouse.com/?query=U0VMRUNUCiAgICBuYW1lLAogICAgbGV2ZWwsCiAgICByb3dzCkZST00gc3lzdGVtLnBhcnRzCldIRVJFIChkYXRhYmFzZSA9ICd1aycpIEFORCAoYHRhYmxlYCA9ICd1a19wcmljZV9wYWlkX3NpbXBsZScpIEFORCBhY3RpdmUKT1JERVIgQlkgbmFtZSBBU0M7&run_query=true&tab=results) for our example table above the list of all currently active parts, their merge level, and the number of rows stored in these parts:
```
SELECT
name,
level,
rows
FROM system.parts
WHERE (database = 'uk') AND (`table` = 'uk_price_paid_simple') AND active
ORDER BY name ASC;
┌─name────────┬─level─┬────rows─┐
1. │ all_0_5_1 │ 1 │ 6368414 │
2. │ all_12_17_1 │ 1 │ 6442494 │
3. │ all_18_23_1 │ 1 │ 5977762 │
4. │ all_6_11_1 │ 1 │ 6459763 │
└─────────────┴───────┴─────────┘
```
The merge level is incremented by one with each additional merge on the part. A level of 0 indicates this is a new part that has not been merged yet.

0 comments on commit 9a63107

Please sign in to comment.