Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce docs about BlobStorage performance metrics #2509

Open
wants to merge 45 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
7f95899
Add docs about performance metrics
serbel324 Mar 6, 2024
38e5eae
Intermediate
serbel324 Mar 7, 2024
766dfab
Add information about fine-tuning
serbel324 Mar 7, 2024
6cdfb65
Address comments
serbel324 Mar 12, 2024
7b572f0
Address more cthulhu comments
serbel324 Mar 12, 2024
ef58cc2
address comments
serbel324 May 27, 2024
bc9e240
Add docs about performance metrics
serbel324 Mar 6, 2024
5b83a9a
Intermediate
serbel324 Mar 7, 2024
99f8bc3
Add information about fine-tuning
serbel324 Mar 7, 2024
e97ccf1
Address comments
serbel324 Mar 12, 2024
fdba9b2
Address more cthulhu comments
serbel324 Mar 12, 2024
532c7af
address comments
serbel324 May 27, 2024
f497462
Update performance_metrics.md
serbel324 May 27, 2024
b24a8e7
Update performance_metrics.md
serbel324 May 27, 2024
2913993
Update performance_metrics.md
serbel324 May 27, 2024
505e271
Update performance_metrics.md
serbel324 May 27, 2024
3c970df
Update performance_metrics.md
serbel324 May 31, 2024
0df425e
Add docs about blobstorage performance metrics
serbel324 Jul 8, 2024
81e190b
Rename BlobStorage -> Distributed Storage
serbel324 Jul 8, 2024
41d42b8
Add pages to index
serbel324 Jul 8, 2024
0dde757
Fix paths
serbel324 Jul 8, 2024
9a3e494
Fix typos
serbel324 Jul 10, 2024
09dd67b
Address more comments
serbel324 Jul 16, 2024
579a102
Update distributed-storage-performance.md
serbel324 Jul 16, 2024
7b2e58e
Update ydb/docs/en/core/reference/observability/metrics/distributed-s…
blinkov Jul 17, 2024
171dc65
Address more comments
serbel324 Jul 17, 2024
22b9c2f
Merge branch 'YDBDOCS-615-perforamnce-metrics' of github.com:serbel32…
serbel324 Jul 17, 2024
ad41888
Merge branch 'YDBDOCS-615-perforamnce-metrics' of github.com:serbel32…
serbel324 Sep 23, 2024
caf80d4
Address comments
serbel324 Sep 24, 2024
da983a1
Address comments
serbel324 Dec 16, 2024
5551f0d
Remove old files
serbel324 Dec 16, 2024
cd1377e
Merge branch 'main' into YDBDOCS-615-perforamnce-metrics
serbel324 Dec 16, 2024
1c2a21d
Update ydb/docs/ru/core/reference/observability/metrics/grafana-dashb…
serbel324 Dec 25, 2024
ccded53
Update ydb/docs/en/core/reference/observability/metrics/grafana-dashb…
serbel324 Dec 25, 2024
d615e78
Update ydb/docs/ru/core/reference/observability/metrics/grafana-dashb…
serbel324 Dec 25, 2024
aaeac08
Update ydb/docs/ru/core/reference/observability/metrics/distributed-s…
serbel324 Dec 25, 2024
323755a
Update ydb/docs/ru/core/reference/observability/metrics/distributed-s…
serbel324 Dec 25, 2024
182fef3
Update ydb/docs/en/core/reference/observability/metrics/distributed-s…
serbel324 Dec 25, 2024
1d5e708
Update ydb/docs/en/core/reference/observability/metrics/grafana-dashb…
serbel324 Dec 25, 2024
868332b
Update ydb/docs/en/core/reference/observability/metrics/distributed-s…
serbel324 Dec 25, 2024
aa52d25
Apply suggestions from code review
serbel324 Dec 25, 2024
d47baec
Update ydb/docs/ru/core/reference/observability/metrics/distributed-s…
serbel324 Dec 25, 2024
fca432e
Address comments
serbel324 Dec 25, 2024
fb43707
Fix build
serbel324 Dec 25, 2024
00d0d31
Fix build
serbel324 Dec 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Distributed Storage performance metrics

Distributed storage has a specific throughput limited by the resources of physical devices in the cluster and can provide low response times if the load does not exceed this capacity. Performance metrics display the amount of available resources of physical devices and allow for assessing of their consumption level. Tracking the values of performance metrics allows to monitor whether the necessary conditions for low response time guarantees are met, specifically, that the average load does not exceed the available limit and that there are no short-term load bursts.
serbel324 marked this conversation as resolved.
Show resolved Hide resolved

### Request cost model

The request cost is an estimate of the time that a physical device spends performing a given operation. The request cost is calculated using a simple physical device model. It assumes that the physical device can only handle one request for reading or writing at a time. The operation execution takes a certain amount of the device’s working time; therefore, the total time spent on requests over a certain period cannot exceed the duration of that period.

The request cost is calculated using a linear formula:

$$
cost(operation) = A + operation.size() \times B
$$

The physical rationale behind the linear dependency is as follows: coefficient \(A\) is the time needed for the physical device to access the data, and coefficient \(B\) is the time required to read or write one byte of data.

The coefficients \(A\) and \(B\) depend on the request request and device type. These coefficients were measured experimentally for each device type and each request type.
blinkov marked this conversation as resolved.
Show resolved Hide resolved

In {{ ydb-short-name }}, all physical devices are divided into three types: HDD, SATA SSD (further referred to as SSD), and NVMe SSD (further referred to as NVMe). HDDs are rotating hard drives characterized by high data access time. SSD and NVMe types differ in their interfaces: NVMe provides a higher operation speed.
serbel324 marked this conversation as resolved.
Show resolved Hide resolved

Operations are divided into three types: reads, writes, and huge-writes. The division of writes into regular and huge-writes is due to the specifics of handling write requests on VDisks.

In addition to user requests, the load on distributed storage is created by background processes of compaction, scrubbing, and defragmentation, as well as internal communication between VDisks. The compaction process can create particularly high loads when there is a substantial flow of small blob writes.
serbel324 marked this conversation as resolved.
Show resolved Hide resolved

### Available disk time {#diskTimeAvailable}

The PDisk scheduler manages the requests execution order from its client VDisks. PDisk fairly divides the device's time among its VDisks, ensuring that each of the $n$ VDisks is guaranteed $1/n$ seconds of the physical device's working time each second. Based on the information about the number of neighboring VDisks for each VDisk, denoted as $N$, and the configurable parameter `DiskTimeAvailableScale`, the available disk time estimate, referred to as `DiskTimeAvailable`, is calculated by the formula:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Тема DiskTimeAvailableScale не раскрыта. Не понятно, что это такое.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n и N -- это по сути одно и то же в рамках заданного контекста? Если да, давай использовать что-то одно, чтобы не путать читателя

$$
DiskTimeAvailable = \dfrac{1000000000}{N} \cdot \dfrac{DiskTimeAvailableScale}{1000}
$$

### Load burst detector {#burstDetector}

A burst is a sharp, short-term increase in the load on a VDisk, which can lead to degradation in the response time of operations. The values of sensors on cluster nodes are collected at certain intervals, for example, every 15 seconds, making it impossible to reliably detect short-term events using only the metrics of request cost and available disk time. A modified [Token Bucket algorithm](https://en.wikipedia.org/wiki/Token_bucket) is used to address this issue. In this modification, the bucket can have a negative number of tokens and such a state is called underflow. A separate Token Bucket object is associated with each VDisk. The minimum expected response time, at which an increase in load is considered a burst, is determined by the configurable parameter `BurstThresholdNs`. The bucket will underflow if the calculated time needed to process the requests in nanoseconds exceeds the `BurstThresholdNs` value.

### Performance metrics

Performance metrics are calculated based on the following VDisk sensors:
| Sensor Name | Units | Description |
|-----------------------|-------------------|---------------------------------------------------------------------------------------|
| `DiskTimeAvailable` | arbitrary units | Available disk time. |
| `UserDiskCost` | arbitrary units | Total cost of requests a VDisk receives from the DS Proxy. |
| `InternalDiskCost` | arbitrary units | Total cost of requests received by a VDisk from another VDisk in the group, for example, as part of the replication process. |
| `CompactionDiskCost` | arbitrary units | Total cost of requests the VDisk sends as part of the compaction process. |
| `DefragDiskCost` | arbitrary units | Total cost of requests the VDisk sends as part of the defragmentation process. |
| `ScrubDiskCost` | arbitrary units | Total cost of requests the VDisk sends as part of the scrubbing process. |
| `BurstDetector_redMs` | ms | The duration in milliseconds during which the Token Bucket was in an underflow state. |

`DiskTimeAvailable` and the request cost are estimates of available and consumed bandwidth, respectively, and not actually measured time, therefore both of these quantities are measured in arbitrary units.

### Conditions for Distributed Storage guarantees {#requirements}

The {{ ydb-short-name }} distributed storage can ensure low response times only under the following conditions:

1. $DiskTimeAvailable >= UserDiskCost + InternalDiskCost + CompactionDiskCost + DefragDiskCost + ScrubDiskCost$ — The average load does not exceed the maximum allowed.
serbel324 marked this conversation as resolved.
Show resolved Hide resolved
2. $BurstDetector_redMs = 0$ — There are no short-term load bursts, which would lead to request queues on handlers.

### Performance metrics configuration

Since the coefficients for the request cost formula were measured on specific physical devices from development clusters, and the performance of other devices may vary, the metrics may require additional adjustments to be used as a source of guarantees for Distributed Storage. Performance metric parameters can be managed via [dynamic cluster configuration](../../../maintenance/manual/dynamic-config.md) and the Immediate Controls mechanism without restarting {{ ydb-short-name }} processes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

не понятно что такое источник гарантий. Может так?

Suggested change
Since the coefficients for the request cost formula were measured on specific physical devices from development clusters, and the performance of other devices may vary, the metrics may require additional adjustments to be used as a source of guarantees for Distributed Storage. Performance metric parameters can be managed via [dynamic cluster configuration](../../../maintenance/manual/dynamic-config.md) and the Immediate Controls mechanism without restarting {{ ydb-short-name }} processes.
Since the coefficients for the request cost formula were measured on specific physical devices from development clusters, and the performance of other devices may vary, the metrics may require additional adjustments to be used as a reliable source of information about the distributed storage. Performance metric parameters can be managed via [dynamic cluster configuration](../../../maintenance/manual/dynamic-config.md) and the Immediate Controls mechanism without restarting {{ ydb-short-name }} processes.


| Parameter Name | Description | Default Value |
|---------------------------------------|-----------------------------------------------------------------------------------------------|---------------|
| `disk_time_available_scale_hdd` | [`DiskTimeAvailableScale` parameter](#diskTimeAvailable) for VDisks running on HDD devices. | `1000` |
| `disk_time_available_scale_ssd` | [`DiskTimeAvailableScale` parameter](#diskTimeAvailable) for VDisks running on SSD devices. | `1000` |
| `disk_time_available_partition_nvme` | [`DiskTimeAvailableScale` parameter](#diskTimeAvailable) for VDisks running on NVMe devices. | `1000` |
| `burst_threshold_ns_hdd` | [`BurstThresholdNs` parameter](#burstDetector) for VDisks running on HDD devices. | `200000000` |
| `burst_threshold_ns_ssd` | [`BurstThresholdNs` parameter](#burstDetector) for VDisks running on SSD devices. | `50000000` |
| `burst_threshold_ns_nvme` | [`BurstThresholdNs` parameter](#burstDetector) for VDisks running on NVMe devices. | `32000000` |

#### Configuration examples

If a given {{ ydb-short-name }} cluster uses NVMe devices and delivers performance that is 10% higher than the baseline, add the following section to the `immediate_controls_config` in the dynamic configuration of the cluster:

```
vdisk_controls:
disk_time_available_scale_nvme: 1100
```

If a given {{ ydb-short-name }} cluster is using HDD devices and under its workload conditions, the maximum tolerable response time is 500 ms, add the following section to the `immediate_controls_config` in the dynamic configuration of the cluster:

```
vdisk_controls:
burst_threshold_ns_hdd: 500000000
```

### How to compare the performance of a cluster with the baseline
serbel324 marked this conversation as resolved.
Show resolved Hide resolved

To compare the performance of Distributed Storage in a cluster with the baseline, you need to load the distributed storage with requests to the point where the VDisks cannot process the incoming request flow. At this moment, requests start to queue up, and the response time of the VDisks increases sharply. Compute the value $D$ just before the overload:
$$
D = \frac{UserDiskCost + InternalDiskCost + CompactionDiskCost + DefragDiskCost + ScrubDiskCost}{DiskTimeAvailable}
$$
Set the `disk_time_available_scale_<used-device-type>` configuration parameter equal to the calculated value of $D$, multiplied by 1000 and rounded. We assume that the physical devices in the user cluster are comparable in performance to the baseline; hence, by default, the `disk_time_available_scale_<used-device-type>` parameter is set to 1000.
serbel324 marked this conversation as resolved.
Show resolved Hide resolved

Such a load can be created, for example, using [Storage LoadActor](../../../contributor/load-actors-storage.md).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Such a load can be created, for example, using [Storage LoadActor](../../../contributor/load-actors-storage.md).
{% note tip %}
To generate the load, use [Storage LoadActor](../../../contributor/load-actors-storage.md).
{% endnote %}


Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,6 @@ items:
- name: Metrics reference
href: index.md
- name: Grafana dashboards
href: grafana-dashboards.md
href: grafana-dashboards.md
- name: Distributed Storage performance metrics
href: distributed-storage-performance.md
Loading
Loading