Add walqueue sharding #2665

mattdurham · 2025-02-09T21:14:01Z

PR Description

This adds dynamic sharding configuration into prometheus.write.queue

Which issue(s) this PR fixes

Notes to the Reviewer

PR Checklist

CHANGELOG.md updated
Documentation added
Tests updated

CHANGELOG.md

docs/sources/reference/components/prometheus/prometheus.write.queue.md

thampiotr · 2025-02-11T13:19:13Z

docs/sources/reference/components/prometheus/prometheus.write.queue.md

+`allowed_network_error_percent`    | `float`    | The allowed error rate before scaling down.                                                                                        | `0.50`   | no
+
+Parralelism determines when to scale up or down the number of desired connections. This is accomplished by a variety of inputs.
+By determining the drift between the incoming and outgoing timestamps that will determine whether to increase or decrease the


Not sure if this connects / flows with previous sentence. Is it meant to be a bullet point?

docs/sources/reference/components/prometheus/prometheus.write.queue.md

internal/component/prometheus/write/queue/types.go

Co-authored-by: Piotr <[email protected]>

…queue.md Co-authored-by: Piotr <[email protected]>

docs/sources/reference/components/prometheus/prometheus.write.queue.md

clayton-cornell · 2025-02-11T19:15:24Z

docs/sources/reference/components/prometheus/prometheus.write.queue.md

+Parallelism determines when to scale up or down the number of desired connections. This is accomplished by a variety of inputs: 
+
+By determining the drift between the incoming and outgoing timestamps that will determine whether to increase or decrease the
+desired connections. This is represented by `drift_scale_up_seconds` and `drift_scale_down_seconds`, if the drift is between these
+two values then the value will stay the same. 
+
+Network success and failures are recorded and kept in memory, this helps determine
+the nature of the drift. For instance if the drift is increasing but the network failures are increasing we should not increase
+desired connections since that would only increase load on the endpoint. 
+
+Flapping prevention accomplished with `desired_check_interval`, each time a desired connection is calculated it is added to a list, before actually changing the
+desired connection the system will choose the highest value in the lookback buffer. Example; for the past 5 minutes desired connections have been: [2,1,1] the check runs
+and determines that the desired connections are 1, but will not change the value since the value 2 is still in the lookback. On the next check we have [1,1,1],
+now it will change to 1. In general the system is fast to increase and slow to decrease.


Suggested change

Parallelism determines when to scale up or down the number of desired connections. This is accomplished by a variety of inputs:

By determining the drift between the incoming and outgoing timestamps that will determine whether to increase or decrease the

desired connections. This is represented by `drift_scale_up_seconds` and `drift_scale_down_seconds`, if the drift is between these

two values then the value will stay the same.

Network success and failures are recorded and kept in memory, this helps determine

the nature of the drift. For instance if the drift is increasing but the network failures are increasing we should not increase

desired connections since that would only increase load on the endpoint.

Flapping prevention accomplished with `desired_check_interval`, each time a desired connection is calculated it is added to a list, before actually changing the

desired connection the system will choose the highest value in the lookback buffer. Example; for the past 5 minutes desired connections have been: [2,1,1] the check runs

and determines that the desired connections are 1, but will not change the value since the value 2 is still in the lookback. On the next check we have [1,1,1],

now it will change to 1. In general the system is fast to increase and slow to decrease.

Parallelism determines when to scale up or down the number of desired connections.

The drift between the incoming and outgoing timestamps determines whether to increase or decrease the desired connections.

The value stays the same if the drift is between `drift_scale_up_seconds` and `drift_scale_down_seconds`.

Network successes and failures are recorded and kept in memory.

This data helps determine the nature of the drift.

For example, if the drift is increasing and the network failures are increasing, the desired connections should not increase because that would increase the load on the endpoint.

The `desired_check_interval` prevents connection flapping.

Each time a desired connection is calculated, the connection is added to a list.

Before changing the desired connection, the system will choose the highest value in the lookback buffer.

For example, for the past 5 minutes, desired connections have been: [2,1,1].

The check determines that the desired connections are 1, and the number of desired connections will not change because the value `2` is still in the lookback buffer.

On the next check, the desired connections are [1,1,1].

Now, it will change to 1. In general, the system is fast to increase and slow to decrease.

This is a first pass at reworking the description here. I don't think I've accurately distilled what you were trying to explain here... so I expect we will have to go over this at least once more.

Questions...

What is the system here? Alloy? The connector?

In the 4th paragraph example, what changes to 1 when the lookback is [1,1,1]?

Yes, but more specifically the component prometheus.write.queue so will be specific there.

The desired_connections change to 1.

…queue.md Co-authored-by: Clayton Cornell <[email protected]>

docs/sources/reference/components/prometheus/prometheus.write.queue.md

…queue.md Co-authored-by: Clayton Cornell <[email protected]>

clayton-cornell

I'd say the docs for this new block are fine as-is now. I'll do a pass though later when I update the Prometheus topics for overall style/consistency... for now it matches the other Prometheus topics :-)

mattdurham added 3 commits February 9, 2025 16:03

Add parralelism block.

0c578e1

merge main

402857d

fix attr

ee265ac

mattdurham marked this pull request as ready for review February 9, 2025 21:45

mattdurham requested review from clayton-cornell and a team as code owners February 9, 2025 21:45

Update to include context cancelled logic.

ee2a014

thampiotr reviewed Feb 11, 2025

View reviewed changes

mattdurham and others added 8 commits February 11, 2025 10:09

Update CHANGELOG.md

8698cfa

Co-authored-by: Piotr <[email protected]>

Update docs/sources/reference/components/prometheus/prometheus.write.…

b99312f

…queue.md Co-authored-by: Piotr <[email protected]>

Update docs/sources/reference/components/prometheus/prometheus.write.…

67cb37e

…queue.md Co-authored-by: Piotr <[email protected]>

Update docs/sources/reference/components/prometheus/prometheus.write.…

89a3c1d

…queue.md Co-authored-by: Piotr <[email protected]>

Update to include better naming from walqueue.

4719f08

Reword verbiage.

165b118

merge main

dc77b42

Fix naming in test.

f962f7a

thampiotr approved these changes Feb 11, 2025

View reviewed changes

clayton-cornell reviewed Feb 11, 2025

View reviewed changes

mattdurham and others added 6 commits February 11, 2025 14:47

Update docs/sources/reference/components/prometheus/prometheus.write.…

04eba86

…queue.md Co-authored-by: Clayton Cornell <[email protected]>

Update docs/sources/reference/components/prometheus/prometheus.write.…

012ae17

…queue.md Co-authored-by: Clayton Cornell <[email protected]>

Update docs/sources/reference/components/prometheus/prometheus.write.…

92f9eea

…queue.md Co-authored-by: Clayton Cornell <[email protected]>

pr feedbackh

d09c0eb

Merge suggestions and committed.

0b11fad

Merge suggestions and committed.

3cf281a

clayton-cornell reviewed Feb 11, 2025

View reviewed changes

docs/sources/reference/components/prometheus/prometheus.write.queue.md Outdated Show resolved Hide resolved

clayton-cornell reviewed Feb 11, 2025

View reviewed changes

mattdurham and others added 3 commits February 12, 2025 10:56

Update docs/sources/reference/components/prometheus/prometheus.write.…

a8b1a30

…queue.md Co-authored-by: Clayton Cornell <[email protected]>

Update docs/sources/reference/components/prometheus/prometheus.write.…

30661da

…queue.md Co-authored-by: Clayton Cornell <[email protected]>

Update docs/sources/reference/components/prometheus/prometheus.write.…

b4eb0d8

…queue.md Co-authored-by: Clayton Cornell <[email protected]>

clayton-cornell approved these changes Feb 13, 2025

View reviewed changes

clayton-cornell added the type/docs Docs Squad label across all Grafana Labs repos label Feb 13, 2025

mattdurham merged commit 3528540 into grafana:main Feb 13, 2025
29 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add walqueue sharding #2665

Add walqueue sharding #2665

mattdurham commented Feb 9, 2025 •

edited

Loading

thampiotr Feb 11, 2025

clayton-cornell Feb 11, 2025

mattdurham Feb 11, 2025

clayton-cornell left a comment

Add walqueue sharding #2665

Add walqueue sharding #2665

Conversation

mattdurham commented Feb 9, 2025 • edited Loading

PR Description

Which issue(s) this PR fixes

Notes to the Reviewer

PR Checklist

thampiotr Feb 11, 2025

Choose a reason for hiding this comment

clayton-cornell Feb 11, 2025

Choose a reason for hiding this comment

mattdurham Feb 11, 2025

Choose a reason for hiding this comment

clayton-cornell left a comment

Choose a reason for hiding this comment

mattdurham commented Feb 9, 2025 •

edited

Loading