Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add walqueue sharding #2665

Merged
merged 21 commits into from
Feb 13, 2025
Merged

Conversation

mattdurham
Copy link
Collaborator

@mattdurham mattdurham commented Feb 9, 2025

PR Description

This adds dynamic sharding configuration into prometheus.write.queue

Which issue(s) this PR fixes

Notes to the Reviewer

PR Checklist

  • CHANGELOG.md updated
  • Documentation added
  • Tests updated

@mattdurham mattdurham marked this pull request as ready for review February 9, 2025 21:45
@mattdurham mattdurham requested review from clayton-cornell and a team as code owners February 9, 2025 21:45
CHANGELOG.md Outdated Show resolved Hide resolved
`allowed_network_error_percent` | `float` | The allowed error rate before scaling down. | `0.50` | no

Parralelism determines when to scale up or down the number of desired connections. This is accomplished by a variety of inputs.
By determining the drift between the incoming and outgoing timestamps that will determine whether to increase or decrease the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this connects / flows with previous sentence. Is it meant to be a bullet point?

internal/component/prometheus/write/queue/types.go Outdated Show resolved Hide resolved
internal/component/prometheus/write/queue/types.go Outdated Show resolved Hide resolved
Comment on lines 133 to 146
Parallelism determines when to scale up or down the number of desired connections. This is accomplished by a variety of inputs:

By determining the drift between the incoming and outgoing timestamps that will determine whether to increase or decrease the
desired connections. This is represented by `drift_scale_up_seconds` and `drift_scale_down_seconds`, if the drift is between these
two values then the value will stay the same.

Network success and failures are recorded and kept in memory, this helps determine
the nature of the drift. For instance if the drift is increasing but the network failures are increasing we should not increase
desired connections since that would only increase load on the endpoint.

Flapping prevention accomplished with `desired_check_interval`, each time a desired connection is calculated it is added to a list, before actually changing the
desired connection the system will choose the highest value in the lookback buffer. Example; for the past 5 minutes desired connections have been: [2,1,1] the check runs
and determines that the desired connections are 1, but will not change the value since the value 2 is still in the lookback. On the next check we have [1,1,1],
now it will change to 1. In general the system is fast to increase and slow to decrease.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Parallelism determines when to scale up or down the number of desired connections. This is accomplished by a variety of inputs:
By determining the drift between the incoming and outgoing timestamps that will determine whether to increase or decrease the
desired connections. This is represented by `drift_scale_up_seconds` and `drift_scale_down_seconds`, if the drift is between these
two values then the value will stay the same.
Network success and failures are recorded and kept in memory, this helps determine
the nature of the drift. For instance if the drift is increasing but the network failures are increasing we should not increase
desired connections since that would only increase load on the endpoint.
Flapping prevention accomplished with `desired_check_interval`, each time a desired connection is calculated it is added to a list, before actually changing the
desired connection the system will choose the highest value in the lookback buffer. Example; for the past 5 minutes desired connections have been: [2,1,1] the check runs
and determines that the desired connections are 1, but will not change the value since the value 2 is still in the lookback. On the next check we have [1,1,1],
now it will change to 1. In general the system is fast to increase and slow to decrease.
Parallelism determines when to scale up or down the number of desired connections.
The drift between the incoming and outgoing timestamps determines whether to increase or decrease the desired connections.
The value stays the same if the drift is between `drift_scale_up_seconds` and `drift_scale_down_seconds`.
Network successes and failures are recorded and kept in memory.
This data helps determine the nature of the drift.
For example, if the drift is increasing and the network failures are increasing, the desired connections should not increase because that would increase the load on the endpoint.
The `desired_check_interval` prevents connection flapping.
Each time a desired connection is calculated, the connection is added to a list.
Before changing the desired connection, the system will choose the highest value in the lookback buffer.
For example, for the past 5 minutes, desired connections have been: [2,1,1].
The check determines that the desired connections are 1, and the number of desired connections will not change because the value `2` is still in the lookback buffer.
On the next check, the desired connections are [1,1,1].
Now, it will change to 1. In general, the system is fast to increase and slow to decrease.

This is a first pass at reworking the description here. I don't think I've accurately distilled what you were trying to explain here... so I expect we will have to go over this at least once more.

Questions...

  1. What is the system here? Alloy? The connector?
  2. In the 4th paragraph example, what changes to 1 when the lookback is [1,1,1]?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yes, but more specifically the component prometheus.write.queue so will be specific there.

  2. The desired_connections change to 1.

Copy link
Contributor

@clayton-cornell clayton-cornell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say the docs for this new block are fine as-is now. I'll do a pass though later when I update the Prometheus topics for overall style/consistency... for now it matches the other Prometheus topics :-)

@clayton-cornell clayton-cornell added the type/docs Docs Squad label across all Grafana Labs repos label Feb 13, 2025
@mattdurham mattdurham merged commit 3528540 into grafana:main Feb 13, 2025
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/docs Docs Squad label across all Grafana Labs repos
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants