-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance optimisation for prometheus output #498
Comments
I see you went all out with the processors :) The first obvious change I would make is moving the As for the lldp subscription, If it's not going to change much, use an on-change subscription (if the router supports it) lldp:
paths:
- "/lldp/interfaces/interface/neighbors/neighbor/state/system-name"
- "/lldp/interfaces/interface/neighbors/neighbor/state/port-id"
stream-mode: on-change How much of a heavy load are we talking about ? I see you enabled the api-server metrics and you have a Prometheus server. api-server:
address: :7890
enable-metrics: true Do you have a target definition for A few more optimisations: This processor matches ALL your events, and runs the rename-metrics:
event-strings:
value-names:
- ".*"
transforms:
- replace:
apply-on: "name"
old: "interfaces/interface/.*/description"
new: "ifAlias" If you know exactly what you are going to replace, set it in the matching section not in the transform rename-metrics:
event-strings:
value-names:
- "interfaces/interface/.*/description"
transforms:
- replace:
apply-on: "name"
old: "interfaces/interface/.*/description"
new: "ifAlias" Same for this processor. rename-metrics-arista-ngbb:
event-strings:
value-names:
- ".*"
transforms:
- replace:
apply-on: "name"
old: ".*connectivityMonitor/status/hostStatus/.*/defaultStats/packetLoss"
new: "PacketLossAristaXBR"
- replace:
apply-on: "name"
old: ".*connectivityMonitor/status/hostStatus/.*/defaultStats/latency"
new: "LatencyAristaXBR"
- replace:
apply-on: "name"
old: ".*connectivityMonitor/status/hostStatus/.*/defaultStats/jitter"
new: "JitterAristaXBR"
- replace:
apply-on: "name"
old: ".*meminfo/memTotal"
new: "MemTotalAristaXBR"
- replace:
apply-on: "name"
old: ".*meminfo/memAvailable"
new: "MemAvailableAristaXBR"
- replace:
apply-on: "name"
old: "/queues/queue"
new: "_queue"
- trim-prefix:
apply-on: "name"
prefix: "/interfaces"
- trim-prefix:
apply-on: "name"
prefix: "/qos/interfaces" All these transforms are independent from each other. In this processor, the old tags are well known rename-labels-interface:
event-strings:
tag-names:
- ".*"
transforms:
- replace:
apply-on: "name"
old: "source"
new: "alias"
- replace:
apply-on: "name"
old: "interface_name"
new: "ifName"
- replace:
apply-on: "name"
old: ".*interface-id"
new: "ifName" I would place them in the rename-labels-interface:
event-strings:
tag-names:
- "source"
- "interface_name"
- .*interface-id"
transforms:
- replace:
apply-on: "name"
old: "source"
new: "alias"
- replace:
apply-on: "name"
old: "interface_name"
new: "ifName"
- replace:
apply-on: "name"
old: ".*interface-id"
new: "ifName" There are a couple more processors like this, I think you get the idea. you can save a lot by skipping a few regex evaluations (over 30 routers). |
How does |
It defines the number of parallel routines reading gNMI notifications from the target's buffer and converting them into Prometheus metrics. It's supposed to help deal with high rate of notifications. |
Understood, I bumped it to 10, seeing some marginal improvements. What's the "recommended" number of workers or how to find the optimal number? |
There is no recommended number really. It depends on the pattern (rate and size) of the updates you are getting. |
Got it!
I could only use something like
but it looks like this metrics disappears for a specific source if gnmic can't connect to it anymore. How do you folks monitor it? This could be used
but doesn't tell me which target is erroring though. |
Currently, this is your best bet:
Can't that metric default to zero if it's not returned ? |
No, once the box becomes gnmi unreachable, all those metrics disappear, they don't become 0. |
some sort of temporary workaround here: #419 (comment) |
Raised a feature request #513 . |
I'm testing this config file https://gist.github.com/aned/8b68e77791dc3bb9eeda903ce54e1643
After adding ~30 targets, I'm seeing some pretty heavy load on the server.
Are there any obvious improvements I can make in the config?
In this section
I'm caching lldp, it doesn't change much so need to do 30s updates, would it break things if
sample-interval
is set to like1h
or it needs to be done via cache expiration?The text was updated successfully, but these errors were encountered: