Performance optimisation for prometheus output #498

aned · 2024-08-01T01:42:56Z

I'm testing this config file https://gist.github.com/aned/8b68e77791dc3bb9eeda903ce54e1643
After adding ~30 targets, I'm seeing some pretty heavy load on the server.
Are there any obvious improvements I can make in the config?

In this section

  lldp:
    paths:
       - "/lldp/interfaces/interface/neighbors/neighbor/state/system-name"
       - "/lldp/interfaces/interface/neighbors/neighbor/state/port-id"
    stream-mode: sample
    sample-interval: 30s
    heartbeat-interval: 30s
    updates-only: false

I'm caching lldp, it doesn't change much so need to do 30s updates, would it break things if sample-interval is set to like 1h or it needs to be done via cache expiration?

The text was updated successfully, but these errors were encountered:

karimra · 2024-08-01T04:51:28Z

I see you went all out with the processors :)

The first obvious change I would make is moving the drop-metrics up in the list of processors under the output. If the event messages are going to be dropped they don't need to travel down the pipeline of processors. Unless you are using them to enrich other values (I didn't see that in the starlark processors)

As for the lldp subscription, If it's not going to change much, use an on-change subscription (if the router supports it)

  lldp:
    paths:
       - "/lldp/interfaces/interface/neighbors/neighbor/state/system-name"
       - "/lldp/interfaces/interface/neighbors/neighbor/state/port-id"
    stream-mode: on-change

How much of a heavy load are we talking about ? I see you enabled the api-server metrics and you have a Prometheus server.

api-server:
  address: :7890
  enable-metrics: true

Do you have a target definition for gnmic:7890/metrics ? We will be able to see how much (and what) is being used.

A few more optimisations:

This processor matches ALL your events, and runs the old regex on all of their values.

  rename-metrics:
    event-strings:
      value-names:
        - ".*"
      transforms:
        - replace:
            apply-on: "name"
            old: "interfaces/interface/.*/description"
            new: "ifAlias"

If you know exactly what you are going to replace, set it in the matching section not in the transform

rename-metrics:
  event-strings:
    value-names:
      - "interfaces/interface/.*/description"
    transforms:
      - replace:
          apply-on: "name"
          old: "interfaces/interface/.*/description"
          new: "ifAlias"

Same for this processor.

rename-metrics-arista-ngbb:
  event-strings:
    value-names:
      - ".*"
    transforms:
      - replace:
          apply-on: "name"
          old: ".*connectivityMonitor/status/hostStatus/.*/defaultStats/packetLoss"
          new: "PacketLossAristaXBR"
      - replace:
          apply-on: "name"
          old: ".*connectivityMonitor/status/hostStatus/.*/defaultStats/latency"
          new: "LatencyAristaXBR"
      - replace:
          apply-on: "name"
          old: ".*connectivityMonitor/status/hostStatus/.*/defaultStats/jitter"
          new: "JitterAristaXBR"
      - replace:
          apply-on: "name"
          old: ".*meminfo/memTotal"
          new: "MemTotalAristaXBR"
      - replace:
          apply-on: "name"
          old: ".*meminfo/memAvailable"
          new: "MemAvailableAristaXBR"
      - replace:
          apply-on: "name"
          old: "/queues/queue"
          new: "_queue"
      - trim-prefix:
          apply-on: "name"
          prefix: "/interfaces"
      - trim-prefix:
          apply-on: "name"
          prefix: "/qos/interfaces"

All these transforms are independent from each other.
The transforms in a single event-strings processor are applied the all the event messages in sequence.
So I would create separate processors for each one.

In this processor, the old tags are well known

rename-labels-interface:
  event-strings:
    tag-names:
      - ".*"
    transforms:
      - replace:
          apply-on: "name"
          old: "source"
          new: "alias"
      - replace:
          apply-on: "name"
          old: "interface_name"
          new: "ifName"
      - replace:
          apply-on: "name"
          old: ".*interface-id"
          new: "ifName"

I would place them in the tag-names field or even create a processor for each one.

rename-labels-interface:
  event-strings:
    tag-names:
      - "source"
      - "interface_name"
      - .*interface-id"
    transforms:
      - replace:
          apply-on: "name"
          old: "source"
          new: "alias"
      - replace:
          apply-on: "name"
          old: "interface_name"
          new: "ifName"
      - replace:
          apply-on: "name"
          old: ".*interface-id"
          new: "ifName"

There are a couple more processors like this, I think you get the idea. you can save a lot by skipping a few regex evaluations (over 30 routers).

aned · 2024-08-01T06:56:10Z

Got it, thanks for the inputs!
I went from 17 to 32 targets, updated as suggested above, seems reasonable, will do more tweaking but it's got the potential!

aned · 2024-08-01T06:58:26Z

How does num-workers: 5 affect things in the outputs configuration?

karimra · 2024-08-01T18:24:09Z

How does num-workers: 5 affect things in the outputs configuration?

It defines the number of parallel routines reading gNMI notifications from the target's buffer and converting them into Prometheus metrics. It's supposed to help deal with high rate of notifications.
Looking at the dashboards you shared, I think you might benefit from more workers. It would reduce the total Goroutines you have running.

aned · 2024-08-01T23:49:17Z

Understood, I bumped it to 10, seeing some marginal improvements. What's the "recommended" number of workers or how to find the optimal number?

karimra · 2024-08-02T16:26:54Z

There is no recommended number really. It depends on the pattern (rate and size) of the updates you are getting.
I would aim at lowering the number of goroutines running and keep it stable over multiple sample intervals.
The optimal number depends on whether you are optimizing for mem or cpu ? If you want to reduce memory, add more workers so that notifications are not hanging in memory waiting to be processed. If you want to reduce cpu usage, reduce the number of workers, but you will be using more memory.

aned · 2024-08-02T21:27:18Z

Got it!
In terms of monitoring targets (can't subscribe due to auth issues, potential acl issues, etc) , from what I can see in the api /metrics endpoint

api-server:
  address: :7890
  enable-metrics: true

I could only use something like

sum by (source) (rate(gnmic_subscribe_number_of_received_subscribe_response_messages_total{ job=~"$job_name"}[2m])) ==0

but it looks like this metrics disappears for a specific source if gnmic can't connect to it anymore. How do you folks monitor it?

This could be used

rate(grpc_client_handled_total{job=~"$job_name"}[2m]) > x

but doesn't tell me which target is erroring though.

karimra · 2024-08-07T20:10:19Z

Currently, this is your best bet:

sum by (source) (rate(gnmic_subscribe_number_of_received_subscribe_response_messages_total{ job=~"$job_name"}[2m])) ==0

Can't that metric default to zero if it's not returned ?

aned · 2024-08-07T23:02:36Z

No, once the box becomes gnmi unreachable, all those metrics disappear, they don't become 0.
It'd only work if the gnmi connection stays up.

karimra · 2024-08-15T22:27:19Z

some sort of temporary workaround here: #419 (comment)

aned · 2024-08-21T00:40:14Z

Raised a feature request #513 .

aned mentioned this issue Aug 21, 2024

Feature request: Improve observability in api-server metrics implementation #513

Open

aned closed this as completed Aug 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance optimisation for prometheus output #498

Performance optimisation for prometheus output #498

aned commented Aug 1, 2024

karimra commented Aug 1, 2024

aned commented Aug 1, 2024

aned commented Aug 1, 2024

karimra commented Aug 1, 2024 •

edited

Loading

aned commented Aug 1, 2024

karimra commented Aug 2, 2024

aned commented Aug 2, 2024

karimra commented Aug 7, 2024

aned commented Aug 7, 2024

karimra commented Aug 15, 2024

aned commented Aug 21, 2024

Performance optimisation for prometheus output #498

Performance optimisation for prometheus output #498

Comments

aned commented Aug 1, 2024

karimra commented Aug 1, 2024

aned commented Aug 1, 2024

aned commented Aug 1, 2024

karimra commented Aug 1, 2024 • edited Loading

aned commented Aug 1, 2024

karimra commented Aug 2, 2024

aned commented Aug 2, 2024

karimra commented Aug 7, 2024

aned commented Aug 7, 2024

karimra commented Aug 15, 2024

aned commented Aug 21, 2024

karimra commented Aug 1, 2024 •

edited

Loading