Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance optimisation for prometheus output #498

Closed
aned opened this issue Aug 1, 2024 · 11 comments
Closed

Performance optimisation for prometheus output #498

aned opened this issue Aug 1, 2024 · 11 comments

Comments

@aned
Copy link

aned commented Aug 1, 2024

I'm testing this config file https://gist.github.com/aned/8b68e77791dc3bb9eeda903ce54e1643
After adding ~30 targets, I'm seeing some pretty heavy load on the server.
Are there any obvious improvements I can make in the config?

In this section

  lldp:
    paths:
       - "/lldp/interfaces/interface/neighbors/neighbor/state/system-name"
       - "/lldp/interfaces/interface/neighbors/neighbor/state/port-id"
    stream-mode: sample
    sample-interval: 30s
    heartbeat-interval: 30s
    updates-only: false

I'm caching lldp, it doesn't change much so need to do 30s updates, would it break things if sample-interval is set to like 1h or it needs to be done via cache expiration?

@karimra
Copy link
Collaborator

karimra commented Aug 1, 2024

I see you went all out with the processors :)

The first obvious change I would make is moving the drop-metrics up in the list of processors under the output. If the event messages are going to be dropped they don't need to travel down the pipeline of processors. Unless you are using them to enrich other values (I didn't see that in the starlark processors)

As for the lldp subscription, If it's not going to change much, use an on-change subscription (if the router supports it)

  lldp:
    paths:
       - "/lldp/interfaces/interface/neighbors/neighbor/state/system-name"
       - "/lldp/interfaces/interface/neighbors/neighbor/state/port-id"
    stream-mode: on-change

How much of a heavy load are we talking about ? I see you enabled the api-server metrics and you have a Prometheus server.

api-server:
  address: :7890
  enable-metrics: true

Do you have a target definition for gnmic:7890/metrics ? We will be able to see how much (and what) is being used.

A few more optimisations:

This processor matches ALL your events, and runs the old regex on all of their values.

  rename-metrics:
    event-strings:
      value-names:
        - ".*"
      transforms:
        - replace:
            apply-on: "name"
            old: "interfaces/interface/.*/description"
            new: "ifAlias"

If you know exactly what you are going to replace, set it in the matching section not in the transform

rename-metrics:
  event-strings:
    value-names:
      - "interfaces/interface/.*/description"
    transforms:
      - replace:
          apply-on: "name"
          old: "interfaces/interface/.*/description"
          new: "ifAlias"

Same for this processor.

rename-metrics-arista-ngbb:
  event-strings:
    value-names:
      - ".*"
    transforms:
      - replace:
          apply-on: "name"
          old: ".*connectivityMonitor/status/hostStatus/.*/defaultStats/packetLoss"
          new: "PacketLossAristaXBR"
      - replace:
          apply-on: "name"
          old: ".*connectivityMonitor/status/hostStatus/.*/defaultStats/latency"
          new: "LatencyAristaXBR"
      - replace:
          apply-on: "name"
          old: ".*connectivityMonitor/status/hostStatus/.*/defaultStats/jitter"
          new: "JitterAristaXBR"
      - replace:
          apply-on: "name"
          old: ".*meminfo/memTotal"
          new: "MemTotalAristaXBR"
      - replace:
          apply-on: "name"
          old: ".*meminfo/memAvailable"
          new: "MemAvailableAristaXBR"
      - replace:
          apply-on: "name"
          old: "/queues/queue"
          new: "_queue"
      - trim-prefix:
          apply-on: "name"
          prefix: "/interfaces"
      - trim-prefix:
          apply-on: "name"
          prefix: "/qos/interfaces"

All these transforms are independent from each other.
The transforms in a single event-strings processor are applied the all the event messages in sequence.
So I would create separate processors for each one.

In this processor, the old tags are well known

rename-labels-interface:
  event-strings:
    tag-names:
      - ".*"
    transforms:
      - replace:
          apply-on: "name"
          old: "source"
          new: "alias"
      - replace:
          apply-on: "name"
          old: "interface_name"
          new: "ifName"
      - replace:
          apply-on: "name"
          old: ".*interface-id"
          new: "ifName"

I would place them in the tag-names field or even create a processor for each one.

rename-labels-interface:
  event-strings:
    tag-names:
      - "source"
      - "interface_name"
      - .*interface-id"
    transforms:
      - replace:
          apply-on: "name"
          old: "source"
          new: "alias"
      - replace:
          apply-on: "name"
          old: "interface_name"
          new: "ifName"
      - replace:
          apply-on: "name"
          old: ".*interface-id"
          new: "ifName"

There are a couple more processors like this, I think you get the idea. you can save a lot by skipping a few regex evaluations (over 30 routers).

@aned
Copy link
Author

aned commented Aug 1, 2024

Got it, thanks for the inputs!
I went from 17 to 32 targets, updated as suggested above, seems reasonable, will do more tweaking but it's got the potential!

image

@aned
Copy link
Author

aned commented Aug 1, 2024

How does num-workers: 5 affect things in the outputs configuration?

@karimra
Copy link
Collaborator

karimra commented Aug 1, 2024

How does num-workers: 5 affect things in the outputs configuration?

It defines the number of parallel routines reading gNMI notifications from the target's buffer and converting them into Prometheus metrics. It's supposed to help deal with high rate of notifications.
Looking at the dashboards you shared, I think you might benefit from more workers. It would reduce the total Goroutines you have running.

@aned
Copy link
Author

aned commented Aug 1, 2024

Understood, I bumped it to 10, seeing some marginal improvements. What's the "recommended" number of workers or how to find the optimal number?

@karimra
Copy link
Collaborator

karimra commented Aug 2, 2024

There is no recommended number really. It depends on the pattern (rate and size) of the updates you are getting.
I would aim at lowering the number of goroutines running and keep it stable over multiple sample intervals.
The optimal number depends on whether you are optimizing for mem or cpu ? If you want to reduce memory, add more workers so that notifications are not hanging in memory waiting to be processed. If you want to reduce cpu usage, reduce the number of workers, but you will be using more memory.

@aned
Copy link
Author

aned commented Aug 2, 2024

Got it!
In terms of monitoring targets (can't subscribe due to auth issues, potential acl issues, etc) , from what I can see in the api /metrics endpoint

api-server:
  address: :7890
  enable-metrics: true

I could only use something like

sum by (source) (rate(gnmic_subscribe_number_of_received_subscribe_response_messages_total{ job=~"$job_name"}[2m])) ==0

but it looks like this metrics disappears for a specific source if gnmic can't connect to it anymore. How do you folks monitor it?

This could be used

rate(grpc_client_handled_total{job=~"$job_name"}[2m]) > x 

but doesn't tell me which target is erroring though.

@karimra
Copy link
Collaborator

karimra commented Aug 7, 2024

Currently, this is your best bet:

sum by (source) (rate(gnmic_subscribe_number_of_received_subscribe_response_messages_total{ job=~"$job_name"}[2m])) ==0

Can't that metric default to zero if it's not returned ?

@aned
Copy link
Author

aned commented Aug 7, 2024

No, once the box becomes gnmi unreachable, all those metrics disappear, they don't become 0.
It'd only work if the gnmi connection stays up.

@karimra
Copy link
Collaborator

karimra commented Aug 15, 2024

some sort of temporary workaround here: #419 (comment)

@aned
Copy link
Author

aned commented Aug 21, 2024

Raised a feature request #513 .

@aned aned closed this as completed Aug 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants