Deploying gNMIc on Kubernetes clustered - loosing targets #526

pboers1988 · 2024-10-08T06:32:35Z

When deploying gNMIc clustered on Kubernetes I'm running into the following problem: When K8s reschedules a pod to a different node or for whatever reason triggers a restart on a pod (resource constraints) the clustering process looses targets and does not recover. What I have to do to recover from this is to delete the leader lock (in consul or the k8s lease) and this will cause the cluster to rebalance and re-acquire targets. I'm deploying the gNMIc process as a statefulset and collecting roughly 22k metrics per second from around 350 nodes so this deployment is resource intensive.

Are there any other users who have an idea how I could deploy gnmic in a better way?

My pipeline is as follows:

Routers -> gNMIc collector -> Kafka -> gNMIc relay -> Influxdb

The collector configuration

api-server:
  address: :7890
  cache:
    address: redis-master.production.svc.cluster.local:6379
    type: redis
  debug: false
  enable-metrics: true
debug: false
encoding: proto
format: event
gzip: false
password: ${GNMIC_PASSWORD}
skip-verify: true
username: ${GNMIC_USERNAME}
outputs:
  k8s-cluster:
    address: ${KAFKA_BOOSTRAP_SERVER}
    cache:
      address: redis-master.production.svc.cluster.local:6379
      expiration: 60s
      type: redis
    event-processors:
    - group-by-interface-and-source
    group-id: ${KAFKA_GROUP}
    sasl:
      mechanism: ${KAFKA_AUTH_MECH}
      password: ${KAFKA_PASSWORD}
      user: ${KAFKA_USERNAME}
    tls:
      skip-verify: true
    topic: ${KAFKA_TOPIC}
    type: kafka
processors:
  group-by-interface-and-source:
    event-group-by:
      tags:
      - interface_name
      - source
subscriptions:
  components:
    mode: stream
    paths:
    - /components
    stream-mode: target-defined
  network_instances_bgp:
    mode: stream
    paths:
    - /network-instances/network-instance/protocols/protocol/bgp
    - /network-instances/network-instance/interfaces
    - /network-instances/network-instance/state
    stream-mode: target-defined
  port_stats:
    mode: stream
    paths:
    - /interfaces
    stream-mode: target-defined
  system:
    mode: stream
    paths:
    - /system
    stream-mode: target-defined
targets:
  target1t:
    address: target1
    subscriptions:
    - port_stats
    - network_instances_bgp
    - components
    - system
 
 <snip >
 

api: ":7890"
clustering:
  cluster-name: gnmic-collector
  locker:
    type: consul
    address: gnmic-consul-svc:8500
gnmi-server:
  address: ":57400"
  debug: false
  cache:
    type: redis
    address: redis-master.production.svc.cluster.local:6379

peejaychilds · 2024-10-09T01:25:11Z

Sorry I probably won't be of much help but I'm always interested in how others are deploying. We have k8s deployment for a proof-of-concept currently.

Was thinking of consul or a cluster -- how do you delete the leader lock?

Our pipeline is
Routers -> gNMIc collector -> telegraf -> Influxdb with kapacitor doing CQs for rollup (into 5 min samples for strategic data collection for items we want a longer non-tactical view)

The telegaf is a sidecar with a health check that dies if the buffer is more than x,000 records - so we get a bit of buffering available if we momentarily loose connection to influx etc

I use a bash script to statically assign devices > stateful set nodes via hard coded yaml in the pods. Not particularly flexible but ok for proof-of-concept and we don't add/remove devices super often.

If a pod restarts well it restarts and we will loose telemetry until it comes back

I have a bunch of different device types and profiles and the script spreads them over the pods in a deterministic way ... so I know for a zone/region that devices of type X will be on zone Y's pod # 3 etc. So monitoring pod resources and telemetry means we get pretty consistent graphs.

We run a 'A' telemetry stack and a 'B' telemetry stack which are independent but poll all the devices so if we blow a influx or loose a storage DC we have another for tactical purposes - and we stage upgrades in production one side at a time etc...

743 devices currently. 8k/points second -- we filter any stat not in a specific allow list - pre-prod gets everything, prod drops most of the metrics from things like /interfaces unless specifically allow listed. We are trying to 're-work' from a point where we had JTI telemetry doing 60k/sec for 70 devices so we don't metric things we don't need/use.

pboers1988 · 2024-10-09T06:38:49Z

We deploy the gNMIc cluster using this chart: https://github.com/workfloworchestrator/gnmic-cluster-chart following the instructions here: https://gnmic.openconfig.net/user_guide/HA/

As for deleting the leader lock its relatively simple, when using K8S you need to delete the lease:

k delete lease -n streaming gnmic-collector-leader

We are running on AKS and were running into Kube-API ratelimit issues, when the cluster leaders was managing the leases. We decided to use the other option consul to store the state of the quorum. Consul has a web interface that you can browse to, you can manage what is stored in consul there, and edit/delete stuff.

The clustering mode of gNMIc trusts the cluster leader to dispatch targets to separate workers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deploying gNMIc on Kubernetes clustered - loosing targets #526

Deploying gNMIc on Kubernetes clustered - loosing targets #526

pboers1988 commented Oct 8, 2024 •

edited

Loading

peejaychilds commented Oct 9, 2024

pboers1988 commented Oct 9, 2024 •

edited

Loading

Deploying gNMIc on Kubernetes clustered - loosing targets #526

Deploying gNMIc on Kubernetes clustered - loosing targets #526

Comments

pboers1988 commented Oct 8, 2024 • edited Loading

The collector configuration

peejaychilds commented Oct 9, 2024

pboers1988 commented Oct 9, 2024 • edited Loading

pboers1988 commented Oct 8, 2024 •

edited

Loading

pboers1988 commented Oct 9, 2024 •

edited

Loading