Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploying gNMIc on Kubernetes clustered - loosing targets #526

Open
pboers1988 opened this issue Oct 8, 2024 · 2 comments
Open

Deploying gNMIc on Kubernetes clustered - loosing targets #526

pboers1988 opened this issue Oct 8, 2024 · 2 comments

Comments

@pboers1988
Copy link

pboers1988 commented Oct 8, 2024

When deploying gNMIc clustered on Kubernetes I'm running into the following problem: When K8s reschedules a pod to a different node or for whatever reason triggers a restart on a pod (resource constraints) the clustering process looses targets and does not recover. What I have to do to recover from this is to delete the leader lock (in consul or the k8s lease) and this will cause the cluster to rebalance and re-acquire targets. I'm deploying the gNMIc process as a statefulset and collecting roughly 22k metrics per second from around 350 nodes so this deployment is resource intensive.

Are there any other users who have an idea how I could deploy gnmic in a better way?

My pipeline is as follows:

Routers -> gNMIc collector -> Kafka -> gNMIc relay -> Influxdb

The collector configuration

api-server:
  address: :7890
  cache:
    address: redis-master.production.svc.cluster.local:6379
    type: redis
  debug: false
  enable-metrics: true
debug: false
encoding: proto
format: event
gzip: false
password: ${GNMIC_PASSWORD}
skip-verify: true
username: ${GNMIC_USERNAME}
outputs:
  k8s-cluster:
    address: ${KAFKA_BOOSTRAP_SERVER}
    cache:
      address: redis-master.production.svc.cluster.local:6379
      expiration: 60s
      type: redis
    event-processors:
    - group-by-interface-and-source
    group-id: ${KAFKA_GROUP}
    sasl:
      mechanism: ${KAFKA_AUTH_MECH}
      password: ${KAFKA_PASSWORD}
      user: ${KAFKA_USERNAME}
    tls:
      skip-verify: true
    topic: ${KAFKA_TOPIC}
    type: kafka
processors:
  group-by-interface-and-source:
    event-group-by:
      tags:
      - interface_name
      - source
subscriptions:
  components:
    mode: stream
    paths:
    - /components
    stream-mode: target-defined
  network_instances_bgp:
    mode: stream
    paths:
    - /network-instances/network-instance/protocols/protocol/bgp
    - /network-instances/network-instance/interfaces
    - /network-instances/network-instance/state
    stream-mode: target-defined
  port_stats:
    mode: stream
    paths:
    - /interfaces
    stream-mode: target-defined
  system:
    mode: stream
    paths:
    - /system
    stream-mode: target-defined
targets:
  target1t:
    address: target1
    subscriptions:
    - port_stats
    - network_instances_bgp
    - components
    - system
 
 <snip >
 

api: ":7890"
clustering:
  cluster-name: gnmic-collector
  locker:
    type: consul
    address: gnmic-consul-svc:8500
gnmi-server:
  address: ":57400"
  debug: false
  cache:
    type: redis
    address: redis-master.production.svc.cluster.local:6379
@peejaychilds
Copy link
Contributor

Sorry I probably won't be of much help but I'm always interested in how others are deploying. We have k8s deployment for a proof-of-concept currently.

Was thinking of consul or a cluster -- how do you delete the leader lock?

Our pipeline is
Routers -> gNMIc collector -> telegraf -> Influxdb with kapacitor doing CQs for rollup (into 5 min samples for strategic data collection for items we want a longer non-tactical view)

The telegaf is a sidecar with a health check that dies if the buffer is more than x,000 records - so we get a bit of buffering available if we momentarily loose connection to influx etc

I use a bash script to statically assign devices > stateful set nodes via hard coded yaml in the pods. Not particularly flexible but ok for proof-of-concept and we don't add/remove devices super often.

If a pod restarts well it restarts and we will loose telemetry until it comes back

I have a bunch of different device types and profiles and the script spreads them over the pods in a deterministic way ... so I know for a zone/region that devices of type X will be on zone Y's pod # 3 etc. So monitoring pod resources and telemetry means we get pretty consistent graphs.

We run a 'A' telemetry stack and a 'B' telemetry stack which are independent but poll all the devices so if we blow a influx or loose a storage DC we have another for tactical purposes - and we stage upgrades in production one side at a time etc...

743 devices currently. 8k/points second -- we filter any stat not in a specific allow list - pre-prod gets everything, prod drops most of the metrics from things like /interfaces unless specifically allow listed. We are trying to 're-work' from a point where we had JTI telemetry doing 60k/sec for 70 devices so we don't metric things we don't need/use.

@pboers1988
Copy link
Author

pboers1988 commented Oct 9, 2024

We deploy the gNMIc cluster using this chart: https://github.com/workfloworchestrator/gnmic-cluster-chart following the instructions here: https://gnmic.openconfig.net/user_guide/HA/

As for deleting the leader lock its relatively simple, when using K8S you need to delete the lease:

k delete lease -n streaming gnmic-collector-leader

We are running on AKS and were running into Kube-API ratelimit issues, when the cluster leaders was managing the leases. We decided to use the other option consul to store the state of the quorum. Consul has a web interface that you can browse to, you can manage what is stored in consul there, and edit/delete stuff.

Screenshot 2024-10-09 at 08 37 28

The clustering mode of gNMIc trusts the cluster leader to dispatch targets to separate workers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants