-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deploying gNMIc on Kubernetes clustered - loosing targets #526
Comments
Sorry I probably won't be of much help but I'm always interested in how others are deploying. We have k8s deployment for a proof-of-concept currently. Was thinking of consul or a cluster -- how do you delete the leader lock? Our pipeline is The telegaf is a sidecar with a health check that dies if the buffer is more than x,000 records - so we get a bit of buffering available if we momentarily loose connection to influx etc I use a bash script to statically assign devices > stateful set nodes via hard coded yaml in the pods. Not particularly flexible but ok for proof-of-concept and we don't add/remove devices super often. If a pod restarts well it restarts and we will loose telemetry until it comes back I have a bunch of different device types and profiles and the script spreads them over the pods in a deterministic way ... so I know for a zone/region that devices of type X will be on zone Y's pod # 3 etc. So monitoring pod resources and telemetry means we get pretty consistent graphs. We run a 'A' telemetry stack and a 'B' telemetry stack which are independent but poll all the devices so if we blow a influx or loose a storage DC we have another for tactical purposes - and we stage upgrades in production one side at a time etc... 743 devices currently. 8k/points second -- we filter any stat not in a specific allow list - pre-prod gets everything, prod drops most of the metrics from things like /interfaces unless specifically allow listed. We are trying to 're-work' from a point where we had JTI telemetry doing 60k/sec for 70 devices so we don't metric things we don't need/use. |
We deploy the gNMIc cluster using this chart: https://github.com/workfloworchestrator/gnmic-cluster-chart following the instructions here: https://gnmic.openconfig.net/user_guide/HA/ As for deleting the leader lock its relatively simple, when using K8S you need to delete the lease: k delete lease -n streaming gnmic-collector-leader We are running on AKS and were running into Kube-API ratelimit issues, when the cluster leaders was managing the leases. We decided to use the other option consul to store the state of the quorum. Consul has a web interface that you can browse to, you can manage what is stored in consul there, and edit/delete stuff. The clustering mode of gNMIc trusts the cluster leader to dispatch targets to separate workers. |
When deploying gNMIc clustered on Kubernetes I'm running into the following problem: When K8s reschedules a pod to a different node or for whatever reason triggers a restart on a pod (resource constraints) the clustering process looses targets and does not recover. What I have to do to recover from this is to delete the leader lock (in consul or the k8s lease) and this will cause the cluster to rebalance and re-acquire targets. I'm deploying the gNMIc process as a statefulset and collecting roughly 22k metrics per second from around 350 nodes so this deployment is resource intensive.
Are there any other users who have an idea how I could deploy gnmic in a better way?
My pipeline is as follows:
Routers -> gNMIc collector -> Kafka -> gNMIc relay -> Influxdb
The collector configuration
The text was updated successfully, but these errors were encountered: