You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have an AKS cluster with a 1.30.3 version and AKSUbuntu-2204gen2containerd-202407.29.0 image version.
Yesterday we had about 1 hour of downtime on this cluster. By downtime, I mean that all our web applications hosted there were inaccessible.
We couldn't execute any kubectl commands against the cluster, like kubectl top nodes.
Please help us identify whether it's a bug in some AKS components, an Azure infrastructure issue, or an issue with our environment. We didn't see that some Azure-related planned maintenance was happening at that time, no issues in the Azure status page as well.
We still haven't identified a root cause of this and any support/help/suggestions are highly appreciated from our side.
In the kube-system logs, I found the following:
coredns:
MountVolume.SetUp failed for volume "kube-api-access-xxxxxx" : failed to fetch token: Post "https://dns-weu-xxxxxx.hcp.westeurope.azmk8s.io:443/api/v1/namespaces/kube-system/serviceaccounts/coredns/token": dial tcp xxx.xxx.xxx.xxx:443: i/o timeout
konnectivity-agent:
0/8 nodes are available: 8 node(s) had untolerated taint {node.kubernetes.io/unreachable: }. preemption: 0/8 nodes are available: 8 Preemption is not helpful for scheduling.
Subsequent events spotted:
Cluster health dropped Oct 30, 12:18PM - 1:13PM
Load Balancer health issue Oct 30, 12:49PM - 13:18PM
We have one node pool where ingress controller is hosted and Health issue was spotter there as well Oct 30, 12:44PM - ...
This VMSS was manually restarted and then scaled.
Cluster events that were captured once it became reachable: Cluster event
32m Warning EgressBlocked node/aks-xxxxxx-vmss000001 Required endpoints are unreachable (curl: (28) Failed to connect to login.microsoftonline.com port 443 after 9962 ms: Connection timed out ;curl: (28) Failed to connect to management.azure.com port 443 after 5202 ms: Connection timed out ;curl: (28) Failed to connect to packages.microsoft.com port 443 after 5203 ms: Connection timed out ;curl: (28) Connection timed out after 10000 milliseconds: https://dns-weu-xxxxxxx.hcp.westeurope.azmk8s.io/healthz ), aka.ms/AArpzy5 for more information.
Events from a node that wasn't removed/redeployed after VMSS restart, where the ingress controller is hosted:
44m Warning ConntrackFull node/aks-xxxxxx-23859948-vmss000003 Conntrack table usage over 90%!:(MISSING) 131069 out of 131072
44m Warning ConntrackFull node/aks-xxxxxx-23859948-vmss000003 Conntrack table usage over 90%!:(MISSING) 118173 out of 131072
NPM logs from the previous node
I1030 10:17:42.317355 1 trace.go:236] Trace[740101859]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229 (30-Oct-2024 10:17:12.314) (total time: 30002ms):
Trace[740101859]: ---"Objects listed" error:Get "https://dns-weu-xxxxxx.hcp.westeurope.azmk8s.io:443/api/v1/namespaces?resourceVersion=657808491": dial tcp xxx.xxx.xxx.xxx:443: i/o timeout 30001ms (10:17:42.316)
Trace[740101859]: [30.00219986s] [30.00219986s] END
E1030 10:17:42.317808 1 reflector.go:147] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.Namespace: failed to list *v1.Namespace: Get "https://dns-weu-xxxxxx.hcp.westeurope.azmk8s.io:443/api/v1/namespaces?resourceVersion=657808491": dial tcp xxx.xxx.xxx.xxx:443: i/o timeout
W1030 10:17:42.525945 1 reflector.go:535] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.Pod: Get "https://dns-weu-xxxxxx.hcp.westeurope.azmk8s.io:443/api/v1/pods?resourceVersion=657810464": dial tcp xxx.xxx.xxx.xxx:443: i/o timeout
The text was updated successfully, but these errors were encountered:
Hello,
We have an AKS cluster with a 1.30.3 version and AKSUbuntu-2204gen2containerd-202407.29.0 image version.
Yesterday we had about 1 hour of downtime on this cluster. By downtime, I mean that all our web applications hosted there were inaccessible.
We couldn't execute any kubectl commands against the cluster, like
kubectl top nodes
.Please help us identify whether it's a bug in some AKS components, an Azure infrastructure issue, or an issue with our environment. We didn't see that some Azure-related planned maintenance was happening at that time, no issues in the Azure status page as well.
We still haven't identified a root cause of this and any support/help/suggestions are highly appreciated from our side.
In the kube-system logs, I found the following:
coredns:
konnectivity-agent:
Subsequent events spotted:
This VMSS was manually restarted and then scaled.
Cluster event
Events from a node that wasn't removed/redeployed after VMSS restart, where the ingress controller is hosted:
NPM logs from the previous node
The text was updated successfully, but these errors were encountered: