MountVolume.SetUp failed for volume "kube-api-access-xxxxxx" : failed to fetch token: Post "https://dns-weu-xxxxxx.hcp.westeurope.azmk8s.io:443/api/v1/namespaces/kube-system/serviceaccounts/coredns/token": dial tcp xxx.xxx.xxx.xxx:443: i/o timeout #4619

mykolaichuk · 2024-10-31T13:16:40Z

Hello,

We have an AKS cluster with a 1.30.3 version and AKSUbuntu-2204gen2containerd-202407.29.0 image version.

Yesterday we had about 1 hour of downtime on this cluster. By downtime, I mean that all our web applications hosted there were inaccessible.
We couldn't execute any kubectl commands against the cluster, like kubectl top nodes.

Please help us identify whether it's a bug in some AKS components, an Azure infrastructure issue, or an issue with our environment. We didn't see that some Azure-related planned maintenance was happening at that time, no issues in the Azure status page as well.
We still haven't identified a root cause of this and any support/help/suggestions are highly appreciated from our side.

In the kube-system logs, I found the following:

coredns:

MountVolume.SetUp failed for volume "kube-api-access-xxxxxx" : failed to fetch token: Post "https://dns-weu-xxxxxx.hcp.westeurope.azmk8s.io:443/api/v1/namespaces/kube-system/serviceaccounts/coredns/token": dial tcp xxx.xxx.xxx.xxx:443: i/o timeout

konnectivity-agent:

0/8 nodes are available: 8 node(s) had untolerated taint {node.kubernetes.io/unreachable: }. preemption: 0/8 nodes are available: 8 Preemption is not helpful for scheduling.

Subsequent events spotted:

Cluster health dropped Oct 30, 12:18PM - 1:13PM
Load Balancer health issue Oct 30, 12:49PM - 13:18PM
We have one node pool where ingress controller is hosted and Health issue was spotter there as well Oct 30, 12:44PM - ...
This VMSS was manually restarted and then scaled.
Cluster events that were captured once it became reachable:
Cluster event

32m         Warning   EgressBlocked             node/aks-xxxxxx-vmss000001    Required endpoints are unreachable (curl: (28) Failed to connect to login.microsoftonline.com port 443 after 9962 ms: Connection timed out ;curl: (28) Failed to connect to management.azure.com port 443 after 5202 ms: Connection timed out ;curl: (28) Failed to connect to packages.microsoft.com port 443 after 5203 ms: Connection timed out ;curl: (28) Connection timed out after 10000 milliseconds: https://dns-weu-xxxxxxx.hcp.westeurope.azmk8s.io/healthz ), aka.ms/AArpzy5 for more information.

Events from a node that wasn't removed/redeployed after VMSS restart, where the ingress controller is hosted:

44m         Warning   ConntrackFull             node/aks-xxxxxx-23859948-vmss000003   Conntrack table usage over 90%!:(MISSING) 131069 out of 131072
44m         Warning   ConntrackFull             node/aks-xxxxxx-23859948-vmss000003   Conntrack table usage over 90%!:(MISSING) 118173 out of 131072

NPM logs from the previous node

I1030 10:17:42.317355       1 trace.go:236] Trace[740101859]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229 (30-Oct-2024 10:17:12.314) (total time: 30002ms):
Trace[740101859]: ---"Objects listed" error:Get "https://dns-weu-xxxxxx.hcp.westeurope.azmk8s.io:443/api/v1/namespaces?resourceVersion=657808491": dial tcp xxx.xxx.xxx.xxx:443: i/o timeout 30001ms (10:17:42.316)
Trace[740101859]: [30.00219986s] [30.00219986s] END
E1030 10:17:42.317808       1 reflector.go:147] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.Namespace: failed to list *v1.Namespace: Get "https://dns-weu-xxxxxx.hcp.westeurope.azmk8s.io:443/api/v1/namespaces?resourceVersion=657808491": dial tcp xxx.xxx.xxx.xxx:443: i/o timeout
W1030 10:17:42.525945       1 reflector.go:535] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.Pod: Get "https://dns-weu-xxxxxx.hcp.westeurope.azmk8s.io:443/api/v1/pods?resourceVersion=657810464": dial tcp xxx.xxx.xxx.xxx:443: i/o timeout

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MountVolume.SetUp failed for volume "kube-api-access-xxxxxx" : failed to fetch token: Post "https://dns-weu-xxxxxx.hcp.westeurope.azmk8s.io:443/api/v1/namespaces/kube-system/serviceaccounts/coredns/token": dial tcp xxx.xxx.xxx.xxx:443: i/o timeout #4619

MountVolume.SetUp failed for volume "kube-api-access-xxxxxx" : failed to fetch token: Post "https://dns-weu-xxxxxx.hcp.westeurope.azmk8s.io:443/api/v1/namespaces/kube-system/serviceaccounts/coredns/token": dial tcp xxx.xxx.xxx.xxx:443: i/o timeout #4619

mykolaichuk commented Oct 31, 2024

MountVolume.SetUp failed for volume "kube-api-access-xxxxxx" : failed to fetch token: Post "https://dns-weu-xxxxxx.hcp.westeurope.azmk8s.io:443/api/v1/namespaces/kube-system/serviceaccounts/coredns/token": dial tcp xxx.xxx.xxx.xxx:443: i/o timeout #4619

MountVolume.SetUp failed for volume "kube-api-access-xxxxxx" : failed to fetch token: Post "https://dns-weu-xxxxxx.hcp.westeurope.azmk8s.io:443/api/v1/namespaces/kube-system/serviceaccounts/coredns/token": dial tcp xxx.xxx.xxx.xxx:443: i/o timeout #4619

Comments

mykolaichuk commented Oct 31, 2024