Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MountVolume.SetUp failed for volume "kube-api-access-xxxxxx" : failed to fetch token: Post "https://dns-weu-xxxxxx.hcp.westeurope.azmk8s.io:443/api/v1/namespaces/kube-system/serviceaccounts/coredns/token": dial tcp xxx.xxx.xxx.xxx:443: i/o timeout #4619

Open
mykolaichuk opened this issue Oct 31, 2024 · 0 comments

Comments

@mykolaichuk
Copy link

Hello,

We have an AKS cluster with a 1.30.3 version and AKSUbuntu-2204gen2containerd-202407.29.0 image version.

Yesterday we had about 1 hour of downtime on this cluster. By downtime, I mean that all our web applications hosted there were inaccessible.
We couldn't execute any kubectl commands against the cluster, like kubectl top nodes.

Please help us identify whether it's a bug in some AKS components, an Azure infrastructure issue, or an issue with our environment. We didn't see that some Azure-related planned maintenance was happening at that time, no issues in the Azure status page as well.
We still haven't identified a root cause of this and any support/help/suggestions are highly appreciated from our side.

In the kube-system logs, I found the following:

coredns:

MountVolume.SetUp failed for volume "kube-api-access-xxxxxx" : failed to fetch token: Post "https://dns-weu-xxxxxx.hcp.westeurope.azmk8s.io:443/api/v1/namespaces/kube-system/serviceaccounts/coredns/token": dial tcp xxx.xxx.xxx.xxx:443: i/o timeout

konnectivity-agent:

0/8 nodes are available: 8 node(s) had untolerated taint {node.kubernetes.io/unreachable: }. preemption: 0/8 nodes are available: 8 Preemption is not helpful for scheduling.

Subsequent events spotted:

  • Cluster health dropped Oct 30, 12:18PM - 1:13PM
    Image
  • Load Balancer health issue Oct 30, 12:49PM - 13:18PM
    Image
    Image
  • We have one node pool where ingress controller is hosted and Health issue was spotter there as well Oct 30, 12:44PM - ...
    This VMSS was manually restarted and then scaled.
    Image
    Image
  • Cluster events that were captured once it became reachable:
    Cluster event
32m         Warning   EgressBlocked             node/aks-xxxxxx-vmss000001    Required endpoints are unreachable (curl: (28) Failed to connect to login.microsoftonline.com port 443 after 9962 ms: Connection timed out ;curl: (28) Failed to connect to management.azure.com port 443 after 5202 ms: Connection timed out ;curl: (28) Failed to connect to packages.microsoft.com port 443 after 5203 ms: Connection timed out ;curl: (28) Connection timed out after 10000 milliseconds: https://dns-weu-xxxxxxx.hcp.westeurope.azmk8s.io/healthz ), aka.ms/AArpzy5 for more information.

Events from a node that wasn't removed/redeployed after VMSS restart, where the ingress controller is hosted:

44m         Warning   ConntrackFull             node/aks-xxxxxx-23859948-vmss000003   Conntrack table usage over 90%!:(MISSING) 131069 out of 131072
44m         Warning   ConntrackFull             node/aks-xxxxxx-23859948-vmss000003   Conntrack table usage over 90%!:(MISSING) 118173 out of 131072

NPM logs from the previous node

I1030 10:17:42.317355       1 trace.go:236] Trace[740101859]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229 (30-Oct-2024 10:17:12.314) (total time: 30002ms):
Trace[740101859]: ---"Objects listed" error:Get "https://dns-weu-xxxxxx.hcp.westeurope.azmk8s.io:443/api/v1/namespaces?resourceVersion=657808491": dial tcp xxx.xxx.xxx.xxx:443: i/o timeout 30001ms (10:17:42.316)
Trace[740101859]: [30.00219986s] [30.00219986s] END
E1030 10:17:42.317808       1 reflector.go:147] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.Namespace: failed to list *v1.Namespace: Get "https://dns-weu-xxxxxx.hcp.westeurope.azmk8s.io:443/api/v1/namespaces?resourceVersion=657808491": dial tcp xxx.xxx.xxx.xxx:443: i/o timeout
W1030 10:17:42.525945       1 reflector.go:535] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.Pod: Get "https://dns-weu-xxxxxx.hcp.westeurope.azmk8s.io:443/api/v1/pods?resourceVersion=657810464": dial tcp xxx.xxx.xxx.xxx:443: i/o timeout
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant