You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Version: main, latest(f99ecac), and least as far back as 0.8.x
Subsystem: agent
Our setup consists of:
Istio-enabled pods in a Kubernetes cluster
Spire-service and spire-agents deployed
Istio-proxy containers using spire-agent integration: a workload socket set up to its local spire-agent.
Istio-proxy requests on startup, from spire-agent, x509 certs and a set of trust bundles.
In this setup, we've encountered a race condition that results in the istio-proxy being sent an incomplete trust bundle.
Here's the specific sequence that reproduces the problem (which can be reliably reproduced with some node-level iptables hacks to prevent the spire-agent from talking to port 10250):
istio-proxy container makes 2 independent calls to the spire-agent on startup: for an x509 cert and for the trust bundles. The trust bundles that we expect to be returned consist of the local Spire trustBundle plus a set of federated bundles registered in the spire-server.
The kubelet is unresponsive momentarily when these above 2 calls happen, for whatever reason. (this is what triggers the race condition):
The x509 cert generation flat out fails, as expected, because the kubelet can't be reached
spire-agent-0-4pzv2 spire-agent time="2024-11-04T14:33:02Z" level=error msg="Failed to collect all selectors for PID" error="workload attestor \"k8s\" failed: rpc error: code = Internal desc = workloadattestor(k8s): unable to perform request: Get \"https://127.0.0.1:10250/pods\": dial tcp 127.0.0.1:10250: connect: connection timed out" pid=3238263 subsystem_name=workload_attestor
The generation of the trust bundles doesn't fail, however. It includes the local trustBundle, but skips including the federated bundles.
There are 2 places in the where the following behavior is exhibited (here and here):
There's an explicit adding of the local trustDomain (OTK) to the bundle.
Then, a loop through the federated bundles, that only happens if an identity/SVID has been issued.
So what has happened is that due to the kubelet having a temporary issue (which can happen), an SVID hasn't been generated (so there is no "identity" for the workload, trigger the if statement to return false), and incomplete set of trust bundles is sent back to the istio-proxy.
Eventually, the kubelet heals.
Istio-proxy re-requests x509 cert generation, which succeed this time.
The istio-enabled pod starts, but with the incomplete set of trust bundles.
Istio-proxy never re-requests the trust bundles, because it thinks it has the correct response from spire-server.
The text was updated successfully, but these errors were encountered:
In the case of this spot in the code, for example, are there any valid reasons to skip adding the federated trust bundles just because update.HasIdentity() is false? The code in that code block has no dependency on the identity.
Or, could/should this entire composeX509BundlesResponse() function fail out if update.HasIdentity()is false? Either of these behaviors would also have prevented this problem.
Our setup consists of:
In this setup, we've encountered a race condition that results in the istio-proxy being sent an incomplete trust bundle.
Here's the specific sequence that reproduces the problem (which can be reliably reproduced with some node-level iptables hacks to prevent the spire-agent from talking to port 10250):
The text was updated successfully, but these errors were encountered: