Race condition that causes envoy integration to result in an incomplete trust bundle #5638

dansimone · 2024-11-06T18:12:56Z

Version: main, latest(f99ecac), and least as far back as 0.8.x
Subsystem: agent

Our setup consists of:

Istio-enabled pods in a Kubernetes cluster
Spire-service and spire-agents deployed
Istio-proxy containers using spire-agent integration: a workload socket set up to its local spire-agent.
Istio-proxy requests on startup, from spire-agent, x509 certs and a set of trust bundles.

In this setup, we've encountered a race condition that results in the istio-proxy being sent an incomplete trust bundle.

Here's the specific sequence that reproduces the problem (which can be reliably reproduced with some node-level iptables hacks to prevent the spire-agent from talking to port 10250):

istio-proxy container makes 2 independent calls to the spire-agent on startup: for an x509 cert and for the trust bundles. The trust bundles that we expect to be returned consist of the local Spire trustBundle plus a set of federated bundles registered in the spire-server.
The kubelet is unresponsive momentarily when these above 2 calls happen, for whatever reason. (this is what triggers the race condition):
The x509 cert generation flat out fails, as expected, because the kubelet can't be reached

spire-agent-0-4pzv2 spire-agent time="2024-11-04T14:33:02Z" level=error msg="Failed to collect all selectors for PID" error="workload attestor \"k8s\" failed: rpc error: code = Internal desc = workloadattestor(k8s): unable to perform request: Get \"https://127.0.0.1:10250/pods\": dial tcp 127.0.0.1:10250: connect: connection timed out" pid=3238263 subsystem_name=workload_attestor

The generation of the trust bundles doesn't fail, however. It includes the local trustBundle, but skips including the federated bundles.
There are 2 places in the where the following behavior is exhibited (here and here):
- There's an explicit adding of the local trustDomain (OTK) to the bundle.
- Then, a loop through the federated bundles, that only happens if an identity/SVID has been issued.
So what has happened is that due to the kubelet having a temporary issue (which can happen), an SVID hasn't been generated (so there is no "identity" for the workload, trigger the if statement to return false), and incomplete set of trust bundles is sent back to the istio-proxy.
Eventually, the kubelet heals.
Istio-proxy re-requests x509 cert generation, which succeed this time.
The istio-enabled pod starts, but with the incomplete set of trust bundles.
Istio-proxy never re-requests the trust bundles, because it thinks it has the correct response from spire-server.

The text was updated successfully, but these errors were encountered:

dansimone · 2024-11-06T18:17:35Z

In the case of this spot in the code, for example, are there any valid reasons to skip adding the federated trust bundles just because update.HasIdentity() is false? The code in that code block has no dependency on the identity.

Or, could/should this entire composeX509BundlesResponse() function fail out if update.HasIdentity()is false? Either of these behaviors would also have prevented this problem.

amartinezfayo · 2024-11-07T20:00:19Z

Thank you @dansimone for opening this issue.
Do you have the allow_unauthenticated_verifiers agent setting set as true?

dansimone · 2024-11-07T20:13:37Z

We have not set this explicitly anywhere, so it looks like that is defaulting to false.

But I do see this which suggests that allow_unauthenticated_verifiers=false should prevent this situation, at least for that code path.

MarcosDY added the triage/in-progress Issue triage is in progress label Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition that causes envoy integration to result in an incomplete trust bundle #5638

Race condition that causes envoy integration to result in an incomplete trust bundle #5638

dansimone commented Nov 6, 2024 •

edited

Loading

dansimone commented Nov 6, 2024

amartinezfayo commented Nov 7, 2024

dansimone commented Nov 7, 2024 •

edited

Loading

Race condition that causes envoy integration to result in an incomplete trust bundle #5638

Race condition that causes envoy integration to result in an incomplete trust bundle #5638

Comments

dansimone commented Nov 6, 2024 • edited Loading

dansimone commented Nov 6, 2024

amartinezfayo commented Nov 7, 2024

dansimone commented Nov 7, 2024 • edited Loading

dansimone commented Nov 6, 2024 •

edited

Loading

dansimone commented Nov 7, 2024 •

edited

Loading