Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition that causes envoy integration to result in an incomplete trust bundle #5638

Open
dansimone opened this issue Nov 6, 2024 · 3 comments
Labels
triage/in-progress Issue triage is in progress

Comments

@dansimone
Copy link

dansimone commented Nov 6, 2024

  • Version: main, latest(f99ecac), and least as far back as 0.8.x
  • Subsystem: agent

Our setup consists of:

  • Istio-enabled pods in a Kubernetes cluster
  • Spire-service and spire-agents deployed
  • Istio-proxy containers using spire-agent integration: a workload socket set up to its local spire-agent.
  • Istio-proxy requests on startup, from spire-agent, x509 certs and a set of trust bundles.

In this setup, we've encountered a race condition that results in the istio-proxy being sent an incomplete trust bundle.

Here's the specific sequence that reproduces the problem (which can be reliably reproduced with some node-level iptables hacks to prevent the spire-agent from talking to port 10250):

  • istio-proxy container makes 2 independent calls to the spire-agent on startup: for an x509 cert and for the trust bundles. The trust bundles that we expect to be returned consist of the local Spire trustBundle plus a set of federated bundles registered in the spire-server.
  • The kubelet is unresponsive momentarily when these above 2 calls happen, for whatever reason. (this is what triggers the race condition):
  • The x509 cert generation flat out fails, as expected, because the kubelet can't be reached
spire-agent-0-4pzv2 spire-agent time="2024-11-04T14:33:02Z" level=error msg="Failed to collect all selectors for PID" error="workload attestor \"k8s\" failed: rpc error: code = Internal desc = workloadattestor(k8s): unable to perform request: Get \"https://127.0.0.1:10250/pods\": dial tcp 127.0.0.1:10250: connect: connection timed out" pid=3238263 subsystem_name=workload_attestor
  • The generation of the trust bundles doesn't fail, however. It includes the local trustBundle, but skips including the federated bundles.
  • There are 2 places in the where the following behavior is exhibited (here and here):
    • There's an explicit adding of the local trustDomain (OTK) to the bundle.
    • Then, a loop through the federated bundles, that only happens if an identity/SVID has been issued.
  • So what has happened is that due to the kubelet having a temporary issue (which can happen), an SVID hasn't been generated (so there is no "identity" for the workload, trigger the if statement to return false), and incomplete set of trust bundles is sent back to the istio-proxy.
  • Eventually, the kubelet heals.
  • Istio-proxy re-requests x509 cert generation, which succeed this time.
  • The istio-enabled pod starts, but with the incomplete set of trust bundles.
  • Istio-proxy never re-requests the trust bundles, because it thinks it has the correct response from spire-server.
@dansimone
Copy link
Author

In the case of this spot in the code, for example, are there any valid reasons to skip adding the federated trust bundles just because update.HasIdentity() is false? The code in that code block has no dependency on the identity.

Or, could/should this entire composeX509BundlesResponse() function fail out if update.HasIdentity()is false? Either of these behaviors would also have prevented this problem.

@MarcosDY MarcosDY added the triage/in-progress Issue triage is in progress label Nov 7, 2024
@amartinezfayo
Copy link
Member

Thank you @dansimone for opening this issue.
Do you have the allow_unauthenticated_verifiers agent setting set as true?

@dansimone
Copy link
Author

dansimone commented Nov 7, 2024

We have not set this explicitly anywhere, so it looks like that is defaulting to false.

But I do see this which suggests that allow_unauthenticated_verifiers=false should prevent this situation, at least for that code path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage/in-progress Issue triage is in progress
Projects
None yet
Development

No branches or pull requests

3 participants