Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

concurrent map iteration and map write panic #243

Open
toastwaffle opened this issue Feb 9, 2024 · 6 comments · May be fixed by #253
Open

concurrent map iteration and map write panic #243

toastwaffle opened this issue Feb 9, 2024 · 6 comments · May be fixed by #253
Labels
bug Something isn't working needs:triage

Comments

@toastwaffle
Copy link
Contributor

What happened?

fatal error: concurrent map iteration and map write

goroutine 169 [running]:
k8s.io/apimachinery/pkg/apis/meta/v1.(*ObjectMeta).DeepCopyInto(0xc00117dd60, 0xc000ffc4a0)
        k8s.io/[email protected]/pkg/apis/meta/v1/zz_generated.deepcopy.go:689 +0x396
github.com/upbound/provider-terraform/apis/v1beta1.(*Workspace).DeepCopyInto(0xc00117dd40, 0xc000ffc480)
        github.com/upbound/provider-terraform/apis/v1beta1/zz_generated.deepcopy.go:369 +0xc5
github.com/upbound/provider-terraform/apis/v1beta1.(*Workspace).DeepCopy(...)
        github.com/upbound/provider-terraform/apis/v1beta1/zz_generated.deepcopy.go:380
github.com/upbound/provider-terraform/apis/v1beta1.(*Workspace).DeepCopyObject(0xc00117dd40?)
        github.com/upbound/provider-terraform/apis/v1beta1/zz_generated.deepcopy.go:386 +0x45
sigs.k8s.io/controller-runtime/pkg/cache/internal.(*CacheReader).Get(0xc000349900, {0xc000642b40?, 0x1f52e2e?}, {{0x0?, 0x1f4a136?}, {0xc000c05460?, 0x1ae2b79?}}, {0x23fc040, 0xc00140b680}, {0x0, ...})
        sigs.k8s.io/[email protected]/pkg/cache/internal/cache_reader.go:88 +0x145
sigs.k8s.io/controller-runtime/pkg/cache.(*informerCache).Get(0xc0000109f0, {0x23e5740, 0xc000642b40}, {{0x0?, 0x40bd45?}, {0xc000c05460?, 0x1f0ac20?}}, {0x23fc040?, 0xc00140b680?}, {0x0, ...})
        sigs.k8s.io/[email protected]/pkg/cache/informer_cache.go:88 +0x1f9
sigs.k8s.io/controller-runtime/pkg/client.(*client).Get(0xc000448360, {0x23e5740, 0xc000642b40}, {{0x0?, 0x0?}, {0xc000c05460?, 0xc0004d4d80?}}, {0x23fc040?, 0xc00140b680?}, {0x0, ...})
        sigs.k8s.io/[email protected]/pkg/client/client.go:348 +0x4c4
github.com/crossplane/crossplane-runtime/pkg/reconciler/managed.(*Reconciler).Reconcile(0xc0000cb770, {0x23e5778, 0xc000642900}, {{{0x0, 0x0}, {0xc000c05460, 0x1c}}})
        github.com/crossplane/[email protected]/pkg/reconciler/managed/reconciler.go:741 +0x2f9
github.com/crossplane/crossplane-runtime/pkg/ratelimiter.(*Reconciler).Reconcile(0xc0002b0d20, {0x23e5778, 0xc000642900}, {{{0x0?, 0x5?}, {0xc000c05460?, 0xc000dd5d48?}}})
        github.com/crossplane/[email protected]/pkg/ratelimiter/reconciler.go:54 +0x16b
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x23e8c50?, {0x23e5778?, 0xc000642900?}, {{{0x0?, 0xb?}, {0xc000c05460?, 0x0?}}})
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119 +0xc8
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0003daaa0, {0x23e56d0, 0xc00013dbd0}, {0x1d4a1c0?, 0xc0004551a0?})
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316 +0x3ca
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0003daaa0, {0x23e56d0, 0xc00013dbd0})
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266 +0x1d9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:223 +0x587

Full thread dump is here. I intend to do some debugging myself, but creating the issue now.

How can we reproduce it?

Absolutely no idea

What environment did it happen in?

  • Crossplane Version: v1.14.2-up.1
  • Provider Version: 0.14.1
  • Kubernetes Version: v1.26.7 client v1.26.13-gke.1144000 server
  • Kubernetes Distribution: GKE
@toastwaffle toastwaffle added bug Something isn't working needs:triage labels Feb 9, 2024
@toastwaffle
Copy link
Contributor Author

Okay, finally getting around to debugging this. I'm not sure how accurate the thread dump you get from a concurrent map iteration and map write is though.

k8s.io/[email protected]/pkg/apis/meta/v1/zz_generated.deepcopy.go:689 is copying the Workspace annotations. I think 0xc00117dd40 is the address of the Workspace being copied, but that doesn't appear anywhere else in the thread dump.

It would appear that the controller work queue is only mostly deduplicated. I found some PRs:

I think the only thing we could do to fix this would be to disable the cache for workspace objects (by adding Workspace to the DisableFor option in the client CacheOptions. I don't think that is particularly harmful (it means more Get calls to the API server), but equally I'm not totally sure it's worthwhile (as it doesn't solve the problem of multiple concurrent reconciles)

It's worth noting that while crossplane-runtime does configure the controller to recover from panics, concurrent map iteration and map write is not a panic, it's a runtime error which in unrecoverable.

@toastwaffle toastwaffle linked a pull request Apr 4, 2024 that will close this issue
1 task
@bobh66
Copy link
Collaborator

bobh66 commented Apr 4, 2024

@toastwaffle Thanks for looking into this! Just to make sure I understand the scenario:

  • for whatever reason there are two entries in the queue for a specific Workspace instance
  • The first entry for the Workspace gets dequeued and starts processing
  • while the reconciler is processing the first one the second one gets dequeued
  • the crossplane-runtime managed Reconciler calls client.Get to retrieve a copy of the same Workspace that is already being processed
  • if the first reconciliation happens to do a write while the second one is doing a read, the fatal error happens and the pod restarts

Is the read and write operating on the same shared client cache?

It's unfortunate that it's a fatal error and not a panic - even two panics, one on each goroutine, would be preferable to an unrecoverable error.

@toastwaffle
Copy link
Contributor Author

I'm suddenly doubting my hypothesis - if it were 2 concurrent reconciles of the same Workspace, the first must have DeepCopy'd the Workspace as part of the cached Get call, and so any writes to the annotations should be on a different map than the one in the cache. Makes me wonder if there is something reading from the same indexer without doing the deepcopy, but I don't know why such a thing would be writing to the annotations.

My only remaining hypotheses are that something is wrong with the DeepCopy implementation, or that something is wrong with the Golang concurrent iteration/write detection - both are exceedingly unlikely!

I'll think about this some more tomorrow to see if I can come up with anything better...

@toastwaffle
Copy link
Contributor Author

Okay, I have not been able to come up with any other hypotheses. I'm going to try to run the provider under the race detector, but apparently "memory usage may increase by 5-10x and execution time by 2-20x" 😬

@bobh66
Copy link
Collaborator

bobh66 commented Jun 28, 2024

@toastwaffle Any update on this?

@toastwaffle
Copy link
Contributor Author

HI @bobh66, sorry for dropping the ball on this. Unfortunately other priorities took over, and I've since moved to a different team at my company which isn't using Crossplane. I've lost of my permissions, but I can see from the logs I still have access to that it has happened 4 times in the past 2 weeks - I'll see if I can get somebody on my old team to look into this some more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs:triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants