concurrent map iteration and map write panic #243

toastwaffle · 2024-02-09T16:37:23Z

What happened?

fatal error: concurrent map iteration and map write

goroutine 169 [running]:
k8s.io/apimachinery/pkg/apis/meta/v1.(*ObjectMeta).DeepCopyInto(0xc00117dd60, 0xc000ffc4a0)
        k8s.io/[email protected]/pkg/apis/meta/v1/zz_generated.deepcopy.go:689 +0x396
github.com/upbound/provider-terraform/apis/v1beta1.(*Workspace).DeepCopyInto(0xc00117dd40, 0xc000ffc480)
        github.com/upbound/provider-terraform/apis/v1beta1/zz_generated.deepcopy.go:369 +0xc5
github.com/upbound/provider-terraform/apis/v1beta1.(*Workspace).DeepCopy(...)
        github.com/upbound/provider-terraform/apis/v1beta1/zz_generated.deepcopy.go:380
github.com/upbound/provider-terraform/apis/v1beta1.(*Workspace).DeepCopyObject(0xc00117dd40?)
        github.com/upbound/provider-terraform/apis/v1beta1/zz_generated.deepcopy.go:386 +0x45
sigs.k8s.io/controller-runtime/pkg/cache/internal.(*CacheReader).Get(0xc000349900, {0xc000642b40?, 0x1f52e2e?}, {{0x0?, 0x1f4a136?}, {0xc000c05460?, 0x1ae2b79?}}, {0x23fc040, 0xc00140b680}, {0x0, ...})
        sigs.k8s.io/[email protected]/pkg/cache/internal/cache_reader.go:88 +0x145
sigs.k8s.io/controller-runtime/pkg/cache.(*informerCache).Get(0xc0000109f0, {0x23e5740, 0xc000642b40}, {{0x0?, 0x40bd45?}, {0xc000c05460?, 0x1f0ac20?}}, {0x23fc040?, 0xc00140b680?}, {0x0, ...})
        sigs.k8s.io/[email protected]/pkg/cache/informer_cache.go:88 +0x1f9
sigs.k8s.io/controller-runtime/pkg/client.(*client).Get(0xc000448360, {0x23e5740, 0xc000642b40}, {{0x0?, 0x0?}, {0xc000c05460?, 0xc0004d4d80?}}, {0x23fc040?, 0xc00140b680?}, {0x0, ...})
        sigs.k8s.io/[email protected]/pkg/client/client.go:348 +0x4c4
github.com/crossplane/crossplane-runtime/pkg/reconciler/managed.(*Reconciler).Reconcile(0xc0000cb770, {0x23e5778, 0xc000642900}, {{{0x0, 0x0}, {0xc000c05460, 0x1c}}})
        github.com/crossplane/[email protected]/pkg/reconciler/managed/reconciler.go:741 +0x2f9
github.com/crossplane/crossplane-runtime/pkg/ratelimiter.(*Reconciler).Reconcile(0xc0002b0d20, {0x23e5778, 0xc000642900}, {{{0x0?, 0x5?}, {0xc000c05460?, 0xc000dd5d48?}}})
        github.com/crossplane/[email protected]/pkg/ratelimiter/reconciler.go:54 +0x16b
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x23e8c50?, {0x23e5778?, 0xc000642900?}, {{{0x0?, 0xb?}, {0xc000c05460?, 0x0?}}})
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119 +0xc8
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0003daaa0, {0x23e56d0, 0xc00013dbd0}, {0x1d4a1c0?, 0xc0004551a0?})
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316 +0x3ca
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0003daaa0, {0x23e56d0, 0xc00013dbd0})
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266 +0x1d9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:223 +0x587

Full thread dump is here. I intend to do some debugging myself, but creating the issue now.

How can we reproduce it?

Absolutely no idea

What environment did it happen in?

Crossplane Version: v1.14.2-up.1
Provider Version: 0.14.1
Kubernetes Version: v1.26.7 client v1.26.13-gke.1144000 server
Kubernetes Distribution: GKE

The text was updated successfully, but these errors were encountered:

toastwaffle · 2024-04-04T12:32:39Z

Okay, finally getting around to debugging this. I'm not sure how accurate the thread dump you get from a concurrent map iteration and map write is though.

k8s.io/[email protected]/pkg/apis/meta/v1/zz_generated.deepcopy.go:689 is copying the Workspace annotations. I think 0xc00117dd40 is the address of the Workspace being copied, but that doesn't appear anywhere else in the thread dump.

It would appear that the controller work queue is only mostly deduplicated. I found some PRs:

Add additional Delaying Queue methods which enable finer control over the delaying and de-duplication features kubernetes/kubernetes#112328 is the main one, but it's a more comprehensive change
Dedup workqueue queue and waiting kubernetes/kubernetes#123071 is an attempt to fix the specific issue
Suppress duplicates between FIFO and priority queue of waiting items kubernetes/client-go#1149 also attempted to fix the issue, but is in the wrong place

I think the only thing we could do to fix this would be to disable the cache for workspace objects (by adding Workspace to the DisableFor option in the client CacheOptions. I don't think that is particularly harmful (it means more Get calls to the API server), but equally I'm not totally sure it's worthwhile (as it doesn't solve the problem of multiple concurrent reconciles)

It's worth noting that while crossplane-runtime does configure the controller to recover from panics, concurrent map iteration and map write is not a panic, it's a runtime error which in unrecoverable.

bobh66 · 2024-04-04T15:18:23Z

@toastwaffle Thanks for looking into this! Just to make sure I understand the scenario:

for whatever reason there are two entries in the queue for a specific Workspace instance
The first entry for the Workspace gets dequeued and starts processing
while the reconciler is processing the first one the second one gets dequeued
the crossplane-runtime managed Reconciler calls client.Get to retrieve a copy of the same Workspace that is already being processed
if the first reconciliation happens to do a write while the second one is doing a read, the fatal error happens and the pod restarts

Is the read and write operating on the same shared client cache?

It's unfortunate that it's a fatal error and not a panic - even two panics, one on each goroutine, would be preferable to an unrecoverable error.

toastwaffle · 2024-04-04T16:58:27Z

I'm suddenly doubting my hypothesis - if it were 2 concurrent reconciles of the same Workspace, the first must have DeepCopy'd the Workspace as part of the cached Get call, and so any writes to the annotations should be on a different map than the one in the cache. Makes me wonder if there is something reading from the same indexer without doing the deepcopy, but I don't know why such a thing would be writing to the annotations.

My only remaining hypotheses are that something is wrong with the DeepCopy implementation, or that something is wrong with the Golang concurrent iteration/write detection - both are exceedingly unlikely!

I'll think about this some more tomorrow to see if I can come up with anything better...

toastwaffle · 2024-04-05T13:59:04Z

Okay, I have not been able to come up with any other hypotheses. I'm going to try to run the provider under the race detector, but apparently "memory usage may increase by 5-10x and execution time by 2-20x" 😬

bobh66 · 2024-06-28T22:20:41Z

@toastwaffle Any update on this?

toastwaffle · 2024-07-01T09:01:59Z

HI @bobh66, sorry for dropping the ball on this. Unfortunately other priorities took over, and I've since moved to a different team at my company which isn't using Crossplane. I've lost of my permissions, but I can see from the logs I still have access to that it has happened 4 times in the past 2 weeks - I'll see if I can get somebody on my old team to look into this some more.

toastwaffle added bug Something isn't working needs:triage labels Feb 9, 2024

toastwaffle linked a pull request Apr 4, 2024 that will close this issue

Disable the client cache for Workspace lookups #253

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

concurrent map iteration and map write panic #243

concurrent map iteration and map write panic #243

toastwaffle commented Feb 9, 2024

toastwaffle commented Apr 4, 2024

bobh66 commented Apr 4, 2024 •

edited

Loading

toastwaffle commented Apr 4, 2024

toastwaffle commented Apr 5, 2024

bobh66 commented Jun 28, 2024

toastwaffle commented Jul 1, 2024

concurrent map iteration and map write panic #243

concurrent map iteration and map write panic #243

Comments

toastwaffle commented Feb 9, 2024

What happened?

How can we reproduce it?

What environment did it happen in?

toastwaffle commented Apr 4, 2024

bobh66 commented Apr 4, 2024 • edited Loading

toastwaffle commented Apr 4, 2024

toastwaffle commented Apr 5, 2024

bobh66 commented Jun 28, 2024

toastwaffle commented Jul 1, 2024

bobh66 commented Apr 4, 2024 •

edited

Loading