[BUG] Data race if the provider set few times #307

erka · 2024-11-26T19:01:34Z

Observed behavior

I have several tests with a similar setup: the provider is created, then configured, the tests are executed, and openfeature.Shutdown() is called for cleanup. However, when these tests are run with the race detector enabled, they consistently fail.

Expected Behavior

No response

Steps to reproduce

package openfeature_test

import (
	"context"
	"sync"
	"testing"

	of "github.com/open-feature/go-sdk/openfeature"
)

func TestDataRace(t *testing.T) {
	of.SetEvaluationContext(of.NewTargetlessEvaluationContext(map[string]any{}))

	err := of.SetProviderAndWait(newRaceProvider())
	if err != nil {
		t.Fatal(err)
	}

	of.Shutdown()

	err = of.SetProviderAndWait(newRaceProvider())
	if err != nil {
		t.Fatal(err)
	}
}

func newRaceProvider() of.FeatureProvider {
	return &raceProvider{
		state: of.NotReadyState,
	}
}

var (
	_ of.FeatureProvider = (*raceProvider)(nil) // ensure implements FeatureProvider
	_ of.StateHandler    = (*raceProvider)(nil) // ensure implements StateHandler
)

type raceProvider struct {
	state of.State
	mu    sync.RWMutex
}

func (p *raceProvider) Metadata() of.Metadata {
	return of.Metadata{
		Name: "racing",
	}
}

func (p *raceProvider) Status() of.State {
	p.mu.RLock()
	defer p.mu.RUnlock()

	return p.state
}

func (p *raceProvider) Init(evalCtx of.EvaluationContext) error {
	p.mu.Lock()
	defer p.mu.Unlock()
	p.state = of.ReadyState
	return nil
}

func (p *raceProvider) Shutdown() {
	p.mu.Lock()
	defer p.mu.Unlock()
	p.state = of.NotReadyState
}

func (p *raceProvider) BooleanEvaluation(ctx context.Context, flag string, defaultValue bool, evalCtx of.FlattenedContext) of.BoolResolutionDetail {
	return of.BoolResolutionDetail{}
}

func (p *raceProvider) StringEvaluation(ctx context.Context, flag string, defaultValue string, evalCtx of.FlattenedContext) of.StringResolutionDetail {
	return of.StringResolutionDetail{}
}

func (p *raceProvider) FloatEvaluation(ctx context.Context, flag string, defaultValue float64, evalCtx of.FlattenedContext) of.FloatResolutionDetail {
	return of.FloatResolutionDetail{}
}

func (p *raceProvider) IntEvaluation(ctx context.Context, flag string, defaultValue int64, evalCtx of.FlattenedContext) of.IntResolutionDetail {
	return of.IntResolutionDetail{}
}

func (p *raceProvider) ObjectEvaluation(ctx context.Context, flag string, defaultValue interface{}, evalCtx of.FlattenedContext) of.InterfaceResolutionDetail {
	return of.InterfaceResolutionDetail{}
}

func (p *raceProvider) Hooks() []of.Hook {
	return []of.Hook{}
}

Create a file with test and run go test -race ./...

The text was updated successfully, but these errors were encountered:

beeme1mr · 2024-11-26T19:28:06Z

Hey @erka, could you please try the test provider and see if that resolves your issue?

erka · 2024-11-26T21:22:23Z

@beeme1mr I am trying to implement the provider and write some e2e tests for it. Using another provider doesn't help much. The issue is related to the providers which implement StateHandler interface.

beeme1mr · 2024-11-26T21:31:36Z

In the next release, stage management moves to the provider. This is a non-breaking change but may address this problem. FYI @toddbaert

toddbaert · 2024-11-28T14:02:22Z

There have been substantial changes to the SDK recently like @beeme1mr said, which are yet unreleased.

The changes are significant enough that there's a chance this bug could be resolved. I would have liked to see the released already but there's a few more issues we have to iron out with it, namely the one you pointed out here: #296 (review)

erka · 2024-12-10T20:40:39Z

Just keep this alive, the release 1.14.0 doesn't fix the issue. go-sdk uses reflect.DeepEqual() to compare provider(s) internally and that doesn't play very well if provider has sync.Mutex probably.

toddbaert · 2024-12-11T19:05:27Z

Just keep this alive, the release 1.14.0 doesn't fix the issue. go-sdk uses reflect.DeepEqual() to compare provider(s) internally and that doesn't play very well if provider has sync.Mutex probably.

Thanks for the update @erka ! If you have some ideas for a fix, I'm open to that. Otherwise, we have some pressing issues internally so I probably won't be able to dig into this until next week.

🙏

blkt · 2024-12-13T13:01:17Z

Hi @toddbaert, we're having the same issue with our project.
After looking into this, the problem seems to be that this invocation of Shutdown happens while the subsequent SetProviderAndWait is running.
https://github.com/open-feature/go-sdk/blob/main/openfeature/openfeature_api.go#L254-L256

I think that just running v.Shutdown() should solve the problem.

cc @rdimitrov

warber · 2024-12-14T21:19:44Z

I think that just running v.Shutdown() should solve the problem.

@blkt i gave it a quick shot and it seems to fix the problem. At least all tests pass, even with the race detector enabled. Cool.

go-sdk uses reflect.DeepEqual() to compare provider(s) internally and that doesn't play very well if provider has sync.Mutex probably.

@erka yea using reflect.DeepEqual() to compare provider implementations is a problem...
Here are just some ideas:

Require the feature flag implementation to include a UniqueName() or Id() method that can be used to internally identify providers. Also not an ideal solution.
Serialize provider implementations into a format like JSON, which naturally excludes unexported fields and requires omitting problematic field types like sync.Mutex. However, this approach is also far from ideal.
Avoid directly using the FeatureProvider implementation within the SDK. Instead, create a providerReference wrapper during registration, assigning it a self-generated identifier. Subsequently, the rest of the code should work with providerReference values, enabling easy comparison using their IDs. This approach seems like the most promising. The effort strongly depends on how easy it is to change the code to not use FeatureProvider implementations directly.

erka · 2024-12-16T21:27:54Z

@warber We could add a kind field to providerReference and set it to reflect.TypeOf(FeatureProvider).Kind(). If the kind is ptr, we can perform a direct comparison of FeatureProvider. For struct, fallback to reflect.DeepEqual.

toddbaert · 2024-12-23T20:12:54Z

@warber We could add a kind field to providerReference and set it to reflect.TypeOf(FeatureProvider).Kind(). If the kind is ptr, we can perform a direct comparison of FeatureProvider. For struct, fallback to reflect.DeepEqual.

I'll give this a shot this week unless I see a PR from somebody else first.

toddbaert · 2024-12-24T17:49:41Z

My findings so far:

In addition to just doing v.Shutdown() as @blkt suggested here, another simple (but inadequate, IMO) solution is to RLock the api mutex during shutdown:

	go func(forShutdown StateHandler) {
		api.mu.RLock()
		defer api.mu.RUnlock()
		forShutdown.Shutdown()
	}(v)

This is probably a bit better than just running shutdown synchronously, but it still means that we won't be able to register providers until the old provider is shut down, which is bad, we want the shutdown not to block anything - it's for side effects such as flushing telemetry, etc. This solution also hides the real issue, IMO...
The root problem of this entire issue, as far as I can tell, is that we are using DeepEqual, which is a problem specifically because it reads unexported fields (in this case the state of @erka 's provider), bypassing the mutex in that provider and causing the race (this is confirmed by the trace in the DATA RACE warning). I think this is a more general issue and we should not do it, or we risk a lot of this sort of issue.
I think the only real solution is to find a safe way to compare providers without using DeepEqual at all, so that provider's unexported fields are never read (perhaps the way @warber suggested in their last point here)

toddbaert · 2024-12-31T19:14:07Z

I think the only real solution is to find a safe way to compare providers without using DeepEqual at all, so that provider's unexported fields are never read (perhaps the way @warber suggested in their last point here)

I'm going to give this a shot. Somebody stop me if you think I'm barking up the wrong tree.

toddbaert · 2025-01-13T16:59:02Z

I didn't get to this, but @warber I've assigned to you also. If you don't have time please let me know and I'll pick it up.

toddbaert · 2025-01-15T19:20:58Z

OK @warber pointed out to me this plan has other problems, namely if the same instance is set twice, it's still not possible to compare them. 😩

Zooming out a bit, the real pain here is that we can't compare provider references reliably. This could be resolved another way, I think, by modifying our functions to require the provider to be a pointer (and never a struct). I think in most real-world cases this is what we'll see anyway, but sometimes people (especially in testing) use structs. This is why we had to start doing deepCompare in the first place.

If we simply require all args to setProvider and its friends to be pointers, we can do reliable reference comparisons between them. This would be a breaking change, but one that should be easy to work around. What do you folks think?

cc @erka @blkt @warber ?

warber · 2025-01-16T11:42:27Z

@erka Actually provided a similar kind of idea here. So, by storing the kind as a field and comparing it to determine if we're dealing with a pointer or not, we ensure it works reliably only with pointers or structs that support deep comparison.

Ultimately, I'm unsure which approach is better: enforcing the use of pointers as input parameters at runtime or keeping it flexible and documenting the limitation. I'm slightly leaning towards the latter solution. 🤔

@toddbaert @blkt

blkt · 2025-01-16T12:23:01Z

From our perspective, this is an issue during tests only. In the application itself, we use SetProvider just once and are done with it, so I wouldn't mind a breaking change very much. Please @rdimitrov and @evankanderson keep me honest here.

My understanding of @erka's approach is that it would break provider implementors, and protect normal users form the change. If that's the case, I agree it might be preferable to just switching to a pointer.

That said, my unsolicited opinion is that if provider shutdown is an important step, it should be synchronous regardless of its cost. I don't know whether that's an expensive operation, but I think that any optimization should be done inside the provider shutdown procedure, rather than outside.

toddbaert · 2025-01-16T16:23:08Z

From our perspective, this is an issue during tests only. In the application itself, we use SetProvider just once and are done with it, so I wouldn't mind a breaking change very much. Please @rdimitrov and @evankanderson keep me honest here.

My understanding of @erka's approach is that it would break provider implementors, and protect normal users form the change. If that's the case, I agree it might be preferable to just switching to a pointer.

That said, my unsolicited opinion is that if provider shutdown is an important step, it should be synchronous regardless of its cost. I don't know whether that's an expensive operation, but I think that any optimization should be done inside the provider shutdown procedure, rather than outside.

The issue is that provider shutdown can happen when a provider is no longer in use (for instance when a new one has been set) and can (as the spec says) involve flushing data such as telemetry etc, which is often asynchonous. Waiting for a provider that's no longer in use to clean up whatever it needs to because a new one has been set seems like a problem to me: as a user I don't really care if the old provider is taking a while to shut down, I just want to use my new one and consider any clean-up of the old a low-priority side effect.

This is a bit different then when the shutdown is triggered on the API level, which blocks. The use-case in that situation is more like "my app is shutting down, please don't stop the process until my provider has shutdown gracefully".

erka added bug Something isn't working Needs Triage labels Nov 26, 2024

blkt mentioned this issue Dec 13, 2024

build(deps): bump github.com/open-feature/go-sdk from 1.13.1 to 1.14.0 mindersec/minder#5175

Closed

toddbaert self-assigned this Dec 31, 2024

toddbaert removed the Needs Triage label Jan 8, 2025

toddbaert assigned warber Jan 13, 2025

warber mentioned this issue Jan 16, 2025

fix: internal provider comparison causing race conditions in tests #312

Merged

toddbaert closed this as completed in #312 Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Data race if the provider set few times #307

[BUG] Data race if the provider set few times #307

erka commented Nov 26, 2024

beeme1mr commented Nov 26, 2024

erka commented Nov 26, 2024

beeme1mr commented Nov 26, 2024

toddbaert commented Nov 28, 2024

erka commented Dec 10, 2024

toddbaert commented Dec 11, 2024

blkt commented Dec 13, 2024 •

edited

Loading

warber commented Dec 14, 2024 •

edited

Loading

erka commented Dec 16, 2024 •

edited

Loading

toddbaert commented Dec 23, 2024

toddbaert commented Dec 24, 2024 •

edited

Loading

toddbaert commented Dec 31, 2024 •

edited

Loading

toddbaert commented Jan 13, 2025

toddbaert commented Jan 15, 2025

warber commented Jan 16, 2025

blkt commented Jan 16, 2025

toddbaert commented Jan 16, 2025

[BUG] Data race if the provider set few times #307

[BUG] Data race if the provider set few times #307

Comments

erka commented Nov 26, 2024

Observed behavior

Expected Behavior

Steps to reproduce

beeme1mr commented Nov 26, 2024

erka commented Nov 26, 2024

beeme1mr commented Nov 26, 2024

toddbaert commented Nov 28, 2024

erka commented Dec 10, 2024

toddbaert commented Dec 11, 2024

blkt commented Dec 13, 2024 • edited Loading

warber commented Dec 14, 2024 • edited Loading

erka commented Dec 16, 2024 • edited Loading

toddbaert commented Dec 23, 2024

toddbaert commented Dec 24, 2024 • edited Loading

toddbaert commented Dec 31, 2024 • edited Loading

toddbaert commented Jan 13, 2025

toddbaert commented Jan 15, 2025

warber commented Jan 16, 2025

blkt commented Jan 16, 2025

toddbaert commented Jan 16, 2025

blkt commented Dec 13, 2024 •

edited

Loading

warber commented Dec 14, 2024 •

edited

Loading

erka commented Dec 16, 2024 •

edited

Loading

toddbaert commented Dec 24, 2024 •

edited

Loading

toddbaert commented Dec 31, 2024 •

edited

Loading