Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Data race if the provider set few times #307

Closed
erka opened this issue Nov 26, 2024 · 17 comments · Fixed by #312
Closed

[BUG] Data race if the provider set few times #307

erka opened this issue Nov 26, 2024 · 17 comments · Fixed by #312
Assignees
Labels
bug Something isn't working

Comments

@erka
Copy link

erka commented Nov 26, 2024

Observed behavior

I have several tests with a similar setup: the provider is created, then configured, the tests are executed, and openfeature.Shutdown() is called for cleanup. However, when these tests are run with the race detector enabled, they consistently fail.

Expected Behavior

No response

Steps to reproduce

package openfeature_test

import (
	"context"
	"sync"
	"testing"

	of "github.com/open-feature/go-sdk/openfeature"
)

func TestDataRace(t *testing.T) {
	of.SetEvaluationContext(of.NewTargetlessEvaluationContext(map[string]any{}))

	err := of.SetProviderAndWait(newRaceProvider())
	if err != nil {
		t.Fatal(err)
	}

	of.Shutdown()

	err = of.SetProviderAndWait(newRaceProvider())
	if err != nil {
		t.Fatal(err)
	}
}

func newRaceProvider() of.FeatureProvider {
	return &raceProvider{
		state: of.NotReadyState,
	}
}

var (
	_ of.FeatureProvider = (*raceProvider)(nil) // ensure implements FeatureProvider
	_ of.StateHandler    = (*raceProvider)(nil) // ensure implements StateHandler
)

type raceProvider struct {
	state of.State
	mu    sync.RWMutex
}

func (p *raceProvider) Metadata() of.Metadata {
	return of.Metadata{
		Name: "racing",
	}
}

func (p *raceProvider) Status() of.State {
	p.mu.RLock()
	defer p.mu.RUnlock()

	return p.state
}

func (p *raceProvider) Init(evalCtx of.EvaluationContext) error {
	p.mu.Lock()
	defer p.mu.Unlock()
	p.state = of.ReadyState
	return nil
}

func (p *raceProvider) Shutdown() {
	p.mu.Lock()
	defer p.mu.Unlock()
	p.state = of.NotReadyState
}

func (p *raceProvider) BooleanEvaluation(ctx context.Context, flag string, defaultValue bool, evalCtx of.FlattenedContext) of.BoolResolutionDetail {
	return of.BoolResolutionDetail{}
}

func (p *raceProvider) StringEvaluation(ctx context.Context, flag string, defaultValue string, evalCtx of.FlattenedContext) of.StringResolutionDetail {
	return of.StringResolutionDetail{}
}

func (p *raceProvider) FloatEvaluation(ctx context.Context, flag string, defaultValue float64, evalCtx of.FlattenedContext) of.FloatResolutionDetail {
	return of.FloatResolutionDetail{}
}

func (p *raceProvider) IntEvaluation(ctx context.Context, flag string, defaultValue int64, evalCtx of.FlattenedContext) of.IntResolutionDetail {
	return of.IntResolutionDetail{}
}

func (p *raceProvider) ObjectEvaluation(ctx context.Context, flag string, defaultValue interface{}, evalCtx of.FlattenedContext) of.InterfaceResolutionDetail {
	return of.InterfaceResolutionDetail{}
}

func (p *raceProvider) Hooks() []of.Hook {
	return []of.Hook{}
}

Create a file with test and run go test -race ./...

@erka erka added bug Something isn't working Needs Triage labels Nov 26, 2024
@beeme1mr
Copy link
Member

Hey @erka, could you please try the test provider and see if that resolves your issue?

@erka
Copy link
Author

erka commented Nov 26, 2024

@beeme1mr I am trying to implement the provider and write some e2e tests for it. Using another provider doesn't help much. The issue is related to the providers which implement StateHandler interface.

@beeme1mr
Copy link
Member

In the next release, stage management moves to the provider. This is a non-breaking change but may address this problem. FYI @toddbaert

@toddbaert
Copy link
Member

There have been substantial changes to the SDK recently like @beeme1mr said, which are yet unreleased.

The changes are significant enough that there's a chance this bug could be resolved. I would have liked to see the released already but there's a few more issues we have to iron out with it, namely the one you pointed out here: #296 (review)

@erka
Copy link
Author

erka commented Dec 10, 2024

Just keep this alive, the release 1.14.0 doesn't fix the issue. go-sdk uses reflect.DeepEqual() to compare provider(s) internally and that doesn't play very well if provider has sync.Mutex probably.

@toddbaert
Copy link
Member

Just keep this alive, the release 1.14.0 doesn't fix the issue. go-sdk uses reflect.DeepEqual() to compare provider(s) internally and that doesn't play very well if provider has sync.Mutex probably.

Thanks for the update @erka ! If you have some ideas for a fix, I'm open to that. Otherwise, we have some pressing issues internally so I probably won't be able to dig into this until next week.

🙏

@blkt
Copy link

blkt commented Dec 13, 2024

Hi @toddbaert, we're having the same issue with our project.
After looking into this, the problem seems to be that this invocation of Shutdown happens while the subsequent SetProviderAndWait is running.
https://github.com/open-feature/go-sdk/blob/main/openfeature/openfeature_api.go#L254-L256

I think that just running v.Shutdown() should solve the problem.

cc @rdimitrov

@warber
Copy link
Contributor

warber commented Dec 14, 2024

I think that just running v.Shutdown() should solve the problem.

@blkt i gave it a quick shot and it seems to fix the problem. At least all tests pass, even with the race detector enabled. Cool.

go-sdk uses reflect.DeepEqual() to compare provider(s) internally and that doesn't play very well if provider has sync.Mutex probably.

@erka yea using reflect.DeepEqual() to compare provider implementations is a problem...
Here are just some ideas:

  • Require the feature flag implementation to include a UniqueName() or Id() method that can be used to internally identify providers. Also not an ideal solution.
  • Serialize provider implementations into a format like JSON, which naturally excludes unexported fields and requires omitting problematic field types like sync.Mutex. However, this approach is also far from ideal.
  • Avoid directly using the FeatureProvider implementation within the SDK. Instead, create a providerReference wrapper during registration, assigning it a self-generated identifier. Subsequently, the rest of the code should work with providerReference values, enabling easy comparison using their IDs. This approach seems like the most promising. The effort strongly depends on how easy it is to change the code to not use FeatureProvider implementations directly.

@erka
Copy link
Author

erka commented Dec 16, 2024

@warber We could add a kind field to providerReference and set it to reflect.TypeOf(FeatureProvider).Kind(). If the kind is ptr, we can perform a direct comparison of FeatureProvider. For struct, fallback to reflect.DeepEqual.

@toddbaert
Copy link
Member

@warber We could add a kind field to providerReference and set it to reflect.TypeOf(FeatureProvider).Kind(). If the kind is ptr, we can perform a direct comparison of FeatureProvider. For struct, fallback to reflect.DeepEqual.

I'll give this a shot this week unless I see a PR from somebody else first.

@toddbaert
Copy link
Member

toddbaert commented Dec 24, 2024

My findings so far:

  • In addition to just doing v.Shutdown() as @blkt suggested here, another simple (but inadequate, IMO) solution is to RLock the api mutex during shutdown:
	go func(forShutdown StateHandler) {
		api.mu.RLock()
		defer api.mu.RUnlock()
		forShutdown.Shutdown()
	}(v)
  • This is probably a bit better than just running shutdown synchronously, but it still means that we won't be able to register providers until the old provider is shut down, which is bad, we want the shutdown not to block anything - it's for side effects such as flushing telemetry, etc. This solution also hides the real issue, IMO...
  • The root problem of this entire issue, as far as I can tell, is that we are using DeepEqual, which is a problem specifically because it reads unexported fields (in this case the state of @erka 's provider), bypassing the mutex in that provider and causing the race (this is confirmed by the trace in the DATA RACE warning). I think this is a more general issue and we should not do it, or we risk a lot of this sort of issue.
  • I think the only real solution is to find a safe way to compare providers without using DeepEqual at all, so that provider's unexported fields are never read (perhaps the way @warber suggested in their last point here)

@toddbaert
Copy link
Member

toddbaert commented Dec 31, 2024

  • I think the only real solution is to find a safe way to compare providers without using DeepEqual at all, so that provider's unexported fields are never read (perhaps the way @warber suggested in their last point here)

I'm going to give this a shot. Somebody stop me if you think I'm barking up the wrong tree.

@toddbaert
Copy link
Member

I didn't get to this, but @warber I've assigned to you also. If you don't have time please let me know and I'll pick it up.

@toddbaert
Copy link
Member

OK @warber pointed out to me this plan has other problems, namely if the same instance is set twice, it's still not possible to compare them. 😩

Zooming out a bit, the real pain here is that we can't compare provider references reliably. This could be resolved another way, I think, by modifying our functions to require the provider to be a pointer (and never a struct). I think in most real-world cases this is what we'll see anyway, but sometimes people (especially in testing) use structs. This is why we had to start doing deepCompare in the first place.

If we simply require all args to setProvider and its friends to be pointers, we can do reliable reference comparisons between them. This would be a breaking change, but one that should be easy to work around. What do you folks think?

cc @erka @blkt @warber ?

@warber
Copy link
Contributor

warber commented Jan 16, 2025

@erka Actually provided a similar kind of idea here. So, by storing the kind as a field and comparing it to determine if we're dealing with a pointer or not, we ensure it works reliably only with pointers or structs that support deep comparison.

Ultimately, I'm unsure which approach is better: enforcing the use of pointers as input parameters at runtime or keeping it flexible and documenting the limitation. I'm slightly leaning towards the latter solution. 🤔

@toddbaert @blkt

@blkt
Copy link

blkt commented Jan 16, 2025

From our perspective, this is an issue during tests only. In the application itself, we use SetProvider just once and are done with it, so I wouldn't mind a breaking change very much. Please @rdimitrov and @evankanderson keep me honest here.

My understanding of @erka's approach is that it would break provider implementors, and protect normal users form the change. If that's the case, I agree it might be preferable to just switching to a pointer.

That said, my unsolicited opinion is that if provider shutdown is an important step, it should be synchronous regardless of its cost. I don't know whether that's an expensive operation, but I think that any optimization should be done inside the provider shutdown procedure, rather than outside.

@toddbaert
Copy link
Member

From our perspective, this is an issue during tests only. In the application itself, we use SetProvider just once and are done with it, so I wouldn't mind a breaking change very much. Please @rdimitrov and @evankanderson keep me honest here.

My understanding of @erka's approach is that it would break provider implementors, and protect normal users form the change. If that's the case, I agree it might be preferable to just switching to a pointer.

That said, my unsolicited opinion is that if provider shutdown is an important step, it should be synchronous regardless of its cost. I don't know whether that's an expensive operation, but I think that any optimization should be done inside the provider shutdown procedure, rather than outside.

The issue is that provider shutdown can happen when a provider is no longer in use (for instance when a new one has been set) and can (as the spec says) involve flushing data such as telemetry etc, which is often asynchonous. Waiting for a provider that's no longer in use to clean up whatever it needs to because a new one has been set seems like a problem to me: as a user I don't really care if the old provider is taking a while to shut down, I just want to use my new one and consider any clean-up of the old a low-priority side effect.

This is a bit different then when the shutdown is triggered on the API level, which blocks. The use-case in that situation is more like "my app is shutting down, please don't stop the process until my provider has shutdown gracefully".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants