Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RSGF8 encoding API changes needed before parallization can occur #13

Closed
evan-forbes opened this issue Jan 18, 2021 · 7 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@evan-forbes
Copy link
Member

evan-forbes commented Jan 18, 2021

the infectious library used for RSGF8 encoding uses a map as a cache that breaks when being used by multiple goroutines.

type rsGF8Codec struct {
	infectiousCache map[int]*infectious.FEC
}

func newRSGF8Codec() *rsGF8Codec {
	return &rsGF8Codec{make(map[int]*infectious.FEC)}
}

While investigating possible parallelization performance improvements, I solved this issue by adding this function and making a cache per goroutine.

func NewRSGF8Codec() Codec {
	return &rsGF8Codec{make(map[int]*infectious.FEC)}

While this works, it is likely not ideal, and I think this should instead be addressed with #12

There's also the option to simply not use RSGF8 and only use leopard, which is faster and doesn't require changes before being parallelized.

@musalbas
Copy link
Member

Ideally we should drop RSGF8 but I'm not sure that's an option as it would require devs to download and compile leopard in order to use the library, so it makes deployment harder.

@liamsi
Copy link
Member

liamsi commented Jan 19, 2021

Yeah, let's keep it around for now. We could either make it go-routine safe by using a mutex on that map (would that already fix the breakage @evan-forbes?), or, we could accept that it is slower and not use that cache and call NewFEC each time it is used.

@evan-forbes
Copy link
Member Author

evan-forbes commented Jan 19, 2021

Man, I am just not having any luck with bugs atm 😕 😅 . I can no longer recreate the RSGF8 bug I was experiencing a month ago. All the old parallel code works fine without making any changes to rsmt2d, I even tried using older versions of go. If someone wants to check my work, it's on this branch and can be ran with go test -cpu 16 -tags leopard -bench=BenchmarkParallelExtension. I'm going to close this issue, I guess I was just wrong the first time and we don't have to worry about this.

On the brightside, since the same caches could be used, parallelizing the erasure step is even faster for RSGF8.

@liamsi
Copy link
Member

liamsi commented Jan 19, 2021

I've just tested your branch and it looks fine for me too.

Does this mean that Leopard and the infectious lib are more or less on par when paralleized. Or am I misreading these two:

BenchmarkParallelExtension/RSGF8_16_goroutines_size_32_(extended_=_64)_-16                   817           1485562 ns/op
...
BenchmarkParallelExtension/LeopardFF16_16_goroutines_size_32_(extended_=_64)_-16             729           1462394 ns/op

If I add the -race flag the race detector detects multiple data races that do seem related to accessing the dataSquare fields (row-/colRoots) instead though.

@evan-forbes
Copy link
Member Author

evan-forbes commented Jan 19, 2021

That is what that means, at least on your machine. On my machine, I'm getting:
BenchmarkParallelExtension/RSGF8_16_goroutines_size_32_(extended_=_64)_-16 476 2357681 ns/op
...
BenchmarkParallelExtension/LeopardFF16_16_goroutines_size_32_(extended_=_64)_-16 848 1327787 ns/op

In the older modified version of rsmt2d, I had added mutexes on ExtendedDataSquare. I'll go check, maybe those were causing the bug somehow. Nope, no difference, it did get rid of the data races though, so that's good!

@evan-forbes
Copy link
Member Author

evan-forbes commented Apr 15, 2021

we're experiencing this issue again in #282

@evan-forbes evan-forbes reopened this Apr 15, 2021
@evan-forbes evan-forbes self-assigned this Apr 15, 2021
@evan-forbes evan-forbes added the bug Something isn't working label Apr 15, 2021
@liamsi
Copy link
Member

liamsi commented Apr 15, 2021

Can this be closed now that #46 was merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants