Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proof of Concept] Polyester.@batch for threading #104

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

efaulhaber
Copy link
Contributor

@efaulhaber efaulhaber commented Jul 5, 2024

Have you considered using Polyester.jl for threading instead of Threads?
I always use the code in this PR when I want to do fair benchmarks of my implementations against CellListMap.jl. I get consistently better performance with this PR over main.

Here is a very cheap benchmark that is running one time step of an n-body simulation. This is the code:
https://github.com/trixi-framework/PointNeighbors.jl/blob/main/benchmarks/n_body.jl#L24-L33

This is on an AMD Ryzen Threadripper 3990X using 128 threads:
grafik
This is a minimum speedup of 1.6x and a speedup of 4.5x for the largest problem size.

Here is a real-life benchmark, running one particle interaction loop of the Weakly Compressible SPH method:
grafik
This is a speedup of 2x for most problem sizes.

@@ -158,8 +158,8 @@ function map_pairwise_serial!(
show_progress::Bool=false
) where {F,N,T}
p = show_progress ? Progress(length(cl.ref), dt=1) : nothing
for i in eachindex(cl.ref)
output = inner_loop!(f, output, i, box, cl)
@batch for i in eachindex(cl.ref)
Copy link
Member

@lmiq lmiq Jul 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This insertion of @batch is in the serial implementation, so I´m unsure if that was what you were willing to do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the easiest way to try this out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a few tests I don't see the difference in performance you report. I might have to dig into your example to see.

One thing that might be important (I didn't look closely to your example) is the density of particles. My tests have the 10^5/(100^3) density, which is the atomic density of water, in the range of the applications I'm interested in.

The tradeoffs in other density ranges are different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the density relevant? Because of the cutoff? I used a cutoff of 3 times the spacing of a grid of particles. For more realistic benchmarks, I randomly perturbed the particle positions.

Copy link
Member

@lmiq lmiq Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, of course, in relation to the cutoff. A typical molecular dynamics simulation has a density of 10ˆ5/100ˆ3 particles, and use a cutoff of something of around 8 to 15, in the same units.

In some analsys of galaxy distributions (not my speciality), people use densities of 10ˆ5/20.274ˆ3 for a cutoff of 5.0. (much denser systems).

These define how many particles end up in each cell, and then what is worth doing in the inner loops that compute the interactions inside each cell or among neighbor cells. One parameter is the side of the cell relative to the cutoff (lcell in CellListMap), for very dense systems a cell side smalller than thte cutoff is beneficial. An in CellListMap y also project the particles in the vector connecting cell centers and then sort the projected distances to avoid even further computing spurious distances.

@lmiq
Copy link
Member

lmiq commented Jul 6, 2024

I´ve experimented with Polyerster a few times and, indeed, if the mapped loop is tight it is faster. However, I have the understanding that Polyester does not compose well if the function is called from within another threaded code. I think it will just fill the threads and compete for the resources. That's why I never stick with it.

Concerning the benchmark: is that using CellListMap? That was not clear to me.

One thing is that the way the inner loop of the map_pairwise CellListMap is implemented, which is very general for type of function being provided by the user, can certainly no be optimal in most specific cases. I have thought previously of exposing the inner loop as part of the interface, so the user could implement at will, possiblly taking advantage of vectorization if the function being mapped allows so. But for my applications that was not a priority, so it stayed as it is.

@efaulhaber
Copy link
Contributor Author

Concerning the benchmark: is that using CellListMap? That was not clear to me.

Yes, foreach_point_neighbor in the code I linked above just forwards to map_pairwise! in this case.

@lmiq
Copy link
Member

lmiq commented Jul 6, 2024

Now that I think, probably what @batch is doing there is adding @inbounds, @fastmath. I don't think that performance difference is associated to the threading per se.

@efaulhaber
Copy link
Contributor Author

But wouldn't we see a similar speedup on a single thread then?
main:

CellListMap.jl
with 10x10x10 = 1000 particles finished in 1.900 ms

CellListMap.jl
with 16x16x16 = 4096 particles finished in 9.661 ms

CellListMap.jl
with 25x25x25 = 15625 particles finished in 41.616 ms

CellListMap.jl
with 40x40x40 = 64000 particles finished in 184.198 ms

CellListMap.jl
with 63x63x63 = 250047 particles finished in 757.062 ms

CellListMap.jl
with 101x101x101 = 1030301 particles finished in 3.271 s

CellListMap.jl
with 160x160x160 = 4096000 particles finished in 13.483 s

Polyester.jl:

CellListMap.jl
with 10x10x10 = 1000 particles finished in 1.894 ms

CellListMap.jl
with 16x16x16 = 4096 particles finished in 9.653 ms

CellListMap.jl
with 25x25x25 = 15625 particles finished in 41.475 ms

CellListMap.jl
with 40x40x40 = 64000 particles finished in 183.926 ms

CellListMap.jl
with 63x63x63 = 250047 particles finished in 754.712 ms

CellListMap.jl
with 101x101x101 = 1030301 particles finished in 3.264 s

CellListMap.jl
with 160x160x160 = 4096000 particles finished in 13.458 s

@lmiq
Copy link
Member

lmiq commented Jul 8, 2024

I don´t see the difference in performance you report here, really. For example, I´m running the examples of this page: https://m3g.github.io/CellListMap.jl/stable/LowLevel/#Histogram-of-distances

The forces example, which is probably the most similar to what you are computing in your tests, results in:

Current stable CellListMap 0.9.4:

julia> @btime map_pairwise!(
           $((x,y,i,j,d2,forces) -> calc_forces!(x,y,i,j,d2,mass,forces)),
           $forces,$box,$cl
       );
  34.576 ms (76 allocations: 13.74 MiB)

and with your PR:

julia> @btime map_pairwise!(
           $((x,y,i,j,d2,forces) -> calc_forces!(x,y,i,j,d2,mass,forces)),
           $forces,$box,$cl
       );
  34.173 ms (82 allocations: 13.75 MiB)

In other examples there might be a slight improvement in perfomance with @batch, but very minor. Maybe this is very dependent on the CPUs used?

Could you test one of these examples in your machine?

@lmiq
Copy link
Member

lmiq commented Jul 13, 2024

Offtopic: can you please send me the link of your talk at JuliaCon about this when it is available? Thanks.

@efaulhaber
Copy link
Contributor Author

https://www.youtube.com/live/V7FWl4YumcA?si=XIKhTj8dfmon0rwk&t=4669

But we only quickly mentioned the neighborhood search.

@efaulhaber
Copy link
Contributor Author

Back to the topic:
I was confused as well, since Polyester.@batch is actually slower in our code for large problems.
So I repeated the benchmark with Threads.@threads and Threads.@threads :static and found this:
grafik
So Threads.@threads :static is just as fast as Polyester.@batch, except for very small problems, exactly as I would expect.
Interestingly, slapping @threads :static to the serial loop is significantly faster than the nbatches approach in main. This is a 2x speedup for 1M particles and a 4.7x speedup for 65M particles in the cheap n-body benchmark.

But as you already noticed, the difference becomes smaller with more neighbors. In this example, I have a search radius of 3x the particle spacing, which gives me ~100 neighbors with ~3 particles per cell. When I increase the search radius to 10x the particle spacing, the speedup for 1M particles goes down from 2x to 1.11x.

@lmiq
Copy link
Member

lmiq commented Jul 15, 2024

Well, I´ll have to investigate what´s going on. The issue is that I don´t see these performance differences at all in the hardware I have, with my test functions. For instance, this is what I get with the polyester version:

julia> using CellListMap                                                                                                                                                                                                                                                    

julia> x, box = CellListMap.xatomic(10^4);                                                                                                                                                                                                                                  

julia> sys = ParticleSystem(positions=x, unitcell=box.input_unit_cell.matrix, output=0.0, cutoff=12.0);                                                                                                                                                                     

julia> @btime map_pairwise($((x,y,i,j,d2,out) -> out += inv(d2)), $sys)
18.444 ms (87 allocations: 16.14 KiB)
75140.7942005164
 

With current version:

julia> using CellListMap

julia> x, box = CellListMap.xatomic(10^4);

julia> sys = ParticleSystem(positions=x, unitcell=box.input_unit_cell.matrix, output=0.0, cutoff=12.0);

julia> @btime map_pairwise($((x,y,i,j,d2,out) -> out += inv(d2)), $sys)
  18.692 ms (87 allocations: 16.14 KiB)
77202.5687995922

@efaulhaber
Copy link
Contributor Author

This is not using the map_pairwise! taking a CellListPair, right? I only changed this definition.

@lmiq
Copy link
Member

lmiq commented Jul 15, 2024

Oh, right... sorry. But in that case I get the opposite result, the current version is faster:

with polyester:

julia> x, box = CellListMap.xatomic(10^4);

julia> y, _ = CellListMap.xatomic(10^4);

julia> sys = ParticleSystem(xpositions=x, ypositions=y, unitcell=2*box.input_unit_cell.matrix, output=zeros(length(x)), cutoff=12.0);

julia> @btime map_pairwise($((x,y,i,j,d2,out) -> begin out[i] += inv(d2); return out end), $sys);
  46.463 ms (43 allocations: 7.25 KiB)

julia> Threads.nthreads()
6

with current version:

julia> x, box = CellListMap.xatomic(10^4);

julia> y, _ = CellListMap.xatomic(10^4);

julia> sys = ParticleSystem(xpositions=x, ypositions=y, unitcell=2*box.input_unit_cell.matrix, output=zeros(length(x)), cutoff=12.0);

julia> @btime map_pairwise($((x,y,i,j,d2,out) -> begin out[i] += inv(d2); return out end), $sys);
  23.328 ms (93 allocations: 16.33 KiB)

So I'll really will have to dig further into your example to understand what's going on.

@efaulhaber
Copy link
Contributor Author

Your benchmark on a Threadripper 3990X:
map_pairwise_parallel!:

julia> @btime map_pairwise($((x,y,i,j,d2,out) -> begin out[i] += inv(d2); return out end), $sys);
  8.930 ms (286 allocations: 54.41 KiB)

map_pairwise_serial! with @batch:

julia> @btime map_pairwise($((x,y,i,j,d2,out) -> begin out[i] += inv(d2); return out end), $sys);
  6.020 ms (53 allocations: 8.59 KiB)

map_pairwise_serial! with Threads.@threads :static:

julia> @btime map_pairwise($((x,y,i,j,d2,out) -> begin out[i] += inv(d2); return out end), $sys);
  3.438 ms (698 allocations: 132.58 KiB)

@lmiq
Copy link
Member

lmiq commented Jul 16, 2024

Well, definitelly not what I see here. This is another machine (a fairly standard one):

With @batch:

julia> @btime map_pairwise($((x,y,i,j,d2,out) -> begin out[i] += inv(d2); out end), $sys); 
 63.056 ms (1 allocation: 448 bytes)

with current stable:

julia> @btime map_pairwise($((x,y,i,j,d2,out) -> begin out[i] += inv(d2); out end), $sys);
  21.881 ms (117 allocations: 19.95 KiB)

With:

julia> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8a (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD EPYC 7282 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver2)
Threads: 8 default, 0 interactive, 4 GC (on 16 virtual cores)

(with only @threads :static in the serial loop probably the result is wrong, unless you did something to accumulate the result per thread and reduce later)

@efaulhaber
Copy link
Contributor Author

(with only @threads :static in the serial loop probably the result is wrong, unless you did something to accumulate the result per thread and reduce later)

I skipped the reduce in all versions and only returned 0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants