Performance improvement for large scenes #6014

John-Nagle · 2024-07-22T18:59:49Z

John-Nagle
Jul 22, 2024

I'm looking at some ideas for improving my performance for large scenes. I'd appreciate comments.

What does best practice for shadows atop WGPU look like?

What I have now is Rend3's shadow system. That approach is to look at every object in the scene on every frame, throw out the ones which are outside the viewing frustum, compute the depth for each object, and sort in the CPU. The process is repeated for each light. This is slow.

The part that finds the keys is slower than the sort, according to a Tracy profile. There's about three levels of de-referencing for each object, and this seems to cause many cache misses as it sweeps over all the objects. So I'm looking at changing out that algorithm for something less brute-force and with better cache locality.

Some of my thinking: An intermediate level of grouping between "object" and "everything" is needed, to avoid every-frame passes over every object. Right now, Rend3 is, I think, putting everything into one giant bind group for WGPU. So I'm thinking of adding the concept of "object groups" at the Rend3 level, which would probably be turned into bind groups at the WGPU level. What's best practice on bind group size? One big one? Lots of little ones? Medium sized ones? Bindless?

At my application level, I have fixed "tiles" 64m^2, with a "None" tile for moving objects. The fixed tiles can change, but once loaded, they don't change much. (This is a metaverse type system - anything can change, but most things don't.) That provides locality info at a coarse level, and that locality can be helpful further down the stack.

Each application level tile covers 64m^2, and there are between 16 and 96 tiles visible at any one time. Dividing up the objects into that many groups seems reasonable. There's also a "None" tile, for moving objects.

In this way, shadows can be separately computed for tiles near the light, and for the other tiles. Build a "near" shadow map on every frame, and a "far" shadow map incrementally over many frames. Combine on each frame. So, on average, this would compute about 6 tiles per frame time, four near the light, the None tile, and one or two of the far tiles, instead of computing about 64 on every frame.

How does this look?

John-Nagle · 2024-07-26T04:11:57Z

John-Nagle
Jul 26, 2024
Author

Related: How big should bind groups be?

Some programs have one giant bind group. Some programs have huge numbers of tiny bind groups. What's good practice here? Something in between? Where's a good discussion on this?

1 reply

teoxoy Jul 26, 2024
Maintainer

This is the general guidance:

Do: group resource bindings by the change frequency, start from the lowest.
For example, put per-frame resources into bind group 0, per-pass resources into bind group 1, and per-material resources in bind group 2. This allows the WebGPU implementation to keep the other bindings intact, reducing the state changes.

from https://github.com/gfx-rs/wgpu/wiki/Do's-and-Dont's

magcius · 2024-07-27T15:38:42Z

magcius
Jul 27, 2024

I want to call attention to something here:

What I have now is Rend3's shadow system. That approach is to look at every object in the scene on every frame, throw out the ones which are outside the viewing frustum, compute the depth for each object, and sort in the CPU. The process is repeated for each light. This is slow.

There is, ultimately, no way around this process. You need to select objects that are visible in each light to determine which ones are shadow casters that are in the shadow map. There are plenty of strategies and techniques you can take to speed things up, usually with some tradeoff, but this core functionality needs to be there.

Some thoughts:

Use less shadow casting lights. Point lights in particular are expensive, and you shouldn't use them with shadowing.
Speed up your frustum culling. Use a more data-oriented structure to store object positions (this is the "ECS" thing you might have heard about)
Use SIMD to cull through N objects at a time; this might require the above to make it really fast, otherwise you might be hung on memory access.
Implement a broadphase culling system which excludes large groups of objects at a time. Classic example here is a grid-based system, where every object gets put into cells. But a simpler, dumber example is distance culling. Exclude objects that are too far away, or replace them with grouped LODs/impostors.
Depending on your scene and your culling system, it can be fruitful to cull multiple frusta at the same time, but for a hierarchical tree of nodes it can be tricky to get the traversal correct.
Cache which objects are in which lights; recalculate which lights an object is in when an object moves, and junk the cache and start over from scratch if a light moves. A variant of this is to have "generational" lists, e.g. static and dynamic, and objects get migrated from the dynamic list to the static list after not moving for N frames. Only objects in the dynamic list get considered per-frame for frustum culling.

Implementing a high-density world renderer is not what rend3 was currently architected to do, and in general it's not something that a "rendering library" can do on its own. It's a huge endeavor with a lot of tradeoffs; IMO an engine like Unity would be just as bad here and require just as many custom solutions. The answer you would most likely get from the Unity forums is "use less shadow casting lights, they are expensive".

Really, you are asking for a college student's rendering library to implement a robust, heavy-duty renderer that most game engines do not include out of the box. For free.

Note that no part of this solution needs to involve the bind groups or changing how the draw calls are made; it could be helpful to merge draw calls together assuming you have no alpha-test materials in the depth pass, but that's a much trickier idea.

3 replies

John-Nagle Jul 27, 2024
Author

There is, ultimately, no way around this process.

Not everything needs to be updated on every frame. Not every light has to be aware of every object. The trick is finding a representation of the minimal info needed for such culling that can be passed across an API volume. I'm looking into a lightweight grid-based system. More on this later.

Really, you are asking for a college student's rendering library to implement a robust, heavy-duty renderer that most game engines do not include out of the box.

Unity and Unreal Engine do include a robust, heavy-duty renderer. Bevy is trying to get there.
I'm asking the Rust game dev community to support at least one good standalone renderer. If Rust is going to go anywhere in game dev, there has to be something that does this job well. Several major projects have recently abandoned Rust partly because the infrastructure just wasn't there.

Rend3's level of API abstraction is about right for getting work done. It's like classic OpenGL in that way; you can just go and draw something in 3D without worrying too much about how it works. Raw Vulkan or WGPU requires a huge amount of fussing with memory allocation and shader setup before anything useful comes out.
Take a look at Rend3's scene-viewer demo. 643 lines of Rust, and you're displaying any glTF scene, including the famous Bistro demo, and can fly through it. The core rendering loop is only 45 lines. This really is a good API.

It's far easier to work off the Rend3 base than create something that does the same job for each application. WGPU needs an easier-to-use higher level to get more users. Think of this as an opportunity to broaden WGPU's user base.

JMS55 Jul 27, 2024

It's far easier to work off the Rend3 base than create something that does the same job for each application. WGPU needs an easier-to-use higher level to get more users. Think of this as an opportunity to broaden WGPU's user base.

That's outside the scope of wgpu. Wgpu is just a graphics library that wraps Vulkan/DirectX12/Metal. An entire rendering engine is not in scope.

John-Nagle Jul 27, 2024
Author

Reasonable. WGPU doesn't do shadows itself, but does influence how higher levels do them.

BGR360 · 2024-09-13T17:48:19Z

BGR360
Sep 13, 2024

@John-Nagle for some prior art here, and more folks that would be happy to discuss different techniques for better performance, check out the Bevy game engine and their Discord.

0 replies

John-Nagle · 2024-09-13T18:03:44Z

John-Nagle
Sep 13, 2024
Author

I know about Bevy. They don't have a fix for WGPU concurrency either.

Currently, I can get the frame rate I want when not updating the world from other threads. But when the list of objects to draw is changing, the render thread slows down on lock delays. It's a known WGPU problem. WGPU only supports a single Vulkan queue.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvement for large scenes #6014

{{title}}

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Performance improvement for large scenes #6014

John-Nagle Jul 22, 2024

Replies: 4 comments · 4 replies

John-Nagle Jul 26, 2024 Author

teoxoy Jul 26, 2024 Maintainer

magcius Jul 27, 2024

John-Nagle Jul 27, 2024 Author

JMS55 Jul 27, 2024

John-Nagle Jul 27, 2024 Author

BGR360 Sep 13, 2024

John-Nagle Sep 13, 2024 Author

John-Nagle
Jul 22, 2024

Replies: 4 comments 4 replies

John-Nagle
Jul 26, 2024
Author

teoxoy Jul 26, 2024
Maintainer

magcius
Jul 27, 2024

John-Nagle Jul 27, 2024
Author

John-Nagle Jul 27, 2024
Author

BGR360
Sep 13, 2024

John-Nagle
Sep 13, 2024
Author