Performance improvement for large scenes #6014
Replies: 4 comments 4 replies
-
Related: How big should bind groups be? Some programs have one giant bind group. Some programs have huge numbers of tiny bind groups. What's good practice here? Something in between? Where's a good discussion on this? |
Beta Was this translation helpful? Give feedback.
-
I want to call attention to something here:
There is, ultimately, no way around this process. You need to select objects that are visible in each light to determine which ones are shadow casters that are in the shadow map. There are plenty of strategies and techniques you can take to speed things up, usually with some tradeoff, but this core functionality needs to be there. Some thoughts:
Implementing a high-density world renderer is not what rend3 was currently architected to do, and in general it's not something that a "rendering library" can do on its own. It's a huge endeavor with a lot of tradeoffs; IMO an engine like Unity would be just as bad here and require just as many custom solutions. The answer you would most likely get from the Unity forums is "use less shadow casting lights, they are expensive". Really, you are asking for a college student's rendering library to implement a robust, heavy-duty renderer that most game engines do not include out of the box. For free. Note that no part of this solution needs to involve the bind groups or changing how the draw calls are made; it could be helpful to merge draw calls together assuming you have no alpha-test materials in the depth pass, but that's a much trickier idea. |
Beta Was this translation helpful? Give feedback.
-
@John-Nagle for some prior art here, and more folks that would be happy to discuss different techniques for better performance, check out the Bevy game engine and their Discord. |
Beta Was this translation helpful? Give feedback.
-
I know about Bevy. They don't have a fix for WGPU concurrency either. Currently, I can get the frame rate I want when not updating the world from other threads. But when the list of objects to draw is changing, the render thread slows down on lock delays. It's a known WGPU problem. WGPU only supports a single Vulkan queue. |
Beta Was this translation helpful? Give feedback.
-
I'm looking at some ideas for improving my performance for large scenes. I'd appreciate comments.
What does best practice for shadows atop WGPU look like?
What I have now is Rend3's shadow system. That approach is to look at every object in the scene on every frame, throw out the ones which are outside the viewing frustum, compute the depth for each object, and sort in the CPU. The process is repeated for each light. This is slow.
The part that finds the keys is slower than the sort, according to a Tracy profile. There's about three levels of de-referencing for each object, and this seems to cause many cache misses as it sweeps over all the objects. So I'm looking at changing out that algorithm for something less brute-force and with better cache locality.
Some of my thinking: An intermediate level of grouping between "object" and "everything" is needed, to avoid every-frame passes over every object. Right now, Rend3 is, I think, putting everything into one giant bind group for WGPU. So I'm thinking of adding the concept of "object groups" at the Rend3 level, which would probably be turned into bind groups at the WGPU level. What's best practice on bind group size? One big one? Lots of little ones? Medium sized ones? Bindless?
At my application level, I have fixed "tiles" 64m^2, with a "None" tile for moving objects. The fixed tiles can change, but once loaded, they don't change much. (This is a metaverse type system - anything can change, but most things don't.) That provides locality info at a coarse level, and that locality can be helpful further down the stack.
Each application level tile covers 64m^2, and there are between 16 and 96 tiles visible at any one time. Dividing up the objects into that many groups seems reasonable. There's also a "None" tile, for moving objects.
In this way, shadows can be separately computed for tiles near the light, and for the other tiles. Build a "near" shadow map on every frame, and a "far" shadow map incrementally over many frames. Combine on each frame. So, on average, this would compute about 6 tiles per frame time, four near the light, the None tile, and one or two of the far tiles, instead of computing about 64 on every frame.
How does this look?
Beta Was this translation helpful? Give feedback.
All reactions