Skip to content

bdwhst/Vulcanite

Repository files navigation

Vulkanite

A Vulkan Implementatioin of Nanite in Unreal Engine 5

University of Pennsylvania, CIS 565: GPU Programming and Architecture, Final Project

@Han Yang (LinkedIn | Personal Website) and @Yian Chen (LinkedIn | Personal Website)

Alt text

Dependencies

  • metis
  • OpenMesh
  • Vulkan

Build

For now, we only test on Windows.

For windows users, we recommend that you use vcpkg to install dependencies.

If you have successfully installed all dependencies through vcpkg, then this program should be able to run normally.

We recommend you use a script file like this to build this project:

set VCPKG_PATH="C:\vcpkg" ::change this to your own vcpkg directory

for /F %%i in ('dir /b /a ".\build\*"') do (
    echo if you see this the folder is NOT empty
    goto clean_build
)
goto make_build

:clean_build
echo "Cleaning build file"
del build/*
goto start_build

:make_build
echo "Making build directory"
mkdir build
goto start_build

:start_build
echo "Start building"
cd ./build/
cmake .. -DCMAKE_TOOLCHAIN_FILE=%VCPKG_PATH%\scripts\buildsystems\vcpkg.cmake
cd ..

Features Implemented

  • GPU Driven View Frustrum Culling and Occlusion Culling
  • BVH Cluster Culling
  • Nanite Builder
    • DAG Builder
  • Multiple Instances of Multiple Nanite Meshes
  • Visibility Buffer Based rendering
  • Mixed-mode Rasterizer

Overview

This project is deeply inspired by Unreal Engine's Nanite Virtual Geometry System, yet it is primarily developed from the ground up. Echoing Richard Feynman's philosophy, "What I cannot create, I do not understand," our aim is to demystify the core principles of Nanite. We endeavor to provide developers with an independent virtual geometry module, distinct from Unreal Engine, enhancing their understanding and application of these advanced rendering techniques. The following text will use patch and cluster interchangeably, since each triangle cluster are likely to be a patch in space.

Basic Idea

The central question we address is how to handle high-polygon-count triangle geometry, especially when numerous triangles shrink to less than one pixel in size after rasterization. Traditional Level of Detail (LOD) methods switch between different mesh resolutions based on the viewer's distance, leading to visual discontinuities during camera movement. Nanite introduces an innovative solution: rather than rendering separate LOD levels, it blends patches from various LODs. This heuristic mixing ensures a seamless and visually coherent output from the current camera perspective.

Nanite Mesh Building

To avoid excessive triangle overdraw, it's crucial to establish a hierarchical relationship among patches across different LOD levels, ensuring they cover identical areas. Nanite employs a sophisticated technique to maintain this hierarchy and facilitate smooth transitions: it simplifies triangles within the same cluster group, followed by re-grouping and re-clustering. Here we implemented nanite building pipeline based on OpenMesh and METIS.

In our CPU-based Nanite Mesh building pipeline, we execute a series of steps to construct a Nanite mesh effectively. The process involves the following stages:

  • Triangle Clustering: Initially, we cluster the individual triangles into distinct clusters. This step is crucial for organizing the mesh data into manageable patches.

  • Cluster Grouping: Once the triangles are clustered, we then group these clusters. This grouping is essential for handling the mesh at a higher level and sets the stage for further simplification.

  • Mesh Simplification: For each cluster group, we perform mesh simplification. This process reduces the complexity of the mesh within each group, optimizing it for better performance while maintaining the essential geometric details. This step uses Quadratic Error Metric to perform edge collapse on the mesh.

  • Re-clustering and Iteration: After simplification, we re-cluster the triangles in these simplified cluster groups. Following this, we return to step 2, grouping these newly formed clusters. This iterative process continues until the mesh is optimally simplified and organized.

This pipeline is designed to efficiently process and simplify the mesh on the CPU, preparing it for optimal use in Nanite-enabled rendering environments. The iterative nature of steps 2 to 4 ensures a balanced approach to mesh simplification, maintaining a high level of detail where necessary while optimizing the overall structure for rendering performance. Here are two pictures of cluster and cluster groups.

Runtime LOD Selection

Using the Quadratic Error Metric (QEM), we can project this error onto the screen space. This projection is facilitated by enclosing each patch within a bounding sphere. We maintain a hierarchy where each parent's error and bounding radius are always greater than those of its children. This hierarchy ensures a monotonic relationship, allowing us to evaluate each patch (whether parent or child) independently. Here using the Stanford dragon as an example, we showed the change of LOD levels with camera movement.

Mixed Rasterization

For large triangles covering numerous pixels, hardware rasterization is highly efficient. Conversely, for smaller triangles — those spanning only a few pixels or slightly more — software rasterization proves to be much faster. In our software rasterization process, we employ a 64-bit texture: the upper 32 bits store depth information, while the lower 32 bits hold visibility buffer data. This approach ensures an effective Z-test during rasterization, optimizing performance for varying triangle sizes.

Here our image layout in software rasterization stage is this:

Depth ClusterId ObjectId TriangleId
32 15 11 6

Where the lower 32 bits can be used in the shading stage to reconstruct pixel attribute values. A visualization of the visibility buffer is shown as below:

Cluster Culling With BVH

The BVH plays a pivotal role in the Nanite pipeline as an acceleration structure, consistent with its historical application. Within Nanite, the BVH serves to speed up cluster culling during subsequent phases. Leveraging its spatial hierarchical structure enables the rapid elimination of numerous clusters that don't necessitate iteration.

Culling operations using BVH closely mirror previous practices, including frustum culling and occlusion culling based on the bounding boxes of BVH nodes. Subsequent to these initial steps, the culling process extends to clusters, incorporating evaluation criteria such as parentError and clusterError:

Render: parentError > threshold && clusterError <= threshold
Cull:   parentError <= threshold || clusterError > threshold

The integration of BVH as an acceleration structure facilitates traversing clusters from an entire mesh down to individual small clusters. Consequently, using parentError <= threshold as the culling condition efficiently handles all child clusters associated with the current BVH nodes. This means that apart from the bounding box, we also need to store the maximum of parentNormalizedError and the largest parentBoundingSphere, which are needed to get the highest parentError.

This approach of coarse culling allows for the pre-culling of over 80% of clusters within the scene. This optimization yields an average performance increase of around 100%.

Nanite Instancing

Instancing is crucial in Nanite for efficiently rendering scenes with over 1 billion triangles, minimizing GPU memory usage by eliminating repeated triangles and vertices. Due to our GPU-driven pipelines, direct modification of instanceCount in vkDrawIndexedIndirect is challenging. Instancing is implemented in the preparation stage at two levels:

  1. Primitive-Level Instancing: Assigning a unique objectId to each cluster removes redundant triangles, allowing for optimal GPU memory usage and rendering scenes with up to hundreds of millions of triangles.

  2. Cluster-Level Instancing: By assigning an objectId from parent BVH nodes, repeated clusters are eliminated. This level of instancing facilitates rendering scenes with over 1 billion triangles while fully utilizing GPU memory.

Rendering Pipeline

Sum those techniques above, we can get this whole rendering pipeline, which is similar to unreal engine's nanite pipeline.

Performance Analysis

All performance analysis are tested with meshes that have fragment shaders using IBL & PBR.

  • Cluster-level LOD On/Off
    • Tested on: Windows 10, AMD Ryzen 5800 HS with Radeon Graphics CPU @ 3.20GHz 16GB, NVIDIA GeForce RTX3060 Laptop 8GB

  • BVH On/Off
    • Tested on: Windows 10, AMD Ryzen 5800 HS with Radeon Graphics CPU @ 3.20GHz 16GB, NVIDIA GeForce RTX3060 Laptop 8GB

NSight GPU Trace

This trace result is produced on NVidia RTX 4080 laptop GPU.

We can see from the graph that there are two stages that have a high proportion of active SM unused warp slots, the first stage is fine culling stage, the second stage is hardware rasterization stage. And since our code in that two stages doesn't have much divergence, we thought the main reason for unused warp slots are delay from global memory access.

Further investigation proved our guess. As in those shaders that most stalls are caused by a memory load or store.

But due to the randomness of memory access, here we cannot use shared memory to reduce global memory access. A great idea is to do stream compaction on those global array so that the memory access are much more consistent.

  • Software Rasterization On/Off

    Please note that this test is run with runtime cluster-level lod always on, which is why framerate is high even when all optimizations are off.

    • Tested on: Windows 11, i9-12900HX @ 2.30GHz 16GB, RTX4080 laptop 12GB

Miletone Slides

  • Milestone1 - GPU-Driven Depth Culling, Clustering & Grouping
  • Milestone2 - DAG, Serialization & Deserialization
  • Milestone3 - Mixed-mode Rasterizer, Multiple Instance, Performance Analysis
  • Final

Third Party Code

Known Bugs

Models may flicker when view is far, this may due to some synchronization issue, but we currently don't know why.

Mesh decimation is not robust enough. For now, it only accepts mesh whose faces are all connected.


Below are for developers

TODOs

  • CPU Side

    • Mesh simplification
      • Lock edge on cluster group boundaries
    • Cluster & Cluster group
      • Learn about METIS (or other available algorithms for triangle clustering)
      • Is forming cluster group exactly the same as forming cluster?
    • Core Algorithm Given cluster & cluster group
      • Mesh simplification (lock boundaries of cluster group)
      • Re-Cluster
      • Recalculate cluster groups
      • Maintain a LOD BVH Tree for each level
    • BVH Builder
    • Nanite Mesh Exporter
      • Mesh LOD
      • Data Compression
    • Parallel building
  • GPU Side

    • Runtime LOD

    • Soft ras

    • Hard ras

      • Mesh shader
    • BVH Traversal

      • Naive Traversal
      • MPMC
    • Customized depth test

    • Tile based deferred materials

    • Shadowmap culling

Links