Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vulkan: Add explicit synchronization on frame boundaries #1290

Merged
merged 19 commits into from
Sep 15, 2024

Conversation

goeiecool9999
Copy link
Collaborator

@goeiecool9999 goeiecool9999 commented Aug 13, 2024

Context

Since #1166 cemu no longer waits for vkAcquireNextImage fences. The reasoning behind this was that one driver vendor reported that that fence could only be implemented with an operation that's equivalent to a vkDeviceWaitIdle. However this shouldn't happen and good drivers do not have this problem. The removal of the fence wait was poorly motivated and has led to a regression described in issue #1239.

The gist of what caused the regression is that, on some drivers, the Latte thread no longer has any point where it explicitly synchronizes with the GPU (barring occlusion queries and texture readbacks which not all games use). This is fine for cases where the fps-limiting factor is Cemu's built-in fps limiter. However if the limiting factor is the GPU or the display the Latte thread can outpace those, which can lead to input lag. This went (mostly) unnoticed because most of the time the GPU and the display are not limiting factors.

There's two scenario's where the CPU should wait for other things to catch up

  • GPU is too slow
  • display is slower than the fps limiter and FIFO present mode is used.

I'll talk about how this PR handles both.

Display-limited with FIFO present mode

Because cemu's FPS limiter runs at 60*1.001 it can match any display below that refresh rate exactly with the FIFO present mode. But to do so, somewhere the latte thread has to block to match the display refresh rate. So what ways of blocking does Vulkan provide?

AcquireNextImageKHR fences

One way to block is to use a fence from AcquireNextImageKHR to let the cpu wait for the moment when a swapchain image is released by the presentation engine. This is what cemu used to do. The downside of this is that vulkan doesn't specify which image will be acquired. In practice with FIFO you often acquire the least recently used image which could be many frames back. In this case if a driver has a minimum image count of n, cemu can queue up to n images before blocking. That means that there will always be as many frames of input lag as there are swapchain images. (could be off-by-one idk)

vkAcquireNextImageKHR with infinite timeout

What else? Well the vulkan spec says:

If timeout is UINT64_MAX, the timeout period is treated as infinite, and vkAcquireNextImageKHR will block until an image is acquired or an error occurs.

Hmmm. That sounds like we may not have to wait for a fence at all. We could just wait for vkAcquireNextImageKHR to return. Right?
No. In a note[1] the vulkan spec warns that metering rendering speed to presentation rate by relying on vkAcquireNextImageKHR to block should not be done.

Asynchronous presentation engines

These two statements seem contradictory. If vkAcquireNextImageKHR must block until an image is acquired, why is waiting for vkAcquireNextImageKHR discouraged?
Some clarity may be found in another note near the beginning of the WSI chapter:

The presentation engine may be synchronous or asynchronous with respect to the application and/or logical device.
Some implementations may use the device’s graphics queue or dedicated presentation hardware to perform presentation.

Speculating based on this there seem to be two different kinds of drivers. One where the results and blocking duration of vkAcquireImage depend on the state of the GPU, and another where the GPU state is entirely ignored. In the former vkAcquireNextImage would need to block because images have to be presented before they become available, limiting the thread to the refresh rate.
What would happen in the latter? Well one user found out for themselves in #1239. On NVIDIA there was simply a little more input lag. On their adreno driver input lag kept increasing gradually "seemingly without upper limits".
Since there are only finite swapchain images and a swapchain image cannot be queued twice, that must mean that an image becoming available for acquisition as in "vkAcquireNextImageKHR will block until an image is acquired" is not actually tied to vsync in any way. It just means that the driver has put the image presentation into an internal queue and allows the CPU to continue queuing more work on the same image.

So by removing the fence wait, on some systems the behaviour was identical to before, and on other systems there was big input lag. So let's just add the fence wait back, but move the wait to SwapBuffer(), which is conventionally the place to block. One driver vendor saying their implementation was slow in 2022 shouldn't stop us and they acknowledged it was really bad and have likely fixed it by now.

Present Wait

Like I said before, there's no way to limit the amount of images that you queue with the core vulkan API.
Is there any way to ensure low latency, even when there are a lot of swapchain images? (besides using different present modes)
Yes there is. VK_KHR_present_wait[2].
While Queuing, it allows you to give an image an ID, and later use vkWaitForPresentKHR to well... wait for the image to be presented.
We can use this to wait for the previous frame to be presented before queuing the current frame for presentation. That keeps the queue as shallow as it can theoretically be. Theoretically making the the input lag even less than it ever was using the double-buffered VSync option.

GPU limited

It's quite simple to prevent the CPU outpacing of the GPU. Simply keep a note which command buffer ID contains the last command for a swapchain image and wait for it in swapbuffers. If the GPU is fast enough the thread never has to wait.

Notes

[1] vkAcquireNextImageKHR note

Applications should not rely on vkAcquireNextImageKHR blocking in order to meter their rendering speed. The implementation may return from this function immediately regardless of how many presentation requests are queued, and regardless of when queued presentation requests will complete relative to the call. Instead, applications can use fence to meter their frame generation work to match the presentation rate.

[2] Present Wait section of Window System Integration chapter

Applications wanting to control the pacing of the application by monitoring when presentation processes have completed to limit the number of outstanding images queued for presentation, need to have a method of being signaled during the presentation process.
Providing a mechanism which allows applications to block, waiting for a specific step of the presentation process to complete allows them to control the amount of outstanding work (and hence the potential lag in responding to user input or changes in the rendering environment).

Fixes #1239

@goeiecool9999 goeiecool9999 changed the title Vulkan: use present_wait to explicitly limit CPU run-ahead for FIFO present mode Vulkan: use present_wait to limit CPU run-ahead for FIFO present mode Aug 13, 2024
@goeiecool9999 goeiecool9999 marked this pull request as draft August 13, 2024 09:32
@goeiecool9999

This comment was marked as resolved.

@goeiecool9999 goeiecool9999 marked this pull request as ready for review August 13, 2024 10:08
@goeiecool9999

This comment was marked as outdated.

@goeiecool9999 goeiecool9999 changed the title Vulkan: use present_wait to limit CPU run-ahead for FIFO present mode Vulkan: use present_wait to limit present queue for FIFO present mode Aug 13, 2024
@goeiecool9999 goeiecool9999 marked this pull request as ready for review August 14, 2024 16:00
@goeiecool9999 goeiecool9999 marked this pull request as draft August 14, 2024 16:08
@goeiecool9999 goeiecool9999 marked this pull request as ready for review August 17, 2024 01:06
@goeiecool9999 goeiecool9999 marked this pull request as draft August 17, 2024 08:26
@goeiecool9999 goeiecool9999 marked this pull request as ready for review September 15, 2024 18:06
@goeiecool9999 goeiecool9999 changed the title Vulkan: use present_wait to limit present queue for FIFO present mode Vulkan: add explicit synchronization on frame boundaries. Sep 15, 2024
@goeiecool9999 goeiecool9999 changed the title Vulkan: add explicit synchronization on frame boundaries. Vulkan: add explicit synchronization on frame boundaries Sep 15, 2024
@goeiecool9999 goeiecool9999 changed the title Vulkan: add explicit synchronization on frame boundaries Vulkan: Add explicit synchronization on frame boundaries Sep 15, 2024
@goeiecool9999 goeiecool9999 merged commit a05bdb1 into cemu-project:main Sep 15, 2024
5 checks passed
@goeiecool9999 goeiecool9999 deleted the present_wait branch October 24, 2024 00:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Latency regression in v2.0-77
1 participant