-
Notifications
You must be signed in to change notification settings - Fork 603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vulkan: Add explicit synchronization on frame boundaries #1290
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
goeiecool9999
changed the title
Vulkan: use present_wait to explicitly limit CPU run-ahead for FIFO present mode
Vulkan: use present_wait to limit CPU run-ahead for FIFO present mode
Aug 13, 2024
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as outdated.
This comment was marked as outdated.
goeiecool9999
changed the title
Vulkan: use present_wait to limit CPU run-ahead for FIFO present mode
Vulkan: use present_wait to limit present queue for FIFO present mode
Aug 13, 2024
goeiecool9999
force-pushed
the
present_wait
branch
from
August 14, 2024 15:07
128f664
to
be1dedd
Compare
goeiecool9999
force-pushed
the
present_wait
branch
from
August 14, 2024 15:45
3bcbb0a
to
cbfa722
Compare
…wait for it in swapbuffer of the next frame
goeiecool9999
force-pushed
the
present_wait
branch
from
August 16, 2024 22:46
74d0ddd
to
05bc893
Compare
goeiecool9999
force-pushed
the
present_wait
branch
from
August 16, 2024 22:48
05bc893
to
e171fc8
Compare
…sent mode" This reverts commit 0f73502.
goeiecool9999
changed the title
Vulkan: use present_wait to limit present queue for FIFO present mode
Vulkan: add explicit synchronization on frame boundaries.
Sep 15, 2024
goeiecool9999
changed the title
Vulkan: add explicit synchronization on frame boundaries.
Vulkan: add explicit synchronization on frame boundaries
Sep 15, 2024
goeiecool9999
changed the title
Vulkan: add explicit synchronization on frame boundaries
Vulkan: Add explicit synchronization on frame boundaries
Sep 15, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Context
Since #1166 cemu no longer waits for vkAcquireNextImage fences. The reasoning behind this was that one driver vendor reported that that fence could only be implemented with an operation that's equivalent to a vkDeviceWaitIdle. However this shouldn't happen and good drivers do not have this problem. The removal of the fence wait was poorly motivated and has led to a regression described in issue #1239.
The gist of what caused the regression is that, on some drivers, the Latte thread no longer has any point where it explicitly synchronizes with the GPU (barring occlusion queries and texture readbacks which not all games use). This is fine for cases where the fps-limiting factor is Cemu's built-in fps limiter. However if the limiting factor is the GPU or the display the Latte thread can outpace those, which can lead to input lag. This went (mostly) unnoticed because most of the time the GPU and the display are not limiting factors.
There's two scenario's where the CPU should wait for other things to catch up
I'll talk about how this PR handles both.
Display-limited with FIFO present mode
Because cemu's FPS limiter runs at 60*1.001 it can match any display below that refresh rate exactly with the FIFO present mode. But to do so, somewhere the latte thread has to block to match the display refresh rate. So what ways of blocking does Vulkan provide?
AcquireNextImageKHR fences
One way to block is to use a fence from AcquireNextImageKHR to let the cpu wait for the moment when a swapchain image is released by the presentation engine. This is what cemu used to do. The downside of this is that vulkan doesn't specify which image will be acquired. In practice with FIFO you often acquire the least recently used image which could be many frames back. In this case if a driver has a minimum image count of n, cemu can queue up to n images before blocking. That means that there will always be as many frames of input lag as there are swapchain images. (could be off-by-one idk)
vkAcquireNextImageKHR with infinite timeout
What else? Well the vulkan spec says:
Hmmm. That sounds like we may not have to wait for a fence at all. We could just wait for vkAcquireNextImageKHR to return. Right?
No. In a note[1] the vulkan spec warns that metering rendering speed to presentation rate by relying on vkAcquireNextImageKHR to block should not be done.
Asynchronous presentation engines
These two statements seem contradictory. If vkAcquireNextImageKHR must block until an image is acquired, why is waiting for vkAcquireNextImageKHR discouraged?
Some clarity may be found in another note near the beginning of the WSI chapter:
Speculating based on this there seem to be two different kinds of drivers. One where the results and blocking duration of vkAcquireImage depend on the state of the GPU, and another where the GPU state is entirely ignored. In the former vkAcquireNextImage would need to block because images have to be presented before they become available, limiting the thread to the refresh rate.
What would happen in the latter? Well one user found out for themselves in #1239. On NVIDIA there was simply a little more input lag. On their adreno driver input lag kept increasing gradually "seemingly without upper limits".
Since there are only finite swapchain images and a swapchain image cannot be queued twice, that must mean that an image becoming available for acquisition as in "vkAcquireNextImageKHR will block until an image is acquired" is not actually tied to vsync in any way. It just means that the driver has put the image presentation into an internal queue and allows the CPU to continue queuing more work on the same image.
So by removing the fence wait, on some systems the behaviour was identical to before, and on other systems there was big input lag. So let's just add the fence wait back, but move the wait to SwapBuffer(), which is conventionally the place to block. One driver vendor saying their implementation was slow in 2022 shouldn't stop us and they acknowledged it was really bad and have likely fixed it by now.
Present Wait
Like I said before, there's no way to limit the amount of images that you queue with the core vulkan API.
Is there any way to ensure low latency, even when there are a lot of swapchain images? (besides using different present modes)
Yes there is. VK_KHR_present_wait[2].
While Queuing, it allows you to give an image an ID, and later use vkWaitForPresentKHR to well... wait for the image to be presented.
We can use this to wait for the previous frame to be presented before queuing the current frame for presentation. That keeps the queue as shallow as it can theoretically be. Theoretically making the the input lag even less than it ever was using the double-buffered VSync option.
GPU limited
It's quite simple to prevent the CPU outpacing of the GPU. Simply keep a note which command buffer ID contains the last command for a swapchain image and wait for it in swapbuffers. If the GPU is fast enough the thread never has to wait.
Notes
[1] vkAcquireNextImageKHR note
[2] Present Wait section of Window System Integration chapter
Fixes #1239