FEX: Implements new sampling based stats #4291

Sonicadvance1 · 2025-01-22T03:36:45Z

The ftrace/gpuviz path has an issue with some usability. It can't be put in high-frequency locations like the SIGBUS handler, and it is hard to correlate between stutters in the game with a gpuviz trace after the fact.

With the combination of these two problems I have written this little SHM implementation of statistics that are /virtually/ free on the FEX side and requires the reader to accumulate the data correctly.

Some design decisions

All atomics are weak-memory ordered, requiring the use of memory_barriers and mutexes in appropriate locations
- This improves performance since it's avoid atomic overheads entirely (Which FEX already has issues with)
- This means all stats are per-thread and the accumulation has to happen on the reader side
The stats SHM region can never shrink, which would result in problems.
The stats are in a singly linked-list because the expectation is low thread counts
The stats are in a single allocation region which allows us to expand the stats if the number of threads grow

This is implemented for Linux and WINE, so we can use it everywhere.

This gets us some nice stat graphics like what follows:

We only have four stats currently, but there can definitely be more stats as time progresses.
The Mangohud patches are forthcoming while I clean them up.

Not wired up, just the definitions so it lives in the InternalThreadState. We want this accessible from both FEXCore and the frontends so it needs to live there. Two types of events supported. Scoped cyclecounts and instant increments. This gives us JIT time and Signal handling time, plus events for number of SIGBUS and number of SMC events. All useful statistics for seeing stutter live.

For the four things we care about

Not wired up to anything. Requires the frontends to allocate shared memory in the expected way.

This is fairly straightforward. It creates the shared memory region in /dev/shm/fex-<pid>-stats so that Mangohud can sample it.

This is a little trickier, we actually open the `/dev/shm/fex-<pid>-stats` file directly using Windows APIs that way Mangohud (which is going to be on the Linux side, or potentially even embedded in to Gamescope) can safely pick up the stats. A little quirky plus doesn't support expanding its size since WINE doesn't support NtExtendSection, but that's fine.

bylaws · 2025-01-22T09:54:04Z

I'll make a fixup to replace the wine support with something simpler in a bit

neobrain

Gave this a first look - heavy stuff, but gathering and visualizing these metrics indeed is useful functionality to have.

Will mangohud need to maintain a fixed list of FEX-specific metrics (and update them if we add any) or will it automatically pick up new metrics added in FEX?

Is this designed to support accumulation of metrics across processes as well or is it solely for the process running mangohud?

The current implementation leaves me uncomfortable, since there's only so much faith one can have in reliability when interleaving weakly-ordered atomics, shared memory, and a custom linked list implementation. Well-placed and tested abstractions and more explaining docs would help support that faith.

More generally, can I ask what's the benefit of a MangoHud-specific approach compared to an established nanosecond-resolution live-profiler like Tracy? Obviously MangoHud is great for general stats like FPS and CPU load, but FEX's metrics are very specific and I'm not sure how valuable having a short window of these is. Most metrics could probably even be reported via ftrace despite its overhead (i.e. less rapidly updated metrics, or those that relate to costly SIGBUS/SMC events anyway), which could give us free multi-process awareness without bespoke accumulation logic in external projects.

Source/Tools/LinuxEmulation/LinuxSyscalls/ThreadManager.cpp

Source/Common/Profiler.cpp

Source/Common/Profiler.h

Source/Tools/LinuxEmulation/LinuxSyscalls/ThreadManager.cpp

neobrain · 2025-01-22T14:35:30Z

Source/Common/Profiler.h

+#endif
+
+namespace FEX::Profiler {
+class StatAllocBase {


There's some heavy logic in here, so this should explain what it's doing.

neobrain · 2025-01-22T15:22:22Z

Source/Common/Profiler.h

+  FEXCore::Profiler::ThreadStatsHeader* Head {};
+  FEXCore::Profiler::ThreadStats* Stats;
+  FEXCore::Profiler::ThreadStats* StatTail {};


Why are you preferring a custom-implemented linked list over flat memory? This seems neither more convenient for the implementation nor like it would improve performance.

Sparsity in the allocation buffer is a concern, so this is how I'm keeping the implementation.

Could you elaborate what exactly the problem is?

The reader side would need to walk the entire allocated buffer space to ensure it doesn't miss any slots, instead of the average case of a hundred or so.

What scenario exactly would trigger this? I don't know what your reader side looks like, but it sounds like a bitmask and/or active slot count would easily resolve this.

I'm just going to leave it as a linked list. We can rev the version number and change it to a bitmask in the future if someone is feeling up to writing that code in the future.

If the choice between linked lists and flat memory doesn't seem to be so important, why is the focus on weakly-ordered atomics considered critical?

(I don't agree using a custom data structure here is the way forward since it makes up a good chunk of the complexity in this PR, but on top of that the performance targets don't seem consistent here as-is.)

For the atomics I don't want the writer side to be affected by cacheline sharing problems at all. Semantically these are two independent things. The containing data which is FEXCore::Profiler::ThreadStats and the way to walk the list of data.

I understand they are different things, but it seems arbitrary to want this (with no explicit motivation) while using a data structure with poor performance characteristics in the same module.

Indeed, which is why we leave some of the "poor performance" problems on the reader side which has something like 500ms to sample the data, even though this is going to be hundreds of nanoseconds regardless.

neobrain · 2025-01-22T15:26:39Z

Source/Common/Profiler.cpp

+  }
+
+  // Find a free slot
+  memory_barrier();


Compared to the cost of the profiled events (SIGBUS/SMC), is it really worth dealing with the added complexity here instead of just using stronger atomics? Also, isn't thread creation/destruction in particular rare enough that we could use strong memory ordering on slot management variables while leaving the stats themselves weakly ordered?

False dependency sharing in the hot path is a significant concern, so I'm keeping them as weak atomics.

Where's the hot path? The currently traced events are orders of magnitude more expensive to process than potential cache behavior effects.

This applies even more to thread creation/destruction, which is a rare event on top of that.

Yes, we have some leniency on thread creation and teardown, which is already slow and causes global locks to be taken, but this is the way I have decided to implement it.

Sonicadvance1 · 2025-01-22T21:30:32Z

https://github.com/Sonicadvance1/MangoHud/tree/fex
For the WIP FEX support in mangohud.

Source/Common/Profiler.h

Source/Tools/LinuxEmulation/LinuxSyscalls/ThreadManager.cpp

Source/Windows/WOW64/Module.cpp

Source/Windows/ARM64EC/Module.cpp

Source/Windows/WOW64/Module.cpp

Fallback to the previous path if it doesn't exist.

Sonicadvance1 added 7 commits January 21, 2025 19:04

Windows: Expose support for NtCreateSection and NtMapViewOfSection

5845474

Profiler: Sprinkle the profile stats around

6c66636

For the four things we care about

WinAPI: Implement support for DeleteFile

f6d9d3d

Common: Implement a base profiler implementation

13ebda4

Not wired up to anything. Requires the frontends to allocate shared memory in the expected way.

LinuxSyscalls: Implements support for Linux side profile stats

3bdc69d

This is fairly straightforward. It creates the shared memory region in /dev/shm/fex-<pid>-stats so that Mangohud can sample it.

Sonicadvance1 force-pushed the profile_stats branch from 235e9cd to 46dca8a Compare January 22, 2025 04:45

neobrain reviewed Jan 22, 2025

View reviewed changes

Sonicadvance1 force-pushed the profile_stats branch from 2e5457b to 9bbe6a2 Compare January 22, 2025 20:07

Sonicadvance1 force-pushed the profile_stats branch from 9bbe6a2 to 38dbfe2 Compare January 22, 2025 23:48

bylaws reviewed Jan 23, 2025

View reviewed changes

Source/Common/Profiler.h Show resolved Hide resolved

bylaws reviewed Jan 23, 2025

View reviewed changes

Source/Tools/LinuxEmulation/LinuxSyscalls/ThreadManager.cpp Outdated Show resolved Hide resolved

bylaws reviewed Jan 23, 2025

View reviewed changes

Source/Windows/WOW64/Module.cpp Outdated Show resolved Hide resolved

Source/Windows/ARM64EC/Module.cpp Outdated Show resolved Hide resolved

Source/Windows/WOW64/Module.cpp Outdated Show resolved Hide resolved

Review

832f9c2

Sonicadvance1 force-pushed the profile_stats branch from 38dbfe2 to 832f9c2 Compare January 23, 2025 01:45

Wine: Add support for magic fex+wine shm path

a9b28f9

Fallback to the previous path if it doesn't exist.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEX: Implements new sampling based stats #4291

FEX: Implements new sampling based stats #4291

Sonicadvance1 commented Jan 22, 2025

bylaws commented Jan 22, 2025

neobrain left a comment •

edited

Loading

neobrain Jan 22, 2025

neobrain Jan 22, 2025

Sonicadvance1 Jan 22, 2025

neobrain Jan 22, 2025

Sonicadvance1 Jan 22, 2025

neobrain Jan 22, 2025

Sonicadvance1 Jan 22, 2025

neobrain Jan 22, 2025 •

edited

Loading

Sonicadvance1 Jan 22, 2025

neobrain Jan 22, 2025

Sonicadvance1 Jan 22, 2025

neobrain Jan 22, 2025

Sonicadvance1 Jan 22, 2025

neobrain Jan 22, 2025 •

edited

Loading

Sonicadvance1 Jan 23, 2025

Sonicadvance1 commented Jan 22, 2025

FEX: Implements new sampling based stats #4291

Are you sure you want to change the base?

FEX: Implements new sampling based stats #4291

Conversation

Sonicadvance1 commented Jan 22, 2025

bylaws commented Jan 22, 2025

neobrain left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neobrain Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neobrain Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Sonicadvance1 commented Jan 22, 2025

neobrain left a comment •

edited

Loading

neobrain Jan 22, 2025 •

edited

Loading

neobrain Jan 22, 2025 •

edited

Loading