You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Maybe some advanced discussion regarding accuracy/wddm, impact of cuda event timers etc.
CUDA event timers have a resolution of "around 0.5 microscends", and timing only behaves as intended when the event's are recorded in the NULL (default) stream:
Computes the elapsed time between two events (in milliseconds with a resolution of around 0.5 microseconds).
If either event was last recorded in a non-NULL stream, the resulting time may be greater than expected (even if both used the same stream handle). This happens because the cudaEventRecord() operation takes place asynchronously and there is no guarantee that the measured latency is actually just between the two events. Any number of other different stream operations could execute in between the two measured events, thus altering the timing in a significant way.
Under WDDM, due to how the WDDM command buffers work, cudaEvent based timing is only meaningful for pure device code (unless you add immediate stream/event/device sync after recording). See FLAMEGPU/FLAMEGPU2#451.
The current implementation in FLAME GPU uses std::steady_clock timers when the gpu is running under WDDM.
std::steady_clock timers are generally not as good, but they are implementation and hardware specific, so can't document a known accuracy / precision. It might be possible to calculate one at runtime though. They might not be precise enough to give useful per step or per layer timing depending on the model. std::high_resolution_clock sounds like it should be better, but its implementation defined. MSVC it is just a std::steady_clock, but gcc uses std::system_clock which is not good for performance timing (it's not monotonic).
There should probably be some coverage of timing.
The text was updated successfully, but these errors were encountered: