-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
don't call xpti if there are no subscribers #2230
Conversation
This avoids the overhead of preparing data for xpti, and the cost of the xpti call itself, if nothing is subscribed to the ur.call xpti call stream.
Compute Benchmarks level_zero run (with params: --env UR_ENABLE_LAYERS=UR_LAYER_TRACING --compare baseline --compare baseline_traced): |
Compute Benchmarks level_zero run (--env UR_ENABLE_LAYERS=UR_LAYER_TRACING --compare baseline --compare baseline_traced): SummaryNo diffs to calculate performance change (result is better) Performance change in benchmark groupsRelative perf in group api (6): cannot calculate
Relative perf in group memory (4): cannot calculate
Relative perf in group miscellaneous (1): cannot calculate
Relative perf in group Velocity-Bench (6): cannot calculate
Relative perf in group Runtime (8): cannot calculate
Relative perf in group MicroBench (14): cannot calculate
Relative perf in group Pattern (10): cannot calculate
Relative perf in group ScalarProduct (6): cannot calculate
Relative perf in group USM (7): cannot calculate
Relative perf in group VectorAddition (3): cannot calculate
Relative perf in group Polybench (3): cannot calculate
Relative perf in group Kmeans (1): cannot calculate
Relative perf in group LinearRegressionCoeff (1): cannot calculate
Relative perf in group MolecularDynamics (1): cannot calculate
DetailsBenchmark details - environment, command, output...api_overhead_benchmark_sycl SubmitKernel out of orderEnvironment Variables:UR_ENABLE_LAYERS=UR_LAYER_TRACING Command:/home/test-user/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=SubmitKernel --csv --noHeaders --Ioq=0 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1 Output:TestCase,Mean,Median,StdDev,Min,Max,Type api_overhead_benchmark_sycl SubmitKernel in orderEnvironment Variables:UR_ENABLE_LAYERS=UR_LAYER_TRACING Command:/home/test-user/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=SubmitKernel --csv --noHeaders --Ioq=1 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1 Output:TestCase,Mean,Median,StdDev,Min,Max,Type api_overhead_benchmark_ur SubmitKernel out of orderEnvironment Variables:UR_ENABLE_LAYERS=UR_LAYER_TRACING Command:/home/test-user/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_ur --test=SubmitKernel --csv --noHeaders --Ioq=0 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1 Output:TestCase,Mean,Median,StdDev,Min,Max,Type api_overhead_benchmark_ur SubmitKernel in orderEnvironment Variables:UR_ENABLE_LAYERS=UR_LAYER_TRACING Command:/home/test-user/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_ur --test=SubmitKernel --csv --noHeaders --Ioq=1 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1 Output:TestCase,Mean,Median,StdDev,Min,Max,Type memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024Environment Variables:UR_ENABLE_LAYERS=UR_LAYER_TRACING Command:/home/test-user/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueInOrderMemcpy --csv --noHeaders --iterations=10000 --IsCopyOnly=0 --sourcePlacement=Device --destinationPlacement=Device --size=1024 --count=100 Output:TestCase,Mean,Median,StdDev,Min,Max,Type memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024Environment Variables:UR_ENABLE_LAYERS=UR_LAYER_TRACING Command:/home/test-user/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueInOrderMemcpy --csv --noHeaders --iterations=10000 --IsCopyOnly=0 --sourcePlacement=Host --destinationPlacement=Device --size=1024 --count=100 Output:TestCase,Mean,Median,StdDev,Min,Max,Type memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024Environment Variables:UR_ENABLE_LAYERS=UR_LAYER_TRACING Command:/home/test-user/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueMemcpy --csv --noHeaders --iterations=10000 --sourcePlacement=Device --destinationPlacement=Device --size=1024 Output:TestCase,Mean,Median,StdDev,Min,Max,Type memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240Environment Variables:UR_ENABLE_LAYERS=UR_LAYER_TRACING Command:/home/test-user/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=StreamMemory --csv --noHeaders --iterations=10000 --type=Triad --size=10240 --memoryPlacement=Device --useEvents=0 --contents=Zeros Output:TestCase,Mean,Median,StdDev,Min,Max,Type api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024Environment Variables:UR_ENABLE_LAYERS=UR_LAYER_TRACING Command:/home/test-user/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=ExecImmediateCopyQueue --csv --noHeaders --iterations=100000 --ioq=0 --IsCopyOnly=1 --MeasureCompletionTime=0 --src=Device --dst=Device --size=1024 Output:TestCase,Mean,Median,StdDev,Min,Max,Type api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024Environment Variables:UR_ENABLE_LAYERS=UR_LAYER_TRACING Command:/home/test-user/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=ExecImmediateCopyQueue --csv --noHeaders --iterations=100000 --ioq=1 --IsCopyOnly=1 --MeasureCompletionTime=0 --src=Host --dst=Host --size=1024 Output:TestCase,Mean,Median,StdDev,Min,Max,Type miscellaneous_benchmark_sycl VectorSumEnvironment Variables:UR_ENABLE_LAYERS=UR_LAYER_TRACING Command:/home/test-user/bench_workdir/compute-benchmarks-build/bin/miscellaneous_benchmark_sycl --test=VectorSum --csv --noHeaders --iterations=1000 --numberOfElementsX=512 --numberOfElementsY=256 --numberOfElementsZ=256 Output:TestCase,Mean,Median,StdDev,Min,Max,Type Velocity-Bench HashtableEnvironment Variables:UR_ENABLE_LAYERS=UR_LAYER_TRACING Command:/home/test-user/bench_workdir/hashtable/hashtable_sycl --no-verify Output:hashtable - total time for whole calculation: 0.349192 s Velocity-Bench BitcrackerEnvironment Variables:UR_ENABLE_LAYERS=UR_LAYER_TRACING Command:/home/test-user/bench_workdir/bitcracker/bitcracker -f /home/test-user/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/img_win8_user_hash.txt -d /home/test-user/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/user_passwords_60000.txt -b 60000 Output:---------> BitCracker: BitLocker password cracking tool <--------- ==================================
|
@oneapi-src/unified-runtime-maintain please review |
Pulls in a few L0 and sanitizer changes, and an optimization for the tracing layer. oneapi-src/unified-runtime#2230
This avoids the overhead of preparing data for xpti, and the cost of the xpti call itself, if nothing is subscribed to the ur.call xpti call stream.