Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resnet50 fp16 No tensorcore was used #19630

Open
hi20240217 opened this issue Jan 8, 2025 · 2 comments
Open

resnet50 fp16 No tensorcore was used #19630

hi20240217 opened this issue Jan 8, 2025 · 2 comments
Labels
bug 🐞 Something isn't working codegen/nvvm NVVM code generation compiler backend support Request support or ask a question

Comments

@hi20240217
Copy link

What happened?

resnet50 fp16 No tensorcore was used

main_graph_async_dispatch_78_matmul_1x1000x2048_f16xf16xf32 (1000, 1, 1)x(128, 1, 1), Context 1, Stream 13, Device 0, CC 8.6
Warning: Data collection happened without fixed GPU frequencies. Profiling results may be inconsistent.
Section: Command line profiler metrics
------------------------------------------------------------------------------------ ------------- ------------
Metric Name Metric Unit Metric Value
------------------------------------------------------------------------------------ ------------- ------------
gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed % 73.87
gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed % 13.92
sm__cycles_elapsed.avg cycle 61517.49
sm__cycles_elapsed.max cycle 61774
sm__cycles_elapsed.min cycle 61365
sm__cycles_elapsed.sum cycle 5044434
sm__cycles_elapsed.avg.per_second cycle/nsecond 1.95
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.max.pct_of_peak_sustained_elapsed (!) n/a
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.max.per_second (!) n/a
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.min.pct_of_peak_sustained_elapsed (!) n/a
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum (!) n/a
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.pct_of_peak_sustained_elapsed (!) n/a
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.peak_sustained (!) n/a
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.per_second (!) n/a
sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_active % 0
sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_elapsed % 0
------------------------------------------------------------------------------------ ------------- ------------

Steps to reproduce your issue

cat> iree_forward_resnet50.py<<-'EOF'  
import numpy as np
import iree.turbine.aot as aot
import torch
import torchvision.models as models
import iree.runtime as rt
import time

input_tensor = torch.ones((1,3,224,224),dtype=torch.half)
model = models.resnet50(pretrained=False).half()
model.eval() 

export_output = aot.export(model, input_tensor)
export_output.save_mlir("resnet50.mlir")
compiled_binary = export_output.compile(save_to=None,target_backends="cuda")

config = rt.Config("cuda://GPU-b915ad16-a0ba-3cc2-faac-2b6397113fa0")
vmm = rt.load_vm_module(
	rt.VmModule.copy_buffer(config.vm_instance, compiled_binary.map_memory()),
	config)
	
# warm up
for i in range(3):
    y = vmm.main(input_tensor)

# benchmark
t0=time.time()
for i in range(1000):
    y = vmm.main(input_tensor)
t1=time.time()
print("{:.2f} FPS".format(1000/(t1-t0)))

EOF
python iree_forward_resnet50.py
ncu --clock-control=none --metrics \
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.max.pct_of_peak_sustained_elapsed,\
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.min.pct_of_peak_sustained_elapsed,\
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.pct_of_peak_sustained_elapsed,\
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum,\
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.peak_sustained,\
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.sum.per_second,\
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off.max.per_second,\
sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_elapsed,\
sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_active,\
gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed,\
gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed,\
sm__cycles_elapsed.avg.per_second,\
sm__cycles_elapsed python iree_forward.py

What component(s) does this issue relate to?

No response

Version information

No response

Additional context

No response

@hi20240217 hi20240217 added the bug 🐞 Something isn't working label Jan 8, 2025
@ScottTodd ScottTodd added codegen/nvvm NVVM code generation compiler backend support Request support or ask a question labels Jan 8, 2025
@ScottTodd
Copy link
Member

Just an idea: have you tried using a specific CUDA target?

https://iree.dev/guides/deployment-configurations/gpu-cuda/#compile-a-program

Canonically a CUDA target (iree-cuda-target) matching the LLVM NVPTX backend of the form sm_<arch_number> is needed to compile towards each GPU architecture. If no architecture is specified then we will default to sm_60.

That --iree-cuda-target option should be set prior to this call:

compiled_binary = export_output.compile(save_to=None,target_backends="cuda")

I think like this? (e.g. for A100, with is sm_80)

export_output.session.set_flags("--iree-cuda-target=sm_80")

@hi20240217
Copy link
Author

have no effect

cat> iree_forward_resnet50.py<<-'EOF'  
import numpy as np
import iree.turbine.aot as aot
import torch
import torchvision.models as models
import iree.runtime as rt
import time

input_tensor = torch.ones((1,3,224,224),dtype=torch.half)
model = models.resnet50(pretrained=False).half()
model.eval() 

export_output = aot.export(model, input_tensor)
export_output.save_mlir("resnet50.mlir")
export_output.session.set_flags("--iree-cuda-target=rtx3090")
compiled_binary = export_output.compile(save_to=None,target_backends="cuda")

config = rt.Config("cuda://GPU-b915ad16-a0ba-3cc2-faac-2b6397113fa0")
vmm = rt.load_vm_module(
	rt.VmModule.copy_buffer(config.vm_instance, compiled_binary.map_memory()),
	config)
	
# warm up
for i in range(3):
    y = vmm.main(input_tensor)

# benchmark
t0=time.time()
for i in range(1000):
    y = vmm.main(input_tensor)
t1=time.time()
print("{:.2f} FPS".format(1000/(t1-t0)))

EOF
python iree_forward_resnet50.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working codegen/nvvm NVVM code generation compiler backend support Request support or ask a question
Projects
None yet
Development

No branches or pull requests

2 participants