You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Gemm called with the Python Cutlass wrapper returns different results from PyTorch. Manual computation shows PyTorch to be more precise.
The gap can be as big as 0.001, which is too big for a simple Gemm.
Steps/Code to reproduce bug
import cutlass
import torch
from torch.autograd import Function
def test_gemm():
A = torch.randn(2, 4, requires_grad=True, device="cuda")
B = torch.randn(4, 2, requires_grad=True, device="cuda")
C = torch.zeros(2, 2, requires_grad=True, device="cuda")
D = torch.zeros(2, 2, requires_grad=True, device="cuda")
a_ref = A.detach().clone().requires_grad_(True)
b_ref = B.detach().clone().requires_grad_(True)
print(A, B, C)
D_ref = a_ref @ b_ref
print(D_ref[0].dtype)
plan = cutlass.Gemm(element=torch.float32, layout=cutlass.LayoutType.RowMajor, element_accumulator=torch.float32)
plan.run(A, B, C, D, print_module=False)
print(D_ref, D)
Expected behavior
These two matrix multiplication results should be similar but the gap is quite large, below is sample output
D (Nvidia-cutlass output): tensor([[-5.6431, -1.4654],
[ 0.1054, -3.7801]], device='cuda:0', requires_grad=True)
For instance, manual computation gives the first row and first column -5.64184263, which is much closer to the PyTorch output than the Nvidia Cutlass output.
Environment details (please complete the following information):
EC2 P4D instance with A100 GPU
Pytorch version: 2.5.1+cu121
nvidia-cutlass version: 3.5.1.0
The text was updated successfully, but these errors were encountered:
Describe the bug
Gemm called with the Python Cutlass wrapper returns different results from PyTorch. Manual computation shows PyTorch to be more precise.
The gap can be as big as 0.001, which is too big for a simple Gemm.
Steps/Code to reproduce bug
Expected behavior
These two matrix multiplication results should be similar but the gap is quite large, below is sample output
A: tensor([[-0.4193, 1.0308, -1.5871, 3.1340],
[ 0.6812, 2.0357, -2.0991, -0.1116]], device='cuda:0',
requires_grad=True)
B: tensor([[-0.0762, 0.5771],
[ 0.7318, 0.0826],
[ 0.7243, 2.0359],
[-1.6843, 0.6128]], device='cuda:0', requires_grad=True)
C: tensor([[0., 0.],
[0., 0.]], device='cuda:0', requires_grad=True)
dtype: torch.float32
D_ref (pytorch output): tensor([[-5.6419, -1.4676],
[ 0.1053, -3.7806]], device='cuda:0', grad_fn=)
D (Nvidia-cutlass output): tensor([[-5.6431, -1.4654],
[ 0.1054, -3.7801]], device='cuda:0', requires_grad=True)
For instance, manual computation gives the first row and first column -5.64184263, which is much closer to the PyTorch output than the Nvidia Cutlass output.
Environment details (please complete the following information):
EC2 P4D instance with A100 GPU
Pytorch version: 2.5.1+cu121
nvidia-cutlass version: 3.5.1.0
The text was updated successfully, but these errors were encountered: