Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Precision issue with python cutlass gemm #2014

Closed
MinghaoYan opened this issue Dec 26, 2024 · 3 comments
Closed

[BUG] Precision issue with python cutlass gemm #2014

MinghaoYan opened this issue Dec 26, 2024 · 3 comments
Labels
? - Needs Triage bug Something isn't working

Comments

@MinghaoYan
Copy link

MinghaoYan commented Dec 26, 2024

Describe the bug
Gemm called with the Python Cutlass wrapper returns different results from PyTorch. Manual computation shows PyTorch to be more precise.

The gap can be as big as 0.001, which is too big for a simple Gemm.

Steps/Code to reproduce bug

import cutlass

import torch
from torch.autograd import Function

def test_gemm():
    A = torch.randn(2, 4, requires_grad=True, device="cuda")
    B = torch.randn(4, 2, requires_grad=True, device="cuda")
    C = torch.zeros(2, 2, requires_grad=True, device="cuda")
    D = torch.zeros(2, 2, requires_grad=True, device="cuda")

    a_ref = A.detach().clone().requires_grad_(True)
    b_ref = B.detach().clone().requires_grad_(True)

    print(A, B, C)
    D_ref = a_ref @ b_ref
    print(D_ref[0].dtype)

    plan = cutlass.Gemm(element=torch.float32, layout=cutlass.LayoutType.RowMajor, element_accumulator=torch.float32)
    plan.run(A, B, C, D, print_module=False)

    print(D_ref, D)

Expected behavior
These two matrix multiplication results should be similar but the gap is quite large, below is sample output

A: tensor([[-0.4193, 1.0308, -1.5871, 3.1340],
[ 0.6812, 2.0357, -2.0991, -0.1116]], device='cuda:0',
requires_grad=True)
B: tensor([[-0.0762, 0.5771],
[ 0.7318, 0.0826],
[ 0.7243, 2.0359],
[-1.6843, 0.6128]], device='cuda:0', requires_grad=True)
C: tensor([[0., 0.],
[0., 0.]], device='cuda:0', requires_grad=True)

dtype: torch.float32

D_ref (pytorch output): tensor([[-5.6419, -1.4676],
[ 0.1053, -3.7806]], device='cuda:0', grad_fn=)

D (Nvidia-cutlass output): tensor([[-5.6431, -1.4654],
[ 0.1054, -3.7801]], device='cuda:0', requires_grad=True)

For instance, manual computation gives the first row and first column -5.64184263, which is much closer to the PyTorch output than the Nvidia Cutlass output.

Environment details (please complete the following information):
EC2 P4D instance with A100 GPU
Pytorch version: 2.5.1+cu121
nvidia-cutlass version: 3.5.1.0

@MinghaoYan MinghaoYan added ? - Needs Triage bug Something isn't working labels Dec 26, 2024
@MinghaoYan MinghaoYan changed the title [BUG] Presion issue with python cutlass gemm [BUG] Precision issue with python cutlass gemm Dec 26, 2024
@MinghaoYan
Copy link
Author

Apparently, this problem only occurs in float32 but not in float16

@jackkosaian
Copy link
Contributor

Can please you check if the related answer here helps?

@MinghaoYan
Copy link
Author

This makes sense, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants