Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Report] ttnn.gcd doesn't support int32 #17771

Open
jasondavies opened this issue Feb 9, 2025 · 4 comments
Open

[Bug Report] ttnn.gcd doesn't support int32 #17771

jasondavies opened this issue Feb 9, 2025 · 4 comments
Assignees
Labels
bug Something isn't working community op_cat: eltwise

Comments

@jasondavies
Copy link
Contributor

Describe the bug

The official documentation states that ttnn.gcd only supports INT32. However, it's clear from testing that INT32 doesn't work properly.

Testing reveals that floating point inputs work as expected.

Note that the restriction on input values to the range [-1024, 1024] doesn't really make sense. Looking at the internal implementation, the comments say "limited precision in bfloat16 format decreases support for input to the range [-1024, 1024]". However, I believe bfloat16 actually has only 7 bits of significand precision. Maybe the comment was supposed to say "float16"?

In any case, I think the ideal fix would be to add support for int32, and extend the maximum number of iterations to cover the full int32 input range.

To Reproduce

import ttnn
import torch

device_id = 0
device = ttnn.open_device(device_id=device_id)
dtype = ttnn.int32

a_torch = torch.tensor([5, 10, 15])
b_torch = torch.tensor([3, 4, 5])
c_torch = torch.gcd(a_torch, b_torch)
print(c_torch)

a: ttnn.Tensor = ttnn.from_torch(a_torch, dtype=dtype, layout=ttnn.TILE_LAYOUT, device=device)
b: ttnn.Tensor = ttnn.from_torch(b_torch, dtype=dtype, layout=ttnn.TILE_LAYOUT, device=device)
c = ttnn.gcd(a, b)
print(c)

ttnn.close_device(device)

Expected behavior

Output should match PyTorch.

Please complete the following environment information:

  • OS: Ubuntu 20.04
  • Version: e1a028f

Additional context

2025-02-09 10:23:47.195 | INFO     | SiliconDriver   - Opened PCI device 0; KMD version: 1.31.0, IOMMU: disabled
2025-02-09 10:23:47.198 | INFO     | SiliconDriver   - Detected PCI devices: [0]
2025-02-09 10:23:47.198 | INFO     | SiliconDriver   - Using local chip ids: {0} and remote chip ids {1}
2025-02-09 10:23:47.252 | INFO     | SiliconDriver   - Software version 6.0.0, Ethernet FW version 6.10.0 (Device 0)
2025-02-09 10:23:47.253 | INFO     | SiliconDriver   - Software version 6.0.0, Ethernet FW version 6.10.0 (Device 1)
                  Metal | INFO     | Initializing device 0. Program cache is NOT enabled
                  Metal | INFO     | AI CLK for device 0 is:   1000 MHz
tensor([1, 2, 5])
                  Metal | WARNING  | Circular buffer indices are not contiguous starting at 0. This will hurt dispatch performance. Non-contiguous indices: 2. First unused index: 1. Kernels: reader_unary_interleaved_start_id
                  Metal | WARNING  | Circular buffer indices are not contiguous starting at 0. This will hurt dispatch performance. Non-contiguous indices: 2. First unused index: 1. Kernels: writer_unary_interleaved_start_id, reader_unary_interleaved_start_id, eltwise_sfpu
ttnn.Tensor([    0,     0,     0], shape=Shape([3]), dtype=DataType::INT32, layout=Layout::TILE)
                  Metal | INFO     | Closing device 0
                  Metal | INFO     | Disabling and clearing program cache on device 0
                 Always | WARNING  | Attempting to push work to Device 0 which is not initialized. Ignoring...
                 Always | WARNING  | Attempting to push work to Device 0 which is not initialized. Ignoring...
                 Always | WARNING  | Attempting to push work to Device 0 which is not initialized. Ignoring...
                 Device | INFO     | Closing user mode device drivers
@ayerofieiev-tt
Copy link
Member

@umalesTT , please take a look

FYI @cmaryanTT

@jasondavies
Copy link
Contributor Author

Perhaps I could have a go at writing an optimised GCD implementation for int32, as a learning exercise?

The first question I have is the best way to benchmark the existing implementation. Can I get a cycle count? Something like CUDA where I can record events on-device and then get the precise timings afterwards.

@cmaryanTT
Copy link

@jasondavies - absolutely, try it out! Here's the documentation for our profiling tool:
https://docs.tenstorrent.com/tt-metal/latest/tt-metalium/tools/tracy_profiler.html

jasondavies added a commit to jasondavies/tt-metal that referenced this issue Feb 10, 2025
This is around 50x faster compared to the old/limited floating point implementation.

Fixes tenstorrent#17771.
jasondavies added a commit to jasondavies/tt-metal that referenced this issue Feb 10, 2025
This is around 50x faster compared to the old/limited floating point implementation.

Fixes tenstorrent#17771.
@vladimirjovanovicTT
Copy link

@umalesTT , please take a look

FYI @cmaryanTT

This is probably a wrong tag? @umalesTT is in Forge - training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working community op_cat: eltwise
Projects
None yet
Development

No branches or pull requests

6 participants