[Bug Report] ttnn.gcd doesn't support int32 #17771

jasondavies · 2025-02-09T10:25:39Z

Describe the bug

The official documentation states that ttnn.gcd only supports INT32. However, it's clear from testing that INT32 doesn't work properly.

Testing reveals that floating point inputs work as expected.

Note that the restriction on input values to the range [-1024, 1024] doesn't really make sense. Looking at the internal implementation, the comments say "limited precision in bfloat16 format decreases support for input to the range [-1024, 1024]". However, I believe bfloat16 actually has only 7 bits of significand precision. Maybe the comment was supposed to say "float16"?

In any case, I think the ideal fix would be to add support for int32, and extend the maximum number of iterations to cover the full int32 input range.

To Reproduce

import ttnn
import torch

device_id = 0
device = ttnn.open_device(device_id=device_id)
dtype = ttnn.int32

a_torch = torch.tensor([5, 10, 15])
b_torch = torch.tensor([3, 4, 5])
c_torch = torch.gcd(a_torch, b_torch)
print(c_torch)

a: ttnn.Tensor = ttnn.from_torch(a_torch, dtype=dtype, layout=ttnn.TILE_LAYOUT, device=device)
b: ttnn.Tensor = ttnn.from_torch(b_torch, dtype=dtype, layout=ttnn.TILE_LAYOUT, device=device)
c = ttnn.gcd(a, b)
print(c)

ttnn.close_device(device)

Expected behavior

Output should match PyTorch.

Please complete the following environment information:

OS: Ubuntu 20.04
Version: e1a028f

Additional context

2025-02-09 10:23:47.195 | INFO     | SiliconDriver   - Opened PCI device 0; KMD version: 1.31.0, IOMMU: disabled
2025-02-09 10:23:47.198 | INFO     | SiliconDriver   - Detected PCI devices: [0]
2025-02-09 10:23:47.198 | INFO     | SiliconDriver   - Using local chip ids: {0} and remote chip ids {1}
2025-02-09 10:23:47.252 | INFO     | SiliconDriver   - Software version 6.0.0, Ethernet FW version 6.10.0 (Device 0)
2025-02-09 10:23:47.253 | INFO     | SiliconDriver   - Software version 6.0.0, Ethernet FW version 6.10.0 (Device 1)
                  Metal | INFO     | Initializing device 0. Program cache is NOT enabled
                  Metal | INFO     | AI CLK for device 0 is:   1000 MHz
tensor([1, 2, 5])
                  Metal | WARNING  | Circular buffer indices are not contiguous starting at 0. This will hurt dispatch performance. Non-contiguous indices: 2. First unused index: 1. Kernels: reader_unary_interleaved_start_id
                  Metal | WARNING  | Circular buffer indices are not contiguous starting at 0. This will hurt dispatch performance. Non-contiguous indices: 2. First unused index: 1. Kernels: writer_unary_interleaved_start_id, reader_unary_interleaved_start_id, eltwise_sfpu
ttnn.Tensor([    0,     0,     0], shape=Shape([3]), dtype=DataType::INT32, layout=Layout::TILE)
                  Metal | INFO     | Closing device 0
                  Metal | INFO     | Disabling and clearing program cache on device 0
                 Always | WARNING  | Attempting to push work to Device 0 which is not initialized. Ignoring...
                 Always | WARNING  | Attempting to push work to Device 0 which is not initialized. Ignoring...
                 Always | WARNING  | Attempting to push work to Device 0 which is not initialized. Ignoring...
                 Device | INFO     | Closing user mode device drivers

The text was updated successfully, but these errors were encountered:

ayerofieiev-tt · 2025-02-09T11:23:49Z

@umalesTT , please take a look

FYI @cmaryanTT

jasondavies · 2025-02-09T11:34:09Z

Perhaps I could have a go at writing an optimised GCD implementation for int32, as a learning exercise?

The first question I have is the best way to benchmark the existing implementation. Can I get a cycle count? Something like CUDA where I can record events on-device and then get the precise timings afterwards.

cmaryanTT · 2025-02-09T14:11:34Z

@jasondavies - absolutely, try it out! Here's the documentation for our profiling tool:
https://docs.tenstorrent.com/tt-metal/latest/tt-metalium/tools/tracy_profiler.html

This is around 50x faster compared to the old/limited floating point implementation. Fixes tenstorrent#17771.

vladimirjovanovicTT · 2025-02-11T10:48:19Z

@umalesTT , please take a look

FYI @cmaryanTT

This is probably a wrong tag? @umalesTT is in Forge - training.

jasondavies added the bug Something isn't working label Feb 9, 2025

github-actions bot added the community label Feb 9, 2025

ayerofieiev-tt assigned umadevimcw Feb 9, 2025

ayerofieiev-tt added the op_cat: eltwise label Feb 9, 2025

mouliraj-mcw assigned mouliraj-mcw and unassigned umadevimcw Feb 10, 2025

jasondavies added a commit to jasondavies/tt-metal that referenced this issue Feb 10, 2025

Fast pure SFPU implementation of binary GCD algorithm for int32.

649b9fd

This is around 50x faster compared to the old/limited floating point implementation. Fixes tenstorrent#17771.

jasondavies mentioned this issue Feb 10, 2025

Optimised GCD implementation for int32 using pure SFPU #17807

Open

6 tasks

jasondavies added a commit to jasondavies/tt-metal that referenced this issue Feb 10, 2025

Fast pure SFPU implementation of binary GCD algorithm for int32.

3bb01f5

This is around 50x faster compared to the old/limited floating point implementation. Fixes tenstorrent#17771.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug Report] ttnn.gcd doesn't support int32 #17771

[Bug Report] ttnn.gcd doesn't support int32 #17771

jasondavies commented Feb 9, 2025

ayerofieiev-tt commented Feb 9, 2025

jasondavies commented Feb 9, 2025

cmaryanTT commented Feb 9, 2025

vladimirjovanovicTT commented Feb 11, 2025

[Bug Report] ttnn.gcd doesn't support int32 #17771

[Bug Report] ttnn.gcd doesn't support int32 #17771

Comments

jasondavies commented Feb 9, 2025

ayerofieiev-tt commented Feb 9, 2025

jasondavies commented Feb 9, 2025

cmaryanTT commented Feb 9, 2025

vladimirjovanovicTT commented Feb 11, 2025