Skip to content

Commit

Permalink
Changed inclusive sum implementation from recursive to iterative
Browse files Browse the repository at this point in the history
Changed hyperparameter choices to be different for CPU and GPU, resulting
in 20% performance gain on GPU.

The non-recursive implementation allows to avoid repeated USM allocations,
resulting in performance gains for large arrays.

Furthermore, corrected base step kernel to accumulate in outputT rather than
in size_t, which additionally realizes savings when int32 is used as
accumulator type.

Using example from gh-1249, previously, on my Iris Xe laptop:

```
In [1]: import dpctl.tensor as dpt
   ...: ag = dpt.ones((8192, 8192), device='gpu', dtype='f4')
   ...: bg = dpt.ones((8192, 8192), device='gpu', dtype=bool)

In [2]: cg = ag[bg]

In [3]: dpt.all(cg == dpt.reshape(ag, -1))
Out[3]: usm_ndarray(True)

In [4]: %timeit -n 10 -r 3 cg = ag[bg]
212 ms ± 56 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
```

while with this change:

```
In [4]: %timeit -n 10 -r 3 cg = ag[bg]
178 ms ± 24.2 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
```
  • Loading branch information
oleksandr-pavlyk committed Jan 8, 2024
1 parent 5e0e5eb commit 6abb9fe
Showing 1 changed file with 258 additions and 117 deletions.
Loading

0 comments on commit 6abb9fe

Please sign in to comment.