Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Changed inclusive sum implementation from recursive to iterative
Changed hyperparameter choices to be different for CPU and GPU, resulting in 20% performance gain on GPU. The non-recursive implementation allows to avoid repeated USM allocations, resulting in performance gains for large arrays. Furthermore, corrected base step kernel to accumulate in outputT rather than in size_t, which additionally realizes savings when int32 is used as accumulator type. Using example from gh-1249, previously, on my Iris Xe laptop: ``` In [1]: import dpctl.tensor as dpt ...: ag = dpt.ones((8192, 8192), device='gpu', dtype='f4') ...: bg = dpt.ones((8192, 8192), device='gpu', dtype=bool) In [2]: cg = ag[bg] In [3]: dpt.all(cg == dpt.reshape(ag, -1)) Out[3]: usm_ndarray(True) In [4]: %timeit -n 10 -r 3 cg = ag[bg] 212 ms ± 56 ms per loop (mean ± std. dev. of 3 runs, 10 loops each) ``` while with this change: ``` In [4]: %timeit -n 10 -r 3 cg = ag[bg] 178 ms ± 24.2 ms per loop (mean ± std. dev. of 3 runs, 10 loops each) ```
- Loading branch information