Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing performance #1249

Open
npolina4 opened this issue Jun 14, 2023 · 2 comments
Open

Indexing performance #1249

npolina4 opened this issue Jun 14, 2023 · 2 comments
Labels
performance Code performance

Comments

@npolina4
Copy link
Collaborator

npolina4 commented Jun 14, 2023

import dpctl.tensor as dpt
a = dpt.ones((8192, 8192), device='cpu', dtype='f4')
b = dpt.ones((8192, 8192), device='cpu', dtype=bool)
%timeit a[b]
#211 ms ± 6.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

import numpy
a_np = numpy.ones((8192, 8192), dtype='f4')
b_np = numpy.ones((8192, 8192), dtype=bool)
%timeit a_np[b_np]
#87.1 ms ± 2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
@oleksandr-pavlyk
Copy link
Collaborator

This should be improved by changes in gh-1300. @npolina4 could you please post timeit results on the same machine you used to obtain reported numbers in the original comment?

@npolina4
Copy link
Collaborator Author

Result with changes in #1300
Size: 8192, 8192
numpy: 105 ms
cpu: 205 ms
gpu: 115 ms

Size: 4096, 4096
numpy: 24.5 ms
cpu: 45~80 ms
gpu: 21.4 ms

@oleksandr-pavlyk oleksandr-pavlyk added the performance Code performance label Aug 15, 2023
oleksandr-pavlyk added a commit that referenced this issue Dec 15, 2023
Changed hyperparameter choices to be different for CPU and GPU, resulting
in 20% performance gain on GPU.

The non-recursive implementation allows to avoid repeated USM allocations,
resulting in performance gains for large arrays.

Furthermore, corrected base step kernel to accumulate in outputT rather than
in size_t, which additionally realizes savings when int32 is used as
accumulator type.

Using example from gh-1249, previously, on my Iris Xe laptop:

```
In [1]: import dpctl.tensor as dpt
   ...: ag = dpt.ones((8192, 8192), device='gpu', dtype='f4')
   ...: bg = dpt.ones((8192, 8192), device='gpu', dtype=bool)

In [2]: cg = ag[bg]

In [3]: dpt.all(cg == dpt.reshape(ag, -1))
Out[3]: usm_ndarray(True)

In [4]: %timeit -n 10 -r 3 cg = ag[bg]
212 ms ± 56 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
```

while with this change:

```
In [4]: %timeit -n 10 -r 3 cg = ag[bg]
178 ms ± 24.2 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
```
oleksandr-pavlyk added a commit that referenced this issue Dec 19, 2023
Changed hyperparameter choices to be different for CPU and GPU, resulting
in 20% performance gain on GPU.

The non-recursive implementation allows to avoid repeated USM allocations,
resulting in performance gains for large arrays.

Furthermore, corrected base step kernel to accumulate in outputT rather than
in size_t, which additionally realizes savings when int32 is used as
accumulator type.

Using example from gh-1249, previously, on my Iris Xe laptop:

```
In [1]: import dpctl.tensor as dpt
   ...: ag = dpt.ones((8192, 8192), device='gpu', dtype='f4')
   ...: bg = dpt.ones((8192, 8192), device='gpu', dtype=bool)

In [2]: cg = ag[bg]

In [3]: dpt.all(cg == dpt.reshape(ag, -1))
Out[3]: usm_ndarray(True)

In [4]: %timeit -n 10 -r 3 cg = ag[bg]
212 ms ± 56 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
```

while with this change:

```
In [4]: %timeit -n 10 -r 3 cg = ag[bg]
178 ms ± 24.2 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
```
oleksandr-pavlyk added a commit that referenced this issue Jan 8, 2024
Changed hyperparameter choices to be different for CPU and GPU, resulting
in 20% performance gain on GPU.

The non-recursive implementation allows to avoid repeated USM allocations,
resulting in performance gains for large arrays.

Furthermore, corrected base step kernel to accumulate in outputT rather than
in size_t, which additionally realizes savings when int32 is used as
accumulator type.

Using example from gh-1249, previously, on my Iris Xe laptop:

```
In [1]: import dpctl.tensor as dpt
   ...: ag = dpt.ones((8192, 8192), device='gpu', dtype='f4')
   ...: bg = dpt.ones((8192, 8192), device='gpu', dtype=bool)

In [2]: cg = ag[bg]

In [3]: dpt.all(cg == dpt.reshape(ag, -1))
Out[3]: usm_ndarray(True)

In [4]: %timeit -n 10 -r 3 cg = ag[bg]
212 ms ± 56 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
```

while with this change:

```
In [4]: %timeit -n 10 -r 3 cg = ag[bg]
178 ms ± 24.2 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Code performance
Projects
None yet
Development

No branches or pull requests

2 participants