Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slowdown above some array-size threshold #2019

Open
jzluo opened this issue Sep 13, 2022 · 1 comment
Open

Slowdown above some array-size threshold #2019

jzluo opened this issue Sep 13, 2022 · 1 comment

Comments

@jzluo
Copy link

jzluo commented Sep 13, 2022

Sorry, wasn't sure how to title this. Continued from #2016.

Thanks so much for all your work. It is much improved and for my case of 20 is no longer slower than Numpy. I played around with it a little more and tested an equivalent (X.T * rootW).T (call it XtW) in addition to the original rootW[:, np.newaxis] * X (call it XW). Please see the plot - I find that XtW has no performance hit with simd, whereas XW actually is still slower than Numpy if larger than dimension size 20 in my example. However at some point (>200 cols in my case) they all become much slower for some reason, which I suppose belongs in a different issue.

#pythran export XW_pythran(float[:,:], float[])
def XW_pythran(X, preds):
    rootW = np.sqrt(preds * (1 - preds))
    XW = rootW[:, np.newaxis] * X
    return XW

#pythran export XtW_pythran(float[:,:], float[])
def XtW_pythran(X, preds):
    rootW = np.sqrt(preds * (1 - preds))
    XW = (X.T * rootW).T
    return XW

# for plot
import perfplot

np.random.seed(0)
preds = np.random.random(20000)
perfplot.show(
    setup=lambda n: np.random.rand(20000, n),
    kernels=[
        lambda X: get_XW(X, preds),  # pure numpy version of XW_pythran
        lambda X: XW_pythran(X, preds),   # -O3 -march=native
        lambda X: XW_pythran_simd(X, preds),  # -O3 -march=native -DUSE_XSIMD
        lambda X: XtW_pythran(X, preds),
        lambda X: XtW_pythran_simd(X, preds)
    ],
    labels=["np", "pythran_XW", "pythran_simd_XW", "pythran_XtW", "pythran_simd_XtW"],
    n_range=[i for i in range(20, 280, 20)],
    xlabel="n_cols",
    relative_to=0,
)

Screenshot from 2022-09-13 00-22-30

Originally posted by @jzluo in #2016 (comment)

@serge-sans-paille
Copy link
Owner

I ran the kernel under perf stat and the performance issue is due to a lot of L1 cache misses. We must be doing something not smart wrt. order of iteration :-/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants