You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sorry, wasn't sure how to title this. Continued from #2016.
Thanks so much for all your work. It is much improved and for my case of 20 is no longer slower than Numpy. I played around with it a little more and tested an equivalent (X.T * rootW).T (call it XtW) in addition to the original rootW[:, np.newaxis] * X (call it XW). Please see the plot - I find that XtW has no performance hit with simd, whereas XW actually is still slower than Numpy if larger than dimension size 20 in my example. However at some point (>200 cols in my case) they all become much slower for some reason, which I suppose belongs in a different issue.
I ran the kernel under perf stat and the performance issue is due to a lot of L1 cache misses. We must be doing something not smart wrt. order of iteration :-/
Sorry, wasn't sure how to title this. Continued from #2016.
Thanks so much for all your work. It is much improved and for my case of 20 is no longer slower than Numpy. I played around with it a little more and tested an equivalent
(X.T * rootW).T
(call itXtW
) in addition to the originalrootW[:, np.newaxis] * X
(call itXW
). Please see the plot - I find thatXtW
has no performance hit with simd, whereasXW
actually is still slower than Numpy if larger than dimension size 20 in my example. However at some point (>200 cols in my case) they all become much slower for some reason, which I suppose belongs in a different issue.Originally posted by @jzluo in #2016 (comment)
The text was updated successfully, but these errors were encountered: