More improvements to Conv2d speed #98

hunse · 2016-05-09T15:53:18Z

Trying computing multiple filters per group to reduce the number of times the image has to be loaded. So far, it hasn't seemed to help that much, though.

Maybe it makes sense that this doesn't help. For global filters (convolution), the limiting factor seems to be FLOPS, not memory access, so reducing image loads wouldn't make a difference.

For local filters, the amount of memory in the filters (nf * ni * nj * nc * si * sj) is much greater than in the image (ni * nj * nc), assuming the filter stride is 1, so reducing image loads won't make a difference. It would only be the case when the stride is about the size of the kernel width that the image data would be on par with the filter data for one workgroup, and so in this case computing multiple kernels per group might help.

This should allow each patch to be used for many filters. TODO: - Play around with index order. Should filters be smallest? - Generalize: take out hardcoded lsize

hunse added the work in progress label May 9, 2016

hunse force-pushed the conv-speed2 branch from 4c97807 to d57c0ba Compare June 2, 2016 16:59

hunse added 2 commits October 17, 2016 14:28

WIP: Faster Conv2d by having multiple filters per group

a99d4ac

This should allow each patch to be used for many filters. TODO: - Play around with index order. Should filters be smallest? - Generalize: take out hardcoded lsize

WIP: flipped k, but it doesn't help

9c491f4

hunse force-pushed the conv-speed2 branch from d57c0ba to 9c491f4 Compare October 17, 2016 18:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More improvements to Conv2d speed #98

More improvements to Conv2d speed #98

hunse commented May 9, 2016

More improvements to Conv2d speed #98

Are you sure you want to change the base?

More improvements to Conv2d speed #98

Conversation

hunse commented May 9, 2016