Convolution Using `StaticKernels.jl` #9

RoyiAvital · 2023-04-29T06:13:21Z

RoyiAvital
Apr 29, 2023

Given 2 vectors:

vA = [1, 2, 3, 4, 5];
vB = [4, 5, 6]

I would like to perform their convolution using StaticKernels.jl.

I would be something like:

vK = Kernel{(0:(length(vB) - 1), )}(@inline vW -> vW[0] * vB[3] + vW[1] * vB[2] + vW[2] * vB[1]);
map(vK, extend(vA, StaticKernels.ExtensionConstant(0)));

The problem is I can get only 5 (the length of the data vA) elements as output and not length(vA) + length(vB) - 1.
I was wondering if there a trick to do with the boundary extension to get the output of the full convolution?

In the case above I can define vAA = [0, 0, 1, 2, 3, 4, 5] and then run on it to get the required result, yet I wonder if it can be done without explicitly extend the vector.

stev47 · 2023-05-06T20:24:59Z

stev47
May 6, 2023
Maintainer

This is an interesting usecase. Unfortunately I don't see an easy way to achieve this using only the public interface of StaticKernels.jl. I recommend either copying the data (as you mentioned) or define your own array type, which wraps your array with larger axes.

To properly support this and similar usecases, either the interface would need to change or we could leverage the output array axes to infer the loop domain (then you could create an OffsetArray over exactly the output indices you want).
Would need to think about it, suggestions welcome.

0 replies

RoyiAvital · 2023-05-09T09:35:38Z

RoyiAvital
May 9, 2023
Author

OK, I tried the following approach just a PoC:

using BenchmarkTools;
using LoopVectorization;
using StaticKernels;
using PaddedViews;

numSamplesA = 1000;
numSamplesB = 3;

vA = rand(numSamplesA);
vB = rand(numSamplesB);
vO = rand(numSamplesA + numSamplesB - 1);

numElementsPad = length(vA) + (2 * (length(vB) - 1));
vParnetIdx = (length(vB) - 1) .+ (1:length(vA));
vAA = PaddedView(0, vA, (1:numElementsPad,), (vParnetIdx,)); #<! View
vK = Kernel{(0:(length(vB) - 1), )}(@inline vW -> vW[0] * vB[3] + vW[1] * vB[2] + vW[2] * vB[1]);

map!(vK, vO, vAA);

It gives the correct answer.
Yet performance are not good.
At first I thought it might be the overhead by PaddedViews.jl.

So I tried:

vOO = zeros(numSamplesA - numSamplesB + 1);

map!(vK, vO, vAA);

Yet I get similar run time (I do it with BenchmarkTools).
So it is not the issue with the overhead.

For comparison:

using BenchmarkTools;
using LoopVectorization;
using StaticKernels;
using PaddedViews;

function ___Conv1D!( vO :: Vector{T}, vA :: Vector{T}, vB :: Vector{T} ) where {T <: Real}
    # Based on https://discourse.julialang.org/t/97658/15
    # Doesn't require `StrideArraysCore` (Uses 1 based index vs. 0 based index)

    J = length(vA);
    K = length(vB); #<! Assumed to be the Kernel
    
    # Optimized for the case the kernel is in vB (Shorter)
    J < K && return ___Conv1D!(vO, vB, vA);
    
    I = J + K - 1; #<! Output length
	
    @turbo for ii in 0:(K - 1) #<! Head
        sumVal = zero(T);
        for kk in 1:K
            ib0 = (ii >= kk);
            oa = ib0 ? vA[ii - kk + 1] : zero(T);
            sumVal += vB[kk] * oa;
        end
        vO[ii] = sumVal;
    end
    @turbo inline=true for ii in K:(J - 1) #<! Middle
        sumVal = zero(T);
        for kk in 1:K
            sumVal += vB[kk] * vA[ii - kk + 1];
        end
        vO[ii] = sumVal;
    end
    @turbo for ii in J:I #<! Tail
        sumVal = zero(T);
        for kk in 1:K
            ib0 = (ii < J + kk);
            oa = ib0 ? vA[ii - kk + 1] : zero(T);
            sumVal += vB[kk] * oa;
        end
        vO[ii] = sumVal;
    end
	return vO
end

vA = [1, 2, 3, 4, 5];
vB = [4, 5, 6];
vC = [6, 5, 4];
vO = collect(1:(length(vA) + length(vB) - 1));

numSamplesA = 1000;
numSamplesB = 3;

vA = rand(numSamplesA);
vB = rand(numSamplesB);
vO = rand(numSamplesA + numSamplesB - 1);

numElementsPad = length(vA) + (2 * (length(vB) - 1));
vParnetIdx = (length(vB) - 1) .+ (1:length(vA));
vAA = PaddedView(0, vA, (1:numElementsPad,), (vParnetIdx,)); #<! View
vK = Kernel{(0:(length(vB) - 1), )}(@inline vW -> vW[0] * vB[3] + vW[1] * vB[2] + vW[2] * vB[1]);

@benchmark ___Conv1D!($vO, $vA, $vB)

@benchmark map!($vK, $vO, $vAA)

On my system the difference is about x180.

I just thought it might be interesting for you.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convolution Using `StaticKernels.jl` #9

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Convolution Using StaticKernels.jl #9

RoyiAvital Apr 29, 2023

Replies: 2 comments

stev47 May 6, 2023 Maintainer

RoyiAvital May 9, 2023 Author

Convolution Using `StaticKernels.jl` #9

RoyiAvital
Apr 29, 2023

stev47
May 6, 2023
Maintainer

RoyiAvital
May 9, 2023
Author