This design implements a bfloat16
based element-wise multiplication between two vectors, performed in parallel on two cores in a single column. Element-wise multiplication usually ends up being I/O bound due to the low compute intensity. In a practical ML implementation, it is an example of the type of kernel that is likely best fused onto another more compute-dense kernel (e.g., a convolution or GEMM).
-
aie2.py
: A Python script that defines the AIE array structural design using MLIR-AIE operations. This generates MLIR that is then compiled usingaiecc.py
to produce design binaries (i.e., XCLBIN and inst.txt for the NPU in Ryzen™ AI). -
add.cc
: A C++ implementation of a vectorized vector multiplication operation for AIE cores. The code uses the AIE API, which is a C++ header-only library providing types and operations that get translated into efficient low-level intrinsics, and whose documentation can be found here. The source can be found here. -
test.cpp
: This C++ code is a testbench for the design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the script verifies the memcpy results and optionally outputs trace data.
To compile the design and C++ testbench:
make
To run the design:
make run
To generate a trace file:
make trace