Second major release based on MLIR, now completely based on LLVM/MLIR release 13.
Compared to v0.1.0
, a number of major additions have been introduced:
Low-latency inference
Thanks to contributions from @csvtuda, SPNC now comes with its own SLP (superword-level parallelism) vectorizer, based on MLIR and specialized for SPNs. Instead of vectorizing across samples in a batch, the SLP vectorizer will try to vectorize the single evaluation of the SPN for low latency inference. The SLP vectorizer is active when choosing cpuVectorize=True
and batchSize=1
during compilation for the CPU. Evaluation has showed that improvements in latency by up to 42x over unvectorized code and up to 7x over the LLVM SLP vectorizer can be achieved.
Note: SLP vectorization is an elaborate process. While the evaluation has showed that the SPN-specific SLP vectorizer in SPNC is typically faster than the LLVM SLP vectorizer and in many cases even faster than unvectorized compilation, you may encounter longer compilation times with SLP vectorization for large SPNs. In such cases, deactive the SLP vectorization by setting cpuVectorize=False
.
Graph Partitioning
To avoid excessive compilation times for very large SPNs, SPNC now supports partitioning of the SPN DAG into independent pieces for all targets. The size of the individual partitions (i.e., the number of operations) can be controlled through maxTaskSize
. According to a first evaluation, a default of 10,000
is a sensible default for this value, but you can use this knob to control compilation time and resulting performance. Use -1
to disable partitioning all together.
Supported architectures
In this release, SPNC has gained vectorization support for the ARM Neon architecture and now supports AVX, AVX2, AVX-512 and Arm Neon as target architectures for vectorization. The ARM Optimized Routines are used for fast math primitives on ARM Neon.
The GPU support was also improved and SPNC now avoids unnecessary copies between host and GPU when graph partitioning is active. SPNC now also supports CUDA unified memory on devices where CPU and GPU share the same physical memory, e.g., the Nvidia Jetson Family. Support for unified memory can be enabled through the CMake option CUDA_UNIFIED_MEMORY
during build.
Other Improvements
A number of minor bugs were also fixed in this release. The internal representation and construction of compilation pipelines has been redesigned, causing SPNC to require significantly less memory for compilation.
Binaries
The release comes with some pre-built Python wheels to facilitate installation of SPNC:
xspn-0.2.0-py3-none-any.whl
: The serialization library used by the compiler (can also be used standalone, see README). This wheel is platform-agnostic.spnc-0.2.0-py3-none-linux_x86_64.whl
: The compiler itself. This version only supports inference on CPUs and should be usable on any common Linux platform.spnc_gpu-0.2.0-py3-none-linux_x86_64.whl
: The compiler itself. This version supports inference on CPUs and CUDA GPUs and should be usable on any Linux platform with CUDA 11.2 and the CUDA driver installed.
For more installation options, see the Installation Manual.