`Innervator`

Hardware Acceleration for Neural Networks

Technical Paper (IEEE TechArxiv) | Presentation Slides | Repository DOI

Abstract

Artificial intelligence ("AI") is deployed in various applications, from noise cancellation to image recognition, but AI-based products often come with high hardware and electricity costs; this makes them inaccessible for consumer devices and small-scale edge electronics. Inspired by biological brains, deep neural networks ("DNNs") are modeled using mathematical formulae, yet general-purpose processors treat otherwise-parallelizable AI algorithms as step-by-step sequential logic. In contrast, programmable logic devices ("PLDs") can be customized to the specific parameters of a trained DNN, thereby ensuring data-tailored computation and algorithmic parallelism at the register-transfer level. Furthermore, a subgroup of PLDs, field-programmable gate arrays ("FPGAs"), are dynamically reconfigurable. So, to improve AI runtime performance, I designed and open-sourced my hardware compiler: Innervator. Written entirely in VHDL-2008, Innervator takes any DNN's metadata and parameters (e.g., number of layers, neurons per layer, and their weights/biases), generating its synthesizable FPGA hardware description with the appropriate pipelining and batch processing. Innervator is entirely portable and vendor-independent. As a proof of concept, I used Innervator to implement a sample 8x8-pixel handwritten digit-recognizing neural network in a low-cost AMD Xilinx Artix-7(TM) FPGA @ 100 MHz. With 3 pipeline stages and 2 batches at about 67% LUT utilization, the Network achieved ~7.12 GOP/s, predicting the output in 630 ns and under 0.25 W of power. In comparison, an Intel(R) Core(TM) i7-12700H CPU @ 4.70 GHz would take 40,000-60,000 ns at 45 to 115 W. Ultimately, Innervator's hardware-accelerated approach bridges the inherent mismatch between current AI algorithms and the general-purpose digital hardware they run on.

Technical Paper

For academic researchers, I also wrote a citable technical paper that describes Innervator. (IEEE TechArxiv)

Notice

Although the Abstract specifically talks about an image-recognizing neural network, I endeavoured to generalize Innervator: in practice, it is capable of implementing any number of neurons and layers, and in any possible application (e.g., speech recognition), not just imagery. In the ./data folder, you will find weight and bias parameters that will be used during Innervator's synthesis. Because of the incredibly broken implementation of VHDL's std.textio library across most synthesis tools, I was limited to only reading std_logic_vectors from files; due to that, weights and biases had to be pre-formatted in a fixed-point representation. (More information is available in file_parser.vhd.

The VHDL code itself has been very throughly documented; because I was a novice to VHDL, AI, and FPGA design myself, I documented each step as if it was a beginner's tutorial. Also, you may find these overview slides of the Project useful.

Interestingly, even though I was completely new to the world of hardware design, I still found the toolchain (and even VHDL itself) in a very unstable and buggy state; in fact, throughout this project, I found and documented dozens of different bugs, some of which were new and reported to IEEE and Xilinx:

VHDL Language Inconsistency in Ports
VHDL Language Enhancement
Bug in Vivado's file_open()
Bug in Vivado's read()
GitHub VHDL Syntax Highlighter
Synopsys Synplify p2019's Parser Breaks on VHDL-2019 syntax

Nomenclature

To innervate means "to supply something with nerves."

Innervator is, aptly, an implementer of artificial neural networks within Programmable Logic Devices.

Furthermore, these hardware-based neural networks could be named "Innervated Neural Networks," which also appears as INN in INNervator.

Foreword

Prior to starting this project, I had no experience or training with artificial intelligence ("AI"), electrical engineering, or hardware design;
Hardware design is a complex field—an "unlearn" of computer science; and
Combining the two ideas, AI & hardware, transformed this project into a unique proof of concept.

Synopsis

Inspired by biological brains, AI neural networks are modeled in mathematical formulae that are inherently concurrent;
AI applications are widespread but suffer from general-purpose computer processors that execute algorithms in step-by-step sequences; and
Programmable Logic Devices ("PLDs") allow for digital circuitry to be predesigned for data-tailored and massively parallelized operations

Build Instructions

[TODO: Create a TCL script and makefile to automate this.]

To ensure maximal compatibility, I tested Innervator across both Xilinx Vivado 2024's synthesizer (not simulator) and Mentor Graphics ModelSim 2016's simulator; the code itself was written using a subset of VHDL-2008, without any other language involved. Additionally, absolutely no vendor-specific libraries were used in Innervator's design; only the official std and IEEE VHDL packages were utilized.

Because I developed Innervator on a small, entry-level FPGA board (i.e., Digilent Arty A7-35T), I faced many challenges in regard to logic resource usage and timing failures; however, this also ensured that Innervator would become very portable and resource-efficient.

In the ./src/config.vhd file, you will be able to fine-tune Innervator to your liking; almost everything is customizable and generic, down to the polarization/synchronization of reset, fixed-point types' widths, and neurons' batch processing size or pipeline stages.

Hardware Demo (Arty A7-35T)

I used the four LEDs to "transmit" the network's prediction (i.e., resulting digit in this case); but the same UART interface could later be used to also transmit it back to the computer.

innervator_demo.mp4

(Note: The "delay" you see between the command prompt and FPGA is primarily due to the UART speed; the actual neural network itself takes ~1000 nanoseconds to process its input.)

Simulation

(Note: This was an old simulation run; in the current version, the same digit was predicted with a %70+ accuracy.)

The sample network that was used in said simulation:

Statistics (Artix-7 35T FPGA)

Excluding the periphals (e.g., UART, button debouncer, etc.) and given a network with an input and 2 neural layers (64 inputs, 20 hidden neurons, and 10 output neurons), 4 bits for integral and 4 bits for fractional widths of fixed-point numerals, batch processing of 1 and 2 (i.e., one/two DSP for each neuron), and 3 pipeline stages; Innervator consumed the following resources:

Resource	Utilization (1)	Utilization (2)	Total Availability
Logic LUT	10,233	13,949	20,800
Sliced Reg.	13,954	22,145	41,600
F7 Mux.	620	1,440	16,300
Slice	3,775	6,115	8,150
DSP	30	60	90
Speed (ns)	1,030	639	N/A

Timing reports were also great; the Worst Negative Slack (WNS) was 1.252 ns, without aggressive synthesis optimizations, given a 100 MHz clock. Lastly, on the same FPGA and with two pipeline stages, the number of giga-operations per second was 7.12 GOP/s (calculations in the technical paper), and the total on-chip power draw was 0.189 W.

Prediction Acuracy Falloff (vs. CPU/floating-point)

Digit	FPGA	CPU
0	.30468800	.10168505
1	.57812500	.15610851
2	.50781300	.14220775
3	.21875000	.19579356
4	.00390625	.00119471
5	.20703100	.01840737
6	.21484400	.00273704
7	.13281300	.09511474
8	.24218800	.15363488
9	.69921900	.71728650
Speed (ns)	630	40k--60k

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
data		data
docs		docs
src		src
tests/core		tests/core
.gitignore		.gitignore
CITATION.cff		CITATION.cff
Innervator.xpr		Innervator.xpr
LICENSE-GPL		LICENSE-GPL
LICENSE-LGPL		LICENSE-LGPL
LICENSE-OHLW		LICENSE-OHLW
Makefile		Makefile
README.md		README.md
constraints.xdc		constraints.xdc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

`Innervator`

Technical Paper (IEEE TechArxiv) | Presentation Slides | Repository DOI

Abstract

Technical Paper

Notice

Nomenclature

Foreword

Synopsis

Build Instructions

Hardware Demo (Arty A7-35T)

(Note: The "delay" you see between the command prompt and FPGA is primarily due to the UART speed; the actual neural network itself takes ~1000 nanoseconds to process its input.)

Simulation

(Note: This was an old simulation run; in the current version, the same digit was predicted with a %70+ accuracy.)

The sample network that was used in said simulation:

Statistics (Artix-7 35T FPGA)

Prediction Acuracy Falloff (vs. CPU/floating-point)

About

Licenses found

Releases 1

Packages

Languages

License

Licenses found

Thraetaona/Innervator

Folders and files

Latest commit

History

Repository files navigation

Innervator

*Technical Paper (IEEE TechArxiv)* | Presentation Slides | Repository DOI

Abstract

Technical Paper

Notice

Nomenclature

Foreword

Synopsis

Build Instructions

Hardware Demo (Arty A7-35T)

(Note: The "delay" you see between the command prompt and FPGA is primarily due to the UART speed; the actual neural network itself takes ~1000 nanoseconds to process its input.)

Simulation

(Note: This was an old simulation run; in the current version, the same digit was predicted with a %70+ accuracy.)

The sample network that was used in said simulation:

Statistics (Artix-7 35T FPGA)

Prediction Acuracy Falloff (vs. CPU/floating-point)

About

Topics

Resources

License

Licenses found

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

`Innervator`

Technical Paper (IEEE TechArxiv) | Presentation Slides | Repository DOI

Packages