Example LLVM passes - based on LLVM 9
llvm-tutor is a collection of self-contained reference LLVM passes. It's a tutorial that targets novice and aspiring LLVM developers. Key features:
- Complete - includes
CMake
build scripts, LIT tests and CI set-up - Out of source - builds against a binary LLVM installation (no need to build LLVM from sources)
- Modern - based on the latest version of LLVM (and updated with every release)
LLVM implements a very rich, powerful and popular API. However, like many complex technologies, it can be quite daunting and overwhelming to learn and master. The goal of this LLVM tutorial is to showcase that LLVM can in fact be easy and fun to work with. This is demonstrated through a range self-contained, testable LLVM passes, which are implemented using idiomatic LLVM.
This document explains how to set-up your environment, build and run the examples, and go about debugging. It contains a high-level overview of the implemented examples and contains some background information on writing LLVM passes. The source files, apart from the code itself, contain comments that will guide you through the implementation. All examples are complemented with LIT tests and reference input files.
- HelloWorld
- Development Environment
- Building & Testing
- Overview of the Passes
- Debugging
- About Pass Managers in LLVM
- Credits & References
- License
The HelloWorld pass from HelloWorld.cpp is a self-contained reference example. The corresponding CMakeLists.txt implements the minimum set-up for an out-of-source pass.
For every function defined in the input module, HelloWord prints its name and the number of arguments that it takes. You can build it like this:
export LLVM_DIR=<installation/dir/of/llvm/9>
mkdir build
cd build
cmake -DLT_LLVM_INSTALL_DIR=$LLVM_DIR <source/dir/llvm/tutor>/HelloWorld/
make
Before you can test it, you need to prepare an input file:
# Generate an LLVM test file
$LLVM_DIR/bin/clang -S -emit-llvm <source/dir/llvm/tutor/>inputs/input_for_hello.c -o input_for_hello.ll
Finally, run HelloWorld with opt:
# Run the pass
$LLVM_DIR/bin/opt -load-pass-plugin libHelloWorld.dylib -passes=hello-world -disable-output input_for_hello.ll
# Expected output
(llvm-tutor) Hello from: foo
(llvm-tutor) number of arguments: 1
(llvm-tutor) Hello from: bar
(llvm-tutor) number of arguments: 2
(llvm-tutor) Hello from: fez
(llvm-tutor) number of arguments: 3
(llvm-tutor) Hello from: main
(llvm-tutor) number of arguments: 2
The HelloWorld pass doesn't modify the input module. The -disable-output
flag is used to prevent opt from printing the output bitcode file.
NOTE: On MacOS this only works when building LLVM from sources. More information is available here.
In order to run HelloWorld automatically at -O{0|1|2|3}
, you have to enable
registration with the optimisation pipelines. This is done via
HELLOWORLD_OPT_PIPELINE_REG
CMake variable:
export LLVM_DIR=<installation/dir/of/llvm/9>
mkdir build
cd build
cmake -DLT_LLVM_INSTALL_DIR=$LLVM_DIR -DHELLOWORLD_OPT_PIPELINE_REG=On <source/dir/llvm/tutor>/HelloWorld/
make
HelloWorld will now be run whenever an optimisation level is specified:
$LLVM_DIR/bin/opt -load libHelloWorld.dylib -O1 -disable-output input_for_hello.ll
# Expected output
(llvm-tutor) Hello from: foo
(llvm-tutor) number of arguments: 1
(llvm-tutor) Hello from: bar
(llvm-tutor) number of arguments: 2
(llvm-tutor) Hello from: fez
(llvm-tutor) number of arguments: 3
(llvm-tutor) Hello from: main
(llvm-tutor) number of arguments: 2
This registration is implemented in
HelloWorld.cpp.
Note that for this to work I used the Legacy Pass Manager (the plugin was
specified with -load
rather than -load-pass-plugin
).
Here you can read more about pass managers in
LLVM.
This project has been tested on Linux 18.04 and Mac OS X 10.14.4. In order to build llvm-tutor you will need:
- LLVM 9
- C++ compiler that supports C++14
- CMake 3.4.3 or higher
In order to run the passes, you will need:
- clang-9 (to generate input LLVM files)
- opt (to run the passes)
There are additional requirements for tests (these will be satisfied by installing LLVM 9):
- lit (aka llvm-lit, LLVM tool for executing the tests)
- FileCheck (LIT requirement, it's used to check whether tests generate the expected output)
On Darwin you can install LLVM 9 with Homebrew:
brew install llvm@9
This will install all the required header files, libraries and tools in
/usr/local/opt/llvm/
.
On Ubuntu Bionic, you can install modern LLVM from the official repository:
wget -O - https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add -
sudo apt-add-repository "deb http://apt.llvm.org/bionic/ llvm-toolchain-bionic-9 main"
sudo apt-get update
sudo apt-get install -y llvm-9 llvm-9-dev clang-9 llvm-9-tools
This will install all the required header files, libraries and tools in
/usr/lib/llvm-9/
.
Building from sources can be slow and tricky to debug. It is not necessary, but might be your preferred way of obtaining LLVM 9. The following steps will work on Linux and Mac OS X:
git clone https://github.com/llvm/llvm-project.git
cd llvm-project
git checkout release/9.x
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release -DLLVM_TARGETS_TO_BUILD=X86 <llvm-project/root/dir>/llvm/
cmake --build .
For more details read the official documentation.
You can build llvm-tutor (and all the provided passes) as follows:
cd <build/dir>
cmake -DLT_LLVM_INSTALL_DIR=<installation/dir/of/llvm/9> <source/dir/llvm/tutor>
make
The LT_LLVM_INSTALL_DIR
variable should be set to the root of either the
installation or build directory of LLVM 9. It is used to locate the
corresponding LLVMConfig.cmake
script that is used to set the include and
library paths.
In order to run the tests, you need to install llvm-lit (aka lit). It's not bundled with LLVM 9 packages, but you can install it with pip:
# Install lit - note that this installs lit globally
pip install lit
Running the tests is as simple as:
$ lit <build_dir>/test
Voilà! You should see all tests passing.
- HelloWorld - prints the functions in the input module and prints the number of arguments for each
- InjectFuncCall - instruments
the input module by inserting calls to
printf
- StaticCallCounter - counts direct function calls at compile-time
- DynamicCallCounter - counts direct function calls at run-time
- MBASub - code transformation for integer
sub
instructions - MBAAdd - code transformation for 8-bit integer
add
instructions - RIV - finds reachable integer values for each basic block
- DuplicateBB - duplicates basic blocks, requires RIV analysis results
Once you've built this project, you can experiment with every pass separately. All passes work with LLVM files. You can generate one like this:
export LLVM_DIR=<installation/dir/of/llvm/9>
# Textual form
$LLVM_DIR/bin/clang -emit-llvm input.c -S -o out.ll
# Binary/bit-code form
$LLVM_DIR/bin/clang -emit-llvm input.c -o out.bc
It doesn't matter whether you choose the binary (without -S
) or textual
form (with -S
), but obviously the latter is more human-friendly. All passes,
except for HelloWorld, are described below.
This pass is a HelloWorld example for code instrumentation. For every function
defined in the input module, InjectFuncCall will add (inject) the following
call to printf
:
printf("(llvm-tutor) Hello from: %s\n(llvm-tutor) number of arguments: %d\n", FuncName, FuncNumArgs)
This call is added at the beginning of each function (i.e. before any other
instruction). FuncName
is the name of the function and FuncNumArgs
is the
number of arguments that the function takes.
We will use input_for_hello.c to test InjectFuncCall:
export LLVM_DIR=<installation/dir/of/llvm/9>
# Generate an LLVM file to analyze
$LLVM_DIR/bin/clang -emit-llvm -c <source_dir>/inputs/input_for_hello.c -o input_for_hello.bc
# Run the pass through opt
$LLVM_DIR/bin/opt -load <build_dir>/lib/libInjectFuncCall.dylib -legacy-inject-func-call input_for_hello.bc -o instrumented.bin
This generates instrumented.bin
, which is the instrumented version of
input_for_hello.bc
. In order to verify that InjectFuncCall worked as
expected, you can either check the output file (and verify that it contains
extra calls to printf
) or run it:
$LLVM_DIR/bin/lli instrumented.bin
(llvm-tutor) Hello from: main
(llvm-tutor) number of arguments: 2
(llvm-tutor) Hello from: foo
(llvm-tutor) number of arguments: 1
(llvm-tutor) Hello from: bar
(llvm-tutor) number of arguments: 2
(llvm-tutor) Hello from: foo
(llvm-tutor) number of arguments: 1
(llvm-tutor) Hello from: fez
(llvm-tutor) number of arguments: 3
(llvm-tutor) Hello from: bar
(llvm-tutor) number of arguments: 2
(llvm-tutor) Hello from: foo
(llvm-tutor) number of arguments: 1
You might have noticed that InjectFuncCall is somewhat similar to
HelloWorld. In both cases the pass visits all functions,
prints their names and the number of arguments. The difference between the two
passes becomes quite apparent when you compare the output generated for the same
input file, e.g. input_for_hello.c
. The number of times Hello from
is
printed is either:
- once per every function call in the case of InjectFuncCall, or
- once per function definition in the case of HelloWorld.
This makes perfect sense and hints how different the two passes are. Whether to
print Hello from
is determined at either:
- run-time for InjectFuncCall, or
- compile-time for HelloWorld.
Also, note that in the case of InjectFuncCall we had to first run the pass with opt and then execute the instrumented IR module in order to see the output. For HelloWorld it was sufficient to run run the pass with opt.
The StaticCallCounter pass counts the number of compile-time (i.e. visible during the compilation) function calls in the input LLVM module. If a function is called within a loop, that will always be counted as one function call, no matter how many times the loop iterates. Only direct function calls are counted.
We will use input_for_cc.c to test StaticCallCounter:
export LLVM_DIR=<installation/dir/of/llvm/9>
# Generate an LLVM file to analyze
$LLVM_DIR/bin/clang -emit-llvm -c <source_dir>/inputs/input_for_cc.c -o input_for_cc.bc
# Run the pass through opt
$LLVM_DIR/bin/opt -load <build_dir>/lib/libStaticCallCounter.dylib -legacy-static-cc -analyze input_for_cc.bc
You will see the following output:
=================================================
LLVM-TUTOR: static analysis results
=================================================
NAME #N DIRECT CALLS
-------------------------------------------------
bar 2
fez 1
foo 3
static
is an LLVM based tool implemented in
StaticMain.cpp.
It is a command line wrapper that allows you to run StaticCallCounter without
the need for opt:
<build_dir>/bin/static input_for_cc.bc
It is an example of a relatively basic static analysis tool. Its implementation demonstrates how basic pass management in LLVM works.
The DynamicCallCounter pass counts the number of run-time (i.e. encountered during the execution) function calls. It does so by inserting call-counting instructions that are executed every time a function is called. Only calls to functions that are defined in the input module are counted. This pass builds on top of ideas presented in InjectFuncCall. You may want to experiment with that example first.
We will use input_for_cc.c to test DynamicCallCounter:
export LLVM_DIR=<installation/dir/of/llvm/9>
# Generate an LLVM file to analyze
$LLVM_DIR/bin/clang -emit-llvm -c <source_dir>/inputs/input_for_cc.c -o input_for_cc.bc
# Instrument the input file
$LLVM_DIR/bin/opt -load <build_dir>/lib/libDynamicCallCounter.dylib -legacy-dynamic-cc input_for_cc.bc -o instrumented_bin
This generates instrumented.bin
, which is the instrumented version of
input_for_cc.bc
. In order to verify that DynamicCallCounter worked as
expected, you can either check the output file (and verify that it contains
new call-counting instructions) or run it:
# Run the instrumented binary
$LLVM_DIR/bin/lli ./instrumented_bin
You will see the following output:
=================================================
LLVM-TUTOR: dynamic analysis results
=================================================
NAME #N DIRECT CALLS
-------------------------------------------------
foo 13
bar 2
fez 1
main 1
The number of function calls reported by DynamicCallCounter and StaticCallCounter are different, but both results are correct. They correspond to run-time and compile-time function calls respectively. Note also that for StaticCallCounter it was sufficient to run the pass through opt to have the summary printed. For DynamicCallCounter we had to run the instrumented binary to see the output. This is similar to what we observed when comparing HelloWorld and InjectFuncCall.
These passes implement mixed boolean arithmetic transformations. Similar transformation are often used in code obfuscation (you may also know them from Hacker's Delight) and are a great illustration of what and how LLVM passes can be used for.
The MBASub pass implements this rather basic expression:
a - b == (a + ~b) + 1
Basically, it replaces all instances of integer sub
according to the above
formula. The corresponding LIT tests verify that both the formula and that the
implementation are correct.
We will use input_for_mba_sub.c to test MBASub:
export LLVM_DIR=<installation/dir/of/llvm/9>
$LLVM_DIR/bin/clang -emit-llvm -S inputs/input_for_mba_sub.c -o input_for_sub.ll
$LLVM_DIR/bin/opt -load <build_dir>/lib/libMBASub.so -legacy-mba-sub input_for_sub.ll -o out.ll
The MBAAdd pass implements a slightly more involved formula that is only valid for 8 bit integers:
a + b == (((a ^ b) + 2 * (a & b)) * 39 + 23) * 151 + 111
Similarly to MBASub
, it replaces all instances of integer add
according to
the above identity, but only for 8-bit integers. The LIT tests verify that both
the formula and the implementation are correct.
We will use input_for_add.c to test MBAAdd:
export LLVM_DIR=<installation/dir/of/llvm/9>
$LLVM_DIR/bin/clang -O1 -emit-llvm -S inputs/input_for_mba.c -o input_for_mba.ll
$LLVM_DIR/bin/opt -load <build_dir>/lib/libMBAAdd.so -legacy-mba-add input_for_mba.ll -o out.ll
You can also specify the level of obfuscation on a scale of 0.0
to 1.0
, with
0
corresponding to no obfuscation and 1
meaning that all add
instructions
are to be replaced with (((a ^ b) + 2 * (a & b)) * 39 + 23) * 151 + 111
, e.g.:
$LLVM_DIR/bin/opt -load <build_dir>/lib/libMBAAdd.so -legacy-mba-add -mba-ratio=0.3 inputs/input_for_mba.c -o out.ll
For each basic block in a module, RIV calculates the reachable integer values (i.e. values that can be used in the particular basic block). There are a few LIT tests that verify that indeed this is correct.
We will use input_for_riv.c to test RIV:
export LLVM_DIR=<installation/dir/of/llvm/9>
$LLVM_DIR/bin/opt -load <build_dir>/lib/libRIV.so -riv inputs/input_for_riv.c
Note that this pass, unlike previous examples, will produce information about the IR representation of the original module only. It won't be very useful if trying to understand the original C or C++ input file.
This pass will duplicate all basic blocks in a module, with the exception of basic blocks for which there are no reachable integer values (identified through the RIV pass). An example of such a basic block is the entry block in a function that:
- takes no arguments and
- is embedded in a module that defines no global values.
Basic blocks are duplicated by inserting an if-then-else
construct and
cloning all the instructions (with the exception of PHI
nodes) into the
new blocks.
This pass depends on the RIV pass, hence you need to load it too in order for DuplicateBB to work. We will use input_for_duplicate_bb.c to test it:
export LLVM_DIR=<installation/dir/of/llvm/9>
$LLVM_DIR/bin/opt -load <build_dir>/lib/libRIV.so -load <build_dir>/lib/libDuplicateBB.so -riv inputs/input_for_duplicate_bb.c
Before running a debugger, you may want to analyze the output from LLVM_DEBUG and STATISTIC macros. For example, for MBAAdd:
export LLVM_DIR=<installation/dir/of/llvm/9>
$LLVM_DIR/bin/clang -emit-llvm -S -O1 inputs/input_for_mba.c -o input_for_mba.ll
$LLVM_DIR/bin/opt -load-pass-plugin <build_dir>/lib/libMBAAdd.dylib -passes=mba-add input_for_mba.ll -debug-only=mba-add -stats -o out.ll
Note the -debug-only=mba-add
and -stats
flags in the command line - that's
what enables the following output:
%12 = add i8 %1, %0 -> <badref> = add i8 111, %11
%20 = add i8 %12, %2 -> <badref> = add i8 111, %19
%28 = add i8 %20, %3 -> <badref> = add i8 111, %27
===-------------------------------------------------------------------------===
... Statistics Collected ...
===-------------------------------------------------------------------------===
3 mba-add - The # of substituted instructions
As you can see, you get a nice summary from MBAAdd. In many cases this will be sufficient to understand what might be going wrong.
For tricker issues just use a debugger. Below I demonstrate how to debug
MBAAdd. More specifically, how to set up a breakpoint on entry
to MBAAdd::run
. Hopefully that will be sufficient for you to start.
The default debugger on OS X is LLDB. You will normally use it like this:
export LLVM_DIR=<installation/dir/of/llvm/9>
$LLVM_DIR/bin/clang -emit-llvm -S -O1 inputs/input_for_mba.c -o input_for_mba.ll
lldb -- $LLVM_DIR/bin/opt -load-pass-plugin <build_dir>/lib/libMBAAdd.dylib -passes=mba-add input_for_mba.ll -o out.ll
(lldb) breakpoint set --name MBAAdd::run
(lldb) process launch
or, equivalently, by using LLDBs aliases:
export LLVM_DIR=<installation/dir/of/llvm/9>
$LLVM_DIR/bin/clang -emit-llvm -S -O1 inputs/input_for_mba.c -o input_for_mba.ll
lldb -- $LLVM_DIR/bin/opt -load-pass-plugin <build_dir>/lib/libMBAAdd.dylib -passes=mba-add input_for_mba.ll -o out.ll
(lldb) b MBAAdd::run
(lldb) r
At this point, LLDB should break at the entry to MBAAdd::run
.
On most Linux systems, GDB is the most popular debugger. A typical session will look like this:
export LLVM_DIR=<installation/dir/of/llvm/9>
$LLVM_DIR/bin/clang -emit-llvm -S -O1 inputs/input_for_mba.c -o input_for_mba.ll
gdb --args $LLVM_DIR/bin/opt -load-pass-plugin <build_dir>/lib/libMBAAdd.so -passes=mba-add input_for_mba.ll -o out.ll
(gdb) b MBAAdd.cpp:MBAAdd::run
(gdb) r
At this point, GDB should break at the entry to MBAAdd::run
.
LLVM is a quite complex project (to put it mildly) and passes lay at its center - this is true for any multi-pass compiler. In order to manage the passes, a compiler needs a pass manager. LLVM currently enjoys not one, but two pass managers. This is important because depending on which pass manager you decide to use, the implementation of your pass (and in particular how you register it) will look slightly differently.
As I mentioned earlier, there are two pass managers in LLVM:
- Legacy Pass Manager which currently is the default pass manager
- It is implemented in the legacy namespace
- It is very well documented (more specifically, writing and registering a pass withing the Legacy PM is very well documented)
- New Pass Manager aka Pass Manager (that's how it's referred to in the code base)
- I understand that it is soon to become the default pass manager in LLVM
- The source code is very throughly commented, but there is no official documentation. Min-Yih Hsu kindly wrote this great blog series that you can refer to instead.
If you are not sure which pass manager to use, it is probably best to make sure that your passes are compatible with both. Fortunately, once you have an implementation that works with one of them, it's relatively straightforward to extend it so that it works with the other one as well.
MBAAdd implements interface for both pass managers. This is how you will use it with the legacy pass manager:
$LLVM_DIR/bin/opt -load <build_dir>/lib/libMBAAdd.so -legacy-mba-add input_for_mba.ll -o out.ll
And this is how you run it with the new pass manager:
$LLVM_DIR/bin/opt -load-pass-plugin <build_dir>/lib/libMBAAdd.so -passes=mba-add input_for_mba.ll -o out.ll
There are two differences:
- the way you load your plugin:
-load
vs-load-pass-plugin
- the way you specify which pass/plugin to run:
-legacy-mba-add
vs-passes=mba-add
These differences stem from the fact that in the case of Legacy Pass Manager you
register a new command line option for opt, whereas New Pass Manager
simply requires you to define a pass pipeline (with -passes=
).
This is first and foremost a community effort. This project wouldn't be possible without the amazing LLVM online documentation, the plethora of great comments in the source code, and the llvm-dev mailing list. Thank you!
It goes without saying that there's plenty of great presentations on YouTube, blog posts and GitHub projects that cover similar subjects. I've learnt a great deal from them - thank you all for sharing! There's one presentation/tutorial that has been particularly important in my journey as an aspiring LLVM developer and that helped to democratise out-of-source pass development:
- "Building, Testing and Debugging a Simple out-of-tree LLVM Pass" Serge Guelton, Adrien Guinet (slides, video)
Adrien and Serge came up with some great, illustrative and self-contained examples that are great for learning and tutoring LLVM pass development. You'll notice that there are similar transformation and analysis passes available in this project. The implementations available here reflect what I (aka banach-space) found most challenging while studying them.
I also want to thank Min-Yih Hsu for his blog series "Writing LLVM Pass in 2018". It was invaluable in understanding how the new pass manager works and how to use it. Last, but not least I am very grateful to Nick Sunmer (e.g. llvm-demo) and Mike Shah (see Mike's Fosdem 2018 talk) for sharing their knowledge online. I have learnt a great deal from it, thank you! I always look-up to those of us brave and bright enough to work in academia - thank you for driving the education and research forward!
Below is a list of LLVM resources available outside the official online documentation that I have found very helpful. Where possible, the items are sorted by date.
- LLVM IR
- Legacy vs New Pass Manager
- Examples in LLVM
- Examples in LLVM source tree in llvm/examples/IRTransforms/. This was recently added in the following commit:
commit 7d0b1d77b3d4d47df477519fd1bf099b3df6f899
Author: Florian Hahn <[email protected]>
Date: Tue Nov 12 14:06:12 2019 +0000
[Examples] Add IRTransformations directory to examples.
- LLVM Pass Development
- "Getting Started With LLVM: Basics ", J. Paquette, F. Hahn, LLVM Dev Meeting 2019 video
- "Writing an LLVM Pass: 101", A. Warzyński, LLVM Dev Meeting 2019 video
- "Writing LLVM Pass in 2018", Min-Yih Hsu, blog series
- "Building, Testing and Debugging a Simple out-of-tree LLVM Pass" Serge Guelton, Adrien Guinet, LLVM Dev Meeting 2015 (slides, video)
- LLVM Based Tools Development
The MIT License (MIT)
Copyright (c) 2019 Andrzej Warzyński
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.