rBase
is a utility designed to determine the base address of arbitrary binary images. It uses a generic technique which works on all kinds of binary images, including BIOSes, boot loaders, kernels etc. And is agnostic to the operating system of the image. It does, however, require the user provide it with the bitness (e.g. 32-bit or 64-bit) and endianness of the image to be examined. It is written in Rust and should run on Linux, MacOS and Windows systems (thanks to the use of cargo-dist
). It's primary purpose is to allow the user to correctly set the base address of an image when loading it into a disassembler for analysis.
$ cargo run --release -- data/Image-arm32le
Compiling rbase v0.1.9 (/home/jon/git/rbase)
Finished release [optimized] target(s) in 0.78s
Running `target/release/rbase data/Image-arm32le`
ARGS
file: data/Image-arm32le
size: 32-bit
endian: little
max: 1024
min: 10
max strings: 100000
max addresses: 1000000
[00:00:00] [00:00:00] Finding strings (100%) [██████████████████████████████] Found: 73725 strings
[00:00:00] [00:00:00] Indexing strings (100%) [██████████████████████████████]
[00:00:00] [00:00:00] Finding addresses (100%) [██████████████████████████████] Found: 1156144 addresses
[00:00:00] [00:00:00] Indexing addresses (100%) [██████████████████████████████]
[00:00:00] [00:00:00] Collecting candidate base addresses (100%) [██████████████████████████████] Found: 877878 candidate base addresses
Found: 700734 recurring candidate base addresses
1: 0xc0208000: 33991 (3.87%)
2: 0xc0207000: 5808 (0.66%)
3: 0xc0209000: 5726 (0.65%)
4: 0xc020a000: 5635 (0.64%)
5: 0xc0204000: 5529 (0.63%)
6: 0xc0212000: 5514 (0.63%)
7: 0xc0206000: 5478 (0.62%)
8: 0xc020b000: 5463 (0.62%)
9: 0xc0211000: 5435 (0.62%)
10: 0xc020d000: 5383 (0.61%)
Found base: c0208000
Took: 697.464047ms
rBase
is highly parallelised and can determine the base address of even large binaries in a few seconds (thanks to the use of rayon
and dashmap
). Meanwhile, it's progress is displayed using indicatif
. It is designed to use a simple to understand algorithm and thanks to the ready availability of advanced libraries in the form of of cargo crates, is implemented in less than 400 lines of Rust (much of which is boiler plate code). This project makes use of devcontainers
in order to provide a readily reproducible build environment, but it should build on any recent version of Rust and doesn't make use of any unstable Rust features.
The algorithm used to determine the base address is built on the basis of analysing addresses (pointers) and strings within the binary. Strings are used within binaries for many purposes from symbol names to strings displayed to the user. If we can determine the offset of a string within an image, and the value of a pointer which referencing it, then we can calculate the base address of the image as follows:
pointer = base + offset
Therefore:
base = pointer - offset
We can use regular expressions to find strings within our binary with a relatively low error rate. We simply use a regex
which looks for a series of consecutive printable or space characters followed by a NUL
character. This can be fine-tuned at the command line, but by default, rBase
will search for strings between 10
and 1024
characters in length. However, determining what is a pointer and what is just data is much more error prone, we simply must assume that any aligned word
within our image is a potential pointer. Whilst we could search for patterns used by symbol tables in various image types, this would prevent our solution being universal. Moreover, even if we determine which words
are indeed pointers, we have no way of telling which string they should point to.
Each string
and pointer
we compare will give us a different potential base address for our image. By comparing every combination of string
and pointer
, we can analyse the frequency at which each base address occurs and the one which is found most frequently is likely to be the actual base address. We can reduce the number of strings
and pointers
we compare by sampling our data. Indeed by default we limit the number of strings
to 100,000
and the number of pointers
to 1,000,000
(this is configurable by command line parameter). However, this gives us a huge problem space:
100,000 strings * 1,000,000 pointers = 100,000,000,000 comparisons
This is too much work for even the most performant processor even when distributed over multiple threads.
Thankfully, a very simple optimization can massively reduce this problem space, alignment. We assume that our image is going to be aligned on a 4 Kb page boundary (e.g. the low 12-bits of the address must be zero). We can therefore reason:
IF:
base = pointer - offset
THEN:
(base & 0xFFF) = (pointer & 0xFFF) - (offset & 0xFFF)
Therefore:
IF:
(base & 0xFFF) = 0
THEN:
(pointer & 0xFFF) = (offset & 0xFFF)
Therefore, we need only compare strings
and pointers
if their low 12-bits (their page offsets) match. Therefore, assuming an even distribution of the low 12-bits of strings
and pointers
, we can calculate the number of necessary calculations as follows:
For each possible page offset:
(100,000 strings / 4096) * (1,000,000 pointers / 4096) ~= 5960 comparisons
Thus:
5960 comparisons * 4096 page offsets = 24,414,062 comparisons
Note that given our base address is non-negative, we should ignore any comparisons where pointer
<
offset
.
The implementation is split into 4 functions:
get_strings_by_page_offset
get_addresses_by_page_offset
get_base_address
main
This function takes the image file as input and splits it into several chunks for parallel processing. If the chunks were simply adjacent to each other, then a string could potentially overlap a chunk boundary. Therefore, we break the input into overlapping chunks where the overlap is the size of the largest string to search for minus one.
Our function then uses rayon
to process each of the chunks in parallel using a Regex
iterator to search for matches. The offsets of these strings
are stored into a DashSet
(a parallel Set
type implemented by the dashmap
crate).
Lastly, we build (again in parallel) an DashMap
to hold this data. The string
offsets are stored in Vec
tors indexed by their page offset. Note that we use the take_any
function of the ParallelIterator
to sample our data. Whilst this doesn't give us a random sample, it is very performant and empirical evidence seems to show it is sufficient.
Rather than interpreting the image as a byte array, this function interprets it as an array of words
. This array of words is split into chunks by rayon
and all non-zero words
are collected into a DashSet
.
Again, this set is processed (in parallel) into a DashMap
. The words
are stored in Vec
tors indexed by their page offset. Again, we use take_any
to sample our data.
This function is responsible for processing the DashMap
s built by get_strings_by_page_offset
and get_addresses_by_page_offset
. It processes each of the keys of the strings
index and looks up the corresponding entry in the addresses
index. Then for each combination of string
and address
from the lists, it first checks the address
is greater than or equal to the string
offset (recall otherwise it would indicate a negative base address) and discounts the others. Then it subtracts the string
offset from the address
to find a candidate base address
. This base address
is then inserted into a DashMap
. This DashMap
uses the base address
as the key and stores a simple counter of occurences as its value.
We then process this DashMap
to filter out any entries which occur only once. This dramatically reduces its size, improving the performance of the next step. Next we extract the filtered data into a Vec
tor of key/value pairs and sort them by the value (the number of occurences). We then print the frequency of the top 10
candidate base addresses
(to allow the user to get an idea of how much a margin there was beteween the most frequent base address and the other candidates) before returning the most frequently found address as our result.
This function is responsible for parsing the arguments passed by the user on the commandline using clap
and it's derive
feature to allow us to represent the user command line input as a struct
. It then uses memmap2
to map our input file before passing it's data to the remaining functions for analysis and printing our results.