Skip to content

Latest commit

 

History

History
657 lines (396 loc) · 15.8 KB

DEMO.md

File metadata and controls

657 lines (396 loc) · 15.8 KB

GrammarGraph Feature Demo

GrammarGraph lets you build a directed graph from a context-free grammar (CFG). A context-free grammar is a mapping from nonterminals to lists of expansion alternatives. For instance, the following grammar represents a CSV file:

import string
from typing import List, Dict

CSV_GRAMMAR: Dict[str, List[str]] = {
    '<start>': ['<csvline>'],
    '<csvline>': ['<items>'],
    '<items>': ['<item>,<items>', '<item>'],
    '<item>': ['<letters>'],
    '<letters>': ['<letter><letters>', '<letter>'],
    '<letter>': list(string.ascii_letters + string.digits + string.punctuation + ' \t\n')
}

This particular representation of a grammar is based on the Fuzzing Book. Such a grammar can, e.g., be used to create random CSV files (based on Fuzzing Book implementations, this is independent from GrammarGraph):

from fuzzingbook.GrammarCoverageFuzzer import GrammarCoverageFuzzer

fuzzer = GrammarCoverageFuzzer(CSV_GRAMMAR)

for _ in range(10):
    print(fuzzer.fuzz())
r,#k
!Udi,z`C/,?t.
@sYh,nW
	cK
+
yT
x
N}X
|<q',b,j%,^-="[Hw,S
l,
u,Q,7aO

Visualization

Using GrammarGraph, we can visualize this grammar based on GraphViz. To make sure everything fits on the screen, we reduce the number of alternatives for the <letter> nonterminal.

CSV_GRAMMAR["<letter>"] = ['a', 'b', 'c', '1', '2', '3']

from grammar_graph import GrammarGraph

graph = GrammarGraph.from_grammar(CSV_GRAMMAR)
graph.to_dot()

svg

Sub Graphs

We can create and visualize sub graphs:

graph.subgraph("<letters>").to_dot()

svg

We can also check whether a subgraph is a tree structure (and can thus, e.g., be trivially converted to a regular expression), or whether a node is reachable from another one:

graph.subgraph("<letter>").is_tree()
True
graph.subgraph("<letters>").is_tree()
False
letters = graph.get_node("<letters>")
letters.reachable(letters)
True

You can also create a sub grammar from a sub graph.

graph.subgraph("<letters>").to_grammar()
{'<start>': ['<letters>'],
 '<letters>': ['<letter><letters>', '<letter>'],
 '<letter>': ['a', 'b', 'c', '1', '2', '3']}

Search, Filtering, Shortest Paths

GrammarGraph features implementations of breadth-first search, filtering, and shortest path discovery (based on Dijkstra's algorithm with Fibonacci heaps). This can, e.g., be used to embed a subtree into a bigger context. For instance, the shortest path from <items> to <letter> is:

[node.symbol for node in graph.shortest_path(graph.get_node("<items>"), graph.get_node("<letter>"))]
['<items>', '<item>', '<letters>', '<letter>']

Let us assume we have CSV item "abc" from which we want to create a (random) CSV file. We can accomplish this by finding the shortest path from <start> to <item> and follow this path, choosing an appropriate grammar production rule along the way.

from fuzzingbook.Grammars import unreachable_nonterminals, is_nonterminal
from fuzzingbook.Parser import EarleyParser
import copy

item_string = "abc"

# We massage the grammar a little bit to get a tree starting at `<item>`
item_grammar = copy.deepcopy(CSV_GRAMMAR)
item_grammar["<start>"] = ["<item>"]
for unreachable in unreachable_nonterminals(item_grammar):
    del item_grammar[unreachable]

item_tree = next(EarleyParser(item_grammar).parse(item_string))[1][0]
item_tree
('<item>',
 [('<letters>',
   [('<letter>', [('a', [])]),
    ('<letters>',
     [('<letter>', [('b', [])]),
      ('<letters>', [('<letter>', [('c', [])])])])])])

The path we need to follow for creating a complete file embedding item is:

item_node = graph.get_node("<item>")

path = [node.symbol for node in graph.shortest_path(graph.root, item_node)]
path
['<start>', '<csvline>', '<items>', '<item>']

So let's create a derivation tree. In principle, such a tree will be incomplete, namely if an expansion alternative contains nonterminals which we do not have to follow. Such incomplete nodes have None as children and can later on be instantiated, e.g., by a fuzzer. For our simple CSV grammar, however, this is not the case. In the algorithm sketched below we still account for this.

We use a "canonical" grammar representation for simplicity.

from fuzzingbook.Parser import canonical

canonical_grammar = canonical(CSV_GRAMMAR)
canonical_grammar
{'<start>': [['<csvline>']],
 '<csvline>': [['<items>']],
 '<items>': [['<item>', ',', '<items>'], ['<item>']],
 '<item>': [['<letters>']],
 '<letters>': [['<letter>', '<letters>'], ['<letter>']],
 '<letter>': [['a'], ['b'], ['c'], ['1'], ['2'], ['3']]}
assert graph.root.reachable(item_node)


def wrap_in_tree_starting_in(start_nonterminal: str, tree, grammar, graph: GrammarGraph):
    start_node = graph.get_node(start_nonterminal)
    end_node = graph.get_node(tree[0])
    assert start_node.reachable(end_node)

    derivation_path = [n.symbol for n in graph.shortest_non_trivial_path(start_node, end_node)]

    result = (start_nonterminal, [])
    curr_tree = result
    for path_idx in range(len(derivation_path) - 1):
        path_nonterminal = derivation_path[path_idx]
        next_nonterminal = derivation_path[path_idx + 1]
        alternatives_for_path_nonterminal = [a for a in grammar[path_nonterminal]
                                             if next_nonterminal in a]
        shortest_alt_for_path_nonterminal = \
            [a for a in alternatives_for_path_nonterminal
             if not any(a_ for a_ in alternatives_for_path_nonterminal
                        if len(a_) < len(a))][0]
        idx_of_next_nonterminal = shortest_alt_for_path_nonterminal.index(next_nonterminal)
        for alt_idx, alt_symbol in enumerate(shortest_alt_for_path_nonterminal):
            if alt_idx == idx_of_next_nonterminal:
                if path_idx == len(derivation_path) - 2:
                    curr_tree[1].append(tree)
                else:
                    curr_tree[1].append((alt_symbol, []))
            else:
                curr_tree[1].append((alt_symbol, None if is_nonterminal(alt_symbol) else []))

        curr_tree = curr_tree[1][idx_of_next_nonterminal]

    return result


wrapped_tree = wrap_in_tree_starting_in("<start>", item_tree, canonical_grammar, graph)
wrapped_tree
('<start>',
 [('<csvline>',
   [('<items>',
     [('<item>',
       [('<letters>',
         [('<letter>', [('a', [])]),
          ('<letters>',
           [('<letter>', [('b', [])]),
            ('<letters>', [('<letter>', [('c', [])])])])])])])])])
from fuzzingbook.GrammarFuzzer import tree_to_string

tree_to_string(wrapped_tree)
'abc'

Now let's assume that we want to add another item into that tree. We can do so, e.g., by replacing the top-level <items> node with another <items> node expanded once more. The shortest node from <items> to <items>, however, is trivial:

items = graph.get_node("<items>")
[node.symbol for node in graph.shortest_path(items, items)]
['<items>']

The function get_shortest_non_trivial_path returns a nontrivial path which is useful for our purposes:

[node.symbol for node in graph.shortest_non_trivial_path(items, items)]
['<items>', '<items>']

The astute reader may have discovered that this method is also used in the above declaration of wrap_in_tree_starting_in, which is why we can use this routine also for mapping a tree into one starting with the same nonterminal!

items_tree = ('<items>',
              [('<item>',
                [('<letters>',
                  [('<letter>', [('a', [])]),
                   ('<letters>',
                    [('<letter>', [('b', [])]),
                     ('<letters>', [('<letter>', [('c', [])])])])])])])

wrapped_tree = wrap_in_tree_starting_in("<items>", items_tree, canonical_grammar, graph)
wrapped_tree
('<items>',
 [('<item>', None),
  (',', []),
  ('<items>',
   [('<item>',
     [('<letters>',
       [('<letter>', [('a', [])]),
        ('<letters>',
         [('<letter>', [('b', [])]),
          ('<letters>', [('<letter>', [('c', [])])])])])])])])

If we now embed this tree into a one starting in <start>, we have an incomplete derivation tree with a "hole" for another item.

wrapped_tree = wrap_in_tree_starting_in("<start>", wrapped_tree, canonical_grammar, graph)
wrapped_tree
('<start>',
 [('<csvline>',
   [('<items>',
     [('<item>', None),
      (',', []),
      ('<items>',
       [('<item>',
         [('<letters>',
           [('<letter>', [('a', [])]),
            ('<letters>',
             [('<letter>', [('b', [])]),
              ('<letters>', [('<letter>', [('c', [])])])])])])])])])])
tree_to_string(wrapped_tree)
',abc'

Let's complete this tree using the fuzzer:

complete_tree = fuzzer.expand_tree(wrapped_tree)
complete_tree
('<start>',
 [('<csvline>',
   [('<items>',
     [('<item>', [('<letters>', [('<letter>', [('1', [])])])]),
      (',', []),
      ('<items>',
       [('<item>',
         [('<letters>',
           [('<letter>', [('a', [])]),
            ('<letters>',
             [('<letter>', [('b', [])]),
              ('<letters>', [('<letter>', [('c', [])])])])])])])])])])
tree_to_string(complete_tree)
'1,abc'

k-path Coverage

k-paths is a grammar coverage metric proposed by Havrikov et al.. The idea is to extract all paths of length k from the grammar as well as from a derivation tree, and check how many grammar paths are covered by the derivation tree.

We experiment with an arithmetic expressions grammar, which is the initial example from Havrikov et al.

from fuzzingbook.Grammars import srange
expr_grammar = {
    "<start>": ["<add_expr>"],
    "<add_expr>": ["<mult_expr>", "<add_expr> <add_symbol> <mult_expr>"],
    "<add_symbol>": ["+", "-"],
    "<mult_expr>": ["<unary_expr>", "<mult_expr> <mult_symbol> <unary_expr>"],
    "<mult_symbol>": ["*", "/", "%"],
    "<unary_expr>": ["<id>", "<unary_symbol><unary_expr>", "(<add_expr>)", "<dec_digits>"],
    "<unary_symbol>": ["+", "-", "++", "--"],
    "<dec_digits>": ["<dec_digit><maybe_dec_digits>"],
    "<maybe_dec_digits>": ["", "<dec_digits>"],
    "<dec_digit>": srange(string.digits),
    "<id>": ["x", "y", "z"]
}

First, we obtain the 2-paths from the grammar:

graph = GrammarGraph.from_grammar(expr_grammar)
two_paths = graph.k_paths(2)

print("\n".join(map(str, [" ".join(map(str, [node.symbol for node in path])) for path in two_paths])))
<start> <start>-choice-1 <add_expr>
<add_expr> <add_expr>-choice-1 <mult_expr>
<add_expr> <add_expr>-choice-2 <add_expr>
<add_expr> <add_expr>-choice-2  
<add_expr> <add_expr>-choice-2 <add_symbol>
<add_expr> <add_expr>-choice-2  
<add_expr> <add_expr>-choice-2 <mult_expr>
<mult_expr> <mult_expr>-choice-1 <unary_expr>
<mult_expr> <mult_expr>-choice-2 <mult_expr>
<mult_expr> <mult_expr>-choice-2  
<mult_expr> <mult_expr>-choice-2 <mult_symbol>
<mult_expr> <mult_expr>-choice-2  
<mult_expr> <mult_expr>-choice-2 <unary_expr>
<add_symbol> <add_symbol>-choice-1 +
<add_symbol> <add_symbol>-choice-2 -
<unary_expr> <unary_expr>-choice-1 <id>
<unary_expr> <unary_expr>-choice-2 <unary_symbol>
<unary_expr> <unary_expr>-choice-2 <unary_expr>
<unary_expr> <unary_expr>-choice-3 (
<unary_expr> <unary_expr>-choice-3 <add_expr>
<unary_expr> <unary_expr>-choice-3 )
<unary_expr> <unary_expr>-choice-4 <dec_digits>
<mult_symbol> <mult_symbol>-choice-1 *
<mult_symbol> <mult_symbol>-choice-2 /
<mult_symbol> <mult_symbol>-choice-3 %
<id> <id>-choice-1 x
<id> <id>-choice-2 y
<id> <id>-choice-3 z
<unary_symbol> <unary_symbol>-choice-1 +
<unary_symbol> <unary_symbol>-choice-2 -
<unary_symbol> <unary_symbol>-choice-3 ++
<unary_symbol> <unary_symbol>-choice-4 --
<dec_digits> <dec_digits>-choice-1 <dec_digit>
<dec_digits> <dec_digits>-choice-1 <maybe_dec_digits>
<dec_digit> <dec_digit>-choice-1 0
<dec_digit> <dec_digit>-choice-2 1
<dec_digit> <dec_digit>-choice-3 2
<dec_digit> <dec_digit>-choice-4 3
<dec_digit> <dec_digit>-choice-5 4
<dec_digit> <dec_digit>-choice-6 5
<dec_digit> <dec_digit>-choice-7 6
<dec_digit> <dec_digit>-choice-8 7
<dec_digit> <dec_digit>-choice-9 8
<dec_digit> <dec_digit>-choice-10 9
<maybe_dec_digits> <maybe_dec_digits>-choice-1 
<maybe_dec_digits> <maybe_dec_digits>-choice-2 <dec_digits>

Next, we get all 2-paths from the derivation tree of x + 42:

parser = EarleyParser(expr_grammar)
tree = list(parser.parse("x + 42"))[0]
tree_two_paths = graph.k_paths_in_tree(tree, 2)
print("\n".join(map(str, [" ".join(map(str, [node.symbol for node in path])) for path in tree_two_paths])))
<start> <start>-choice-1 <add_expr>
<add_expr> <add_expr>-choice-2 <add_expr>
<add_expr> <add_expr>-choice-1 <mult_expr>
<add_expr> <add_expr>-choice-2  
<add_expr> <add_expr>-choice-2 <add_symbol>
<add_symbol> <add_symbol>-choice-1 +
<add_expr> <add_expr>-choice-2  
<add_expr> <add_expr>-choice-2 <mult_expr>
<mult_expr> <mult_expr>-choice-1 <unary_expr>

This gives rise to the k-path coverate metric:

coverage = len(tree_two_paths) / len(two_paths)
coverage
0.1956521739130435

We can also call the method k_path_coverage directly:

graph.k_path_coverage(tree, 2)
0.1956521739130435

Our input's 3-path coverage is worse, while its 1-path coverage (simple coverage of nonterminals) is better:

graph.k_path_coverage(tree, 3)
0.11764705882352941
graph.k_path_coverage(tree, 1)
0.5454545454545454

Let us fuzz some inputs and check their, say, 3-path coverage:

fuzzer = GrammarCoverageFuzzer(expr_grammar)
for _ in range(20):
    tree = fuzzer.expand_tree(("<start>", None))
    print(f"{(tree_to_string(tree) + ',').ljust(40)} coverage: {round(graph.k_path_coverage(tree, 3)*100, 2)} %")
++(y / z - x % y + y * 8 % x),           coverage: 58.82 %
--++-++937415026373,                     coverage: 14.12 %
6 / -z,                                  coverage: 15.29 %
x / z / (x + z) * ((z % y)),             coverage: 42.35 %
(++--z * y) / 032 + 82 % 4 + x + z / 0 % -z, coverage: 75.29 %
-(x / +z + (x - z) + 7 * (x)),           coverage: 63.53 %
--x % y / ((1 - --21)),                  coverage: 48.24 %
3,                                       coverage: 1.18 %
(9 % z * -31 / 394 + +861) % ++x,        coverage: 61.18 %
x * y * x % 3 * (x + y + x + (x)),       coverage: 51.76 %
x % --(z) - (z / y % z) / 3,             coverage: 54.12 %
0,                                       coverage: 1.18 %
--+99 * 1 % 0,                           coverage: 31.76 %
z / z + 4 / ++(++y * 50 * ((x)) % -(y % y / y) * (x)), coverage: 70.59 %
x + x - x * z * (y - x) - x,             coverage: 47.06 %
++x + 310,                               coverage: 21.18 %
4 % (z % x % --z) / +++--((+(z - x) / 46)) * --1, coverage: 71.76 %
(y + x + y - y / y - x % (x)),           coverage: 41.18 %
z + z / y + y / z - z % ++y,             coverage: 35.29 %
(y * 98) - 32,                           coverage: 36.47 %

We think that representing CFGs as graphs is useful for a number of purposes such as the one we looked into just now. Hopefully it can help you too!