Enhance C Parser to allow named array bounds #754

Xaphiosis · 2024-05-22T02:09:53Z

Problem Statement

When trying to generalise proofs across multiple platforms or configurations, we do OK on the abstract/design side, but encounter a significant issue with ArrayBounds during CRefine: they show up as numbers, then get mangled via simp, which makes it very difficult to write a generic proof.

Let's use this C code as an example:

enum {
    numDomains = 16
};

int domainMap[numDomains];

void updateDom(int i, int k) {
    domainMap[i] = k;
}

In the above, we clearly have an enum for numDomains. Normally when used in an expression, unlike preprocessor macros, an enum will show up wrapped in its name:

term numDomains (* "numDomains" :: "32 signed word" *)
thm numDomains_def (* numDomains ≡ 0x10 *)

This is useful for abstracting parameters, because we can bind a preprocessor macro to an enum and as long as it fits in an int, we can refer to it in proofs. However, we don't get to see the enum in the bounds checking. Here's updateDom_body_def:

test_global_addresses.updateDom_body ≡
TRY
  Guard ArrayBounds ⦃´i <s 0x10⦄
   (Guard ArrayBounds ⦃0 ≤s ´i⦄ (´domainMap :== Arrays.update ´domainMap (unat ´i) ´k))
CATCH SKIP
END

Note the Guard ArrayBounds using 0x10. There's knowledge lost between the expression used to define the array and the calculated number. There's no way to link the bound to some more abstract value/condition. Simplification rules can dig in and generate "interesting" numbers like 2147483632, or discover that i < 1 can be simplified to i = 0 making the proof specific to the number, instead of being generic in numDomains, meaning we now need to remove Word.word_less_1 and hope for the best, etc.

General idea and complication

Obviously the C parser needs to calculate the actual size of the array bounds in order to declare its type. No matter what we do, the above domainMap is going to end up a 32 signed word[16] (for 32-bit ints).

For verification and links to other abstraction levels, we need to get hold of the expression used to calculate the size of the array in some manner.

I'm pretty sure that the C parser generates the bounds check expression from the array type rather than its name, so we run into the issue that conceptually one 'a word[16] is not the same as another 'a word[16] when it comes to bounds calculations (which is plausible given that 16 could be anything really). This means we need to interfere with the ArrayBounds generation and make it somehow related to the specific variable name.

Idea 1: Heavy annotation

This was Raf's first idea in this direction. Add an AUX annotation of some kind that we put in the C code to specify what expression we want to use directly, something like:

/** AUXBOUND: "Kernel_Config.numDomains" */
int domainMap[numDomains];

PROS:

can use expression from higher level abstraction directly
high level of control

CONS:

have to annotate C code
need to add extra AUX annotation to C Parser
C Parser needs to generate proof that our expression is the same as the calculated array size
no idea what this would look like for a 2D array
expression can't use anything that's generated during the C parse, and Kernel_C doesn't see any abstract/design specs either, meaning duplication like with ctcb_size_bits

Idea 2: Automaic named annotation

After discussion with Raf, Gerwin proposed this: have the C parser generate a constant for that array only, and use that constant for bounds checking. E.g. for the above, something like domainMap_array_size ≡ 0x10.
We can then separately create a ..._def' lemma that shows the link to abstract concepts, and unfold with that when we want to be generic.

PROS:

no need to annotate C code or addition of parsing extra annotations to C parser
no need for C parser to generate proof, as the new definition only wraps the constant

CONS:

this will change all bounds checks to need unfolding of these array constants; big intrusion into proof
need to resolve name shadowing: global arrays are OK, but local vars become a problem when they shadow global name or another local name
what to do with 2D arrays? (none of the other proposed solutions know what to do here)

Idea 3: Keep and throw in entire expression

This is like Idea 2 (and might need to be a transitional stage or prototype to idea 2), but instead of generating the name, we keep the original expression, and whenever we see a bounds check, we dump the entire expression in there.
The downside is that instead of 0x10 or whatever, you'll get the whole calculation, even if it's ((1 << 16) - 1) && 0xF or something.

PROS:

easiest proof-of-concept prototype for keeping expressions of any kind with array variable names
no name shadowing problem, since no names generated

CONS:

were we to actually use it in proofs, this would cause problems because term-based calculation will resolve stuff that the simplifier will make a mess of at times, yielding even more bizarre conditions
still no idea how to handle 2D arrays

The text was updated successfully, but these errors were encountered:

lsf37 · 2024-05-22T03:23:00Z

Excellent write-up, thank you!

One variation I thought of when reading through it -- probably not worth it, but I wanted to record it anyway just so we can decide about it consciously.

Variation/Extension to Idea 2:

There is a special case where the parser might not have to make up a new name for the bound: if the dimension was declared using an enum, the bound can be that enum constant. This would require the parser to track what that name was and then use it later instead of referring to the dimension or the name of the array only.

PROS:

this special case occurs in the kernel and might make things slightly nicer when it applies.

CONS:

might need too much tracking in the C parser
might be too much magic in terms of guard being treated differently depending on how the array was declared

Overall, there is a connection to the new value_type command from all this, which of course came long after the C parser, so it is not used there. value_type generates a type, a constant, a _val theorem with the number, and _def theorem with the symbolic expression that it got as input.

Maybe that code could be used in the parser (or adapted for it) to generate the same things for array sizes. This would not only give us a symbolic name for the guard, but also a symbolic name for the array size that we can use in the state relation and other places. The _def theorem could potentially also provide access to the original expression if that is of any value after preprocessing (e.g. it could be the connection to the enum constant if that is used).

The value_type angle can probably be used with most of the ideas we have come up with so far.

Xaphiosis · 2024-05-22T03:47:11Z

With the V/E 2, you still need to resolve the issue of storing an extra expression along with a variable to use for bounds checking in lieu of its type, but you can indeed get away with not generating a new theorem when the entire size expression is a unat of an enum member.
value_type is interesting, but I am not quite seeing what you're thinking... you'd get a definition for a type and a type abbreviation (input) for that type. If we say that the type we're talking about is the specific size of array linked to a variable name, that's very similar to Idea 2 but you get an input abbreviation one could refer to. It won't ever show up in print output...

lsf37 · 2024-05-22T04:30:19Z

With the V/E 2, you still need to resolve the issue of storing an extra expression along with a variable to use for bounds checking in lieu of its type, but you can indeed get away with not generating a new theorem when the entire size expression is a unat of an enum member.

Agreed

value_type is interesting, but I am not quite seeing what you're thinking... you'd get a definition for a type and a type abbreviation (input) for that type.

Correct

If we say that the type we're talking about is the specific size of array linked to a variable name, that's very similar to Idea 2 but you get an input abbreviation one could refer to. It won't ever show up in print output...

If we only do value_type it is very similar to idea 2, and yes, the type abbreviation will not help for output, but it will exist to be used for input. E.g. currently we repeat a value_type definition for some array sizes for things like VCPU registers in StateRelation. If the parser already gives us a name, we wouldn't need a second definition, but could just refer to that name there and be guaranteed that it is the same value as in the C code. We could also refer to the constant name in the state relation to talk about index bounds that are later used in those arrays (we could do this part with just idea 2 already, though, by using the generated array bound name).

Xaphiosis added enhancement C-parser anything about the C/Simpl parser proof tools convenience, automation, productivity tools labels May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance C Parser to allow named array bounds #754

Enhance C Parser to allow named array bounds #754

Xaphiosis commented May 22, 2024

lsf37 commented May 22, 2024

Xaphiosis commented May 22, 2024

lsf37 commented May 22, 2024

Enhance C Parser to allow named array bounds #754

Enhance C Parser to allow named array bounds #754

Comments

Xaphiosis commented May 22, 2024

Problem Statement

General idea and complication

Idea 1: Heavy annotation

Idea 2: Automaic named annotation

Idea 3: Keep and throw in entire expression

lsf37 commented May 22, 2024

Variation/Extension to Idea 2:

Xaphiosis commented May 22, 2024

lsf37 commented May 22, 2024