Skip to content

v0.8.0

Compare
Choose a tag to compare
@elliotchance elliotchance released this 22 Aug 22:35
· 113 commits to main since this release
bf3e689
Completely rebuilt parser (#27)

This completely rebuilds the parser from a trivial top-down to using a
Earley parser based on the correct BNF rules described in the 2016 SQL
standard. Or, at least the rules/partial rules that apply to the
features already implemented.

The parser is now generated from the `grammar.bnf` by a python script
(yes, I know, not pure V anymore - I'll replace that in a future patch).
As a consequence of using the correct grammar some notable changes are:

1. The "FROM" clause is mandatory. This is unfortunate, but all tests have
been updated with dummy tables.

2. A new (rather large) list of non-reserved words has been added. These
in combination with the existing reserved words make up the SQL
"key words" and extend the list of bad entity names we were using in
some places. Some examples of this are; T is no longer allowed as a
table name and A is no longer allowed as a column name.

3. LOG() was removed because it didn't follow the SQL standard (which
takes 2 parameters). This can be implemented in a future patch along
with some other basic functions that are missing.

4. NULL cannot be used as a non-contextual value (ie.
`SELECT NULL IS NULL FROM t1`). That means it cannot be used by itself
in an expression where the database couldn't figure out what type it
could be, like an `INSERT`. So a few tests had to be removed. In the
future this will be possible with a `CAST`:
`SELECT CAST(NULL AS INTEGER) IS NULL FROM t1`.

5. More flexible exact numeric values such as `.3` and `3.`.

There still is a lot of cleanup to make in the lexer area but I'll that
for a future patch.

Back story:

This was an unexpectedly extraordinarily complex refactoring that took
weeks instead of days because originally I built an LR(0) (bottom-up
parser) which I thought was sufficient for SQL grammar, but it turns out
that it's not because there are some ambiguous rules that can be
resolve by picking a conflict path.

So, I had to rebuild the entire parser again as as LALR parser which is
also a bottom-up parser but contains one extra look ahead token (hence
the "LA" in the name) but the SQL grammar is still to ambigous for even
this type of parser.

Finally, I rebuilt the parser again for the third and forth time using
an two different Earley parsers. The first was lark which was nice, but
too difficult to rewrite in V, so I had to scrap that. Finally settled
on a minimal Earley parser I found for python which lacks any
optimization (like precalculating tables) and lacked any error handling
other than "parser failed".

The Earley parser works for any ambiguous grammar because it generates
multiple solutions rather than trying to handle conflict resolution. For
our case, we actually don't care which solution we use since they would
all be syntactically valid.

Fortunately, this ended up making the parser way easier (less code) to
implement and I was able to built error handling. Yay!