-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explorations: MySQL Query -> AST Parser #153
Conversation
…e a few if/else lists with switches
@adamziel Wow, this is an incredible demonstration of what's possible with current AI tools! Intuitively, I would've never guessed that it could do such a great job. I'll add one question that's on my mind: Why not apply this to a language that can be compiled to WASM, such as Rust, for instance? How would that affect performance and bundle size? Would there be a binding cost (since PHP is WASM as well)? Additionally, would new builds for new releases of MySQL be created similarly, or would the parser require a manual maintenance from now on? |
Thank you @JanJakes, and great questions!
For Playground, that would be brilliant. For WordPress core, that wouldn't help use SQLite at all since PHP has no WASM support yet. I think we may eventually maintain a semi-automatically-maintained C or Rust implementation to speed things up.
The ANTLR .g4 grammar doesn't seem to be maintained. We'd either need to find an up-to-data and well-maintained MariaDB grammar file and figure out a semi-automated workflow, or, if we can't find one, do manual maintenance. |
As I'm working through this parser, there's quite a few places that require:
It can all be done in a week or two, but I now wonder whether converting |
public const BIGINT_SYMBOL = 43; | ||
public const REAL_SYMBOL = 44; | ||
public const DOUBLE_SYMBOL = 45; | ||
public const FLOAT_SYMBOL = 46; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
next time tell the AI to run its output through WPCS 😆
public const BACKUP_SYMBOL = 100; | ||
public const BEFORE_SYMBOL = 101; | ||
public const BEGIN_SYMBOL = 102; | ||
public const BETWEEN_SYMBOL = 103; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is 104 skipped in the grammar?
const PipesAsConcat = 1; | ||
const HighNotPrecedence = 2; | ||
const NoBackslashEscapes = 4; | ||
public const ANSI_QUOTES = 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an interesting bit. it appears like this was arbitrarily chosen as the only escaping mode? given an arbitrary enum value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Crazy idea: ask it to replace all const
values with string literals and see how that impacts the built size.
I'll close this PR in favor of #157. |
Note
Read the MySQL parser proposal for the full context on this PR.
Description
This PR explores a full MySQL Lexer and Parser to enable an AST-based MySQL -> SQLite, PostgreSQL, etc. bridge. This would be much more stable, easy to maintain, and easy to expand than token processing approach we use now.
The proposed
MySQLLexer.php
andMySQLParser.php
have plenty of opportunities for optimization and refactoring, but overall they are fantastic starting points for this work and should save us a few months of explorations. The entire code is ~1MB (or 100kb gzipped), has no dependencies, and can parse 500 complex SELECT queries in ~800ms. I think we can reduce that time 10x or so.How was it done?
I fed the official MySQLParser.g4 grammar to Google Gemini Pro and asked it to build a PHP parsed based on that. The 2 million token input context window made it really viable. Gemini can only output ~8000 tokens, so I had to take the response and feed it back to the model as an input. It took some time, so I ran a loop overnight and in the morning I had a decent starting point.
I used the
generate_parser.py
script included in this PR to generate the code. I built that script with AI studio where I initially uploaded the MySQL grammar, tuned the model parameters, and perfected the prompt. After the export I only had to make a few adjustments such as adding a loop, local file cache etc.Once the parser was ready, I then used the lexing grammar and a similar prompt to generate the Lexer. Yes, the Lexer came second. I asked Gemini to make the lexer class plug-and-play with the parser class. However, once the Lexer was done, I realized I used the wrong Parser grammar and regenerated the entire Parser from scratch, only this time I included the Lexer class with my prompt for reference.
The entire project costed ~$520 in Google Cloud charges, $300 out of which I covered using free trial credits.
Rationale
As the proposal explains:
Testing instructions
Run PHPUnit tests as follows:
Next steps
try/catch
statement, try to use it by default and fall back to the existing approach on failure. Collect logs when users opted-in to that. Once we achieve a feature and stability parity, switch to this parser entirely.cc @dmsnell @aristath @brandonpayton @schlessera