Deeply Nested Concrete Syntax Tree #32

psteinroe · 2023-09-15T10:09:25Z

psteinroe
Sep 15, 2023
Maintainer

The parser right now emits both an abstract and a concrete syntax tree. The abstract syntax tree contains pretty much what is emitted from libpg_query: a list of statements, each with their respective text range.

use cstree::text::TextRange;
use pg_query::NodeEnum;

#[derive(Debug)]
pub struct RawStmt {
    pub stmt: NodeEnum,
    pub range: TextRange,
}

The concrete syntax tree right now is just a flat list of tokens beneath the root node of each statement. This should be good enough for most features of a lsp, but at some point we need a deeply nested one. A deeply nested tree contains not only the tokens, but also the nested nodes.

For example, the statement

CREATE TABLE weather (
        city      varchar(80) references cities(name),
        temp_lo   int, 
        temp_hi   int,
        prcp      real,
        date      date
);

parsed into a deely nested cst is

[email protected]
  [email protected] "CREATE"
  [email protected] " "
  [email protected] "TABLE"
  [email protected] " "
  [email protected]
    [email protected] "weather"
  [email protected] " "
  [email protected] "("
  [email protected] "\n"
  [email protected] "        "
  [email protected]
    [email protected] "city"
    [email protected] "      "
    [email protected] "varchar"
    [email protected] "("
    [email protected] "80"
    [email protected] ")"
    [email protected] " "
    [email protected] "references"
    [email protected] " "
    [email protected] "cities"
    [email protected] "("
    [email protected] "name"
    [email protected] ")"
  [email protected] ","
  [email protected] "\n"
  [email protected] "        "
  [email protected]
    [email protected] "temp_lo"
    [email protected] "   "
    [email protected] "int"
  [email protected] ","
  [email protected] "\n"
  [email protected] "        "
  [email protected]
    [email protected] "temp_hi"
    [email protected] "   "
    [email protected] "int"
  [email protected] ","
  [email protected] "\n"
  [email protected] "        "
  [email protected]
    [email protected] "prcp"
    [email protected] "      "
    [email protected] "real"
  [email protected] ","
  [email protected] "\n"
  [email protected] "        "
  [email protected]
    [email protected] "date"
    [email protected] "      "
    [email protected] "date"
    [email protected] "\n"
    [email protected] ")"
  [email protected] ";"

Right now, we only parse it into

[email protected]
  [email protected] "CREATE"
  [email protected] " "
  [email protected] "TABLE"
  [email protected] " "
  [email protected] "weather"
  [email protected] " "
  [email protected] "("
  [email protected] "\n"
  [email protected] "        "
  [email protected] "city"
  [email protected] "      "
  [email protected] "varchar"
  [email protected] "("
  [email protected] "80"
  [email protected] ")"
  [email protected] " "
  [email protected] "references"
  [email protected] " "
  [email protected] "cities"
  [email protected] "("
  [email protected] "name"
  [email protected] ")"
  [email protected] ","
  [email protected] "\n"
  [email protected] "        "
  [email protected] "temp_lo"
  [email protected] "   "
  [email protected] "int"
  [email protected] ","
  [email protected] "\n"
  [email protected] "        "
  [email protected] "temp_hi"
  [email protected] "   "
  [email protected] "int"
  [email protected] ","
  [email protected] "\n"
  [email protected] "        "
  [email protected] "prcp"
  [email protected] "      "
  [email protected] "real"
  [email protected] ","
  [email protected] "\n"
  [email protected] "        "
  [email protected] "date"
  [email protected] "      "
  [email protected] "date"
  [email protected] "\n"
  [email protected] ")"
  [email protected] ";"

libg_query returns the ast, and a list of all tokens with their text span. A cst is a tree structure containing both: the ast nodes, and the tokens. Unfortunately, reverse engineering into a cst is not straightforward because of a few limitations of libg_query:

For some reason, not all ast nodes have a location. For example, a SelectStmt does not have one. This is especially a problem when the SelectStmt Is not the root node, but nested within a CTE.
The ast nodes have neither a length nor an end position.
The top-level nodes() method does not return all nodes in a statement, but just the ones that are not list children of the root node. For example, a CreateTableStmt does not return the individual column definition nodes, just the RangeVar node that contains the name of the table.

I already managed to find a solution for 3 by recursively resolving the children from a proc macro that reads the proto definition from libg_query.

Attempt 1 (on hold)

My first attempt for 1 and 2 is pushed to branch feat/deep-cst. It can be called an “whatever necessary” approach because I tried to merge the tokens and the nodes using

As many control mechanisms as possible to figure out what token(s) should be children of what nodes
- SiblingToken are tokens that always belong together, such as ( and ). The closing token must be applied on the same depth as its opening token
- TokenType defines how a token should be applied. You can find details in the inline docs.
Derive the location of every node that does not have one using regular expression (e.g. find select for a SelectStmt)

While I could not find a conflicting case yet, this already took a significant effort and it feels like a rabbit hole. It feels very fuzzy to decide whether or not a token should be beneath a node solely based on a manually defined TokenType.

Attempt 2 (todo)

Another idea I am currently pondering on is to start from the ast nodes, and try to figure out what tokens belong beneath each node from the ast itself. It is less fuzzy than the previous approach, because we use more information to determine the children of each node. Also, it will require “just” one implementation for every node, as opposed to guessing a type based on experience. On a high level, this is how it could work:

Recursively walk ast of every statement, similar to get_children
For every node, try to find what tokens from the list of tokens for the statement are its direct children based on its properties. For example, a RangeVar node has a relname property of type string. We can search for the string token by the name of the relation and set RangeVar as the parent node of the respective token.
Some nodes will have static keywords tokens as their children, e.g. SelectStmt will have the Select token.
In most cases, there will be just a single token that can be a child of the node. But not always. For example a cte with multiple select statements. The correct Select token will be the one that is the first after the start position of its parent SelectStmt node. We can get at least an estimation of the location of the SelectSmt from
1. the location property on the node itself,
2. if not there, from the location property of its immediate parent, recursively

The result should be some data structure that holds all tokens, and, if found, their parent node. In theory, the only tokens without a parent node should be whitespaces, brackets and other syntactical elements.

From here, we can determine whether or not to close a node before applying a token by validating that the token has the current node as its parent.

psteinroe · 2023-09-22T09:28:14Z

psteinroe
Sep 22, 2023
Maintainer Author

started with attempt 2: #38

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deeply Nested Concrete Syntax Tree #32

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Deeply Nested Concrete Syntax Tree #32

psteinroe Sep 15, 2023 Maintainer

Attempt 1 (on hold)

Attempt 2 (todo)

Replies: 1 comment

psteinroe Sep 22, 2023 Maintainer Author

psteinroe
Sep 15, 2023
Maintainer

psteinroe
Sep 22, 2023
Maintainer Author