Replies: 1 comment
-
started with attempt 2: #38 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The parser right now emits both an abstract and a concrete syntax tree. The abstract syntax tree contains pretty much what is emitted from libpg_query: a list of statements, each with their respective text range.
The concrete syntax tree right now is just a flat list of tokens beneath the root node of each statement. This should be good enough for most features of a lsp, but at some point we need a deeply nested one. A deeply nested tree contains not only the tokens, but also the nested nodes.
For example, the statement
parsed into a deely nested cst is
Right now, we only parse it into
libg_query
returns the ast, and a list of all tokens with their text span. A cst is a tree structure containing both: the ast nodes, and the tokens. Unfortunately, reverse engineering into a cst is not straightforward because of a few limitations of libg_query:nodes()
method does not return all nodes in a statement, but just the ones that are not list children of the root node. For example, a CreateTableStmt does not return the individual column definition nodes, just theRangeVar
node that contains the name of the table.I already managed to find a solution for 3 by recursively resolving the children from a proc macro that reads the proto definition from libg_query.
Attempt 1 (on hold)
My first attempt for 1 and 2 is pushed to branch
feat/deep-cst
. It can be called an “whatever necessary” approach because I tried to merge the tokens and the nodes usingSiblingToken
are tokens that always belong together, such as(
and)
. The closing token must be applied on the same depth as its opening tokenTokenType
defines how a token should be applied. You can find details in the inline docs.select
for aSelectStmt
)While I could not find a conflicting case yet, this already took a significant effort and it feels like a rabbit hole. It feels very fuzzy to decide whether or not a token should be beneath a node solely based on a manually defined
TokenType
.Attempt 2 (todo)
Another idea I am currently pondering on is to start from the ast nodes, and try to figure out what tokens belong beneath each node from the ast itself. It is less fuzzy than the previous approach, because we use more information to determine the children of each node. Also, it will require “just” one implementation for every node, as opposed to guessing a type based on experience. On a high level, this is how it could work:
get_children
RangeVar
node has arelname
property of typestring
. We can search for the string token by the name of the relation and setRangeVar
as the parent node of the respective token.SelectStmt
will have theSelect
token.Select
token will be the one that is the first after the start position of its parentSelectStmt
node. We can get at least an estimation of the location of theSelectSmt
fromThe result should be some data structure that holds all tokens, and, if found, their parent node. In theory, the only tokens without a parent node should be whitespaces, brackets and other syntactical elements.
From here, we can determine whether or not to close a node before applying a token by validating that the token has the current node as its parent.
Beta Was this translation helpful? Give feedback.
All reactions