-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Matching tokens using regular expressions #154
Comments
is this significantly better than writing: string_literal = '"' @$([^"\\] / "\\" .)* '"' |
@hildjj This pattern works, but it's not a regular expression. I'd prefer to use unmodified regular expressions instead of manually re-writing them like this. |
If you absolutely need to have regular expressions (what, I think, still not what you want), you can try to play with ugly hack, something like this: {
const re = /^"([^"\\]|\\.)*"/;// Note: add ^ to the start of your RegExp
let lastMatch = null;
}
// Check if input in the current position starts with RegExp and advance position if true
string_literal = &{
// offset() not merged yet, can replace with location().start.offset... #145
lastMatch = re.exec(input.substr(offset());
return lastMatch !== null;
} {
// Access to the Peggy internals... could broke at any moment
peg$currPos += lastMatch.lastIndex;
return lastMatch[0];
}; |
May be you can explain, why you want to use another parsing mechanism (regular expressions) in an PEG parser? |
@Mingun I have some grammar files that need to be converted to Peggy grammars, but they all use regular expressions. Regular expressions are supported in many other parser generators (ANTLR, Bison, Nearley, etc.), so I wish Peggy would support them as well. It should be easy to add this feature: we just need to write a parser that converts regular expressions into equivalent Peggy expressions, like the one above. |
That generators have two stages -- lexer and parsing. They just have not another ways to form a token. PEG parsers have.
I my opinion this will be a different task that should be solved by another library. You even provided a link to one of them in the first message. Another way is to use some hacks and use JS native RegExps, as I've shown, and we going to try to make an official API for that. At least, some API that allow to advance a parse position would be useful not only for that task, but also for parsing an indentation-based language. |
I've marked this as "enhancement request", but not assigned it to a release yet. This will have some interaction with parsing per-codepoint (#15) if we add a If we do decide to add it at some point, we'll need to emulate sticky, or only allow use of the feature if you opt in to only supporting later browsers. (note: this does not mean I'm sold on this idea yet. I expect other people to want something similar coming from other libraries, and want to maintain a place for us to discuss it.) |
Alternatively it would be useful to have some way to use an external lexer. E.g. pass in an iterable of some object type that includes a lexical class, location info, and a payload. Not only could such a lexer make it possible to use regexes (or other tools) to tokenize the input, but it would also allow the issue of whitespace to be solved during the lexing phase. (which can be really tedious and annoying to solve during an all-in-one parsing phase...). The vague idea is, outside of Peggy, the user could call some hypothetical lexer with an input like this: " int x = 4;" The lexer could transform this into a token stream like this: [
// type is lexical class, data is payload.
// the location type is oversimplified for example purposes
{location: {start: 4, end: 7}, type: "kw_int", data: null},
{location: {start: 8, end: 9}, type: "identifier", data: "x"},
{location: {start: 11, end: 12}, type: "=", data: null},
{location: {start: 13, end: 14}, type: "integer", data: 4},
{location: {start: 14, end: 15}, type: ";", data: null},
] and that token stream could be given to the parser generated by Peggy. Not sure how the grammar would look but since the current meaning of string literals as terminals in Peggy would not be useful when using this API, I'll just put this straw man here which reappropriates that syntax for referring to an individual token with that lexical class: Type
= "kw_int" { return {type: "primitive", name: "int"}; }
/ "kw_float" { return {type: "primitive", name: "float"}; }
Expr
// the value will come from the token's 'data:'
= value:"integer" { return {type: "literal-int", value}; }
Stmt
= ty:Type ident:"identifier" "=" expr:Expr ";" {
return {type: "declaration", ty, ident, expr};
} |
@jarble would it be possible to do what you want by using
By matching any character string (or maybe up until a certain delimiter, depending on your grammar?), and then adding the extra test that will only make the rule match when the regex test returns true, you might be able to get the effect you want? I don't know if this will be efficient if you need to parse very long strings... |
Found this issue by looking for exactly this use case, with the exact same motivation (white space). Is there a separate issue tracking this or should I open it? In theory, it shouldn't be that difficult to implement, since nothing conceptually changes about the way Peggy produces nodes. It's mostly about deciding on the API changes and .pegjs grammar changes itself. |
We were talking about something similar on Discord last week, and this came up: The idea is that you create an array of tokens with a lexer, then trick Peggy into thinking that your array is a "string". You then use a subset of Peggy grammar rules against those tokens. I haven't tried it myself yet, but it's an interesting idea that doesn't require major changes to Peggy. Let's have another issue to track this conversation, because wanting regex's against real string inputs is interesting in a different way. |
I tried to define a token as a JavaScript regular expression, but this is a syntax error:
string_literal = /"([^"\\]|\\.)*"/
Is there another way to define tokens using regular expressions, since the syntax above doesn't work?
(If this feature isn't implemented yet, it could probably be done using something like r2p.)
The text was updated successfully, but these errors were encountered: