Matching tokens using regular expressions #154

jarble · 2021-05-24T17:56:05Z

I tried to define a token as a JavaScript regular expression, but this is a syntax error:

string_literal = /"([^"\\]|\\.)*"/

Is there another way to define tokens using regular expressions, since the syntax above doesn't work?

(If this feature isn't implemented yet, it could probably be done using something like r2p.)

The text was updated successfully, but these errors were encountered:

hildjj · 2021-05-24T20:12:44Z

is this significantly better than writing:

string_literal = '"' @$([^"\\] / "\\" .)* '"'

jarble · 2021-05-24T23:20:03Z

@hildjj This pattern works, but it's not a regular expression. I'd prefer to use unmodified regular expressions instead of manually re-writing them like this.

Mingun · 2021-05-25T04:34:22Z

If you absolutely need to have regular expressions (what, I think, still not what you want), you can try to play with ugly hack, something like this:

{
  const re = /^"([^"\\]|\\.)*"/;// Note: add ^ to the start of your RegExp
  let lastMatch = null;
}
// Check if input in the current position starts with RegExp and advance position if true
string_literal = &{
  // offset() not merged yet, can replace with location().start.offset... #145
  lastMatch = re.exec(input.substr(offset());
  return lastMatch !== null;
} {
  // Access to the Peggy internals... could broke at any moment
  peg$currPos += lastMatch.lastIndex;
  return lastMatch[0];
};

Mingun · 2021-05-25T04:40:17Z

May be you can explain, why you want to use another parsing mechanism (regular expressions) in an PEG parser?

jarble · 2021-05-25T13:00:51Z

@Mingun I have some grammar files that need to be converted to Peggy grammars, but they all use regular expressions. Regular expressions are supported in many other parser generators (ANTLR, Bison, Nearley, etc.), so I wish Peggy would support them as well.

It should be easy to add this feature: we just need to write a parser that converts regular expressions into equivalent Peggy expressions, like the one above.

Mingun · 2021-05-25T14:28:38Z

Regular expressions are supported in many other parser generators (ANTLR, Bison, Nearley, etc.),

That generators have two stages -- lexer and parsing. They just have not another ways to form a token. PEG parsers have.

It should be easy to add this feature: we just need to write a parser that converts regular expressions into equivalent Peggy expressions, like the one above.

I my opinion this will be a different task that should be solved by another library. You even provided a link to one of them in the first message.

Another way is to use some hacks and use JS native RegExps, as I've shown, and we going to try to make an official API for that. At least, some API that allow to advance a parse position would be useful not only for that task, but also for parsing an indentation-based language.

hildjj · 2021-05-25T15:35:46Z

I've marked this as "enhancement request", but not assigned it to a release yet. This will have some interaction with parsing per-codepoint (#15) if we add a unicode flag later.

If we do decide to add it at some point, we'll need to emulate sticky, or only allow use of the feature if you opt in to only supporting later browsers.

(note: this does not mean I'm sold on this idea yet. I expect other people to want something similar coming from other libraries, and want to maintain a place for us to discuss it.)

ExpHP · 2021-06-29T00:09:55Z

Alternatively it would be useful to have some way to use an external lexer. E.g. pass in an iterable of some object type that includes a lexical class, location info, and a payload. Not only could such a lexer make it possible to use regexes (or other tools) to tokenize the input, but it would also allow the issue of whitespace to be solved during the lexing phase. (which can be really tedious and annoying to solve during an all-in-one parsing phase...).

The vague idea is, outside of Peggy, the user could call some hypothetical lexer with an input like this:

"    int x  = 4;"

The lexer could transform this into a token stream like this:

[
    // type is lexical class, data is payload.
    // the location type is oversimplified for example purposes
    {location: {start: 4, end: 7}, type: "kw_int", data: null},
    {location: {start: 8, end: 9}, type: "identifier", data: "x"},
    {location: {start: 11, end: 12}, type: "=", data: null},
    {location: {start: 13, end: 14}, type: "integer", data: 4},
    {location: {start: 14, end: 15}, type: ";", data: null},
]

and that token stream could be given to the parser generated by Peggy. Not sure how the grammar would look but since the current meaning of string literals as terminals in Peggy would not be useful when using this API, I'll just put this straw man here which reappropriates that syntax for referring to an individual token with that lexical class:

Type
    = "kw_int" { return {type: "primitive", name: "int"}; }
    / "kw_float" { return {type: "primitive", name: "float"}; }

Expr
    // the value will come from the token's 'data:'
    = value:"integer" { return {type: "literal-int", value}; }

Stmt
    = ty:Type ident:"identifier" "=" expr:Expr ";" {
        return {type: "declaration", ty, ident, expr};
    }

mrft · 2021-12-13T14:28:56Z

@jarble would it be possible to do what you want by using & { ... }

Integer
  = x:$(.*) & { return (/[0-9]+/).test(x) }
    { return parseInt(x, 10); }

By matching any character string (or maybe up until a certain delimiter, depending on your grammar?), and then adding the extra test that will only make the rule match when the regex test returns true, you might be able to get the effect you want?

I don't know if this will be efficient if you need to parse very long strings...

lazarljubenovic · 2023-06-23T07:17:07Z

Alternatively it would be useful to have some way to use an external lexer. E.g. pass in an iterable of some object type that includes a lexical class, location info, and a payload. Not only could such a lexer make it possible to use regexes (or other tools) to tokenize the input, but it would also allow the issue of whitespace to be solved during the lexing phase. (which can be really tedious and annoying to solve during an all-in-one parsing phase...).

Found this issue by looking for exactly this use case, with the exact same motivation (white space).

Is there a separate issue tracking this or should I open it?

In theory, it shouldn't be that difficult to implement, since nothing conceptually changes about the way Peggy produces nodes. It's mostly about deciding on the API changes and .pegjs grammar changes itself.

hildjj · 2023-06-23T11:30:03Z

We were talking about something similar on Discord last week, and this came up:

https://github.com/siefkenj/unified-latex/blob/main/packages/unified-latex-util-pegjs/libs/decorate-array-for-pegjs.ts

The idea is that you create an array of tokens with a lexer, then trick Peggy into thinking that your array is a "string". You then use a subset of Peggy grammar rules against those tokens. I haven't tried it myself yet, but it's an interesting idea that doesn't require major changes to Peggy.

Let's have another issue to track this conversation, because wanting regex's against real string inputs is interesting in a different way.

hildjj added the enhancement New feature or request label May 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matching tokens using regular expressions #154

Matching tokens using regular expressions #154

jarble commented May 24, 2021 •

edited

Loading

hildjj commented May 24, 2021

jarble commented May 24, 2021 •

edited

Loading

Mingun commented May 25, 2021

Mingun commented May 25, 2021

jarble commented May 25, 2021 •

edited

Loading

Mingun commented May 25, 2021

hildjj commented May 25, 2021 •

edited

Loading

ExpHP commented Jun 29, 2021 •

edited

Loading

mrft commented Dec 13, 2021

lazarljubenovic commented Jun 23, 2023

hildjj commented Jun 23, 2023

Matching tokens using regular expressions #154

Matching tokens using regular expressions #154

Comments

jarble commented May 24, 2021 • edited Loading

hildjj commented May 24, 2021

jarble commented May 24, 2021 • edited Loading

Mingun commented May 25, 2021

Mingun commented May 25, 2021

jarble commented May 25, 2021 • edited Loading

Mingun commented May 25, 2021

hildjj commented May 25, 2021 • edited Loading

ExpHP commented Jun 29, 2021 • edited Loading

mrft commented Dec 13, 2021

lazarljubenovic commented Jun 23, 2023

hildjj commented Jun 23, 2023

jarble commented May 24, 2021 •

edited

Loading

jarble commented May 24, 2021 •

edited

Loading

jarble commented May 25, 2021 •

edited

Loading

hildjj commented May 25, 2021 •

edited

Loading

ExpHP commented Jun 29, 2021 •

edited

Loading