FIX: a sweeping fix for keyword word boundary handling #107

klauer · 2024-12-18T01:36:57Z

Background

Closes #106 (or hopes to)

The example provided in the issue report, which made it into this PR as a test suite example:

PROGRAM Test
VAR CONSTANT
    RetainInt: INT;
END_VAR
VAR RETAIN
    constantInt: INT;
END_VAR
END_PROGRAM

Shows how blark's rather lax whitespace-handling rules can be excessively so. What happened, prior to this PR, was that VAR CONSTANT would see the keyword Retain and add that to the list of access modifiers (i.e., VAR CONSTANT RETAIN and the variable would then be Int).

In truth, I had seen this once before a year or so ago and didn't pursue a real fix.

The fix

The fix is simply to ensure that there are word boundary markers in the regular expressions for each keyword (\b before and after the keyword text).

Doing this in-place would have required changing the transformer source code as the tokens would have been whitelisted (where they were previously skipped over).

I opted for a separate section in the lark grammar that contains only keywords with this word-boundary fix:

_KEYWORD: /\bKEYWORD\b/i

The above, for example, would match KEYWORD or keyword (the i flag makes it case insensitive) - but only if there were word boundaries on either side (whitespace, end of line, beginning of line, etc.)

Per the Python re docs:

\b
Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of word characters. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning or end of the string. This means that r'\bat\b' matches 'at', 'at.', '(at)', and 'as at ay' but not 'attempt' or 'atlas'.

The default word characters in Unicode (str) patterns are Unicode alphanumerics and the underscore, but this can be changed by using the ASCII flag. Word boundaries are determined by the current locale if the LOCALE flag is used.

Note Inside a character range, \b represents the backspace character, for compatibility with Python’s string literals.

I think based on that description and our usage here that we should be OK.

I'm open to suggestions, of course!

Going forward

I think this PR could be a good starting point for fixing this long-term. There are undoubtedly more whitespace-related issues that could use addressing (TwinCAT likely requires a newline in many places where blark doesn't).

After this PR, I think keeping the pattern of the "underscore keyword" section at the end of iec.lark will be a reasonable one to follow.

klauer · 2024-12-18T01:43:36Z

@engineerjoe440 - when you get a chance, would you give this a spin with your codebase?

I'm hopeful this will only be a beneficial change from a parsing standpoint, though I suspect it may slow things down.

(GitHub Actions is having some cloning trouble right now; the tests are passing locally at the moment. I'll re-run it later)

engineerjoe440 · 2024-12-26T18:52:27Z

Wow... finally got around to checking this out. Sorry about that delay!

I ran this branch against the list of libraries I normally test blark with and confirmed that it's behaving effectively the same as the master branch. 😄 I don't see any clear or notable regressions. So I'd call this golden! 🥇

klauer added 2 commits December 17, 2024 17:18

FIX: a sweeping fix for keyword word boundary handling

b4de82e

TST: underscore terminals do not need coverage

21039f6

STY: whitespace

22bb54a

klauer marked this pull request as ready for review December 29, 2024 21:13

klauer merged commit f694b3d into master Jan 25, 2025
38 checks passed

klauer deleted the fix_whitespace_handling branch January 25, 2025 19:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX: a sweeping fix for keyword word boundary handling #107

FIX: a sweeping fix for keyword word boundary handling #107

klauer commented Dec 18, 2024 •

edited

Loading

klauer commented Dec 18, 2024

engineerjoe440 commented Dec 26, 2024

FIX: a sweeping fix for keyword word boundary handling #107

FIX: a sweeping fix for keyword word boundary handling #107

Conversation

klauer commented Dec 18, 2024 • edited Loading

Background

The fix

Going forward

klauer commented Dec 18, 2024

engineerjoe440 commented Dec 26, 2024

klauer commented Dec 18, 2024 •

edited

Loading