Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: a sweeping fix for keyword word boundary handling #107

Merged
merged 3 commits into from
Jan 25, 2025

Conversation

klauer
Copy link
Owner

@klauer klauer commented Dec 18, 2024

Background

Closes #106 (or hopes to)

The example provided in the issue report, which made it into this PR as a test suite example:

PROGRAM Test
VAR CONSTANT
    RetainInt: INT;
END_VAR
VAR RETAIN
    constantInt: INT;
END_VAR
END_PROGRAM

Shows how blark's rather lax whitespace-handling rules can be excessively so. What happened, prior to this PR, was that VAR CONSTANT would see the keyword Retain and add that to the list of access modifiers (i.e., VAR CONSTANT RETAIN and the variable would then be Int).

In truth, I had seen this once before a year or so ago and didn't pursue a real fix.

The fix

The fix is simply to ensure that there are word boundary markers in the regular expressions for each keyword (\b before and after the keyword text).

Doing this in-place would have required changing the transformer source code as the tokens would have been whitelisted (where they were previously skipped over).

I opted for a separate section in the lark grammar that contains only keywords with this word-boundary fix:

_KEYWORD: /\bKEYWORD\b/i

The above, for example, would match KEYWORD or keyword (the i flag makes it case insensitive) - but only if there were word boundaries on either side (whitespace, end of line, beginning of line, etc.)

Per the Python re docs:

\b
Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of word characters. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning or end of the string. This means that r'\bat\b' matches 'at', 'at.', '(at)', and 'as at ay' but not 'attempt' or 'atlas'.

The default word characters in Unicode (str) patterns are Unicode alphanumerics and the underscore, but this can be changed by using the ASCII flag. Word boundaries are determined by the current locale if the LOCALE flag is used.

Note Inside a character range, \b represents the backspace character, for compatibility with Python’s string literals.

I think based on that description and our usage here that we should be OK.

I'm open to suggestions, of course!

Going forward

I think this PR could be a good starting point for fixing this long-term. There are undoubtedly more whitespace-related issues that could use addressing (TwinCAT likely requires a newline in many places where blark doesn't).

After this PR, I think keeping the pattern of the "underscore keyword" section at the end of iec.lark will be a reasonable one to follow.

@klauer
Copy link
Owner Author

klauer commented Dec 18, 2024

@engineerjoe440 - when you get a chance, would you give this a spin with your codebase?

I'm hopeful this will only be a beneficial change from a parsing standpoint, though I suspect it may slow things down.

(GitHub Actions is having some cloning trouble right now; the tests are passing locally at the moment. I'll re-run it later)

@engineerjoe440
Copy link
Collaborator

Wow... finally got around to checking this out. Sorry about that delay!

I ran this branch against the list of libraries I normally test blark with and confirmed that it's behaving effectively the same as the master branch. 😄 I don't see any clear or notable regressions. So I'd call this golden! 🥇

@klauer klauer marked this pull request as ready for review December 29, 2024 21:13
@klauer klauer merged commit f694b3d into master Jan 25, 2025
38 checks passed
@klauer klauer deleted the fix_whitespace_handling branch January 25, 2025 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

'var_declarations' with variable containing 'VAR_ATTRIB' in name at start parses incorrectly
2 participants