FIX: a sweeping fix for keyword word boundary handling #107
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Background
Closes #106 (or hopes to)
The example provided in the issue report, which made it into this PR as a test suite example:
Shows how blark's rather lax whitespace-handling rules can be excessively so. What happened, prior to this PR, was that
VAR CONSTANT
would see the keywordRetain
and add that to the list of access modifiers (i.e.,VAR CONSTANT RETAIN
and the variable would then beInt
).In truth, I had seen this once before a year or so ago and didn't pursue a real fix.
The fix
The fix is simply to ensure that there are word boundary markers in the regular expressions for each keyword (
\b
before and after the keyword text).Doing this in-place would have required changing the transformer source code as the tokens would have been whitelisted (where they were previously skipped over).
I opted for a separate section in the lark grammar that contains only keywords with this word-boundary fix:
The above, for example, would match
KEYWORD
orkeyword
(thei
flag makes it case insensitive) - but only if there were word boundaries on either side (whitespace, end of line, beginning of line, etc.)Per the Python re docs:
I think based on that description and our usage here that we should be OK.
I'm open to suggestions, of course!
Going forward
I think this PR could be a good starting point for fixing this long-term. There are undoubtedly more whitespace-related issues that could use addressing (TwinCAT likely requires a newline in many places where blark doesn't).
After this PR, I think keeping the pattern of the "underscore keyword" section at the end of
iec.lark
will be a reasonable one to follow.