Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in repeated capturing group #2

Open
ysard opened this issue Jan 12, 2022 · 2 comments
Open

Error in repeated capturing group #2

ysard opened this issue Jan 12, 2022 · 2 comments

Comments

@ysard
Copy link
Contributor

ysard commented Jan 12, 2022

Hi, I encounter a problem with this not supported regex syntax:

^(A|B){2}-\1$

The parser fails silently until the solver returns None.
I don't know how to tweak the grammar to support this.

The question is:
What should match a back reference to a capture group followed by a quantifier ?
What character should end the string AB- ? A, B or AB ?

Following the advise here: https://regex101.com/

A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated
group to capture all iterations or use a non-capturing group instead if you're not interested in the data.

So the capture should be the content of the last repetition, so in the example, it should be AB-B.

Python implementation is ok with this:

import re
assert re.match(r"^(A|B){2}-\1$", "AB-A") is None
assert re.match(r"^(A|B){2}-\1$", "AB-B").groups() == ('B',)
@blukat29
Copy link
Owner

@ysard Thanks for your interest.

I agree that the capture should be the last character of the repitition.

Currently (A|B){2}-\1 fails because following SMT problem is produced.

('(A|B){2}-\\1', 4, True)
And(p_52 == 0,
    Or(x_0 == 65, x_0 == 66),
    p_52 == 1,
    Or(x_1 == 65, x_1 == 66),
    x_2 == 45,
    Or(And(p_52 == 1, x_3 == x_1),
       And(p_52 == 0, x_3 == x_0)))

Note that p_52 cannot be both 0 and 1. p_52 is the position of the source character for \1.

The problem should be something like

And(Or(x_0 == 65, x_0 == 66),
    p_52 == 1,
    Or(x_1 == 65, x_1 == 66),
    x_2 == 45,
    And(p_52 == 1, x_3 == x_1))

Current implementation collects self.groups and self.backrefs and build the "position conditions" using them. (https://blukat.me/2016/01/regex-crossword-solver/) Perhaps we can restrict the self.groups to the last element of repetition, but then cases like (A|B){3,5}-\1 will be difficult.

@ysard
Copy link
Contributor Author

ysard commented Jan 14, 2022

Thank you for your quick responses;
Indeed the problem is not satisfiable because of contrary clauses. I was afraid it would be difficult to implement without breaking other features.
Anyway, these are niche cases, the solver works reasonably fast on 16x16 matrices.

I'm proposing a pull request that could allow the following behavior.
The logical formula seems to be ok, Tests are ok, but I'm not totally sure about what I'm doing.

Basically if a group has already been encountered (same idx) (because of a logical OR), the previous possible position is overwritten by the current one. If a group has 2 propositions, p_x = 0 disappears from the SMT in favor of p_x = 1.

solve_regex(r"(A|B){2}-\1", 4)
# Solution: AA-A

# Formula not simplified
Or(False,
   And(Or(False,
          And(Or(x_00 == 65, x_00 == 66),
              Or(x_01 == 65, x_01 == 66))),
       Or(False,
          And(x_02 == 45,
              Or(False, And(p_1 == 1, x_03 == x_01))))))
# Formula simplified
And(Or(x_00 == 65, x_00 == 66),
    Or(x_01 == 65, x_01 == 66),
    x_02 == 45,
    p_1 == 1,
    x_03 == x_01)

And :

solve_regex(r"(A|B|C){2,3}-\1", 5)
# Solution: AAA-A

# Formula not simplified
Or(Or(False,
      And(Or(Or(False,
                And(Or(x_00 == 65,
                       Or(x_00 == 66, x_00 == 67)),
                    Or(x_01 == 65,
                       Or(x_01 == 66, x_01 == 67)))),
             False),
          False)),
   And(Or(False,
          Or(False,
             And(Or(False,
                    And(Or(x_00 == 65,
                           Or(x_00 == 66, x_00 == 67)),
                        Or(x_01 == 65,
                           Or(x_01 == 66, x_01 == 67)))),
                 Or(x_02 == 65, Or(x_02 == 66, x_02 == 67))))),
       Or(False,
          And(x_03 == 45,
              Or(False, And(p_2 == 2, x_04 == x_02))))))

# Formula simplified
And(Or(x_00 == 65, x_00 == 66, x_00 == 67),
    Or(x_01 == 65, x_01 == 66, x_01 == 67),
    Or(x_02 == 65, x_02 == 66, x_02 == 67),
    x_03 == 45,
    p_2 == 2,
    x_04 == x_02)

But I encounter another problem (not related to my modifications, but annoying for this functionality):

solve_regex(r"^(ab|defgh)?\1$", 7) # abababa

This shouldn't be accepted:

assert re.match(r"^(ab|defgh)?\1$", "abababa") is None

Because of that, the following test fails:

self.do_test(r"(ab|defgh){1}\1", 7, False)  
# Expected:
# abab
# defghdefgh
# Not expected:
# abdefgh
# defghab
# Nor (the problem is here):
# abababa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants