Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Operations.reverse() to not add non-deterministic dead states #14212

Merged
merged 2 commits into from
Feb 10, 2025

Conversation

rmuir
Copy link
Member

@rmuir rmuir commented Feb 6, 2025

Operations.reverse() can create dead states, but ones that have non-determinism, which is worse than just creating dead states since it causes Automaton.isDeterministic() to return false, e.g. treated as NFA. This can lead to unnecessary det() calls expecially if automaton gets bigger or more complex.

Operations.reverse() serves multiple use-cases today:

  • Search engine use-cases trying to speed up leading wildcards
  • Testing/academic use-case (Brzozowski minimize)

In the search engine use-case, it is used by both Lucene and Solr.

Lucene uses this method for infinite
automata (e.g. leading wildcard) to compute a common suffix. if the expression has one (e.g. "*foo"), then we'll need to evaluate many candidates: so we reverse the automaton as part of computing the common suffix. Then memcmp can be used to filter out candidates quickly.

Solr uses this method, where users can opt-in to also indexing the reversed form of every term, with a special marker to prevent false-positives from the extra reversed terms. At query-time, the reversed wildcard queries can be turned into something that looks more like a prefix query: https://github.com/apache/solr/blob/bca4cd630b9cff66ecc0431397a99f5289a6462b/solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java#L1291-L1324

Move Operations.reverse(Automaton, Set) to AutomatonTestUtil, since it is too difficult to improve while also supporting this hook.

Fix Operations.reverse(Automaton) to remove dead states.

Description

Operations.reverse() can create dead states, but ones that have
non-determinism, which is worse than just creating dead states since it
causes Automaton.isDeterministic() to return false, e.g. treated as NFA.
This can lead to unnecessary det() calls expecially if automaton gets
bigger or more complex.

Operations.reverse() serves multiple use-cases today:
* Search engine use-cases trying to speed up leading wildcards
* Testing/academic use-case (Brzozowski minimize)

In the search engine use-case, it is used by both Lucene and Solr.

Lucene uses this method for infinite
automata (e.g. leading wildcard) to compute a common suffix. if the
expression has one (e.g. "*foo"), then we'll need to evaluate many
candidates: so we reverse the automaton as part of computing the common
suffix. Then memcmp can be used to filter out candidates quickly.

Solr uses this method, where users can opt-in to also indexing the
reversed form of every term, with a special marker to prevent
false-positives from the extra reversed terms. At query-time, the
reversed wildcard queries can be turned into something that looks more
like a prefix query: https://github.com/apache/solr/blob/bca4cd630b9cff66ecc0431397a99f5289a6462b/solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java#L1291-L1324

Move Operations.reverse(Automaton, Set) to AutomatonTestUtil, since it
is too difficult to improve while also supporting this hook.

Fix Operations.reverse(Automaton) to remove dead states.
@rmuir rmuir requested a review from mikemccand February 7, 2025 18:59
@rmuir rmuir added this to the 10.2.0 milestone Feb 10, 2025
@rmuir rmuir merged commit ad7ff1f into apache:main Feb 10, 2025
6 checks passed
asfgit pushed a commit that referenced this pull request Feb 10, 2025
…4212)

Operations.reverse() can create dead states, but ones that have
non-determinism, which is worse than just creating dead states since it
causes Automaton.isDeterministic() to return false, e.g. treated as NFA.
This can lead to unnecessary det() calls expecially if automaton gets
bigger or more complex.

Operations.reverse() serves multiple use-cases today:
* Search engine use-cases trying to speed up leading wildcards
* Testing/academic use-case (Brzozowski minimize)

In the search engine use-case, it is used by both Lucene and Solr.

Lucene uses this method for infinite
automata (e.g. leading wildcard) to compute a common suffix. if the
expression has one (e.g. "*foo"), then we'll need to evaluate many
candidates: so we reverse the automaton as part of computing the common
suffix. Then memcmp can be used to filter out candidates quickly.

Solr uses this method, where users can opt-in to also indexing the
reversed form of every term, with a special marker to prevent
false-positives from the extra reversed terms. At query-time, the
reversed wildcard queries can be turned into something that looks more
like a prefix query: https://github.com/apache/solr/blob/bca4cd630b9cff66ecc0431397a99f5289a6462b/solr/core/src/java/org/apache/solr/parser/SolrQueryParserBase.java#L1291-L1324

Move Operations.reverse(Automaton, Set) to AutomatonTestUtil, since it
is too difficult to improve while also supporting this hook.

Fix Operations.reverse(Automaton) to remove dead states.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants