From cfde14510a290247dac94730150108cf4a574562 Mon Sep 17 00:00:00 2001 From: Alberto Islas Date: Fri, 10 Jan 2025 18:04:18 -0600 Subject: [PATCH 1/3] fix(docs): Tweaked advanced search documentation to include new search operators Fixes: #4907 --- cl/search/templates/includes/no_results.html | 4 +- .../templates/help/advanced_search.html | 38 +++++++++++++------ 2 files changed, 29 insertions(+), 13 deletions(-) diff --git a/cl/search/templates/includes/no_results.html b/cl/search/templates/includes/no_results.html index f42f7b300a..b1cf9ef088 100644 --- a/cl/search/templates/includes/no_results.html +++ b/cl/search/templates/includes/no_results.html @@ -34,7 +34,7 @@

{% elif error_message == "unbalanced_quotes" %} Did you forget to close one or more quotes? {% elif error_message == "disallowed_wildcard_pattern" %} - The query contains a disallowed expensive wildcard pattern. + The query contains a disallowed expensive wildcard pattern. {% endif %} {% else %} encountered an error. @@ -43,7 +43,7 @@

{% if error_message %} {% if suggested_query == "proximity_query" %}

Are you attempting to perform a proximity search?

-

Try using this format: term~ or term~2. For more details, visit our advance search documentation.

+

Try using this format: "lorem term"~50. For more details, visit our advance search documentation.

{% elif suggested_query == "proximity_filter" %}

Are you attempting to perform a proximity search within a filter?

Proximity queries do not work in filters. Consider using the main search box. For more details, visit our advance search documentation.

diff --git a/cl/simple_pages/templates/help/advanced_search.html b/cl/simple_pages/templates/help/advanced_search.html index a1f0c288a5..e30db4d754 100644 --- a/cl/simple_pages/templates/help/advanced_search.html +++ b/cl/simple_pages/templates/help/advanced_search.html @@ -23,9 +23,10 @@

Advanced Query Techniques and Operators

If you would like assistance crafting a query, let us know. We can sometimes help.

-

Intersections: AND

-

This connector is used by default, and so is not usually needed. However, some operators, like the negation operator, can change the default operator to OR. Therefore, in more complicated queries it is good practice to explicitly intersect your tokens by using the AND operator between all words (e.g. immigration AND asylum). -

+

Intersections: AND or &

+

This connector is used by default, and so is not usually needed. However, some operators, like the negation operator, can change the default operator to OR. Therefore, in more complicated queries it is good practice to explicitly intersect your tokens by using the AND or & operator between all words e.g:

+

(immigration AND asylum) AND (border OR patrol) or

+

(immigration & asylum) & (border OR patrol)

Unions: OR

Creates an OR comparison between words (e.g. immigration OR asylum). @@ -40,6 +41,11 @@

Negation/Exclusion: -

This query does "immigration" or "border" but not "border patrol":

immigration border -"border patrol".

+

But not : NOT or %

+

The NOT operator or % serves as an alternative way to exclude terms from your search results. This operator is particularly useful when combined with other boolean operators or grouped queries to refine your search precision:

+

"border patrol" NOT (immigration OR asylum) or

+

"border patrol" % (immigration OR asylum)

+

Phrase and Exact Queries: " "

Creates a phrase search (e.g. "border patrol").

You can also use " " to perform an exact query, which will not apply stemming or match synonyms.

@@ -51,17 +57,27 @@

Grouped Queries and subqueries: ( )

Using parentheses will group parts of a query (e.g. (customs OR "border patrol") AND asylum). Parentheses can be nested as deeply as needed.

-

Wildcards and Fuzzy Search: *, ?, and ~

-

Using an asterisk (*) allows wildcard searches. For example, immigra* finds all words beginning with "immigra". This can also be used at the beginning or middle of words, at both the beginning and the end of a word, or even all three. For example, you can find all words containing two esses side-by-side with the following query: *ss*. You could also find words with two esses separated by other letters with a query such as: *s*s*. This would find cases containing words like "susan" or "assistant". -

+

Wildcards and Fuzzy Search: *, !, ? and ~

+

Using an asterisk (*) allows for wildcard searches. For example, immigra* finds all words that begin with "immigra". Alternatively, you can use an exclamation mark (!) at the beginning of a word for the same purpose. For instance, !immigra matches words that start with "immigra".

-

The question mark character (?) can be used similarly as a single letter wildcard. For example, this would find cases containing the word "immigrant" or "emmigration": ?mmigra* -

+

* can also be used inside words, where it acts as a single-character wildcard. For example, a query like gr*mm*r would match cases containing both "grammar" and "grimmer".

+

The question mark character (?) can be used similarly as a single-character wildcard. Unlike *, it is allowed at the beginning of words. For example, this would find cases containing the word "immigrant" or "emmigration": ?mmigra*.

-

Fuzzy search can be applied using the tilde character (~) after a word. This is an advanced parameter that allows searches for misspellings or different variations of a word's spelling. For example, searching for immigrant~ would find words similar to "immigrant". Values can also be added after the tilde to indicate how similar different spellings must be. The default value, if none is given, is 0.5. Values can range between 0 and 1, with 1 being exact, and 0 being very sloppy. Fuzzy searches tend to broaden the result set, thus lowering precision, but also casting a wider net. -

+

Fuzzy search can be applied using the tilde character (~) after a word. This is an advanced parameter that allows searches for misspellings or variations in a word's spelling. For example, searching for immigrant~ would find words similar to "immigrant." Values can also be added after the tilde to specify the maximum number of changes allowed, where a change refers to the insertion, deletion, substitution of a single character, or transposition of two adjacent characters. The default value, if none is given, is 2. Allowed values are 1 and 2. Fuzzy searches tend to broaden the result set, thus lowering precision, but also casting a wider net.

+ +

Disallowed Expensive Wildcards

+

The following types of wildcard queries are disabled due to performance issues:

+ + * at the beginning of terms +

Queries like *ing are disallowed because they require examining all terms in the index, which is highly resource-intensive.

+ + Multiple endings with * or ! in short terms + +

Queries that match multiple endings are only allowed if the base word has at least three characters. Therefore, queries like a*, bc*, !a, or !bc are disallowed due to performance issues.

+

Performing a query like these will throw an error with the message:

+

The query contains a disallowed expensive wildcard pattern.

-

Proximity: ~

+

Proximity: ~

Using a tilde character (~) after a phrase will ensure that the words in the phrase are within a desired distance of each other. For example "border fence"~50 would find the words border and fence within 50 words of each other.

From 29712cc8faf8a0b97486a319d39b6161bdf5a7ec Mon Sep 17 00:00:00 2001 From: Alberto Islas Date: Fri, 10 Jan 2025 18:27:00 -0600 Subject: [PATCH 2/3] fix(search): Fixed failing test_disallowed_wildcard_pattern --- cl/search/tests/tests.py | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/cl/search/tests/tests.py b/cl/search/tests/tests.py index 9c06e7e6e0..1b34ae48ed 100644 --- a/cl/search/tests/tests.py +++ b/cl/search/tests/tests.py @@ -1242,9 +1242,14 @@ def test_disallowed_wildcard_pattern(self) -> None: test_case["search_params"], ) decoded_content = response.content.decode() + tree = html.fromstring(decoded_content) + h2_error_element = tree.xpath('//h2[@class="alt"]')[0] + h2_text_error = "".join( + h2_error_element.xpath(".//text()") + ).strip() self.assertIn( "The query contains a disallowed expensive wildcard pattern", - decoded_content, + h2_text_error, msg=f"Failed on: {test_case['label']}, no disallowed wildcard pattern error.", ) From 713f400f340cf466bd7be421e52eb836b1abac9b Mon Sep 17 00:00:00 2001 From: Alberto Islas Date: Fri, 10 Jan 2025 19:29:46 -0600 Subject: [PATCH 3/3] fix(docs): Applied suggestions in advanced search documentation --- cl/search/exception.py | 2 +- cl/search/templates/includes/no_results.html | 2 +- cl/search/tests/tests.py | 6 +++--- cl/simple_pages/templates/help/advanced_search.html | 8 ++++---- 4 files changed, 9 insertions(+), 9 deletions(-) diff --git a/cl/search/exception.py b/cl/search/exception.py index 0d51d152d3..926ae17495 100644 --- a/cl/search/exception.py +++ b/cl/search/exception.py @@ -61,4 +61,4 @@ class ElasticBadRequestError(APIException): class DisallowedWildcardPattern(SyntaxQueryError): """Query contains a disallowed wildcard pattern""" - message = "The query contains a disallowed expensive wildcard pattern." + message = "The query contains a disallowed wildcard pattern." diff --git a/cl/search/templates/includes/no_results.html b/cl/search/templates/includes/no_results.html index b1cf9ef088..91ab6c6b95 100644 --- a/cl/search/templates/includes/no_results.html +++ b/cl/search/templates/includes/no_results.html @@ -34,7 +34,7 @@

{% elif error_message == "unbalanced_quotes" %} Did you forget to close one or more quotes? {% elif error_message == "disallowed_wildcard_pattern" %} - The query contains a disallowed expensive wildcard pattern. + The query contains a disallowed wildcard pattern. {% endif %} {% else %} encountered an error. diff --git a/cl/search/tests/tests.py b/cl/search/tests/tests.py index 1b34ae48ed..f2503ee63e 100644 --- a/cl/search/tests/tests.py +++ b/cl/search/tests/tests.py @@ -1248,7 +1248,7 @@ def test_disallowed_wildcard_pattern(self) -> None: h2_error_element.xpath(".//text()") ).strip() self.assertIn( - "The query contains a disallowed expensive wildcard pattern", + "The query contains a disallowed wildcard pattern.", h2_text_error, msg=f"Failed on: {test_case['label']}, no disallowed wildcard pattern error.", ) @@ -1261,7 +1261,7 @@ def test_disallowed_wildcard_pattern(self) -> None: self.assertEqual(api_response.status_code, 400) self.assertEqual( api_response.data["detail"], - "The query contains a disallowed expensive wildcard pattern.", + "The query contains a disallowed wildcard pattern.", msg="Failed for V4", ) @@ -1273,7 +1273,7 @@ def test_disallowed_wildcard_pattern(self) -> None: self.assertEqual(api_response.status_code, 400) self.assertEqual( api_response.data["detail"], - "The query contains a disallowed expensive wildcard pattern.", + "The query contains a disallowed wildcard pattern.", msg="Failed for V3", ) diff --git a/cl/simple_pages/templates/help/advanced_search.html b/cl/simple_pages/templates/help/advanced_search.html index e30db4d754..e0d9207144 100644 --- a/cl/simple_pages/templates/help/advanced_search.html +++ b/cl/simple_pages/templates/help/advanced_search.html @@ -60,12 +60,12 @@

Grouped Queries and subqueries: ( )

Wildcards and Fuzzy Search: *, !, ? and ~

Using an asterisk (*) allows for wildcard searches. For example, immigra* finds all words that begin with "immigra". Alternatively, you can use an exclamation mark (!) at the beginning of a word for the same purpose. For instance, !immigra matches words that start with "immigra".

-

* can also be used inside words, where it acts as a single-character wildcard. For example, a query like gr*mm*r would match cases containing both "grammar" and "grimmer".

-

The question mark character (?) can be used similarly as a single-character wildcard. Unlike *, it is allowed at the beginning of words. For example, this would find cases containing the word "immigrant" or "emmigration": ?mmigra*.

+

* can also be used inside words, where it acts as a single-character wildcard. For example, a query like gr*mm*r would match cases containing both "grammar" and "grimmer".

+

The question mark character (?) can be used similarly as a single-character wildcard. Unlike *, it is allowed at the beginning of words. For example, this would find cases containing the word "immigrant" or "emmigration": ?mmigra*.

Fuzzy search can be applied using the tilde character (~) after a word. This is an advanced parameter that allows searches for misspellings or variations in a word's spelling. For example, searching for immigrant~ would find words similar to "immigrant." Values can also be added after the tilde to specify the maximum number of changes allowed, where a change refers to the insertion, deletion, substitution of a single character, or transposition of two adjacent characters. The default value, if none is given, is 2. Allowed values are 1 and 2. Fuzzy searches tend to broaden the result set, thus lowering precision, but also casting a wider net.

-

Disallowed Expensive Wildcards

+

Disallowed Wildcards

The following types of wildcard queries are disabled due to performance issues:

* at the beginning of terms @@ -75,7 +75,7 @@

Disallowed Expensive Wildcards

Queries that match multiple endings are only allowed if the base word has at least three characters. Therefore, queries like a*, bc*, !a, or !bc are disallowed due to performance issues.

Performing a query like these will throw an error with the message:

-

The query contains a disallowed expensive wildcard pattern.

+

The query contains a disallowed wildcard pattern.

Proximity: ~

Using a tilde character (~) after a phrase will ensure that the words in the phrase are within a desired distance of each other. For example "border fence"~50 would find the words border and fence within 50 words of each other.