Search history completions #419

MareStare · 2025-02-12T12:37:14Z

Motivation

Today custom autocompletions in the search input are disabled by default, mainly because they replace the native browser auto-fill. But the native auto-fill itself isn't probably the requirement.

Solution

The requirement is to have a functional search history in completions. Something similar to what Google does:

The history should be shown by default when the search input is empty and focused. It should also participate in completions search when the user has typed at least one character.

Ranking

We should rank the search history results by the number of times the search was invoked. The search history results should always take precedence over the default tag completions, since it's very likely the users have a set of "favourite" queries they repeat again and again.

Manual deletion

There should be a button somewhere on the search history completion item to remove it from history. Google example:

This is useful when you want to remove some search history item because it contains typos, or maybe you just want to hide something from somebody 😳.

Matching

Currently, tags completions are formed based on prefix match. I think that's rather inconvenient even for that purpose, but for the search history it will be even more of a problem because search queries may contain multiple tags and even complex conditions. I think a fuzzy search like the one fzf implements would be much nicer overall. This may be rather difficult to implement, and thus may be discussed/done separately.

For the easy start we may just use a case-insensitive substring match i.e. ILIKE %_% logic.

Server-side vs client-side

We may provide server-side search history storage for logged-in users only. The server-side storage will provide a shared history across all user devices. Mainly - desktop + phone or multiple desktops. This is how Google does it and it's very convenient.

Anonymous users will have to enjoy client-side localStorage solution only, and it'll serve them an additional incentive to create an account.

However, logged-in users will also rely on client-side localStorage as a cache for search history until server-side history is loaded.

Limits

We don't want the search history to grow unboundedly for the reasons of performance and fair storage usage.

Maximum search query length

What is the maximum search query length today? Is there a limit at all? If not, we should probably set it to a reasonably big number of characters to prevent misuse and DDoS attack vectors. Even if there is no official limit today, there is definitely an implicit system limit imposed by OpenSearch (I suppose this is what's used to implement search queries), at which point the backend fails with an error from OpenSearch max characters count.

As a more flexible solution - we may allow very long queries, but ignore the search queries exceeding some UTF8 length (e.g. 500 bytes) for the purposes of search history bookkeeping. I'm speculating on the best max length here, since I don't have any statistics about the average query lengths. Do we have any metrics about that? Are there any metrics time series instrumentation in the backend? I don't see any Prometheus/VictoriaMetrics nodes defined in the docker-compose.yml, so I suppose some of the existing statistics are collected via Postgres/OpenSearch?

On an unrelated note, I can help to set up VictoriaMetrics + Grafana (dashboards and proactive alerts). I already have a good template for that from my other project. If you'd be interested in that, please, let me know.

Maximum search items count

We should keep at most N most actively used search items. The number should be something reasonable to make sure that we both store enough history that all actively used queries always appear in completions, and rare/one-off/mistake queries are eventually removed from the storage.

This number must correlate with the max search query length as their product defines the maximum storage space we need to allocate per user to store their search history.

Adoption. Migration to opt-out (again)

The eventual goal is to have custom completions enabled by default for everyone, including for people who objected to the feature for the reason of losing the browser auto-fill. To do that we may keep completions disabled by default for some time, but collect the search history regardless. For example, we can give it a month in this state, collect the search history from users, and then enable completions by default. On that day users will open Derpibooru and see the search history that is most likely already good enough, since it contains their most popular queries from the last month.

Unresolved Questions

Should we make the search history disableable? Will there be anyone who'd want to disable it? If so, why? Privacy concerns when sharing screen /derpi account maybe? Will they want to disable only the search history and not the entire autocompletions feature?
What's the best server-side API for this? Should we just load the entire history in one request and do client-side filtering? I guess in the worst case the history may be in the order of hundreds of kilobytes, so I'd vote for that instead of a debounced on-the-fly completion requests, that may be slower and incur more load.
What's the optimal DB and DB layout for this? If we speculate that the search history should not be bigger than a few hundreds of kilobytes, then we can store it in an embedded postgres array field. We may even store it inside of the users table, but we need to make sure the search history is not loaded unnecessarily (e.g. via SELECT * somewhere). However, could this decrease the performance of queries that don't load the search history from the users table? I'm not sure how Postgres stores columns (SoA or AoS?), need to check the perf. implications of that layout. Alternatively search history can be stored in a separate table with one-to-one relationship to users. WDYT? Or maybe a classic many-to-one will be fine here? I just think that storing the entire history in one column may yield better performance if we are going to return the entire history for client-side filtering anyway, but I don't have enough experience with Postgres yet to say for sure.

The text was updated successfully, but these errors were encountered:

liamwhite · 2025-02-12T15:11:29Z

We may provide server-side search history storage for logged-in users only.

I would prefer to keep it entirely client-side. I do not want the server to store or have any access to search query history.

What is the maximum search query length today?

It's ultimately limited by the max query string length supported by Bandit. OpenSearch does not receive the query string directly--it is forwarded through a parser in the application to transform the query string into a query tree for sending to the search server, and for post-search evaluation of spoilers.

I'm speculating on the best max length here, since I don't have any statistics about the average query lengths. Do we have any metrics about that?

There are no metrics, since I have always been opposed to logging user search queries.

MareStare · 2025-02-12T16:38:17Z

I have always been opposed to logging user search queries.

I see. Anyhow, the metrics I'm talking about like "query length" don't require logging the query contents. It's just a numeric metric - a histogram or a heatmap, i.e. we can measure the number of queries with length 1-10 characters, 11-20, 21-40, 41-80, 81-150, 151-300, Inf+ - these are the buckets in a histogram with query counts per each of them.

It'll look something similar to this on a pie chart, for example:

Otherwise, do you have any system metrics about the request latencies, system load (CPU load, memory usage spikes)? Something like this:

These metrics are just simple in-memory counters on the backend side (Atomic Integers/floats) that a flushed to a time-series DB such as Prometheus or VictoriaMetrics retained for a specified period of time (e.g. only for the last month, half a year, year or more, depending on your storage availability).

So would you be interested in having that instrumentation (if not already)? I can make a separate issue for that.

Other than using client-side storage, I suppose there is nothing more to change?

MareStare added the enhancement New feature or request label Feb 12, 2025

MareStare linked a pull request Feb 24, 2025 that will close this issue

Search history completions #423

Draft

23 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search history completions #419

Search history completions #419

MareStare commented Feb 12, 2025 •

edited

Loading

liamwhite commented Feb 12, 2025

MareStare commented Feb 12, 2025 •

edited

Loading

Search history completions #419

Search history completions #419

Comments

MareStare commented Feb 12, 2025 • edited Loading

Motivation

Solution

Ranking

Manual deletion

Matching

Server-side vs client-side

Limits

Maximum search query length

Maximum search items count

Adoption. Migration to opt-out (again)

Unresolved Questions

liamwhite commented Feb 12, 2025

MareStare commented Feb 12, 2025 • edited Loading

MareStare commented Feb 12, 2025 •

edited

Loading

MareStare commented Feb 12, 2025 •

edited

Loading