Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search history completions #419

Open
MareStare opened this issue Feb 12, 2025 · 2 comments · May be fixed by #423
Open

Search history completions #419

MareStare opened this issue Feb 12, 2025 · 2 comments · May be fixed by #423
Labels
enhancement New feature or request

Comments

@MareStare
Copy link
Contributor

MareStare commented Feb 12, 2025

Motivation

Today custom autocompletions in the search input are disabled by default, mainly because they replace the native browser auto-fill. But the native auto-fill itself isn't probably the requirement.

Solution

The requirement is to have a functional search history in completions. Something similar to what Google does:

The history should be shown by default when the search input is empty and focused. It should also participate in completions search when the user has typed at least one character.

Ranking

We should rank the search history results by the number of times the search was invoked. The search history results should always take precedence over the default tag completions, since it's very likely the users have a set of "favourite" queries they repeat again and again.

Manual deletion

There should be a button somewhere on the search history completion item to remove it from history. Google example:

This is useful when you want to remove some search history item because it contains typos, or maybe you just want to hide something from somebody 😳.

Matching

Currently, tags completions are formed based on prefix match. I think that's rather inconvenient even for that purpose, but for the search history it will be even more of a problem because search queries may contain multiple tags and even complex conditions. I think a fuzzy search like the one fzf implements would be much nicer overall. This may be rather difficult to implement, and thus may be discussed/done separately.

For the easy start we may just use a case-insensitive substring match i.e. ILIKE %_% logic.

Server-side vs client-side

We may provide server-side search history storage for logged-in users only. The server-side storage will provide a shared history across all user devices. Mainly - desktop + phone or multiple desktops. This is how Google does it and it's very convenient.

Anonymous users will have to enjoy client-side localStorage solution only, and it'll serve them an additional incentive to create an account.

However, logged-in users will also rely on client-side localStorage as a cache for search history until server-side history is loaded.

Limits

We don't want the search history to grow unboundedly for the reasons of performance and fair storage usage.

Maximum search query length

What is the maximum search query length today? Is there a limit at all? If not, we should probably set it to a reasonably big number of characters to prevent misuse and DDoS attack vectors. Even if there is no official limit today, there is definitely an implicit system limit imposed by OpenSearch (I suppose this is what's used to implement search queries), at which point the backend fails with an error from OpenSearch max characters count.

As a more flexible solution - we may allow very long queries, but ignore the search queries exceeding some UTF8 length (e.g. 500 bytes) for the purposes of search history bookkeeping. I'm speculating on the best max length here, since I don't have any statistics about the average query lengths. Do we have any metrics about that? Are there any metrics time series instrumentation in the backend? I don't see any Prometheus/VictoriaMetrics nodes defined in the docker-compose.yml, so I suppose some of the existing statistics are collected via Postgres/OpenSearch?

On an unrelated note, I can help to set up VictoriaMetrics + Grafana (dashboards and proactive alerts). I already have a good template for that from my other project. If you'd be interested in that, please, let me know.

Maximum search items count

We should keep at most N most actively used search items. The number should be something reasonable to make sure that we both store enough history that all actively used queries always appear in completions, and rare/one-off/mistake queries are eventually removed from the storage.

This number must correlate with the max search query length as their product defines the maximum storage space we need to allocate per user to store their search history.

Adoption. Migration to opt-out (again)

The eventual goal is to have custom completions enabled by default for everyone, including for people who objected to the feature for the reason of losing the browser auto-fill. To do that we may keep completions disabled by default for some time, but collect the search history regardless. For example, we can give it a month in this state, collect the search history from users, and then enable completions by default. On that day users will open Derpibooru and see the search history that is most likely already good enough, since it contains their most popular queries from the last month.

Unresolved Questions

  • Should we make the search history disableable? Will there be anyone who'd want to disable it? If so, why? Privacy concerns when sharing screen /derpi account maybe? Will they want to disable only the search history and not the entire autocompletions feature?

  • What's the best server-side API for this? Should we just load the entire history in one request and do client-side filtering? I guess in the worst case the history may be in the order of hundreds of kilobytes, so I'd vote for that instead of a debounced on-the-fly completion requests, that may be slower and incur more load.

  • What's the optimal DB and DB layout for this? If we speculate that the search history should not be bigger than a few hundreds of kilobytes, then we can store it in an embedded postgres array field. We may even store it inside of the users table, but we need to make sure the search history is not loaded unnecessarily (e.g. via SELECT * somewhere). However, could this decrease the performance of queries that don't load the search history from the users table? I'm not sure how Postgres stores columns (SoA or AoS?), need to check the perf. implications of that layout. Alternatively search history can be stored in a separate table with one-to-one relationship to users. WDYT? Or maybe a classic many-to-one will be fine here? I just think that storing the entire history in one column may yield better performance if we are going to return the entire history for client-side filtering anyway, but I don't have enough experience with Postgres yet to say for sure.

@MareStare MareStare added the enhancement New feature or request label Feb 12, 2025
@liamwhite
Copy link
Contributor

We may provide server-side search history storage for logged-in users only.

I would prefer to keep it entirely client-side. I do not want the server to store or have any access to search query history.

What is the maximum search query length today?

It's ultimately limited by the max query string length supported by Bandit. OpenSearch does not receive the query string directly--it is forwarded through a parser in the application to transform the query string into a query tree for sending to the search server, and for post-search evaluation of spoilers.

I'm speculating on the best max length here, since I don't have any statistics about the average query lengths. Do we have any metrics about that?

There are no metrics, since I have always been opposed to logging user search queries.

@MareStare
Copy link
Contributor Author

MareStare commented Feb 12, 2025

I have always been opposed to logging user search queries.

I see. Anyhow, the metrics I'm talking about like "query length" don't require logging the query contents. It's just a numeric metric - a histogram or a heatmap, i.e. we can measure the number of queries with length 1-10 characters, 11-20, 21-40, 41-80, 81-150, 151-300, Inf+ - these are the buckets in a histogram with query counts per each of them.

It'll look something similar to this on a pie chart, for example:

Image

Otherwise, do you have any system metrics about the request latencies, system load (CPU load, memory usage spikes)? Something like this:

Image

These metrics are just simple in-memory counters on the backend side (Atomic Integers/floats) that a flushed to a time-series DB such as Prometheus or VictoriaMetrics retained for a specified period of time (e.g. only for the last month, half a year, year or more, depending on your storage availability).

So would you be interested in having that instrumentation (if not already)? I can make a separate issue for that.


Other than using client-side storage, I suppose there is nothing more to change?

@MareStare MareStare linked a pull request Feb 24, 2025 that will close this issue
23 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants