Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve type search #641

Merged
merged 5 commits into from
Jan 31, 2025
Merged

Improve type search #641

merged 5 commits into from
Jan 31, 2025

Conversation

wbazant
Copy link
Collaborator

@wbazant wbazant commented Dec 14, 2024

Closes #635

Different resolution:

  1. show the synonyms when selecting

As Ethan points out, the synonyms can help find the match, but just using them in the background would produce a "matching but not sure why" experience. Meanwhile, they're quite interesting - a bit of trivia about plants of the form: okra is also known as Ladies' fingers - and might help also when browsing.

  1. don't reorder search results

I proposed that solution originally, and from searching it seems to be borderline possible, but the library really didn't want me to do it, and it probably has drawbacks. Instead, match from the start only, and use common name + scientific name + all synonyms as possible starts.

@ezwelty
Copy link
Collaborator

ezwelty commented Jan 26, 2025

@wbazant It's immediately fun to see the synonyms displayed. A reminder of all this data we have but aren't yet using. So thanks for bringing them to the fore!

The only request for change in this PR is fixing the design to handle many synonyms. Can we allow rows to expand?

Screenshot 2025-01-26 at 13 51 36 Screenshot 2025-01-26 at 13 51 12

I'm willing to accept that prefix matching will lead to the best result for most users/searches, although it may fail in some cases.

  • Search "mulberry" and expect to be able to choose from a list like "black mulberry", "red mulberry", "white mulberry" rather than just get "mulberry"
  • Search by cultivar name "Reinette ..." and find nothing. This could be solved by parsing cultivar names from scientific names (or perhaps return a list of cultivars from the API) and add them to the search bucket?
  • Search for anything whose common name starts with "common ....". This could be solved by always including the second part as a synonym (e.g. "common yarrow", "yarrow").

During testing, I realized that pending types cannot be distinguished, which could lead to confusion. It isn't so serious, because pending types will get merged, but it might be worth considering flagging them somehow. We decided to include them so that a user can add a new type and then use it for subsequent locations without having to wait that the type is approved (which could take months).

Screenshot 2025-01-26 at 13 51 55

I also realized that matching fails because synonyms cannot realistically capture all permutations of no-space, space, and hyphenated versions (e.g. little-leaf linden, littleleaf linden, little leaf linden, small-leaved linden, ... lime, ...), which is a challenge in English, French, Portuguese (especially), and probably many others. Would it be crazy or helpful to ignore space and dash for matching?

Finally, while we're on the topic, there is the option of diacritic-insensitive matching for languages that use the latin script as a base. I have this function in Javascript for the purpose:
https://github.com/ezwelty/opentrees-harvester/blob/57110ccd51e5078665639ea593f799a7d59f9889/lib/helpers.js#L659

@ezwelty
Copy link
Collaborator

ezwelty commented Jan 26, 2025

One more little idea, maybe interpunct instead of comma-separated for legibility and consistency with other lists?

@wbazant
Copy link
Collaborator Author

wbazant commented Jan 31, 2025

Thanks for the detailed feedback! I did the following:

Can we allow rows to expand?

The list was virtualized because it's slow to render it all. I removed react-window and replaced it with a basic infinite list, like on the list page or activity page. Now the rows can have variable heights, and it looks better, thanks!

I've made the tokenizer more elaborate and added some rules:

Search "mulberry" and expect to be able to choose from a list like "black mulberry", "red mulberry", "white mulberry" rather than just get "mulberry"

Tried to generalize it as follows: if the parent's common name appears in the child's name, but not at the start (where we don't need to add it because it will show up during typing), then add parent's name to the search reference. BTW I noticed it fails for 'European plum/Prunus domestica' because the parent there is 'Stone fruit', and in general the taxonomy of plums isn't quite right. Not an issue for a regular user because Plum/Prunus is the third term and they'll probably go for that entry!

Search by cultivar name "Reinette ..."

Added cultivars as search terms

Search for anything whose common name starts with "common ....".

If commonName.toLowerCase().startsWith('common '), copy, strip "[Cc]ommon\s+", and add to the reference

pending types

Added (Pending Review) to common name- could be done in a fancier way but this should be clear enough

Would it be crazy or helpful to ignore space and dash for matching?

Good suggestion! I ignored [^\w\s], so dashes, apostrophes, etc. I started ignoring space, until I realised I want the space as a feature - 'elm ' shouldn't match 'elmleaf blackberry' - so we allow word ends in input

diacritic-insensitive matching

Thanks, I did that! Did toLowerCase and then toAscii on both input and reference.

maybe interpunct

Thanks! The interpunct looks better.

@wbazant
Copy link
Collaborator Author

wbazant commented Jan 31, 2025

I'll merge this in since the feature is now completely gold-plated, but it's something we can tweak and add rules as we come up with them!

@wbazant wbazant merged commit 3a81a16 into falling-fruit:main Jan 31, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Type search in location form: prioritise prefix matches
2 participants