Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: False positives with partial matches #25

Open
HEmile opened this issue Feb 23, 2022 · 16 comments · Fixed by #30 or #31
Open

Bug: False positives with partial matches #25

HEmile opened this issue Feb 23, 2022 · 16 comments · Fixed by #30 or #31

Comments

@HEmile
Copy link

HEmile commented Feb 23, 2022

I'm seeing some false positives in my vault. In this case, 'on' is highlighted with following replacements:
image

Here, 'matter' is highlighted
image

Same with 'zero'
image

I think this started happening on 1.4.1, and it wasn't like this in 1.4.0 .

(btw... apologies for all the issues I'm making in this repo. I think it will be super useful to me and this plugin will be of huge help to many in the community!)

@Kolooooo
Copy link

hi, HEmile I have a question:I downloaded the file from github and added it to the obsidian plugin library, and when I opened the plugin from obsidian, it said "Failed to load plugin obsidian-sidekick", I really don't know which file to add to the obsidian plugin library. Please advise, thank you~

@hadynz
Copy link
Owner

hadynz commented Feb 23, 2022

Thanks for raising this.

(btw... apologies for all the issues I'm making in this repo. I think it will be super useful to me and this plugin will be of huge help to many in the community!)

No issues at all. You are doing me a great service testing for me as you are.

I've been heads down focusing on supporting the selection of multiple words, single words, and "stemmed" single words in response to #19. This will change the behaviour that you are seeing again. I'll test the scenarios that you've raised to make sure you don't get the false positive examples you shared.

@HEmile
Copy link
Author

HEmile commented Feb 24, 2022

I see, in that case it might be intentional..
. But since i have so many notes with long names (paper titles) this gives me a lot of recommendations compared to exact matching (and live preview looks messy because of it).

Ideally, I'd prefer the option to turn this off. Stemming on single word (or final word) replacements does sound very useful though

@hadynz hadynz linked a pull request Feb 26, 2022 that will close this issue
@hadynz
Copy link
Owner

hadynz commented Feb 26, 2022

Can you try the latest version that I just released - 1.5.0?

I've made sure that "stop words" are never indexed, so it should resolve the issue that you experienced.

@Jinnayah
Copy link

The partial match option is giving me too many matches to be useful in version 1.5.0. Here's an example from a basic note in my vault. The only match that might've had something useful was "principle" (and it didn't, but I can see where it could have). "30" is matching to every entry in my daily notes that happened to be on the 30th of a month.

This might be more useful to me if partial matches were limited. In my vault, for example, if only one or two notes match a partial, that might be useful. If a dozen match, it's just a common word and unlikely to have a meaningful link.

overactive_sidekick

@hadynz
Copy link
Owner

hadynz commented Feb 26, 2022

Thanks for the feedback @Jinnayah.

It sounds like any numbers in the text should never be highlighted and considered a stop word. That's something I can do.

Do you have any suggestions for how we can make partial matches more informed? There is feedback that people want it. But how do you think we can identify and surface a relevant partial match?

Alternatively. Do you think that when #3 is implemented that this becomes less of an issue as you can simply build your ignore list for your vault?

@Jinnayah
Copy link

Being able to exclude certain notes wouldn't help me in this case, but being able to add my own stop words would. This might be the easiest and most flexible for most people.

Another idea might be to have a threshold for matches. For example, here are the matches for 'principle' in the same file. Principle is one word of two for "Purcell Principle" and a journal entry, so there's a good chance those could be a match. It's one word out of 17 on the article about the Copernican Principle and 1 of 14 in the note about W.H.O., so it's less likely to be a match there. A threshold where the partial must match at least X% of the full name would filter out a lot of the false positives. (For my vault, it looks like a threshold around 15% of words would get rid of most false positives while still surfacing the good potential links.)

Principle_Matches

BTW, I just noticed that the note itself is being flagged a possible link and probably shouldn't be. This note is named "Cognitive ease principle", and is coming up as an option for that phrase and for "principle".

@hadynz hadynz linked a pull request Feb 27, 2022 that will close this issue
@hadynz
Copy link
Owner

hadynz commented Feb 27, 2022

BTW, I just noticed that the note itself is being flagged a possible link and probably shouldn't be. This note is named "Cognitive ease principle", and is coming up as an option for that phrase and for "principle".

Good call. This was a bug. Fixed in 1.5.1.

@hadynz
Copy link
Owner

hadynz commented Feb 27, 2022

Another idea might be to have a threshold for matches. For example, here are the matches for 'principle' in the same file. Principle is one word of two for "Purcell Principle" and a journal entry, so there's a good chance those could be a match. It's one word out of 17 on the article about the Copernican Principle and 1 of 14 in the note about W.H.O., so it's less likely to be a match there. A threshold where the partial must match at least X% of the full name would filter out a lot of the false positives. (For my vault, it looks like a threshold around 15% of words would get rid of most false positives while still surfacing the good potential links.)

That's not a bad idea. I will implement your suggestion and we can give this a go testing to see the usefulness of this change. Will let you know when the change is made.

@HEmile
Copy link
Author

HEmile commented Feb 27, 2022

I would honestly be most happy with an option to disable partial matches. My vault really isn't set up in a way that partial matches make sense, since they are mostly (pretty long) paper titles.

@HEmile
Copy link
Author

HEmile commented Feb 28, 2022

An example: The different replacement for 'generator'. I prefer to have control over this list by explicitly adding aliases, which gives me much more control and much fewer false positives (which are time consuming to scroll through!)
image

@hadynz
Copy link
Owner

hadynz commented Feb 28, 2022

Damn. That surely drives the point through.

Are any of those suggestions remotely useful for your use case any chance?

I've just come across RAKE which I'm going to trial out quickly to see if that is an even better solution than what I have at the moment.

@HEmile
Copy link
Author

HEmile commented Feb 28, 2022

Are any of those suggestions remotely useful for your use case any chance?

Not really. I don't think it shows all the recommendations, since it filled my whole screen. There probably are some relevant recommendations like 'generative models', but I don't see it probably because it's ordered alphabetically.

Also the stemming is rather agressive, it seems to use 'generalized' for 'generator'.

@laurastephsmith
Copy link

Hi, I'll chime in on this conversation rather than start a new one. I've just installed this for the first time, the idea and the way you're approaching it is awesome! The first note I threw at it though gave matches for:

  • "things"
  • "really"
  • "back"
  • "going"
  • "to"
    They're basically what you might call "filler words" rather than potential keywords. My initial hunch is that I only want to match nouns, or combinations of words that contain nouns. But of course that would mean running the whole thing against a dictionary. Hmm... I'm being the unhelpful person here pointing out a problem without being able to properly define the problem, let alone suggest a workable solution! But I'm here because I think this plugin has SO much potential to be incredible! So I offer my train of thought in that spirit ;)

@HEmile
Copy link
Author

HEmile commented Mar 27, 2022

@laurastephsmith I created a fork with a setting to disable the rather aggressive stemming. You can install it from here: https://github.com/HEmile/obsidian-sidekick/releases/tag/1.1.0 , hopefully that solves the problem! (It does for me).

@laurastephsmith
Copy link

@HEmile oo thanks, I'll give it a go!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants