OPENNLP-1589 - Fix incorrect array check in CachedFeatureGenerator #635

rzo1 · 2024-07-09T14:49:42Z

This (draft) PRs introduces a more aggressive caching strategy in the cached feature generator, which doesn't rely on ==.

However, the eval results are a bit odd for the Conll02 dataset:

Default (Spanish):

opennlp.tools.util.featuregen.CachedFeatureGenerator@2ddc9a9f: hits=100385 misses=52923 hit%0.6547929657943486
opennlp.tools.util.featuregen.CachedFeatureGenerator@76a4ebf2: hits=98677 misses=51533 hit%0.6569269689101924

1589-Caching (Spanish)

opennlp.tools.util.featuregen.CachedFeatureGenerator@2e385cce: hits=102229 misses=51079 hit%0.6668210399979126
opennlp.tools.util.featuregen.CachedFeatureGenerator@6e6f2380: hits=99197 misses=51013 hit%0.6603887890286931

Default (main) (Dutch)

opennlp.tools.util.featuregen.CachedFeatureGenerator@2ddc9a9f: hits=67301 misses=37687 hit%0.6410351659237247
opennlp.tools.util.featuregen.CachedFeatureGenerator@76a4ebf2: hits=123179 misses=68875 hit%0.6413769044123007

1589-Caching (Dutch)

opennlp.tools.util.featuregen.CachedFeatureGenerator@45d84a20: hits=68174 misses=36814 hit%0.6493504019506992
opennlp.tools.util.featuregen.CachedFeatureGenerator@52f27fbd: hits=124618 misses=67436 hit%0.6488695887614941

As you can see, the aggressive mechanism results in better caching.

It doesn't have an impact on Spanish and on any other eval test but the results for conll02 for dutch are odd (see changes in eval f1 scores).

They are sometimes slightly better but at the same time decrease in some scenarios.

I am actually wondering, why we don't see such changes in f-measure for Spanish. Therefore, I am opening this PR, so you can also investigate what is going on here.

@mawiesne is also having a look here, but we would appreciate some additional 👁️ 👁️

- removes deprecated getters in several records - removes a very long time deprecated constructor of TokenizerME - removes deprecated field in DefaultNameContextGenerator and adapts tests accordingly - marks a constructor of ADNameSampleStream as "for removal" to indicate actual removal is more likely in upcoming releases - marks deprecated 'NameFinderEventStream#generateOutcomes(..)' as "for removal" to indicate actual removal is more likely in upcoming releases - clears unused code from NameFinderME - fixes resource leak in TokenNameFinderFactory

… a dual caching strategy; relies on cache evication as implemented by the Cache class of OpenNLP. - Use Arrays.equals(...) - Demonstrate that caching based on identity is broken.

rzo1 · 2024-07-09T16:03:21Z

We found, that the == to equals results in different f-measures.

Guess, we would need to compute baseline f-measure for each test without a cache and compare those results against the f-measure computed with a cache in place.

Don't know how the original values we're computed.

rzo1 · 2024-07-09T16:33:49Z

Ok, update as found by @mawiesne: the f-values from main are the correct ones. Tests will output them, if cache is disabled.

Changing them (as done in my PR) is therefore an invalid operation.

Nevertheless, we need to investigate why == was used in the first place (and not equals), and why this change only affects Dutch.

rzo1 · 2024-07-09T19:14:59Z

I am going to close this PR for now. Given the fact, that the f-measure on main was created as a baseline without using an actual cache, this PR should create the same results with caching.

However, this is not the case, which implies, that the changes are not valid.

However, there are still some open questions to better understand what actually happens here and how we can implement a more efficient way of caching something without changing the f-measure in the eval test.

The following questions remain:

Why did we originally choose to use == instead of Arrays.equals(...) for caching the features.
Why does using Arrays.equals(...) only impact the eval tests for Dutch.

Side-note: Played around with IdentityHashMap to get a similar behaviour as on main. This will make the eval tests pass (as expected) because the cache numbers are identical, but doesn't make sense to impl it like that (because no gain)

mawiesne and others added 5 commits July 4, 2024 11:25

OPENNLP-1589 - Implements a more aggressive caching strategy based on…

9bac8a7

… a dual caching strategy; relies on cache evication as implemented by the Cache class of OpenNLP. - Use Arrays.equals(...) - Demonstrate that caching based on identity is broken.

OPENNLP-1589 - Adds Cache Printing by Martin

6123b8c

Add printing for EvalTest

73c2720

Adjust eval values

360e08d

rzo1 requested review from kinow, atarora, jzonthemtn and mawiesne July 9, 2024 14:49

Merge remote-tracking branch 'origin/main' into OPENNLP-1589

2458718

Restore baseline values

5610a22

rzo1 closed this Jul 9, 2024

rzo1 deleted the OPENNLP-1589 branch July 10, 2024 09:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OPENNLP-1589 - Fix incorrect array check in CachedFeatureGenerator #635

OPENNLP-1589 - Fix incorrect array check in CachedFeatureGenerator #635

rzo1 commented Jul 9, 2024

rzo1 commented Jul 9, 2024

rzo1 commented Jul 9, 2024 •

edited

Loading

rzo1 commented Jul 9, 2024

OPENNLP-1589 - Fix incorrect array check in CachedFeatureGenerator #635

OPENNLP-1589 - Fix incorrect array check in CachedFeatureGenerator #635

Conversation

rzo1 commented Jul 9, 2024

rzo1 commented Jul 9, 2024

rzo1 commented Jul 9, 2024 • edited Loading

rzo1 commented Jul 9, 2024

rzo1 commented Jul 9, 2024 •

edited

Loading