fix count of last n-grams #434

ben-freist · 2023-07-16T12:16:01Z

I think this fixes a problem with the way ngrams are counted that's described in
#405.

The problem is that the last ngram for which adjusted counts were computed had the wrong count.
I generated a bunch of texts, ngram orders and pruning thresholds and compared with this python script
compute_discounts.txt

Out of 100 texts, with this patch 79 texts are rejected by both lmplz and the attached python script and for 21 I get the same discounts.

Without this patch 78 texts are rejected by both lmplz and my script, 1 is rejected by my script but not lmplz and 2 are rejected by lmplz but not my script. Among the texts for which discounts are computed, there's agreement between lmplz and my script in 17 cases and for 2 they are different.

Should I add my test data here too?

fix count of last n-grams

6710df2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix count of last n-grams #434

fix count of last n-grams #434

ben-freist commented Jul 16, 2023

fix count of last n-grams #434

Are you sure you want to change the base?

fix count of last n-grams #434

Conversation

ben-freist commented Jul 16, 2023