Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix count of last n-grams #434

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

ben-freist
Copy link

I think this fixes a problem with the way ngrams are counted that's described in
#405.

The problem is that the last ngram for which adjusted counts were computed had the wrong count.
I generated a bunch of texts, ngram orders and pruning thresholds and compared with this python script
compute_discounts.txt

Out of 100 texts, with this patch 79 texts are rejected by both lmplz and the attached python script and for 21 I get the same discounts.

Without this patch 78 texts are rejected by both lmplz and my script, 1 is rejected by my script but not lmplz and 2 are rejected by lmplz but not my script. Among the texts for which discounts are computed, there's agreement between lmplz and my script in 17 cases and for 2 they are different.

Should I add my test data here too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant