-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix counts #12
Comments
The refactor branch (at least where we are now) leads to many more substitutions being detected (9030 out of 41397, although this changes a little when filtering again, since there's some stochasticity in the language detection). After much poking around, it turns out:
|
This third point and point 2 above account for the fact that the 9051 substitutions detected come from 1060 unique clusters, whereas mining with the previous code found substitutions in only 702 unique clusters (there's an overlap of 88 clusters, but the rest are new ones). |
Also, the new language module makes us get 50427 / 71568 filtered clusters, 141324 / 310457 filtered quotes, 2601421 / 7665108 occurrences (without page frequencies), 2658805 / 8155875 occurrences with page frequencies. And we decide to count occurrences without page frequencies. |
New change: with the improved substitution filtering, we are getting 8245 out of 40868 candidate substitutions. |
Further improving the filtering leads us to get fewer substitutions: 6318 / 40868. These numbers will now appear in a file in the |
The number of words coded by frequency has also changed, due to the different filtering of clusters. |
Clustering values have also changed because weights are now taken into account. |
Finished updating these counts in the paper. This issue should be used for the cover letter too #21. |
Changing the language detection module gave us a few more quotes. My first reproduction of the analysis gives
Stored 1188 of 6586 mined substitutions.
(instead of 1051 out of 6172.)The text was updated successfully, but these errors were encountered: