Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix counts #12

Closed
wehlutyk opened this issue Jan 12, 2016 · 8 comments
Closed

Fix counts #12

wehlutyk opened this issue Jan 12, 2016 · 8 comments

Comments

@wehlutyk
Copy link
Owner

Changing the language detection module gave us a few more quotes. My first reproduction of the analysis gives Stored 1188 of 6586 mined substitutions. (instead of 1051 out of 6172.)

@wehlutyk
Copy link
Owner Author

wehlutyk commented Mar 9, 2016

The refactor branch (at least where we are now) leads to many more substitutions being detected (9030 out of 41397, although this changes a little when filtering again, since there's some stochasticity in the language detection). After much poking around, it turns out:

  • first, a mistake in the previous code meant that only one substitution was being counted for multiple occurrences in a destination bag/bin (this comes from yielding only once per destination bag/bin and not counting the in-bag frequency; see for example here for the old tbg mining model). In accordance with what was actually explained in the first submission of the paper (see the text and figure), we now count one substitution per destination occurrence, instead of one for all the occurrences in a destination bag/bin.
  • second, bags/bins are now aligned to midnight (vs. aligned to cluster start in previous code), which changes exactly which substitutions are detected (but overall, this shouldn't change the amounts of substitutions detected that much).

@wehlutyk
Copy link
Owner Author

  • third, another bug in the previous code leads to substitutions in sentences with an apostrophe (') being erroneously discarded: in mine/models.py, the mother quote of a substitution is reconstituted with spaces between all tokens, so for instance he's tokenizes to he+'s, which if reconstituted with ' '.join(tokens) gives he 's with an additional space, which in turn lemmatizes to ['he', "'", 'S'] instead of ['he', 'be'], changing the indices for code further down.

This third point and point 2 above account for the fact that the 9051 substitutions detected come from 1060 unique clusters, whereas mining with the previous code found substitutions in only 702 unique clusters (there's an overlap of 88 clusters, but the rest are new ones).

@wehlutyk wehlutyk added A-writing and removed A-fix labels Apr 18, 2016
@wehlutyk wehlutyk changed the title Fix substitution counts Fix counts Jun 29, 2016
@wehlutyk
Copy link
Owner Author

wehlutyk commented Jun 29, 2016

Also, the new language module makes us get 50427 / 71568 filtered clusters, 141324 / 310457 filtered quotes, 2601421 / 7665108 occurrences (without page frequencies), 2658805 / 8155875 occurrences with page frequencies. And we decide to count occurrences without page frequencies.

This was referenced Jun 30, 2016
@wehlutyk
Copy link
Owner Author

wehlutyk commented Jul 6, 2016

New change: with the improved substitution filtering, we are getting 8245 out of 40868 candidate substitutions.

@wehlutyk
Copy link
Owner Author

Further improving the filtering leads us to get fewer substitutions: 6318 / 40868.

These numbers will now appear in a file in the codings/ folder.

@wehlutyk
Copy link
Owner Author

wehlutyk commented Jul 12, 2016

The number of words coded by frequency has also changed, due to the different filtering of clusters.

@wehlutyk
Copy link
Owner Author

Clustering values have also changed because weights are now taken into account.

@wehlutyk
Copy link
Owner Author

wehlutyk commented Jul 13, 2016

Finished updating these counts in the paper. This issue should be used for the cover letter too #21.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant