Fix counts #12

wehlutyk · 2016-01-12T08:54:33Z

Changing the language detection module gave us a few more quotes. My first reproduction of the analysis gives Stored 1188 of 6586 mined substitutions. (instead of 1051 out of 6172.)

The text was updated successfully, but these errors were encountered:

wehlutyk · 2016-03-09T17:01:46Z

The refactor branch (at least where we are now) leads to many more substitutions being detected (9030 out of 41397, although this changes a little when filtering again, since there's some stochasticity in the language detection). After much poking around, it turns out:

first, a mistake in the previous code meant that only one substitution was being counted for multiple occurrences in a destination bag/bin (this comes from yielding only once per destination bag/bin and not counting the in-bag frequency; see for example here for the old tbg mining model). In accordance with what was actually explained in the first submission of the paper (see the text and figure), we now count one substitution per destination occurrence, instead of one for all the occurrences in a destination bag/bin.
second, bags/bins are now aligned to midnight (vs. aligned to cluster start in previous code), which changes exactly which substitutions are detected (but overall, this shouldn't change the amounts of substitutions detected that much).

wehlutyk · 2016-03-14T18:50:27Z

third, another bug in the previous code leads to substitutions in sentences with an apostrophe (') being erroneously discarded: in mine/models.py, the mother quote of a substitution is reconstituted with spaces between all tokens, so for instance he's tokenizes to he+'s, which if reconstituted with ' '.join(tokens) gives he 's with an additional space, which in turn lemmatizes to ['he', "'", 'S'] instead of ['he', 'be'], changing the indices for code further down.

This third point and point 2 above account for the fact that the 9051 substitutions detected come from 1060 unique clusters, whereas mining with the previous code found substitutions in only 702 unique clusters (there's an overlap of 88 clusters, but the rest are new ones).

wehlutyk · 2016-06-29T15:03:57Z

Also, the new language module makes us get 50427 / 71568 filtered clusters, 141324 / 310457 filtered quotes, 2601421 / 7665108 occurrences (without page frequencies), 2658805 / 8155875 occurrences with page frequencies. And we decide to count occurrences without page frequencies.

wehlutyk · 2016-07-06T09:16:45Z

New change: with the improved substitution filtering, we are getting 8245 out of 40868 candidate substitutions.

wehlutyk · 2016-07-12T09:09:32Z

Further improving the filtering leads us to get fewer substitutions: 6318 / 40868.

These numbers will now appear in a file in the codings/ folder.

wehlutyk · 2016-07-12T11:51:49Z

The number of words coded by frequency has also changed, due to the different filtering of clusters.

wehlutyk · 2016-07-12T12:02:11Z

Clustering values have also changed because weights are now taken into account.

wehlutyk · 2016-07-13T16:37:08Z

Finished updating these counts in the paper. This issue should be used for the cover letter too #21.

wehlutyk added A-fix D-easy S-ready labels Jan 12, 2016

wehlutyk added A-writing and removed A-fix labels Apr 18, 2016

wehlutyk changed the title ~~Fix substitution counts~~ Fix counts Jun 29, 2016

This was referenced Jun 30, 2016

Misc. details #20

Closed

Cover letter #21

Open

wehlutyk closed this as completed Jul 13, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix counts #12

Fix counts #12

wehlutyk commented Jan 12, 2016

wehlutyk commented Mar 9, 2016

wehlutyk commented Mar 14, 2016

wehlutyk commented Jun 29, 2016 •

edited

Loading

wehlutyk commented Jul 6, 2016 •

edited

Loading

wehlutyk commented Jul 12, 2016

wehlutyk commented Jul 12, 2016 •

edited

Loading

wehlutyk commented Jul 12, 2016

wehlutyk commented Jul 13, 2016 •

edited

Loading

Fix counts #12

Fix counts #12

Comments

wehlutyk commented Jan 12, 2016

wehlutyk commented Mar 9, 2016

wehlutyk commented Mar 14, 2016

wehlutyk commented Jun 29, 2016 • edited Loading

wehlutyk commented Jul 6, 2016 • edited Loading

wehlutyk commented Jul 12, 2016

wehlutyk commented Jul 12, 2016 • edited Loading

wehlutyk commented Jul 12, 2016

wehlutyk commented Jul 13, 2016 • edited Loading

wehlutyk commented Jun 29, 2016 •

edited

Loading

wehlutyk commented Jul 6, 2016 •

edited

Loading

wehlutyk commented Jul 12, 2016 •

edited

Loading

wehlutyk commented Jul 13, 2016 •

edited

Loading