Questions about keyword methodology #39

milanterlunen · 2017-02-20T13:22:08Z

@JonathanReeve I wanted to make sure I understand the methodology correctly. Is this a fair summary?

In order to calculate the words which are disproportionately present in quoted vs. unquoted parts of the novel, we first split the novel's text into two parts: all the words which had been quoted at least once and all the words which hadn't. For the quoted corpus, we then multiplied each of the words by the number of times it had appeared in any quotation [did we do this? if not I think it's crucial that we do!]. Finally we normalized the word frequencies so that they represented the frequency per 100,000 words, to allow for comparison of different-sized corpora.

When the frequency per 100,000 words was similar for both corpora, we deemed the quotation to be proportionate to the whole. When the frequencies diverged most strongly, we consider this to be "disproportionate" and thus possible evidence of selection as opposed to simply arising by chance. Here we offer a list of the top 25 words most disproportionately present in quoted vs non-quoted text.

And just a question that occurs as I'm writing it... When we're comparing these two corpora (unquoted vs quoted-weighted), does it make more sense to compare their frequencies to each other or to the original full text of Middlemarch? Right now I'm struggling to get my head around the different implications of each approach, but I know they are different!

milanterlunen added the question label Feb 20, 2017

milanterlunen added the 2019-papers label May 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about keyword methodology #39

Questions about keyword methodology #39

milanterlunen commented Feb 20, 2017

Questions about keyword methodology #39

Questions about keyword methodology #39

Comments

milanterlunen commented Feb 20, 2017