Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about keyword methodology #39

Open
milanterlunen opened this issue Feb 20, 2017 · 0 comments
Open

Questions about keyword methodology #39

milanterlunen opened this issue Feb 20, 2017 · 0 comments

Comments

@milanterlunen
Copy link
Collaborator

@JonathanReeve I wanted to make sure I understand the methodology correctly. Is this a fair summary?

In order to calculate the words which are disproportionately present in quoted vs. unquoted parts of the novel, we first split the novel's text into two parts: all the words which had been quoted at least once and all the words which hadn't. For the quoted corpus, we then multiplied each of the words by the number of times it had appeared in any quotation [did we do this? if not I think it's crucial that we do!]. Finally we normalized the word frequencies so that they represented the frequency per 100,000 words, to allow for comparison of different-sized corpora.

When the frequency per 100,000 words was similar for both corpora, we deemed the quotation to be proportionate to the whole. When the frequencies diverged most strongly, we consider this to be "disproportionate" and thus possible evidence of selection as opposed to simply arising by chance. Here we offer a list of the top 25 words most disproportionately present in quoted vs non-quoted text.

And just a question that occurs as I'm writing it... When we're comparing these two corpora (unquoted vs quoted-weighted), does it make more sense to compare their frequencies to each other or to the original full text of Middlemarch? Right now I'm struggling to get my head around the different implications of each approach, but I know they are different!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant