As part of the efforts of the Fabula-NET project at the Center for Humanities Computing, Århus University, we present a dataset of quality judgments on 9,000 19th and 20th century English-language literary novels by 3,166 predominantly Anglophone authors.
The data includes annotation of expert opinions and crowd-based resources to allow comparative analyses between different literary quality evaluations, as well as several textual metrics chosen for their connection with literary reception. A large part of the corpus is subjected to copyright (see the available pre-1924 works here). We release quality and reception measures together with stylometric and sentiment data for each of the 9,000 novels to promote future research and comparison. Read the Paper presenting this resource.
- 9,000 titles
- Author, title & year
- Various textual metrics
- Various reception metrics
For an overview of all included data, see the corpus documentation.
Available formats: .xlsx, .json
BOOK_ID | TITLE | AUTH_FIRST | AUTH_LAST | PUBL_DATE | ... | AVG_RATING | SCIFI_AWARDS | PULITZER | TRANSLATIONS | ... | PERPLEXITY | MEAN_SENT | READABILITY |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6913 | A Clash of Kings | George R. R. | Martin | 1999 | ... | 4.41 | 1 | 0 | 38 | ... | 79.97 | -0.002 | 92.73 |
20636 | Dune | Frank | Herbert | 1965 | ... | 4.25 | 1 | 0 | 398 | ... | 72.74 | -0.007 | 85.18 |
22741 | Beloved | Toni | Morrison | 1987 | ... | 3.92 | 0 | 1 | 68 | ... | 68.78 | 0.030 | 91.71 |
5778 | Misery | Stephen | King | 1987 | ... | 4.20 | 0 | 0 | 74 | ... | 68.09 | -0.032 | 82.54 |
86 | The Portrait of a Lady | Henry | James | 1881 | ... | 3.78 | 0 | 0 | 53 | ... | 80.35 | 0.150 | 71.65 |
Above: Example of titles and corresponding values for selected metrics
The corpus of texts from which we constructed our dataset was assembled by Hoyt Long and Richard Jean So in the Textual Optics Lab; it encompasses 9088 novels published in the United States between 1880 and 2000 and was compiled based on the number of libraries holding each title (based on the WorldCat catalogue), favoring works with a higher number of library holdings.
Titles | Authors | Titles per author |
---|---|---|
9088 | 3166 | 2.88 |
Above: Number of titles/authors in the corpus
Below: Mean & SD of some of the included features
Metric | Wordcount | Sentence Length | Wordlength | Type/Token Ratio | Compressibility | Bigram Entropy | Word Entropy | Flesch Ease | Dale Chall New | Mean Sentiment | Std Sentiment | End Sentiment | Beginning Sentiment | Hurst Exponent | Approximate Entropy |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Mean (µ) | 118584.71 | 86.56 | 3.67 | 0.69 | 2.92 | 14.63 | 9.69 | 82.70 | 5.10 | 0.03 | 0.35 | 0.03 | 0.04 | 0.61 | 1.75 |
St. dev. (±) | 64746.05 | 29.44 | 0.18 | 0.02 | 0.14 | 0.55 | 0.30 | 6.48 | 0.33 | 0.04 | 0.04 | 0.07 | 0.05 | 0.04 | 0.15 |
Beyond textual features, we present various "quality proxies", that is, ways of estimating valuation in literary culture, such as whether or not titles are included in Bestseller or Canon lists. We also include what we call "continuous" proxies, that is, scores per title, for example of GoodReads ratings or translation numbers (see the corpus documentation).
Because of the library holdings selection criteria, the corpus comprises much high-quality fiction from authors who have received prestigious distinctions, such as the Nobel Prize (i.a., Toni Morrison), the National Book Award (i.a., Don DeLillo). Yet, library holdings appear to indicate both high distinction and mass popularity, reflecting library users' demand and preferences. So the corpus also comprises widely popular novels from mainstream literature (i.a., Agatha Christie), and notable works on the broad spectrum of so-called "genre literature", from Mystery to Science Fiction (i.a., Tolkien, Philip K. Dick etc.). An examination of the relation between various proxies in this corpus is forthcoming.
📄 Paper | The Chicago resource paper. |
✏️ Documentation | Detailed description of measures and proxies included in the dataset. |
🗂️ Previous works | Publications that have previously used the Chicago Corpus. |
🔬 Textual Optics Lab | The Chicago Corpus at the Textual Optics Lab, University of Chicago. |
📚 Citation | Bibtex citation. |
🔥 EmotionArcs | Emotion Arcs of the Chicago Corpus (a linked dataset). |
🔬 CHC | Center for Humanities Computing, hosting the FabulaNET project. |