The Chicago Corpus

As part of the efforts of the Fabula-NET project at the Center for Humanities Computing, Århus University, we present a dataset of quality judgments on 9,000 19th and 20th century English-language literary novels by 3,166 predominantly Anglophone authors.

The data includes annotation of expert opinions and crowd-based resources to allow comparative analyses between different literary quality evaluations, as well as several textual metrics chosen for their connection with literary reception. A large part of the corpus is subjected to copyright (see the available pre-1924 works here). We release quality and reception measures together with stylometric and sentiment data for each of the 9,000 novels to promote future research and comparison. Read the Paper presenting this resource.

⚡ Data included

9,000 titles
Author, title & year
Various textual metrics
Various reception metrics

For an overview of all included data, see the corpus documentation.

Available formats: .xlsx, .json

🔍 Example

BOOK_ID	TITLE	AUTH_FIRST	AUTH_LAST	PUBL_DATE	...	AVG_RATING	SCIFI_AWARDS	PULITZER	TRANSLATIONS	...	PERPLEXITY	MEAN_SENT	READABILITY
6913	A Clash of Kings	George R. R.	Martin	1999	...	4.41	1	0	38	...	79.97	-0.002	92.73
20636	Dune	Frank	Herbert	1965	...	4.25	1	0	398	...	72.74	-0.007	85.18
22741	Beloved	Toni	Morrison	1987	...	3.92	0	1	68	...	68.78	0.030	91.71
5778	Misery	Stephen	King	1987	...	4.20	0	0	74	...	68.09	-0.032	82.54
86	The Portrait of a Lady	Henry	James	1881	...	3.78	0	0	53	...	80.35	0.150	71.65

Above: Example of titles and corresponding values for selected metrics

📈 Corpus statistics

The corpus of texts from which we constructed our dataset was assembled by Hoyt Long and Richard Jean So in the Textual Optics Lab; it encompasses 9088 novels published in the United States between 1880 and 2000 and was compiled based on the number of libraries holding each title (based on the WorldCat catalogue), favoring works with a higher number of library holdings.

Titles	Authors	Titles per author
9088	3166	2.88

Above: Number of titles/authors in the corpus

Below: Mean & SD of some of the included features

Metric	Wordcount	Sentence Length	Wordlength	Type/Token Ratio	Compressibility	Bigram Entropy	Word Entropy	Flesch Ease	Dale Chall New	Mean Sentiment	Std Sentiment	End Sentiment	Beginning Sentiment	Hurst Exponent	Approximate Entropy
Mean (µ)	118584.71	86.56	3.67	0.69	2.92	14.63	9.69	82.70	5.10	0.03	0.35	0.03	0.04	0.61	1.75
St. dev. (±)	64746.05	29.44	0.18	0.02	0.14	0.55	0.30	6.48	0.33	0.04	0.04	0.07	0.05	0.04	0.15

🏆 "Quality", "reader appreciation" or "popularity" metrics

Beyond textual features, we present various "quality proxies", that is, ways of estimating valuation in literary culture, such as whether or not titles are included in Bestseller or Canon lists. We also include what we call "continuous" proxies, that is, scores per title, for example of GoodReads ratings or translation numbers (see the corpus documentation).

Because of the library holdings selection criteria, the corpus comprises much high-quality fiction from authors who have received prestigious distinctions, such as the Nobel Prize (i.a., Toni Morrison), the National Book Award (i.a., Don DeLillo). Yet, library holdings appear to indicate both high distinction and mass popularity, reflecting library users' demand and preferences. So the corpus also comprises widely popular novels from mainstream literature (i.a., Agatha Christie), and notable works on the broad spectrum of so-called "genre literature", from Mystery to Science Fiction (i.a., Tolkien, Philip K. Dick etc.). An examination of the relation between various proxies in this corpus is forthcoming.

📖 Documentation


📄 Paper	The Chicago resource paper.
✏️ Documentation	Detailed description of measures and proxies included in the dataset.
🗂️ Previous works	Publications that have previously used the Chicago Corpus.
🔬 Textual Optics Lab	The Chicago Corpus at the Textual Optics Lab, University of Chicago.
📚 Citation	Bibtex citation.
🔥 EmotionArcs	Emotion Arcs of the Chicago Corpus (a linked dataset).
🔬 CHC	Center for Humanities Computing, hosting the FabulaNET project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

The Chicago Corpus

⚡ Data included

🔍 Example

📈 Corpus statistics

🏆 "Quality", "reader appreciation" or "popularity" metrics

📖 Documentation

Files

README.md

Latest commit

History

README.md

File metadata and controls

The Chicago Corpus

⚡ Data included

🔍 Example

📈 Corpus statistics

🏆 "Quality", "reader appreciation" or "popularity" metrics

📖 Documentation