Skip to content

History domain training data sets

Colin Greenstreet edited this page Nov 18, 2024 · 7 revisions

TABLE OF CONTENTS

(alphabetical)

  • British History Online (BHOL)

  • Common Corpus

  • Common Crawl

  • Early English Books Online (EEBO-ProQuest)

  • Europeana APIs

  • Evans-TCP

  • Hathitrust

  • Historic England APIs

  • Huygens Institute

  • Internet Archives APIs

  • Library of Congress

  • National Archives (TNA)

  • Text Creation Partnership (TCP I and TCP II)

  • Vrije Universiteit Amsterdam


British History Online is a collection of nearly 1300 volumes of primary and secondary content relating to British and Irish history, and histories of empire and the British world. BHO also provides access to 40,000 images and 10,000 tiles of historic maps of the British Isles.


Common Corpus is the largest open and permissible licensed text dataset, comprising over 2 trillion tokens (2,003,039,184,047 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more. Launched November 15th 2024.

Common Corpus differs from existing open datasets in that it is:

  • Truly Open: contains only data that is permissively licensed
  • Multilingual: mostly representing English and French data, but contains data for XX languages
  • Diverse: consisting of scientific articles, government and legal documents, code, and cultural heritage data, including books and newspapers
  • Extensively Curated: spelling and formatting has been corrected from digitized texts, harmful and toxic content has been removed, and content with low educational content has also been removed.

About Common Corpus

Common Corpus is made of five carefully curated collections:

  • OpenCulture: our largest collection at 926,541,096,243 tokens, featuring public domain books, newspapers, and Wikisource content. We've developed innovative tools like OCROnos-Vintage to correct historical digitization errors, while implementing advanced toxicity filtering to ensure content meets modern ethical standards.
  • OpenGovernment: 387,965,738,992 tokens of financial and legal documents, including Finance Commons (from sources like SEC and WTO) and Legal Commons (including Europarl and Caselaw Access Project), providing enterprise-grade training data from regulatory bodies and administrative sources.
  • OpenSource: 334,658,896,533 tokens of high-quality code in open source from GitHub, filtered using ArmoRM to ensure only the top 80% of submissions by quality rating are included.
  • OpenScience: 221,798,136,564 tokens of academic content from Open Alex and other open science reposiories, processed using vision-language models to preserve crucial document structure and formatting.
  • OpenWeb: 132,075,315,715 tokens from Wikipedia (official releases from the Wikimedia Foundation on Huggingface), YouTube Commons and other websites available under permissible licenses like Stack-Exchange.
Collection Domain Sources
OpenGovernment legal and administrative Finance Commons (e.g. SEC, WTO) and Legal Commons (e.g. Europarl, Caselaw Access Project)
OpenCulture cultural heritage public domain books and newspapers, Wikisource
OpenScience academic OpenAlex, French theses
OpenWeb web text YouTube Commons, Stack Exchange
OpenSource code GitHub

Dataset Structure

Data Fields
  • identifier: unique text identifier
  • text: post-processed text
  • char_count: number of UTF-8 characters in text
  • file_name: original file path, organized by collection
  • set_id: set id (1-10)
  • subset_id: subset id (1-100)

How to Use

Considerations for Using the Data

All data in Common Corpus are permissibly licensed and may be used for both commercial and non-commercial purposes.

The dataset is multilingual. The language text is included in the metadata, so data can be filtered by language. Additionally, some of the text data are historical. The year each text is written is included in the metadata, therefore it is possible to construct a dataset with a custom date cutoff if desired.

Discussion of Bias

Some of the dataset sources contain biased and toxic content, such as stereotypes about certain minoritized groups. We have removed texts which had high toxicity scores according to our toxicity classifier, Celadon, or which contain offensive terms and slurs. See our preprint for more details.

Personal and Sensitive Information

We have attempted to remove personally identifiable information (PII). We primarily use Microsoft Presidio, but make additional modifications to account for language- and country-specific considerations, such as European phone number formats.


Wikipedia entry: Common Crawl

Link to Wikipedia article: Common Crawl

Common Crawl is a nonprofit 501(c)(3) organization that crawls the web and freely provides its archives and datasets to the public.[1][2] Common Crawl's web archive consists of petabytes of data collected since 2008.[3] It completes crawls generally every month.[4]

Common Crawl was founded by Gil Elbaz.[5] Advisors to the non-profit include Peter Norvig and Joi Ito.[6] The organization's crawlers respect nofollow and robots.txt policies. Open source code for processing Common Crawl's data set is publicly available.

The Common Crawl dataset includes copyrighted work and is distributed from the US under fair use claims. Researchers in other countries have made use of techniques such as shuffling sentences or referencing the common crawl dataset to work around copyright law in other legal jurisdictions.[7]

English is the primary language for 46% of documents in the March 2023 version of the Common Crawl dataset. The next most common primary languages are German, Russian, Japanese, French, Spanish and Chinese, each with less than 6% of documents.[8]

History

Amazon Web Services began hosting Common Crawl's archive through its Public Data Sets program in 2012.[9]

The organization began releasing metadata files and the text output of the crawlers alongside .arc files in July 2012.[10] Common Crawl's archives had only included .arc files previously.[10]

In December 2012, blekko donated to Common Crawl search engine metadata blekko had gathered from crawls it conducted from February to October 2012.[11] The donated data helped Common Crawl "improve its crawl while avoiding spam, porn and the influence of excessive SEO."[11]

In 2013, Common Crawl began using the Apache Software Foundation's Nutch webcrawler instead of a custom crawler.[12] Common Crawl switched from using .arc files to .warc files with its November 2013 crawl.[13]

A filtered version of Common Crawl was used to train OpenAI's GPT-3 language model, announced in 2020.[14]

External links


From the first book published in English through the age of Spenser and Shakespeare, Early English Books Online (EEBO) contains over 146,000 titles listed in Pollard & Redgrave's Short-Title Catalogue (1475-1640) and Wing's Short-Title Catalogue (1641-1700) and their revised editions, as well as the Thomason Tracts (1640-1661) collection and the Early English Books Tract Supplement. Libraries possessing this collection find they are able to fulfill the most exhaustive research requirements of graduate scholars in many subject areas, including English literature, history, philosophy, linguistics, theology, music, fine arts, education, mathematics, and science.

For more information about the Early English Books Online, navigate to the Content Page.

Early English Books Online resides on the ProQuest Platform. For additional information about basic and advanced functionality or administrative capabilities, visit the ProQuest Platform LibGuide.

EEBO Wing 147 - content release November 20, 2023 EEBO Wing 146 - content release December 12, 2022. EEBO New content release in 2021 EEBO 2020 release details and title samples

Resources: (1) API


Europeana provides access to diverse historical datasets, including digitized manuscripts, books, photographs, and artwork, suitable for training Historical Large Language Models (HLLMs). Utilizing these resources, HLLMs can achieve a deeper understanding of historical language, cultural contexts, and societal norms, leading to improved accuracy in generating historical text, translating historical documents, and analyzing historical trends. Europeana offers APIs for programmatic access to its collections, including:


Evans Early American Imprints (Evans) TCP. Evans-TCP, a partnership among the TCP, the Readex division of NewsBank, and the American Antiquarian Society (AAS), created 5,000 accurately keyed and fully searchable SGML/XML text editions from among the 40,000 titles available in the online Evans Early American Imprints collection (series I).

Readex and AAS undertook a project to digitize the entire Evans collection, which now includes every item previously issued in microform, plus a series of supplements drawn from both the AAS collection and that of the Library Company of Philadelphia — some 1,200 additional works located, catalogued, and digitized since completion of the earlier effort (a total of more than 36,000 works and 2,400,000 images). The comprehensive online version of “Evans” uses OCR technology to support searching the full text of the corpus. As with the ECCO materials, therefore (much of it contemporary with the Evans materials), the task of Evans-TCP was not to produce the first ever searchable text, but to produce more accurate, more reliably searchable text of a subset of the entire body of pre-1800 American imprints.

The TCP sought initially to convert 6,000 (and ended up converting 5,000) of the most frequently studied books from the Evans bibliography. Selection of items to transcribe was left entirely to the AAS, drawing on its already significant knowledge of these materials and its contacts with relevant specialists and scholars. It is our belief that by relying on the expertise (as well as on the catalogue records) of the AAS, Evans-TCP was able to provide a diverse text collection comprising the best possible core of significant titles. With the support of more than 90 institutions, Evans-TCP keyed and encoded 4,977 early American texts, which are available online to the public at large through either interactive search or bulk download.

The American Antiquarian Society (AAS), a learned society and research library founded in 1812, is the third oldest historical organization in the United States and the first to be national rather than regional in its purpose and the scope of its collections. It preserves the largest single collection of printed source material relating to the history, literature, and culture of the first 250 years of what is now the United States, and holds copies of nearly two-thirds of all books, pamphlets, and broadsides known to have been printed in this country between 1640 and 1821. In partnership with Readex (a division of NewsBank), AAS produced what has been called one of the most important microform collections ever, a reproduction of the contents of titles listed in Charles Evans’ bibliography of American imprints through 1800.


HathiTrust Digital Library provides a sizeable repository of digitized material suitable for training Historical Large Language Models (HLLMs). Its extensive collection, comprising books, journals, and other textual resources, offers diverse linguistic and stylistic examples across various historical periods. HathiTrust offers programmatic access to its data through APIs, including the Bibliographic API and the Data API. These APIs facilitate efficient data retrieval for research and development purposes.

The HathiTrust Data API can be used to retrieve content (page images, OCR, and in some cases whole volume packages), as well as metadata for HathiTrust volumes. There are two methods of accessing the Data API: via a Web client, requiring authentication (users who are not members of a HathiTrust partner institution must sign up for a University of Michigan "Friend" Account), and programmatically using an access key that can be obtained at http://babel.hathitrust.org/cgi/kgs/request. Complete documentation of the API is available below. Version 2 is the most recent version.

Data API Documentation: Version 2, Revised 26 May, 2015 - Download PDF

Data API related applications - Key Generation Service (for programmatic use) and Web client - Download PDF


The API catalogue contains the following 6 Historic England (HE) APIs:

Conservation Areas

Greater London Archaeological Priority Areas (APAs)

Historic England Aerial Investigation Mapping data

Historic England Heritage at Risk Register 2021

Historic England Heritage at Risk Register 2022

National Heritage List for England (NHLE)

Historic England Open Data Hub


Huygens Institute

Official letters of the United East India Company (VOC), 1610-1795, Compiled by W.Ph. Coolhaas, J. van Goor and J.E. Schooneveld-Oosterling (2021). Available in digital form: image; PDF; OCR The web application was developed for the Huygens ING by Gerbrandy Software in collaboration with E.C.M. Huysman. Digitization of the documents: Vector Eyes. Coordination: E.C.M. Huysman. The copyright of the texts rests with the Huygens ING. If you are interested in a PDF file of the entire book, please contact [email protected].. You should bear in mind that these are very large files and that you can only consult them for your own use. Post-processing: L. Dokter, M. Gutierrez Rojas, S. Koopmans and F. Gutierrez Rojas.

Resources: (1) API


IIIF is a way to standardize the delivery of images and audio/visual files from servers to different environments on the Web where they can then be viewed and interacted with in many ways.

Modern Web browsers understand how to display formats like .jpg and .mp4 at defined sizes, but cannot do much else. The IIIF specifications align with general Web standards that define how all browsers work to enable richer functionality beyond viewing an image or audio/visual files. For images, that means enabling deep zoom, comparison, structure (i.e., for an object such as a book, structure = page order) and annotation. For audio/visual materials, that means being able to deliver complex structures (such as several reels of film that make up a single movie) along with things like captions, transcriptions/translations, annotations, and more.

IIIF makes these objects work in a consistent way. That enables portability across viewers, the ability to connect and unite materials across institutional boundaries, and more.

There are two main components to IIIF: delivering digital objects to sites and viewing them:

Delivering objects: The Image API defines how image servers deliver image pixels to a viewer. It allows the image to be sent as a full-sized image or as a smaller size, a zoomed portion, a rotated view, or as a black and white version. All of these settings are designated by changing portions of the URL for an Image API resource. To try it out for yourself, head over to this Image API Playrground and adjust some of the Image API parameters to see how they work with real images.

The Image API can be implemented on its own (most commonly to enable fast, deep zoom of very high resolution files like TIF and JP2000), or alongside the Presentation API for additional viewing capabilities.

Viewing objects: The Presentation API attaches basic metadata and structure to digital objects, defining how they appear in viewers. It does this via the Manifest, a JSON file which bundles up all the different elements of an IIIF object (such as a single image, or a series of images) with basic metadata (like title, description, and rights information) and structural information (such as page order). (See the glossary below for more definitions of common IIIF terms.)

There are many IIIF viewers. Some are general purpose tools while others specialize in particular kinds of content or functionality. IIIF-compatible viewers generally allow users to pan, zoom, rotate, and resize image objects, and play audio/visual files. Some allow annotation with text, audio, location, and more. Others allow comparison of objects from a single collection side-by-side (or even objects from multiple collections if the object’s Manifest is made available to users).

Resources: (1) IIIF API specification (2) How IIIF works


The Internet Archive is a non-profit digital library with a vast collection of digitized materials, including websites, books, audio recordings, and software. This makes it a valuable source of historical domain data for fine-tuning language models. By training on this data, language models can learn the nuances of language use across different time periods and contexts, improving their ability to understand and generate historically accurate text.

The Internet Arcxhive offers a variety of APIs to access and interact with their vast collection of digital materials. Here are some of the key APIs they provide:  

Metadata API: This allows you to fetch and update metadata for items in the archive. You can get information like titles, descriptions, creators, and dates.  

Files API: This API lets you access the actual files stored in the archive, such as books, movies, audio recordings, and software.   [Wayback Machine API:] This API allows you to interact with the Wayback Machine, enabling you to check if a URL is archived and access snapshots of web pages from the past.   Item API: This allows you to create items, upload files, and manage metadata.   Tasks API: This API provides information about running, pending, and completed tasks.

Reviews API: This API allows registered users to store and retrieve reviews of items.

Relationships API: This API allows you to create relationships between items in the archive.

You can find more information about these APIs and their documentation on the Internet Archive Developer Portal: https://archive.org/developers/

Resources: (1) Documentation for public APIs at Internet Archive (2) Downloading in bulk using wget (3) Metadata read and write APIs (4) Internet Archive Developer Portal


Library of Congress

The Library of Congress makes three different loc.gov APIs available to the public:

JSON/YAML for loc.gov: The loc.gov API provides structured data about Library of Congress collections. The API was originally designed to power the loc.gov website, but in addition to providing HTML for the website it can provide a wealth of information in JSON format. Sitemaps: A sitemap provides information on the relationships between the pages, videos, images and other resources on a website. They are primarily used to inform search engines about the pages that are available for crawling. It is expressed as an XML file listing URLs and their associated metadata. Conventionally, sitemaps are not described as APIs but it's convenient to discuss them in relationship to other LC APIs since they are also used for automated interactions, especially by web crawlers.

Microservices: A microservice is a limited-purpose computer system written to carry out a specific role and using a lightweight API. The three microservices described on this page fall into three categories: Text Services, Image Services, and Streaming Services.

Text Services provides an API for accessing full text OCR, word coordinates and context snippets on loc.gov. Image Services provides an IIIF-compliant API for accessing and manipulating images from the Library of Congress. Streaming Services provides an audio and video (A/V) delivery API for the Library of Congress. Additional APIs

In addition to the loc.gov API, there are other APIs and machine-readable data services maintained by the Library separately from the loc.gov API. A number of them are detailed under Additional APIs and Services.

Congress.gov API: The Congress.gov Application Programming Interface (API) provides a method for Congress and the public to view, retrieve, and re-use machine-readable data from collections available on Congress.gov.

NLS BARD API: An API for the BARD service of the National Library Service for the Blind and Print Disabled.

Chronicling America API: An API for Chronicling America

PPOC - JSON HTTP API: An API for the Prints and Photographs Online Catalog

Linked Data Service: A Linked Data Service for Authority and Bibliographic Metadata

Search/Retrieval via URL: Search/Retrieval via URL (SRU) servers, including the Z39.50 gateway to the Library catalog.

Additional Resources: For more guidance on using the Library's web applications, APIs, and other data services to explore digital collections, see the Guide to Digital Scholarship at the Library of Congress.


National Archives (TNA)

Discovery holds more than 32 million descriptions of records held by The National Archives and more than 2,500 archives and institutions across the United Kingdom, as well as a smaller number of archives around the world.

The information in Discovery is made up of record descriptions provided by or derived from the catalogues of the different archives. Although some of The National Archives’ records have been digitised and can be read online, Discovery can’t search the words within them – only their description and title.

The TNA API allows developers to query the search engine and database within the Discovery service application programmatically, and returns results in XML or Json for further processing. The service is offered as a beta with some functionality still to be developed.

Terms of use

The Discovery application programming interface (API) is designed to maximise access to the information held in the Discovery service catalogue.

You are welcome to use the information and the images for personal use, educational purposes or commercial use under the Open Government Licence.

Do not make an unreasonable number of API calls or use the API in a way which significantly compromises the experience of its other users. As a guideline, you should make no more than 3,000 API calls per day at a rate of no more frequently than one request per second.

Understanding the Discovery service catalogue:

Note that the following information relates to the content, metadata and structure of The National Archives’ catalogue dataset; other catalogues and datasets within Discovery may be different.

A little understanding of the structure of the Discovery service catalogue will help with the API methods below. The catalogue dataset is organised hierarchically to reflect the origin and structure of the records. There are seven levels in the catalogue, ranging from ‘department’ at the top of the tree to pieces and, occasionally items at the bottom:

Department – a government department, agency or body that creates the records

Division – administrative section of a department, when appropriate

Series – the main grouping of records with a common function or subject

Sub-series – smaller groupings of records with a common function or subject

Sub sub-series – smaller groupings of records with a common function or subject

Piece – not a single piece of paper: a piece can be a box, volume or file

Item– part of a piece: can be a bundle, a single document, a letter, and so on

Every level of description in the hierarchy is described within a catalogue entry according to the international standard ISAD (G). The dataset follows its rules for multi-level catalogues (specificity, relevancy, hierarchy position and non-repetition of information at different levels).

Resources: (1) Discovery (2) Api hel page


The Text Creation Partnership was conceived in 1999 between the University of Michigan Library, Bodleian Libraries at the University of Oxford, ProQuest, and the Council on Library and Information Resources as an innovative way for libraries around the world to:

  • pool their resources in order to create full-text resources few could afford individually
  • create texts to a common standard suitable for search, display, navigation, and reuse
  • collaborate with commercial providers, rather than constantly bargaining and competing with them

As of today, the project has produced approximately 73,000 accurate, searchable, full-text transcriptions of early print books, which were previously only available as static page images.

These full texts can be found in the following digital collections:

Early English Books Online–TCP Eighteenth-Century Collections Online–TCP Evans Early American Imprints–TCP Learn more about the texts; about using the content; about the history of the partnership; or consult our FAQ.


Vrije Universiteit Amsterdam

VOC GM NER corpus Sophie Arnoult (Creator), Lodewijk Petram (Contributor), Piek Vossen (Contributor), Dirk Roorda (Contributor), Jesse de Does (Contributor); Vrije Universiteit Amsterdam; Language, Network Institute

Description

Corpus and training data for Named-Entity Recognition from the VOC General Letters. The corpus consist of a selection of letters from the Generale Missiven, a subset of the Overgebleven Brieven en Papieren corpus of the United East India Company (VOC). These letters were reports sent by governor generals and administrators of the VOC to the board, from locations where the VOC was active (Indonesia and other parts of Asia as well as South Africa). The letters for the current corpus were edited and digitalized by the Huygens Institute of Netherlands History between 1960 and 2007 as part of the Rijks Geschiedkundige Publicatiën (RGP) series. In this edition, letters were transcribed in part, while other parts were summarized. The data in the current package consist of a selection of these letters, spread in time, where the original text and modern additions (notes and passage summaries) are extracted into separate documents to allow for training on either the historical text or modern additions. The entities identified in the data are: persons, locations, organisations (mainly the VOC itself) and ships. These are completed with forms derived from location or religion names. The 'corpus' folder contains files for the historical text and modern notes of each letter, in CoNLL 2002 format, taking paragraphs or separate notes as units for segmentation. The 'datasplit_all_standard' folder contains training, validation and test data for the 'standard' NER experiment on all the data referred in the companion publication, splitting sequences longer than 256 subtokens. For more information, see Arnoult et. al, 2021. Batavia asked for advice. Pretrained language models for Named Entity Recognition in historical texts. In Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature. Code, intermediary data and more information on the collection process can be found on the cltl/voc-missives package on Zenodo and GitHub.
Date made available | 2022 -- | -- VU 1 Jan 2021 - 22 Sept 2021 Asia

Clone this wiki locally