-
Notifications
You must be signed in to change notification settings - Fork 0
History domain training data sets
(alphabetical)
-
British History Online (BHOL)
-
Common Corpus
-
Common Crawl
-
Early English Books Online (EEBO-ProQuest)
-
Europeana APIs
-
Evans-TCP
-
Hathitrust
-
Historic England APIs
-
Huygens Institute
-
Internet Archives APIs
-
Library of Congress
-
National Archives (TNA)
-
Text Creation Partnership (TCP I and TCP II)
-
Vrije Universiteit Amsterdam
British History Online is a collection of nearly 1300 volumes of primary and secondary content relating to British and Irish history, and histories of empire and the British world. BHO also provides access to 40,000 images and 10,000 tiles of historic maps of the British Isles.
Common Corpus is the largest open and permissible licensed text dataset, comprising over 2 trillion tokens (2,003,039,184,047 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more. Launched November 15th 2024.
Common Corpus differs from existing open datasets in that it is:
- Truly Open: contains only data that is permissively licensed
- Multilingual: mostly representing English and French data, but contains data for XX languages
- Diverse: consisting of scientific articles, government and legal documents, code, and cultural heritage data, including books and newspapers
- Extensively Curated: spelling and formatting has been corrected from digitized texts, harmful and toxic content has been removed, and content with low educational content has also been removed.
Common Corpus is made of five carefully curated collections:
- OpenCulture: our largest collection at 926,541,096,243 tokens, featuring public domain books, newspapers, and Wikisource content. We've developed innovative tools like OCROnos-Vintage to correct historical digitization errors, while implementing advanced toxicity filtering to ensure content meets modern ethical standards.
- OpenGovernment: 387,965,738,992 tokens of financial and legal documents, including Finance Commons (from sources like SEC and WTO) and Legal Commons (including Europarl and Caselaw Access Project), providing enterprise-grade training data from regulatory bodies and administrative sources.
- OpenSource: 334,658,896,533 tokens of high-quality code in open source from GitHub, filtered using ArmoRM to ensure only the top 80% of submissions by quality rating are included.
- OpenScience: 221,798,136,564 tokens of academic content from Open Alex and other open science reposiories, processed using vision-language models to preserve crucial document structure and formatting.
- OpenWeb: 132,075,315,715 tokens from Wikipedia (official releases from the Wikimedia Foundation on Huggingface), YouTube Commons and other websites available under permissible licenses like Stack-Exchange.
Collection | Domain | Sources |
---|---|---|
OpenGovernment | legal and administrative | Finance Commons (e.g. SEC, WTO) and Legal Commons (e.g. Europarl, Caselaw Access Project) |
OpenCulture | cultural heritage | public domain books and newspapers, Wikisource |
OpenScience | academic | OpenAlex, French theses |
OpenWeb | web text | YouTube Commons, Stack Exchange |
OpenSource | code | GitHub |
Data Fields
- identifier: unique text identifier
- text: post-processed text
- char_count: number of UTF-8 characters in text
- file_name: original file path, organized by collection
- set_id: set id (1-10)
- subset_id: subset id (1-100)
All data in Common Corpus are permissibly licensed and may be used for both commercial and non-commercial purposes.
The dataset is multilingual. The language text is included in the metadata, so data can be filtered by language. Additionally, some of the text data are historical. The year each text is written is included in the metadata, therefore it is possible to construct a dataset with a custom date cutoff if desired.
Some of the dataset sources contain biased and toxic content, such as stereotypes about certain minoritized groups. We have removed texts which had high toxicity scores according to our toxicity classifier, Celadon, or which contain offensive terms and slurs. See our preprint for more details.
We have attempted to remove personally identifiable information (PII). We primarily use Microsoft Presidio, but make additional modifications to account for language- and country-specific considerations, such as European phone number formats.
Link to Wikipedia article: Common Crawl
Common Crawl is a nonprofit 501(c)(3) organization that crawls the web and freely provides its archives and datasets to the public.[1][2] Common Crawl's web archive consists of petabytes of data collected since 2008.[3] It completes crawls generally every month.[4]
Common Crawl was founded by Gil Elbaz.[5] Advisors to the non-profit include Peter Norvig and Joi Ito.[6] The organization's crawlers respect nofollow and robots.txt policies. Open source code for processing Common Crawl's data set is publicly available.
The Common Crawl dataset includes copyrighted work and is distributed from the US under fair use claims. Researchers in other countries have made use of techniques such as shuffling sentences or referencing the common crawl dataset to work around copyright law in other legal jurisdictions.[7]
English is the primary language for 46% of documents in the March 2023 version of the Common Crawl dataset. The next most common primary languages are German, Russian, Japanese, French, Spanish and Chinese, each with less than 6% of documents.[8]
Amazon Web Services began hosting Common Crawl's archive through its Public Data Sets program in 2012.[9]
The organization began releasing metadata files and the text output of the crawlers alongside .arc files in July 2012.[10] Common Crawl's archives had only included .arc files previously.[10]
In December 2012, blekko donated to Common Crawl search engine metadata blekko had gathered from crawls it conducted from February to October 2012.[11] The donated data helped Common Crawl "improve its crawl while avoiding spam, porn and the influence of excessive SEO."[11]
In 2013, Common Crawl began using the Apache Software Foundation's Nutch webcrawler instead of a custom crawler.[12] Common Crawl switched from using .arc files to .warc files with its November 2013 crawl.[13]
A filtered version of Common Crawl was used to train OpenAI's GPT-3 language model, announced in 2020.[14]
-
Common Crawl in California, United States
-
Common Crawl GitHub Repository with the crawler, libraries and example code
From the first book published in English through the age of Spenser and Shakespeare, Early English Books Online (EEBO) contains over 146,000 titles listed in Pollard & Redgrave's Short-Title Catalogue (1475-1640) and Wing's Short-Title Catalogue (1641-1700) and their revised editions, as well as the Thomason Tracts (1640-1661) collection and the Early English Books Tract Supplement. Libraries possessing this collection find they are able to fulfill the most exhaustive research requirements of graduate scholars in many subject areas, including English literature, history, philosophy, linguistics, theology, music, fine arts, education, mathematics, and science.
For more information about the Early English Books Online, navigate to the Content Page.
Early English Books Online resides on the ProQuest Platform. For additional information about basic and advanced functionality or administrative capabilities, visit the ProQuest Platform LibGuide.
EEBO Wing 147 - content release November 20, 2023 EEBO Wing 146 - content release December 12, 2022. EEBO New content release in 2021 EEBO 2020 release details and title samples
Resources: (1) API
Europeana provides access to diverse historical datasets, including digitized manuscripts, books, photographs, and artwork, suitable for training Historical Large Language Models (HLLMs). Utilizing these resources, HLLMs can achieve a deeper understanding of historical language, cultural contexts, and societal norms, leading to improved accuracy in generating historical text, translating historical documents, and analyzing historical trends. Europeana offers APIs for programmatic access to its collections, including:
Evans Early American Imprints (Evans) TCP. Evans-TCP, a partnership among the TCP, the Readex division of NewsBank, and the American Antiquarian Society (AAS), created 5,000 accurately keyed and fully searchable SGML/XML text editions from among the 40,000 titles available in the online Evans Early American Imprints collection (series I).
Readex and AAS undertook a project to digitize the entire Evans collection, which now includes every item previously issued in microform, plus a series of supplements drawn from both the AAS collection and that of the Library Company of Philadelphia — some 1,200 additional works located, catalogued, and digitized since completion of the earlier effort (a total of more than 36,000 works and 2,400,000 images). The comprehensive online version of “Evans” uses OCR technology to support searching the full text of the corpus. As with the ECCO materials, therefore (much of it contemporary with the Evans materials), the task of Evans-TCP was not to produce the first ever searchable text, but to produce more accurate, more reliably searchable text of a subset of the entire body of pre-1800 American imprints.
The TCP sought initially to convert 6,000 (and ended up converting 5,000) of the most frequently studied books from the Evans bibliography. Selection of items to transcribe was left entirely to the AAS, drawing on its already significant knowledge of these materials and its contacts with relevant specialists and scholars. It is our belief that by relying on the expertise (as well as on the catalogue records) of the AAS, Evans-TCP was able to provide a diverse text collection comprising the best possible core of significant titles. With the support of more than 90 institutions, Evans-TCP keyed and encoded 4,977 early American texts, which are available online to the public at large through either interactive search or bulk download.
The American Antiquarian Society (AAS), a learned society and research library founded in 1812, is the third oldest historical organization in the United States and the first to be national rather than regional in its purpose and the scope of its collections. It preserves the largest single collection of printed source material relating to the history, literature, and culture of the first 250 years of what is now the United States, and holds copies of nearly two-thirds of all books, pamphlets, and broadsides known to have been printed in this country between 1640 and 1821. In partnership with Readex (a division of NewsBank), AAS produced what has been called one of the most important microform collections ever, a reproduction of the contents of titles listed in Charles Evans’ bibliography of American imprints through 1800.
HathiTrust Digital Library provides a sizeable repository of digitized material suitable for training Historical Large Language Models (HLLMs). Its extensive collection, comprising books, journals, and other textual resources, offers diverse linguistic and stylistic examples across various historical periods. HathiTrust offers programmatic access to its data through APIs, including the Bibliographic API and the Data API. These APIs facilitate efficient data retrieval for research and development purposes.
The HathiTrust Data API can be used to retrieve content (page images, OCR, and in some cases whole volume packages), as well as metadata for HathiTrust volumes. There are two methods of accessing the Data API: via a Web client, requiring authentication (users who are not members of a HathiTrust partner institution must sign up for a University of Michigan "Friend" Account), and programmatically using an access key that can be obtained at http://babel.hathitrust.org/cgi/kgs/request. Complete documentation of the API is available below. Version 2 is the most recent version.
Data API Documentation: Version 2, Revised 26 May, 2015 - Download PDF
Data API related applications - Key Generation Service (for programmatic use) and Web client - Download PDF
The API catalogue contains the following 6 Historic England (HE) APIs:
Greater London Archaeological Priority Areas (APAs)
Historic England Aerial Investigation Mapping data
Historic England Heritage at Risk Register 2021
Historic England Heritage at Risk Register 2022
National Heritage List for England (NHLE)
Historic England Open Data Hub
Official letters of the United East India Company (VOC), 1610-1795, Compiled by W.Ph. Coolhaas, J. van Goor and J.E. Schooneveld-Oosterling (2021). Available in digital form: image; PDF; OCR The web application was developed for the Huygens ING by Gerbrandy Software in collaboration with E.C.M. Huysman. Digitization of the documents: Vector Eyes. Coordination: E.C.M. Huysman. The copyright of the texts rests with the Huygens ING. If you are interested in a PDF file of the entire book, please contact [email protected].. You should bear in mind that these are very large files and that you can only consult them for your own use. Post-processing: L. Dokter, M. Gutierrez Rojas, S. Koopmans and F. Gutierrez Rojas.
Resources: (1) API
IIIF is a way to standardize the delivery of images and audio/visual files from servers to different environments on the Web where they can then be viewed and interacted with in many ways.
Modern Web browsers understand how to display formats like .jpg and .mp4 at defined sizes, but cannot do much else. The IIIF specifications align with general Web standards that define how all browsers work to enable richer functionality beyond viewing an image or audio/visual files. For images, that means enabling deep zoom, comparison, structure (i.e., for an object such as a book, structure = page order) and annotation. For audio/visual materials, that means being able to deliver complex structures (such as several reels of film that make up a single movie) along with things like captions, transcriptions/translations, annotations, and more.
IIIF makes these objects work in a consistent way. That enables portability across viewers, the ability to connect and unite materials across institutional boundaries, and more.
There are two main components to IIIF: delivering digital objects to sites and viewing them:
Delivering objects: The Image API defines how image servers deliver image pixels to a viewer. It allows the image to be sent as a full-sized image or as a smaller size, a zoomed portion, a rotated view, or as a black and white version. All of these settings are designated by changing portions of the URL for an Image API resource. To try it out for yourself, head over to this Image API Playrground and adjust some of the Image API parameters to see how they work with real images.
The Image API can be implemented on its own (most commonly to enable fast, deep zoom of very high resolution files like TIF and JP2000), or alongside the Presentation API for additional viewing capabilities.
Viewing objects: The Presentation API attaches basic metadata and structure to digital objects, defining how they appear in viewers. It does this via the Manifest, a JSON file which bundles up all the different elements of an IIIF object (such as a single image, or a series of images) with basic metadata (like title, description, and rights information) and structural information (such as page order). (See the glossary below for more definitions of common IIIF terms.)
There are many IIIF viewers. Some are general purpose tools while others specialize in particular kinds of content or functionality. IIIF-compatible viewers generally allow users to pan, zoom, rotate, and resize image objects, and play audio/visual files. Some allow annotation with text, audio, location, and more. Others allow comparison of objects from a single collection side-by-side (or even objects from multiple collections if the object’s Manifest is made available to users).
Resources: (1) IIIF API specification (2) How IIIF works
The Internet Archive is a non-profit digital library with a vast collection of digitized materials, including websites, books, audio recordings, and software. This makes it a valuable source of historical domain data for fine-tuning language models. By training on this data, language models can learn the nuances of language use across different time periods and contexts, improving their ability to understand and generate historically accurate text.
The Internet Arcxhive offers a variety of APIs to access and interact with their vast collection of digital materials. Here are some of the key APIs they provide:
Metadata API: This allows you to fetch and update metadata for items in the archive. You can get information like titles, descriptions, creators, and dates.
Files API: This API lets you access the actual files stored in the archive, such as books, movies, audio recordings, and software. [Wayback Machine API:] This API allows you to interact with the Wayback Machine, enabling you to check if a URL is archived and access snapshots of web pages from the past. Item API: This allows you to create items, upload files, and manage metadata. Tasks API: This API provides information about running, pending, and completed tasks.
Reviews API: This API allows registered users to store and retrieve reviews of items.
Relationships API: This API allows you to create relationships between items in the archive.
You can find more information about these APIs and their documentation on the Internet Archive Developer Portal: https://archive.org/developers/
Resources: (1) Documentation for public APIs at Internet Archive (2) Downloading in bulk using wget (3) Metadata read and write APIs (4) Internet Archive Developer Portal
The Library of Congress makes three different loc.gov APIs available to the public:
JSON/YAML for loc.gov: The loc.gov API provides structured data about Library of Congress collections. The API was originally designed to power the loc.gov website, but in addition to providing HTML for the website it can provide a wealth of information in JSON format. Sitemaps: A sitemap provides information on the relationships between the pages, videos, images and other resources on a website. They are primarily used to inform search engines about the pages that are available for crawling. It is expressed as an XML file listing URLs and their associated metadata. Conventionally, sitemaps are not described as APIs but it's convenient to discuss them in relationship to other LC APIs since they are also used for automated interactions, especially by web crawlers.
Microservices: A microservice is a limited-purpose computer system written to carry out a specific role and using a lightweight API. The three microservices described on this page fall into three categories: Text Services, Image Services, and Streaming Services.
Text Services provides an API for accessing full text OCR, word coordinates and context snippets on loc.gov. Image Services provides an IIIF-compliant API for accessing and manipulating images from the Library of Congress. Streaming Services provides an audio and video (A/V) delivery API for the Library of Congress. Additional APIs
In addition to the loc.gov API, there are other APIs and machine-readable data services maintained by the Library separately from the loc.gov API. A number of them are detailed under Additional APIs and Services.
Congress.gov API: The Congress.gov Application Programming Interface (API) provides a method for Congress and the public to view, retrieve, and re-use machine-readable data from collections available on Congress.gov.
NLS BARD API: An API for the BARD service of the National Library Service for the Blind and Print Disabled.
Chronicling America API: An API for Chronicling America
PPOC - JSON HTTP API: An API for the Prints and Photographs Online Catalog
Linked Data Service: A Linked Data Service for Authority and Bibliographic Metadata
Search/Retrieval via URL: Search/Retrieval via URL (SRU) servers, including the Z39.50 gateway to the Library catalog.
Additional Resources: For more guidance on using the Library's web applications, APIs, and other data services to explore digital collections, see the Guide to Digital Scholarship at the Library of Congress.
Discovery holds more than 32 million descriptions of records held by The National Archives and more than 2,500 archives and institutions across the United Kingdom, as well as a smaller number of archives around the world.
The information in Discovery is made up of record descriptions provided by or derived from the catalogues of the different archives. Although some of The National Archives’ records have been digitised and can be read online, Discovery can’t search the words within them – only their description and title.
The TNA API allows developers to query the search engine and database within the Discovery service application programmatically, and returns results in XML or Json for further processing. The service is offered as a beta with some functionality still to be developed.
Terms of use
The Discovery application programming interface (API) is designed to maximise access to the information held in the Discovery service catalogue.
You are welcome to use the information and the images for personal use, educational purposes or commercial use under the Open Government Licence.
Do not make an unreasonable number of API calls or use the API in a way which significantly compromises the experience of its other users. As a guideline, you should make no more than 3,000 API calls per day at a rate of no more frequently than one request per second.
Understanding the Discovery service catalogue:
Note that the following information relates to the content, metadata and structure of The National Archives’ catalogue dataset; other catalogues and datasets within Discovery may be different.
A little understanding of the structure of the Discovery service catalogue will help with the API methods below. The catalogue dataset is organised hierarchically to reflect the origin and structure of the records. There are seven levels in the catalogue, ranging from ‘department’ at the top of the tree to pieces and, occasionally items at the bottom:
Department – a government department, agency or body that creates the records
Division – administrative section of a department, when appropriate
Series – the main grouping of records with a common function or subject
Sub-series – smaller groupings of records with a common function or subject
Sub sub-series – smaller groupings of records with a common function or subject
Piece – not a single piece of paper: a piece can be a box, volume or file
Item– part of a piece: can be a bundle, a single document, a letter, and so on
Every level of description in the hierarchy is described within a catalogue entry according to the international standard ISAD (G). The dataset follows its rules for multi-level catalogues (specificity, relevancy, hierarchy position and non-repetition of information at different levels).
Resources: (1) Discovery (2) Api hel page
The Text Creation Partnership was conceived in 1999 between the University of Michigan Library, Bodleian Libraries at the University of Oxford, ProQuest, and the Council on Library and Information Resources as an innovative way for libraries around the world to:
- pool their resources in order to create full-text resources few could afford individually
- create texts to a common standard suitable for search, display, navigation, and reuse
- collaborate with commercial providers, rather than constantly bargaining and competing with them
As of today, the project has produced approximately 73,000 accurate, searchable, full-text transcriptions of early print books, which were previously only available as static page images.
These full texts can be found in the following digital collections:
Early English Books Online–TCP Eighteenth-Century Collections Online–TCP Evans Early American Imprints–TCP Learn more about the texts; about using the content; about the history of the partnership; or consult our FAQ.
VOC GM NER corpus Sophie Arnoult (Creator), Lodewijk Petram (Contributor), Piek Vossen (Contributor), Dirk Roorda (Contributor), Jesse de Does (Contributor); Vrije Universiteit Amsterdam; Language, Network Institute
The MarineLives project was founded in 2012. It is a volunteer lead collaboration dedicated to the transcription, enrichment and publication of English High Court of Admiralty depositions.
AI assistants and agents. Nov 19, 2024 talk
Analytical ontological summarization prompt
APIs and batch processing - second collaboratory session
APIs and batch processing ‐ learnings from second collaboratory session
Barbary pirate narrative summarization prompt
Barbary pirate deposition identification and narrative summarization prompt
Batch processing of raw HTR for clean up and summarization
Collaboratory members interests
Early Modern English Language Models
Fine-tuning - third oollaboratory session
History domain training data sets
Introduction to machine learning for historians
MarineLives and machine transcription
New skill set for historians? July 19, 2024 talk
Prompt engineering - first collaboratory session
Prompt engineering - learnings from first collaboratory session