Early Modern English Language Models [EMELM]

Alphabetical listing with links

EAHistoriChat

C18th

A chatbot designed to respond in the style of Early American written texts. It has limited ability to deal with multi-turn conversations, and has some formatting issues in its answers, due to training on data which retained linebreaks.

A Fine-tuned version of Mistral-Hermes 2, trained on synthetic question-answer pairs to replicate Early American prose. The training dataset consists of curated paragraphs from the Evans-TCP corpus and uses a quantized version of Mistral-Nemo-Instruct to generate questions for which these paragraphs serve as appropriate answers. Fine-tuning was conducted using the Axolotl framework on this custom dataset.

The idea came from Mark L. Thompson at the University of Groningen, with design and implementation by Michiel van der Ree, also at the University of Groningen.

Resources: (1) GitHub repository (2) Hugging Face model card

MacBERTh

MacBERTh and GysBERT are language models (more specifically, BERT models) pre-trained on historical textual material (date range: 1450-1950). MacBERTh is trained on English and GysBERT is trained on Dutch.

Pretrained on ca. 3.9B tokens, drawn from EEBO, ECCO, COHA, CLMET3.1, EVANS, Hansard Corpus.

MacBERTh has been utilized in studies to assess its performance on Early Modern English data. For instance, the paper "How BERT Speaks Shakespearean English? Evaluating Historical Bias in Contextual Language Models" (2024) examines MacBERTh's capabilities in understanding Shakespearean English.

Resources:

(1) Hugging Face MacBERTh model card and model (2) Manjavacas, Enrique & Lauren Fonteyn. 2022. Adapting vs. Pre-training Language Models for Historical Languages. Journal of Data Mining & Digital Humanities jdmdh:9152. https://doi.org/10.46298/jdmdh.9152 (3) Miriam Cuscito, Alfio Ferrara, Martin Ruskov, How BERT Speaks Shakespearean English? Evaluating Historical Bias in Contextual Language Models, submitted 7 Feb 2024, https://doi.org/10.48550/arXiv.2402.05034

MonadGPT:

C17th

A finetune of Mistral-Hermes 2 on 11,000 early modern texts in English, French and Latin, mostly coming from EEBO and Gallica. It can be used in conversation mode, answering in an historical language and style, and using historical and dated references. The training dataset for MonadGPT consists of 10,797 rows, and is 10,3 M+B in size (auto-converted Parquet files).

Developed by French digital humanities researcher Pierre-Carl Langlais, aka Alexander Doria. Langlais is co-founder of the French private company Pleias, which focuses on open source large language models.

OCRonos-Vintage

C19th-mid-C20th

A small specialized model for OCR correction of cultural heritage archives pre-trained with llm.c. OCRonos-Vintage was pre-trained from scratch on a dataset of cultural heritage archives from the Library of Congress, Internet Archive and Hathi Trust totalling 18 billion tokens. Can run on a CPU as well as a GPU. Tokenization is currently done with the GPT-2 tokenizer. Roughly 65% of the content has been published between 1880 and 1920. The model has a hard cut-off date of December 29th, 1955, with the vast majority prior to 1940, and none before 1800.

Resources: (1) Try out model with CPU instance (2) Try out model with GPU instance (3) Configuration json file 4) Chart: distribution of pre-training tokens by year of publication (5) PleIAs bad data toolbox

Bibliography

Catherine Arnett, Eliot Jones, Ivan P. Yamshchikov, Pierre-Carl Langlais, Toxicity of the Commons: Curating Open-Source Pre-Training Data, Published October 29, 2024. Arxiv: 2410.22587

Arnoult, Sophie, L. Petram, P. Vossen, Batavia asked for advice. Pretrained language models for Named Entity Recognition in historical texts. 2021, LATECHCLFLm https://aclanthology.org/2021.latechclfl-1.3/

Arnoult, S. (Creator), Petram, L. (Contributor), Vossen, P. (Contributor), Roorda, D. (Contributor) & de Does, J. (Contributor), VOC GM NER corpus VU, 2022 DOI: 10.48338/vu01-hi67kl, https://publication.yoda.vu.nl/full/VU01/HI67KL.html

Manjavacas, Enrique & Lauren Fonteyn. 2022. Adapting vs. Pre-training Language Models for Historical Languages. Journal of Data Mining & Digital Humanities jdmdh:9152. https://doi.org/10.46298/jdmdh.9152

Manjavacas, Enrique & Lauren Fonteyn. 2022. Non-Parametric Word Sense Disambiguation for Historical Languages. Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities (NLP4DH), 123-134. Association for Computational Linguistics. https://aclanthology.org/2022.nlp4dh-1.16

Iiro Rastas, Yann Ryan, Iiro Tiihonen, Mohammadreza Qaraei, Liina Repo, Rohit Babbar, Eetu Mäkelä, Mikko Tolonen, Filip Ginter. 2022. Explainable Publication Year Prediction of Eighteenth Century Texts with the BERT Model. Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change, 68 - 77.

The MarineLives project was founded in 2012. It is a volunteer lead collaboration dedicated to the transcription, enrichment and publication of English High Court of Admiralty depositions.