-
Notifications
You must be signed in to change notification settings - Fork 0
MarineLives and machine transcription
What we do
MarineLives is a volunteer-led collaboration for the transcription and enrichment of English High Court of Admiralty (HCA) records from the C16th and C17th. The records provide a rich and underutilised source of social, material and economic history.
We focus on the record series HCA 13/. This series consists of books of depositions given by witnesses in the Court and recorded in writing by notary publiques.
Most deposition books have been imaged by MarineLives for the period 1570 to 1685 (HCA 13/20 - HCA 13/79) from the original physical documents, which are held by the TNA at Kew, England.
These depositions are well structured:
Date
Short form name of legal case
Front matter consisting of the name of the deponent, their place of residence, their occupation, and their age
Responses in numbered sequential order to a written allegation (or written libel)
Responses in numbered sequential order to a written interrogatory [many, but not all depositions]
Signoff (signature, initials, or mark)
Our data
Marinelives maintains a GitHub repository named Addaci/HCA for volumes of HCA 13 depositions .
Individual volumes are available for inspection and download from this respository.
We apply semi-diplomatic editorial standards to our transcriptions
Our early work (2012 to 2021) was entirely hand transcribed, proofed and edited by trained MarineLives volunteers.
Since 2022 we have used machine transcription (HTR) to produce a rough cut, and have hand proofed and edited these rough cuts to hand-corrected groundtruth.
Since 2024, we have been experimenting with machine correction of machine transcribed rough cut or raw-HTR text using rules based Python scripts and context sensitive machine learning techniques. After exploring OpenAI's GPT-4o, Google's Gemini Advanced 1.5, and Hugging Face's Zephyr 7B β (a fine-tuned version of mistralai/Mistral-7B-v0.1, we have chosen to work with Anthropic's Claude Sonnet 3.5 large language model, which shows high performance in text correction and analytical legal summarization.
Machine transcription
We machine transcribe images of Admiralty Court depositions in Transkribus, which provides machine transcription and hosting services.
You can browse images and transcriptions from a sample volume of depositions (HCA 13/58) on our public Transkribus hosted site. The transcriptions are largely raw-HTR, with some groundtruth. The groundtruth was created earlier as part of the training data set we created to train multiple bespoke transcription models.
Our current machine transcription model is named HCA Secretary Hand 4.404 Pylaia. This model was trained on handcorrected groundtruth of over 400,000 words (95% English; 5% Latin; 1H C17th).
NotebookLM Edition
Thirty-three partial and complete volumes of HCA depositions are now available to historians as single volume downloads from our HCA GitHub repository or as an omnibus edition of all thirty-three volumes in a NotebookLM. Please contact Colin Greenstreet of MarineLives if you are a research historian and would like access to the NotebookLM edition. We are actively seeking beta testers.
The MarineLives project was founded in 2012. It is a volunteer lead collaboration dedicated to the transcription, enrichment and publication of English High Court of Admiralty depositions.
AI assistants and agents. Nov 19, 2024 talk
Analytical ontological summarization prompt
APIs and batch processing - second collaboratory session
APIs and batch processing ‐ learnings from second collaboratory session
Barbary pirate narrative summarization prompt
Barbary pirate deposition identification and narrative summarization prompt
Batch processing of raw HTR for clean up and summarization
Collaboratory members interests
Early Modern English Language Models
Fine-tuning - third oollaboratory session
History domain training data sets
Introduction to machine learning for historians
MarineLives and machine transcription
New skill set for historians? July 19, 2024 talk
Prompt engineering - first collaboratory session
Prompt engineering - learnings from first collaboratory session