Skip to content

MarineLives and machine transcription

Colin Greenstreet edited this page Nov 16, 2024 · 7 revisions

What we do

MarineLives is a volunteer-led collaboration for the transcription and enrichment of English High Court of Admiralty (HCA) records from the C16th and C17th. The records provide a rich and underutilised source of social, material and economic history.

We focus on the record series HCA 13/. This series consists of books of depositions given by witnesses in the Court and recorded in writing by notary publiques.

Most deposition books have been imaged by MarineLives for the period 1570 to 1685 (HCA 13/20 - HCA 13/79) from the original physical documents, which are held by the TNA at Kew, England.

These depositions are well structured:

Date

Short form name of legal case

Front matter consisting of the name of the deponent, their place of residence, their occupation, and their age

Responses in numbered sequential order to a written allegation (or written libel)

Responses in numbered sequential order to a written interrogatory [many, but not all depositions]

Signoff (signature, initials, or mark)


Our data

Marinelives maintains a GitHub repository named Addaci/HCA for volumes of HCA 13 depositions .

Individual volumes are available for inspection and download from this respository.

We apply semi-diplomatic editorial standards to our transcriptions

Our early work (2012 to 2021) was entirely hand transcribed, proofed and edited by trained MarineLives volunteers.

Since 2022 we have used machine transcription (HTR) to produce a rough cut, and have hand proofed and edited these rough cuts to hand-corrected groundtruth.

Since 2024, we have been experimenting with machine correction of machine transcribed rough cut or raw-HTR text using rules based Python scripts and context sensitive machine learning techniques. After exploring OpenAI's GPT-4o, Google's Gemini Advanced 1.5, and Hugging Face's Zephyr 7B β (a fine-tuned version of mistralai/Mistral-7B-v0.1, we have chosen to work with Anthropic's Claude Sonnet 3.5 large language model, which shows high performance in text correction and analytical legal summarization.


Machine transcription

We machine transcribe images of Admiralty Court depositions in Transkribus, which provides machine transcription and hosting services.

You can browse images and transcriptions from a sample volume of depositions (HCA 13/58) on our public Transkribus hosted site. The transcriptions are largely raw-HTR, with some groundtruth. The groundtruth was created earlier as part of the training data set we created to train multiple bespoke transcription models.

Our current machine transcription model is named HCA Secretary Hand 4.404 Pylaia. This model was trained on handcorrected groundtruth of over 400,000 words (95% English; 5% Latin; 1H C17th).


NotebookLM Edition

Thirty-three partial and complete volumes of HCA depositions are now available to historians as single volume downloads from our HCA GitHub repository or as an omnibus edition of all thirty-three volumes in a NotebookLM. Please contact Colin Greenstreet of MarineLives if you are a research historian and would like access to the NotebookLM edition. We are actively seeking beta testers.

Clone this wiki locally