Glossary

A non-coder's guide to machine learning terminology

84 words you need to know

A

Agent: An AI system that can interact with its environment, make decisions, and take actions to achieve a specific goal. Agents are often more autonomous and capable of complex decision-making than simple assistants.

Analytical ontological summarization: A type of text summarization that goes beyond simply condensing information. It involves identifying and extracting key concepts and relationships from the text, often using ontologies or knowledge graphs to structure the information. This type of summarization is particularly valuable for historical research as it can help to reveal underlying patterns and connections in the data. Example of an analytical ontological summarization prompt.

Anthropic models: Anthropic is an AI research company focusing on developing safe and beneficial AI systems. Its current frontier model is Claude Sonnet 3.5, which last had a major update in October 2024.

API (Application Programming Interface): A set of rules and specifications that allows different software systems to communicate with each other. APIs are often used to access data or functionality from external services.

Attention mechanism: A technique used in neural networks that allows the model to focus on specific parts of the input data that are most relevant to the task at hand. Attention mechanisms are particularly useful in natural language processing tasks, helping models understand the relationships between words and phrases in a sentence.

Aurelius-Archives: A tailored GPT (Generative Pre-trained Transformer) designed by Colin Greenstreet specifically for working with archives. It includes features such as the complete Gazetteer of British Place Names, which allows users to geolocate archival metadata

B

Batch processing: Processing multiple data items (like pages of text) together as a group, as opposed to one by one. This is often more efficient for computational tasks.

BERT: BERT (Devlin et al., 2019); two Dutch mono-lingual models BERTje; RobBERT; multilingual model mBERT.

C

Chain-of-thought prompt: A type of prompt that encourages a large language model to explain its reasoning process step-by-step. This can help to improve the transparency and accuracy of the model's output.

Classification: Assigning data items to predefined categories. In machine learning, classification algorithms are trained on labeled data to predict the category of new, unseen data.

Claude Sonnet 3.5: A large language model (LLM) developed by Anthropic, known for its high performance in text correction, summarization, and entity extraction. It is used in the collaboratory for tasks such as cleaning up raw machine transcriptions and performing analytical ontological summarization.

Commons: A term used to refer to shared resources, such as knowledge, data, or code, that are openly accessible and often governed by licenses that encourage reuse and collaboration.

Collaboratory: A group of people working together on a project, often remotely and using online tools to collaborate. The MarineLives AI & History Collaboratory is a group of graduate history students and faculty working together to explore and apply LLMs in historical research.

Controlled vocabulary: A predefined set of terms used to describe and categorize information, ensuring consistency and facilitating retrieval. Controlled vocabularies are common in libraries and archives to organize and index materials. Controlled vocabularies play a role in creating structured metadata, which is important for organizing and searching digital historical collections.

Convolutional Neural Network (CNN): A type of neural network commonly used for image and video processing. CNNs excel at recognizing patterns and features in visual data.

CSV: CSV (Comma-Separated Values) is a simple text-based file format where data is organized in a tabular structure. Each line in the file represents a row in the table, and values within a row are separated by commas. This straightforward format makes CSV files easily readable by both humans and computers, widely used for data storage and exchange.

D

Deposition: A sworn statement made by a witness, often in writing, used as evidence in legal proceedings. In the context of the collaboratory, depositions from the English High Court of Admiralty are being used as source material for analysis with LLMs.

Digital object identifiers (DOIs): Unique, persistent identifiers used to locate digital objects, such as journal articles, datasets, or images. DOIs help ensure that digital objects remain accessible even if their location on the internet changes.

E

EAHistoriChat: EAHistoriChat (Early American) is a chatbot designed to respond in the style of Early American written texts. This project draws significant inspiration from Pierre-Carl Langlais' Monad-GPT, which pioneered the use of 17th-century writing styles in AI responses. Similarly, EAHistoriChat is a fine-tuned version of Mistral-Hermes 2, trained on synthetic question-answer pairs to replicate Early American prose.

Embedding: A mathematical representation of a data item, such as a word or sentence, as a vector in a high-dimensional space. Embeddings capture semantic relationships between words, allowing machines to understand the meaning of text.

Entity extraction: The process of identifying and classifying named entities (people, organizations, locations, dates, etc.) in text. This is a key step in many natural language processing tasks, as it allows for the structured extraction of information from unstructured text..

F

Fine-tuning: Adapting a pre-trained large language model (LLM) to a specific task or domain by training it on additional data. Fine-tuning can significantly improve the performance of an LLM for a particular application.

FineWeb: FineWeb, a new, large-scale (15-trillion tokens, 44TB disk space) dataset for LLM pretraining. FineWeb is derived from 96 CommonCrawl snapshots and produces better-performing LLMs than other open pretraining datasets.

G

Gemini 1.5: A large language model (LLM) developed by Google, known for its advanced capabilities in various language tasks.

General Purpose Transformer: GPT refers to a family of large language models developed by OpenAI. These models are known for their ability to generate human-quality text in response to prompts. It's important to note that "GPT" is a general term for the model architecture, while specific models like GPT-3, GPT-4, etc., have different capabilities and are released at different times.

GitHub: A platform for hosting and collaborating on software development projects. It is used in the collaboratory to share code, data, and documentation.

Google Colab: A cloud-based platform for running Python code, particularly popular for machine learning tasks. It provides a free and accessible environment for experimenting with and developing machine learning models.

Google NotebookLM: A vectorbase developed by Google that allows users to create and interact with notebooks that combine text, code, and machine learning models. It is used in the collaboratory to synthesize collective learnings and as a research tool..

Groundtruth: In machine learning, the correct or validated data used to train and evaluate models. In the context of the collaboratory, hand-corrected transcriptions of historical documents serve as groundtruth data.

H

Handwritten Text Recognition (HTR): The use of computer algorithms to transcribe handwritten text into digital form. HTR is an essential step in making handwritten historical documents accessible for analysis with digital tools.

Historical Research Use Cases: Specific ways in which LLMs and other digital tools can be applied to address historical research questions. The collaboratory focuses on developing and sharing practical use cases for LLMs in history.

Hugging Face: A platform for hosting and sharing machine learning models, datasets, and tools. It is used in the collaboratory to access and experiment with various LLMs.

I

IIIF manifest: A standardized description of a digital image, providing information about its structure, content, and metadata. IIIF manifests are used to facilitate the online presentation and interoperability of digital images, particularly important for cultural heritage institutions.

K

Knowledge Graph: A structured representation of knowledge, consisting of entities (people, places, things) and the relationships between them. Knowledge graphs can be used to enhance search, provide context, and support reasoning about data.

L

Llama Models: A family of large language open source models (LLMs) developed by Meta.

Linked Open Data: Data that is published in a structured, machine-readable format, with links between different datasets to create a web of interconnected information. LOD is a key concept in the semantic web, enabling data from different sources to be easily combined and analyzed.

Large Language Model (LLM): A deep learning algorithm trained on vast amounts of text data, enabling it to understand and generate human-like language. LLMs like Claude Sonnet 3.5 and GPT-4 are central to the collaboratory's focus on using AI for historical research.

Low Rank Adaptive Training (LoRA): Low-Rank Adaptation (LoRA) is a technique for fine-tuning large language models more efficiently. It freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, reducing the number of trainable parameters and memory footprint while achieving comparable performance to full fine-tuning.

M

Machine learning (ML): A type of artificial intelligence that allows computers to learn from data without being explicitly programmed. ML is the foundation for LLMs and other AI tools used in the collaboratory.

Machine transcription: Using computer software to automatically transcribe audio or handwritten text into digital text. Machine transcription can save time and effort compared to manual transcription, but often requires human correction and review.

MarineLives: A volunteer-led digital history project focused on transcribing and enriching records from the English High Court of Admiralty. The MarineLives AI and History Collaboratory is an extension of this project, exploring the use of large language models for historical research.

Mask: In natural language processing, a special token used to replace or hide a word or phrase in a sentence. Masking is commonly used in language modeling tasks, where the model is trained to predict the masked word based on the surrounding context.

Metadata: Data about data. In the context of the collaboratory, metadata refers to information about historical documents, such as author, date, location, and keywords, which can be used to organize and search the documents.

Mistral Models: A family of large language models developed by Mistral AI. MarineLives has experimented with Zephyr 7B β, a fine-tuned version of a Mistral model, for text correction and analysis.

Multi-agent/multi-player historical simulations: Interactive simulations that involve multiple actors (either human or AI) interacting within a historical setting. Greenstreet envisions using these simulations in graduate teaching to provide immersive learning experiences

N

Named Entity Recognition (NER): You can try a Named Enity Recognition model here in a Hugging Face instance. It is a an open source fine-tuned BERT model named bert-base_NER, which has been trained to recognize four types of entities: location (LOC), organizations (ORG), person (PER) and Miscellaneous (MISC).

Neural network: A computational model inspired by the structure of the human brain, consisting of interconnected nodes (neurons) organized in layers. Neural networks are used in machine learning to learn patterns and relationships from data. Large Language Models are a type of neural network.

O

Online Public Access Catalogue: An online database that allows users to search a library's catalog and find information about its holdings. OPACs are a standard feature of modern libraries, providing access to bibliographic information and often linking to digital resources.

Ontology: A formal representation of knowledge within a specific domain, defining concepts and relationships between them. Ontologies are used to structure data and facilitate machine understanding of information.

OpenAI Models: This refers to the various AI models developed by OpenAI, a leading artificial intelligence research company. OpenAI is known for creating a range of models, including the GPT series, DALL-E (image generation), and Codex (code generation).

Opendata Commons: Open Data Commons (ODC) provides a suite of licenses and legal tools that facilitate the open sharing and use of data. These licenses help data creators clearly define the conditions under which their data can be used, reused, and redistributed, promoting transparency and encouraging wider access to valuable datasets.

P

PageRank algroithm A link analysis algorithm used by Google Search to rank websites in their search engine results. It works by evaluating the number and quality of links to a page to determine an estimate of how important the website is. PageRank treats links from other websites as "votes" of confidence. The more links a page receives, the more important it is considered. Not all links are created equal. Links from high-quality, authoritative websites carry more weight than links from low-quality or irrelevant sites. While still a factor in Google's ranking algorithm, PageRank is now just one of many signals used to determine search result rankings. It has evolved and been refined over the years to combat link manipulation and improve accuracy.

Parquet: Parquet is a columnar storage file format specifically designed for efficient data processing and analytics. Unlike traditional row-based formats, Parquet stores data by column, allowing for faster retrieval of specific data subsets and improved compression. This columnar organization, coupled with various encoding and compression techniques, leads to significant performance gains, particularly in analytical queries that often involve accessing only a subset of columns. Furthermore, Parquet maintains schema information within the file itself, ensuring data integrity and compatibility across different systems.

Parts of speech tagging (POS): The process of assigning grammatical tags (e.g., noun, verb, adjective) to each word in a text to identify its syntactic role. Example: "The DET quick ADJ brown ADJ fox NOUN jumps VERB over ADP the DET lazy ADJ dog NOUN."

Persistent uniform resource locators (PURLs): A type of web address that provides a permanent and stable link to a digital resource, even if the resource's location changes. PURLs help ensure long-term access to digital materials.

Pinecone: A platform for building and managing vectorbases. In the collaboratory, Pinecone is used to create a specialized vectorbase for historical data.

Pipeline: A sequence of steps or processes in a workflow. In the context of the collaboratory, a pipeline might involve machine transcription, cleanup, summarization, metadata creation, and LOD creation.

Prompt engineering: An instruction or query given to a large language model to guide its output. Prompt engineering is the art of crafting effective prompts to elicit desired responses from LLMs.

Prompt list: A structured list of prompts designed to guide students' exploration of a topic or research area. Greenstreet suggests that these might become more common in academic settings, supplementing or even replacing traditional reading lists.

Prompting: Providing an input to a large language model (LLM) to guide its output. Prompt engineering involves crafting effective prompts to elicit desired responses from LLMs.

Python: A popular programming language widely used in data science and machine learning. Some collaboratory activities may involve working with Python code, but the focus is on low-coding or no-coding approaches.

Q

Question answering: A task where a machine learning model is asked a question and must provide an answer based on its knowledge or understanding of a given text or dataset.

R

Raw HTR text: The initial output of Handwritten Text Recognition software, which often contains errors and inconsistencies. This raw text needs to be cleaned and corrected before it can be used for analysis.

Recurrent neural network (RNN): A type of neural network designed to process sequential data, such as text or time series. RNNs have memory, allowing them to retain information from previous inputs and use it to understand context. RNNs are relevant were historically significant in natural language processing, and some LLMs still incorporate RNN components.

Retrieval Augmented Generation: A technique that combines the information retrieval capabilities of search engines with the generative capabilities of LLMs. It allows LLMs to access and process information from external sources, making their responses more comprehensive and accurate.

S

Semantic search: A type of search that goes beyond keyword matching, taking into account the meaning and context of words to retrieve more relevant results. Semantic search is particularly valuable for historical research as it can help to uncover connections and patterns that might be missed with traditional keyword searches.

Semantic web: An extension of the World Wide Web that aims to make data machine-readable and interconnected, allowing machines to understand and process information. The semantic web uses technologies like Linked Open Data (LOD) to create a web of knowledge. The concept of the semantic web aligns with the collaboratory's goals of using structured data and knowledge representation to enhance historical research.

Splitting: Dividing data into smaller chunks or segments for processing. This can be necessary for tasks like machine transcription or analysis of large documents.

Summarization: Condensing a longer text into a shorter version that captures the key points. LLMs are increasingly used for summarization tasks, helping researchers quickly digest large amounts of text.

SQL (Structured Query Language): A standard language used to manage and query relational databases. SQL is a powerful tool for data analysis and retrieval. The sources list "SQL" as a possible future topic for the collaboratory. This indicates an interest in exploring how SQL can be used to work with structured historical data.

Synthetic data:

Systems prompt: A prompt that sets the overall behavior and guidelines for a large language model (LLM). Systems prompts are used to define the LLM's "persona" or role and can influence its style, tone, and biases.

T

Text correction: This involves identifying and rectifying errors in text, including spelling mistakes, grammatical inconsistencies, and OCR inaccuracies.

Text extraction: This refers to the process of identifying and isolating specific information from a larger body of text. Named entity extraction focuses on identifying and classifying named entities (people, organizations, locations, etc.) within text, enabling more structured analysis and retrieval of information. ummarization involves condensing larger texts into shorter versions that retain the essential information. This process is valuable for quickly grasping the key points of lengthy historical documents.

Text translation: This is the process of converting text from one language to another. In the context of modernizing Early Modern English texts, it is possibl;e to use machine text translation to convert Latin text within documents into English for contemporary readers.

Token classification: Token classification is a natural language understanding task in which a label is assigned to some tokens in a text. Some popular token classification subtasks are Named Entity Recognition (NER) and Part-of-Speech (PoS) tagging. NER models could be trained to identify specific entities in a text, such as dates, individuals and places; and PoS tagging would identify, for example, which words in a text are verbs, nouns, and punctuation marks.

Tokenization: This refers to the process of breaking down a text into smaller units called tokens. These tokens can be words, characters, or subwords, depending on the application. tokenization is a fundamental step in many natural language processing tasks, including machine translation, text summarization, and sentiment analysis. It allows AI models to process and understand the structure and meaning of text by breaking it down into manageable units.

Transkribus: A platform for machine transcription of handwritten documents, used by MarineLives to transcribe historical depositions.

Transfer training: Transfer learning is a machine learning technique where a model developed for one task is reused as a starting point for a model on a second task. This leverages previously learned knowledge to improve performance and speed up training on the new task, much like how humans apply knowledge from one domain to another.

V

Vectorbase: A database that stores and manages vector embeddings, which are mathematical representations of words, sentences, or other data items. Vectorbases are used for tasks such as semantic search, similarity comparisons, and recommendation systems. Pinecone and Google NotebookLM are examples of vectorbase platforms.

Vectors: Mathematical representations of data items as points in a multi-dimensional space. Vectors are used in machine learning to represent and compare data based on similarity. Sentence and paragraph embeddings are types of vectors used to represent text in the collaboratory's Pinecone vectorbase. The collaboratory has sessions dedicated to "Vectorbases" and "Semantic similarity."

Vision-Language-Action models (VLA): Pre-trained models with large-scale robot data to integrate visual perception, language understanding, and action-based decision making to guide robots in various tasks.

W

Weights and Biases: Parameters within a neural network that determine how strongly different inputs influence the output. Weights and biases are adjusted during the training process to optimize the model's performance. While the sources don't explicitly discuss weights and biases, they are fundamental components of neural networks and LLMs.

Z

Zephyr 7B β: A fine-tuned version of the Mistral-7B-v0.1 large language model, explored by MarineLives for text correction and analysis.

The MarineLives project was founded in 2012. It is a volunteer lead collaboration dedicated to the transcription, enrichment and publication of English High Court of Admiralty depositions.