-
Notifications
You must be signed in to change notification settings - Fork 0
Prompt engineering ‐ learnings from our first session
[Place Holder]
Feedback from Gavin Beinart-Smollan
Thank you for a great first session and for sharing all of these resources with us - they’re extremely helpful. I was inspired after the first session to create a series of prompts on Anthropic workbench to help me correct my raw Yiddish HTR. They need work and I’d love to discuss further with you. I had hoped to come to your zoom office hours last week but family thanksgiving obligations got in the way! I will try to come next week.
Feedback from Maurice Brenner
My main interest is in using LLMs as a qualitative tool for cultural and social research and I have been playing with prompts for questions such as summarise, analyse, simplify. For example, I provided extracts from the diary of an early eighteenth-century artisan that I have used extensively and asked the LLMs to:
· List key information about the diarist
· Identify main themes and preoccupations
· Note relationships mentioned
· Extract and quote significant events or observations
· Analyse the tone and emotional state of the diarist
· Identify period-specific terms or concepts
· Consider the socio-economic context
Given the brevity of the extract compared to the full diary, the summary outputs were surprisingly effective and extensive, highlighting the main themes of the diary.
However, I am nervous about the status of such outputs as rigorous academic overviews. There are significant differences between such queries and the other uses of LLMs we have so far discussed, such as transcribe, translate, extract, organise, clean. I tentatively suggest that those uses are valid because of three factors:
· Linearity: There is a linear relationship between input and output, with a direct equivalence between, say, a Latin text and its English translation or references to Barbary pirates and their tabulation by the model
· Stability: Minor changes in the prompt generally seem to cause predictable changes in the output. Retrying the prompt in the same or a different LLM might alter the structure, but the essential information remains stable. [What about translate? Do repeat iterations in the same or in another LLM produce substantively different outputs?]
· Verifiability: Because the process is linear, any individual output datum can be manually checked against the input.
None of these factors seem to hold true for the kinds of qualitative question I have trying out. Non-linearity: It is difficult to understand what choices the model has made in reaching a particular conclusion, which examples it has considered important and, particularly, which others it has set aside. Instability: Minor changes in the wording of the prompt or repeated iterations can produce substantively different outputs. Non-verifiability: Because the responses are summative, it is difficult to verify (as opposed to exemplify) any particular conclusion reached by the model.
For these reasons I would not be confident in using the outputs in any academic setting. [It also feels vaguely dishonest – akin to plagiarism.]
Possibly my error is in treating the LLM as if it is a historian instead of as a tool to aid the historian, leading me to ask the wrong kind of question. Perhaps I should be drawing my parallels not from the questions of cultural and social research but from its methods, and so ask the model, for example, not for summaries, but for the information that would help me to summarise, eg to list with quotes the occasions when the diarist expresses religious doubt or worries about his obligation to pay a debt.
I would be interested in any comments or thoughts colleagues may have.
Comment by Colin Greenstreet on feedback from Maurice Brenner
My first thoughts in response to Maurice's question about whether he should be asking for building blocks to summaries, rather than summaries, is that he/we should do both. This is consistent with Chain of Though prompting methodology, in which you break a task into component tasks. These component tasks can be sequential, with the first task or task set being required for a second task or task set, which themselves may be required for a third task or task set. These tasks can also go from the granular to a higher level of abstraction, again in two, three, or even more sequential steps. Studies have shown empirically that large language models produce more accurate (= meet human pre-set criteria more accurately), when such a Chain of Thought approach is taken to prompt design. Particularly when (1) The text (or other form of data) being analyzed is lengthy and complex (2) the level of abstraction desired is high.
There is an interesting November 18 2024 pre-print by Cauã Ferreira Barros et al, Large Language Model for Qualitative Research - Systematic Mapping Study which performs a LLM-enabled systematic review of the use of large language models in qualitative research in a number of fields. A related document published in Zenodo provides supporting research materials for the systematic review. A total of 354 studies were retrieved, distributed as follows: 20 from the ACM Digital Library, 30 from IEEExplore, 78 from Web of Science, 32 from SBC Open Lib, 193 from Scopus, and 1 from Arxiv. At the end of this process, 7 studies remained, for which comprehensive data extraction was conducted. The studies exploring the use of LLMs as a tool to support qualitative data analysis are concentrated in various fields, including healthcare [14] [17], education [13][15][19], cultural studies [18], and analysis of technological applications.
Feedback from Thiago Krause
Rieke asked for two pieces of research I mentioned in the chat. I'm copying her here, but if you want to add to the Wiki... They are not peer-reviewed, though: one is a pre-print (https://arxiv.org/pdf/2402.14531) and the other was actually just a post: https://x.com/RobLynch99/status/1734278713762549970
Feedback from Brad Scott
After the meeting I spent some time with a small data set of the descriptions of 254 plant collections. My sense from working with the material has been that there is some sort of order to the collections, but I hadn't fully explored it. This time I used Anthropic to assess what groupings and trends it could discern (if any); this is much more qualitative than my previous experiment, and it was useful to see how the refinements to the prompt affected the results -- none of them perfect, and all containing something horribly wrong -- but it did broadly accord with my embodied sense of the collection, and it also highlighted some clusters that I hadn't appreciated before. So, it has been a handy tool for supporting my thinking about tendencies in collection management over 50 years.
I am also keen to work with a highly structured XML rendering of a single volume of dried plants. I want to see if Anthropic can be used as an inference engine to help join together descriptions of plants as provided in lists, with the actual labelled specimens (which do not use the same names as in the list).
Feedback from Mark L. Thompson
Thiago mentioned that the scholar Mark Humphries has already found some ways to automate some of these approaches, so perhaps there's something to be learned from what he's doing, too: https://generativehistory.substack.com/ & https://github.com/mhumphries2323/Transcription_Pearl He's even built an executable file (.exe) that can run on Windows
The MarineLives project was founded in 2012. It is a volunteer lead collaboration dedicated to the transcription, enrichment and publication of English High Court of Admiralty depositions.
AI assistants and agents. Nov 19, 2024 talk
Analytical ontological summarization prompt
APIs and batch processing - second collaboratory session
APIs and batch processing ‐ learnings from second collaboratory session
Barbary pirate narrative summarization prompt
Barbary pirate deposition identification and narrative summarization prompt
Batch processing of raw HTR for clean up and summarization
Collaboratory members interests
Early Modern English Language Models
Fine-tuning - third oollaboratory session
History domain training data sets
Introduction to machine learning for historians
MarineLives and machine transcription
New skill set for historians? July 19, 2024 talk
Prompt engineering - first collaboratory session
Prompt engineering - learnings from first collaboratory session