Prompt engineering ‐ learnings from our first session

DISCUSSION DURING SESSION

[Place Holder]

FEEDBACK POST SESSION

Feedback from Gavin Beinart-Smollan

Thank you for a great first session and for sharing all of these resources with us - they’re extremely helpful. I was inspired after the first session to create a series of prompts on Anthropic workbench to help me correct my raw Yiddish HTR. They need work and I’d love to discuss further with you. I had hoped to come to your zoom office hours last week but family thanksgiving obligations got in the way! I will try to come next week.

Feedback from Maurice Brenner

My main interest is in using LLMs as a qualitative tool for cultural and social research and I have been playing with prompts for questions such as summarise, analyse, simplify. For example, I provided extracts from the diary of an early eighteenth-century artisan that I have used extensively and asked the LLMs to:

· List key information about the diarist

· Identify main themes and preoccupations

· Note relationships mentioned

· Extract and quote significant events or observations

· Analyse the tone and emotional state of the diarist

· Identify period-specific terms or concepts

· Consider the socio-economic context

Given the brevity of the extract compared to the full diary, the summary outputs were surprisingly effective and extensive, highlighting the main themes of the diary.

However, I am nervous about the status of such outputs as rigorous academic overviews. There are significant differences between such queries and the other uses of LLMs we have so far discussed, such as transcribe, translate, extract, organise, clean. I tentatively suggest that those uses are valid because of three factors:

· Linearity: There is a linear relationship between input and output, with a direct equivalence between, say, a Latin text and its English translation or references to Barbary pirates and their tabulation by the model

· Stability: Minor changes in the prompt generally seem to cause predictable changes in the output. Retrying the prompt in the same or a different LLM might alter the structure, but the essential information remains stable. [What about translate? Do repeat iterations in the same or in another LLM produce substantively different outputs?]

· Verifiability: Because the process is linear, any individual output datum can be manually checked against the input.

None of these factors seem to hold true for the kinds of qualitative question I have trying out. Non-linearity: It is difficult to understand what choices the model has made in reaching a particular conclusion, which examples it has considered important and, particularly, which others it has set aside. Instability: Minor changes in the wording of the prompt or repeated iterations can produce substantively different outputs. Non-verifiability: Because the responses are summative, it is difficult to verify (as opposed to exemplify) any particular conclusion reached by the model.

For these reasons I would not be confident in using the outputs in any academic setting. [It also feels vaguely dishonest – akin to plagiarism.]

Possibly my error is in treating the LLM as if it is a historian instead of as a tool to aid the historian, leading me to ask the wrong kind of question. Perhaps I should be drawing my parallels not from the questions of cultural and social research but from its methods, and so ask the model, for example, not for summaries, but for the information that would help me to summarise, eg to list with quotes the occasions when the diarist expresses religious doubt or worries about his obligation to pay a debt.

I would be interested in any comments or thoughts colleagues may have.

Comment by Colin Greenstreet on feedback from Maurice Brenner

My first thoughts in response to Maurice's question about whether he should be asking for building blocks to summaries, rather than summaries, is that he/we should do both. This is consistent with Chain of Though prompting methodology, in which you break a task into component tasks. These component tasks can be sequential, with the first task or task set being required for a second task or task set, which themselves may be required for a third task or task set. These tasks can also go from the granular to a higher level of abstraction, again in two, three, or even more sequential steps. Studies have shown empirically that large language models produce more accurate (= meet human pre-set criteria more accurately), when such a Chain of Thought approach is taken to prompt design. Particularly when (1) The text (or other form of data) being analyzed is lengthy and complex (2) the level of abstraction desired is high.

There is an interesting November 18 2024 pre-print by Cauã Ferreira Barros et al, Large Language Model for Qualitative Research - Systematic Mapping Study which performs a LLM-enabled systematic review of the use of large language models in qualitative research in a number of fields. A related document published in Zenodo provides supporting research materials for the systematic review. A total of 354 studies were retrieved, distributed as follows: 20 from the ACM Digital Library, 30 from IEEExplore, 78 from Web of Science, 32 from SBC Open Lib, 193 from Scopus, and 1 from Arxiv. At the end of this process, 7 studies remained, for which comprehensive data extraction was conducted. The studies exploring the use of LLMs as a tool to support qualitative data analysis are concentrated in various fields, including healthcare [14] [17], education [13][15][19], cultural studies [18], and analysis of technological applications.

Feedback from Thiago Krause

Rieke asked for two pieces of research I mentioned in the chat. I'm copying her here, but if you want to add to the Wiki... They are not peer-reviewed, though: one is a pre-print (https://arxiv.org/pdf/2402.14531) and the other was actually just a post: https://x.com/RobLynch99/status/1734278713762549970

Feedback from Brad Scott

After the meeting I spent some time with a small data set of the descriptions of 254 plant collections. My sense from working with the material has been that there is some sort of order to the collections, but I hadn't fully explored it. This time I used Anthropic to assess what groupings and trends it could discern (if any); this is much more qualitative than my previous experiment, and it was useful to see how the refinements to the prompt affected the results -- none of them perfect, and all containing something horribly wrong -- but it did broadly accord with my embodied sense of the collection, and it also highlighted some clusters that I hadn't appreciated before. So, it has been a handy tool for supporting my thinking about tendencies in collection management over 50 years.

I am also keen to work with a highly structured XML rendering of a single volume of dried plants. I want to see if Anthropic can be used as an inference engine to help join together descriptions of plants as provided in lists, with the actual labelled specimens (which do not use the same names as in the list).

Feedback from Mark L. Thompson

Thiago mentioned that the scholar Mark Humphries has already found some ways to automate some of these approaches, so perhaps there's something to be learned from what he's doing, too: https://generativehistory.substack.com/ & https://github.com/mhumphries2323/Transcription_Pearl He's even built an executable file (.exe) that can run on Windows

The MarineLives project was founded in 2012. It is a volunteer lead collaboration dedicated to the transcription, enrichment and publication of English High Court of Admiralty depositions.