APIs and batch processing ‐ learnings from second collaboratory session

DISCUSSION DURING SESSION

Agreed action points:

Colin Greenstreet to propose some potential benchmarks to assess the quality of large language model output in response to summarization prompts of qualitative text based material. To be posted to our wiki, and also for discussion with Maurice Brenner as he continues to explore the pros and cons of LLM summarization in the context of cultural historical research, and as he examines how to assess the reliability and reproducability of LLM summarization compared with human summarization by expert historians.
Colin Greenstreet to check with all collaboratory participants that they have editorial privilege access to the latest version of the session two Colab notebook, which should be accessed through the emailed link sent Tuesday afternoon to collaboratory members (rather than through the GitHub based copy). Colin to check specifically with Rieke that she now has access, since she had failed to get access yesterday

FEEDBACK POST SESSION

Feedback from Gavin Beinart-Smollan

Re fine tuning a model - that sounds like a great idea. I'd be very happy to work on this with you. I have only around 100 pages of letters. They are all collected on transkribus so I guess I'd need to do the recognition again so we have the raw baseline plus my corrected versions, right?

There is a way that we might get a lot more data for such a model: I was one of the volunteers who trained the baseline Yiddish handwriting model on Transkribus. I could contact the organizers and see if they'd be willing to share all of that data. And I could reach out to other Yiddish Transkribus users to see if they'd like to participate. This would definitely take more time though.

Feedback from Colin Greenstreet

I have worked with Gemini 1.5 Advanced to come up with an intitial response to Maurice Brenner's concerns about the reliability and reproducability of the output from his summarization prompts

**Here is the prompt I used:**

I am the convenor of an ai-and-history-collaboratory. Its members consist largely of doctoral students, post-doctoral fellows, and faculty members. I taught a segment on "summarization" a week ago, and one of our collaboratory members (Maurice Brenner) has subsequently been testing summarization prompts with his own material. He has concerns about the reliability and reproducability of the output from his summarization prompts. I would like to propose some methodological steps for him to (1) establish appropriate benchmarks for the type of text he wants to process (2) a process for humans to grade LLM summarization results against these benchmarks (3) a process for a LLM to grade the summarization results against these benchmarks.

Pasted Maurice's comments from Prompt engineering ‐ learnings from our first session

**Here is the Gemini response, which I have edited and improved:**

1. Establish Benchmarks

Define the Ideal Summary: Start by having multiple human experts (including Maurice) independently summarize the chosen diary extract. Discuss and reconcile differences to create a "gold standard" summary or a set of acceptable summaries. This process clarifies the key information, themes, and analytical points that a good summary should contain.

Criteria Checklist: Develop a clear rubric with specific criteria for evaluating summaries. This might include:

Accuracy: Are the facts and details presented correctly?

Completeness: Does the summary capture the most important information?

Conciseness: Is the summary succinct and to the point?

Coherence: Is the summary well-organized and easy to understand?

Insightfulness: Does the summary offer any valuable interpretations or analysis?

Neutrality: Does the summary avoid unwarranted assumptions or biases?

2. Human Evaluation Process

Blind Evaluation: Provide human evaluators with a set of summaries generated by different LLMs (and potentially humans) without revealing the source. This prevents bias in the evaluation.

Comparative Scoring: Have evaluators score each summary against the rubric criteria. Encourage them to provide detailed feedback and justifications for their scores.

Inter-rater Reliability: Calculate inter-rater reliability (e.g., using Cohen's kappa) to measure the agreement between different evaluators. This helps to assess the consistency and objectivity of the evaluation process.

3. LLM Evaluation Process

Fine-tuning for Evaluation: Fine-tune an LLM specifically for the task of evaluating summaries of historical texts. Use the "gold standard" summaries and human-generated evaluations from the previous step as training data. This will help the LLM learn the criteria for a good summary in this specific context.

Feature Engineering: Extract relevant features from the summaries that the LLM can use for evaluation. These might include:

Text Overlap: Similarity measures (e.g., cosine similarity) between the LLM-generated summary and the "gold standard."

Sentiment Analysis: Compare the sentiment expressed in the LLM summary with that of the original text and the "gold standard."

Keyword Extraction: Check if the LLM summary includes essential keywords and concepts identified in the benchmark.

Readability Scores: Assess the readability of the LLM summary using metrics like Flesch-Kincaid Grade Level.

LLM Scoring: Have the fine-tuned LLM assign scores to new summaries based on the learned criteria and extracted features.

Comparison and Analysis: Compare the LLM's evaluations with the human evaluations. Analyze discrepancies and identify areas where the LLM's evaluation model can be improved.

Addressing Maurice's Concerns

Non-linearity: While LLMs are complex, techniques like attention visualization can help provide some insights into which parts of the input text the model focused on when generating the summary.

Instability: By systematically varying prompts and tracking changes in output, Maurice can start to understand how sensitive the LLM is to different phrasings. Techniques like prompt engineering can help create more robust and reliable prompts.

Non-verifiability: The benchmarking and evaluation processes described above will help establish a framework for verifying the accuracy and completeness of LLM-generated summaries.

Additional Tips for Maurice:

Iterative Refinement: Encourage Maurice to experiment with different prompts, LLMs, and evaluation metrics. This iterative process will help him refine his approach and identify best practices for his specific research needs.

Transparency: In any academic work using LLMs, it's crucial to be transparent about the methodology, including the limitations of the technology.

Human-in-the-Loop: Emphasize that LLMs should be seen as tools to assist historians, not replace them. Human expertise is still essential for interpreting and contextualizing the information generated by LLMs.

**I asked a further question about metrics**

What metrics are used to assess summarization capabilities of frontier LLMs and medium sized models? Can your provide metrics for Claude Sonnet 3.5 (latest version), GPT-4o and GPT4oipreview and Gemini 1.5 Advanced? for summarization?

**Gemini's answer:**

Traditional Metrics:

ROUGE (Recall-Oriented Understudy for Gisting Evaluation): This family of metrics compares an LLM's summary to a human-written reference summary. It focuses on the overlap of words or n-grams (sequences of words). Different ROUGE variants (ROUGE-1, ROUGE-L, etc.) emphasize different aspects of the summary.

Strengths: Widely used, easy to calculate.

Weaknesses: Can be overly sensitive to word choice and may not capture meaning or overall coherence.

BLEU (Bilingual Evaluation Understudy): Originally designed for machine translation, BLEU measures the precision of n-grams in the LLM summary compared to a reference summary.

Strengths: Good for capturing factual accuracy.

Weaknesses: Less sensitive to recall (whether the summary includes all the important information) and may not be ideal for evaluating more abstractive summaries.

METEOR (Metric for Evaluation of Translation with Explicit ORdering): This metric considers synonyms and paraphrases, making it more sensitive to meaning than BLEU or ROUGE.

Strengths: Better at capturing semantic similarity.

Weaknesses: Requires more complex linguistic resources.

####LLM-based Metrics:

BERTScore: Uses contextual embeddings from BERT to compare the semantic similarity between the LLM summary and a reference summary.

Strengths: Captures meaning better than traditional metrics.

Weaknesses: Computationally more expensive.

BARTScore: Similar to BERTScore but uses BART, a model specifically designed for text generation.

Strengths: Potentially even better at capturing meaning and fluency.

Other Emerging Metrics: There's ongoing research into metrics that evaluate aspects like factuality, coherence, and consistency.

Metrics for Specific LLMs:

Unfortunately, it's difficult to give precise summarization metrics for Claude 3.5, GPT-4, and Gemini 1.5 Advanced without specific benchmark datasets and evaluation setups. Here's why:

Closed Weights: The exact architectures and training data of these models are not fully public, making it hard to do direct comparisons.

Rapid Evolution: LLMs are constantly being updated, so performance can change quickly.

Task Specificity: Summarization performance can vary depending on the type of text (news articles, scientific papers, etc.) and the desired summary length.

What you can do:

Look for published evaluations: Check for research papers or technical reports that evaluate these models on standardized summarization datasets (e.g., CNN/Daily Mail, XSum).

Run your own evaluations: If you have access to these models, you can use the metrics described above to evaluate their summarization capabilities on your own data. Be sure to use a clear evaluation setup and compare the models fairly.

Key takeaway:

While specific numbers are hard to pin down, it's generally accepted that:

Larger models (like GPT-4 and Gemini 1.5 Advanced) tend to perform better at summarization than smaller models. They have more capacity to understand complex language and generate coherent summaries.

Specialized models (like Claude 3.5, which is tuned for dialogue) may have particular strengths in certain summarization tasks.

Feedback from Maurice L. Thompson

By the way, inspired by Colin's example, I used Claude & Gemini & Colab to build a text scraping tool to download all of Benjamin Franklin's authored writings from Founders Online using its API and then convert them from individual JSON files to a well-formatted single CSV file. It's something others have done for me (so I basically already had the data, just not in that specific form) but I was pretty proud that I was able to get it to work and even make some modifications to it to suit what I wanted to do. And now I'm wondering what else I might be able to do with that data in the meantime ... without having to ask someone else to do it for me .... ! However small it may be, it feels like quite an achievement, all things considered.

The MarineLives project was founded in 2012. It is a volunteer lead collaboration dedicated to the transcription, enrichment and publication of English High Court of Admiralty depositions.