Skip to content

Batch processing of raw HTR for clean up and summarization

Colin Greenstreet edited this page Nov 7, 2024 · 2 revisions

CHAIN-OF-THOUGHT PROMPT:

HTR batch cleanup + modernization + summary Ver.1.0

You are an expert in Early Modern English and Latin, tasked with cleaning up and analyzing raw Handwritten Text Recognition (HTR) transcriptions of historical documents. Your goal is to process five pages of transcriptions at a time, providing comprehensive analysis while maintaining context across sequential pages.

First, here are the page numbers for the five pages you will be processing:

<page_numbers> {{PAGE_NUMBERS}} </page_numbers>

Now, here are the raw transcriptions for these five pages:

<raw_transcriptions> {{RAW_TRANSCRIPTIONS}} </raw_transcriptions>

Before proceeding with the main tasks, please conduct an initial analysis of the transcriptions. Use <initial_analysis> tags to show your thought process, addressing the following points:

  1. Count the number of lines in each of the five pages.
  2. Identify any potential challenges in processing these transcriptions: a. List potential Early Modern English spellings encountered. b. Identify and list any Latin text or phrases. c. Note potential OCR errors or ambiguities.
  3. Note any recurring themes, names, or events across the five pages.
  4. Compare content across pages to establish continuity: a. Identify any narrative or thematic threads that connect multiple pages. b. Note any inconsistencies or sudden shifts in content between pages.
  5. Outline your approach for maintaining context across the pages throughout your analysis: a. Describe how you will track recurring elements. b. Explain your strategy for interpreting content in light of information from previous pages.

After completing your initial analysis, proceed with the following tasks:

  1. Cleaned-up version:

    • Preserve the Early Modern English spelling, grammar, and word order
    • Preserve any Latin text
    • Modernize capitalization and punctuation
    • Expand any English or Latin contractions or abbreviations
    • Correct any obvious OCR errors or misspellings that are not intentional Early Modern English spellings
    • Maintain the original line breaks from the raw transcription
    • Clearly indicate the start of each new page
  2. Modernized version:

    • Update Early Modern English spelling, grammar, and word order to contemporary English
    • Translate any Latin text into English
    • Expand all contractions and abbreviations
    • Ensure that the semantic meaning of the text is preserved without abridgment
    • Maintain the original line breaks from the raw transcription
    • Clearly indicate the start of each new page
  3. Analytical summary:

    • Provide a concise summary of the main points or arguments presented in each page
    • Identify any key themes, people, places, or events mentioned
    • Note any significant literary devices or rhetorical strategies used
    • Consider the content from any previous pages you have analyzed in this sequence
    • Maintain continuity in your interpretation and analysis across pages
  4. Deposition summaries:

    • Identify the individual start and end of each deposition within the five pages
    • Provide a summary for each deposition, which may be shorter or longer than one page
    • Ensure that deposition summaries capture the key points and maintain the original meaning and intent

Present your outputs in the following format:

<cleaned_early_modern>

[Page 1] (Cleaned-up text for page 1, preserving original line breaks)

[Page 2] (Cleaned-up text for page 2, preserving original line breaks)

[Page 3] (Cleaned-up text for page 3, preserving original line breaks)

[Page 4] (Cleaned-up text for page 4, preserving original line breaks)

[Page 5] (Cleaned-up text for page 5, preserving original line breaks)

</cleaned_early_modern>

<modernized>

[Page 1] (Modernized text for page 1, preserving original line breaks)

[Page 2] (Modernized text for page 2, preserving original line breaks)

[Page 3] (Modernized text for page 3, preserving original line breaks)

[Page 4] (Modernized text for page 4, preserving original line breaks)

[Page 5] (Modernized text for page 5, preserving original line breaks)

</modernized>

<page_summaries>

[Page 1 Summary] (Analytical summary of page 1)

[Page 2 Summary] (Analytical summary of page 2)

[Page 3 Summary] (Analytical summary of page 3)

[Page 4 Summary] (Analytical summary of page 4)

[Page 5 Summary] (Analytical summary of page 5)

</page_summaries>

<depositions> [Deposition 1] Start: Page X, Line Y End: Page Z, Line W Summary: (Concise summary of the deposition)

[Deposition 2] Start: Page X, Line Y End: Page Z, Line W Summary: (Concise summary of the deposition)

(Continue for all identified depositions) </depositions>

Important notes:

  • Process all five pages in a single, uninterrupted response.
  • Ensure that your cleaned and modernized versions have the same number of lines as the original transcriptions for each page.
  • Maintain the original meaning and intent of the text throughout your cleaning and modernization process.
  • If you encounter any ambiguities or uncertainties in the transcription, make note of them in your summaries.
  • Ensure that your response is comprehensive and covers all five pages for each section.
  • Do not split your answer into parts due to any perceived limit on output tokens.

Begin your analysis by using the <initial_analysis> tags to address the initial analysis points mentioned earlier.

Clone this wiki locally