GPT OCR followup (#148)

* clean GPT OCR in ingest.py * split interior whitespace * normalize more interior whitespace * pipeline-style * try being much more lax about is_negative * 2/426 false positives * this is pretty good * images.ndjson diff script * split up (1), (2), (3) etc on single one; 180 matches, 2 false positives * remove F. S. LINCOLN stamps * strip NEW YORK PUBLIC LIBRARY stamp (108) * Drop "President Borough of Manhattan" (862!) * update keep-ids with new cleaner * add another manual list * run cleaner on old site OCR, too * squash excessive newlines
danvk · Oct 27, 2024 · 5159091 · 5159091
1 parent 28bae2b
commit 5159091
Show file tree

Hide file tree

Showing 13 changed files with 32,898 additions and 63,970 deletions.
diff --git a/data/gpt-ocr.json b/data/gpt-ocr.json
diff --git a/data/images.ndjson b/data/images.ndjson
diff --git a/data/originals/README.md b/data/originals/README.md
@@ -17,3 +17,4 @@ Files related to the 2024 change from Ocropus → GPT-based OCR:
 - `ocr-heavy-editor.ids.txt`: backing IDs of items that were reviewed by heavy users of the OldNYC OCR Corrector (heavy = 100+ edits).
 - `ocr-big-movers.txt`: Dan's manual review of ~500 items with a large edit distance between the existing on-site OCR and GPT.
 - `ocr-spelld25.txt`: Dan's manual review of ~80 items where GPT has more misspelled words and the length of the transcriptions varied by 25+ characters.
+- `ocr-followup.txt`: More manual choices from follow-up review.