Skip to content

Commit

Permalink
GPT OCR followup (#148)
Browse files Browse the repository at this point in the history
* clean GPT OCR in ingest.py

* split interior whitespace

* normalize more interior whitespace

* pipeline-style

* try being much more lax about is_negative

* 2/426 false positives

* this is pretty good

* images.ndjson diff script

* split up (1), (2), (3) etc on single one; 180 matches, 2 false positives

* remove F. S. LINCOLN stamps

* strip NEW YORK PUBLIC LIBRARY stamp (108)

* Drop "President Borough of Manhattan" (862!)

* update keep-ids with new cleaner

* add another manual list

* run cleaner on old site OCR, too

* squash excessive newlines
  • Loading branch information
danvk authored Oct 27, 2024
1 parent 28bae2b commit 5159091
Show file tree
Hide file tree
Showing 13 changed files with 32,898 additions and 63,970 deletions.
88,827 changes: 29,609 additions & 59,218 deletions data/gpt-ocr.json

Large diffs are not rendered by default.

6,182 changes: 3,091 additions & 3,091 deletions data/images.ndjson

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions data/originals/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,4 @@ Files related to the 2024 change from Ocropus → GPT-based OCR:
- `ocr-heavy-editor.ids.txt`: backing IDs of items that were reviewed by heavy users of the OldNYC OCR Corrector (heavy = 100+ edits).
- `ocr-big-movers.txt`: Dan's manual review of ~500 items with a large edit distance between the existing on-site OCR and GPT.
- `ocr-spelld25.txt`: Dan's manual review of ~80 items where GPT has more misspelled words and the length of the transcriptions varied by 25+ characters.
- `ocr-followup.txt`: More manual choices from follow-up review.
Loading

0 comments on commit 5159091

Please sign in to comment.