Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* clean GPT OCR in ingest.py * split interior whitespace * normalize more interior whitespace * pipeline-style * try being much more lax about is_negative * 2/426 false positives * this is pretty good * images.ndjson diff script * split up (1), (2), (3) etc on single one; 180 matches, 2 false positives * remove F. S. LINCOLN stamps * strip NEW YORK PUBLIC LIBRARY stamp (108) * Drop "President Borough of Manhattan" (862!) * update keep-ids with new cleaner * add another manual list * run cleaner on old site OCR, too * squash excessive newlines
- Loading branch information