Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorganizing repo #144

Merged
merged 53 commits into from
Oct 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
57342e3
start migration
danvk Oct 17, 2024
4ad25a3
only absolute imports
danvk Oct 17, 2024
eaa6c6b
update path
danvk Oct 20, 2024
5b748f3
run ingestion script
danvk Oct 20, 2024
3956aba
start filling in site, geocode directories
danvk Oct 20, 2024
12716a8
continue filling in site, geocode directories
danvk Oct 20, 2024
baabcc0
update paths
danvk Oct 20, 2024
79b0ee9
drop old generate_static_site
danvk Oct 20, 2024
d73aa12
show stderr/stdout for test debugging
danvk Oct 20, 2024
1296a4a
more PYTHONPATH
danvk Oct 20, 2024
4f719b6
move cluster_locations and data files
danvk Oct 20, 2024
900b95f
move extract_sizes; no more top-level .py files!
danvk Oct 20, 2024
3b1a158
move extract sizes, sizes.txt files
danvk Oct 20, 2024
a521bed
popular photos
danvk Oct 20, 2024
161fc11
update instructions
danvk Oct 20, 2024
74bd910
move nyc-lat-lons-nyc.js; delete viewer dir
danvk Oct 20, 2024
ce363ad
start clearing out nyc dir
danvk Oct 20, 2024
492e1f0
move and rename grid/gold.py
danvk Oct 20, 2024
137a877
clear out grid dir
danvk Oct 20, 2024
e8669cb
clear out crawl dir
danvk Oct 20, 2024
33cf185
apparently __init__.py is no longer a thing
danvk Oct 20, 2024
738b681
analysis folder; not sure about rotations yet
danvk Oct 20, 2024
e4a5891
move coders
danvk Oct 20, 2024
64a3a44
add types for photo extraction
danvk Oct 20, 2024
5cd4d92
move to oldnyc/crop
danvk Oct 20, 2024
a67393a
two more from nyc dir
danvk Oct 20, 2024
d880a05
no more code in nyc dir
danvk Oct 20, 2024
6c64a30
patch e2e
danvk Oct 20, 2024
2c29cb3
apparently one does not name a file types.py
danvk Oct 20, 2024
a470c28
rv test changes
danvk Oct 20, 2024
e3b9b34
update some paths
danvk Oct 20, 2024
4b2bb07
__main__ guard
danvk Oct 20, 2024
86593fa
move rotations.json
danvk Oct 20, 2024
bf5cce7
start feedback tools
danvk Oct 20, 2024
e8123f1
move url fetcher
danvk Oct 20, 2024
e1ac957
move/rename ocr.json
danvk Oct 21, 2024
e757a24
move ocrbacks
danvk Oct 21, 2024
047249c
start populating oldnyc/orc
danvk Oct 21, 2024
edb07b5
move more OCR code; generate_gpt_review is redundant with eval_and_re…
danvk Oct 21, 2024
9384b16
more OCR move
danvk Oct 21, 2024
50722af
update paths
danvk Oct 21, 2024
bc6b08b
ocr dir cleaned out
danvk Oct 21, 2024
f07c5bb
update paths
danvk Oct 21, 2024
0d9c098
drop ancient NOTES file
danvk Oct 21, 2024
1ba8631
move GPT OCR output into data/
danvk Oct 21, 2024
54cdb90
match previous GPT output
danvk Oct 21, 2024
d8448b3
move geogpt; all py files in oldnyc/!
danvk Oct 21, 2024
d9ac962
silence the vulture
danvk Oct 21, 2024
88b2cb8
clear out feedback dir
danvk Oct 21, 2024
f66c602
clear out rotations, nyc dirs
danvk Oct 21, 2024
406d13f
update instructions
danvk Oct 21, 2024
a4fc4e2
update instructions
danvk Oct 21, 2024
e8ce15e
document intention behind directory structure
danvk Oct 21, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 9 additions & 9 deletions .github/workflows/e2etest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@ jobs:
tar -xzf geocache.tgz
- name: Run geocoder
run: |
poetry run ./generate-geocodes.py --ids_filter test/random200-ids.txt --images_ndjson data/images.ndjson --output_format id-location.txt --geocode > test/random200-geocoded.txt 2> test/random200.logs.txt
PYTHONPATH=. poetry run oldnyc/geocode/geocode.py --ids_filter test/random200-ids.txt --images_ndjson data/images.ndjson --output_format id-location.txt --geocode > >(tee test/random200-geocoded.txt) 2> >(tee test/random200.logs.txt >&2)
# See https://stackoverflow.com/a/692407/388951 for the stdout/stderr redirection
- name: Check for diffs
run: |
git diff --exit-code test/
Expand All @@ -25,11 +26,11 @@ jobs:
- uses: ./.github/actions/setup
- name: Run cropper
run: |
poetry run ocr/crop_morphology.py --beta 2 --overwrite test/721675b.jpg
poetry run ocr/crop_morphology.py --beta 2 --overwrite --border_only --output_pattern '%s.border.jpg' test/721675b.jpg
poetry run oldnyc/crop/crop_to_text.py --beta 2 --overwrite test/721675b.jpg
poetry run oldnyc/crop/crop_to_text.py --beta 2 --overwrite --border_only --output_pattern '%s.border.jpg' test/721675b.jpg
- name: Run photo detector
run: |
poetry run nyc/find_pictures.py test/*f.jpg > test/detected-photos.ndjson
poetry run oldnyc/crop/find_pictures.py test/*f.jpg > test/detected-photos.ndjson
- name: Check for diffs
run: |
git diff --exit-code test/
Expand All @@ -45,17 +46,16 @@ jobs:
git clone https://github.com/oldnyc/oldnyc.github.io.git
- name: Run ingestion
run: |
PYTHONPATH=. poetry run data/ingest.py
PYTHONPATH=. poetry run oldnyc/ingest/ingest.py
- name: Check for diffs
run: |
git diff --exit-code data/
- name: Generate static site
run: |
export PYTHONPATH=.
poetry run ./nyc/crops-to-json.py nyc/crops.txt > /tmp/crops.json
poetry run ./nyc/records_to_photos.py data/images.ndjson /tmp/crops.json data/photos.ndjson
echo '{"fixes": {}}' > ocr/feedback/fixes.json
poetry run ./generate_static_site.py --leave-timestamps-unchanged
poetry run oldnyc/crop/records_to_photos.py data/images.ndjson data/crops.ndjson data/photos.ndjson
echo '{"fixes": {}}' > data/feedback/fixes.json
poetry run oldnyc/site/generate_static_site.py --leave-timestamps-unchanged
- name: Check for diffs
run: |
cd ../oldnyc.github.io
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/pytest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ jobs:
steps:
- uses: actions/checkout@v4
- uses: ./.github/actions/setup
- run: poetry run pytest
- run: PYTHONPATH=. poetry run pytest
- run: poetry run ruff check
- run: poetry run ruff format --check
- run: poetry run pyright
3 changes: 3 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,8 @@
"source.organizeImports.ruff": "always",
},
// "editor.wordBasedSuggestions": "off"
},
"markdownlint.config": {
"code-block-style": false
}
}
6 changes: 2 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,10 @@ To get going on development:
```bash
git clone git://github.com/danvk/oldnyc.git
cd oldnyc
virtualenv env
source env/bin/activate
pip install -r requirements.txt
poetry install
```

See [nyc/howto.md](nyc/howto.md) for more details on how to perform specific tasks.
See [howto.md](howto.md) for more details on how to perform specific tasks.

If you're interested in building your own "Old" site using this code, check out [this great writeup][3] on Old Ravenna.

Expand Down
Empty file removed analysis/__init__.py
Empty file.
19 changes: 0 additions & 19 deletions analysis/rotations/NOTES

This file was deleted.

Empty file removed coders/__init__.py
Empty file.
14 changes: 9 additions & 5 deletions data/README.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,20 @@
# ETL (Extract, Transform, Load)
# OldNYC Ddata

Goal is to pull together disparate data sources into a single `images.ndjson` file.
All data comes together into a single `images.ndjson` file (see oldnyc/ingest).

Format should be similar to the one used in [OldTO].
Input data (sources of truth) live in `data/originals`. Files in the top-level
`data` directory are derived from those and other sources.

Inputs:

- `nyc/milstein.csv`: the vintage 2013 CSV file from the NYPL that started it all
- `data/originals/milstein.csv`: the vintage 2013 CSV file from the NYPL that started it all
- Contains image ID (typically starts with "7" and ends with "f")
- Contains title, alt_title
- Contains dates
- Contains creator
- Contains source (corresponds to subcollection)
- Contains Address and Full Address (unclear provenance)
- `Milstein_data_for_DV.csv`: the 2024 update to the CSV
- `data/originals/Milstein_data_for_DV.csv`: the 2024 update to the CSV
- Contains image ID (with a capital "F" this time)
- Contains title (which may have changed since 2013)
- Contains two UUIDs, which can be used to construct a Digital Collections (DC) URL
Expand All @@ -22,3 +23,6 @@ Inputs:
- `data.json`: contains OCR text from 2015 (Ocropy) plus manual fixes
- `gpt-text.json`: contains OCR text from 2024 via OpenAI

TODO:

- Document provenance for all files.
Empty file removed data/__init__.py
Empty file.
File renamed without changes.
26 changes: 12 additions & 14 deletions feedback/README.md → data/feedback/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@

OldNYC incorporates user feedback in a variety of ways, most notably:

* Detection of rotated images
* OCR correction
- Detection of rotated images
- OCR correction

This document describes how to pull in new user feedback and push the
changes to the site.
Expand All @@ -15,19 +15,18 @@ side-by-side on the file system.

Usage:

curl "https://brilliant-heat-1088.firebaseio.com/.json?print=pretty&auth=..." -o feedback/user-feedback.json
cp feedback/user-feedback.json feedback/user-feedback.$(date +%Y-%m-%dT%H:%M:%S).json
curl "https://brilliant-heat-1088.firebaseio.com/.json?print=pretty&auth=..." -o data/feedback/user-feedback.json
cp data/feedback/user-feedback.json data/feedback/user-feedback.$(date +%Y-%m-%dT%H:%M:%S).json

This will update `feedback/user-feedback.json`.

## Step 2: Update rotations

Run:

cd analysis/rotations
./extract_rotations.py
poetry run oldnyc/feedback/extract_rotations.py

This will update `analysis/rotations/corrections.json`
This will update `data/feedback/corrections.json`

Commit all the rotations and you'll be able to review them via `git webdiff` when you run `generate_rotated_images.py` in the oldnyc.github.io repo.

Expand All @@ -37,7 +36,7 @@ To review the changes before committing, use this [localturk] template:

```bash
(echo 'photo_id,rotation'; git diff rotations.json | grep '^\+' | grep -v 'last_date' | sed 1d | sed 's/\+ *"//' | sed 's/,//' | sed 's/": /,/') > /tmp/new-rotations.txt
localturk template.html /tmp/new-rotations.txt checked-rotations.txt
localturk oldnyc/feedback/rotation-review.html /tmp/new-rotations.txt checked-rotations.txt
```

[localturk]: https://github.com/danvk/localturk
Expand All @@ -46,24 +45,23 @@ localturk template.html /tmp/new-rotations.txt checked-rotations.txt

Run:

cd ocr/feedback
./extract_user_ocr.py
./ocr_corrector.py
poetry run oldnyc/feedback/extract_user_ocr.py
poetry run oldnyc/feedback/ocr_corrector.py

This will update `ocr/feedback/{corrections,fixes}.json`.
This will update `data/feedback/{corrections,fixes}.json`.
`corrections.json` is an exhaustive list of new OCR corrections, while
`fixes.json` includes just one corrected version of the text for each
image.

To manually review updates, open review/index.html in a browser.
To manually review updates, open `data/feedback/review/index.html` in a browser.

To reject some changes, re-run `ocr_corrector.py` as it suggests.

## Step 4: Update the static site

Run:

./generate_static_site.py
poetry run oldnyc/site/generate_static_site.py
cd ../oldnyc.github.io
git diff

Expand Down
Loading