Lebovits/issu1015 cleanup new pipeline #1037

nlebovits · 2024-12-05T05:19:23Z

Major cleanup of new ETL pipeline:

updates pip dependencies
lints and formats with ruff
adds docstrings and typing
dynamically sets the number of workers based on the local machine when running things in parallel
changes the priority_level calculation to use z-scores instead of percentiles
connects the pipeline to the new postgres db with postgis + timescale
sets up monthly partitioning and six-month compression policies
incorporates pg_stat reporting to slack channel to monitor table sizes

A couple of outstanding items that need to be done:

Consolidate the process of writing to postgres (there' some redundancy in how/where it's done right now--could probably be a single class)
Re-incorporate data-diff and diff reporting to Slack
Reconnect writing tiles to GCS; write an additional unclustered tile, if possible (coordinate with @HeyZoos)
- [ ] Make sure to notify the FE team that the new tiles will have a new name
- [ ] If possible, make sure that this and other relevant data are written to subdirs in GCS, per that old ticket from Nico
Set up a cron job for monthly backup dumps to GCS from the VM
Smoothly replace the old ETL pipeline with the new one in the main repo, making sure not to delete the old database/disturb the old processes.

…data creation

…d up + reorg'd

vercel · 2024-12-05T05:19:28Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
vacant-lots-proj	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Dec 11, 2024 2:49am

nlebovits · 2024-12-05T05:24:39Z

hey @rmartinsen if you have the bandwidth, it'd be a huge help to get a PR review from you, specifically looking at featurelayer.py and main.py to see if there's room for improvement/concision.

rmartinsen

Overall there are lots of improvements here. Adding ruff, docstrings and config are all nice changes to make to the code. I left some comments about structuring the code. Adding some helper functions and consolidating error handling could really improve readability and maintainability going forward.

I don't have enough context to really speak on the logic, but none of the comments I left as dealbreakers, so take them or leave them as you see appropriate.

data/src/Pipfile

rmartinsen · 2024-12-06T13:24:27Z

data/src/main.py

+        END IF;
+        END $$;
+        """)
+    )


What's the situation where a table exists without having this column? Ideally table creation would be isolated from data processing. We could consider adding the column in opa_properties, using a "add_create_date" flag if there cases where we don't want to create the date.

data/src/main.py

data/src/new_etl/classes/featurelayer.py

rmartinsen · 2024-12-07T13:28:24Z

hey @rmartinsen if you have the bandwidth, it'd be a huge help to get a PR review from you, specifically looking at featurelayer.py and main.py to see if there's room for improvement/concision.

For sure! I left some comments. None of them are that major, but might help structure the code a little better. Please let me know if any of it doesn't make sense. I'm happy to clarify. I also want to emphasize that the code is perfectly fine without these changes so feel free to push back or ignore them.

nlebovits · 2024-12-09T03:23:32Z

Made a bunch of progress in these latest commits--I've modularized the data loading and database connections that were previously in featurelayer.py and have the dataset posting to GCP again. I need to get data-diff properly reporting to Slack, but then I think we'll be pretty much ready to get this running on the VM.

nlebovits · 2024-12-09T03:52:15Z

Q to consider before finalizing: how will tsdb handle a new column added to an existing table? Will it gracefully add the column and impute NAs in previous timestamps of the table? Or will it recreate the entire table?

nlebovits added 12 commits November 26, 2024 20:58

update deps

021dd0e

lint, format

0666f1f

add docstrings, typing; set num workers dynamically; minor tweaks to …

b51385f

…data creation

switch priority level calc to use z-score instead of percentile

f36f1d1

remove some logging

457099b

add draft database connection --no-verify

f1d1c58

commit working draft of hypertable creation (still needs to be cleane…

87538b4

…d up + reorg'd

successfully posting to postgres w tsdb extension

f858b84

get constituent tables set up as hypertables

11231c6

add month partitioning and compression policies to hypertables

b8268c7

add slack reporter for hypertable sizes

0b6f407

ruff

6496cf4

github-actions bot added backend frontend labels Dec 5, 2024

rmartinsen reviewed Dec 7, 2024

View reviewed changes

nlebovits added 3 commits December 8, 2024 18:08

relocate dev deps in pipfile; reinstall

dddca54

fix issue with incorrect comment

54d78e1

remove outdated parquet write to GCS

7c7fce2

vercel bot deployed to Preview December 8, 2024 23:12 View deployment

nlebovits added 4 commits December 8, 2024 22:19

restore post to GCP; data diff not yet working correctly

48bf627

modularize large parts of featurelayer; add slack reporter for data QC

b113fde

create new draft of diff report for new timestamp approach

8df4b56

modularize database, data loaders components of featurelayer.py

dd998ed

vercel bot deployed to Preview December 9, 2024 03:22 View deployment

set up data diffing

36eb609

vercel bot deployed to Preview December 10, 2024 14:40 View deployment

update pip deps

f566695

vercel bot deployed to Preview December 10, 2024 14:42 View deployment

get diff report working

5b436b7

vercel bot deployed to Preview December 11, 2024 02:22 View deployment

nlebovits added 2 commits December 10, 2024 21:43

clean up logging; add try-except block if main.py fails

432cbb5

track data_diff.py class

0521199

vercel bot deployed to Preview December 11, 2024 02:45 View deployment

nlebovits added 2 commits December 10, 2024 21:47

re-add geoalchemy2 to pipfile

928da95

remove duplicate, outdated diff report file

6e82443

vercel bot deployed to Preview December 11, 2024 02:49 View deployment

nlebovits marked this pull request as ready for review December 11, 2024 02:51

nlebovits merged commit 56fc5fb into staging Dec 11, 2024
8 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lebovits/issu1015 cleanup new pipeline #1037

Lebovits/issu1015 cleanup new pipeline #1037

nlebovits commented Dec 5, 2024 •

edited

Loading

vercel bot commented Dec 5, 2024 •

edited

Loading

nlebovits commented Dec 5, 2024

rmartinsen left a comment

rmartinsen Dec 6, 2024

rmartinsen commented Dec 7, 2024 •

edited

Loading

nlebovits commented Dec 9, 2024

nlebovits commented Dec 9, 2024

Lebovits/issu1015 cleanup new pipeline #1037

Lebovits/issu1015 cleanup new pipeline #1037

Conversation

nlebovits commented Dec 5, 2024 • edited Loading

vercel bot commented Dec 5, 2024 • edited Loading

nlebovits commented Dec 5, 2024

rmartinsen left a comment

Choose a reason for hiding this comment

rmartinsen Dec 6, 2024

Choose a reason for hiding this comment

rmartinsen commented Dec 7, 2024 • edited Loading

nlebovits commented Dec 9, 2024

nlebovits commented Dec 9, 2024

nlebovits commented Dec 5, 2024 •

edited

Loading

vercel bot commented Dec 5, 2024 •

edited

Loading

rmartinsen commented Dec 7, 2024 •

edited

Loading