Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lebovits/issu1015 cleanup new pipeline #1037

Merged
merged 26 commits into from
Dec 11, 2024

Conversation

nlebovits
Copy link
Collaborator

@nlebovits nlebovits commented Dec 5, 2024

Major cleanup of new ETL pipeline:

  • updates pip dependencies
  • lints and formats with ruff
  • adds docstrings and typing
  • dynamically sets the number of workers based on the local machine when running things in parallel
  • changes the priority_level calculation to use z-scores instead of percentiles
  • connects the pipeline to the new postgres db with postgis + timescale
  • sets up monthly partitioning and six-month compression policies
  • incorporates pg_stat reporting to slack channel to monitor table sizes

A couple of outstanding items that need to be done:

  • Consolidate the process of writing to postgres (there' some redundancy in how/where it's done right now--could probably be a single class)
  • Re-incorporate data-diff and diff reporting to Slack
  • Reconnect writing tiles to GCS; write an additional unclustered tile, if possible (coordinate with @HeyZoos)
    - [ ] Make sure to notify the FE team that the new tiles will have a new name
    - [ ] If possible, make sure that this and other relevant data are written to subdirs in GCS, per that old ticket from Nico
  • Set up a cron job for monthly backup dumps to GCS from the VM
  • Smoothly replace the old ETL pipeline with the new one in the main repo, making sure not to delete the old database/disturb the old processes.

Copy link

vercel bot commented Dec 5, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
vacant-lots-proj ✅ Ready (Inspect) Visit Preview 💬 Add feedback Dec 11, 2024 2:49am

@nlebovits
Copy link
Collaborator Author

hey @rmartinsen if you have the bandwidth, it'd be a huge help to get a PR review from you, specifically looking at featurelayer.py and main.py to see if there's room for improvement/concision.

Copy link
Contributor

@rmartinsen rmartinsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall there are lots of improvements here. Adding ruff, docstrings and config are all nice changes to make to the code. I left some comments about structuring the code. Adding some helper functions and consolidating error handling could really improve readability and maintainability going forward.

I don't have enough context to really speak on the logic, but none of the comments I left as dealbreakers, so take them or leave them as you see appropriate.

data/src/Pipfile Show resolved Hide resolved
data/src/main.py Outdated
END IF;
END $$;
""")
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the situation where a table exists without having this column? Ideally table creation would be isolated from data processing. We could consider adding the column in opa_properties, using a "add_create_date" flag if there cases where we don't want to create the date.

data/src/main.py Outdated Show resolved Hide resolved
data/src/main.py Outdated Show resolved Hide resolved
data/src/main.py Outdated Show resolved Hide resolved
data/src/new_etl/classes/featurelayer.py Show resolved Hide resolved
data/src/new_etl/classes/featurelayer.py Outdated Show resolved Hide resolved
data/src/new_etl/classes/featurelayer.py Outdated Show resolved Hide resolved
data/src/new_etl/classes/featurelayer.py Show resolved Hide resolved
@rmartinsen
Copy link
Contributor

rmartinsen commented Dec 7, 2024

hey @rmartinsen if you have the bandwidth, it'd be a huge help to get a PR review from you, specifically looking at featurelayer.py and main.py to see if there's room for improvement/concision.

For sure! I left some comments. None of them are that major, but might help structure the code a little better. Please let me know if any of it doesn't make sense. I'm happy to clarify. I also want to emphasize that the code is perfectly fine without these changes so feel free to push back or ignore them.

@nlebovits
Copy link
Collaborator Author

Made a bunch of progress in these latest commits--I've modularized the data loading and database connections that were previously in featurelayer.py and have the dataset posting to GCP again. I need to get data-diff properly reporting to Slack, but then I think we'll be pretty much ready to get this running on the VM.

@nlebovits
Copy link
Collaborator Author

Q to consider before finalizing: how will tsdb handle a new column added to an existing table? Will it gracefully add the column and impute NAs in previous timestamps of the table? Or will it recreate the entire table?

@nlebovits nlebovits marked this pull request as ready for review December 11, 2024 02:51
@nlebovits nlebovits merged commit 56fc5fb into staging Dec 11, 2024
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants