Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Connector Registration and Dynamic Dispatch #1375

Merged
merged 8 commits into from
Jan 30, 2025

Conversation

alexrichey
Copy link
Contributor

@alexrichey alexrichey commented Jan 6, 2025

Implements dynamic (ie non-static / hardcoded) dispatcher for our connectors, at the lifecycle level. The idea is that if you define an destination type as, say, ginger_ftp as below
image
then the distribution code should delegate the push to a connector registered as ginger_ftp.

Impetus

This approach would enable others to register custom connectors. E.g. if another agency needed a snowflake connector, they could just register one. (example in the test case). We can implement something similar for ingest and package.

Changes

  1. refactor lifecycle.distribute to be a generic distributor (ie there should be nothing specific to Socrata or BYTES). The distributor should delegate push/pull requests to whatever connector is registered by the dispatcher.
  2. Move anything socrata specific, or anything that references multiple stages, into lifecycle.scripts.

Other Notes

  1. I tried to start pulling code into the init files for lifecycle modules to implement a nice API. In general, I think it would be nice to expose an API per lifecycle module that looks something like [lifecycle stage].[method_name]. For example, distribute.to_dataset_destination feels really nice. However, there's some work to be done here, as some of the methods I added have a stutter, e.g. package.validate_package(...). It would be nice to just call package.validate(...) BUT there's also a package.validate module so we'd have a clash. There are a few ways around this, e.g. renaming the file to package/_validate.py. Open to thoughts!
  2. I went a little heavy on Unpack for kwargs, but I'm not convinced it's actually all that helpful here. I might be nicer to just use a class.
  3. dcpy/test/models/test_connectors.py is mostly for my sake to understand what variance/covariance mean for typing in the context of protocols vs regular functions. I'm happy to just delete it before merging.

TODOS

  1. We probably need to add extra generics to the ConnectorDispatcher for result types.
  2. lifecycle.scripts.package_and_distribute has a little more logic about distribution than it ought to. I started moving this into lifecycle.distribute, but it got a little messy.

Minor Changes

  • Sometimes we called it distribute and other times distribution. Standardize on distribute.
  • Change the working file directory from .[stage] -> .lifecycle/[stage] for package.
  • I'm trying to adhere to our general emerging pattern of only passing around metadata.yml paths in typer functions, and just passing around metadata objects in the rest of our code.

Successful Job showing what a failure looks like:
https://github.com/NYCPlanning/data-engineering/actions/runs/13018821848

@alexrichey alexrichey force-pushed the ar-connectors-dynamic-dispatch branch from d4f5a7e to b92c85b Compare January 6, 2025 18:53
@fvankrieken
Copy link
Contributor

I like this a lot. I had taken a slightly different approach in #1282 and while there are things I like about my approach, as written it still tethers too closely to ingest. There are things I like about keeping it more object oriented (going for a method where we parse a Dataset from ingest/product yml and then call ds.push() or something like that, but it requires a weird dance between pydantic and our actual code that felt clunky and while we lose some type safety this way, I think I still prefer it.

@fvankrieken
Copy link
Contributor

fvankrieken commented Jan 7, 2025

So then this also aligns with where ingest should try to move towards - mainly, getting all the "template"/"config" pydantic models into dcpy.models.ingest and out of connectors, since as written they're quite specific to the ingest yml declaration, and then that seems like something that we could get to play nicely with this sort of approach.

@sf-dcp
Copy link
Contributor

sf-dcp commented Jan 7, 2025

I like this too. Is it possible to define required/optional args at a connector level rather than unpacking kwargs and receiving an error when pushing/pulling? Or would it conflict with the underlying ABC class? Thinking of having some level of type safety

@alexrichey alexrichey force-pushed the ar-connectors-dynamic-dispatch branch from b92c85b to 027e905 Compare January 9, 2025 23:16
Copy link

codecov bot commented Jan 9, 2025

Codecov Report

Attention: Patch coverage is 50.00000% with 85 lines in your changes missing coverage. Please review.

Project coverage is 70.50%. Comparing base (013f304) to head (305f3f8).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
dcpy/lifecycle/scripts/package_and_distribute.py 0.00% 41 Missing ⚠️
dcpy/lifecycle/distribute/__init__.py 50.00% 8 Missing ⚠️
dcpy/lifecycle/distribute/connectors.py 65.21% 8 Missing ⚠️
dcpy/models/connectors/__init__.py 73.91% 5 Missing and 1 partial ⚠️
dcpy/lifecycle/package/validate.py 61.53% 5 Missing ⚠️
dcpy/models/lifecycle/validate.py 0.00% 4 Missing ⚠️
dcpy/lifecycle/distribute/_cli.py 0.00% 3 Missing ⚠️
dcpy/lifecycle/package/assemble.py 50.00% 3 Missing ⚠️
dcpy/models/lifecycle/distribute.py 88.88% 3 Missing ⚠️
dcpy/connectors/ftp.py 60.00% 2 Missing ⚠️
... and 1 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1375      +/-   ##
==========================================
- Coverage   70.74%   70.50%   -0.25%     
==========================================
  Files         119      124       +5     
  Lines        6041     6093      +52     
  Branches      706      702       -4     
==========================================
+ Hits         4274     4296      +22     
- Misses       1617     1650      +33     
+ Partials      150      147       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@alexrichey alexrichey force-pushed the ar-connectors-dynamic-dispatch branch 2 times, most recently from be2cfcb to 24e1a06 Compare January 10, 2025 15:41
@alexrichey
Copy link
Contributor Author

@fvankrieken @sf-dcp I've changed things to find a nice potential middle-ground where we can type-safety, by just binding/configuring connectors at the lifecycle level. The idea being that for ingest/package/distribute we have a pretty fixed set of args that we'd pass to these dispatchers, so it makes sense to establish interfaces/protocols at the lifecycle level. That leaves us free to keep using sensible arguments in the connectors themselves.

@alexrichey alexrichey force-pushed the ar-connectors-dynamic-dispatch branch from 24e1a06 to 2a05958 Compare January 10, 2025 15:50
@alexrichey alexrichey changed the title WIP / POC of dynamic dispatch for connectors Add Connector Registration and Dynamic Dispatch Jan 10, 2025
@alexrichey alexrichey force-pushed the ar-connectors-dynamic-dispatch branch 4 times, most recently from e402142 to d86b864 Compare January 15, 2025 16:45
@alexrichey alexrichey force-pushed the ar-connectors-dynamic-dispatch branch 10 times, most recently from 4099b97 to 72c281f Compare January 27, 2025 21:54
@alexrichey
Copy link
Contributor Author

@damonmcc FYI, my big refactor of the package_and_distribute file is functional - it needs cleanup and some fixes to testing etc. But it's worked so far with manual testing. Here's an example output from the logs:
image

I also need to hook this up to GHA. But... as of now, there's not a single reference to bytes or socrata in lifecycle.package, and the only references to Socrata in lifecycle.distribute are in the README (which we have to rewrite) and in the registration of the socrata connector. So we're pretty thoroughly generalized 🎉

@alexrichey alexrichey force-pushed the ar-connectors-dynamic-dispatch branch 8 times, most recently from d880a61 to f7fe1da Compare January 28, 2025 21:00
@@ -87,10 +87,10 @@ def get_datasets_by_id(self) -> dict[str, DatasetMetadata]:
def query_destinations(
self,
*,
datasets: set[str] | None = None,
datasets: frozenset[str] | None = None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

immutability helps the ol' typechecker.

@alexrichey alexrichey force-pushed the ar-connectors-dynamic-dispatch branch from f7fe1da to 30206f8 Compare January 28, 2025 22:41
@alexrichey
Copy link
Contributor Author

alexrichey commented Jan 28, 2025

@damonmcc I think I'm pretty much done here. Feel free to go ahead and implement that FTP connector. Or we can just review this and merge. Another real connector might give us a better sense of what we actually need though.

@alexrichey alexrichey force-pushed the ar-connectors-dynamic-dispatch branch from 30206f8 to fb38653 Compare January 28, 2025 22:50
@damonmcc
Copy link
Member

damonmcc commented Jan 30, 2025

@damonmcc I think I'm pretty much done here. Feel free to go ahead and implement that FTP connector. Or we can just review this and merge. Another real connector might give us a better sense of what we actually need though.

a lotta changes here so let's get it merged and I'll add FTP in a follow-up. I'll review and let you know if a walkthrough would be helpful

shall I rebase and skip fix the failing FTP test? at the moment, it fails because it expects a specific error message. maybe that's OK for now, could make it pass or skip it

@damonmcc damonmcc force-pushed the ar-connectors-dynamic-dispatch branch from fb38653 to 7c41cb6 Compare January 30, 2025 16:40
@alexrichey alexrichey force-pushed the ar-connectors-dynamic-dispatch branch from 7c41cb6 to 23bc075 Compare January 30, 2025 16:52
@damonmcc damonmcc force-pushed the ar-connectors-dynamic-dispatch branch from 23bc075 to 305f3f8 Compare January 30, 2025 17:18
@alexrichey alexrichey marked this pull request as ready for review January 30, 2025 17:19
@alexrichey
Copy link
Contributor Author

@damonmcc cool, marked as ready for review. Happy to walk through in person, if anything is murky. It's definitely a shotgun surgery PR.

@@ -43,9 +52,17 @@ jobs:
image: nycplanning/build-base:latest
env:
PUBLISHING_BUCKET: edm-publishing
RECIPES_BUCKET: edm-recipes
PRODUCT_METADATA_REPO_PATH: product-metadata
_TYPER_STANDARD_TRACEBACK: 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this disables the use of rich for error messages. are "pretty" tracebacks in github action logs hard to read when distributing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the context variables from typer are humongous in this case, so it's a real pain to track back in the logs. Honestly, I really dislike having the full context all the time. It might be nice to tie this to the to log-level. (ie enable rich for debug level logs, but not for info)

Copy link
Member

@damonmcc damonmcc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! separate from this, would love to set our test coverage to ignore CLI code and then revisit coverage of stuff like package_and_distribute()

@alexrichey alexrichey merged commit 61822a1 into main Jan 30, 2025
20 of 22 checks passed
@alexrichey alexrichey deleted the ar-connectors-dynamic-dispatch branch January 30, 2025 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants