Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove State Capability #11

Closed
pnadolny13 opened this issue Feb 2, 2022 · 4 comments
Closed

Remove State Capability #11

pnadolny13 opened this issue Feb 2, 2022 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@pnadolny13
Copy link
Collaborator

Currently state is a listed capability but we dont bookmark files. Try to implement state tracking for files in an OS agnostic way.

Or just remove the state capability from the README and MeltanoHub and note that it doesnt manage state.

@pnadolny13 pnadolny13 self-assigned this Feb 2, 2022
@pnadolny13 pnadolny13 added the bug Something isn't working label Feb 2, 2022
@aaronsteers
Copy link
Contributor

@pnadolny13 - Good call. I'd say as a short term fix, we probably can go ahead and remove the capability.

In regards to how we might implement state tracking, I think we have two approaches which are very different from each other:

  1. We could remember file names (and possibly their hashes) and rather than a greater than or equal comparison we'd just have a have we seen it comparison. This option works best when files are date stamped, size limited, and "table growth" actually looks like new files appended over time.
  2. We could treat CSVs as tables and expect a user to provide an incremental key in the catalog.json. We probably need to read the full file anyway, but we'd only emit records if the greater than or equal condition passes. This option works best for files that are going to be constantly edited or appended to.

@pnadolny13
Copy link
Collaborator Author

@aaronsteers I agree. I created #12 to remove the capability for now.

I see value in both options but I think its more common for people to use this tap for testing out pipelines vs running large workloads so reading the whole file isnt that big of a deal. I'd vote for options 2 by default but potentially supporting both options as a configurable switch - idk if thats possible or if it would cause problems though.

@aaronsteers
Copy link
Contributor

@pnadolny13 - Yes, agreed. I think the SDK "rails" have an apply_catalog() method which would apply "replication_key" if provided by a user in the catalog. That said, it's untested as of now and might need some adjustment.

There might even by built-in capability to exclude values based on that key's >= comparison.

If you have cycles (and interest) to try this out, I'd be interested to know if it works out of box - or if it needs more work.

@pnadolny13 pnadolny13 changed the title Add State Tracking or Remove Capability Remove State Capability Feb 3, 2022
@pnadolny13
Copy link
Collaborator Author

Closing because I split the implementation to a separate issue. #13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants