From dd6d918945c7f1af5749a1c3b31ca0cb7d202145 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Edgar=20Rami=CC=81rez=20Mondrago=CC=81n?= Date: Mon, 25 Oct 2021 17:50:26 -0500 Subject: [PATCH 1/7] First commit for state splitting SIP --- .../SIPXX - State Artifacts Composability.md | 88 +++++++++++++++++++ 1 file changed, 88 insertions(+) create mode 100644 proposals/draft/SIPXX - State Artifacts Composability.md diff --git a/proposals/draft/SIPXX - State Artifacts Composability.md b/proposals/draft/SIPXX - State Artifacts Composability.md new file mode 100644 index 0000000..b6b0a99 --- /dev/null +++ b/proposals/draft/SIPXX - State Artifacts Composability.md @@ -0,0 +1,88 @@ +# SIP x - Composable state artifacts + +## Status + +| | | +| ------ | ------ | +| State | Draft | +| Issue Link | [#6](https://github.com/MeltanoLabs/Singer-Working-Group/issues/6) | +| Discussion Thread(s) | (optional link) | +| Created | 2021-10-25 | + +----------------------- + +## Proposal + +### TL;DR Overview + +Singer State artifacts can be split between its constituent streams and combined again. + +### What specific change do you propose to make? + +According to the [Singer specification](https://github.com/singer-io/getting-started/blob/master/docs/SPEC.md#state-message): + +> The semantics of a STATE value are not part of the specification, and should be determined independently by each Tap. + +This SIP proposes to standardize the structure of state message `value` objects in Singer taps, to allow splitting and combining state artifacts from multiple streams. + +The proposed `value` structure looks like this: + +```json +{ + "bookmarks": { + "stream_1": {}, + "stream_2": {} + } +} +``` + +That is, there MUST be one key for every stream in the source and the object value MUST contain all necessary state information to sync the stream in question. For taps with streams that need access to a _global_ state, the same duplicate value MUST be stored per-stream. + +## Motivation + +### What problem does it solve? + +By making state (de)composable, some interesting use cases open for orchestrators. For example, syncing a Singer Tap with independent stream bookmarks, is [embarrassingly parallelizable](https://en.wikipedia.org/wiki/Embarrassingly_parallel). It should be possible to split the workload of such a tap among its constituent streams to fully utilize compute resources and reduce syncing time. + +### Why is it needed? + +By standardizing on the structure of state values, more performant and scalable use cases open for Singer taps and targets, without sacrificing inter-operability. + +## Other Considerations + +### Are there any downsides to this change? + +- Every stream would need to track any global state independently. + +### Is the change backwards compatible? + +No. Changing the state schema of a tap would break incremental replication for current users. + +### Which users are affected by the change? + +Users of existing taps that use differently structured state artifacts. For example, [tap-stripe](https://github.com/singer-io/tap-stripe). + +### How are users affected by the change? (e.g. DB upgrade required?) + +Users of non-compliant taps would need to manually massage the schema into the right format or use a one-off script. + +### Prototype Implementations + +The SDK currently enforces this state schema. + +### Future Plans + +NA + +### Excluded Alternatives + +NA + +### Acknowledgements + +- [@dmosorast](https://github.com/dmosorast) +- [@aaronsteers](https://github.com/aaronsteers) + +## What defines this SIP as "done"? + +The working group has agreed upon the standard for state values. From e1d940718f18b5fba415f39b3e87238214c87d08 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Edgar=20Rami=CC=81rez=20Mondrago=CC=81n?= Date: Wed, 26 Jan 2022 19:18:40 -0600 Subject: [PATCH 2/7] add JSON schema and examples --- .../SIPXX - State Artifacts Composability.md | 174 +++++++++++++++++- 1 file changed, 168 insertions(+), 6 deletions(-) diff --git a/proposals/draft/SIPXX - State Artifacts Composability.md b/proposals/draft/SIPXX - State Artifacts Composability.md index b6b0a99..c2ecbc8 100644 --- a/proposals/draft/SIPXX - State Artifacts Composability.md +++ b/proposals/draft/SIPXX - State Artifacts Composability.md @@ -25,18 +25,136 @@ According to the [Singer specification](https://github.com/singer-io/getting-sta This SIP proposes to standardize the structure of state message `value` objects in Singer taps, to allow splitting and combining state artifacts from multiple streams. -The proposed `value` structure looks like this: +The proposed `value` structure if a **flat** JSON object with the following schema: + +```json +{ + "$schema": "http://json-schema.org/draft-06/schema#", + "title": "Singer State", + "description": "State payload format for Singer taps", + "type": "object", + "properties": { + "bookmarks": { + "title": "Stream Bookmarks", + "type": "object", + "additionalProperties": false, + "patternProperties": { + "^\\w+(\\/\\w+\\=[A-Za-z0-9_-]+)*$": { + "type": "object", + "additionalProperties": true, + "properties": { + "replication_key": { + "type": "string" + }, + "replication_key_value": { + "oneOf": [ + { + "type": "string" + }, + { + "type": "integer" + } + ] + } + } + } + } + }, + "additionalProperties": false + } +} +``` + +That is, there MUST be one key for every stream or stream partition in the source, and the object value MUST contain all necessary state information to sync the stream or partition in question. For taps with streams that need access to a _global_ state, the same duplicate value MUST be stored per-stream AND per-partition. + +Notes: + +- The keys for partitioned stream bookmarks need to indicate the partition keys (e.g. `catalogs/shop_id=143`). Thus, order of the keys must be deterministic. +- A flat object like this is the most straightforward to merge. + +In the following examples, the stream hierarchy `shops` > `catalogs` > `products` is sync from an API endpoint with. The bookmark value of catalogs might be different for different shops, so it's necessary to store one `catalogs` bookmark for each `shop_id`. + +#### Example valid state objects ```json { "bookmarks": { - "stream_1": {}, - "stream_2": {} + "shops": { + "replication_key": "updated_at", + "replication_key_value": "2021-02-01T00:00:00Z", + "some_global_state": 123 + }, + "catalogs/shop_id=143": { + "replication_key": "updated_at", + "replication_key_value": "2021-02-01T00:00:00Z" + }, + "products/shop_id=143/catalog_id=2022-01": { + "replication_key": "updated_at", + "replication_key_value": "2021-02-01T00:00:00Z" + } } } ``` -That is, there MUST be one key for every stream in the source and the object value MUST contain all necessary state information to sync the stream in question. For taps with streams that need access to a _global_ state, the same duplicate value MUST be stored per-stream. +#### Example invalid state objects + +Global state defined at the top level: + +```json +{ + "bookmarks": { + "shops": { + "replication_key": "updated_at", + "replication_key_value": "2021-02-01T00:00:00Z" + }, + "catalogs/shop_id=143": { + "replication_key": "updated_at", + "replication_key_value": "2021-02-01T00:00:00Z" + }, + "products/shop_id=143/catalog_id=2022-01": { + "replication_key": "updated_at", + "replication_key_value": "2021-02-01T00:00:00Z" + } + }, + "global_state_can_not_live_here": 123 +} +``` + +The SDK implementation becomes invalid under the current proposal: + +```json +{ + "bookmarks": { + "shops": { + "replication_key": "updated_at", + "replication_key_value": "2021-02-01T00:00:00Z" + }, + "catalog": { + "partitions": [ + { + "context": { + "shop_id": 143 + }, + "replication_key": "updated_at", + "replication_key_value": "2021-02-01T00:00:00Z" + } + ] + }, + "products": { + "partitions": [ + { + "context": { + "shop_id": 143, + "catalog_id": "2022-01" + }, + "replication_key": "updated_at", + "replication_key_value": "2021-02-01T00:00:00Z" + } + ] + } + } +} +```` ## Motivation @@ -68,7 +186,7 @@ Users of non-compliant taps would need to manually massage the schema into the r ### Prototype Implementations -The SDK currently enforces this state schema. +NA ### Future Plans @@ -76,7 +194,51 @@ NA ### Excluded Alternatives -NA +#### Meltano SDK (up to 2022-02) + +The SDK currently uses a state schema where partitioned streams contain an array of different _context_ objects: + +```json +{ + "bookmarks": { + "products": { + "partitions": [ + { + "context": { + "shop_id": 143 + }, + "replication_key": "updated_at", + "replication_key_value": "2022-01-22T06:49:41.005000Z" + }, + { + "context": { + "shop_id": 158 + }, + "replication_key": "updated_at", + "replication_key_value": "2022-01-24T12:53:08.000000Z" + } + ] + } + } +} +``` + +#### Other alternatives + +Some taps use a structure like: + +```json +{ + "bookmarks": { + "products": { + "updated_at(shod_id:143)": "2022-01-22T06:49:41.005000Z", + "updated_at(shod_id:158)": "2022-01-24T12:53:08.000000Z" + } + } +} +``` + +This is close to the current proposal, but adds an unnecessary level of _nested-ness_. ### Acknowledgements From 931a587e55eb9c3886dc52a1724c907412dec213 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Edgar=20Rami=CC=81rez=20Mondrago=CC=81n?= Date: Wed, 26 Jan 2022 19:22:20 -0600 Subject: [PATCH 3/7] fix typo --- proposals/draft/SIPXX - State Artifacts Composability.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/proposals/draft/SIPXX - State Artifacts Composability.md b/proposals/draft/SIPXX - State Artifacts Composability.md index c2ecbc8..f840ec1 100644 --- a/proposals/draft/SIPXX - State Artifacts Composability.md +++ b/proposals/draft/SIPXX - State Artifacts Composability.md @@ -72,7 +72,7 @@ Notes: - The keys for partitioned stream bookmarks need to indicate the partition keys (e.g. `catalogs/shop_id=143`). Thus, order of the keys must be deterministic. - A flat object like this is the most straightforward to merge. -In the following examples, the stream hierarchy `shops` > `catalogs` > `products` is sync from an API endpoint with. The bookmark value of catalogs might be different for different shops, so it's necessary to store one `catalogs` bookmark for each `shop_id`. +In the following examples, the stream hierarchy `shops` > `catalogs` > `products` is synced from an API endpoint with. The bookmark value of catalogs might be different for different shops, so it's necessary to store one `catalogs` bookmark for each `shop_id`. #### Example valid state objects From dfe8a820c992ba6333d1420b898c25351513e071 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Edgar=20Rami=CC=81rez=20Mondrago=CC=81n?= Date: Wed, 26 Jan 2022 19:23:33 -0600 Subject: [PATCH 4/7] add details to example --- proposals/draft/SIPXX - State Artifacts Composability.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/proposals/draft/SIPXX - State Artifacts Composability.md b/proposals/draft/SIPXX - State Artifacts Composability.md index f840ec1..d62434e 100644 --- a/proposals/draft/SIPXX - State Artifacts Composability.md +++ b/proposals/draft/SIPXX - State Artifacts Composability.md @@ -72,7 +72,7 @@ Notes: - The keys for partitioned stream bookmarks need to indicate the partition keys (e.g. `catalogs/shop_id=143`). Thus, order of the keys must be deterministic. - A flat object like this is the most straightforward to merge. -In the following examples, the stream hierarchy `shops` > `catalogs` > `products` is synced from an API endpoint with. The bookmark value of catalogs might be different for different shops, so it's necessary to store one `catalogs` bookmark for each `shop_id`. +In the following examples, the stream hierarchy `shops` > `catalogs` > `products` is synced from an API. The bookmark value of catalogs might be different for different shops, so it's necessary to store one `catalogs` bookmark for each `shop_id`. Similarly, `products` require a `shop_id` and `catalog_id`. #### Example valid state objects From 524a197eecd4f67caea4f58e011559ef7a57d4cb Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Edgar=20Rami=CC=81rez=20Mondrago=CC=81n?= Date: Wed, 26 Jan 2022 19:25:12 -0600 Subject: [PATCH 5/7] better syntax --- proposals/draft/SIPXX - State Artifacts Composability.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/proposals/draft/SIPXX - State Artifacts Composability.md b/proposals/draft/SIPXX - State Artifacts Composability.md index d62434e..eee53c8 100644 --- a/proposals/draft/SIPXX - State Artifacts Composability.md +++ b/proposals/draft/SIPXX - State Artifacts Composability.md @@ -72,7 +72,7 @@ Notes: - The keys for partitioned stream bookmarks need to indicate the partition keys (e.g. `catalogs/shop_id=143`). Thus, order of the keys must be deterministic. - A flat object like this is the most straightforward to merge. -In the following examples, the stream hierarchy `shops` > `catalogs` > `products` is synced from an API. The bookmark value of catalogs might be different for different shops, so it's necessary to store one `catalogs` bookmark for each `shop_id`. Similarly, `products` require a `shop_id` and `catalog_id`. +In the following examples, the stream hierarchy `shops` > `catalogs` > `products` is synced from an API. The bookmark value of a catalog might be change from one to another, so it's necessary to store one `catalogs` bookmark for each `shop_id`. Similarly, `products` require a `shop_id` and `catalog_id`. #### Example valid state objects From 17db88457ae593e6c9bbfc0774c561170e93631f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Edgar=20Rami=CC=81rez=20Mondrago=CC=81n?= Date: Wed, 26 Jan 2022 19:25:48 -0600 Subject: [PATCH 6/7] better wording --- proposals/draft/SIPXX - State Artifacts Composability.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/proposals/draft/SIPXX - State Artifacts Composability.md b/proposals/draft/SIPXX - State Artifacts Composability.md index eee53c8..5dad7c6 100644 --- a/proposals/draft/SIPXX - State Artifacts Composability.md +++ b/proposals/draft/SIPXX - State Artifacts Composability.md @@ -72,7 +72,7 @@ Notes: - The keys for partitioned stream bookmarks need to indicate the partition keys (e.g. `catalogs/shop_id=143`). Thus, order of the keys must be deterministic. - A flat object like this is the most straightforward to merge. -In the following examples, the stream hierarchy `shops` > `catalogs` > `products` is synced from an API. The bookmark value of a catalog might be change from one to another, so it's necessary to store one `catalogs` bookmark for each `shop_id`. Similarly, `products` require a `shop_id` and `catalog_id`. +In the following examples, the stream hierarchy `shops` > `catalogs` > `products` is synced from an API. The bookmark value of a catalog might change from one to another, so it's necessary to store one `catalogs` bookmark for each `shop_id`. Similarly, `products` require a `shop_id` and `catalog_id`. #### Example valid state objects From 74bdc82eb9c0ee14fa175512d86c952cb199e3ed Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Edgar=20Rami=CC=81rez=20Mondrago=CC=81n?= Date: Wed, 26 Jan 2022 19:26:20 -0600 Subject: [PATCH 7/7] add missing word --- proposals/draft/SIPXX - State Artifacts Composability.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/proposals/draft/SIPXX - State Artifacts Composability.md b/proposals/draft/SIPXX - State Artifacts Composability.md index 5dad7c6..690c92e 100644 --- a/proposals/draft/SIPXX - State Artifacts Composability.md +++ b/proposals/draft/SIPXX - State Artifacts Composability.md @@ -72,7 +72,7 @@ Notes: - The keys for partitioned stream bookmarks need to indicate the partition keys (e.g. `catalogs/shop_id=143`). Thus, order of the keys must be deterministic. - A flat object like this is the most straightforward to merge. -In the following examples, the stream hierarchy `shops` > `catalogs` > `products` is synced from an API. The bookmark value of a catalog might change from one to another, so it's necessary to store one `catalogs` bookmark for each `shop_id`. Similarly, `products` require a `shop_id` and `catalog_id`. +In the following examples, the stream hierarchy `shops` > `catalogs` > `products` is synced from an API. The bookmark value of a catalog might change from one shop to another, so it's necessary to store one `catalogs` bookmark for each `shop_id`. Similarly, `products` require a `shop_id` and `catalog_id`. #### Example valid state objects