Migrating a Running State Chart to a New Version of the State Machine #1338

x-aaron-moore · 2020-07-28T00:01:32Z

x-aaron-moore
Jul 28, 2020

We are investigating xstate for use primarily in the back-end in a cloud environment. Our use-cases are somewhat like business process workflows, so we are also looking at a workflow engine called Zeebe (as it is specifically focused on running BPMN 2.0 workflows in internet scale service environments). However, having worked with State Charts in designing reactive behavior... at this point I find State Charts to be more appealing. They are both simpler and more flexible. Furthermore, the apparent lack of a solution to this same question in the Zeebe forum gives me pause.

My question is, is there any community experience with xstate of having a long running state machines where over time the implementation of the state machine changes:

New states are added
New transitions or events are added
Part of the machine is found to be faulty and is replaced with a better way of modeling the same interaction

The question is, if we have many state machines in flight at any given time, how can we evolve the state machines themselves over time and pick up these changes, without having to wait for all the old state machines that were launched based on old implementations to simply run to completion?

Based on some prior discussion with this community (thanks for your input!) I have begun thinking of this in terms of Event Sourcing. Essentially, if instead of persisting machine state, I persist the sequence of events with their payloads that led to the current state, then the "current state" of the machine becomes kind of ephemeral. It can always be reconstructed by replaying events against the current implementation of the state machine. (Though this may require custom runner logic to skip over actions and invokes when rehydrating a machine). Things that result from this:

You can change the state machine implementation at any time, so long as you retain the ability to process all of the same events. Perhaps even more important: inbound or outbound events (any events other than internal ones) form its public API. The schema of event payloads may not change quickly, but its states and transitions can be thought of as implementation details that can change at any time. Event payloads seem a more suitable candidate for an API boundary than states and transitions.
You can conceivably evolve the event schema over time via event sourcing methodologies described in detail elsewhere. If interested, I recommend this talk: https://youtu.be/GzrZworHpIk
You get other event sourcing benefits such as an audit log of meaningful events that describe how you arrived at whatever the current state is. From the point of view of our use cases, this audit log would be extremely valuable for monitoring and debugging.

Does any one have any relevant experience to share regarding either evolving the implementation of long-running machines, or using xstate with Event Sourcing inspired machine runners? Or does anyone have alternative solutions to these same problems? It would be very encouraging to hear about prior exploration in this area.

Thanks!

davidkpiano · 2020-07-28T03:28:46Z

davidkpiano
Jul 28, 2020
Maintainer

This isn't really a solution, just my thoughts on this (not so easy) problem:

I have begun thinking of this in terms of Event Sourcing. Essentially, if instead of persisting machine state, I persist the sequence of events with their payloads that led to the current state, then the "current state" of the machine becomes kind of ephemeral

I think this is part of the key to solving this, but there's more to it than that. The idea of "versioning" state machines is an interesting problem, and I found some prior art/exploration:
Versioned FSM (Finite-State Machine) with Postgresql

We can use the example state machines for discussion purposes:

Version 1	Version 2

Let's assume that the following events happened:

create
pay
machine changes to version 2
ship

If we were to "replay" the first two events on Version 2, then the pay event would be swallowed because we would be stuck in the awaiting_approval state. So what should happen?

Here's my thoughts, which might be wrong...

If we consider that the states represent some process where prerequisite data is collected in order to successfully complete a transaction, we can determine whether a previous state should be allowed to be "grandfathered" in. For example, if a precondition of awaiting_payment is that we need an approval_code, then we can see that the extended state from version 1:

{ 
  payment_details: { ... },
  shipment_details: { ... }
}

is different than what is expected if we were to traverse a path to the shipped state in version 2:

{
- approval_code: ...,
  payment_details: { ... },
  shipment_details: { ... }
}

So we know that we can't just "jump" to the shipped state in version 2 with the data available in version 1. So what happens now?

We should replay the first two events on version 2, but have a way to defer events that aren't accepted in a state, such as the awaiting_approval state. So when we get to that state, here's where we're at:

State	Event	Deferred
start
awaiting_approval	create
awaiting_approval	pay	pay
awaiting_approval	ship	pay, ship

Some sort of action should be elevated to the "client-side" of this, saying something like "an approval code is required". In XState, this can be handled with a wildcard transition (not sure if BPMN has the same notion as wildcard events/transitions):

on: {
  approve: {
    target: 'awaiting_payment',
    actions: 'recall' // recall all deferred events
  },
  '*': { actions: ['defer', 'notifyApprovalRequired'] }
}

So once the approve action finally occurs, the workflow becomes "unstuck" and the events are recalled. I wrote about the defer/recall pattern here: #1305 (comment) and I'll probably end up making these available as helper functions.

Another thought: in theory, we can compare both versions and generate a diff between them, since they're just directed graphs:

DELETE "create" edge on "start" node
ADD "awaiting_approval" node
ADD "create" edge on "start" node to "awaiting_approval"
ADD "approve" edge on "awaiting_approval" node to "awaiting_payment" node
ADD "cancel" edge on "awaiting_approval" node to "canceled" node

If there are no DELETE operations and no "overriding" edges on the same state node for the same event, then we can call two versions compatible with each other. Otherwise, we can go through each of the DELETE operations and examine which paths would be affected (in this case, all paths that start with the create event) and surface to the developer how migration should be handled for those cases.

This can be done automatically with some tooling. I'll have to give it some more thought.

x-aaron-moore
Jul 28, 2020
Author

Very interesting example! Thanks for pointing out these issues, better to confront them now than when we're in the crux of needing to perform one of these migrations with many many actively running machines!

I haven't gotten a chance yet to look through the links you posted, will take a look. Figured I'd give a first stab at the problem with just my own common sense as best as I can muster it.

Problem Statement: You want to introduce an approval phase in your ordering workflow.

How does this impact existing machines? From a product perspective it seems like you could take a couple different strategies:

Consider the in-flight machines to be grandfathered in and considered "pre-approved".
Require all machines that have not reached a final state to go back and get approval before proceeding.

I drew these diagrams in Lucid Chart to try to figure out what I would do in this scenario:

Having this intermediate machine with a deprecated transition would essentially implement product strategy 1. Off-hand that seems like the approach I would imagine product people taking. It is simpler and consistent with the existing behavior.

However, one could imagine a scenario where the skipping of the approval step is actually a really bad bug and we don't want to allow any more orders to proceed without approval.

This is conceptually similar to the idea of deferred events. One thing that came out of this exercise is it underlines the conviction that event stream as source of truth is a sane approach. You cannot go back and say that the order of events wasn't originally ["created", "payment received"]-- one way or another you have to live with that reality and compensate for it.

In both instances, my instinct was to introduce an "in between" state machine for the breaking change between version 1 and version 2. This would operate until all machines that still have the legacy event order have run to completion.

In the latter, this was much more complicated, involving creating a superstate that captures history and allows to route back to the approval step and then resume where you left off. Furthermore, this implies that there must be an upgrade batch processing job where the "request approval" event is sent to all in-progress machines that have not completed. It might require that event processing be queued while the batch job completes. -- Obviously this is all much more complicated, however, the appeal to me is that it is explicit.

However, if state machines run for a very long time, then the further problem of having to build on top of these intermediate machines that can never be cleaned up would arise.

Also, this migration approach raises the question of "when can you clean up the intermediate states and transitions?" You would have to have some way of identifying event streams that are associated with an old machine implementation. Perhaps a concept of machine version that would be captured with the event stream would make this relatively simple.

The concept of the compatibility of two machine versions with respect to a given event stream is an interesting concept. If that could be asserted automatically, or even manually with a reasonable number of unit tests, it could be very helpful in being able to make changes with confidence.

(Incidentally, will have to look more at this defer / recall pattern. I ran into a situation that sounds very similar in a machine design that is still in progress.)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrating a Running State Chart to a New Version of the State Machine #1338

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Migrating a Running State Chart to a New Version of the State Machine #1338

x-aaron-moore Jul 28, 2020

Replies: 2 comments

davidkpiano Jul 28, 2020 Maintainer

x-aaron-moore Jul 28, 2020 Author

x-aaron-moore
Jul 28, 2020

davidkpiano
Jul 28, 2020
Maintainer

x-aaron-moore
Jul 28, 2020
Author