Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migration to v1.3.0 can cause missed events when contract listeners are listening from latest #1531

Closed
EnriqueL8 opened this issue Jun 20, 2024 · 14 comments
Labels
bug Something isn't working documentation Improvements or additions to documentation priority
Milestone

Comments

@EnriqueL8
Copy link
Contributor

As part of the v1.3.0 release, we rearchitected the way event streams from the transaction managers are configured against the namespaces in FireFly. FireFly core will now create one event stream per namespace and as such we wrote code to migrate the existing contract listeners on the previous event stream to the corresponding namespace event stream.

Contract listeners have an option of where to listen from where the choices are accepted:

  • oldest which defaults to the first block
  • a specific block number
  • latest which will be start listen from the creation of the listener.

As such when the listeners get recreated as part of the migration, the ones set to latest will now listen from a new latest and in the time where the chain has advance and FireFly is starting up some event could get lost! This can get worse with another bug where if FireFly cannot communicate with the Transaction Manager to create the listener it will give up and never listen to those events!

A way to solve this would be:

  • Retrieve the previous contract listeners created against the default eventstream
  • Find the checkpoint value - i.e the block at which this listener has caught up to!
  • When creating the new contract listener against the namespace listener set the option to listen from the checkpoint!

This was we do not lose events.

Looking at the code, this change is tricker than it sounds because the original change simply at startup creates an event stream and apply the listeners created for the namespace. There is no notion of "migration" so this code will need to introduce some code to get the existing listeners on the "global" event stream and hopefully be able to using some identifier correlate them to the new ones that need to be created.

@EnriqueL8 EnriqueL8 added bug Something isn't working priority labels Jun 20, 2024
@peterbroadhurst
Copy link
Contributor

peterbroadhurst commented Jun 21, 2024

Concrete proposal - which has other benefits:

  • For FFTM based connectors (EVMConnect) we do the work to ensure FireFly Core and query the latest checkpoint for a listener
  • We update the migration code to ask the Go blockchain plugin if it can get the checkpoint for any migration of listeners, and do one of these - in order:
    • Use the checkpoint if it can be obtained (EVMConnect only)
      • see below - unclear this is optimization is feasible, and does not add anything over the next item
    • Look up the most recent event dispatched by the connector, and if one is found then require the blockchain plugin to decode the protocolId back into a fromBlock (all connectors - EthConnect/Fabric)
    • If no event, then fall back to the original approach

@peterbroadhurst
Copy link
Contributor

The listener endpoint on FFTM already has the checkpoint information embedded, so it doesn't look like changes would be needed to the FFTM or EVMConnect codebases for Core to be able to query the checkpoint information:

https://github.com/hyperledger/firefly-transaction-manager/blob/1ac00b5eb37664d73abebd8822ae8439cef3f4bc/pkg/fftm/stream_management.go#L376-L385

@peterbroadhurst
Copy link
Contributor

This code confirms existence of custom listeners. Per the comment it's not specific to the migration scenario:

// Validate all our listeners exist on startup - consistent with the multi-party manager.
// This means if EVMConnect is cleared, simply a restart of the namespace on core will
// cause recreation of all the listeners (noting that listeners that were specified to start
// from latest, will start from the new latest rather than replaying from the block they
// started from before they were deleted).
return cm, cm.verifyListeners(ctx)

This calls down to ethereum specific code, which gets the subscription in the context of the e.streamID that is designated to the namespace:

esID := e.streamID[namespace]
sub, err := e.streams.getSubscription(ctx, subID, okNotFound)
if err != nil || sub == nil || sub.Stream != esID {
return false, nil, core.ContractListenerStatusUnknown, err
}

So from this I believe that the code for migration is rather leaning on simply the recreation code that was introduced (during the 1.2->1.3 development) for a different reason - the recreation of listeners in EVMConnect if they had been deleted.

Need to dig into what happens to the old listeners that were at a global level 👁️

@peterbroadhurst
Copy link
Contributor

Reading through #1388 and the code, it seems like the original event streams are just left and ignored. However, they are not referred to in any way currently. So implementing a read of their checkpoint would be complicated.

So I'm going to investigate locating/decoding/passing the last-event information.

@peterbroadhurst
Copy link
Contributor

For MultiParty networks (where used) the FireFly batch pin contract follows the same pattern, and there's even less code change as the namespace was included in the listener name already in 1.2:

streamID, ok := e.streamID[namespace.Name]
if !ok {
return "", i18n.NewError(ctx, coremsgs.MsgInternalServerError, "eventstream ID not found")
}
sub, err := e.streams.ensureFireFlySubscription(ctx, namespace.Name, version, ethLocation.Address, contract.FirstEvent, streamID, batchPinEventABI)

@peterbroadhurst
Copy link
Contributor

The thing that distinguishes multi-party BatchPin events from all other events in the blockchainevents table, is the lack of a listener_id.

@peterbroadhurst
Copy link
Contributor

For Tokens, there is a strong leaning to using either 0 or a specific block number in the code. When a user doesn't specify a block number, it goes to 0 in the token connector before being passed to EVMConnect. There are many cases we see specific block numbers being used to designate the contract installation block for a token. However, currently I don't see any likelihood that token connectors would have been coerced with a blockNumber: "latest".

So I am not going to work on addressing that, as it would mean a change to the /activatepool flow (which I believe gets now called on every re-connect of a namespace to a token connector, against every pool) to look up the last event and provide it as extra information to handle only the latest special case.

I will look to discuss this a little further with @awrichar

@peterbroadhurst
Copy link
Contributor

@EnriqueL8 @awrichar - I have submitted PR #1534

Open conversations I'd like your view on this week:

  • This only addresses the newest/latest case, it does not address the inefficiency of replaying all transactions since block 0 (or another specific block)
    • Should we extend the change to always use this logic if the lastProtocolID block is greater-than the registration block?
  • Token connectors:
    • Question outlined in previous comment

@EnriqueL8
Copy link
Contributor Author

EnriqueL8 commented Jun 25, 2024

Thanks for all the thinking!

  • I'm happy with not addressing this for the token connectors based on the above points.
  • It would be create to solve the inefficiency as catching up from large chain as part of the upgrade can take a long time. Wouldn't it almost always be the case that the lastProtocolID block is greater-than the registration block?

@peterbroadhurst
Copy link
Contributor

So just thinking about the implications, it will simply be that the listener in EVMConnect/fabconnect will show a fromBlock that isn't the only you specified on the listener itself.

I think that's justifiable for the benefit you get - should roll up to the release notes for 1.3.1 for sure, and maybe we need a tweak to the reference section that talks about the event bus, to cover this.

Happy to take both those items on.

@peterbroadhurst
Copy link
Contributor

I've updated the PR to handle the 0 and explicit block cases for Fabric and Ethereum

@EnriqueL8 EnriqueL8 added the documentation Improvements or additions to documentation label Jun 27, 2024
@EnriqueL8
Copy link
Contributor Author

Just added the documentation tag to make sure we document this new behaviour and add it to the release notes

@EnriqueL8
Copy link
Contributor Author

So just thinking about the implications, it will simply be that the listener in EVMConnect/fabconnect will show a fromBlock that isn't the only you specified on the listener itself.

Agreed, we should just document that as part of upgrade your listeners will get recreated and as part of this it will take the new "latest" block so it does not loose events. We query the checkpoint information from the connector when retrieving a listener so users will see it straight away

@EnriqueL8 EnriqueL8 added this to the 1.3.1 milestone Jul 3, 2024
@EnriqueL8
Copy link
Contributor Author

This has been merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation priority
Projects
None yet
Development

No branches or pull requests

2 participants