In microservice systems it is often required to implement a business transaction that spans multiple microservices. This is called a distributed transaction (using saga pattern).
Distributed transaction consist of multiple operations and requires transaction manager to coordinate it. Transaction manager is a state machine that is responsible for operation execution and to coordinate state based on operation result.
Example use case
For distributed transaction example use case I will use simplified bank account system with business transaction of transferring funds from one account to another.
Transferring funds from account #1 to account #2 is a transaction that consist of an operation for removing defined amount from account #1 and an operation for adding that funds to account #2.
In ideal world where nothing fails we would not need a distributed transaction to implement fund transfer but we would just chain operation for removing funds from account #1 with operation for adding funds on account #2.
We are not living in an ideal world so we need to identify what could fail and based on that design our distributed transaction operation flow.
We need to distinguish if operation fail is a “final” fail or just a temporary fail.
Final fail is a fail where operation repeat will not help and transaction fails with fail reason. Temporary fail is a fail where operation fails temporary and will be repeated (for example connectivity issues). For temporary fail timeout feature could also be implemented to trigger final fail (for example if there is no connectivity for N minutes transaction needs to fail).
In case of fund transfer example we can identify these final fails:
- removing funds from account #1 fails because, for example, there are insufficient funds on account
- adding funds to account #2 fails because, for example, account is closed
In case remove fails, transaction fails. In case add fails, remove has to be rollbacked (funds have to be added/returned back to the account). Remove funds could also be implemented using reservation/completion/delete operations.
Fun transfer transaction has these end statuses:
- SourceRemoveFailed – removing from account # 1 failed
- SourceRemoveRollbackFailed – add to account #2 failed and rollback of remove from account #1 failed
- SourceRemoveRollbacked – add to account #2 failed and rollback of remove from account #1 was successful
- TransferSuccessful – transaction was successful
This would be fund transfer transaction manager flow:
Transaction manager state machine states:
- SourceRemoveStarted – transaction start (intial transaction manager state), execution of source (account #1) remove funds operation in process
- SourceRemoveFailed – if remove funds operation fails transaction is finished with fail status that source account (account #1) remove funds operation fails
- DestinationAddStarted – if remove funds operation was successful, transaction manager changes state and execution of destination (account #2) add funds operation is in process
- SourceRemoveRollbackStarted – if add funds operation fails, transaction manager changes state and execution of source remove rollback is in process (adding funds back)
- SourceRemoveRollbackFailed – if source remove rollback fails, transaction is finished with fail, incomplete, status. In real life use cases this situation should be handled depending in the use case (for example notification for manual action)
- SourceRemoveRollbacked – if source remove rollback operation was successful, transaction is finished with fail status.
- TransferSuccessful – if destination
Transaction manager should also control, depending on the use case, how multiple transfer requests are handled. These are some options:
- only one transfer per source account – if one transfer is active, other are rejected
- parallel transfers per source account
- queuing of transfer requests per source account – if one transfer is active other transfer requests are not rejected but are stored in the LIFO queue
There is not dedicated or out-of-the-box feature to implement distributed transactions in Lagom but there is a natural and straight forward way to implement it using Lagom existing features.
For implementing distributed transaction in we need to cover implement:
- Transaction manager
- Operation execution
Transaction manager is a state machine that triggers operation execution so it is logical to implement it using Persistent Entity. Persistent Entity state can be used for Distributed transaction state.
Operation execution, calling other service(s), can not be done in Persistent Entity because only command evaluation and event persist is allowed. So for implementing operation execution we can use ReadSide processor.
- Depending on the state Persistent entity persists event for triggering operation
- ReadSide processor collects event
- ReadSide processor executes operation on defined resource
- ReadSide processor, depending on the operation result:
- success – success command is sent to Persistent Entity
- temporary fail – exception is thrown and ReadSide processor is restarted and event re-handeled (3a)
- final fail – fail command is sent to Persistent Entity
- PersistantEntity evaluates command and depending on the state persist new event
In case of temporary fail, timeout feature can be implemented by checking event persist timestamp and configured timeout.
By Lagom official recommendation in “Sizing individual microservices“, distributed transaction Persistent Entity should be implemented in separate service.
Resource (service) on which operation is performed on just need to expose service calls for defined operation and is independent on the distributed transaction service.
In example use case with fund transfer functionality we would implement:
- Account Service – responsible for account operations (service calls – createAccount, closeAccount, addFunds, removeFunds, removeFundsRollback)
- Fund Transfer Service – Fund transfer distributed transaction manager implementation (service call – transfer)
Multiple transfer requests handling options implementations:
- only one transfer per source account – Persistent Entity instance id uses source account id and depending on the state accepts or rejects transfer start requests
- parallel transfers per source account – Persistent Entity instance id uses transfer id and is not aware of other transfer and allows parallel transfer execution per source account
- queuing of transfer requests per source account -Persistent Entity instance id uses source account id and depending on the state accepts or persists transfer request in write side. When active transfer is finished, in command handler, state is evaluated for pending transfers and next one is triggered
You can find example case Lagom scala project on my GitHub.