Another very broad and very typical system problem is fault tolerance. A fault can cause an error which eventually can cause a failure. For example a fault can be to not check zero division which gives the error division by zero.
Faults can be classified into:
- Transient faults which occur once and then disappear.
- Intermittent faults appear and vanish without any apparent reason.
- Permanent faults continue to exist until the failed components are repaired.
There are different types of failure models:
- Omission failures which can occur in processes
- Timing failures, which only occur in synchronous systems, happen when one of the time limits defined for the system is violated (for example, bounds on request's latency).
- Byzantine failures can occur in processes, where intended processing steps may be omitted or additional steps may be added. The "key characteristic" of byzantine failures is that weird results are produced.
Fault tolerance refers to the ability of a system or network to handle failures without causing a complete disruption. One of the main techniques used to achieve fault tolerance is redundancy. Redundancy can be implemented in different ways to mask failures:
- Information redundancy: instead of sending only the packet, we send the packet and also some additional information that may help in recovering a byzantine failure
- Time redundancy: TCP if does not receive an ack, tries to re-send the message later
- Physical redundancy: be redundant in physical layer, so use multiple channels
Redundant process groups are groups of processes which can collectively handle the work that should be done by an individual process. In this way, even if some processes fail, the remaining healthy processes can continue to work. Using process groups can be challenging to manage the membership of distributed groups. Multicast join/leave announcements are necessary but the problem is that if a process crashes, it will not notify others, which can complicate maintaining the group's integrity: if multiple processes crash or leave, the groups structure may need to be rebuilt.
In general, if processes fail silently, then
The FloodSet algorithm is used for fault tolerance in distributed systems. Each process begins with a variable called W
, which is initialized with its own start value.
In each round, each process sends its W
to all other processes and adds the received sets to its own W
.
This process is repeated for k + 1
rounds, where k
is the maximum number of faults the system can tolerate (k
-fault-tolerance): so for example to be 5-fault-tolerant, we need 6 steps to reach an agreement.
Each process may stop in the middle of the send operation.
After k + 1
rounds a decision is made based on the size of W
:
- If the cardinality of
W
is 1, no problem at all - If the cardinality of
W
is greater than 1, it must be use a previously decided common function/criteria on all processes to make the decision on which value to take.
A more efficient implementation consists to broadcast W
only if the process learns of a new value.
In case of byzantine failure, things get more complex: the processes can both stop or exhibit any arbitrary behavior like sending arbitrary message or performing arbitrary state transitions. This problem has been described by Lamport in 1982 in terms of armies and generals.
During the initial round, all processes exchange their respective values. However among them there is a traitor which sends unusual values. Starting from the second round onwards, each process adopts the value that is most commonly shared among them.
Lamport (1982) showed that if there are
Fischer, Lynch, and Paterson proved that in general, for an asynchronous system, even a single failure is enough for not being able to reach a consensus: "Impossibility of Distributed Consensus with One Faulty Process" (FLP Theorem). Basically the FLP theorem says that every protocol that may come to your mind must be a synchronous protocol: so it must be a protocol that somehow fixes the bound in the sending of messages, fixes the bound in the processing speed of the chat-optic processes, and fixes the bound in the maximum delay jitter between the values cross.
Here the are the various alternatives for reliable group communication when processes are reliable but links are not.
- ACKs, where each recipient sends an acknowledgement after receiving the message. If an acknowledgement is not received, the sender resends the message. However, this can lead to an "ack implosion" if all recipients send acknowledgements simultaneously.
- NACKs, where ach recipient sends a NACK after a random delay indicating which packet was missed. This optimistic approach is more scalable as it prevents multiple NACKs from being sent simultaneously and reduces the likelihood of overwhelming the sender.
- Hierarchical Feedback Control is an evolution of using just ACKs and NACKS. The receivers are grouped together, with each group having a coordinator. These groups form a tree structure with the sender at the root. Coordinators manage acknowledgments and retransmissions within their groups and communicate with their parent coordinator if necessary. The challenge lies in maintaining the hierarchy, especially in the presence of dynamic group membership.
Let's define the atomic multicast problem: a message is delivered to all non-faulty members of a group or to none, maintaining the same order of messages at all receivers.
Close synchrony cannot be achieve in the presence of failures: it says that multicast messages can be considered as instantaneous events and processes which receive the messages see events in the same order.
Virtual synchrony is the weaker model which replace the close synchrony:
- Only messages sent by correct processes are processed by all correct processes.
- Crashed processes are removed from the group and must rejoin later, reconciling their state.
- Messages sent by failing processes are either processed by all correct members or by none.
- The guarantees on the receiving order are only for relevant messages
Virtual synchrony model takeaways:
- distinguishes between receiving a message and delivering it; messages are buffered by the communication layer delivered when certain conditions are met.
- The concept of group view is crucial in this model: it's the set of processes that are considered part of the group at the time of sending a message.
- Changes to the group view can be seen as another form of multicast messages (they must be consistent)
- All multicast must take place between view changes
Virtual synchrony is complex. On top of the previous explored protocol (or analogue protocols) we could retain the virtual synchrony property applying different orderings for multicast messages can be identified:
- Unordered multicasts
- FIFO-ordered multicasts: FIFO messaging ensures that messages from a single sender are delivered in the order sent. However, total FIFO ordering across all senders is not guaranteed without additional mechanisms.
- Causally-ordered multicasts
- Totally-ordered multicasts
We are discussing fault tolerance and recovery techniques. There are two types of recovery:
- Backward recovery involves going back to a previous state if a state resulting from a crash is undesirable.
- Forward recovery involves trying to correct errors without going back to a previous state.
To enable backward recovery, we need to save previous states that can be retrieved later. Checkpointing involves periodically saving the state of each process without needing to synchronize with other processes (as synchronization is not practical in distributed systems). When recovering using checkpoints, it is important to ensure that transitioning to different states at different times still makes sense for the entire application: this is achieved finding what we call a "consistent cut" or recovery line. The main challenge is to identify the most optimal recovery line, which comprises consistent checkpoints from each process, thereby representing an overall consistent cut: going back to different states at different times but still making sense for the application as a whole.
To implement and discover valid checkpoints, we need to obtain an approximation that allows us to reconstruct the best set of checkpoints for a good recovery line.
Before dive in the 2 seen algorithms, let's keep in mind that with
- Rollback-dependency graph:
- this algorithm works starting from the "final" / "crashed" state view of the overall system
- each dependency between the interval of checkpoints is translate it into a dependency between the ending checkpoint of the starting interval and the ending checkpoint of the arrival interval (ending to ending).It's like shifting towards the "right" all messages between processes, aligning them to the ending checkpoints of each interval.
- from the final situation, you "mark" all the checkpoints reachable from the crashed process following the dependency relations you "wrote".
- Checkpoint-dependency graph:
- First of all the dependencies are added from the starting state of the interval to the final state of the other interval (where the arrow arrives).
- Now the iterative part:
- You select the set with the checkpoints nearest the crash: this is called "hypothesis set":
- if there are no dependencies inside the hypothesis set a recovery line is found.
- otherwise the arrival checkpoint of the dependency which is inside the hypothesis set is discarded. Then new iteration of the algorithm is done, selecting the new set: which is made from the previous checkpoints a part the one discarded.
- You select the set with the checkpoints nearest the crash: this is called "hypothesis set":
Those are 2 centralized algorithms that should bring to the same result and are performed by a single central coordinator after a distributed collection of checkpoints from the processes. How works the distributed collections? We will see it in Synchronization chapter when we talk about Chandy-Lamport Distributed snapshot algorithm.