Fault tolerance

Another very broad and very typical system problem is fault tolerance. A fault can cause an error which eventually can cause a failure. For example a fault can be to not check zero division which gives the error division by zero.

Faults can be classified into:

Transient faults which occur once and then disappear.
Intermittent faults appear and vanish without any apparent reason.
Permanent faults continue to exist until the failed components are repaired.

There are different types of failure models:

Omission failures which can occur in processes
Timing failures, which only occur in synchronous systems, happen when one of the time limits defined for the system is violated (for example, bounds on request's latency).
Byzantine failures can occur in processes, where intended processing steps may be omitted or additional steps may be added. The "key characteristic" of byzantine failures is that weird results are produced.

Fault tolerance refers to the ability of a system or network to handle failures without causing a complete disruption. One of the main techniques used to achieve fault tolerance is redundancy. Redundancy can be implemented in different ways to mask failures:

Information redundancy: instead of sending only the packet, we send the packet and also some additional information that may help in recovering a byzantine failure
Time redundancy: TCP if does not receive an ack, tries to re-send the message later
Physical redundancy: be redundant in physical layer, so use multiple channels

Protection against process failures

Redundant process groups are groups of processes which can collectively handle the work that should be done by an individual process. In this way, even if some processes fail, the remaining healthy processes can continue to work. Using process groups can be challenging to manage the membership of distributed groups. Multicast join/leave announcements are necessary but the problem is that if a process crashes, it will not notify others, which can complicate maintaining the group's integrity: if multiple processes crash or leave, the groups structure may need to be rebuilt.

In general, if processes fail silently, then $k + 1$ processes allow the system to be $k$-fault-tolerant . In case of Byzantine failures matters become worse: $2k + 1$ processes are required to achieve k-fault tolerance (to have a working voting mechanism). Byzantine means that you need other servers (the majority) to check that the first one has produced a weird result.

FloodSet algorithm

The FloodSet algorithm is used for fault tolerance in distributed systems. Each process begins with a variable called W, which is initialized with its own start value. In each round, each process sends its W to all other processes and adds the received sets to its own W. This process is repeated for k + 1 rounds, where k is the maximum number of faults the system can tolerate (k-fault-tolerance): so for example to be 5-fault-tolerant, we need 6 steps to reach an agreement.

Each process may stop in the middle of the send operation.

After k + 1 rounds a decision is made based on the size of W:

If the cardinality of W is 1, no problem at all
If the cardinality of W is greater than 1, it must be use a previously decided common function/criteria on all processes to make the decision on which value to take.

A more efficient implementation consists to broadcast W only if the process learns of a new value.

Lamport's algorithm intuition

In case of byzantine failure, things get more complex: the processes can both stop or exhibit any arbitrary behavior like sending arbitrary message or performing arbitrary state transitions. This problem has been described by Lamport in 1982 in terms of armies and generals.

During the initial round, all processes exchange their respective values. However among them there is a traitor which sends unusual values. Starting from the second round onwards, each process adopts the value that is most commonly shared among them.

Lamport (1982) showed that if there are $m$ traitors, $2 m+1$ loyal generals are needed for an agreement to be reached, for a total of $3m+1$ generals.

Fischer, Lynch, and Paterson proved that in general, for an asynchronous system, even a single failure is enough for not being able to reach a consensus: "Impossibility of Distributed Consensus with One Faulty Process" (FLP Theorem). Basically the FLP theorem says that every protocol that may come to your mind must be a synchronous protocol: so it must be a protocol that somehow fixes the bound in the sending of messages, fixes the bound in the processing speed of the chat-optic processes, and fixes the bound in the maximum delay jitter between the values cross.

Reliable group communication

Here the are the various alternatives for reliable group communication when processes are reliable but links are not.

Basic approaches to reliable group communication

ACKs, where each recipient sends an acknowledgement after receiving the message. If an acknowledgement is not received, the sender resends the message. However, this can lead to an "ack implosion" if all recipients send acknowledgements simultaneously.
NACKs, where ach recipient sends a NACK after a random delay indicating which packet was missed. This optimistic approach is more scalable as it prevents multiple NACKs from being sent simultaneously and reduces the likelihood of overwhelming the sender.
Hierarchical Feedback Control is an evolution of using just ACKs and NACKS. The receivers are grouped together, with each group having a coordinator. These groups form a tree structure with the sender at the root. Coordinators manage acknowledgments and retransmissions within their groups and communicate with their parent coordinator if necessary. The challenge lies in maintaining the hierarchy, especially in the presence of dynamic group membership.

Communication in faulty processes

Let's define the atomic multicast problem: a message is delivered to all non-faulty members of a group or to none, maintaining the same order of messages at all receivers.

Close synchrony cannot be achieve in the presence of failures: it says that multicast messages can be considered as instantaneous events and processes which receive the messages see events in the same order.

Virtual synchrony is the weaker model which replace the close synchrony:

Only messages sent by correct processes are processed by all correct processes.
Crashed processes are removed from the group and must rejoin later, reconciling their state.
Messages sent by failing processes are either processed by all correct members or by none.
The guarantees on the receiving order are only for relevant messages

Virtual synchrony model takeaways:

distinguishes between receiving a message and delivering it; messages are buffered by the communication layer delivered when certain conditions are met.
The concept of group view is crucial in this model: it's the set of processes that are considered part of the group at the time of sending a message.
Changes to the group view can be seen as another form of multicast messages (they must be consistent)
All multicast must take place between view changes

Message ordering

Virtual synchrony is complex. On top of the previous explored protocol (or analogue protocols) we could retain the virtual synchrony property applying different orderings for multicast messages can be identified:

Unordered multicasts
FIFO-ordered multicasts: FIFO messaging ensures that messages from a single sender are delivered in the order sent. However, total FIFO ordering across all senders is not guaranteed without additional mechanisms.
Causally-ordered multicasts
Totally-ordered multicasts

Checkpoints

We are discussing fault tolerance and recovery techniques. There are two types of recovery:

Backward recovery involves going back to a previous state if a state resulting from a crash is undesirable.
Forward recovery involves trying to correct errors without going back to a previous state.

To enable backward recovery, we need to save previous states that can be retrieved later. Checkpointing involves periodically saving the state of each process without needing to synchronize with other processes (as synchronization is not practical in distributed systems). When recovering using checkpoints, it is important to ensure that transitioning to different states at different times still makes sense for the entire application: this is achieved finding what we call a "consistent cut" or recovery line. The main challenge is to identify the most optimal recovery line, which comprises consistent checkpoints from each process, thereby representing an overall consistent cut: going back to different states at different times but still making sense for the application as a whole.

Independent checkpointing

To implement and discover valid checkpoints, we need to obtain an approximation that allows us to reconstruct the best set of checkpoints for a good recovery line.

Before dive in the 2 seen algorithms, let's keep in mind that with $I_{i, x}$ we indicate the $x$-th interval between two checkpoints on process $i$ . The 2 algorithms:

Rollback-dependency graph:
- this algorithm works starting from the "final" / "crashed" state view of the overall system
- each dependency between the interval of checkpoints is translate it into a dependency between the ending checkpoint of the starting interval and the ending checkpoint of the arrival interval (ending to ending).It's like shifting towards the "right" all messages between processes, aligning them to the ending checkpoints of each interval.
- from the final situation, you "mark" all the checkpoints reachable from the crashed process following the dependency relations you "wrote".
Checkpoint-dependency graph:
- First of all the dependencies are added from the starting state of the interval to the final state of the other interval (where the arrow arrives).
- Now the iterative part:
  - You select the set with the checkpoints nearest the crash: this is called "hypothesis set":
    - if there are no dependencies inside the hypothesis set a recovery line is found.
    - otherwise the arrival checkpoint of the dependency which is inside the hypothesis set is discarded. Then new iteration of the algorithm is done, selecting the new set: which is made from the previous checkpoints a part the one discarded.

Those are 2 centralized algorithms that should bring to the same result and are performed by a single central coordinator after a distributed collection of checkpoints from the processes. How works the distributed collections? We will see it in Synchronization chapter when we talk about Chandy-Lamport Distributed snapshot algorithm.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

05.Fault tolerance.md

05.Fault tolerance.md

Fault tolerance

Protection against process failures

FloodSet algorithm

Lamport's algorithm intuition

Reliable group communication

Basic approaches to reliable group communication

Communication in faulty processes

Message ordering

Checkpoints

Independent checkpointing

Files

05.Fault tolerance.md

Latest commit

History

05.Fault tolerance.md

File metadata and controls

Fault tolerance

Protection against process failures

FloodSet algorithm

Lamport's algorithm intuition

Reliable group communication

Basic approaches to reliable group communication

Communication in faulty processes

Message ordering

Checkpoints

Independent checkpointing