Week 07a
Week 07a
▪ Error: part of the system state that leads to failure (i.e., it differs from its
intended value)
▪ Fault: the cause of an error (results from design errors, manufacturing faults,
deterioration, or external disturbance)
▪ Recursive:
– Failure may be initiated by a mechanical fault
– Manufacturing fault leads to disk failure
– Disk failure is a fault that leads to database failure
– Database failure is a fault that leads to email service failure
Partial Failure:
▪ One or more (but not all) components in a distributed system fail
▪ Some components affected
▪ Other components completely unaffected
▪ Considered as fault for the whole system
If the receiver does not receive one or more of the messages sent by the
transmitter, an omission failure occurs. For wireless networks, collision occurs
in MAC layer or receiving node moves out of range.
▪ Transient Failure
A transient failure can disturb the state of processes in an arbitrary way. The
agent inducing this problem may be momentarily active but it can make a
lasting effect on the global state. E.g., a power surge, or a mechanical shock,
or a lightening.
Note: that many of the failures like crash, omission, transient and Byzantine
can be caused by software bugs. For example, a poorly designed loop that
does not terminate can mimic a crash failure in the sender process. An
inadequate policy in the router software can cause packets to drop and trigger
omission failure.
▪ Security Failure
Virus and other malicious software may lead to unexpected behaviour that
manifests itself as a system fault.
– In November 1988, much of the long distance service along the East
Coast of USA was disrupted when a construction crew accidentally
detached a major fibre optic cable in New Jersey; as a result
3,500,000 call attempts were blocked.
Note:
▪ Safety properties specify that “something bad never happens”
– Doing nothing easily fulfils a safety property as this will never lead to a “bad”
situation
▪ Liveness properties assert that: “something good” will eventually happen [Lamport]
▪ Given a set of fault actions F, the fault span Q corresponds to the largest set of
configurations that the system can support.
▪ Masking tolerance preserve both safety and liveness properties of the original
system.
▪ Consider that while watching a movie, the server crashed, but the system
automatically restored the service by switching to a standby proxy server.
As an example,
▪ At a four-way traffic crossing, if the lights are green in both directions then a
collision is possible. However, if the lights are red, at best traffic will stall but
will not have any catastrophic side effect.
Some examples
– Validity. If every (non-faulty) process begins with the same initial value
v, their final decision must be v.
▪ On the other hand, a state from which only one decision value can be reached
is called a univalent state. Univalent state states can be either 0-valent or 1-
valent.