0% found this document useful (0 votes)
24 views

Possible Types of Failure

Possible types of failure in a distributed system include: 1. Site failures which can cause loss of volatile storage or non-volatile storage. 2. Communication failures such as lost messages or network partitions that divide the network into disconnected subnetworks. 3. Different communication structures like centralized, hierarchical, linear, or distributed can be used between sites.

Uploaded by

Amandeep Singh
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Possible Types of Failure

Possible types of failure in a distributed system include: 1. Site failures which can cause loss of volatile storage or non-volatile storage. 2. Communication failures such as lost messages or network partitions that divide the network into disconnected subnetworks. 3. Different communication structures like centralized, hierarchical, linear, or distributed can be used between sites.

Uploaded by

Amandeep Singh
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 16

Possible types of failure

1. Site failure
2. Communication failures
1. Lost messages
2. Network partitions
Possible types of failure
1. Site failure
2. 5.1.2
3. 6.1
4. 7.1.2.4
5. 9.1.2
1.Site faliures are those failures which can occur at each
site ..the are calssified as .
a. Failures without loss of information-all the information
stored in memory is available for recovery eg.division
by zero error.
b. Failure with loss of volatile storage—the content of
main memory is lost,but information recoded on disk is
not affected.eg.system crash.
c. Failure with non volatile storage—the content of disk
storage is not lost .eg. Head crash
d. Failure with loss of stable storage—some info stored in
stable storage is lost because of several ,simultaneous
of third type
• Failure can also occur in the communication between the sites
• When a message is sent from site x to site why we require from
the comunication network th follwing behaviour
1. X receives a positive acknowledgement after a delay which is less
then same maximum dealay dmax
2. Message is delivered at y in proper sequence with respect to
other x y messages
3. The message is correct.

If after a delay of dmax ,site x has not received an acknoledgement


then may the message is not lost or acknoledgement is lost
• There can also be a network partition—
In it the network is partition intoo two or more
completely disconnected sub networks, one
including x and one including y. all the oprational
sites which belong to the same network can
communicate with each other, how ever they
can not communicate with sites which belong to
a differnet subnetwork untill the partition is
repaired.
Differnet communication structure
for two pc
• 1. centerlized communiication structure—
• The communication is always performed
between the coordinator dtm-agent and
the participants .but not betweenn
participants directly
2.Hierarchical communication structure—
The coordinator is the dtm agent at the root of the
tree.the communication between the coordinator
and participants is performed not by directly
broadcast , but by propagating he messages up
and down the tree. Each dtm agents which is
internal node of the communication tree gets the
message from its son and broadcast messages
to them
Col
3.Linear communication structure—
In leanear protocol an ordering of the sites is
defined, so that each site excpt the first
abd the last one has a predessor and
successor .instead of broadcasting a
message from the cordonator to all other
participants ,the message is passed from
each participants to its successsor.
• Distributed communication structure –
• It requires tht each dtm-agent communicate with
each other participants .the no of messages
which are needed by a distributed protocol is
much more greater thn the no. of messages
which is required by centerlised or heiraichical
structure..these protocols are suitable for those
network which are cheAP like local network.
Check pointing reduces the
overhead of log based recovery
• When a failure with loss of voltile stoahe occurs ,
a recovery proocedure reads the log file and
perform the following operation
1-- determine all non committed transaction tht hav
to be undone ie which hav a begin_transaction
record in the log file, without having a commit or
abort record
2 Determine all the which need to be redone .
This is all transaction which hav a commit record
in log files.to distiguish transactions which need
to be redone frm those which do not,checkpoints
are used
• Undo the transaction determine at step 1
and redo the transaction determined at
step 2.
• Checkpoints are operation which are
predically perform in order to simplify the
first two steps of the recovery
procedure.performing the check points
require the following operation.
• Writing to stable storage all log records
and all database updates which still in
volitile storage
• Writing to stable storge a ccheck point
record .it is an indication of transaction
which are active at the time whn
checkpoint is done.\
• Step1 and step 2 of recovery procedure
are now substitute by the following
• Find and read the last check point record
• Put all transaction written in the checkpt.
Record into the undo set.which contains
the transaction to be undone.
• Read the log file strting frm the checkpoint
record untill its end.
• If a begin_transaction is found it put into
undo set,if a commit record is found it put
into redoset.
Diff between availability and
reliability
• One of the advantage of distributed
datbase is increase the reliability and
availability
• Reliability-is defined as the probability tht
the system is running (not down) at a
certain time point.
• Availability-is the probabilty tht the system
is continusly available during a time
interval.
• Increased reliability and availibility ensures
gracefull degradation property..when the
data and dbms s/w are distributed over
serval sites,one site may failure while
other site countinue to operate.only the
data and s/w of failed site can not be
accessed .further improvement is
achieved by judiciously replicating data
and s/w at more then one site.

You might also like