DC Techmax Everything You Need To Prepare For Distributed Computing Subject in The Final
DC Techmax Everything You Need To Prepare For Distributed Computing Subject in The Final
aw
4|Syllabus
Layered Protocols, Interprocess communication (IPC): MPI, Remote Procedure Call. (RPC), Remote Object
Invocation, Remote Method Invocation (RMI), Message Oriented Communication, Stream Oriented Communication,
Group Communication.
In distributed system, processes communicate with each other by sending the messages. Soit is necessary to study
the ways processes exchange the data and information.
Application protocol 7
Application |4----------oeee __
Presentation protocol |
Presentation 4-----See a re 6
Networkprotocol |
Network d= sh gresSas Sreserny Samana sors 3
Data link
4 ink protocol
Datalin
A) .-4-- 2. | >
Network . |
Interface is present
4 In this model total seven layers .are present. Each layer provides services to its upper layer.
between each pair of adjacentlayers. Interface defines set of operations through whichservices is provided to users.
message and handoverit to the
Application running on sender machine is considered at application layer. It builds the
application layer on its machine. Application layer adds its header at the front of this message and passes to
downto the session
presentation layer which adds its header at the front to the message. Further message is passed
so on.
layer which in turn passes the message to transport layer by adding its header and
W TechKnowledge -
Publications
2-2
Communication
|
Distributed Computing (MU-Sem 8-Comp.)
4 Data Link Layer (DLL): This layer deals with the issues such as:
O Framing: To detect start and end of received of frame.
© Flow Control: This is mechanism controls data transfer rate of sender t 9 avoid load on receiver. DLL implements _
protocols to get feedback from receiver to sender about slowing downthe data transfer rate.
oO Error Control : DLL deals with detection and correction of errors in received data by using mechanism such as 2
checksum calculation, CRC etc.
4 Network Layer : This layer mainly deals with routing. Messagestravel through manyrouters to reach to destination
machine. These routers calculate shortest path to forward received messageto destination. Internet protocol is most |
widely used today and it is connectionless. Packets are routed through different paths independent of others.
Connection-oriented communication now popular. For example, virtual channel in ATM network.
4 Transport Layer : This layer is very important to construct network applications. This layer offers all services that are
not implemented at interface of network layer. This layer breaks messages coming from upper layer into pieces which
are suitable for transmission, assigns them sequence number, congestion control, fragmentation of packets etc. This
layer deals with end to end processdelivery. Reliability of transport services can be offered or implemented on the top
of connection-oriented or connection-less network services.
4 {n this transport layer, transmission contro! protocol (TCP) works with IP today in network communication. TCP is
connection oriented and user datagram protocol (UDP) is connection-less protocol in this layer.
4 Session and presentation layers : Session layers deal with dialog control, it keeps track of which parties are currently ;
talking and check pointing to recover from crash. It takes in to account the last checkpoint instead of considering all |
since beginning. Presentation layer deals with syntax and meaning of transmitted information. It helps the machines 4
with different data representation to communicate with each other.
_ Application Layer: Initially in OSI layer, electronic mail,file transfer and terminal emulation was considered. Today,all
such as HTTP.
the applications running are considered in application layer. Applications, application specific protocols
and general purpose protocol such as FTP is now consideredinthis layer. | | .
4 Synchronous Communication: In this type of communication,client application waits after sending request until reply
sent request comes from server application. Example is Remote Procedure Calls (RPC), Remote Method Invocation
(RMI).
Oy Techtaestelf,
pupiicatl® 4
:
E
Sockets are considered asinsufficient to deal with communication among high-performance multicomputers.It is
concluded that, highly efficient applications cannot be implemented with sockets in order to incur minimal cost in
communication. Sockets are implemented with general-purpose protoccl stack such as TCP/IP which does not suite
with high-performance multicomputers. Sockets supports only simple primitives like send and receive.
- Sockets are not suitable with proprietary protocols developed for high performance interconnection network, for
example, used in cluster of workstations (COWs) and massively parallel processors (MPPs).
- Therefore such interconnection network uses proprietary communication libraries which provide high level and
efficient communication primitives. Message passing interface (MPI) is specification for users and programmers for
message passing libraries. MPI basically is developed for supporting the parallel applications and adapted to transient
communication where sender and receiver should be active during communication.
- MPI assumesprocess crash or network failure as incurable and these failures do not require recovery action. MPI also
assumes that known group of processes establishes communication between them. To recognize process group,
group!D is used and for process, processIDis used. These are identifiers and the pair (groupI/D, process!D) exclusively
identifies the sender and receiver of the message. This pair of identifiers excludes using the transport-level of address.
ee ee
Tech Knowledge
Puettitations
*eP 7+Clientcallstheclientstub.ASclientstubisonclientmachine,thiscallisalocalprocedur
P2:° Clientstubpacks the parametersinto a message andissueasystem call tosend themessage.Thisprocessis
>& iCalled marshalling9) oO We ee ote ere
fep3:Kernelsendsthemessagefromthe client machinetothe servermachine.
servermachine,send the incomingpacketto the server stubs e008 ate
=P: Thekernelononserver
ae ~ Serverstub callstheserverprocedure.
Client CPU Server CPU
Client Server
1 stub stub
Client |".
4 |Server
4]
Operating system : , 4-Operating system
eee)
\
Network
Fig. 2.3.1 : Steps in RPC
The server then handover the result to server stub which packsi
t into message. Kernel then send the messagetoclient
where kernel at client handover it to client stub. Client
stub unpack the message and handover result to client
YF TechKnowletf 8dl
Pubitcatiee= 47j
ij
Det
euevieiian!
Neea7 bros
palit tas thn
continue to Carry out its useful work after sending request to server without waiting for reply.
Asynchronous RPC allows client to continue its work after calling procedure at server. Here, client immediately
continue to perform its other useful work after sending RPC request. Server immediately sends acknowledgement to
After receiving
client the moment request is received. Server then immediately calls the requested procedure.
waits till acknowledgement of receipt
acknowledgement, client continue its work without further blocking. Here client
of request from server arrives.
t waiting for acknowledgementfrom server for receipt of
In one-way RPC, client immediately continues its work withou
sent request.
Ep TechKnvutedgi
OO ; 8
Publications
4 Itcan fragment and reassemble the messages, handle transport in both directions and converts data types between |
v client and servers if machine on which they running have different architecture and hence, different data
representation.
4 Clients and servers can be independent of one another. Implementation ofclient and servers in different languagesis
supported. OS and hardware platform can be different for client and server applications. RPC system hidesall these
dissimilarities. |
4 DCE RPC system contains several componentssuch aslibraries, languages, daemons, utility programs etc. Interface
VY definitions are specified in interface definition language (IDL). IDL files contains type definitions, constant declarations,
function prototypes, and information to pack parameter and unpack the result. Interface definitions contains syntax of
. , oN
call not its semantics.
J Each IDL file contains globally unique identifier for specified interface. This identifier client sends to server in first RPC
message. Server then checksits correctness for binding purpose otherwise it detects error.
4 First, call uuidgen program to generate prototype IDL file which contains unique interface identifier not generated
anywhere by this program. This unique interface identifier is 128 bit binary number represented in ASCII string in
hexadecimal. It encodes location and time of creation to guarantee the uniqueness.
4 Next step is to edit this IDL file. Write the remote procedure names and parametersin file. This IDL file then is
compiled by using IDL compiler. After compilation, three files are generated: header file, Client stub andserver stub.
4 Header file contains type definitions, constant declarations, function prototypes and unique identifier. This file is
included (#include)in client and server code. The client stub holds procedures whichclient will call from server.
- These procedure collects and marshals parameters and convert it outgoing message. It then calls runtime system to
send this message. Client stub also is responsible for unmarshaling the reply from server and delivering it to client
application. The server stub at server machines contains procedures which are called by runtime system there when -
message from client side arrives. This again calls actual server procedure. °
Write server and client code.Client code and client stub both are compiled to convertit in objectfiles which arelinked
with runtime library to produce executable binary for client. Similarly at server machine, server code and server stub
both are compiled to convert it in object files which are linked with runtime library to produce executable binary for
server.
Client should be able to call to server and server should accept the clients call. For this purpose,registration of serveris
necessary, Client locates first server machine and then server process on that machine.
Port numbers are used by OS on server machine to differentiate Incoming messages for different processes. DCE
daemon maintains table of server-port numberpairs. Server first asks the OS about port number and registers this end |
point with DCE daemon. Server also registers with directory service and provide it network address of server machine
2.3.2, ei
and name of server. Binding a cllant to server Is carrled out as shownIn Fig.
i
AN
|
Communication
¥<¢ Distributed C omputing (MU-Sem 8-Comp.) 2-7 a
4_4
Directory machine
|
Directory
2. Serverregisters service with directory }
Server
Server machine
oN
3. Look up server
44444 |e»!
5. Do RPC Server
DCE
4. Ask for port number Daemon
to DCE daemon
re
Fig. 2.3.2
t DCE |
server. Client then contac
which ret urns network address of
Client passes name of server to directory server ess and port
Cc lient knows both network addr
on to get port number of server running on server machine. Once
daem
number, RPC takes place.
eg: |
Network \
at client side
zation of Remote object with proxy
Fig. 2-4-1: Organi
e
ee Teck Knowledge
Publications
4 Wecan keep object's interface on one machine and object itself on other machine. This organization is referred as _
' : distributed object. Implementation of object's interfaceis called as proxy. It is similar to client stub in RPC and remains
in client address space.
4 Proxy marshals clients method invocation in message and unmarshal reply messages which contains result of method
invocationby client. This result, proxy then returns to client. Similar to serve r stub in RPC, skeleton is present On server ©
machine
Skeleton unmarshals incoming method invocation requests to proper method invocationsat the object s interface at
: : . . 1. 4
4
the server application. It also marshals reply messages and forward them to client side proxy. Objects remains on
single machine and their interfaces only made available on different machines. This is also called as remote object.
4 Language-level objects called as compile-time objects. These objects are supported by object oriented languages such
as Java, C++ and other object oriented languages. Objectis an instance of class. Compile time objects makes it easy to
build the distributed applications in distributed system.
4 In Java, objects are defined by means of class and interfaces that class implements. After compiling interfaces, we get
server stub and client stub. The generated code after compiling the class definition permits to instantiate the java
objects. Hence, Java objects can be invoked from remote machine. The disadvantage of the compile-time objectsisits:
dependency on particular programming language.
4 Runtime objects are independent of programming language and explicitly constructed during run time. Many
distributed object-based system uses this approach and allows construction of applications from objects writtenin
multiple languages. Implementation of runtime objects basically is left open.
4 Object adapter is used as wrapper around implementation so that it appears to be object and its methods can be
invoked from the remote machine. Here, objects are defined in terms of interfaces they implements. This
implementation of interface then registered with object adapter which makes available the interface for remote
. stiofanj
~invosation.
ieee A
} .| ersistent and Transient Objects
. 8
illy
:
f
4 Persistent object continue to exist although server exits. Server at present managing persistent object stores object
7 L state on secondary storage beforeit exit. If newly started address needs this object then it reads object state from
* Sip}
2hsfyyIC= secondary storage.
oovrs4y In contrast to persistent object, transient object exists till server manages that object exists. Once server exit, the ©
*
Sit2
Coe pte
transientSbiect also exit.
i RNruby)
, Object References F
i $y
reference which must contain '
ab ent |invokes the method of remote object. The client binds with an object using oblect
_ sufficient information to permit this binding. Object reference includes network address of machine on which object isg
placed, plus port number(endpoint) of server assignedbyits local operating system.
4 The limitation of above approach is that, when servercrashes,it will be assigned different port number by OS after
aoe Pe. befall gS 022,53.fal83
W TechKnowledg?
Gober
Publications...»
assign 8 ; . . 8
used as index into endpoint ta Bnment in endpointtable. In this case, server ID is encoded in object reference Is
: ble. While bind;
number. Server should always regi While binding client to object, daemon process is asked for server's current port
register with local dae mon.
- The problem with encoding ne |
two rk address iin n object
obj referenceis that, if server moves to other machine i then again
all the object reference S becomes
invalid. Th; |
on machine whereserver js Curr '¢- This problem: can be solved i ich wlil keepke} track
. by keeping locati server whichwil
; onnetwork '
ently running. In this case, object reference contains address of location
server and systemwide identi fier
for
server,
oe relerence may also
include more information such
as identi fication of protocol that is used to bind with object
and the protoco| supported by
object server, for example T CP or UDP. Object reference may also contain
implementation handle which
refers to com plete implementation of proxy which client can dynamically loads for
binding with object.
Parameter Passing
- Objects can be accessed from remote machines. Object references can be used as parameter to methodinvocation
and these references are passed by value. As a result object references can be copied from one machine to the other.
If process has object reference then it can bind to the object whenever required.
ere
- Itis not efficient to use only distributed or remote objects. Some objects such as Integers and Booleans are small.
Therefore it is important to consider references to local objects and remote objects. For remote method invocation,
object reference is passed as parameter if object is remote object. This reference is copied and passed as value
parameter. Whereas, if object is in the same address space of client then entire object is passed along with the
method invocation. That means, object is passed by value.
-~ Asa example, suppose client code is running on Machine 1 having reference to local object O1. Server code is running
with
on Machine 2. Let the remote object 02 resides at machine 3. Suppose client on Machine 1 calls server program
machine 3. Client
object O1 as a parameter to call. Client on machine 1 also holds reference to remote object O2 at
reference to O2 is passed as
also uses O2 as a parameter to call. In this invocation, copy of object O1 and copyofthe
and remote objects are passed
parameter while calling server on Machine 2. Hence, local objects are passed by value
by reference.
WF lechKnowledge
B
he i We Publications
4 In Java, method can be declared as synchronized.If two processestry to call synchronized methodat the same time
access to object's internal data is
then only one process is permitted to proceed and other is blocked. In this manner,
entirely serialized.
side. At client side, client is blocked in
4 This blocking of process on synchronized method can be at client side or server
other machines are blocked at client side,
client-side stub which implements object interface. As every client on
synchronization among them is required which is complex to implement.
is executing its invocation then
Server side blocking of client process is possible. But, if client crashes while server
to proxies.
protocol is needed to handle this situation. In Java RMI blocking on remote object is restricted
4 In Java, any primitive or object type can be passed as parameter to RMI if it can be marshaled. Thenit is said that the
objects are serializable. Most of the object types are serializable except platform dependent objects. In RMI local
objects are passed by value and remote objects are passed by reference. |
4 In Java RMI, object reference includes network address, port numberof server and object identifier. Server class and
client class are two classes used to build remote object. Server class contains implementation of remote object that -
runs on server. It contains description object state, implementation of methods that works on this state. Skeleton is
generated from interface specification of the object.
4 Client class contains implementation of client-side code. This class contains implementation of proxy. This class is
generated from interface specification of the object. Proxy builds the method invocation in message which then it
sends to server. The reply from server is converted in result of method invocation by proxy. Proxy stores network
address of server machine, port numberand object identifier.
4 In Java, proxies are serializable. Proxies can be passed to remote'process in terms of message. This remote process
then can usethis proxy for method invocation of remote object. Proxy can be used as reference to remote object.
Proxyis treated as local object and hence can be passed as parameter in RMI.
4 As size of proxy is large, implementation handle is generated and used to download the classes to construct proxy.
Hence, implementation handle replaces marshaled code as part of remote object reference. Java virtual machine is on
every machine. Hence, marshaled proxy can be executed after unmarshaling it on remote or other machine.
In Remote Procedure Call (RPC) and Remote method invocation (RMI), communication is inherently synchronous
whereclient blockstill reply from server arrives. For communication between client and server, both sending andreceiving
sides should be executing. Messaging system assumesboth side applications are running while communication is going on.
Message queuing-system permits processes to communicate although receiving side is not running at the time
communication is initiated. '
a
44
W TechKnowledge
Publications
|
urati on O h
Buffer is available at ea ch host a nd communication server. Consider Email system with above config
s or
and communication serv ers. Her e, user agent program runs on host using which users can read, compose, $ end
rece
,
ive the messages. Use r agent sends message permits to send messageat destination.
host for _ then forward this message toits local mail
transmission. Host
ser agent progra m at host s ubmits message to
, -. t level address of d estination
server. This mail server st ores the received messagein output buffer and lo ok up transpor
a
| ination mail server an d forward message by removing| t from
mail server. bliishes
tabl
en esta the connecti, on with dest
output buffer.
agent at
buffer in or der to deliver to designated receiver. User
At destination mail server, message is put in input there. In this case,
n regula r basis by usin vices of interface available
g ser
receiving host can check incoming messageso
messages are buffered at communication servers (mail servers).
is successf ully delivered to next
by ¢o mmunication server as long as it
Message sent by source hostis buffered application need not continue
for transmission, sending
tting messa ge
communication server. Hence, after submi for receiving application
to be
d by comm unic atio n serve r. Also, it is not necessary
executi on as message is store persistent communicati
on. Email is
was subm itte d. This for m of commu nication is called as
executing whe n message
OOOO
.
munication.
example of persistent com message if
in above discussi on) stores the
n, com mun ica tio n sys tem (communication server
tio discussed above,
In transient communica e is discarded. In the exam ple
lic ati ons are exe cutin g. Otherwise messag
app eiver then message is
both sender and re ceiver a ge to nex t communication server OF rec
ive r me ss
not able to del
if communication serve ris
|
discarded. ation router plays
r
After r this,
ee
Ty
: ALL
en
| Q OF FeetRowuleag
444
ee
4
A 7
Besser ecresssr=ss
. B starts and
B is not < receives
running message
Fig. 2.5.1
4 Persistent Synchronous Communication : In this communication sender is blocked until its message is stored at
receiver's buffer. In this case, receiver side application may not be running at the time message is stored. In Fig. 2.5.2,
Process A sends messageand is blocked till message is stored in buffer of receiver. Note that, process B starts running
after some timeand receivesthe message.
| Asends message A stopped
and waits until accepted 8running
A
Messageis stored
at B's location for
later delivery #, Time
- B =-------Qrannone,
B is not B starts and
running receives
message
Fig. 2.5.2
4 Transient Asynchronous Communication : In this type of communication, sender immediately continues its execution
after sending the message. Message may remain in local buffer of sender9s host. Communication system the routes
the message to destination host. If receiver process is not running at the time message is received then transmission
fails. Example is transport-level datagram services such as UDP. In Fig. 2.5.3, Process A sends message and continues
its execution. While at receiver host, if process B is executing at the time message is arrived then B receives the
message. Example is one way RPC.
A sends message
and cantinues
= ee ee
sent onlyif B is
running
Time
4->
' B receives
message
Fig. 2.5.3
~ Recelpt-Based Transient Symrennnepous. Communication: In this type of communication, sender is blocked until
message is stored at local buffer of receiver host, Sender contl ues Its execution after it receives acknowledgement of|
(Jj NF a
receipt of message. It is shown In Fig, 2.5.4, tema) in
a TechKnotsledgia
Pupricarteneah
A P=== ===: Hi
ofa
ate
_Request at REL
IS received K ty i
qhie
3 aTime i
' 444 Wy
Running, butdoing Process te
something else request
i
? Fig. 2.5.4 {|
ti
- Delivery-Based Transient Synchronous Communication : In this type of communication, sender is blocked until i
ale
message is delivered
i to the target process. It is shownin Fig. 2.5.5. Asynchronous RPC is example of this. type of ey
communication. I
Send request and wait ; |
until accepted | ! H
gf Hit
: Request Accepted a
is receive au
44 ie i
4
Running, but doing Process iit
something else request | 4
1 fi!
Fig. 2.5.5 | :
ae
ication: i icati se nder is blocked until bt
I
- Response-Based Transient Synchronous Communication: In this eee of communication,
, in Fig. 2.5.6. RPC and RMI sticks to this type of communication. / Hii
response is received from receiver process.It is shownin Fig. 2.5 6. YP aay
Send request and wait Ha
for reply | bihad |
--<-- tid
A seme) We
Accepted ii
|4|3
Request
is received Time ya
8 8 .
- 4
He
ya
W Falls
| Lae
Running, but doing Process
something else request i ii
} i
Fig. 2.5.6 tis
ta
to be developed. | | ih
is need ofall communication types as per requirements of applications ae
~ Indistributed system, there | |
sient Communication at
25.3 -Oriented Tran ii |
luge Message i veral ap lications
ich several applications and and distributed sy t are built
distributed systems Duillt a as ee aa
ee
level socket
pec dansree wee lS eee
a TechKnowledga
Publications
i8
SE
SOS
n se
.
Transportlayer offer programmersto use all the messaging protocols through simpl
e set of primitives. This is Possible
ow to port the application op
due to focus given on standardizing9 the interfaces. The standard interfaces also all
different computers. Socket interface is introduced in Berkley UNIX.
ications running 8 on ot her
Socket is communication endpoint to which application writes data to be sent to applicatio
4
ines
machines in icati
| the network. Application also reads the incoming data from this communication endp oint. Followin g are
socketprimitives for TCP.
a. , , r
4 Socket : This primitive is executed by server to create new.communication endpoint for particular transport protocol,
The local operating system internally reserve resources for incoming and outgoing messages for this specified protocol
This primitive is also executed by client to create socket at client side. But, here binding of local addressto this socket
is not required as operating system dynamicallyallocate port when connection is set up.
4 Bind : This primitive is executed by server to bind IP address of the machine and port number to created socket. Port
number may be possibly well-known port. This binding informs the OS that server will receive message only on
specified IP address and port number.
4 Listen : Server calls this primitive only in connection-oriented communication. This call permits operating system to
reserve sufficient required buffers for maximum number of connections that caller wanted to accept. It is nonblocking
call.
4 Accept : This is executed by server and call to these primitive blocks the caller until connection requestarrives. Local
operating system then creates new socket with same properties like original one. This new socket is then returned to
the caller which permits the server to fork off the process to handle actual communication through this new
connection. After this, server goes back and waits for new connection request on original socket.
| 4 Connect : This primitive executed by client to get transport-level address in order to send connection request. The
client is blocked till the connection is set up successfully.
|| . 4
.
Send : This primitive executed by bothclient and server to send the data over the established connection.
4 Receive : This primitive executed by both client and server to receive the data over the established connection.
4 Both client and server exchange the information through write and read primitive which
establish sending and
receiving of the data.
4 The message passing interface (MPI) is discussed insection 2.2.2. It uses messaging primitive to supporttransient
of the message
mPI_ssend : This primitivej age and then waits until receipt
xecuted by sender to send the mess
starts by receiver. This case is given in Fig. 2.5.5
is blocked until reply
MPI_sendrecv: This primitiive supports synchronous } communication given in Fig. 2.5.6 Sender
from receiver.
m user9s buffer. Sender passes
MPI_isend : This primitive avoi
avoids the copying of mes sage to MPI runtime buffer fro
P Ive
. ;
inter to message & les the communication.
ge and afte r this MPI runtime system hand ;
po ime system,
der passe inte runt |
imesystem. Onc e mes sage processing is done by MPI runt
MPI_issend : Sen p S$ pointer to MPI
ge for processing.
sender guaranteed that receiver accepted the messa
It blocks caller until message arrive $.
MPI_recv : It is called to receive a message. r
to accept message. Receiver checks whethe
v : Sarn e as abov e but here receiver indicates that it is ready
MPI_irec
us communication.
message is arrived or not. It supports asynchrono
munication
95.4 Message-Oriented Persistent Com
ronous
Middieware (MOM) supports persistent asynch
Message-oriented
Message-queuing systems Of pas
sive
Sender run ning bu
t receiver P as 8
oT
<
t receiver running»
Sender passive bu
OF
passixe-
d receiv er are If message size is large then underlying system
Both sender an destinati on queue.
6
dr
ad ; es s fo r
! used as ent to communicating applications. This leads to having
u n i q u e name ner transpar
a System wide eint e m a
| n
e messag ives offered.
ments an d assembles th ll ow in g are the primit
frag ations. Fo
er to appli
c
essage to system in order to putin design
ated queue.This callis non-
Simple interface to ff a s s t h e m
op
primitive t
© Put: Sender calls this ue. If queue is empty then
blocking. ch pro ces s having rights removes message from que
by whi
© Get: It is blocking cal
process blocks. TechKuowledge
Publicati ons
Po
is empty or expected
Oo Poll: It is nonblocking call and proc essexecuting it polls for expected message. i queue
message not found then process continues.
Qo Notify : With this primitive, installed handleriis called by receiver to check queue for message.
4 Inthe message-queuing system, source queueis present either on the local machine of the sender or on the same
of
LAN. Messages can be read from local queue. The message put in queue for transmission contains specification
destination queue. Message-queuing system provides queues to senders and receivers of the messages.
Message-queuing system keeps mapping of queue names to network locations. A queue manager manages queues
and interacts with applications which sends and receives the messages. Special queue managers work as routers or
relays which forward the messages to other queue managers. Relays are more suitable as in several message-queuing
systems, dynamically mapping of queue-to-location not available.
4 Hence, each queue manager should have copy of queue-to-location mapping. For larger size message-queuing system,
this approach leads to network management problem. To solve these problems, only routers need to be updated after
adding or removal of the queues. |
4 Relays thus provide help in construction of scalable message-queuing systems. Fig. 2.5.7 showsrelationship between
queue-level-addressing and network-level-addressing.
PE MEROTS Hs Look-up at
a ) ~ Transport-level
$1 ~~ - 22-2 nnn nnn nnn nee eee naan nnn ene = address
Message Brokers
4 In more advancesettings of conversion, some information loss is expected. For example, conversion between x.400 .
and internet email messages. Message broker maintains database of rules to convert message fone format: nto:
other format. These rules are manually inserted in this database. 4
___4 " i
tal aoe
* +
vee
ouledg!
Punitcarien! a
System
examp! e : IBM's WebSphere Message-Queuing
ows Ge neral organization
Fig. 2.5.8 sh eue. It
mas eries is IBM9 s WebSphere product which is now known as WebSphere M Q. it 5 send qu
me ss ag es from
queues and extr act m in put the
messa
ie
a pick up the incoming
also forward the mess ges to other queue managers. It
ues. Receiving client
appropriate incoming que Client's receive
Send queue
Routing table
cms client |
Queue Program
neue TS manager
Program cms
sana
MQ Interface a
.
RPC Local netswork Enterprise network
(Synch ronous) To other remote
Message passing queue managers
(asynch ronous)
ing system
's message-queu
organization of IBM
Fig. 2.5.8 : General size is 2 GB but it
can be
up to 10 0 MB . Normal queue
reased
4 MB bu t can be inc managers i s th
rough
xi mu m defa ul t size of message is Co nn ec ti on be tw een two queue
Ma ng system. reliable
rlying oper ati a unidirectional,
mo re siz e de p ending on unde tio ns. A me s sage channel is
of vel con nec tion,
W hich is an id
ea of transp ort-le Through this connec
ss ag e ch an nels , er an d re ce iver of messages.
me ac ts as se nd
gers which
een qu eue mana
Pager ne!
Message Transfer
messag e by sending queue manager to receiving queue
4 A message should carry a destination address to send
of the receiving queue manage r. The second part is the
manager. The first part of the address consists of the name
is to be appended.
name of the destination queue of queue manager to which the message
to which queue manager message is to be
4 Route is specified by having name of send queue in message. It also means
, sendQ) in it. In this
forwarded. Each queue manager (QM) maintains routing table having entry as a pair (destQM
queue in which messageto be
entry destQMis name of destination queue manager and sendQ is name of local send
put to forward to destination queue manager.
manager extracts
Message travels through many queue managers before reaching to destination. Intermediate queue
name of destination QM from header of received message and search routing table for name of send queue to append
message to it.
Each queue manager has a systemwide unique identifier for that queue manager. This identifier is unique name tothis
queue manager. If we replace QM then applications sending messages may face problems. This problem is solved by
using a local alias for queue manager names.
in much communication, applications expect incoming information to be received in precisely defined time limits. For
example: audio and video streams in multimedia communication. Many protocols are designed to take care of this issue.
Distributed system offers exchange of such time dependent information between senders and receivers.
Representation of information in storage or presentation media (monitor) is different. For example, text information is
stored in ASCCI of in Unicode format and images can be in GIF or JPEG format. Same is true for audio information. The
imecis 1% Categorized as continuous and discrete representation media. In continuous media, relationship about time
between diferent data items matters to infer actual meaning ofit. In discrete media, timing relationship doesn9t
matter
between cfferent data items.
Data Stream
~ sequence of cata units called as data streams. It is applied to both continuous and discrete media. Transmission
moces differ in term: of timing aspects for continuous and discrete media.
Wansfer Gata ems received a3 data streams is relevant toits transmission time by sender
; JS tn synchronous transmission mode, maximum end to end delay is defined for
every unit ina data stream. . Data units it
/ a . 8 ao
Gata sean can also be received at any time within defined maximum end to end delay. It will cause mo harm at
receiver wide. Iiy HOL important here that, data units in stream can be sent faster than maximum tolerated delay
ina data stream. This (ode ty important in distibuted multimedia watem to represent audio and video :
Peterey npr eet ei 2 hk maw Aer ie . Paria ana inartice enema meslens IN Ee hn DO ak ora aestuedlns iC Rhea . at 5
Penticetier9 9c
4
aa
4
. aie
istributed Computine
! e
. .
5246 . Communication He
- 4 1
!
(subple contains only Single sequence of data Complex stream consist of many such simpl cain
-
e a
streams). These subs
treams need synchroni nt
zat
-
izatio
.
betw een substreams in com Plex stream. Example of complex stream is transferring movie. This stream comprises one
:
yideo stream, two stream ® to transfer movie sound in stereo, The other stream may be subtitles to display for deaf.
- stre am is virtual connecti
| ection between source (sender) and sink (receiver). These source and sink could be process OF
:
ice. For exam .
across network to
byte by byte from hard disk and sending it |
ee nee machine reading data
her process on
anotner p :
machine.In turn, this, receiver process may deliver received bytes to local device.
other i
|
|
- Inmultiparty communication, data stream is multicast to manysinks. To adjust with requirements of quality of service |
|
for receivers, a stream is configured with filters
|
- It is necessary to preserve temporal relationship in a stream. For continuous data stream,timeliness, volume. and |
reliability decides quality of service. There are many waysto state QoS requirements.
bandwidth, transmission rate, delay etc. Token |
A flow specification contains precise requirements for QoS such as |
number of
A bucket (buffer) holds these tokens which represent
bucket algorithm generates tokens at constant rate.
are dropped. Application ng to remove |
bytes application can send across the network. If bucketis full then tokens
tokens from buffer to pass the data units to network. Application itself may remain unknown about its requirements. |
8Application 4 1? ei
Onetoken is added tH fi
ee . ra pti
AP es = to the
ae i < [rregular stream bucket every AT . ft | |
|i a
Fig. 2.6.1: Token Bucket Algorithm
| |!i
in flow specification.
ice requirement and characteristics of input specified
Following are the se rvi
. . _ Hl
- Characteristics of Input<+ cize in bytes:It defines maximum size of data units in bytes.
lete bucket to Hal)
burstiness, as application is permitted to pass comp
©
Token bucket
unit. Si%*
Maximum datarate in bytes/sec : TO p ermit the
©
networkin single operation: . Pt ate
specified maximum.in order to .
© Token bucket size in ian . i b e
/sec: This is to limit the rate of transmission to fy |
Service Requirements 4 interval (micro second): Both collectively defines maximum acceptable lossrate.
,
dboss!
© Loss sensitivity . Number of d ata units my lost.
(bytes) an
0 second): Define s delay parameter that network can delay to deliver data before
. data units) . .
© Burst loss sensi tivity (
© Minimum delay
receiver notice it. | icro s2Cnd) * Maximum tolerated jitter.
(mic Ww TechKnowledge
© Maximum de lay variation Publications
Setting up a Stream
: iIn
n n network. These resources are
There is no single best model to specify QoS parameters and to describe resources
. oS param
bandwidth,. processing capacity, buffers. Also, there is no single best model to translate these Q05 p eters tg
; : rs.
resource usage. QoS requirements are dependent on services that networkoffer
A
f «4
resources at router for conti nuous
- Resource reSerVation protocol (RSVP)is transport-level protocol which reserves
streams. The sender handover the data stream requirements in terms of bandwidth, delay, jitter etc to RSVP process
on the same machine which then stores it locally. |
RSVP is receiverinitiated protocol for QoS requirements. Here, receiver sends reservation request along the path to
the sender. Sender in RSVP set up path with receiver and gives flow specification that contains QoS requirements such
as bandwidth,delay,jitter etc to each intermediate node,
At sender side, this reservation request is given by RSVP process to admission control module to check sufficient .
resources available or not. Same request is the passed to policy control module to check whether receiver has
permission to makereservation. Resources are then reservedif these to tests are passed. Fig. 2.6.1 shows organization
of RSVP for resource reservation in distributed system. | |
Senderprocess | &
mere RSVP-enabled host 7
% Poli
Application |} | Mid iS RSVP process
Application |- Ries |
data stream ~~~ a) ES | nh S
fe - RSVP
program
| Local OS ee
Data link layer _| Admission : Reservation requests
y | control = | from other RSVP hosts »
Data link layer 44 7 eo. , .
data stream man nm AR 8
ee Su
oH 4(... Internetwork
Setup information to Local network Nes ee ey
other RSVPhosts
Fig. 2.6.2 : RSVP Protocol
l
In case of compplex te on : stream. Consider
stream,it is necessary to maint
=xdviple of syrichroniation bet ain temporal relations between different data
maintai
er stores slide show
resentation which contei | _ discrete data stream and a continuous data stream. Web serv
e W .
*
ns sli es and audio stream. Slides come from server to client in terms of discrete data
8
stream. The client should play the sameaudio with i here gets synchronized with
wi respectpect to current slide. Audio stream
:
slide presentation.
Synchronization Mechanisms
consider following twoissues.
For actual synchronization we have to
two streams.
o Mechanisms to synchronize
ms in network environment.
o Distribution of these mechanis
ams. In this approach, application should
l, syn chr oni zat ion is car ried ou t on data items of simple stre
At lowest leve evel facilities available. Other better alter
native
oni zat ion whi ch is not possi ble as onlyit has low-l
implement synchr devices in simple way.
plicatio n an interf ace whi ch permits it to control stream and
could be to offe r an ap zation, a
zed. To carry out synchroni
different su bstreams needs to be synchroni
complex stream,
es is ek ee
ation. To po inat e
setup the path to dissem
,
s
t eio n pl em en ted. The issues in these solution were to
communicat !5
solution
s have been im
transport-level
the information. to peer solutions are usually
¢ appl icat ion level multicasting techniques as peer
j
solutions set u p communication paths.
Alternative to thes e became eas
ier to
;
n layer. It ne w
deployed at applicatio
UF fekinontedes
7
. 8ctics,
;
Multicast messages offer goodsolution to build distributed system with following characteristl
o Fault tolerance based on replicated services : Replicated service Is available on group of a nen deere:
operation on this request. some
are multicast to all these servers in group. Each server performs the same
memberof group fails, still client will be served by other active members(servers). |
and clients can use TC eeais a find
© Finding discovery servers in spontaneous networking : Servers
interfaces of other services in the distributed
discovery services to register their interfaces or to search the |
system.
performance. If in oe cases, replica of data
© Better performance through replicated data : Replication improves
is placed in client machine then changesin data item is multicast to all the processes managing the replicas.
Oo Event Notification : Multicast to group to notify if some event occurs.
4 In application-level multicasting, nodes are organized into an overlay network. It is then used to disseminate
information to its members. Group membership is not assigned to network routers. Hence, networklevel routing is the
better solution with compare to routing messageswithin overlay.
In overlay network nodes are organized into tree, hence there exists a unique path between every pair of nodes.
Nodesalso can be organized in mesh network and hence, there are multiple paths between every pair of nodes which
offers robustnessin case offailure of any node.
It is built on top of Pastry which is also a DHT (distributed hash table) based peer-to-peer system. In order to start a
multicast session, node generates multicast identifier (mid). It is randomly chosen 160bit key. It then finds SUCC(mid), |
which is node accountable for that key and promotesit to be the root of the multicast tree that will be used to send
]
|
data to interested nodes.
|
|
If node X wants to join tree, it executes operation LOOKUP(mid). This lookup message now with request to join MID
|
(multicast group) gets routed to SUCC(mid) node. While traveling towards the root, this join message passes many ;
}
8
nodes. Suppose this join message reaches to node Y.If Y had seen this join request for mid first time, it will become a
forwarder for that group. Now-node X will become child of Y whereas the latter will carry on to forward
the join
request to the root.
If the next node on the root, say Z is also not yet a forwarder,it will become one and record Y as its child and
persist to
send the join request. Alternatively, if Y (or Z) is already a forward
er for mid, it will also note the earlier sender asits
child (i.e., X or Y, respectively), but it will not be requirementto send the join
request to the root anymo re, as Y(or
Y(o Z)
will already be a memberof the multicast tree.
In this way, multicast tree across the overlay network with two types
. of nodes gets created. ; These nodes
nod are : pure
forwarderthat works as helper and nodesthat
| are forwardersas well, but have Clearly requested to join thet
in the tree.
Overlay Construction
ance is merely based on th . | :
4 tis not easier to build efficient tree of nodes in overlay. Actual perform
e routing
messages through the overlay. There are three metric to measure quality of an application-level mmullticaset tk
st tree.
are link stress, stretch, and tree cost. Link
stress is per link and counts how frequently
a Packet crosses the same link
link.
ST
Ww TechKnowledge
Publication9
es, ble
.
y infect ed nod tl
4
In case of man
n
ans
For k = 3, sis less than 0.02. Scalability is main advantage of epidemic algorithms.
Removing Data
, | destroys information
8on
to sprea d the deletion of data Item. It is because the deletio
It is hard for epidemic algor ithms receive old copies
ted to deleted data item. In this case if data item from node is8 suppose removed, then node will
rela
of data item and it willbe considered as new one.
a nl
ed after updates propagate to
to create the death certificate with timestamp. The death certificates then can be remov
all the nodes within known finite time which is maximum elapsed propagation time.
is for those nodes to which
A few numbersof nodes (For example, say node X)still maintains this death certificate. This
|
previous death certificate was not reached. If node X has death certificate for data item x and update come to nodeX
for the same data item x.In this case, node X will again spread the death certificate of data item x.
i i a aa el
Applications of Epidemic Protocols
ReviewQuestions _
el
<or TechKnowledge
pean sn 8ds howe Pelecatagel
,
onan?
" Publications 9 ©
%
wats
ae Sans 8 Sa
eee
e 000°
ee
4 Syllabus
seepiemmaei deen Uiocks, erecta Algorithms, Mutual Exclusion, and Distributed Mutual
Exclusion-
xclusion Algorithm, Requirements of Mutual Exclusion Algorithms, and Performance
Toke, w, en token ae pigontnmns : Lamport Algorithm, Ricart-Agrawala's Algorithm,
Maekawa's Algorithm,
ased Algorithms : Suzuki-Kasami's Broardcast Algorithms, Singhal's Heurastic Algorithm,
Raymond's Tree
based Algorithm, Comparative Performance
Analysis.
Introduction
4 Apar
part from the communicatio
icati n between processes in
in distrib
distri uted system, cooperation and synchronization between
processes Is also important. Cooperation is supported throug
h naming which permits the processes to share resources.
As an example of synchronization, multiple processes in distrib
uted system should agree on some ordering of events
occurred.
4 Processes should also cooperate with each other to access the shared resource
s so that simultaneous access of these
resources will be avoided. As processes run on different machines in distribu
ted system, synchronization is no easy to
implement with compare to uniprocessor or multiprocessor system.
4 Incentralized system, if process P asks for the time to kernel, it will get it. After sometime if process Q asks f ti
sks for time,
naturally it will get time having value greater than or equal to the value of time of Process P received. In distributed
. istribute
system, agreement on time is important and necessary to achieveit.
4 Practically, different machinesclock .in distributed system differs in their time value. Therefore, clock synchronization is
| , Onization|
required in distributed system. Consider example of UNIX make program. Editor is running on machine A
an d Compiler
chi e B. Large UNIX program is collection of many sourcefiles.
is running on machin ,
} | If few files suppose modified then no
need to recompile all the files in program. Only modified files require recompilation.
After changing source files, programmer starts make program which checks timestamp of creation of so urce and
objectfile. For example xyz.c has timestamp 3126 and xyz.o has timestamp 3125. In this case, since object file xyz.0
is renulFed Idec fit
has greater timestamp, make program confirms that xyz.c has been changed and recompilation
;
n is required.
timestamp 3128 and abc.o has 3129 then no compilatio
aea Adana!ne Binge palin Ss he RAL ANT
Suppose there is no global agreement on time.In this case suppose abc.o has timestamp 3129 and abc.c is modified _4
So eerease
i :
ee tend
!
ats
7= i) Fe
Tech t
Pudiicatiens - |
ee ee = ee ee a
3.1.1 Physical Clocks
Ce i ad EET ee A St
a.
~- Every machine has timer which
ich jis machined
; quartzcrystal. This quartz crystal oscillates when kept undertension. The
SSS eo
ate
counte I and holding
} register
1 is
7 associated
*
with crystal. Counter value gets decremented by
rereetes
= j f
one arter completion Oo
one oscillati
ation |
of the crystal. When countervalue gets to O then interruptis generated. Again counter gets reloaded
ee Sere ee
from holding register.
| )
Each interruptis called clock tick and it is possible to program a timer to generate 60 clock ticks per second. Battery
4
+
backed-up CMOS RAM keepstime and date. After booting, with each clock tick interrupt service procedure
adds 1 to
4
current time. Practically it is not true that crystal in. all the machines runs with same frequency. The difference
between twoclocks time value is called clock skew.
=
- Inreal time system, actual clock time is important. So consideration of multiple physical clocks is necessary. How to
synchronize these clocks with real world clocks and how to synchronize clocks with each other are two important
issues that need to be considered.
4 Mechanical clock was invented in 17" century and since then time has been measured astronomically. When sun
_reaches at its highest apparent point in skyis called as transit of sun. Solar day is interval between two consecutive
transits of sun. Each day have 24 hour. 1 hour contain 60 minutes and in each minute contain 60 seconds. Hence, each
day have (24x60x60) total 84600 seconds. Solar second is 1/84600" solar day.
~ {n 1940 it was invented that earth is slowing down and hence period of earth9s rotation is also varying. As a results,
astronomers considered large number of days and taken the average of it before dividing by 84600.This is (1/84600) is
called mean solar second.
In 1948, physicist started measuring time with cesium 133 clock. One mean solar second is equal to time that is
ee
needed to cesium 133 atom to make 9,192631,770 transitions. Now days, 50 laboratories worldwide kept cesium clock
and reports periodically time to BIH in Paris which takes averageofall the time called as International atomic time
7 a
(TAI). Now 84600 TAI seconds is about 3 msecless than mean solar day. BIH solved this problem by using leap second
SL ee we eS a Ac =
44
when difference between TAI and solar time grows 800 msec. After correction of time, it is called universally
=>
coordinated time (UTC).
Peery aco
- National institute of standard time (NIST) run shortwave broadcastradio station that broadcast short pulse at start of
each UTC secon d with accuracy of + 10 msec. In practice, this accuracy is no better than +10msec. Several earth
satellites also provide UTC service.
a ee 4
~ GPS was launched in 1978. It is a satellite-based distributed system which has been used mainly for military
applications. Now days it is used in many civilian applications such as traffic navigation and in GPS phones.
tage tractot OS ee
aeee
SSS
==
WF TechKnowledge
Pudlications
<a
erica
[Se
If one machine is with UTC time then all other machines clocks time should be synchronized with
respect to this UTC
time. Otherwise, all the machines clocks together should be with same time. Consider timer
of each machine.
interrupts X times a second. Assume that value ofthis clock is Y. When UTC time is
t then value of clock at machine p is -
Y,(t). If all the clocks value is UTC time t then Y,(t) = t for all p, which is required. It
means, dY/dt should be 1.
In real world, timer do not interrupt exactly X times a second. If X=0
then timer should generate 216,000 ticks per
hour. Practically, machine gets this value in the range 215,998
to 216,002ticks per hour. For any constant a, if
1-a s dY/dtsl+a then timer is working within its specification.
The constant a is specified by manufacturer of timer and
called as maximum drift rate.
- Cristian9s Algorithm
Request . Cute
Time Server 4
ty | _ Time
Fig. 3.1.1 |
Whenclient receives response from time server at time
Tr, it sets its Clockat time C This al orithm - wid9 |
- problems, one is major and other is mino . UTC: por 7
r problem. Major problem occur whenclien
ts clock is fast. The received Cyr
will be less than client9s current value C,
wW TechKnewledgé s a
Pubrications ..-
\ ae
4 h Va
- As aminor
P problem m, Ht; takes
. ;
some time to reach the reply from time server to client. This propagation time Is
r 4 Ts)/2. Add this ti
iu M this time to the Cure time which is replied by server to set the client9s clock time. Let time elapsed
between t, and t) is interrupt processing time (I) at serve r. Then message propagation time is (Tr Ts 4
|). One way
propagation time will be the h
alf of this time. In this algorithm,time server is passive as it replies as per query of client.
The Berkeley Algorithm
- In Berkeley UNIX, time daemon(server) is active as it polls every machine asking current time there, It then calculates
average time of received answers fromall machines. It then tells other machines either advance their clocks to new
time or slow their clocks until specified timeis set.
- This algorithrn is used in system where WWYVreceiver is not there for UTC time. The daemon time is periodically set
44-
manually by operator.
t 400 4:05
(a) (b)
Fig. 3.1.2
As shownin Fig. 3.1.2(a), initially time daemon has time value of 4:00, Client 1 has clock time value 3:50 and client 2
has clock time value of 4:20. Time server asks to other machines about their current time including it. Client 1 replies
that, itis lagging by 10 with respect ¢ o server time which is 4:00. Similarly, client 2 replies that, it is leading by 20
ch is 4:00. Server will reply as 0 toitself.
with respect to server time whi
- Meanwhile server time increases to 4:05. Therefore server sends client 1 and 2 messages about how to adjust their
clocks. It tells client 1 to 4 dvance clock time by 15. Similarly,oeit tells to client 2 to decrement the time value by 15.
ts own Clock time. It is shownin Fig. 3.1.2(b).
5 time
Server itself tells to add
Averaging Algorithms
~ Cristian algorithm and Berkeley algorithm both are centralized algorithms. Centralized algorithms have certain
disadvantage s. Averaging algorith m is decentralized algorithm. One ofthese algorithms works by dividing timein fixed
eed upon momentin past. The k" interval start at time TO +kS and runs until
TO is any agr
length intervals. Let
param eter.
TO + (k+1)S. Here Sis a system
Bo
Tech e
Publications
4 At the start of each interval, every machine broadcastits clock time. This broadcast will happen at different time at
different clocks as machines clock speed can be different. After machine finish the broadcast, it starts local timer in
order to collect broadcast from other machines in some time interval |. Whenall the broadcasts are collected at every
machine,following algorithms are used. = _
O Averagethe time values collected from all machines.
Oo Otherversion is to discard n highest and n lowest time values and take the average of rest.
o Another variation is to add propagation time of message from source machine in received time. Propagation
time can be calculated using known topology of the network.
4 In wireless network, algorithm needs to be optimized considering energy consumption of nodes. It is not easy to
deploy time server just like traditional distributed system. So, design of clock synchronization algorithms requires
different insights.
4 Reference broadcast synchronization (RBS) is a clock synchronization protocol. It focuses on internal synchronization
of clocks just like the Berkeley algorithm. It does not consider synchronization with respect to UTC time. In this
protocol, only receiver synchronizes while keeping sender out of loop.
4 In sensor network, when message leaves the network interface of sender, it takes constant time to reach destination. 4
Here multi-hop routing is not assumed. Propagation time of the message is measured from the point messageleaves
the network interface of the sender. RBS considers only delivery time at receivers. Sender broadcasts reference
message say m. Whenreceiverx receives this message, it records time of receiving of m which is say Tym. This timeis
recorded with its local clock.
4 Two nodesx and y exchanges their receiving time of m to calculate the offset. M is total number of sent reference
messages.
M
> (1x, k-Ty,k)
k=1
Offset[x, y] = M
4 As clocks drift apart from each other, calculating average only will not work.
Standard linear regression is used to
calculate offset function.
- Inccristian algorithm when time server send reply message with UTC time, the problem
is to find actual propagation
time. The propagation time ofreply will definitely affect the reporting of actual
UTC time
4 Whenreply will reach at client, message delay will have outdated reported time. The good estimation
for this delay
can be calculated as shownin followingFig.3.1.3.
a ae
tect caries) |
Sona
e + eae
. 8 *
eee
i vonnae
Fa
7 =P)
vy _
h
Fig. 3.1.3
, .
~ Asends message to B oe T, byits local clock. B then send reply
containing timestamp T,. B also recordsits receiving time
raining tl 6 .
tap:
containing B times tamp T3 along with its recorded timestamp Tp. Finally, A-records time T, at which responsearrives. Let
|
T,- T1 = Ta4 Ts. Estimation ofits offset by A is given as below.
_ (T, -T,) + (T, - T) (T, = T,) + (T, -T,)
5 _ -
0 = T;
- The value of the 0 will be less than zeroif A9s clock is fast. In this case A has to set its clock to backward time.
than the
Practically it will produce problem like objectfile is compiled just after clock change will have timestamp less
introduce the change slowly.If
source which was modified just before clock change. The solution to this problem is to
time until time gets corrected. Time can be
8each interrupt of timer adds 5 msec to the time then add 4 msecto
advanced by adding 6 msec for each interrupt instead of forwarding all at once.
current
n servers. In other word, B will also query to A for its
- Incase of NTP, this protocol is set up pair-wise betwee
tion of p for delay is carried out as
time. Along with calculation of offset as above, estima
(T, -T,) + (Ty, - Ty)
hoo= 4
for p is the best estimation for the delay between the two
. = Eight pairs of (0, u) values are buffered. Smallest value
t. Some clocks are more
servers. S ubsequently the associated value
of @ is the most reliable estimation of offse
are divided in strata.
accurate, Sa y B9s clock. Servers
than that of B. After
its time provided its own stratum level is higher
l onl y correct
~ If A contacts B, it wil Because of the symmetry of NTP, if A's
str atum | evel will become one higher than that of B.
synchronization, A'S server is with WWYV receiver or with
um-|
cesium
that of B, B will adjust itself to A. Strat
stratum level was lower than f errors and security attacks.
Many features of NT P are related to 8dentification and maskingo
clock,
Logical Clocks =
3.2
e with UTC time orreal
although this time does not agre
| machines should agree on same time ncy of
~ In many applications, d on internal consiste
I C onsistency of clocks is essential. Many algorithm otsworks base
time. In this case, inte rnal i g al clocks.
id to be logic
clocks are said
I time. For these algorithms,
clocks and not with rea synchronization will not cause any
cting then lack of t heir clocks
cesses are not intera y what
Lamport sho wed that, if two pro should agree on the orderin which events occ
ur and not on exactl
th at, proce sses
so pointedout
Problem. He al
- timeitis. 4
UE TechKnowledge
S i .
Publleations
: .
4 Ifx >yand y >2z then x > z. It means happen before relation is transitive. If any two processes does not exchange |
messagesdirectly or indirectly, then x > y and y > x where x and y are the events that occur in these processes.In this
case, events are said to be concurrent.
' |
4 Let T(x) be the time value of event x. If x 3 y then T(x) < T(y). If x and y are the events in same process and eventx |
occur before y then x > y then T(x) < T(y). If x is event of sending the message by one process and is the event of |
receiving same message by other process then x > y then all the processes should agree on the values T(x) and
T(y) )
with T(x) < T(y). Clock time T assumed to go forward and should not be decreasing. Time value can be corrected
by |
adding positive value.
|
4 In Fig. 3.2.1 (a), three processes A, B and C are shown. These processes are running on different
machines. Each clock
runs with its own speed. Clock has ticked 5 times in process A, 7 times in process B and 9 times in processC.
This rate ls _
different due to different crystal in timer.
4
0 0 0 0 0 0
|
10 14 18 10 14 18
|
15 21 q 27 15 21 27 |
~, q |
20 28 36 20 28 36 |
25. 35 45 25 35 AB?
39. 42 ya 54 30 42. PE
35 49 FT 63 35 cae ea
ee _ 2
wledgt
We ee : ,
hy
were rt, 20
ae
Bete
aii
<7,
~*~ -
<se
message. Each messagecarriesits sending time. Assumeall the processes are receiving messages in the order they
were sent and three is no loss of messages.
4 Each process sends acknowledgement (ACK) for the.received message. Hence as per Lamport algorithm,timestampof
the ACK will be greater than the timestamp of received message. Here Lamport9s clock ensures that, no any tWo
messages will have same timestamp. Timestampsalso gives global ordering of events.
4 In Lamport9s clock,if x > y then T(x) < T(y). But, it does nottell about relationship between event x and y. Lamport
clocks do not capture causality.
. ; vet atts
4 Suppose user and hence processes join discussion group where processes post articles and post reaction to these
posted articles. Posting of articles and reactions are multicast to all the members of the group.
4 Considerthe total order multicasting scheme and reactions should be delivered after their corresponding postings. The
receipt of article should be causally preceded by posting of reaction. Within group, receipt of reaction to an article
must alwaysfollow the receipt of that article. For independent articles or reactions, order of delivery does not matter.
4 The causal relationship between messages is captured through vector clocks. Event x is said to be causally precede
event y, if VT(x) < VT(y). Here, VT(x) is a vector clock assigned to event x and VT(y) is vector clock assignedto eventy.
O VT,[i] maintains number of events that have taken place at process P.,.
O If VT,[j] = k then P; is aware about k numberof events occurred at processP;.
4 Each time event occurs at processPj, VTj\[i] gets incremented by 1 and it ensures tosatisfy first property. Whenever
process P; sends the message m to other process, it sends its current vector with timestamp v, along with it
(piggybacking), which ensures to satisfy second property said above.
4 Because ofthis, receiver knows the number of events occurred at other processes before receipt of message m from
process P; and on which message m may causally depends.
4 Whenprocess P; receives message m, it modifies each entry in its table to maximum of the VTj[k] and vt[k]. The vector
at P; now showsthe numberof messagesit should receive to have at least seen the same messagesthat precededthe
sending of message m. After this every Vij[j] is incremented by1.
4 Tofulfill the need of processing in distributed system, manydistributed algorithms are designed in which one process .
acts as coordinator to play some special role. This can be any process. If this process fails, then some approach is.Is
required to assignthis role to other process.In this case, election is required to elect the coordinator.
- In group ofprocesses,if all processes are same then the only waytoassign this responsibility is on the basis of some
criterion. This criterion can be the identifier which is some numberassigned to process. For example, it could be.:
network address of machine on which processis running. This assumption considerss that only one process is running.
Se Tech
vaanavstelsta
renrieatTee <
-a
He. 4
44
4. 2 ay. |
-
i 8wy Distributed Computing (MU-Sem
P:) 3-10 4 Synchronization
| Oe 4
Election algorithm locat ¬s process with hi is aware
:
proce are : ighest number in order to elect it as a coordinator. Every process
-gbo ut the
fe . Process number of other processes B But, process es do not know about which processes are currently up or
-
which ones are current ly crash ed. Elect . agree on newly elected
|
coordinator. - ction algorithm ensures that, all processes will
Election
@©
Fig. 3.3.1
hed and hence does not
16 res pon ds with OK message. Process 17 is already cras
g. 3-3-2, Proce s5 15 and
te As shownin Fi
send repl y wi th OK me ssage. Now,
job of process 14 is over.
<"
+.
Ww TechKnowledge
Publications
4 Now process 15 and 16 each hold an election by sending Election message to its higher-numbered processes.It is.
shownin following Fig. 3.3.3 (a). As process 17is crashed, only process 16 replies to 15 with OK message. This is shown
in Fig. 3.3.3 (b).
Election ah
© ©
(a) (b)
Fig. 3.3.3
Finally, process 16 wins the election and send coordinator message to all the processes to inform that,
it is now new
coordinator as shownin Fig. 3.3.4. In this algorithm, if two processes detect simultaneously that the coordinatoris
crashed then both will initiate the election. Every higher-number processwill receive two Election messages.
Process
will ignore the second Election message and election will Carry Onas usual. .
\ Coordinator
Coordinator >(10)
Fig. 3.3.4
4 This ring algorithm does not use token and Processes are physically or logicall
y ordered in ring. Each processhasits
successor. Every process knows about who is its successor. When any processnotices
that coordinator is crashed,it
builds the Election message and sendsto its successor. This message contains the process numberof sending
process.
If successor is down then process sends message to the next process along the ring. Process
locates next running
process alongthe ringif it finds many crashed processesin sequence. In this manner, the receiver of Election message
also forwards the same messagetoits successor by appending its own number in message,
A
nN
mh
:
In this way, eventually message returns back to the process that had started the election initially. This
ty
>
incoming
otst
.
message contains its own process number along with process numbersof all the Processes that had received this -
bled hacenalgMae eeILSaga
Diya : + ;
message.
atk os
pirat ; Z
ead
t' lito :
ee ita
se
Ww TechKnowledgé a
Publications 99.
Be a
pie Pe
[5,6,0,1]
[5,6,0,1,2]
(5,6,0,1,2,3,4] [5,6,0,1,2,3]
Fig. 3.3.5
- In Fig. 3.3.5, process 5 notices crash of coordinator which was initially process 7. It then sends Election message to
to
process 6. This message contains numberofthe process 5. As process 7 is crashed, process 6 forward this message
received byall the
its new successor process 0 and append its numberin the same list. In this way message Is
processesin ring.
Highest number in thislist is 6. So message is
Eventually, message arrives at process 5 which had initiated the election.
sthat, process 6 is now new coordinator.
again circulated in previous manner to inform all the processe
Source
¥ iechKnemledge .
q a
Let any node (consider, source as node A) initiates election and send Election message to its immediate neighbors
which are in its range. When any node receives this Election messagefirst time, it chooses sender of message asits
parent. The receivers then send out this received Election message to its immediate neighbors which arein its range
except to its chosen parent. When node receives Election message from other node except parent, it sends its
acknowledgement.
The build tree phase is shownin Fig. 3.3.7. Node P broadcast Election message to Q and Y. Node Q broadcastElection
Q
message to R and V.In this scenario, Y is slow in sending message to V. As a result, V receives message fast from
compared to from Y. Node V broadcast message to T and W and node R broadcast to $ and T. After this, node T
receives the broadcast message from V before R. Node W forward the message to T and X and further, T forward
message to node U which gets the message faster from T and S. This process continues till traversing of all the nodes is
finished.
Fig. 3.3.7
In this method of forwarding the message, each process records the sender node of message and the node to which
message is forwarded. Receiving node of the message prepares message in order to reply in the form (node, capacity]
and send it to the node from which message was received. In following Fig. 3.3.8, reporting of best node to sourceis
shown. -
The first reply node is generated from the last node and same return
path is traversed. The receiver of replied
message selects the message of highest capacityif it receives multiple messages. This message it then
forwards to next
node along the return path. |
As shownin Fig. 3.3.8, The reply message [U,4] would be initiated from U to T and [X,5] from X to W Node T forward
the messageas it is [U,4] to V. During the same time, W receive
s the message [X,5] and as its Capacity greater than
that of X, it replaces the message to [W,8] and forwardit to V.
As stated previously, node Vselects the higher capacity message [W,8] and forward it to node Q. In the meantimé.$
initiate the reply message [S,2] to nodeR which comparesits own capacity and forward the new message [R,3] tonode
[W,8] from V. The best node value [W,8] ix'sent t2 Be |
Q. The node Q now receives two messages. One is [R,3] from R and
source node P. AT the same time, P also receives message[Y,4] from Y. It then selects the higher value out of these aS _ i
[W,8] and reports W as a best node.Finally W is elected as new coordinator.
Tech
i aS hee lll ee
n
4|
Fig. 3.3.8
~ Whenprocessfinish executionin critical section. These classes of algorithmsdiffer on th e basis of the way in which
nodes/processes searches the tokens.
,
Avoidance of Starvation :
: If some nodes are repetitively ii section and some any other node is
executing critical
. % ys . ;
indefinitely waiting for the execution of critical section, such situation should be avoided.
ae 8 In finite
inite time, eve ry
requesting node should get opportunity to executein critical section (CS).
Fairness : As there is absence of physical global clock, either requests at any node should be executed in the order of
arrival or in the order these wereissued. This is due to consideration of logical clocks.
4 Fault Tolerance : In case of anyfailure in the course of carrying the work, if any failure occurs,
then mutual exclusion
algorithms should reorganize itself to carry out the further design
ated job.
3.4.4 Performance Measure of Mutual Exclusion Algorithms
ty
(*4444_ Execution timei
n CS444__»
Time
+ Response Time ea
Fig. 3.4.1
3.4.5 Performancein Low and High Load Conditions
<In low load condition of the system, there is rarely more than one re
quest at the same time
condition of the system, thereis often pending request for in system. In high
mutual exclusio nat ue node
(site). A node rare
load
in high load condition of the system. In best case performance, performanc ly remains idle
¬ metrics achiev
es best Possible
values
Ta
Ww TechKuowledge
Publications
se 4 ¢ Distributed
Computing (MU-Sem Synchronization
8-Comp.)
3-16
* .
5 to decide who
In non-token based m utual exclusj ,
lusion algorithms, a particular node communicates with set of node
non -token
-
should enter in criti orde red on the basis of timestamps in
val section. next, Requests to enter in CS are
based mutual exclusion algorithms
ck is used and request
_ Simultaneous requests ar e also handled by using timestamp of the request. Lamport's logical clo
lori over requests having larger timestamp values.
for CS having less timest am p value getspriority
Executing the CS
ditions are satisfied.
CSif following two con
Site Si can enter in (ts;, i).
h timestamp larger than
ceiv ed mess age from all the sites wit
- (C1:Site S;has re
t_queue;.
at the top of re ques
C2: S's requestis
Publications
fi
1. Currently if receiver is not in CS and also does not wantto enter it then it replies OK message to sender.
2. If receiver is already running in CS thenit does not reply. It puts the message in its queue.
3. If receiver is about to enter in CS and yet has not done then it compares the timestamp in incoming request
message withthe timestamp in request message which it has already sent to other processes. The lowest one
wins. Receiver sends OK message to the sender if incoming request message has lower timestamp value. If its
own request message has lower timestamp value then it puts the incoming request message in its queue and
does not send anyreply.
4 Whenrequesting process for CS receives reply as OK messages from all the other processes, it enters in CS. If process
is already in CS then it replies OK message to other requesting process after coming out of CS.
ee By
Ww TechKnowledge -
Publications ee
ae
ages ae a? ob ok eee
8teenie a5 te 4 ' irae
Pad ae at = fe; = , cig < :
an » iret Os ' = az = ot Saaoe
SS
the name of CS.
enter in CS sends REQUEST message to coordinator stating
:
on to enter in stated CS in
requesting process to grant permissi
If CS is free then coordinator sends OK me ssage to .
it puts the request message in its queue
s s is executing then
message. If in stated CS, already other proce
,
coordinator process. As a result
exec utin g proc ess in CS exits , it sends RELEASE message to
When currently grant permission to enter in CS.
If
GS is now free and it take s request from the queue to
coordinator knows that
rsin CS.
blocked, it unblocks and ente
requesting process is still
©m @Q ®
© @ ®&
(S
(c) *
Coordinator
ue IS Coordinator
ompty Coordinator
(b) (c)
(a)
Fig. 3.5.2
Process Q sends
y no processis executing in CS.
5 C is coordinator and currentl
5
e
e4oooe
eh i - ;
<NG rR af
Si
a C
crash.
algorithm, to enterin CS, process has to obtain permission from subsetof the processes as long as sabsets usedby any
| - :
two processes overlaps. Process need not obtain permission from all the processesIn order to enter In cs
4 Processescarry out voting to one another in orderto enterin CS.If process collects enough votes, then it can enterin
CS. The conflict is resolved by the processes common to bothsets of processes. Hence, processes in intersection of two
sets of voters guarantee the safety property that; only one process enters in CS, by voting to a single candidate to
enter in CS.
For all the processes 1 to N, there is associated voting set V; with process P;. The set V; is subset of all the processes
numbered from 1 to N (Py, P2, P3 ....... P,,). For all processes i and j between 1 to N, this set satisfies following four
properties.
In order to get access to9CS, process P; sends REQUEST message to K-1 membersof V;. Process P; can enteris CSifit
receives reply from K-1 processes of the set. Let Process P; belongsto setVj.
Process P, sends reply message immediately to process P; after receiving the REQUEST message from it unlesseitherits:
state is HELD or it has replied (VOTED message)since it last received the RELEASE message.
4 When process receives RELEASE message, it takes out the outstanding REQUEST messages from front of the non-
empty queue andsendsreply as VOTE message.To exit the CS, P; sends RELEASE messagesto all (K4-1) membersofV;.
4 This algorithm guarantees mutual exclusion as common Processesin V;and V; should cast
votes for both processes P,
and P; which simultaneously want to enter in CS. But this algorithm permits to cast at the most one vote between
consecutive receipts of RELEASE message. This is impossible and hence, ensures mutual exclusion.
- This algorithm leadsto deadlock situation. Let V, ={P,, P2}, V2 = {P2, P3} and V3 = {P3, P,}.
If all three processes P,, P,and
P; simultaneously want to enter in CS then P; may reply to P2 while holding P3. Process P, to reply P3, it may hold P1-
Process P3 to reply Py, it may hold P2. As in this situation each process will receive one out of two reply, none can
proceed further.
4 In this algorithm, if process crashes and it is not the member of voting set then its failure
does not affects other |
processes. The stepsof the algorithm are summarized as shown below.
RppbaBsone
ee
VOTED = true.
remove message from front (h ) queue. Suppose itit iisj from Py. Then send reply to Py. Now
e
ead) of
i VOTED = false.
Otherwise
eee
.
3.6 Token BasedAlgorithms 44
a
ae erent
rn eens e
*
token is shared among all the processes. The process
Token based algorithm makes use of token and this unique
=
_
process carry out
ent token based algorithms based on the way
possessing token can enter in CS. There are differ :
search for the token.
aot ge eee ee
token has sequence number and
instead of timestamps. Each request for the
_ These algorithm uses sequence numbers
ss request for the token, its
advances independently. Whenever proce
these sequence numbers for processes
for the token. As only process
a eT ree
incremented. This helps is diffe rentiating old and current request
sequence number gets
e mutual exclusion.
acquiring token enters in G, algorithms ensur
m
3.6.1 Suzuki-Kasami9s Broadcast Algorith
a REQUEST message
t to ente r in CS and does not holds token then it broadcast
cetera ehET AT LL
ess wan
- Inthis algorithm, suppose proc ng process.
holding token sendsit to requesti
the processes. Afte r receiving the REQUEST message, process
to all CS.
after it has exited the
sage then it sends token only
ady in CS when receives the REQUEST mes in
issues are important to consider
- If process ts alre
token. Following two
atedly enter in CS till it has a
Process holding toxen can repe
this algorithm.
from curren t REQUEST messages.
ed REQUEST messages
o To differentiate outdat
e
request for CG.
s having outdated
-
min ing th e pr oc es
o _4dDetter ber that showsP; is
, n (n = 1, 2, 3...) is a sequence num
pe a e
the REQ UEST messaBe of process P,. Here
Let REQUEST(P), ) Is e RN,{J] indicates largest sequence
Ae Poe
- maintains array of integers RN,[1...N] wher
~ tion. Pro ces s P,
requesting its n# CS ex ecu UEST(P,, n) the e l be outdated if RN;[J] > n. When
n this messag wil
from P;- If process Pi receives REQ
number received i N,{J] = max(RN [J], 7).
it sets R Ul)
process P, receives REQUEST(P} n), SN[1...N]. Where SN{j] indicates
the sequence number
processes and array of integers finishes its
ever it
~ Token has queue of requesting ed by process Pp; most recently. Process P; sets SN[i] = RN,{i] when
at has execut request corresponding to sequence numberRN\[i] has been executed.
of the request th
ows that its :
execution of CS. It sh .
ocess P; is currently requesting the token. The process which has
executed the
and puts their IDs
sion for 3I j9s to know aboutall the processes that are requesting the token
~ : Tij = SNIj] +1 then pr
At processP;if RNiL]
CS now checks this condi
I
: esting queue.
the process at head of requ
ere. process then sends token to
already th
in requesting queue Hf not messages per CS entry. Synchronization
delay is 0 orT.If
requires 0 or N
le 4 and efficient.gIt
simple then no message or zero delay is required.
This algorithm is : ves tin
Algorithm
Request for CS
(i) If process P; wants to enter in CS and does not have token, then it increments its sequence number RN({i]. It then
sends REQUEST(P,, sn) messageto all the processes. In this message sn is updated value of RN\[iJ.
(ii) After receiving this request message, process P; sets RN|[i] to max(RN,[i], sn). If Pj has token then it sendsto P, if
RN,{i] = SN[i) + 1.
Executing the CS
Exiting the CS
fF |
|
-
NONE: None of above.
0
s
Initially arrays are set as follow
ri=1toNdo
For every processPi, fo
=N toi
Oo Set PV;,[j] = NONE forj
=i4itol
Oo Set PV;[j] = REQ forj
N.
Set PN,[j] = 0 forj=1to
state. Hence, $1[1] = HOLD.
4 Initially process Py is in HOLD
and TPN{j] = 0 forj =1to N.
For the token TPV[j] = NONE
ee
oS ee ee ae
SF rece
x
e S
8tributed Computi
algorithm
request for CS
1.If process P; wants to enterin CS and does not have token,thenit takes following action.
o -4sCSetts PV8[i] = REQ.
ee,
g activities as per its current state.
Otherwise, it (P;) sets PN;[j] = sn and carry outfollowin
o PV;{j] = NONE: Set PV,[i] = REQ.
PNj[j) message to P. (Else do nothing).
PV|[i] = REQ and send a REQUEST(P;,
oO PV;[j] = REQ: If PV;[i] # REQ then set
Executing the CS
PV;{i] = EXE.
receiving the token. Prior to entering in CS it sets
3.Process P, executes CS after
as mentioned below. es
= DEVOTNGD = TENG
; sie tie Thee ¢ 3 s eo
set : a 23 : ee US
7 : < i uta LenS
7 " ss ae ale 8 i=
_US
pig 4
"at, " VF aN ¥ of
x oF ett rise
NG ae, - a ¥ ee
Example
4 Process 1 : PV,[1] = HOLD, PV,[2] = NONE, PV,[3] = NONE. PN,[1], PN,[2], PN,[3] are 0.
4 Process 2 : PV.[1] = REQ, PV,[2] = NONE, PV,[3] = NONE. PNs are 0.
4 Py receives the REQUEST. Accepts the REQUEST since PN,[2] is smaller than
4 As PV,[1] is HOLD: PV, [2] = REQ, TSV[2] = REQ, TSN[2] = 1,PV,[1] = NONE.
4 Send token to processP).
4 Pz receives the token. PV, [2] = EXE. After exiting the CS, PV, [2] = TPV[2] = NONE.
|
|
4 Updates PN,PV, TPN, TPV. Since nobodyis requesting, PV. [2] = HOLD.
|
|
|
|
4 Suppose P; issues a REQUEST now.It will be sent to both process P, and P2. Only P, responds since only PV,[2]
| is HOLD
(PV,[1] is NONE atthe present).
4 Inthis algorithm, processes are organizedin directed tree. In this tree, edges of the tree are
towardsthe root(process)
which currently having token. Each process in tree maintains local variable HOLDER (H) that points to an immediate
neighbor on path towards root process. The local variable HOLDER (H) can also point to the root. The process which
currently holds the token hasprivilege to be root.
4 If we chase HOLDER(H)variables, every process has path towards process holding token. At root, HOLDER (H) points
to itself. Every process maintains FIFO Request-queuewith it which stores requests sent by neighboring processes but 3
has notso far sent the token. Following are the steps in algorithm.
4 Consider two processes P; and P; where P; is neighborof P;. If HOLDER variable H,= j. It means P, has informationthat,
"root can be reached through Pj. In this case, undirected edge between P; and P, converts to directed edge from P,and ©
P;. Hence, H variable helps to trace path towards process holding the token.
Tech
Pablicatiors
ei ey
«
Forwarded REQUEST messageis noted by using variable <asked= in order to prevent the resending of the same
*
message.
at i
H, = "Self", Asked = F @ <4 Variable t indicates token
Hl
rey
®
a a a i ee ee a| SE eens 5 4:
H5 = 1 Asked a E
(P3) Hg =1 Asked = F
*
REQUEST /
= 1 Asked = F(alse) @) (Ps) (6) (P)
RQ,=4
Asked = F(alse) Asked=F Asked =F
ss P;
EST messageto P2 which forwards it to proce
Fig. 3.6.1 : Process P4sen ds REQU
H, = 2, Asked = F
RQ, = Empty
PRIVILEGE
Tokent, H, = <Self
RQ, = 4, Asked =T
ee mek
H,=2, Asked =T (Pa) (Ps) (Ps) 7) . |
AQ, =4
8 i privi ss P 1 :
Fig. 3.6.2: Process Pz receives token after granting the privilege from proce
|
and if not executing G, it sends
When it receives REQUEST message (H, = <Self=)
process Py r Ht ts
4 Initially token was with hbor P2. Process p, was the sender of REQUEST message to P,. Now P, resets holde I
or P2- |
PRIVILEGE message to its neig forwards aemessage to process Pa Process P, had requested token on the
s P, in turn eee
variable Hi = 2- ne holdervariable as H2= Fle
Pye
5
behalf of process Pa, it uPdate | it +
ne ms con ne h token.
not yet been sent the k Maximum size of RQ is number of
Maxim
~ RQ, is the request
for the
that have requested
immediate nei
Ser TechKeowledge
Pablitativgs
Algorithm
1. Request for CS
(If token is already held by the process then no need to send REQUEST message. Process needs tokenfor itself orfo,
__ its nelenbor. If variable Askedis false, means it has not already senta request message).
s notemptyand Asked==F
TeprocessP, doesrnothold thetoken and RQis
<self, ae
5ae holds token ant if it is not executing CSand RQis not emptyand at the head of its RQID isnot
0
request on.th
If P, hadforvarded the ebehalfof its< neighbor: --
nah ¢ PRIVILEGEn
<Send the ps piteeea whichistop ofRQ: update H, and 4
message torequesting
:RQ
_change parent en
| GETS Se hae ag:
<TERQ:is<still not empty.
H value;Asked = T;
"Then send requestmessopeto De
ta
- Else AskedeR SF Fes
4. Execution of CS
RQ;
token and its own ID is at the head of
Process Pi can enter the CS if it has
ee :
e .:
Knowleds?
Se Tec ciere
r
a ed Computi.ng (MU-Sem 8-Comp.) 3-26
Distribut
|
5. process P; Exiting the CS
HA
if RQI isnot empty. =:
i
~.DequeueRQi; If this element is ID of P,; Send the token to P,: Hy = Asked = F:
7 Ee i
ei TERQiis sull notempty
|i
fi?) - Then send REQUESTtoparentprocess; Asked = F.
i|
i
,
performance f
under light node. tit |
only O(log N) messages
algorithm exchanges 4 messages if system load is heavy. It exchanges
:
Hef
|
3.6.4 Token Ring Algorithm iteri to.attange
, no any fix criterion ae
il
i
oer
ring In bus based network eonis
There eve used acht process should
are logically arranged in
_ Inthis algorithm, processes ii
n
the machines can be used or other criterio
the processes in ring. IP addresses of ae After te
.
know the next process in line after itself. : 1°" Hi |
from process k to process k+1 in point to point messages
it finishes Its Hi
ter i n CS. If yes, then it enters the CS. After
, the token. ae eelsif it has to enter Inalong ,
Co. the ring. :
Same token cannot be used to enter | |1
-_ Initially process Pe holds ts successor
acquiring token from its neighbor, processcn9 . Fhe
on of CS and exit the CS, it passes the token fo '
ees... . : lates with high
irculaies W Hi
[ii
ii
ne
secon
is interested to enter In CS then simply eea a his
irculates in ring If no process : : : . Hi
j k I ; will be there in :
a single process
rocess has the token at any instant, | Fi |
am _ tomer token. After lost, token
is loss of _
starvatio8on. n. The problem in: this algorithm from ring. Hed
| H
speed in ring: As one
8on. There is no fr &
It !s removed
of the process is dead to which token Is passed then
aoresot
ical ruts! tea infinity. Delay before entry is 0 to (n41) message>-
Hie
||
~ needs to be regenera ted.| we a re 0 to °
Pill
entry/exit
«
Messages required per
Ha
| ad
te
Hy
i Mi
Hl
IE
yt. Pita
a
, rformance Analysis of AlgorithmsThese are response time,
ive Perfor .
Number of messages -
alt
his
Comparatt ce of the algorithms.
erforman - hig
3.7 ompare P |
wry Bite
Three parameters ion delay- : aie
required and synchronizal :
a . token (27) plus the At
Hes
3.7.1 Response Time ime for many algorithmsis roundtrip time to acquire
distance between : | ae
ion of system, respon is E). In Raymond9s Tree-Based algorithm, average
pid
cs (whic
Ave ° rage round trip dela y thereforeis T(log N). -
- In low load conditl
exe4cut e t rf Iding tokenis (log N)/2. . Lil leish
time required t© proces sno TechK led = Maer 7
an 1] yt
: WF resist atcars
} ot ne
requesting proces? .
\ idee:
it Ih ee
4 Different algorithms vary in their response time in high load condition of the system. It varies
when loadin System
increases.
4 This is delay due exchanges of messages neededafterprocess exit the CS and before next site enters the CS.In MOst
of :
the algorithms whenprocess exits the CS,it directly sends REPLY or token message to the next process to enterthe cc
As a result, the synchronization delay requiredis T.
| 4 In Maekawa8s Algorithm, process exiting the CS unlocks arbiter process
by sending RELEASE message. After this, arbiter
process sends GRANT message to next process to enter the CS. Hence, synchronization delay
required is 2T.
4 In Raymond9salgorithm, two processes that consecutively execute the CS can
be found at anyposition in tree. Token
Passesserially along the edges in tree to the process to enter CS next. If
tree has N nodes then averagedistance »
between twonodesis (log N)/2. The synchronization delay required is T(log
N)/2.
ir |
4
Wen
Review Questions_
g.1 Why clock synchronizationis required in distributed system? Explain.
0.2. Explain physicalclocks in detail.
Q.3 Write short note on Global Positioning System (GPS).
Q.9 Whatis logical clocks? Explain Lamport's logical clock with example.
I
Module 4
Syllabus
;
Desirable Features of global Scheduling algorithm, Task assignment i
approach, Load balancing roach, load
appro
Clients, Servers,
sharing approach, Introduction to process management, process migration, Threads, Virtualization,
Code Migration.
|
4.1 Introduction
ae
4 Indistributed system, resources are distributed on many machines. If local node of the process does not have needed
resources then this process can be migrated to another node where these resources are available.
4 In this chapter, resource is considered as processor of the system. Each processor forms the node of the distributed
system.
t i
: foorrcariers 4
a abe
i= lores
ee ee
Scanned withCamScanner
Distributed Comput s Management
Resource and Proces
mp.) 4-2
In distributed moeSem 8.
system, collecti ng of information
overhead. Further, due to agi
.
.
n
re
rf
|
~
:
8 If x number of tasks and y number of
4
| Se
44 _Load-Balancing Approach a
4 Load balancing algorithms use this approach. It assumes that, equal
load should be distributed amongall nodes to
ensure betterutilization of resources. These algorithms transfers load from
heavily loaded nodestolightly loaded
nodesin transparent wayto balance the load in system. This is carried
out to achieve good performancerelative to
Parameters used to measure system performance.
|
4~ From user point of view, performance metric often considered is respons
e time. From resource pointof view,
performance metric often consideredis total system throughput. Through
put metric involves treating all usersin the
system fairly with making progress as well. As a result, all load balancing algorith
ms focuses on maximizing the system
throughput.
|
4.4.1 Classification of Load Balancing Algorithms
4 Basically, load balancing algorithms are mainly classified into two types as static and dynamic
algorithms.
4 Static algorithms are classified as deterministic and probabilistic.
4 Dynamic algorithms areclassified as centralized and distributed algorithms.
4 Furtherdistributed algorithmsare classified as cooperative and non-cooperative.
4 Static algorithms do not consider current state of system. Whereas, dynamic algorithms
considerit. Hence, dynamic
algorithms avoid giving response to system states which degrades system performan
ce. However, dynamic algorithms
need to collect information about current state of the system and they should give response
about the same. As a
result, the dynamic algorithms are more complex than static algorithms.
4 Deterministic algorithms consider properties of nodes and characteristics of processes
to be assigned to them to
' ensure optimized assignment, for example; task assignment approach. A probabilistic algorithm
uses information such
as network topology, numberof nodes and processing capability of each node etc. Hence, such
information helps to
improve system performance.
4 In centralized dynamic scheduling algorithm, scheduling responsibility is carried out by single node. In distributed
dynamic scheduling algorithm, various nodes are involved in making scheduling decision,i.e. assignment of processes
to processors. In centralized approach, system state information is collected at single node whichis centralized server
node. This node takes processes assignment decision based on collected system state information. All other nodes
periodically send this information to this centralized server node. This is efficient approachto take scheduling decision
as single node knowsload on each node and number of processes needing service. Single point of failure is the
limitation of this approach. Messagetraffic increases towards single node, hence it may becomebottleneck.
4 In distributed dynamic scheduling algorithm various nodesare involved in making scheduling decision and hence,it
avoids bottleneck ofcollecting state information at single node. Each local controller on each node runs concurrently
with others. They takes decision in coordination with other based on systemwide objective function instead oflocal
one. .
4 In cooperative distributed scheduling algorithms, distributed entities makes scheduling decision by cooperating with
=
each other. Whereas, non-cooperative distributed scheduling algorithms do not cooperate with each other for making4
ner Sea eT
< TechKnowstedgt ee
W pevireatton aS
:
aenw apeprmee
ase
a
==
4-
Distributed Co
ee
44.2 44 |
Issues in Designing Load-
a
Balancing Algorithms
ail
2
3. Selection of node to transfer
| selected
Process called as location policy.
4 Policy for how to exch
: ange system loa d information
among load.
Policy to assign prior; ty of execut :
6M priori ion of local and remote process on particular node.
6. Migration
6 limiti
ting policy to determine ; numberoftimes process can migrate from one nodeto other.
If process IS processedatits Originating node thenitis local process. If process is processed on node whichis notits
originating node then it is remote process. If process arrived from other node and admitted for processing then it
becomes local process on this current node. Otherwiseit is sent to other nodes across the network. 8It thenis
considered as remoteprocess at destination node.
Load-balancing algorithm should first estimate the workload of the particular node based on some measurable
parameters. These parameters include factors which are time dependent and node-dependent. These factors are:
Total number of processes present on node at the time of load estimation, resource demands of these processes,
instruction mixes of these process and architecture and speed of the processors.
As current state of the load is important, estimation of the load should be carried out in efficient manner. Total
number of processes present on node is considered to be inappropriate measure for estimation of load. This is
because actual load could vary depending on remaining service time for those processes. Again, how to measure
remaining service time of all processesis the problem.
Again both measures, total number of processes and total remaining service time of all processes are not suitable to
measure load in modern distributed system. In modern distributed system, idle node may haveseveral processes such
as daemons, window managers. These daemons wakeup periodically and again sleeps.
Therefore, CPU utilization is considered as best measureto estimate the load of the node. CPU utilization is number of
time.
CPU cycles actually executed per unit of real time. Timeris set up to observe CPU state timeto
NL.
Process Transfer Policies
processis transferred
; from heavily loaded nodesto lightly loaded nodes. There
In order to balance load in system,
4
_4
d be somepolicy to decide wh
ether node is heavily or lightly loaded. Most of the load-balancing algorithms use
ee
shoul
threshold policy.
hold
At each node, threshold value is used. Node accepts new processto run locally if its workload is below thres
value. Otherwise processis transferred to another lightly loaded node. Following policies are used to decide threshold
SUS et ere eee
value. .
© Static policy : Eac h node has predefined threshold value as perits processing capabilities. It remains fixed and
P changing wo r kload at local or remote node. No exchangeofstate information is needed in this
does not vary wit h
re
policy.
In this policy, threshold value for each node is product of average workload ofall the nodes and
© Dynamic policy :
predefined constant for that node.
Predefined constant value C& for node n, is dependson processing capability of node n, with respect to processing
er nodes. Hence state information exchange between nodesis neededin this policy.
Se ee Sg
TechKuowiedge
Publications
Overloaded
Thresht
Underloaded
single-Threshold policy
(a)
4 ¥
Following policies are used to exchangestate information.
after elapse of eve ry t units of time. It
Periodic Broadcast : In this policy, each node broadcast its state information
not changed. It has poor
generates heavy networktraffic and unnecessary messages get transferred although state is
-
scalability.
state information only if its state changes. Node9s
Broadcast When State Changes: In this method, node broadcast its
processes from other nodes. Further
state changes if it transfers its processe s to other nodes or if it receives
state to other nodes as nodes takes part in load
improvement to this method is to do not report small change in
its state information
balancing processif they are underloadedoroverloaded. Refined method is node will broadcast
edstate. .
whenit moves from normal load to underloaded or overload
request when its state moves from
~ On-Demand Exchange : In this method, node broadcastits state information
replies their current state to
normal load to either underloaded or overloaded region. Receivers of this request,
further
requesting node. This method work with double-threshold policy. In order to reduce number of messages,
underloade d then only
improvementis to avoid reply regarding current state from all the nodes. If requesting node is
overloaded nodes can cooperate withit. If requesting node is overloaded then only underloaded nodes can cooperate
with it in load-balancing process. As a result, in this improved policy requesting nodes identity is included in state
information request message SO that only nodes which can cooperate with requesting node sends reply. In this case
other nodes do not sends anyreply to requesting nodes.
4 Exchange bypolling : This method do not use broadcast which limits the scaling. Instead, the node which needs
cooperation from other nodes for balancing the load searches for suitable partner by polling other nodes one by one
Hence, exchange of state information takes place betweenpolling node and polled nede.Polling processstopseitherif
poll limit is attained.
appropriate nodeis located or predefined
Main issue in scheduling the local or remote processes at particular node in load balancing process is decidinging th the
Priority assignment rules. Following rulesare usedtoassignpriority to the processesat particular node
© Selfish - In this rule, local processes gets higher priority compared to remoteprocesses.
~_° Altruistic (Unselfish) : in this rule, remote processesare given higher priority comparedto local processes
8 TechKnewledge
Pablication:s
AT.
on
.4 In load-balancing approach, exchange of state information among nodes is required. This involves considerable
overhead. Since resources are distributed, load cannot be equally distributed amongall nodes. There should be a
properutilization of these resources. In large distributed system, number of processes in a node always fluctuating.
~ Hence, temporal unbalancing of load among nodesalwaysexists.
4 Therefore, it is sufficient to keep all the nodes busy. It should ensure that, no nodewill remain idle while other nodes
have several processesto execute. This is called as dynamic load sharing.
Load sharing algorithms aims at no node is idle while other nodes are overloaded. In load sharing algorithms,issues
are similar to issues in load-balancing algorithms. It is simpler to decide about policies in load-sharing algorithms
with
compareto load-balancing algorithms.
4 Load-sharing algorithms only ensure that no node is idle. Hence, simplest policy to estimate load is total numberof
processes on node.
4 Asin moderndistributed system, several processes permanently resides on idle node, CPU utilization is considered as
measure to estimate load on node
4 Load-sharing algorithms mainly concern with two states of nodes: busy or idle. It uses single threshold policy with
threshold value of 1. Node acceptsprocessif it has no process. Node transfers the process as soonasit has more than
two processes. If node becomesIdle, it can be unable to accept new process immediately. This policy wastes the
processing power ofidle node,
~ The solution to above problem Is to transfer the process to the node whichis expected to becomeidle. Therefore,
some load-sharing algorithms use single threshold policy with threshold value of 2. If CPU utilization is considered as
measure of load then double threshold policy should be used.
Location Policies
- The good candidate node to accept the processeswill be that node whose load would not exceed threshold value after
oeepue the processes for execution. In case of broadcast, sender node immediately knows about availability es
probing
suitable node 80 acceptprocesses when it receives reply from all the receivers. Whereas, in random probing,
continuoustill appropriate receiver is found or number of probes reachesto static probe limit. If no suitable nodeis
found thenprocess is executed on its originating node. Probing method is more scalable than broadcast.
- ReceiverInitiated Policy : In this location policy, receiver takes decision about from whereto get the process.In this
policy, lightly loaded node search for heavily loaded nodes to get load from there. When load of the node goes below
the threshold value,it either broadcast or randomly probes the other nodesto find heavily loaded node which can
transfer one or moreits processes.- |
- The good candidate nodeto sendits processes will be that node whose load would not reduce below threshold value
after sending the processes for execution. In case of broadcast, sender node immediately knows aboutavailability of
suitable node to send processes to it whenit receives reply from all the receivers. Whereas, in random probing,
probing continuoustill appropriate node is found from which processes can be obtained or numberof probes reaches
to static probelimit.
- In later case, node waits for fixed timeout period before trying again toinitiate transfer. Probing method is more
scalable than broadcast. |
- Preemptive-process migration is costlier than non-preemptive as execution state needs to be transfer along with
ed
process. In non-preemptive process migration, process is transferred before it starts the execution on its current node.
eee ees =
Receiverinitiated process transfer is mostly preemptive. Sender initiated process transfer is either preemptive or non-
preemptive.
In load-sharing approach state information exchangeis carried out only when node changes the state. Because, node
needs to know state of other node whenit becomes overloaded or underloaded. In this case following policies are
used.
ape
Broadcast When State Changes : In this policy, node broadcast state information request when its state changes. In
senderinitiated policy, this message broadcast is carried out when node becomesoverloaded andin receiverinitiated
policy, this message broadcast is carried out when node becomesunderloaded.
- Poll When State Changes: Polling is appropriate method compared to broadcastin large networks. In this policy,
upon changein state, node polls other nodes one by one and exchangesits changed state information with them.
Exchanging the state information continuoustill appropriate nade for sharing load is found or number of probes
reaches to probelimit. In sender initiated policy, probing is carried by overloaded node andin receiverinitiated policy,
erloaded node.
probing is carried out by und
Process management includes different policies and mechanisms to share processors among all the processes in
system.
es
ee,
TechKnowledge
Peblitatiors
*4 Robustness Failure of
other node
an and execution of that pro
cess.
Pubricatioes 4
: .
Computing(MU-Sem 8-Comp.)
Sem8-Comp.) Sa
4-10
412 Process Migration Mechanisms
a
j1 Forwarding all the messa ges for process onits destination node.
process migration.
H andling g com munication
icati between cooperating processes placed on diffe rent nodes due to
A.
Total Freezing
In this method, process is not allowed to continue its execution when decision ofits transfer to destinationis taken, |
The execution of the processis stopped while its address space is being transferred. This methodis simple and easy to |
implement.
{
4 The limitations of this methodis that if process that is to be transferred is suspended for longer period of time during 1
process of migration then timeouts may occur. Also user may observe delayif the processis interactive. |
Pretransferring .
|
arte
Inthis method, processis allowed to run on its source node until its address space has been transferred on destination
node. The modified pages at source node during transfer of address space also are retransferred to the destination |
=
followed by previoustransfer. This retransfer of modified pages is carried out repeatedly until number of modified ©
=
pages is relatively small. These remaining modified pages are transferred after the processis frozen for transferringits |
state information.
4 Infirst pretransfer operation,as first transfer of address space takes longest time, a longest time is offered accordingly ©
to carry out modifications related to changed pages. Obviously, second transfer will take short time for the same as _
only modified pages duringfirst transfer is transferred. These modifications will be lesser comparedtofirst transfef. |
Subsequently, fewer pageswill be transferred in next transfer to destination node till number of pages converges to
zero or very few.
\
4 These remaining pages are then transferred from source node to destination node
whenprocess is frozen. At source
node, pretransfer operation gets higher priority with respect to other programs.In
this method, frozen time is reduced 4
and hence, leads to minimal interfere of migration with interactions of process
to other processes and users. As same ©
pages can betransferred repetitively due to modifications, the total time elapsed
in migration can increase. |
Transfer on Reference
~ In this method, entire address space of process Is kept on its source node and only needed
address space for execution
is requested from destination node during execution (just like relocate
d process executes at destination node) . Th»
page is transferred from source node onits reference at destination
node. This is just like demand driven copy %
reference approach. 4
4 This methodIs more dependent onsource node and imposes burden on It.
Moreover, if source node fails or reboo® ;
the process execution will fail. °
ag* .
t
4
<29
= afl
*
es
ie feamertc |
tt
sa
Cae
<ith
oth
-
a Resource and Process Management
yessage-Forwarding Mechanism
4-12
ee
nee 2
updatede eo
when proces
ss | s
ue
#
he
oft
same node.
i
any node in the chain fails then process cannot be located. it has poor
3
. that,fatilu re of
at
thodis -s
YF TechKaonietge
ne
arion®
Two conversions for each processor, where one is to convert data from CPU format to external format and otherin
reverse order (external format to CPU format).
In this method, translation of floating point numbers which consist of exponent, mantissa and sign needs to be
handled carefully. In handling of exponent example suppose processorA uses 8 bit, B uses 16 bits and external data
representation is designed for processor A offers 12 bits. Hence, process can be migrated from A to B which involves
conversion of 8 bits to 12 bits and a12 bits to 16 bits for processor B. Process requiring exponent data more than 12
bits cannot be migrated from processor B to A. Therefore, external data representation should be designed with
numberofbits at least as longest exponentof any processor in the system.
Migration from processorB to processorA also not possible if exponentis less than or equal to 12 bits but greater than
8 bits. In this case, although external data representation hassufficient numberof bits but processor A doesnot.
Suppose processor A uses 32 bits, B uses 64 bits and external data representation uses 48 bits. In this case, handling
mantissa is same as handling exponent. Only transferring process from B to A will result in computation with half
precision. This may not be acceptable when accuracy of result is important. Second problem may occur duetoloss of
precision due to multiple migrations between set of processors. Solution is to design external data representation with
longest mantissa of any processor. External data representation also should take care of signed-infinity and
signed-zero.
4.8 Threads |
facility, basic unit of CPU
In operating system supporting thread
of the application. as lightweight
_4 Threads improve performance Threads are also called
is a single sequence stream within a process.
utilization is threads. A thread one process.
h thread belongs to exactly
e of the properties of processes. Eac
processesas it pos ses s som threads run in parallel
hreading, pro ces s can con sist of many threads. These
In operating system that sup
port multit
CPU state an d stack, but
they share the address
h thr ead has its own
performance. Each suc
improving the application
the environment.
space of the process and the processes, threads
the y do not nee d to use interprocess communication. Like
mmon data so highest
_4 Threads can share co blo cked etc. priority can be assigned
to the threads just like process and
dy, execut ing ,
also have stateslike rea
led first.
' priority threadis schedu thread and register
process context switch occurs for the
its ow n Thr ead Contro | Block {TCB). Like hronization is also required
fach thread has resources, sync
4
B). As thr ead s shar e the same address space and
con tents are saved in
(TC
activities of the t hre
ad.
for the various
Puntications
4 New Process termination takes more time as New thread termination takesless t me as compar
pared
| j
compared to new thread termination. tonewp rocess termination. ; =
5 Each process executes the same code but has its own All threads can share same set of open iles, child
memory and file resources. processes.
be a ee ee at
ng
6. In process based implementation, if one process is In multithreaded server implementation, if one thread
blocked no other server process can execute until the is blocked and waiting, second thread in the same
first process unblocked. process could execute,
ee et er ea a es a ee eee
7.
a
Multiple redundant processes use more resources Multiple threaded processes use fewer resou
rces than
than multiple threaded process. muluple redundant processes.
} ae
ee
8. Context switch flushes the MMU (TLB) registers as No need to flush TL8 as address spac
e remains same
address space of process changes.
after context switch because threads
belong to the
A |
SaMe process.
a ee reas
an
WF FechKeontetyt |
Thread Creation
fixed till lifetime of the process. In dynamic
Threads can be created statically in which number of threads remains
New threads are created when needed during the
creation of threads, initially process is started with single thread.
after finishing of its job by using exit system call.
execution of process. A thread may destroy itself
stack
Instatic approach, number of threads is decided
while writing or compiling the program. In this approach,a fixed
-
ied through parameter to the system call.
In dynamic approach, stack size 5 specif
is allocated to the thread.
Thread Termination
used to kill thread from
job by using exit system call. A kill command is
- Athread may destroy itself after finishingof its nates.
In many cases, threads are never killed until process termi
outside by specifying thread identifier a5 parameter.
Thread Synchronization
variable is referred as the Critical region (CR).It is
Th rtion of the; code where thread may be accessing some shared
e portio n s
= ose mutex variable is used. Thread that want
t multip le threads accessing same data. For this purp
necessary to preven mutex variable. In single atomic operation, state of mutex
e ration on corresponding
to execute in CR performs lock op
d state. .
changes from unlocked to locke mutex
s in queue of waiting threads on the
ed state then thread is blocked and wait
~ If mutex variabl e is already in lock mutex variab le. In multi proce ssor
= se thread carries out other job but continuously retries to lock the
; ble. herwi
varia Ot s
nread run in parall el, two threads tion
may carry out lock opera x ble.
on same mute varia In this case,
system where ; ad other wins. To exit the CR, thread carries out unlock operation on mutex variable.
one thread waits 4 d for
u sed for more
general synchronization. Operations wait() and signal() are provide
variab les are is
e out wait operation on condition variable the associated mutex variable
4~ Condit ion
le. When thread
hcarries
it io na l va ri ab operati on is carried out by other thread on the condition variable. When
cond sig
read is blocked till
unlocked and th jon on con
dition variable the mutex variable is locked and blocked thread on condition
t i
si gn al op eratio
s ou
thread carrie
variable starts execution-
- Pepblicatiens
;
4 _Inuserlevel implementation, kernel is unaware of the thread. In this case, thread package entirely put in user space.
Java language supports threading package. | |
User can implement the multithreaded application in java language. Kernel treats this application as a single threaded
application. In a user level implementation,all of the work of thread managementis done bythe thread package.
Thread managementincludes creation and termination of thread, messages and data passing between the threads,
' scheduling thread for execution, thread synchronization and after context switch saving and restoring thread context
etc.
Creation and destroying of thread requires less time and is chip operation. It is the cost of allocating memory to set up
a thread stack and deallocating the memory while destroying the thread.
Both operations requiresless time. Since threads belong to the same process, no needto flush TLB as address space
remains same after context switch. User-level threads requires extremely low overhead, and can achieve high
computational performance. Fig. 4.8.1 shows the user level threads.
L L ! Userlevel threads
Kemel area
n. shi ng of TLB and doing CPU accounting. Only value of CPU register need to be
stored and reloaded agai
_ User level threads arepl
platform independent. They can execute on any operatin system.
si Scheduling can be as per need
of application
#
_ Thread managementareat userlevel and d is taken by threading package.
is saved for other activities one by thread library. so kernels burden
Kernels time Ht
: .
pisadvantagesof userlevel threads
- lfone threadis blocked onI/O,entire process gets blocked.
s to run in parallel, user level threadsare of no use.
- app catio ns whereafter blocking of one thread otherrequire
The appli
;
le another th re a d of the sam e proc
- The kernel can schedu
Se eee
process. lit
not block the entire
of one thread does multiple processes.
edul e multiple thr ead s from the same process on
eously sch
- Kernel can simultan
|
kernel leve l threads
Disadvantages of
ervention. Hi
requires kernel int n the user threads.
- Context switch erally req e 5 more time to create an
uir d mana ge tha
.
i bide
ity poin
ed.
la li ty and securlails, otherwil:l not be affect
at
Py From avaiilabi
e
a tit
Pablicatiogs
a
i .
>.
5 oe
oles Programs
444_,
oda,
AS ce lnterfaceYo er,
ee
Fig. 4.9.2 - Virtualizing
of System X on top of
4 With advancementin Y
technology, hardware
applications and midd and low-level system software
leware which are at change reasonably fas
with the platformsit relies higher level. Hence,
it
t compared to
on. With virtualization leg is not pos sible to maintain pace of
latter can be openedf acy interfa legacy software
or large classes of ex
4 In large distributed system env
iro
8 whic
running different applications .
h are he
efficiently. In this Case, vir ; alste
o ro
lisge
ea neous col
o lecet of mond
tualization plays Majorrole
.
4 Each application ca urces easily an
n run on its own vir
turn, run on a comm tual machine, Perh
on platform. In this aps with th e related librari
degree of portability and fle Way,
, the diversity of p! atfor
xibility can be achiev ed, i
# me and machines can be i nimi
minimized and high
4.9.2 Archit ectures Of Virtual
Machines
4 Four types ofinterfa
ces are of fered b Y any co
software, consist ing mputer system.
of ma chine Inst These are - Anint
.
software consis
siettse of machin ructions that ca erfa
e Instructions thakeay nbebe inv
| oke
! application programming interface , interfa ce between the har
rfa
(API). d by Os only, an interface consisti
dware and
ng of system calls and
Publicattens
4.10 Clients
$e=
- Itis used to control bit-mapped terminals, which include a monitor, keyboard, and a pointing device such as a mouse.
It is just like componentof an operating system that controls the terminal. The X kernel comprisesall the terminal-
specific device drivers. It is usually highly hardware dependent. The X kernel offers a relatively low-level interface for
controlling the screen, but also for capturing events from the keyboard and mouse.
- This interface is made available to applicationsas a library called Xlib. The X kernel and the X applications may reside
on the same or different machine. Particularly, X offers the X protocol, which is an application-level communication
~ protocol by which an instanceofXlib can exchange data and events withthe X kernel.
4 Asanexample, Xlib can request to X kernel to create or kill a window,for setting colors, and defining the type of cursor
to display, etc. The X kernel will react to local events such as keyboard and mouseinput bysending event packets back
to Xlib. Several applications can communicate simultaneously with the X kernel. Window manager application has
to the user.
special rights. This application can dictate the "look and feel" of the display as it appears
ency
4.10.2 Client-Side Software for Distribution Transpar
4 Inclient server computing, some part of processing and data level is executed at client-side as well. A special classis
formed by embedded client software, such as for automatic teller machines, cash registers, barcode readers, TV set-
top boxes, etc. In these cases, the user interfaceis a relatively small part of the client software, in contrast to the local
n facilities.
processing a nd communicatio
transparency. A clinsp ould not be aware thatit
ent shnt
~ Client software als onents for achieving distribution ion are to servers for reasons
0 includmo
re te mpcesses. On the contrary, distribut is often less tra
es copro
is communicating with | |
ss.
of performance and correctne
lent side offers the same interfaces which areavailable at server, access transparency is achieved by client
4 As, at client SI
stu offers the same interface as available at the server. It hides the possible differenc
es in machine
b.
stub. The The stub )
as the actual com mun ica tio n.
architectures, as well
8oration, and relocation transparency Is handled with different manner. Client request is always
~ The location, mig ré of server replica. Client-side software can transparently
collect all responses and pass a single
forwarded re° orent application. Failure transparency is also handled by
client. Middleware carries out masking of
response to ;
can repetiti vely attempt to connect the same servero r otherserver.
communication failure.
Tal : Client
| Ww TechKnowledge
Publicatisns
4 Server takes request sent by client, processesit and sends responseto client. Iterative serveritself handles request
and sends responsebackto client. Concurrent server passes incoming request to thread and immediately waits for
another request. Multithreaded serveris example of concurrent server. ,
4 Server listen client request at endpoint. Client sends request to this end pointcalled as port. These end points are
globally assigned for well-known services. For example, servers that handle Internet FTP requests alwayslisten to Tcp
port 21. Some services uses dynamically assigned endpoint. A time-of-day server may use an end point that js
dynamically assignedto it by local OS.
4 In this case, a client has to search the end point. For this purpose, a special daemon process is maintained on each
server machine. The daemon keepstrack of the current endpoint of eachservice implementedbya co-located server,
The daemonitself listens to a well-known endpoint.A client will first contact the daemon, request the end point, and
then contact the specific server. |
4 Anotherissue is how to interrupt the server while operation is going on. In this case, user suddenly exits the client
application, immediately restart it, and pretend nothing happened. The server will finally tear down the old
connection, thinking the client has most likely crashed. Other technique to handle communication interrupts is to
design the client and server such thatit is possible to send out-of-band data, which is data that is to be praressed by
the server before any other data from thatclient. :
4 One solution is to let the serverlisten to a separate control end point to which the client sends out-of-band data, while
at the same time listening (with a lower priority) to the end point through which the normal data passes. Another
solution is to send out-of-band data across the same connection through which the client is sending the original
request.
= Other important issue is whether server should be stateless or stateful. A stateless server does not keep information
on thestate ofits clients, and can change its own state without having to inform anyclient. Consider the example of
file server. Client access the remote files from server. If server keeps track about each file being accessed by each
client then the service is called stateful. If server is simply provides requested blocks of data to client and does not 7
keep trackabout how client makes use of them then serviceis called stateless.
oar: Fe tapes Sh tule omateak? yikes sable 3 f wire ciane ore CS 8nk pe helt +
ape
aah Ta ii Oia hrRec
yes » D i,
a hina ; itee Ae
9 it, al ; | os slacker 8
7. pee 8ri = aes
* a . - it 8 x " - : " s L Fé "
i dd Vere ee
o : ais e eat f eT ; . = 2 ti tay age el a Wee ll
+--+ = = 4 4 4
-. The server cluster offering multiple services can have different machines running different application servers.
to the proper
Therefore, the switch should distinguish services. Otherwiseit will not be possible to forward requests
machines. ©
Logicalswich |. . | ' Distributed -
(possibly multiple) Application/compute servers file/database
t system
!
' El |
!
<=
: Dispatched !
' =
request _444--.
~ Client requests ! aes
ee 7 TechKaowledga
Publications
Using single switch (access point) to forward clients request to multipleservers may _ aticthe ictal The
cluster may become unavailable. As a result of this, several access points can be used, o PSSES are
madepublicly available.
. |
A distributed server is dynamically varying collection of machines, with also possibly varying 4 points which
appears to the outside world as a single powerful machine. With such a eistabutec server, the clients benefit from a
robust, high-performing, stable server. The available networking services, particularly mobility support for IP on 6
(MIPv6) is.used. In MIPv6, a mobile node normally resides in home network with its home address (HOA) whichis
stable.
_ When mobile node changes network,it receives care-of-address in foreign network. This care-of address is reported
to
the node's homeagent(router attached to home network) which forwardstraffic to the mobile node.All applications
communicate with mobile node using homeaddress, they never see care of address.
This concept can be used for stable address of distributed server. In this case, a single unique
contact -address is
initially assigned to the server cluster. The contact address will be the server's
life-time address to be used in all
communication with the outside world. At any time, one nodein the distributed
serverwill work as an accesspoint
using that contact address, but this role can easily be taken over by another node.
Whathappensis that the access point records its own address as the care-of
address at the home agentassociated
with the distributed server. At that point, all traffic will be directed
to the access point, which will then take care in
distributing requests among the currently participating nodes.
If the access point fails, a simple fail-over mechanism
comesinto place by which another access point reports
a new care-of address. Home agent and access point
may
become bottleneck as whole traffic would flow these two machines
. This situation can be avoided by using MIPv6
feature knownas route optimization.
Tech
Publications 4
s
A simple solution is to let the operating system take care of that by creating a separate process to execute the
ee ee ee ee
migrated code. Migration by cloning is a simple way to improve distribution transparency in which process creates
child processes.
Following are the alternatives for code migration.
Strong Mobility
ee res
44 Pun a
|
+ t
Receiver-initiated
Sender-initiated
mo 4 I s
4 T-
yo : : l
ee
Executein Executein
ecutein | Executein separate process
separate Process f target process
inet process
~ Weak Mobility
____t =
Receiver-initiated
ss
Executein ~ Executein
target process separate process
target process 7 *
4 Fig. 4.12.1
WFTohtnemieagi
Copy (or Move, Global Ref) Global Ref (or Copy) Global Ref
By Value
Rebind (or Move, Copy) Rebind (or Global Ref, Copy) Rebind (or Global Ref
By Type
ee
e a
Be Techni
ipti
heterogeneous. With the use of scriptin : Lt
.
| |
is preferred. It is possible to
stead of migrating only processes, migration of entire computing environment
. .
. .
Recen :
t in
provided proper a hi
decouple a part from the underlying system and actually migrate it to another machine, of migration,
Bil
such migration supports strong mobility. In this type
compartmentalization is carried out. In this way, -
different types of resources may be solved.
many complexities that arise in binding with
_ Review Questions ve
bf Sab ee
-
Q.3 | | i
orithms. |
ferent load balancing alg
Q.4 Explain and classify dif
ms.
-balancing algorith
ffer en t is su es in designing load
Explain di
Q. 5
de in load balanc
ing approach.
ad of a no
l cy to est! mate t
h e lo
Q6
: Explaiain n polpoli oach.
ad balancing appr
proces sfer policies in lo
s tran
Q7 Explain different n load balanc
ing approach
.
th e no de i
hes to locate
Q.8 Explain different approac
in diff in load
balancing approach.
e po li ci es
: ion © exchang
state information Explain.
Q9 Explain different niquesin load bal
ancing approach?
i
i y assignme nt tech
iornt
io
erent pr
| Q10 What are the diff 8nq algorithms. |
. n i g joad-shanng a
es j
in desi gn in g
i loa d sharin g approach.
O11 Explain issu lo cate the node i n
t approaches to proach.
Q te in fo rm at io n exchang?¢ poli_iagcies In in load sharing ap
Exp
: an a e
ent st
013 Explain differ tion
.
ro ce ss migra es s gr at io n mech an is m.
Q14 Explai8 n| the types of p sod proc mi
e features of . process migration.
the process in
45 Explain desirab! , g and rest arii n g
. eezin ; ss migration. ti
. igm for fr ath
n sfer mechanisms in proce
EB a ne
UF TechKnewiedge
Publications
Ol
_Syllabus |
i
Introduction icati on and consistency, Data-Centric and Client Centric Consistency Models, Replica
to replicati
and group communication,
Management. Fault Tolerance : Introduction, Process resilionco, Reliablo client server
Recovery.
n
5.1.1 Reasons for Replicatio
es then other
data. if one replica of the data crash
rman ce are two main reasons to replicate
- Reliability and perfo te operation fails in one replica offile out of three replicas
ble for required ope
ration. If single wri
replica can be availa
ce, we can protect our data.
er valu e ret urned_ by other replica. Hen
id
then we can cons
system can be carried out in numbers orin
proves performance. Scaling of the distributed
4 Replication also im is better to
er of processes access the data from sameserver, then it
If the re is increasing numb
ea.
geographic ar performance can be improved.
load.In this way,
ate server an d divide t he work
replic If a copy
ect to the size of a geographical area,thenreplication is also needed.
he sy stem with resp
~ ifwe wantto scalet the process which access it, then access time decreases. As a result, the
pla c ed in th e proximity of
of data supposeis
y that process improves.
perceived b
performance as
rr oFSohtrmmna
4
Replication and caching is mainly used to improve performance. These are also widely applied asscaling9techniques
Scalability issues usually comeinto view in the form of performanceproblems. If copies of data are placed cose to the
~ processes accessing them, then definitelyit improves access time and performance aswell.
j
a hae all replicas up to date, updates propagation consumes more network bandwidth. If process
P accessesloca|
copy of the data n times per second. Let same copy gets updated m times per second. If n<<m whichindicates
access-to-update ratio is very low. In this case, many updated versions of data copywill never be accessedby process
P. In such case, network communication to update these versionsis useless. It requires other approachto updatelocal
copyor to not to keep this data copylocally.
4 Keeping multiple copies consistent is subject to serious scalability problem. If one process read the data copy thenit
8should always get updated version of data. Hence changes donein oné copy should be immediately propagatedtoall
| copies (replica) before performing any subsequent operation. Such tight consistency is offered by whatis also called
synchronous replication.
4 Performing updateas a single atomic operation onall replicasis difficult as these replicas are widely distributed across
large scale network. All replicas should agree on when precisely an update is to be carried out locally. Replicas may
need to agree on a global ordering of operations using Lamport timestamps, or this order may be assigned by
coordinator. Global synchronization requires lot of communication time, particularly if replicas are spread across 2
wide-area network. Global synchronizationiis costly in terms of synchronization.
7 The strict assumption about consistency discussed above can berelaxed based on the access and updatepatternsof
the replicated data. It can be agreed that replicas may not be same everywhere. This assumption also
dependson
purpose for which data are used.
Pubricatie® aie
Seltee
Pe ear;
eee eaten
ee
=e Te
g (MU-Sem 8- plication an d Fault Tolerance
WS
Cons is te nc y, Re
zt
r
wProcess =3
aL fa 233600
Process Process
Fat.
: Process
P4
ee Fy Te
at
aE pono nnnngEaE RR
a
: ,
_44
a ae
=< SuUEEEE
oeLs had
a
Distributed data store
=a
ted Data Store
Fig. 5.2.1 : Physically Distribu
ots
the dat a on dif fer ent machines in network. Eff
ate
no best solution to replic tion also depends
It is true that, there is cy. The tol era nce whi le considering the relaxa
axation in consisten
if we consider some rel
=
be possible eseare :
ies in three way s. Th
ons can spe cif y the ir toler ance of inconsistenc
ati
on applicati on. Applic
een replicas.
merical values betw
o Variation in nu
as.
e ss between replic
o Variation in stalen
ions.
ring of update operat
io n wi th re sp ect to the orde 0 r variation. For exampl
e,if
o Variat the y use s numerical deviation
ations the n
s for the applic e should not deviate
ta have nu me rical seman tic ap pl ic at ion ma y sp ecify that two copi s
if da ta then
ntain stoc k
market price da cal d eviation coul
d be
re cord s co the oth er han d, a relative numeri
replicated deviation. On
lu te numeric
uld be abso e.
more than Rs 10
. Th is wo
an some S$ pecific valu
e n t w o co pies not more th
ve devi ation be
t we cal de viations, replicas would still
specified to ha ol at in g th e specified numeri
out vi
k goes UP with e
last updated tim of
nsideratio the=
, if a st oc n
vi at io ns io n ta ke s| in to co
In both cases of A staleness de vi
de at |
tent. it is not too ag ed.
b e mutually
consis n be old data, provid
ed
be considered to ta supplied by re plica which ca such
t change suddenly.In
he da
tolerate t
pplicati ons can 45 parameter values do no
lica. S 3 for few hours
replica. Some - ates to the replicas
P may take decision to propagate upd
| e, weather rep updates,
For examP
cases, a main | ferent at the different replic
as,
u pdates that can be dif
periodically. } orderi ng of
permits to a local copy,
e updates can be
viewe d as they are applied tentatively
applied
At last, there 2
stay bounde d. Th es
, some upda tes may need to be rolled back and
rences m ves. As
a resu lt
provided the diffe om
all replicas ar
ord ering deviations are not easy
to grasp than the other
e ement fr
permanent . Naturally ,
agr
until: the global al 36 fore beco
ming
in a diff er en t orde r be
metric-
two consistency
m
Bil
ersphdiiaebiaie sla
a oS a
ualiiteceaser
ee ky
<5saant
In parallel and distributed computing, multiple processes share the resources. These processes access these resources
simultaneously. The semantics of concurrent access to these resources when resources are replicated led to use of
consistency model.
Sequential Consistency
The time axis is drawn horizontally and interpretation of read and write operation by process on data item is given
below.Initially each data item is considered.as NIL.
Oo W,(x)a: Process P; perform write operation on data item x with valuea.
oO R,(x)b : Process P. perform read operation on data item x which return value b.
" As shownin following Fig. 5.2.2 (a), process P, perform write operation on data item x and modifiesits valueto a.This
write operation is performed by process P, on data store which is local to it. We have assumed that data storeis
replicated on multiple machines. This operation then propagated to other copies of the data store. Process P,later
reads x from its local copy of the data store and see value a. This is perfectly correct for strictly consistent data store.
As shownin Fig. 5.2.2 (b), Process Plater reads x from its local copy ofthe datastore and see value NIL. After some
P,. Itis
time, Process P, see the value as a. This means it takes some time to propagate the updates from processP, to
Fig. 5.2.3
~ Following Fig. 5.2.4 vi : data item been changed to b andlater
. Be 4 violets sequentially consistency. Processes P, see that first
to a. On the other hand,process P, will conclude thatfinal valueis b.
Pi: W(xja
P,: W (x) b
P3: R (x) b R (x) a
Se
P,: R(x)a R(x)a
elin = ese
Fig. 5.2.4
ee
store. Assignmentis considered as write operation
a, b, and c are stored in shared sequentially data
~
ses are shown below.
ly. All statement executes indivisibly. Proces
4-
statement reads two parameters simultaneous
4 - he
-
Process P, Process P3
EL
Process P,
c=1;
Ss
<as 1; b=1; él
Fig. 5.2.5 a
and hence, 720 (6!) possible execut ion sequences. If sequence begin with
total has 6 s tatements
Three processes , c) appears before statement
sequences. The sequencesin which statement read(a
ent a = 1 then total 120 (5 1)
statem . Hence, only 30
s In whic h st atem ent read(a, b) appears before statement c = 1 are invalid
b = 1 and sequence s ith a = 1are valid. Similarly 30 starti
WI!
ng with b = 1 and other 30 starting with c= 1
i n whici hstart
executio
sequences of enc es are valid out of 720. These can producedifferent program results
ot al , 90 execution sequ
sequences are valid. Int i y cons istency.
cy : =a.
der seq uentiall
s
which are acceptableun
~
Causal Consistency 3 |
not
istinction between the events that are causally related and those that are
cy mo del makes d : .
d see y and then
isten
d or occurred by previous event y then every processfirst shoul
4,
Causal consist
d. If an y ev en t bi s cause n
causally relate
io :
e follows th e following condit
x. Causally consistent data stor
are not causally related
lly rela ted writes in the same order. Those write which
a
All processes. ne
see the sescaus
may see in different order on different machines.
| roces
writes) P
(concurrent
: ae | r TechKaemletgé
: Peblicattens
fas ; 8 ,
Fig. 5.2.6
In Fig. 5.2.7 (a), W(x)a by process P, and W(x)b by P, are causally related writes. But, processes P, and P, see these
writes in different order. Hence,it is violation of causally-consistent data store. In Fig. 5.2.7 (b), as W(x)a by processP,
and W(x)b by P, are concurrent writes. This is due to removal of read operation. Hence, figure shows correct sequence
of events in a causally consistent data store.
(a) ! (b)
Fig. 5.2.7
FIFO Consistency
FIFO consistency is the next step inrelaxing the consistency by dropping requirementof seeing the causally related
writes in same orderat all the processes. It is also called the PRAM
consistency. It is easy to implement It says that:
tis,
All the processes must see the writes done bysingle process in the orderin
which they wereissued. The writes from
different processes, processes may see in different order.
Pi: Wix)a
Wesenety
Scanned with CamScanner
Distributed Comput
Ng (MU-Sem 8-Com
Grouping Operations
- Consistency
defineg at th
e level
applications. The by Hy
qifferent a
Progra 0 ms Write operations does not match its granularity that is offered
ms r
mechanismsand
| tran acti unning Concurrently use shared data. For this purpose, synchro
nization 1
. ons are u
sed t )
and write Oper re program level read
ations are Brou » control the concurrency between programs. Therefo
ped togeth er and bracketed with Pair of operations ENTER_CS and EXIT_CS. | ||
~ Consider the dist
ributed data store
Ww hich we have assumed. If process has successfully executed ENTER_CS, it
guaran tees thatits loca] co
py of data
p ations ins
oper insi
Store is up to date. Now process can safely execute series of read and write \
i de CS on that stor
e and exit the CS byca
lling EXIT_CS
A series of read and wr}
write o i vans , . i!
mult Perations within program are performed on data. This.data are protected against Mt,
simultaneous accesses that would le
ad to seeing something different than the result of executing the series as a |
whole. It is needed t
© have precise semantics about the operations ENTER_CS and EXIT_CS. The synchronization
variables are usedfor this Purpose.
Release Consistency
4 In release consistency model, acquire operation is used to tell the data store that CS is about to enter and release wi
operation is used totell that, CS has just been exited. These operations are used to protect the shared data items. The
shared data that kept consistent are said to be protected. Release consistency guarantees that, when process Carry
out acquire, the store will make sure thatall the local copies of protected data are brought up to date to be consistent
with remote copies if must be.
|
of the store. Carrying out acquire
After release is done, the modified protected data is propagated to otherlocal copies |
Hi
does not g uarantee that locally done| updates will be propagated immediately to other copies. Carrying out release
5.2.9 showsvalid events for release consistency.
|
does not ens ure to fetch updates from other copies. Following Fig.
|
Ps: AM
Fig. 5.2.9
: da
o Before carrying out fully
have completed; eestted to be carri ed out, all previous reads and writes carried out by process must have been
. . ; bh
e is permitte
: <| u
Oo Before ne
He
completec- , --bles are FIFO consistent.
|
| varia
ization
o Access es to synchron!
. m 3 Hee
| | TechKuowledge Hike |
Publicattons hie y
ita
Entry Consistency
The shared synchronization variables are used in this model. When process enters the CS, it should acquire the relateg
synchronization variables and whenexit the CS, it sho uld release these variables. The current ownerof synchronization
variable is the process which haslast acquired it.
The ownermayenter and exit CSs repetitively without sending any messages on the network. A process not currently
having synchronization variable but need to acquire it has to send a message to the current ownerasking for
ownership and the current values of the data coupled withthat synchronization variable. Following rules are needed
to follow.
1. An acquire access of a synchronization variable is not permitted to carry out with respect to a process until all
updates to the guarded shared data have been carried out with respect to that process. |
2. Before an exclusive mode access to a synchronization variable by a process is permitted to carry out with respect
to that process, no other process may keep the synchronization variable, not even in nonexclusive mode.
3. After an exclusive mode access to a synchronization variable has been carried out, any other process9 next
nonexdusive mode access to that synchronization variable may not be carried out until it has carried out with
respect to that variable9s owner.
First condition says that at an acquire, all remote updates to the guarded data must be able to be seen. The second
condition says that before modification to a shared data item, a process must enter a CS in exclusive mode to confirm
that no other process is attempting to update the shared data simultaneously. The third condition says that if a
process wants to enter a CS in nonexclusive mode,it must first ensure with the owner of the synchronization variable.
guarding the CS to obtain the latest copies of the guarded shared data.
A valid vent sequence for entry consistency is shown Fig. 5.2.10. Lock is associated with each data item. Process P,
does acquire on x and update it. Later it does acquire on y. Process P, does acquire for data item x but not for y. Hence,
process P, will read value a for data item x but it may get NIL for data item y. Process P, does first acquire on y henceit
reads value after y is released by process P,. The correctly associating the data with synchronization variables is one
of the programming problem with this model.
8Consistency. is concerned with the read and writes operations by the processes that are performed on set of data
items. A consistency model explains what can be predictable with respect to that set when numerous processes
concurrently work on that data.
Whereas, coherence models expects result from single data item. In this case, assumption is that a data item is
replicated at several places;it is said to be coherent when thedifferent copies stick to the policies as defined by its
allied coherence model.
l l ent#
ga_c s on distributed
i multaneous update
i ered si n th
5.2, W eh ave cons
id solved. O
in section
situation can be easily re
y and if, j this
ated si m ultaneousl
not get upd
ly readin g the data. be W
operations are on idered to
co ns is te nc y mo del which is cons
ly ch
obeys eventual in ac omparative
These data stores inconsistencie s ca n be hi dden
ri c consistency models, many
ce nt
SST
owes
e
a from d
sistency
ee e
re ad dat
5.3.1 Eventual Con If processes only all processe
s
limited fo rm .
updates [0
aiedete
ex is ts in so me pr op ag at e
ions, CON currency necessary to
caniineeniadianeas
Hecaie
4_4
Practical
y in many situat it i 5 not
ade
on it th en
rforms U pdates In this case
,
process s pe at pag e.
o
hardly some
4
e r of th
by o wn are
uthority OF eb proxies
a
immediatel
y.
n be up dated by a , br ow se rs and W
WWW) web
pages either
ca
get better
efficiency
ide Web ( ersely, to request.
_ In Wo rl d W
al l no t re qu ir e. Co nv
at pa ge U P S n the next
e conf lict
s at return th entual
che and to t ing client./Ev
4mae
e writewit in a local ca ques
resolving th pa ge en t to re
some © xt e for
ss ed
to hold acce t take plac
configured ptable at
frequently can be acce up dates do no
_4
>- it p li ca if ica.
re turns ol
d pa re the re e same repl
y to update
aaa
th ou gh the we
d caches
pr op ag at ed graduall ie nt s al wa ys access th
_4 Al ll bewi ed
id cl
tha t, upda
tes
rrect ma
nne f, prov
444
cy states work In co time.
consist en stores of
nt data
te t period
over 4 shor th current
me . Fy en tual consis ar e ac cess ed sconnects wi
long ti e replic as . Cl ie nt di
en divers copy of data
base
ious copy to
ms occur wh © perations on ated from prev
But, proble rf or mi ng not pr op ag
t and pe if upda tes are els
bile dien copy. In th
is case,
nt-cent ric co
nsistency m o d
os e, client | s mo pl ic at ed havi or . Cl ie
4- Supp oth er re nsisten t be
nects to no tice inco
ta ba se an d late f con th en he/she M3y
da ba se tency-
py of data ntual consis ta
nnected co le of eve accesses to a da
recently co cal examp e c onsistency of
t th
This is typi single client abou
problem. rance fora
res olve such offe r as su
cesses by differ ent
clients.
tric Cc onsistency abou t simult aneous ac ta
copy of the entire da
en ed
e is pr ovid
nt-c
, clie
_~ Particularly
. N o assuranc es . Ea ch pr ocess has its local
only y mach in t operation
at client ted © n man store, it carry ou
store by th
ph ys ic al ly distribu en pr oc es s accesses data
hich is vity is un
reliable. Wh
pies. Following four mo
dels are
ts store w
~ Consider da n e t w o r k connecti o p a g a t e d to other co
at ly pr
assumed th en eventual
store. It is la bl e. U p dates are th
py avai
nearest co
on local or consistenc
y
d for client-centr
suggest e
ds
onic Rea
° Monot
c writes
o Monotoni tes
wri
o Read Your
low R eads ri:te
o Writes Fol
sc ri be above models. io n Is re su lt of seriies of w
to de e t This vers
tion® gre used copy L, at) tim
Following nota em * at loca
onl
.
~ rsion of dasilncit e initializati
h : in d tloca he pvey & occurred L, at timet.
o x t n
operatio x at local copy
ns on data item local copy L at late
r
ri es of write operatio n ca rried :out later at
ws(x tl) : se ]) a v e also b e e
6
operat ions in WS(xi[ty h
: It denote that
"9 ws (x[t]ltal)
time t:- | tnemtctes
SE fe
|
|
4 This ensuresthat in later| read operation, processwill alwaysget (read) latest or same value (which
i
was retu rnediin last
read operation) and notolder one. Consider the example of distributed email database.In this case, mailboxesof users !
are distributed and replicated on multiple machines. The received mail at any location is propagated in lazy manner(as
per demand) to othercopiesof mailbox.
Only that data gets forwarded which is needed to maintain consistency. Suppose user reads the mailbox in Mumbai
and later flies to Delhi. The Monotonic-read consistency guaranteesthat, the messages in mailbox that were read at
Mumbai will also be in mailbox at Delhi.
4 Fig. 5.3.1(a) shows monotonic consistent data store. L, and L, are local copies at different machines. Process P carries
out read (R(x,)) operation on x at L,. The value returned to the processP is the result of write operation WS(x,) carried
out at L,. Later, process P carries out read operation (R(x,)) on x at L,. Before P performs read operation on L,, here
writes at L, is propagated to L, and it is shown in diagram by WS(x,; x2). Hence, WS(x,) is the
part of WS(x,). This shows
that monotonic-read consistency is guaranteed.
s that monotonic-read
consistenc y is not guaranteed in this case.
a
= In this model 9 it Is important to Propagat
gate WiIte Operations in the corr ect o r de r
Monotonic-read consistency to all copi |
is guaranteed by data storeif |
following condition holds:
Tech fi. .
Pusiicati
. ta ars : eine
aie
:
Reads
5.3.5 Write Follows eads consistency is
updates of last read operation gets propagated. Write-Follows-R
. mo del
$
In this C condition holds:
Ee s t o r e !f following
ed by data
operation on x by the sameprocessis
guarante cess on a data 8em x following a previous read
4 p ro t was rea d.
tion b y value of x tha
- Awrite opera he same OF 2 more recent = E
Es lace ont
: guaranteed to take P .
Publisatises
<2 = 3
(a) (by
Fig. 5.3.4
5.4 Replica Management §:9-8
4 Replica managementis importantissue in distributed system. The key issuesare to decide whereto place replicas, by
whom andthe timeat which placement should be carried out.
4 Otherissueis selection of approaches to keep the replica consistent. Placement problem also includes placing the
servers and placing content. Placing server involves finding the best location and placing content involves whichis the
best server for placementof content.
For optimized placement, k best location out of n (k <n) needs to be selected. Following are some ofthe approaches
| to find best locations.
4 Consider distance between clients and locations which is measured in termsof latency or bandwidth.In this solution,
one serverat a timeis selected such that the average distance betweenthat serverand its clients is minimal given that
- previously k servers havebeen already placed (n-k locationsareleft).
4 In second approach, instead of considering the position ofclients, take topology of internet formed by autonomous
systems (ASs) in which all the nodes runssame routing protocols. Initially take largest AS and place server on router
having largest number of network interfaces. Repeat the algorithm with second largest AS and so on. These both
algorithms are expensive in terms of computation.It takes O(n9) time where nare numberoflocations to be checked.
4 Inthird approach,a region to place the replicas is identified quickly in less time. This identified region comprises Wie
collection of nodes accessing the same content, but for which the internode latencyis: low. The identified region. el
contains mt ore numberof nodes with compareto other regions and one2 of the nodesjs selected to play role of replica a
server.
_ 8niet
Puaiice ter.ae
=peee eee
| 54.2 Content Replication and Placement
lowi
~ Replicas are Classified in fol owing three types.
7, Permanentreplicas
as
2. Server-initiated replic
3, Client-initiated replicas
52 4" ered at S, as a
sever her ge
ts regist clients for
stored at re toget count of incoming
req uests from
If file F is 2"
a
maintained. It says th at , if
from C; is al sO from server Ss.
at server 52 (d el l5, F)) ld beremoved
fo r file F af ° e r v e r s
d the n fi le sh ou
se the load on
t h r e s h o l d t h is t hreshol o f re p licas. This can again increa
ps belo w
agai n redu
ces number
rver s dro wer in thi s manner of the same file.
file F at se in order to su rv ive at leas t one copy
from >
e d out Ser Tech
he file e carrie
t jons ar
Publications
Removing t c i a l a c
e spe
0, som
other se rve! 5. 5
F). if count Of
4 In the same way, file F is replicated if incoming requests for it exceeds the replication count rep(S,
on server in proximity of clients
requests forfile F is in between these two thresholds then file is permitted to migrate
requestingit.
4 Every server again assesses the placement of file which it stores. In case of numberof access requests countif drops
below threshold (del(S, F)) at serverS, it deletes F, provided it is not a last copy. Certainly, replication is carried out
| only if the total numberof access requests for F at S exceeds the replication threshold rep(S, F).
4
These replicas are initiated by client and also called as client caches. Caches are always said to be managedbyclients,
In many cases, the cached data should be inconsistent with data store. Client caches improve the access time to data
: i as accesses are carried out from local cache. This local cache could be maintained on client9s machine orin separate
| |!
machine in sameLAN asclient. Cache hit occurs if data is accessed from the cache.
I 4 Cache holds the data for short period. This is either to prevent extremely stale data from being used or to make room 4
| for other requested data from data store. Caches can be shared between manyclients to improve cache hits. The
I; assumption is that, requested data by one client may be useful to another nearby client. This assumption may be
1 correct for sometypes of data stores.
4 Acache eitheris usually placed on the same machine as its client or on a machine shared byclients on the same LAN.
In some other cases, system administrators may place a shared cache between a number of departments,
organizations, or for whole country. Other approachis to place servers as caches at some specific locations in wide
area network. The client locates nearest server and requests it to keep copies of the data the client was previously
| accessing from somewhereelse. )
4 In aninvalidation protocol, if updates occur in one copy then other copies are informed about these update. This
y informs to other copies that the data they hold are no longervalid. It may specify that, which part of the data store has
is been updated, so that only changed part of a copyis actually invalidated.
[aoLeninmatinligeI i Mali Be one Tk
4 In this case, only notification is propagated. Before carrying out the update operation on an invalidated copy, it needs
to be updatedfirst, depending on the particular consistency model that is to be supported. The network bandwidth
needed topropagate notificationsis less. Invalidation protocol. works best in situation where many update operations
are Carried out compared to read operations (read-to-write ratio is small). x
Ey TechKnewicte® ce
Publications ~ia
Distributed Computing (MU-Sem 8-Comp.) 5-16 Consistency, Replication and Fault Tolerancg _
4 As a balance between push and pull-based approaches, a hybrid form of update Propagatio
n based onleasesjs
introduced. A lease is a time interval in which server pushes updatesto clients.
In other word,it is a promise by the
serverthatit will push updates to the client for a specific time. After expiry of the lease, the client is forced to poll
the
server for updates andpull in the modified data if required. In other approach,a
client requests a new lease to push
updates whenthe earlier lease expires.
4 The use of unicasting or multicasting depends on whether updates needs to be pushedor pulled. If
serveris a part of
data store and push updates to other n numberof servers then it sends n separate messages. In
this case, with
multicasting underlying network takes care of sending n messages.
4 Supposeall replicas are placed in LAN then hardware broad casting is available. In
that case broadcasting or
multicasting is cheaper. In this case, unicasting the update is expensive and less efficient. Multicastin
g can be
combined with push based approach wheresingle server sends message to multicast group
of other n servers.
In distributed system, partial failure may occur. In partial failure, failure of one component of system may
affect
operation of other componentbut yet some other components may remain unaffected. The distributed system should be
designed to recover from suchpartial failure automatically without affecting the performance.
Distributed system should tolerate faults. Fault tolerant system is strongly related to dependable system.
Dependability covers following requirements fordistributed system along with other important requirements.
oO Availability : It refers to the probability that system is performingits operation correctly. If system is working at
any given instantof time then it is highly available system.
Oo Reliability : It refers to the working of system continuously without failure or interruptionduring a relatively long
period of time. Although system maybe continuously available, but we cannotsaythatitis reliable.
o Safety : Safety refers to the situation that when a system temporarily fails to function correctly, nothing
catastrophic takes place. | !
©. Maintainability : It refers to how failed system can be repaired without any trouble. System should be easily.
e
penton iting
,
ey Pusiicatien®
lechhawioe
=
i Sette
a
fee Gita
y
4 Send-omissionfailures refer to loss of messages between sending process and its outgoing
buffer. Receive- Omission
failures refer to loss of messages between incoming buffer and the receiving process.
Arbitrary Failures
4 In this type of failure process or communication channel behaves arbitrarily. In this failure, the responding process
may return wrong values or it may set wrong value in data item. Process mayarbitrarily omit intentional
processing
step orcarry out unintentional processing step.
4 Arbitrary failure also occurs with respect to communication channels. The examples of these failures
are: message
content may change, repeated delivery of the same messages or non-existent messages may be delivered.
These
failures occur rarely and can be recognized by communication software.
Timing Failures
4 In synchronous distributed system,limits are set on process execution time, messagedelivery time and clockdrift
rate.
Hence, timing failures are applicable to this system.
4 In timingfailure, clock failure affects process as its local clock maydrift from perfect time or may exceeds
bound onits
rate.
4 Performancefailure affects processif it exceeds the defined boundson the interval between two steps.
4 Performancefailure also affects communication channels if transmission of message take longer time than defined
bound.
Masking Failures
4 Distributed system is collection of many components and components are constructed from collection of other
components. Reliable services can be constructed from the components which exhibit failures.
4 For example, suppose data is replicated on several servers. In this case, if one serverfails then other servers
would
provide the service. Service maskfailure either by hiding it or by transformingit in more acceptable type offailure.
The use of redundancy is main technique for masking the faults. Redundancy is categorized as information
redundancy, time redundancy, and physical redundancy. With information redundancy extra bits are added for
recovery. For example, hamming code addedat senderside in transmitted data for recovery from noise. ©
4 Time redundancy is particularly useful in case of transient or intermittent faults. In this, action is performed again and
again if needed with no harm. For example, aborted transaction can be redone with no harm.
4 -
With physical redundancy, to tolerate the loss or malfunctioning .
of some components either extra hardware © r
software components are addedin system. . a
Publicatie® te tne
=.
a
NT be
g (MU-Sem 8-Comp) leranc e
pistributed Computin ion on and Fault To
8catati
te aon
a
Consistency, Replic
=
-.*<
3 3 a9
ence
Process Resiili
ag a ee
_
5.6
-
7
?
In distributed system, , failure of proce sses may happen. In this case, fault tolerance 's achieved
| by replicating
eto proups.
ies
==
56.1 Design Issues
Ei
s.
all the group member
gets delivered to
n group, then messagesent to this group
this proc ess9s job.
if identi cal proce sses are organ izedi
Every process in group receives this message.If one of the membersfails th e other can take over
oups- During
gro age d dyna : lly. It allows creating new groups and destr oying old gr
mica
These ese p proc ess grou ps may be man
ltiple groups
be a mem ber of mu
or leave it. A process can
process can join a group
the system operation, a nt and group membership.
msare required for group manageme
simultaneously. As a result, mechanis age to
cess can send a mess
sses to deal with sets of proces ses as a single abstraction. Hence a pro
Groups permit proce from
r locations, which may va ry
8
ers or number o f servers an d thei
Sm
with out knowin g to whic h serv
a group of servers
one call to the next. . The
decisions are made collectively
In flat grou p, all the processesare at equal level and all
~ Twotypes of groupsexist. ma king is complex as all
are
ion is t hat the re is no sing le point offailure. But decision
advantage of this organizat
ad.
ing incu r delay and overhe
involved in this process. Vot group, a request
s aS COO rdinator andall
the others are workers. In this
e pro ces s act
Insimple hierarchical gro
up, on ropriate
-
over to coordinator an d
then it decides which workeris app
nt is firs t get s han ded
from worker or e xternal clie or crashes the
lure is the problem. If coo rdinat
it the re. In this group sing le point of fai
forwards
to carry it out, and
whole group will affect.
| |
Group Membership join or leave the group.
also allows the process to
and destroy the groups. It
responsible to cre
ate of
Agroup server is and their exact membership. In this, single point
- da ta ba se of all the gr oups
intains a comp
le te
The group se ver ma gementwill affect.
ou p se rv er crashes then group mana
m. If gr
failure is agal n the proble hip in distributed manner. if reliable multicasting
is existing, an external
members just
Another approach is t
o manage group. To leave a group, a member
- mbers declaring its wish to join the
a mess@ ge to all gro
uP me
process can send ry other member
.
odbye me ss ag e to eve up. After leaving gro
up, process
sends a go e all the mes sages sent to that gro
receiv m it
ins the er
oup,it should
ers should not receive messages fro
~ As soon as process jo any other group memb
e from that group and i and group can no longer
should not re ceive any messaB e situation where many
i supposefails
machine
| to deal with th
protoco
There should be some
|
function atall.
n
licatlo
5.6.2 Failure Masking and Rep formed by
us to mas k one or more faulty proc essesin that group. A group can be ed by this
rmit s le es s ca n be replac
| rocesses 4Peorga in a group. Hence, single vulnerab proc
~ Group of identi cal p nizing then :
an be carried out ither with primar
y-based protocols, or through replicated-
icating the p rocesses
e
repl can
Rep lication
e fault.
group to tolerate th
write protocols. TechKnowledge
Publtcatians
4 .Replicated-write protocols are used in the form of active replication, in addition to using quorum-based protocols.
These solutions considerthe flat group. The main advantageis that such groups have no single pointoffailure, at the
cost of distributed coordination. The advantage of this organization is that there is no single point of failure. But
distributed coordination incurs delay and overhead. .
4 Consider the replicated write systems. Amountof replication required is an important issue to consider. A k fault
tolerant system survivesfaults in k components and still meets its specifications. If processes fail slowly, then having k
+ 1 of themis sufficient to give k fault tolerance.If k of them just stops, then the reply from the other one can beused.
In case of Byzantine failures, processes continuing to run whensick and sending out erroneous or random replies, a
minimum of 2k +1 processors are required to attain k fault tolerance.
4 Fault tolerance can be achieved byreplicating processes in groups. In many cases,it is required that, process groups
reaches some agreement. For example; electing a coordinator, deciding whether or not to commit a transaction,
dividing up tasks among workers etc. Reaching such agreement is uncomplicated, provided communication and
processesareall perfect. Otherwise problem mayarise.
4 The generalaim ofdistributed agreementalgorithmsis to haveall the correctly working processes reach an agreement
on someissue, and to build that consensus within a finite numberof steps. Following cases should be considered.
Q Synchronousversus asynchronous systems: A system is synchronousif and only if the processes are known to4
operate in a lock-step mode.
o Communication delay is bounded or not: Bounded delay means every messageis delivered with a globally and
predetermined maximum time.
Messagedelivery is ordered or not: Messages from same senderis delivered in the same orderin which they were
sent.
4 In this case, we assume that processes are synchronous, messages are unicast while preserving ordering, and-
communication delay is bounded. Consider n number of processes. In this example, let n = 4 and k = 1 where kis
numberoffaulty processes. The goalis that, each process should build the vector of length n.
Step 1 : Each non-faulty process i send v; to other process. Reliable multicasting is used to send the v,. Consider
Fig. 5.6.1. In this case, process 1 send 1, process 2 send 2 and process 3 lie to everyone, sending x, y, and z
respectively. Process 4 send 4. Process3 is faulty process and process 1,2 and 4 are non faulty processes.
il
nt
il
K,
f
Faulty process HI
Fig. 5.6.1 Hdl
torsis shown below. iy
form ofvec
- Stepp2: Theres ults collected by each process from otherprocessesin the | i
-o Process1:(1, 2, x, 4) iy
14)
Oo ; ess2: (1, 2,
Proc y, 4) in|
o Process3: (1, 2, 3, 4)
Oo Process 4:(1, 2, z, 4) (
other .
that each process receives from
ts vector to ever other process. The vectors
Step 3 : Now, every proces s pass esi
valuesa to |. th ese are new 12 i
process, it lies and sends new
ow. Since process i s faulty
each process are shown bel
iy
|;
9 values.
Process 4
Process 1 Process 2 hi
| !
(1, 2, xX, 4) ||
(1, 2, Y, 4) |
(1, 2, X, 4)
TechKnow
Ey Publ tedge
ications
Distributed Computin (MU-Sem 8-Comp.) 5-22 Consistency, Replication and Fault Tolerance
re
Process 1 Process 2
(1, 2, y) (1, 2, x)
(a, b, c) (d, e,f)
4 In system,if k faulty processes are there then agreement can be achieved if 2k+1 correctly functioning (non-faulty)
processes present for a total of 3k+1.
4 For proper masking of failures, it is necessary and requirement to detect them. It is needed to
detect faulty memberby
non-faulty processes. The failure of member detection is main focus here. Two approaches are
available for detection
of process failure. Either process asks by sending messages to other processes to know
whether others are active or
not or inactively wait until messages arrive from different processes.
4 The latter approach useful only when it can be certain that there is sufficient communication between
processes.In
fact, actively pinging processes is generally followed. The suggested timeout approach to know whether
a process has
failed or not, suffers with two major problems. As a first problem, declaring process9s failure as it does
not return
response to ping messages is wrong as this may due to unreliable networks. In other argument,
there is no practical
implementation to correctly determine failure of process due to timeouts.
4 Gossiping between processes can also be used to detect faulty process. In gossiping, processtells other process about
its active state. In other solution, neighbors probe each otherfor their active state instead of relying on single node
to
take decision.It is also important to distinguish the failure due to process and underlying network.
4 This problem can be solved by taking feedback from other nodesor processes.In fact, when observing a timeout on a
ping message, a node requests other neighbors to know whetherthey can reach the presumed failing node.if a node
is still active, that information can be forwardedto otherinterested nodes.
ee
4 While considering the faulty processes, fault tolerance in distributed system should also consider the failures of the
communication channels. Communication failures are also equally important in fault tolerance. Communication
channels may also suffer through crash, omission, timing, and arbitrary failures.
4 Reliable communication channels are required for exchanging the messages. Practically, while designing the reliable
communication channels, main focus is given on masking crash and omission failures. Arbitrary failure may occur in the
444
teatinn n is
In Remote procedurecalllls (RPC), communicatio eeshed betweenthe client and server. In RPC, a client side
i establi
-
is to hide
p rocess Calls a procedure implemented on a remote machine (server). The primary goal of RPC
a
dure calls.
i g remote procedure so that remote procedurecalls looks like just local proce
whi callin
communication while
not easy. In RPC system,
s between remote andlocal procedure calls is
lf any errors occur, then masking difference
following failure may occur.
The client cannot locate the server.
9
receiving a request.
The server crashesafter
O80
lost.
the server to theclientis
The reply message from
oOo
er sending a request.
The client crashes aft
SS
that are invok ed upon It is no t po ss ib le
r signals.
sup port 5 e x c eptions o
do not
t is carried
Lost Request Messages retransmission of reques .
problem, | r expir
before time
=
and retransmissiO
original ver handle it.
7 7
rve 9=.
.
t S¬nd s request to server
e when clien
a etot) Set atm
ient .
quest a n d reply is sentto cl
processed the re
servel-
1. Arequest
arrives at
: WF lechtaomioda
ubllca
tionrs
ues .
In this case client retransmits request message when reply doesnotarrive before timer expires.It concludes that
request is lost and then retransmits it. But, in this case client actually remains unaware about what is happened
actually. Idempotent operations can be repeated. These operations produce same result although repeated any
numberof times. Requesting first 512 bytes offile is idempotent.It will only overwrite at client side.
Some request messages are non-idempotent. Suppose, request to serveris to transfer amount from one account to
other account. In this case, each retransmission of the request will be processed by server. Transfer of amount from
one account to other account will be carried out many times. The solution to this problem is, try to configureall
requests in idempotentway.This is not possible as many requests are inherently non-idempotent.
In another solution, client assigns sequence number to each request. Server can then keep track on most recent
request and it can differentiate between original (first) and retransmitted requests. Now servercan reject to process
any request a second time. Client may put additional bit in first message header so that server will take it for
processing. This is required as retransmitted message requires morecare to handle.
Client Crashes
In this case, client sends request to server and before reply arrives it crashes. The computation of this request is now
going on at server. Such computation for which no one is waiting is treated as an orphan. This computation at server
side (orphan) wastes CPU cycles,locks files, hold resources.
Supposeclient reboots and sends same request. If reply from orphan arrives immediately then there will be confusion |
state. This orphan needssolution in RPC operation. Four possible solutions are suggested as below.
1. Orphan Extermination : Client maintains log of message on hard disk mentioning whatit is about to do. Aftera
reboot,log is verified by client and the orphan is explicitly killed off. This solution needs to write every RPC record
on disk. In addition, these orphans may also do RPC, hence, creating grandorphans or further descendents that
are difficult or practically not possible to locate.
2. . Reincarnation: In this solution, no needto write log on disk. Timeis divided in sequentially numbered epochs.
Client broadcasts a messagetoall machines telling the start of a new epoch after its reboot. Receivers of this
broadcast messagethenkills all remote computations going on the behalf of this client. This solution also suffers
through network delay and some orphan maystay alive. Fortunately, however, when they report back, their
_ Teplies will include an out of date epoch numberso thattheycan easily noticeit. ,
Nea
<4. Se
pistributed Computing (MU-eM
2 8-Com Consistency, Replication and Fault Tole
rance
BA 225 .
fie= 3, Gentle Reincarnation - Wh
7 en an epoch broadcast arrives, each machine checks about any remote computations
: .
runninglocally,9 if runni ng, tries owners cannot be
its best to locate their owners. Computations are killed in case
located.
ng constraints.
besides various orderi o is a member of the grou
p and who is not.
y ag re em en texis ts on wh
e case when alread
- It is simple to handle th the group during
fail, and processes do not join or leave
pro cesses do not
.
Especially, if it is as su me d that,
each current group
s that ev ery message should be delivered to
hen reliable multicasting simply mean
communication t
8 group
member. In9 simple case,
me ssages by all group members in same order.
of re ceiving the
is requirem
ent
an dit is easy to implementi
f numberofreceivers are small.
Sometimes, it ssag es i any ord er
!1
n receive me from
members ca system offers only unreliable multicasting. In this case, some messages
8 ication
Consider underlying commun In this case, a sequence numberis
~ livered to all the receivers (group members).
. .
be lost or may not be de ages are rece ived in the order they are sent. Sender
sender may it multicasts- Supp ose mess
for a
messages till acknowledgment (ACK) for that messageis received.It is easy
Oo each messaB&
assigned by sender t
dy sen
ore alrea For missed message by receiver, sender receives negative
ACK (NACK)
maintains buffer to st ga me ssage-
it is missin
tion, sender can retransmits message whenit has notreceivedall
receiver to detect e. In other solu
mits the messaB sages.
Sender then retrans g gy backin g can be used to minimize number of mes
e. iP
thin a certain tim
acknowledgments wi
Publications
a
It is necessary to reduce numberof feedback messages while offering the scalable solution to reliable multicasting. A
more accepted model used in case of several wide-area applications is feedback suppression. This approach underlies
the Scalable Reliable Multicasting (SRM) protocol. In SRM, receiver only sends NACK for missed message. For
successfully delivered messages, receiver does not report acknowledgements (ACKs) to sender. Only NACKs are
returned as feedback. Whenevera receiver notices that it missed a message, it multicasts its feedback to the rest of
the group.
In this way, all the members of the group know that message m is missed by this receiver. If k number of receivers has
already missed this message m, then each of k members has to send NACKto sender so that m can be retransmitted.
On the other hand, suppose retransmissions are always multicast to the entire group, only a single request for
retransmission can be sent to sender.
For this reason, a receiver R within group who has missed the message m sends the request for retransmission after
some random timehas elapsed. If, in the meantime, another request for retransmission for m reaches R, R will hold
back its own feedback (NACK), knowing that m will be retransmitted soon. In this manner, only a single feedback
message (NACK) will reach sender S, which in turn next retransmits m.
_ Feedback suppressionhas been usedas the fundamental approach for a numberofcollaborative Internet applications.
Although, this mechanism has shownto scale reasonably well, it also introduces a numberof serious problems.It
requires accurate scheduling of feedback messagesat each receiver so that single request for retransmission will be
returned to the sender. It may happen that, manyreceivers will still return their feedback simultaneously. Setting
timers for that reason in a group of processesthatis spread across a wide-area networkis difficult.
Other problem in this mechanismis that, a retransmitted messagealso gets delivered to the group members who have
already successfully received this message. Unnecessarily, these processes have to again process this message. A
solution to this problem is to let these processes which have missed the message m, join another multicast groupto
receive message m. This solution requires efficient group managementwhich is practically not possible in wide area
network, A better approachis therefore to let receivers that tend to miss the same messages team up andsharethe -
same multicast channel for feedback messages and retransmissions.
aE
Tech
Publicatiars
Root
S - Sender
C - Coordinator
(b)
(a)
Fig. 5.8.1
8
weconsi 8d
derer reli able MU
-« delivered to eith
Now ed in the same orderto all processes called as
are deliver
vent that that all mess98
es
guarantee that 4 .
Also,it is also necessary Fé . |
~~. the atomic multicast P roblem
Publicatiears
ee
Tech
Publications
Virtual Synchrony
Weassumethateach distributed system has communication layer within which messages are sent and received. On
each node, messagesare first buffered in communication layer and then delivered to the application which runsin
higher layer. Group view is view on the set of processes in the group which sender had when message m was
multicast. Every process in group should have same view about delivery of message m sent by sender. All should agree
- that, message should be delivered to each memberin group view and to no other member.
Suppose message m is multicast by sender with group view G and while multicast is going on, new processjoins the
group or leaves the group. Because of this group view changes due to change in group membership. This change in
group membership message c (joining or leaving group) is multicast to all the members of the group. Now wehave
two multicast messages in transit: m and c. In this case, guarantee is needed abouteither m is delivered to all
processes in group view G beforeeach one of them delivered messagec, or m is not deliveredat all.
When group membership changeis the result of crashing of the sender of m, then delivery of m is permitted tofail. In
this case, either all the members of the G should know abort of new memberor none. This guaranteesthat, a message
multicast to group view is delivered to each nonfaulty process in G. If sender crashes during multicast then message
maybedelivered to all remaining processes or ignored by each of them. Reliable multicast with this property is said to
be virtually synchronous. |
Consider processes P,, P,, P;, and P, in group view G. Group view G = (P,, P,, P3, P,). After some messages have been
multicast, P, crashes. So group view G = (P,, P,, P,). Before crash, P, succeeded in multicasting messages to P, and P,
but notto P,. Virtual synchrony guaranteesthat, its messages will not be delivered at all. This indicates that, message
had neverbeensent before crash of P,. Communication then proceeds between remaining members after removingP3
from group.After P; recovers, it can join the group providedits state has been broughtup to date.
mp.)
4
pistributed Comput
4
: yessage Ordering
h following four orderings
The multicasts are classified wit
asts
1. Unordered multic
2. FIFO-ordered multicasts
3. Causally-ordered multicasts
are not
d multicasts 9 rante es
4. Totally-ordere In thi s multicast, 8U
chronous mul ticast . ple in
le is a virtually syn er the exam
un orde re d mu lticast which is reliab
ffer en t pr oc es se$. Consid
An by di the
messages are delivered tion blocks
red re ga rd in g, the order in which received mit ive . Th e rec cive ope ra
assu receive pri
rary with send and 3
le mu lt ic as ui ng is offered by a lib
which reliab
to ft. at each
il a message 6 delivered of events
calling proc ess unt P,. Following are the ord eri ng
processes P,, Py. and
gro up wit h thr ee communicating
Consider
process. Process P,
Process P, Process P,
4444
receives M,
recenves M#,
sends Mm,
receives M, receives M,
sends m, ring multicast. In
me grou p vi e w does not change du
ers. Assu yer at P,
tog roup memb mmunication la
M e s s res (71, and m, d th en m, - In contrast, co
icasts first m, an
- Process P, mult yer at P, recenves delivered in the orde
r they
ppose communication la aint, Mes sages May be
this situation su #,- As there 6 no or
dering constr
and then
ives first M,
suppose rece vered by the
Me ss ap es fr om same process are deli
were received. incoming
ltica sts, the n following four
ab le ri ro -ordered mu Co ns id e r communication betwee
case of re li nt .
ave been se me Way, My will
- Inthe second
sa me order as they h li ve re d befo re m,. In the sa
n layer in th e always de
communicatio Messabe m, will be grou p. If communic
ation layer at particular
g. 35 shown below, es se s in th e
O rderin all proc
processes. In FIFO is rule
is followed by
ived and d eliv
ered m,. On the other hand,
fore IT) s- Th r m, till it h as rece
vered be t delive , ed.
be always deli should no 4_ have receiv
d th en Mx " layer in the order they
st an n
process receives M2 fir
icatio
livered by commun
processes are de
ed from different
messages receiv
receives M, sends m,
cends M receives #;
sendsm, 4
receives M, receives M3
sends #,
receives Ms receives M,
receives M, receives M,
ation layer b
ial ly cau sal ly re la te d mess ages are delivered by communic
7otent receiver side,
usally precedes message m, then at
es. If message M1 ca
In reliable causally or
dered mull from sameprocess or from different
~ ese m essages can be
betwee m nt and then M- Th
considering causality
iver #
communication layer Wil del
Processes-
WFlechKacutetgs
Total-order multicast imposes additional constraint on order of delivery of messages. It says that, message delivery
may be unordered, FIFO or causally ordered but should be delivered to all the processes in group in same order,
Virtually synchronous reliable multicasting which offers totally ordered delivery messages called as atomic
multicasting.
9.9 Recovery a . =
5.9.1. Introduction
It is necessary to recover from thefailure of processes. Process should always recover to the correct state. In error
recovery,it is necessary to replace erroneous state to error-free state. In backward recovery, system is brought from
current erroneous state to previous state. For this purpose, a system state is recorded and restored after some
interval. Whencurrentstate of the system is recorded, a checkpointis said to be
made.
In forward recovery, an attempt is made to bring system from current erroneous state to new current state
from
whichit can carry onits execution. Forward recovery is possible if type current error occulted
is known.
In distributed system, backward recovery techniques are widely applied as mechanism to recover from
failures.It is
generally applicable to all systems and processes and can be integrated in middleware as general-pu
rposeservice. The
disadvantageofthis technique is that,it degrades performanceas it is costly to restore
previous state.
As backward recovery is general technique applicable to all systems and applications, guarante
e cannot be given about
occurrenceoffailure after recovery from the same one. Hence, applications
support is needed for recovery which
cannotgive full-fledged failure transparency. Moreover, rolling back to state
such as moneyis already transferred to
other accountis practically impossible.
4
The information needed to recover to the previous state neededto
be stored Safely so thatit Survives process crashes
site failures and also storage mediafailures..In distributed system, stable
storageis importantfor recovery |
The storage is of three types: RAM whichis volatile, disk which survives
CPU failure but can belostin case offailure of
disk heads. Finally, stable storage which is designed to survivefrom all types
of failures except major calamities.
- Stable storage is implemented with pair of ordinary disks. Each block on second
drive is an exact copy of the
corresponding block onfirst drive. Update ofblock first takes placein first
drive and once updated,it is verified. Then
same block on second drive is done.
RE Publications
TechKaswiedgé
-
5.9.3 Checkpointing rd
needed to reco
stora ge. It also
required sav e sy st em state regularly on stable e receipt
to s has recorded th
in backward erro r reco very , it is
ted snap shot ,if one pr oces
distributed snapshot.In distribu
-_
pe eed
collection of che ckpoints.
signifies th e most recent consistent
| 8J 4
| K [
Pry Fl
8, \ Failure
{
4
Y
| p 7S
2 ; Inconsistent
;
Recovery line collection of
Initial state Checkpoint checkpoints
very Line
Fig. 5.9.1:A Reco
! | Yo
7\
P2
initial state Checkpoint
effect
Fig. 5-9-2 : The domino
recorded checkpoint. As a result
iis required to restore its state to the most recently
s P2,
After crash of proces
be rolled back.
s P, wi ll al so needs to
Proces Hh
pene:
ich P P,
is require P, to roll back to previous state. However, the next state to which . hed back
of message m,although; P, has recorded receiptof it. As a result, P, again needsto
= 8 7 e ro =
Coordinated Checkpointing
In coordinated checkpointing, all processes synchronize to write their state to local stable storage in cooperative
manner. As a result of which, the saved state is automatically globally consistent. In this way, domino effect is avoided
here.
|
A two-phase blocking protocol is used to coordinate checkpointing. A coordinator
first multicast
CHECKPOINT_REQUEST to all the processes. Receiving processof this messa
ge then takes local checkpoint, queues any
successive message handedtoit by the application it is runnin
g, and acknowledges to the coordinator that it is has
taken a checkpoint.
<or TechKuewledge
Peblications
m Mp is never replayed
| so neither will m,
4
mii m3
4
However, the state after the recovery of P, is inconsistent with that before its recovery. Especially, P,; keeps a message
m, which was sent before the crash, but whosereceipt and delivery do not happen when replaying what had taken
The other way of carrying out recovery is basically to start over again. It may be much cheaper to optimize for
recovery, then it is aiming for systemsthat are free from failures for a long time. This approachis called as recovery-
oriented computing. One approach is simply reboot the system.
For clever reboot only a part of the system, it is essential to localize the fault correctly. Just then, rebooting simply
involves deleting all instances of the identified components together with the threads operating on them, and (often)
to just restart the associated requests. Practically, rebooting as a recovery technique requires few or no dependency
between system components.
In other approach of recovery-oriented computing, checkpointing and recovery techniquesare applied, but execution
are allocated more buffer space, changing
is carried out in a changed environment. The basic idea is that, if programs
be avoided..
Order of message delivery etc the n many failures can
Review Questions
of roplication,
Write the advantages
p What is replication?
as scaling technique.
= 82 Explain replication
y? Explain.
2.3 What is continuous consistenc
y model with oxamplo,
=a" Explain sequential consistanc
le.
y model wilh oxamp
E Qs Explain causal consistanc
.no
CRNcareer
l wiihiat
rrexdocmaae
. a6 4 Explain FIFO consistoncy
4
Q.14 Explain read your writes and write follows readsclient centric consistency.
Q.19 Explain different failures that can occur in RPC andtheir solutions.
Q.25 Whatis stable storage? Howit plays role in recovery ofdistributed system?
oOo
ut ee
pay a
i
ation, Case
ie
SC
6.1 Introduction
Files are used for
acc ess ed, use d, pro tec ted and implemented.
ed,
iles are structured, nam
File system describes howf vides sharing of informati
on.
inf orm ati on on sec ondary storage.It also pro
permanent use of tem Car share these
era l C omp ute rs. Com puters in distributed sys
on sev
, files are available ent and it is a software
In distributed system tem . Ser vic e Oo ffe rs particular function to cli
ed file sys
les by using distribut cess invokes the service
physically dispersed fi vic e SO ftw are on single machine. Client pro
rver runs ser
e machines. Se
entity running On 5 om as client interface.
d operations called
through some se t of de fi ne
a file, read from a
e se t of pri mit ive file ope rat ion sare create a file, delete
vices to client
s. Th rols local secondary-
File system offers file ser e th set of th es e operations. File server cont
rm d wi per request of
it e to fi le . Client interface is fo ce ssed fr om these devices as
file, and wr files are stored. Th
esefile s ar e ac
e de vi ce s su ch as disks on which
stor ag
In
icated machines.
4 client.
system (DF S). Server may run on ded
for distributed file
ent implementation buted operating
There can be differ server Mm DFS can be part of distri
n both client
and
ay run on same machine. stem and network operating system. DFS
en
other implemca ta ti c ware layer anaging communicat
m ion between file sy
n bea soft and storage devices which are
or it tional centralized file system. The servers
system client as convenow, ; g e
quest of client by arrangin th
: : its
¢
should come into vieW to itstwork should be 8avisible to clients. DFS should fulfil the re
machinesin ne
on different
or data- to service
required files ured with amountof time required ces
eration of DES,its performance is meas ll sto rag e spa
As data transfe ci
s involved in OP ed by 4 DES includes dif
ferent and remotely loc
ated sma ;
rag e SP ace Mm sn ag
the clienttr request . sto
ien
following - irrespective ofits location
DFS supports the . Any node in the system can transparently accessfile
mation sharin
g ut physically relocating the
o Remote infor user to work on different nodesat different times witho
allows t he
ty - DFS
o User mobili evices.- -
, , fect the
d
secondary storaBekeep s mult iple copi es of file on differen t nodes. Failure of any node or copy does notaf
> Availability : DFS
Sw Sere= 444
: ru | operation. Fadlicattcas
wrens met
Pe sa jet
© Diskless workstations : DFS provides transparentfile accessing capability; hence, economical diskless workstations
can be used.
Transparency
Structure transparency: DFS uses multiple file servers. Each file server is user or kernel process to control secondary
storage devices of that node on whichit runs. Client should be unaware about location and numberoffile servers and
storages devices. DFS shouldtreatthe client just like single conventional file system offered by centralized time sharing
OS.
Access Transparency: Accessing the local and remote file should be carried out in similar manner. DFS should
automatically locate accessedfile and support to transporting the datatoclient.
Naming Transparency: name offile should remain same when it is moved from one node to other. Its name should be
location independent.
Replication Transparency: Existence of multiple replicated copies and their location should remain hidden from
clients.
User mobility
DFS should permit the user to work on different nodes at different times. Performance should notaffect if user works
on nodeotherthan his node. User9s homedirectory should be automatically made available when user logs in on new
node.
Performance
DFS must give performance sameascentralized file system. User should not feel the need to place file explicitly to
improve performance.
TechKnowledge
<Peslicatians
scalability 4
;
to service loss OF
The DFS should also su pports for growth iT
and users in netw ork. Such growth should not lead
;
of nodes
performance degradation of the system
4, High Availability be
an
This failure ©
functii iif parti;al : one or more components.
continue to failure occurs in
DFS should
ce due to failure
*
on
failure. Degradation in performan
communication link failure , node f ailure, and secondary storage
9. Security m to
the security mechanis
pri vacy to use r's data. It should implement
offer the should also be supporte
d.
red so that it can righta location policy
DFS should be secu Sec ure d access
ess .
om unauthorize d acc
protect file data fr
informal
3
othe r
associate by operating
ion associated
:. extra informat :
ae
"3 :
~ Thelists of attributes are not same for all the systems and vary from one operating system fo BROBIEN, ING mB
existing system supports all of these attributes, but each one is present in some system.
6.3.2 Mutable and Immutable Files
4 Mutable Files : Most of the OS uses this model. In this file, each update operation onfile update overwritesits old
as single stored sequence that is changed by each
content and new content is produced. Hence,file is represented
update operation.
4 Immutable Files : In this file model, each update operation creates new version of the file. Changes are made in new
version and old version is retained. Hence, more storage space is required.
to accessthe file.
DFS mayuse oneof the following models to service request of the client
to access remote file. Naming
-~ Remote Service Model : In this model suppose client forward the request to server
through remote-service
scheme locates the server and actual data transfer between client and server is achieved
accesses and returns
mechanism. In this mechanism, request for accesses is forwarded to server, which then performs
data packing and
the result to user or Client. This is similar to disk accesses in conventional file system. Hence,
communication overhead is significant in this model.
al file
Data-Caching Model : Caching can be used to improve performanceof remote-service mechanism. In convention
te
system, caching is used to reduce disk I/O. The main goal behind caching in remote-service mechanism is to reduce
arncaeter ssi
and
network traffic and disk 1/O. If data is not available locally then it is copied from server machine to client machine
SS ee Oe 88st ot
cached there. This cached data is then used byclient at client side to process its request. This model offers better
performanceand scalability.
All the future repeated accesses to this recently cached data can be carried outlocally. This will reduce additional
networktraffic. Least Recently Used (LRU) algorithm can be usedfor replacing the cached data. Master copyoffile is
available on serverandits part is scattered on manyclient machines. If copy of the file at client side modifies thenits
master copy at server should be updated accordinglycalled as cache-consistency problem.
DFS caching is network virtual memory which works similar to demand-paged virtual memory having remote server as
backing store. In DFS, data cached at client side can be disk blocks or it can be entire file. Actually more data are
cached byclient than actually needed so that most of the operations are carried outlocally.
The unit of data transfer is fraction of data that is transferred betweenclient and server dueto single read or write
operation. In data-caching modelof accessing remotefiles, following four models are used to transfer the data.
Teck
Publications
ofthis form
One
te quired to transferthe file to the form compatible to clientfile system. The drawback
time transfer is requi
is that it requi
equires sufficient space atclient as entirefile is cached. Amoeba, CFS and Andrew File system
uses this transfer model.
65 1
%35. Cache Lo cation nile {5 server's disk. rollowing three possible locations can be there to cache the
acation ofthe
al |
a ~ Suppose the origin
A data.
Ke:
if.
Client disk : It involves disk access cost at client machine on cache hit. It eliminates network access. In case of crash,
data remainsin client disk and hence, no need to access again from server for recovery. Thereis no loss of data as disk
is permanent storage. Hence,it offers reliability. Disk also has large storage capacity compared to main memory,
resulting in higher hit ratio. Most of the DFS usesfile level transfer for which caching in disk is better solution as disk
has large storage spaceforfile. The drawbackis that, this policy does not work for diskless workstations.
Client9s main memory, It works for diskless workstations and avoids network access cost and disk access cost.It
contributes scalability and reliability as access request is served locally on cache hit.It is not preferable compared to
client disk cache if increased reliability and large cache size is required.
Cachelocation can either be main memory or disk. If cache is kept in main memory then modifications done on cached
data will lost due crash. If caches are kept in disk then they are reliable. No need to fetch the data during recovery as
data resides on disk. Following are the advantages of main memory caches.
oO It permits for diskless workstations
o Data access takes less time from main memory comparedto access from disk.
os Performance speedup is achieved with larger and inexpensive memory which technology demandstoday.
are kept in main
o To speedup I/O server caches are kept in main memory.If both server caches and user caches
memory then single caching mechanism can be used for both.
WT peaications
ae
access has to wait until the
information Is Sent to the ser
a"
iy
L
ve lr.
|the advantagesof
é data caching are only for read accesses as remote service methodis usedfor all write accesses.It is
4
-
i
io
In this policy there is delayin writing the modifications to the master copy on server. Modifications are written first in
__ cache and thenlatertime is done on master copy.First advantageofthis policy is that write accesses completein less
time as writes are made to cache.
Second, the data which maybeoverwritten prior to writing back on master copy, so last update needs to be written.
The limitation of this policy is that, if client crashes then unwritten data are lost and hencelessreliable.
There are variations of delayed write policy. One choiceis to flush a block as soon asit is set to be ejected from the
client's cache. This alternative can lead to good performance, but some blocks can exist in the client's cache a long
time before they are written backto the server. |
A negotiation betweenthis choice and the write-through policy is to scan the cache at regular intervals and to flush
is to write data
blocks that have been modified since the most recent scan. So far one morevariation on delayed write
File System (AFS).
back to the server whenthe file is closed. This write-on-close policy is used in Andrew
Client hine always use cached data for accesses which is consistent with master copy at server. If client
4
lent machin
determines that its cached copy is out ofdate, then it should cache the up-to-date copyof data.
4
Following two approachesare used :
' Server.
Ti =
Publications
a NT RR ee ee,
hi
eh eeey
Tech
Pubtitatianrs
perfor rate ©
wae Put.
ilacd with replica then 2% Race rT
cs. & recede bo Ge peranten oily ect
~ Cache toy % clecgoectubecesé ort fei.
Advantages of Hepisation
increased Awailatniiey
8
I? orirmary ce sy Upils, 30
resi Fee
sf, fst
possible. Perruannnt 44
iailins
replications oft ers ret ! aihy ce ror Thue peoochke whO
ne 2eCSRs tire & he3s
sifeswet thats % nye se tivser is 7
yep ia. share
bgrowed Hievgnortet Fyeser
ierss bee
-<-,*
F re ve s= + ae <2
we gucttabhem
eor9 traffic
Reduced Netw Cee ty.
4 we (eee
greets ee
a
tant
and non-replicated file as well. Following are the two impor
ace fo; replicated of
chent inter? replicas and replication control.
naming
cation transparency OFS
e
renticeteare
;r
rete
Naming of Replicas
4 As immutable objects are easily supported by kernel, single identifier can be assignedto all the replicas of immutable
object. Asall copies are identical and immutable, kernel can use any copy. In case of mutable objects,all copies may
not be consistent at particular instance of time.
- If single identifier is assigned toall replicas of mutable object then kernel cannot decide which replica is most up-to-
date. Therefore consistency control and management for mutable objects should be carried out outside the kernel,
~ Naming system should map a user supplied identifier to the appropriate replica of mutable object. In caseif all the
replicas are consistent then mapping must provide locationof the replicas and their distance from client node.
Replication Control
- Replication control is transparent from user and handled automatically. Replication can be carried out system
automatically or can be carried out manually.
4 In explicit replication, users control the process of replication. Created process specifies server on which file should be
placed. If needed then additional copies are created as per request fromusers. Users also haveflexibility to delete one
or more replica. In implicit replication, entire process of replication is controlled by system automatically. Users
Me
remain unawareof this process. Server to place thefile is selected by system automatically. System also creates and
deletes replicas as per replication policy.
As replicas of file exist on multiple machines, it is necessary to keepall the copies consistent. If update takes place on
one copy, it must be propagated to all other copies. Following approachesare used :
Read-Only Replication
In this approach, only immutablefiles are replicated as they are used only in read only mode.Thesefile gets updated
after longer period of time.
Read-Any-Write-All Protocol
This protocol allows replication of mutable files. In this approach, read operation is performed on any copy of the file
but write operation is performed by writing to all copies of file. Before performing update to any copy,all copies are
locked, then they are updated, and finally locks are released to complete the write.
Available-Copies Protocol
4 In read-any-write-all protocol, if server with replicated copy is down at the time of write operation then write cannot
be performed. Available-copies protocol allows this operation. In this approach, read operation is performed by
reading any available copyofthefile but write operation is performed by writing to all available copies offile.
The assumptionin this protocol is that, down server when recovers, it brings its state up-to-date from other server's
copies. This protocol provides high availability but does not prevent inconsistencies in failure of communication links.
aa
quorum-Basedprotocol
- This protocol is applicable in network partition problem where replicated file are partitioned in two more active
groups. All above protocols discussed has some restrictions.
|
Following are the definitions of read and write quorum.
(i), Read quorum : Toread file, if minimum copies of replicated file F have to be consulted out.of n replicated
copiesof F then set ofr copies is called as read quorum.
(ii) Write quorum : To carry out write operation on the file, if minimum w copies of replicated file F have to be
written out of n replicated copies of F then set of w copiesis called as read quorum. =
Restriction in this protocol is that sum of r and w should be greater than n (r+w>n). It ensures that, between any pair
of read-quorum and write-quorum,there is at least one copy common whichis up-to-date. This protocol needs to
identify current updated copyto in order to update other copiesof the file. This problem is resolved with assigning the
version number to copy when it gets updated. Highest version number copy in quorum is current updated copy. New
Version number to be assigned is one more than the current version number.
~ In read operation, only highest version numbercopyis selected
to read from read quorum. For write operation, only
highest version number copyis selected from write quorum to carry. out write operation. Before
performing write, the
Version number is incremente d by one. Once update is carried out, the new update and new version number is written
_ toall the replicas. . f. an ong AY IN # at: ig
Consider n = 8, r=4, and w= 5. In this example, r+w>n condition is satisfied. If write Operation is carried out on write
4_4 , 4 f * *
Quorum {3, 4, 5,6, 8 } then these copiesare updatedversions with new version number.If read operationis performed
>. 7, 3} then copy 3 is common copy as read operation mustcontain atleast one copyof previous
4 read quorum (1, fs .
write-quorum This exactly ensures copy 3 as having largest version number from read-quorum. Hence, read is carried
w=/7. .
ighted Vo ting: In this protocol, diff
Consensus with We erent votes
nce.
d pe rforma <e erent replicas are assigned with diff
ing, reli8babilility an
i
consider
a
UP TechKnowieaga
Pudlications
oO If replica X is accessed more frequently then morevotesare assignedto it. Size of quorum depends on replicas
selected for quorum.In this protocol, to ensure non-null intersection of read and write quorums, condition
DFS forms basis for many distributed application and sharing of data is fundamental toit. DFS supports to share data
by multiple processes over long period of time while offering security and reliability.
SunMicro system9s Network File System (NFS) and Andrew File system (AFS) are the examples of distributed file
system.
into one
NFS is developed by Sun Microsystems and it is used on Linux to join file systems of different computers
and offers a number of
logical whole. Version 3 of NFS was introduced in 1994. NSFv4 was introduced in 2000
NFS Architecture
|
|
d out to VFS
ace to virtual file system (VFS). The operations carrie
local UNIX file system interface is replaced by interf
performs remote
m or to NFS client component. NFSclient then
interface is either passed out to local file syste
tions as RPC to
server.This means NFS client implements file opera
procedure call (RPC) to access files at remote
ces betweendifferent file systems.
| remote server. VFS hides the differen
and NFS server
sts. RPC server stub unmarshals these requests
NES server on server side handles incoming reque
file
the VFS layer. VFS implements local
rt them to regul ar VFS file operations that are afterward passed to
conve file system -
file systems, provided, if local
system where actual files reside.
In this way, NFS is independent oflocal
model of NFS.
compliant with file system
44
OF TectKacietse
Peblicationrs ;
Server
Network
for reading the data from file, read operation is used. Client specifies offset and numberof bytes to read. Writing data
tofile is carried out with write operation. The position and numberof bytes to write is specified by client.
-
for communication, NFS protocol is placed on top of RPC layer. This is due to independence of NFS on OS, network
for the
| architecture and transport protocols. Open network computing remote procedure call (ONC RPC) is used
everal RPC requests in one request in order to reduce number of messagesto
communication. NFS supports to group >
be exchanged.
dures which do not have any transactional semantics. All the operationsin compound
This j ro cedu eas ;
_This is called as compound p , ests. Conflict s cannot be avoidedin case if same operationsare invoked by _
requ
Procedure is executed in order of their
Other clients
clj simultane ously. ;
+d wnitially which cannot supports to lock a file for operations. Therefore, a
ln NFS, stateless server WaS implemen e situation. in later version, stateful approach is implemented to
support
| 8eParate lock manager is USE to handle sam nts can use cacheseffectively
d .
ork so that clie
: Working across wide area netw
.
remo te file9 system at serverto its requesting clients. NFS allows
access for
Nts naming model offers transparen' server. The xP orted directory :by server can be maintained in client9s local
ae s
lent to mo unt rt le
pa offi sy stem a
pe sis that it does no t al lo w sharing fi
.
le s.
; appro ach in
_ Space, The drawbackofthis ~ ait
ted directories. Every NFS server exports one or moreofits directories.
bis9 ' O
Re . NFS sé 8ents Particular directories along with its subdirectories, so infact entire
E Mote clients access the ed bY remote cl jents.
. 5 .
-'$se directories then acc©° as 4 unit
Cj rt ed SF TechKnowledge
*ectory trees are usually exP® Publlcutions
In version 4 of NFS, it can be variable up to 128 bytes. File handle is stored and used by client for most of the
operations, and hence, avoids look up for file which improves performance.
After deleting file, server cannot reuse the same file handle asit can be locally stored by client. This may lead to wrong
file access by using the same file handle byclient.
Iterative look up with not permitting look up operation to cross a mount point leads to the problem in getting initial
file handle. To access the file in remote file system, client must provide file handle of directory where look up would
take place. This also requires providing name ofthe file or directory that is to be resolved.
NFS version 4 solves this problem by offering a separate operation pootrootfh that informs server
to solve all file
namesrelative to rootfile handle of the file system it manages. In NFS, on demand mounting of remotefile
system is
handled by automounter which runs as separate process on client machine.
File Attributes
m.
Following are some ofthe general recommended
file attributes :
° ACL : Access control list associated with file,
10 FILEHANDLE: Serverprovide file handl
e of this file
oO __FILEID:File system unique identifier for thisfi
le.
_FS-LOCATIONS : Locations
in network wherethis file
0
Sor TechKnowledge
Publications
_ synchronization
o
- Lock is nonblocking operation and used to request a read or write lock on consecutive range of bytes in file. In case of
conflicting lock. lock cannot be grated and client has to poll the server later time. Once conflicting lock has been
removed, server grant the next lock to theclient at top of requesting FIFO list maintained at server side. Lockt is used
to check whether any conflicting lock exist. Removing lock id done by using Locku. if client does not renew lease on
acquired lock, server will automatically removes it.
- In NFS version 4, cache consistency is handled in implementation-dependent manner. Client has memory cache to hold
data read fromserver. In extension to memory cache, disk cache also exists in client machine. Client caches file data,
attributes, and directories and file handles. Client caches data obtained from server while performing several read
done on data then cached
operations. Several clients on the same machine may share the cache. If modifications are
~ Clients can also cache attribute values which can be different at different clients. Modifications to attribute value
tories.
approach is used for file handles and direc
hould be eim
snou immediately forwarded to server. Same
NES on 4 offers minimum support for file replication. Only replication of whole file system is possible.
- version
~ RPC m mechanism in NFS does not ensure guarantee regarding reliability.It lacks in detecting the duplicate messages, In
byclient. This Problem is solved by means of
case of loss of server replay, server will process retransmitted request
. aol emen ted by server. Each client request carries trans
dupplicate-request cache Imp
action identifier (XID) that is cached
t server. After processing the request, server also caches reply. e
est arrivesa
by server when requ
444___
Ye TechKnowtedge
Pudlicetro se
ww Distributed Computing (MU-Sem 8-Comp.) 6-16 Distributed File Systems and Name Service
After timerat client expires before reply comes back then client retransmits same request with same XID. Three cases
occur. If server has not yet completed original request,it ignores retransmitted request.In other case, server may get
at which reply sent
retransmitted request after reply sentto client.If arrival time of retransmitted request and time
are nearly equal then server ignores retransmitted request.If reply is really did get lost then cached reply is sent to
client as reply to retransmitted request.
In case of locks, if client is granted a lock and it crashes.In this case, server issues lease on ever lock. If not renewed by
client then server removeslock freeing the resources held by lock. In case of server failure, a grace period is provided
to server after it recovers. In this grace period, clients can reclaim same lock that was previously granted to them.
As a security measure, older NFS used Diffi-Hellman key exchange to establish a session, NFS version 4 uses
authentication protocol Kerberos. RPCSEC_GSS secured framework is also supported foe setting up secured channels.
| Authorization in NFS is analogues to secure RPC.ACL file attribute is used to support access control.
UNIX programs can transparently access the remote shared files. Like NFS this transparency is offered by Andrewfile
system (AFS) to UNIX programs. Normal UNIX primitives are used by UNIX programs to access AFS files without
carrying out any modifications or recompilation. AFS is compatible with NFS. File system in server is NFS based. Hence,
File handles are use for referenceof file. Remote accessto file is provided via NFS.
client
Scalability is major design goal of AFS. It supports to large number of active users. The whole file is cached at
node. Following are two design characteristics.
oO Whole-file serving : The entire content of directory and file are transferred to client machine by AFSservers.
fe) Whole-file caching : This transferred file by server is then stored on the local disk which is permanent storage.
Cache contains several recently used files. Open request are carried out locally.
Implementation
called as Vice and Venus. Vice
4 AFS implementation contains two software components which are UNIX processes
process runs on server machine as user-level UNIX process.
At client side, local and sharedfiles are available.
Venus process runs on client machine as user-level UNIX process.
cachedby clients in disk cache.
Local files are handled as normal UNIX files. Sharedfiles are stored on server and
22 TechKnowledge
Pulictcafiers
S
Distributed Computing (MU-<Sem 8-Com Service
al Distributed File Systems and Name
=
\
oe,
machine. At client side, , on one of the file
copies of files from shared space.
of cache is carried o Se is used as cache storing files cached
The management by Venus process
ut
, at
- It removes LRU (least rec
helen that new files accesse d from server gets space. Thesize of disk cache
client is large. Vice sees implements hierarchic
processes at client side
that is needed by une flat file service. Venus
directory structure is assigne¢ with 96-bit file
identified by thi ws mee Each file in shared file space
to translate path name to file
identifier.These files are is the job of Venus process
y this identifier. It
identifier.
to refer tO web
ames are
B . ;
TR
Service
6-18 Distributed File Systoms and Name
"W Distributed Computing (MU-Sem 8-Comp.)
independent from its
Address is also a name. Name should be location independent. Nam
eof the entity should be
address.
ity and each entity
the entity. identifier should refer to one entity
Identifier is also a name whichis used to identify
ref er to the same entity.
identifier. Identifier should always
should refer to by at the most on
endly name
. Huma n friendly
'
identifiers are repr esented in the form of bit strings.
In many computer systems, addresses and
some service.
such as URLis represented asstring of characters. Many of the names ar e specific to
Originally name services were quite simple as they were Cengned for single administrative domain. Considering the
name-mapping 5 required. Global name service is
large distributed system with interconnection of networks, a larger
Name Spaces
Names in distributed system are orgarized in name space. Name spacecan be represented as labeled diagraph having
This leaf node stores information
two types of nodes. Leaf node represents named entity and it has no outgoing edges.
it is
about entity such as address. ft also stores state of the entity, for example,in file system it contains complete file
maintains directory table in
representing. Directory node in name space has number of outgoing edges. Directory node
which outgoing edge represented as pair (edge label, node identifier),
Naming graph contains root node. The path in naming graph contains sequence of labels. For example, path N:<label 1,
label2,......, label n> contains N asfirst node in graph. Such sequenceis called as path name. If first name in path name
is the root of naming graph then it is called as absolute path. Otherwiseit is called as relative path.
DNS namesare called as domain names. They are strings just like absolute UNIX file names. DNS name space has
hierarchic structure. A domain name comprises of one or more strings called as labels. They are separated by delimiter
<= (dot). There is no delimiter at beginning and end of domain name.Prefix of nameis initial section of name. For
example dcs and dcs.qmw are both prefixes of dcs.qmw.ac.uk. DNS servers do not recognize relative name. All the
namesreferred to the global root.
In general, alias is similar to UNIX like symbolic link which allows substituting the convenient name in place of
complicated one. DNS permit aliases in which one domain nameis defined to stand for another. Aliases provide the
transparency. Aliases are generally used to specify machines name on which FTP server or webserverruns.Alias is
updated in DNS database suppose webserveris moved to another machine.
i corresponding authorities
be merged by replac
the same. The entire UNIXfile systems of two different machine s can
by
in this super root. In this way name spaces can be merged
super root. Then mount each machine's file system
ility.
es the problem of backward compatib
creating higher level root. It rais ces to be embedded in
it. DCE
ting Enviro nment) name space perm its heterogeneous name spa
DCE (Distributed Compu the mounting of
NFS. These j unctions allow
spac e conta ins junct ions simul ar to mount points in UNIX and
name
heterogeneous name spaces.
n aming service supports
and share them. Spring
em moun ting allo w users to import file s from remote server
File syst ext selectively.
to share individual naming cont
to create name speces dynamically. It also supports
hi
Be
E.. or |
Fig. 6.8.1: Client iteratively contact name server 1
to 4 in order to resolve name.
L
a Ba
PeSlitatiers
[ a |
W Distributed Computing (MU-Sem 8-Comp.) 6-20 Distributed File Systems and Name Service
Initially client presents name to local name server.If it has the name,it returns it immediately, Otherwiseit will
suggest another server which can help. Now resolution proceeds at new server.If needed, further navigationis carried
out until the nameis located or discovered to be unbound. In multicast navigation client multicast name to be resolved
and needed object type to groupofservers. Server storing this namedattribute only replies to the request.
In recursive name resolution, name server coordinates the resolution of name and returns result back to user agent.It
is further classified as non-recursive and recursive server controlled navigation. In non-recursive server controlled
navigation, client can choose any name server.
This server then either multicast request to its peers or it communicates iteratively with peers. In recursive server
a a a
If this server does not hold namethenit contact to its peer that holds larger prefix of the name, which in turn attempt
to resolve it. It is repeated until nameis resolved. Fig. 6.8.2 shows non-recursive and recursive server controlled
navigation,
Name Name
server 2 sorver 2
2 2J
te Name / e Name 3 y4 3
[creat[44 server 1 N server 1
4
qn
3 A
Narne Namo 4
server a server 3
(a) Non-recursive server controlled Navigation (b) Recursive server controlled Navigation
Fig. 6.8.2
The access to name server running in one administrative domain by client running in another administrative domain
should be prohibited. In ONS and other name services, client name resolution software and server maintains cache of
results of previous name resolution. On a clients request to resolve name,first client name resolution software enquire.
cache.If recent result of previous nameresolution is found then return it to client. Otherwise request is forwarded to
server. That server in turn mayreturn result cached from other server.
Caching improves performance and availability. It saves communication cost and improves response time. High level
nameservers such as root servers are eliminated due to caching.
DNS naming database is used across the internet. The objects named by DNS are computers. For these computersIP
addresses are stored as attributes. In DNS domain names simply are called as domains. Internet DNS has bound
millions of names. The look ups against these namesare carried out from all around the world.
Mainly DNSis used for naming across the Internet. DNS name spaceis partitioned organizationally and as per
geography aswell. In name highest level domain is mentionedat right. Following are the generic domains.
Oo _ com : Commercial organizations
Oo edu : Educational institutions and universities
o gov : US Governmentalagencies
do ma in na mes an
Each server also re
cords
su b-domains. For exa
mple, it could
low ing data. le ss th e data about
ai fol d
each zone cont
ns domains an lege.ac.in).
r names in artment (department.col
tr ib ut e data fo an d le ss the data about dep
It contains
at
(college-ac-in) data for the zone.
o ga ni za ti on ma in ta in 5 trustworthy and reliable
for or ne whic h
contain data least two se
rvers in zo data which gives IP addres
ses of these
ad dr es se s of at
ga te d su b-domains and
Names and r dele
o lds data fo
me serv er s that ho
na
Names of cation and caching.
o
s qu ic kl y. th at go ve rns the re pli
_ server ment ess. Each entry
zone manage in fu ture name resolution proc
related to a t it can be
required
o Parameters servers SO th er by client will remain usefu
l. In this
othe r un -quthoritative serv
n cache d ata from cached
da ta fr om
s query after timeto live period
~ Any server ca live value
so that rver. if client send
ive se
ns time to |data fr
o networktraffic and offers flexibility
in zone contai server ive server.
This minimizes
th is un-authoritative 362! n to
guthoritat
Ca se ,
ta ti ve se rver contact
ri
then un-autho ecifies W hat is required. It
can be IP address, name
ni stra tors- e of query sp
to system admi utes- TyP
attrib
bitrary
ed to store ar
DNS can be us
her information.
Tech
n Publications
_Distributed (MU-Sem 8-Comp.) 6-22 Distributed File Systems and Name Service
4
4 ADNS clientis resolver whichis implementedin library software.It is simplest request-reply protocol. Both iterative
and recursive resolutionis supported by DNSandclientside software specifies the type of resolution to be carried out.
4 Nameservers store the zone datainfiles in the form of resource records. Following are some examples of resource
recordsfor Internet database.
MX Domain List of <preference, host pair> Refers to mail server to handle mail
addressed to this node
NS Zone Machine architecture and operating Holds information on the host this node
system represents.
Berkeley Internet Name Domain (BIND) is an implementation of DNS for machines having UNIS OS running on them.
Client programs link in library software as the resolver. DNS name servers run named daemon. BIND permits three
servers which are primary, secondary and caching-only servers. The named program implements just one of these as
per configurationfile contents.
Typically, organization has one primary, one or more secondary thatoffer service for name serving on different LANs
at the site. In addition to this, caching-only servers are run by individual machines to minimize networktraffic and
speed up the response time.
Directory services are attribute based naming systems.Directory service stores binding between namesand attributes
and that look up entries matches with attribute based specifications called as directory service. Some of the examples
are LDAP, X.500 and Microsoft9s Active Directory Services. Directory service returns attributes of any objects.
Attributes are more powerful compared to names. For example, Programsusually can be written to select objects by
their attributes and not names.
networking.
A discovery service is a directory service. This discovery service registers services provided in spontaneous
In spontaneous networking devices gets connected at any momentof time without any warning. There is no any
administrative preparationcarried out for these devices when they connect in network.It is required to support set of
clients and services to be registered transparently. There should not be any humanintervention for the same.
automatically. It also
To support this, discovery service offers interface for registering and de-registering these services
hotel should be able to connect his
offers interface for the clients to look up these services. For example, customerin
laptopto printer automatically without configuring it manually.
eee
Tech:
Publications
- Global Name Service (GNS) was designed at DEC Systems Research Center.It offers facility for resource location,
guthentication and mail addressing. Following are the design goals of GNS.
o Tohandle arbitrary numberof namesandto serve arbitrary numberof organizations.
o.=.:«d Ad long life time.
0. High Availability.
o Fault isolation: Local failure does notaffect the entire system.
c Tolerance of mistrust.
hence, it considered the
- These goals indicate that any numb er of computers, users Can be added in system and
may also change. The
a | structure changes then the structure of name space
support for scalability. If organization
of individuals, organization etc.
service should also accommodate the ch anges in the names
esthat, changes
and naming databa se, caching is used in GNS.It assum
- Considering the large si ze of distributed system r
n of updates is adopted. Client can detect and recove
r | ent ly an d hence slow propaga tio
will occu
in database nfr equ
. fajesh,
ve
Hiverting new foot abo
ee
ee
h Whenever client use - e 7.
In the same way, wed agent deal with relative name referting
10 GNS serv
a
fe
fe
yed ma
a
>
nds
this dew
User apent then se
t
orles.
8 =e yea= Se - ee a i, Orhan: teraead SSS
. ey .
to working direct
ees pet ae SmaS et aa tial ge.
& serge a
iy 4 nl
eee
aed Ferneed
Rekits #lisaa
a saa
Scanned with CamScanner
YN.
W Distributed Computing (MU-Sem 8-Comp.) | 6-24 Distributed File Systems and Name Service
i
4 GNS maintainstable of well-knowndirectories in which it list all the directories which are used as working roots. This
changes
table is held in currentreal root directory of the naming database. Whenever real root of naming data base
due to adding the new root,all GNS servers are informed aboutlocation of new root.In GNSrestructuring of database
can be doneif any organizational changes occurs.
Di: #567
if : t
Dieu Ae
Directory
Rajesh
Value Tree
password
X.500 Directory Serviceis used to satisfy descriptive queries to look up namesandattributes of other users or system
resources. The use suchservice is quite diverse. For example, the enquiries can be for accessing white pages to obtain
user's email address or yellow pages query maybefor obtaining names and telephone numbers ofgarages.
Individuals or organizations can use directory service to provide information about them and resources they wish to
offer for use in network. It is possible for users to search directory for information with partial knowledgeofits name,
structure and contents.ITU and ISO standard organization have defined X.500 Directory Service as a networkservice.
It is used for access to hardware and software services and devices. The X.500 servers maintain the data in tree
structure with named nodeJustlike other name servers. Each node oftree in X.500, stores wide range ofattributes.
The entries can be searched by any combinationof attributes.
The X.500 name tree Is called as Directory Information Tree (DIT). The entire directory structure along with data
associated with nodesIs called as Directory Information Base (DIB).
EE
Uf" lechinewiedes
Peblications
DUA DSA
DSA
DUA DSA fo
DSA |
4 Directory can be accessed by using two access requests : read and search.
Oo read : An absolute or relative name to an entry is given along with a list of attributes to be read. DSA then
navigates DIT to locate the entry.If the part of tree that includes entry is not available on this DSA server, then it
passes request to other DSA servers. The retrieved attributes thenit returns to client. :
© Search: This access request is attribute-based, base name and filter expression are supplied as arguments. The
base name indicates the node in DIT from which search is to begin and filter expression is to be evaluated for
every node below the base node. The searchcriterion is specified byfilter. The search command then returns
namesfor all entries for which filter evaluates TRUE value.
4 DSAinterface offers operations to add, delete and modify the entries. Access controlis also Provided
for both query
and update operation. As a result, access to somepart of DIT can berestricted to user or group ofusers.
scalability, reliability, performance and openness. Like any web search engine, Google search engine return an ordered
list of the most relevant results that match to the given query by searching the contentof the Web. The search engine
contains a set of services for crawlin g the Web and indexing and ranking the searched Pages.
44
Fea.
ystem.This
~ The crawler locates and retrieves the contents of the W eb and passes
the contents onto the indexing subs
recursively reads a given web page, harvesting all the links
a
on Google servers.
Googie Calendar : It is Web-based calendar having all data hosted
wikis and social networks.
Google Wave : It isa collaboration tool integrating email, instant messaging,
Google News It is automated news aggregatorsite.
0
44_4
Google
itecture of
e Ov er al l System Arch
Fig- 6.12.1 = Th
ubby 3 = BigTable
< a
nation
Data and Coordi
ba=
8Publish-Subscri
n paradigms
Communicatio ogle Infrastr
ucture d
Fig- 6.12.2 = Go of data couple
d\ ab st ra ct io ns for the storage e
ucture mized fo r th
as tr uc tu red and semi-strS offers a distributed file system opti
ffers U a : GF
© ate ss to the dat coordination serv ices and the ability to
ordinatl
on serv ices
rvices. Chubby supports
4 Data and co in
cations and sebute databaseoffering access to semi-structured data.
coord
:
with services tO suppo aP og a dist ri d
ements © Google llel and distributed computat
ion over the physical
particular requir of data. carryi ng ou t P ara
| volumes mean s for y large datasets,
stor
e smal atlon over potentially ver
a
pReduceé suppor Penitcatiors
infrastructure - Ma
7 Distributed Computing (MU-Sem 8-Comp.) 6-28 Distributed File Systems and NameService
eee
4 Sawzall provides a higher-level language for the execution of such distributed computations. Communication is
ee__4_E_EEEEEEEEE4E4E4E4EeEEeeEEEeEeEEOO 2
4 The GFS mainly aims at demanding and rapidly growing needs of Google9s search engine and the Google web
applications. Following are the requirements for GFS :
4 GFSprovides a conventional file system interface offering a hierarchical namespace with individual files identified by
pathnames.Following file operations are supported. The main GFS operations are very similar to those forthe flat file
service.
4 The parameterfor GFS read and writes operations specify a starting offset within the file. The API offers, snapshot and
record appendoperations. The snapshot operation offers an efficient mechanism to make a copy ofa particularfile or
directory tree structure. The record append operation supports the commonaccess pattern whereby multiple clients
carry out concurrent appends to a givenfile.
GFS Architecture
4 Fig. 6.12.3 shows overall GFS architecture. In GFS, the storage offiles is in fixed-size chunks, where each chunk is 64
megabytesin size.This size chosenis very large comparedto other file system.It offers highly efficient sequential reads
and appendsoflarge amounts ofdata.
Contro! Flow
Client
library, 3.
44 oe ome ome me
UF TechKnomtedgs reels
Puaiications esi macne ars
distributedfile system.
Q. 1 Explain in short services providedby
distributed file system.
Q.2 Explain desirable features of good