A Perspective On Distributed Computer Sy
A Perspective On Distributed Computer Sy
(Invited Paper)
Abstract - Distributed computer systems have been the subject Possibly the most important potential advantage of a DCS
of a vast amount of research. Many prototype distributed com- is extensibility. Extensibility is the ability to easily adapt to
puter systems have been built at university, industrial, commer- both short and long term changes without significant dis-
cial, and government research laboratories, and production
systems of all sizes and types have proliferated. It is impossible to ruption of the system. Short term changes include varying
survey all distributed computing system research. Instead, this workloads and subnet traffic, and host or subnet failures or
paper identifies six fundamental distributed computer system re- additions. Long term changes are associated with major
search issues, points out open research problems in these areas, modifications to the requirements of the system.
and describes how these six issues and solutions to problems asso- In trying to achieve the advantages of DCS,'s, the scope of
ciated with them transect the communications subnet, the distrib-
uted operating system, and the distributed database areas. It is research has been very broad. In spite of this, there is a
intended that this perspective on distributed computer system relatively small number of fundamental issues dominating
research serve as a form of survey, but more importantly to illus- the field. Solutions to these fundamental issues have not yet
trate and encourage a better integration and exchange of ideas been consolidated in a comprehensive way, thereby thwart-
from various subareas of distributed computer system research. ing the full potential of DCS's. After a brief overview of DCS
Index Terms -Communications subnet, computer networks, research (Section II), this paper provides a perspective on six
distributed computer systems, distributed databases, distributed fundamental DCS issues (the object model, access control,
operating systems, distributed processing, system software. distributed control, reliability, heterogeneity, and effi-
ciency), identifies problems associated with these issues,
I. INTRODUCTION shows how these issues interrelate, and describes how they
A DISTRIBUTED computer system (DCS) is a collection are addressed in different subareas of DCS research
A of processor-memory pairs connected by a commu- (Section III). It is intended that this perspective on DCS
nications subnet and logically integrated in varying degrees research serve as a form of survey, but more importantly to
by a distributed operating system and/or distributed data- illustrate and encourage a better integration and exchange of
base system. The communications subnet may be a widely ideas from various subareas of DCS research. To keep the
geographically dispersed collection of communication pro- scope of this paper reasonable, two fundamental issues, re-
cessors or a local area network. The widespread use of dis- search in the theory and specification of distributed systems,
tributed computer systems is due to the price-performance and the need for a distributed systems methodology are not
revolution in microelectronics the development of cost ef- specifically discussed. A theory of distributed systems is
fective and efficient communication subnets (which is itself needed to better understand theoretical limitations and com-
due to the merging of data communications and computer plexity. Specification languages must be extended to better
communications), the development of resource sharing soft- treat parallelism, reliability, the distributed nature of the sys-
ware, and the increased user demands for communication, tem being specified, and the correctness of the system. A
economical sharing of resources, and productivity. methodology for the design, construction, and maintenance
A DCS potentially provides significant advantages, in- of large complex distributed systems is necessary. This meth-
cluding good performance, good reliability, good resource odology must address the specific problems of DCS's such as
sharing, and extensibility [31], [36], [56]. Potential per- distribution and parallelism. Finally, Section IV contains
formance enhancement is due to multiple processors and summary remarks.
an efficient subnet, as well as avoiding contention and
bottlenecks that exist in uniprocessors and multiprocessors. II. DISTRIBUTED COMPUTER SYSTEMS
Potential reliability improvements are due to the data and DCS research encompasses many areas, including: the
control redundancy possible, the geographical distribution of communication subnet, local area networks, distributed
the system, and the ability for mutual inspection of hosts operating systems, distributed databases, concurrent and dis-
and communication processors. With the proper subnet and tributed programming languages, specification languages for
distributed operating system, it is possible to share hardware concurrent systems, theory of parallel algorithms, parallel
and software resources in a cost effective manner increasing architectures and interconnection structures, fault tolerant
productivity and lowering costs. and ultrareliable systems, distributed real-time systems, co-
Manuscript received May 7, 1984; revised July 14, 1984. operative problem solving techniques of artificial intel-
The author is with the Department of Electrical and Computer Engineering, ligence, distributed debugging, distributed simulation, and
University of Massachusetts, Amherst, MA 01003. distributed applications [23], [47], [89]. There are also the
associated efforts of modeling and analysis, testbed devel- TYMNET, and Telenet. Commercial vendors and several
opment, measurement and evaluation, and prototype imple- standards organizations have developed computer network
mentations. An extensive survey and bibliography would architectures. A network architecture describes the func-
require hundreds of pages. In this section we concentrate on tionality of a computer network in a structured and layered
briefly categorizing and identifying actual distributed com- manner. The ISO reference niodel [134] is typical of these
puter systems, rather than all related research. At the end of architectures and it contains seven layers: the physical,
this paper there is a list of textbooks, survey articles, and data link, network, transport, session, presentation, and
references to provide further information on these research application layers. Some examples of commercial network
areas. For a more extensive survey of DCS research issues architectures which are used in a wide variety of applica-
see [126]. tions include: Burrough's Network Architecture, DEC's
Strictly speaking, a DCS is the sum total of the physical DECNET, Honeywell's Distributed Systems Architecture,
network and all the controlling software. Therefore, each IBM's Systems Network Architecture, Siemans' TRANS-
category discussed below (local area networks, wide area DATA, and Univac's Distributed Communication Architec-
networks, network operating systems, distributed operating ture [80]. Strictly speaking, network architectures exist on
systems, distributed file servers, distributed real-time sys- both LAN's and WAN's, but they are listed here because they
tems, and distributed databases) actually refers to a particular were first conceived within the WAN context.
aspect of a DCS and not an entire DCS. Network Operating Systems: Consider the situation where
Local Area Networks: According to Stallings [ 118] "a lo- each of the hosts of a computer network has a local operating
cal network is a communications network that provides inter- system that is independent of the network. The sum total
connection of a variety of data communicating devices within of all the operating system software added to each host in
a small area." A small area generally refers to a single build- order to communicate and share resources is called a network
ing or possibly spanning several buildings. A network with a operating system (NOS). The added software often includes
radius of 20 km would border between a local network and a modifications to the local operating system. NOS's are char-
long haul network. Local networks are sometimes classified acterized by being built on top of existing operating systems,
into three types: a local area network (LAN), a high speed and they attempt to hide the differences between the under-
local network (HSLN) which is typically found in a large lying systems. The most famous example of such a computer
computer center, or a digital switch/computerized branch ex- network is ARPANET and it contains several NOS's, e.g.,
change (CBX). Typical LAN's are the ring, the baseband RSEXEC and NSW [71]. RSEXEC includes a uniform
bus, and the broadband bus. HSLN and CBX's are not dis- command language interpreter and a networkwide execution
cussed in this paper. environment for user programs. It is intended as an experi-
Common protocols for accessing ring LAN's are the New- mental vehicle for exploring NOS issues. NSW is a NOS
hall (token) protocol [38], [39], the IEEE 802 token ring that supports software development by providing a uniform
protocol [118], the Pierce protocol (the slotted ring) [91], or access to a set of software tools. A systemwide file system is
the delay insertion protocol [74]. Prime and Apollo [15] a major element of NSW. XNOS [61] is another example of
manufacture token rings; the Cambridge ring [87] and a NOS.
Spider [72] are examples of slotted rings; and DDLCN [74] Distributed Operating Systems: Consider an integrated
is a delay insertion ring. computer network where there is one native operating system
A baseband network refers to transmission of signals with- for all the distributed hosts. This is called a distributed oper-
out modulation and the entire spectrum of the medium (cable) ating system (DOS). A DOS is designed with the network
is consumed by the signal. Baseband LAN's are typically requirements in mind from its inception and it tries to manage
accessed via a carrier sensed multiaccess/collision detect the resources of the network in a global fashion. Therefore,
(CSMA/CD) protocol commonly referred to as Ethernet retrofitting a DOS to existing operating systems and other
[84]. DEC, Intel, and Xerox support a product which runs the software is not a problem in DOS's. Since DOS's are used to
Ethernet protocol. satisfy a wide variety of requirements their various imple-
Broadband LAN's use frequency division multiplexing mentations are quite different. Table I lists a number of
and divide the spectrum of the medium (twisted pair, coaxial DOS's. Note that the boundary between NOS's and DOS's is
cable, or fiber optics) into channels each of which carries not clearly distinguishable.
analog signals or modulated digital data. For example, some Distributed File Servers: A file system is an integral part
channels may be used for point-to-point data communication, of a NOS, a DOS, and a distributed database system. Hence,
other channels can utilize a contention protocol such as there has been a great deal of research in this area.- In most
Ethernet, other channels may be assigned for video traffic, distributed systems the file system is considered a server
and yet other channels may be dedicated to voice traffic. which fields requests from users as well as from the rest of the
Wang manufactures a broadband LAN. operating system. File servers support the illusion that there
Wide Area Networks: A wide area network (WAN) is a is a single logical file system when, in fact, there may be
geographically dispersed collection of hosts and commu- many different file systems. Depending on the level of
nication processors where the distances involved are large. sophistication implemented, the file server may support rep-
Another common name for these networks is long haul net- lication, movement of files, and reliable updates to files in
works. WAN's include ARPANET [78], Cyclades [93], addition to the common file system commands. Typical file
1104 IEEE TRANSACTIONS ON COMPUTERS, VOL. C-33, NO. 12, DECEMBER 1984
col, require stations to reserve future time slots on the com- plished by differential files [107]. Validation is a technique
munication medium to completely avoid conflict. Another which permits unrestricted access to data items but then there
class of access protocols, called limited contention protocols, are checks for potential conflicts at the commit point. The
allows contention for serial use of the bus under light loads commit point is the time at which a transaction is sure that it
to reduce delays, but essentially becomes a collision free will complete. This approach is useful when few conflicts are
protocol under heavy loads, thereby increasing channel expected.
efficiency. See [134] for a full description of collision free The correctness of a concurrency control protocol is
and limited contention protocols. usually based on the concept of serializability [22].
We believe that most ideas from the subnet research area Serializability means that the effect of a set of executed
can be better exploited in other parts of DCS's. For example, transactions (permitted to run in parallel) must be the same as
one idea which has already been used is a "virtual" token some serial execution of that set of transactions. In many
scheme [70] for concurrency control in a distributed database cases and at all levels the strict condition of serializability is
system. The token circulates on a virtual ring and carries with not required. Relaxing this requirement can usually result in
it a sequencer that delivers sequential and unique integer access techniques that are more efficient in the sense of al-
values called tickets. This scheme insures serial access to lowing more parallelism and faster execution times. An ex-
data items by allowing access to an object only if the re- ample of this is the scheduling function. The scheduling
questor holds the proper ticket. Another potential area for algorithm is running in parallel on many hosts and must make
sharing ideas is in distributed time critical applications where decisions quickly. The scheduler would rather make quick
such applications could make use of reservation and limited decisions using somewhat inconsistent data than have to lock
contention protocols in controlling access to objects. This all the distributed information that it might use in arriving
approach would better guarantee access within the time at a solution. A research problem is determining the value
constraints. that the completeness and accuracy of information con-
In addition to resources which must be shared serially, tributes to the quality of the result (in this case response time
DCS's contain resources which can be shared simultane- or throughput).
ously. Resources that can be shared simultaneously pose no The concurrency control protocols as well as the pre-
difficulty if accessed individually. However, it is sometimes viously mentioned access control techniques of the subnet are
necessary to access a group of resources (objects) at the same implemented across the entire spectrum from centralized to
time. This gives rise to transactions which are used exten- distributed control. Hence, access control is very closely
sively in the database area. A transaction is an abstraction related to the discussion in the next section on distributed
which allows programmers to group a sequence of actions control.
into a logical unit. If executed atomically (either the entire Further, some access control may be done statically (e.g.,
transaction is executed or none of it is executed), a trans- scope rules in programming languages), but most access con-
action transforms a current consistent state of the resources trol must occur dynamically during system operation. For
into a new consistent state. The virtues of transactions are example, the data items to be accessed by a process may not
well documented [511, [66] and are important enough that be known prior to execution, and therefore, access control to
transactions have also appeared in distributed programming these data items must be controlled dynamically. How dy-
languages [73], and in distributed operating systems [77], namic access control is done affects the performance and
[116], [131]. fairness of the protocol. Normal access must sometimes be
Multiple transactions may have access problems among bypassed for external interrupts, for alarms, or for failures.
themselves. Protocols for resolving data access conflicts be- Careful normal access control coupled with techniques for
tween transactions are called concurrency control protocols failure situations signifies that access control is also highly
[21], [69], [137]. Three major classes of concurrency control related to reliability. A particular access control technique
protocols are: locking, timestamp ordering [100], and valida- must resolve deadlock and livelock -two other issues re-
tion (also called the optimistic approach [63]). lated to performance and reliability [48], [82].
Locking is a well known technique that is already used at Finally, we point out that objects and access to them are
all levels of a system. Timestamp ordering is an interesting intimately related and of equal importance. Access control
approach where all accesses to data are timestamped and then may be implemented by a distributed control protocol and has
some common rule is followed by all transactions in such a important effects on efficiency and reliability. It is a compli-
way as to ensure serial access [137]. This technique should be cated problem to choose the right access control technique for
useful at all levels of a system, especially if timestamps are a given set and type of objects. Ideas for solutions to the
already being generated for other reasons such as to detect access control problem from the subnet, NOS, DOS, and
lost messages or failed hosts. A variation of the timestamp database areas should be interchanged to obtain better per-
ordering approach is the multiversion approach. This formance and reliability.
approach is interesting because it integrates concurrency
control with recovery. In this scheme, each change to the C. Distributed Control
database results in a new version. Concurrency control is Depending on the application and requirements, distrib-
enforced on a version basis and old versions serve as check- uted control can take on many various forms. Although re-
points. Handling multiple versions efficiently is accom- search has been active for all types of distributed control, the
STANKOVIC: DISTRIBUTED COMPUTER SYSTEMS 1107
majority of the work is based on extensions to the centralized exchanged varies depending on the metric used by the
state-space model and can be more accurately described as algorithm (e.g., the metric might be number of hops, some
decomposition techniques [ 135], [136], rather than distrib- estimate of delay to the destination, or buffer lengths). Each
uted control. In such work, large scale problems are par- copy of the routing algorithm uses the exchanged (out-of-
titioned into smaller problems, each smaller problem being date) information in making routing decisions. Such algo-
solved, for example, by mathematical programming tech- rithms have the potential for good performance and reliability
niques, and the separate solutions being combined via inter- because the distributed control can operate in the presence of
action variables. In many cases the interaction variables are failures and quickly adapt to changing traffic patterns. On the
considered negligible and in others they are limited in the other hand, several new problems arise in such algorithms. If
sense that they model very limited cooperation. See the algorithm is not careful, then phenomena known as ping-
[55], [67], [105] for excellent surveys of these types of dis- ponging (message looping) and poor reaction to "bad news''
tributed control (decomposition). Note that in many of these might occur [134]. These problems were recognized in the
cases it is an irrelevant detail that individual subproblems are old ARPANET distributed routing algorithm and various
solved in parallel on different computers and reliability is solutions now exist [79]. Rather than continuing the dis-
often not an issue. Further, many of the techniques require cussion of such algorithms, we simply note that there is a
extensive computing power making them more suitable for large degree of similarity between routing messages in the
application programs rather than system functions. subnet and scheduling processes on hosts of the DCS. We
For DCS's, another form of distributed control is also consider scheduling in Section III-F and the reader is encour-
important; that is, distributed control where decomposition is aged to note the similarities.
not possible to any large degree and where cooperation Another type of distributed routing algorithm is based on
among the distributed controllers is the most important as- "n" spanning trees being maintained in the network (one at
pect. The DCS environment also results in a set of additional each host). Each spanning tree identifies all "n" destinations
very demanding requirements. For example, some functions in the DCS [83]. Each spanning tree is largely independent
implemented in this environment are dynamic, distributed, of the other trees so this is not a highly cooperative type of
adaptive, asynchronous, and operate in noisy, error prone, distributed control. Such an approach does 'have a number
and uncertain (stochastic) situations. Another important as- of advantages such as guaranteeing that there will be no
pect of the environment is that significant delays in the move- looping of messages.
ment of data are commonplace and must be accounted for. It When too many messages are in the subnet, performance
is important to note that the stochastic nature of the system degrades. Such a situation is called congestion. In fact, de-
being controlled affects two distinct aspects of the func- pending on the subnet protocols it may happen that at high
tions: an individual controller's view of the system is an traffic, performance collapses completely, and almost no
estimate, and future random forces can effect the system packets are delivered. Solutions include preallocation of
independently of the control decision. Scheduling, startup of buffers and performing message discarding only when there
multiple distributed processes of a system or application pro- is congestion. This is usually done by choke packets [ 134]. A
gram, and processes which test the reasonableness of distrib- particularly interesting distributed control algorithm for con-
uted data are examples of this form of distributed control. We gestion control is called isarithmic congestion control [134].
will now consider distributed control of this more demanding In this scheme a set of permits circulates around the subnet
type across three levels: the subnet, the DOS, and the distrib- and a set of permits is fixed at each host. Whenever a commu-
uted database levels. nication processor wants to transmit a message it must first
Functions in the subnet such as access control (discussed acquire a permit, either one assigned to that site (and not
above), routing [44], [106], and congestion control are good being used) or a circulating permit. When a destination com-
candidates for distributed control. Routing is the decision munication processor removes a message from the subnet it
process which determines the path a message follows in pass- regenerates the permit. Stationary permits are considered
ing from its source to its destination. Some routing schemes free upon message acknowledgment. This scheme limits the
are completely fixed, others contain fixed alternate paths number of messages in the subnet to some maximum, given
where the alternative is chosen only on failures. These non- by the number of permits in the system. Can such a scheme
adaptive schemes are too limited and do not constitute dis- be used for other aspects of DCS's such as scheduling pro-
tributed control. Adaptive routing schemes modify routes cesses on processors, or improving transaction response
based on changing traffic patterns. Adaptive routing schemes time in a distributed database system by avoiding thrashing
may be centralized where a routing control center calculates situations?
good paths and then distributes these paths to the individual Research on various topics in distributed databases [22],
hosts of the network in some periodic or aperiodic fashion. [62] has been extensive. One of the main research issues is
This is not distributed control either. concurrency control. Various algorithms for concurrency
Routing algorithms which exhibit distributed control typi- control have appeared, including some based on distributed
cally contain n copies of the algorithm (one at each com- control. One such algorithm is found in the Sirius-Delta
munication processor). Information is exchanged among database system developed at INRIA 170]. In this system
communication processors periodically or asynchronously as integrity of the database is maintained by distributed control-
a result of some noticable change in traffic. The information lers in the presence of concurrent users. The distributed con-
1108 IEEE TRANSACTIONS ON COMPUTERS, VOL. C-33, NO. 12, DECEMBER 1984
trollers must somehow cooperate to achieve a systemwide [ul, U2 , un], one per decision maker. The variables 0, z,
objective of good performance subject to the data integrity u are assumed to take values in appropriate spaces 0, Z, U,
constraint. In Sirius-Delta the cooperation is achieved by the respectively.
combined principles of atomic actions and unique time- 4) The strategy (decision rule) of the i th decision maker is
stamps. It can then be proven that the multiple distributed a function ui = yi(zi) where yiVi is chosen from some admis-
controllers can operate concurrently and still meet the data sible class of functions Fi.
integrity constraint. Bernstein and Goodman [22] have also 5) The loss (payoff) criterion
shown that another class of algorithms based on Two Phase
Locking and atomic actions can also achieve this same coop- LOSS = L(u1, U2,* 'Un, 01, 02, 9,,) -
eration. However, for other functions of distributed operating Finally, the team decision problem is
systems such as scheduling, there is no requirement for the
data integrity constraint. In fact, as stated before, if such a Find yi E vi, in order to
constraint were required for scheduling, then in general, per- min J = E[L(u = y(-q(O)), 0)].
formance of the system would suffer dramatically. Removing
the data integrity constraint from the scheduling algorithm Note that this model is the simplest form of a team decision
will improve its performance, but the control problem be- and it yields a complicated optimization problem that, in gen-
comes much more difficult. eral, could not be solved in real time on a DCS. The above
In the operating system arena, functions such as sched- model is completely static (one team decision is made) and the
uling, deadlock detection, access control, and file servers are controls (actions) ui of each decision maker do not depend on
candidates for being implemented via distributed control. the actions of the other decision makers. Cooperation is only
Consider an individual operating system function to be im- occurring statically through the systemwide objective func-
plemented by n distributed replicated entities (controllers). tion and dynamically via changes in the states of nature. For
For reliability we require that there is no master controller; in most DCS's, component 4 must be modified to
other words each of the entities is considered equal (demo- =i q u)
cratic) at all times. Furthermore, one of the most demanding
requirements is that most operating system functions must In other words, the observations of a controller depend on
run in real-time with minimum overhead (time sensitive). both the states of nature and the decisions of the other decision
This requirement eliminates many potential solutions based makers. This team problem is unsolved today. In DCS func-
on mathematical or dynamic programming. Central to the tions, the problem is even more difficult than this unsolved
development of distributed control functions is the notion of Team Theory problem because there are additional problems,
what constitutes optimal control. However, such a notion for e.g., there are inherent delays in the system causing inaccu-
dynamic, democratic, decentralized, replicated, adaptive, racies and eliminating the possibility of immediate response
stochastic, and time-sensitive functions has not yet been well to actions, and decisions must be made quickly. Further, team
formulated. In fact, this is such a demanding set of require- theory does not directly deal with stability, an issue that is
ments that there are no mathematical techniques that are fundamental to distributed control.
directly applicable. To have any hope of solving the distributed control problem
To better illustrate the complexities involved, we discuss we need to either relax our optimization requirement or im-
the complexities in terms of the mathematical discipline of pose more structure to the problem or both. In effect, the
team theory [54]. At a high level of abstraction team theory distributed database problem has imposed additional struc-
appears to be a good candidate as a mathematical discipline ture by requiring data integrity. In general, imposing addi-
to base distributed control algorithms on because it contains tional structure includes: 1) requiring that each controller act
all the essential ingredients of the distributed control prob- sequentially and also requiring the controller to know the
lem. The main ingredients are as follows. action and the result of any action of all previous controllers;
1) The presence of different but correlated information for 2) various n-step delay approaches [54]; 3) periodic coordi-
each decision maker about some underlying uncertainty; and nation [67]; or 4) using a centralized coordinator. Even with
2) the need for coordinated actions on the part of all deci- such simplifications, the specification of additional structure
sion makers in order to realize the systemwide payoff. does not guarantee that the resulting optimization problem is
More formally, a team-theoretic decision problem contains solvable in practice. Even with this additional structure the
five components [54]. optimization problem can be too complex (and costly) to run
1) A vector of random variables 0 = [01, 02,' 0, m] in real time for functions like scheduling and routing. There-
with given distribution p(0). The random vector represents fore, it is necessary to develop heuristics that can be run in
all the uncertainties that have a bearing on the problem and real time, and that can effectively coordinate distributed con-
is called the states of nature. trollers. Even in heuristics, often the delayed effects of the
2) A set of observations z = [ZI, Z2, ,Zn] which is interactions are not considered. Furthermore, both iterative
given functions of 0, i.e., zi = 7q(01, 02, . .
Om)
.
solutions and keeping entire histories are not practical for
i1=l1, 2, , n. In general zi is a vector and is the obser- most DCS functions. For the scheduling problem there is the
vation available to the ith decision maker. added concern that it is difficult if not impossible to know the
3) A set of decision variables (or controls) direct systemwide effect of a particular action taken by a
STANKOVIC: DISTRIBUTED COMPUTER SYSTEMS 1109
controller. For example, assume that controller i takes action attempt to avoid congestion, lost messages due to buffer
"a and assume that the net effect of all the actions of all the overruns, and possible deadlock due to too much traffic and
controllers improve the system. It cannot be assumed that not enough buffer space. Routing algorithms contain mul-
action "a" was a good action where, in fact, it may have been tiple routes to each destination or a method of generating a
a bad action dominated by the good actions of other control- new route given that failures occur. Alarms and other high
lers. This is an aspect of what is referred to as the "assignment priority messages are used to identify dangerous situations
of credit problem." needing immediate attention. Many of the particular al-
In summary, there are many forms of distributed control. gorithms or protocols used are distributed to avoid single
Deciding which type is appropriate for each function in a points of failure. Other typical reliability techniques used
DCS is difficult. Deciding how distributed control algo- in any number of protocols include backup components,
rithms of different types will interact with each other under voting, consistency and range checks, and special testing
one system is even more complex. It is our hypothesis that the procedures.
advantages of designing the proper algorithms in the right All the same techniques used in the subnet can also be used
combinations will be improved performance, reliability, and in the DOS [20]. Reliable DOS's should also support repli-
extensibility -the major potential benefits of DCS's. Con- cated files [13], [45], exception handlers, testing procedures
sequently, distributed control is a crucial issue. run from remote hosts, and avoid single points of failure
D. Reliability
by a combination of replication, backup facilities, and dis-
tributed control. Distributed control could be used for file
Reliability is a fundamental issue for any system, but servers, name servers, scheduling algorithms, and other ex-
DCS's are particularly well suited to the implementation of ecutive control functions. Process structure, how environ-
reliability techniques. Reliability is one fundamental issue ment information is kept, the homogeneity of various hosts,
where solutions have already been widely used across areas. and the scheduling algorithm may allow for relocatability of
However, we believe that recent reliability techniques in the processes. Interprocess communication (IPC) might be
database area should be better utilized in the operating system supported as a reliable remote procedure call [88], [108].
and subnet levels. We begin the discussion on reliability with Reliable IPC would enforce "at least once" or "exactly once"
a few definitions. semantics depending on the type of IPC being invoked. Other
A fault is a mechanical or algorithmic defect which may DOS reliability issues relate to invoking processes that
generate an error. A fault may be permanent, transient, or are not active, or attempting to communicate to terminated
intermittent. An error is an item of information which when processes, or the situation in which a process remains active
processed by the normal algorithms of the system will pro- but is not used.
duce a failure. A failure is an event at which a system ARGUS [73], a distributed programming language, has
violates its specifications. Reliability can then be defined as explicity incorporated reliability concerns into the pro-
the degree of tolerance against errors and faults. Increased gramming language. It does this by supporting the idea of an
reliability comes from fault avoidance and fault tolerance. atomic object, transactions, nested actions, reliable remote
Fault avoidance results from conservative design practices procedure calls, stable variables, guardians (which are mod-
such as high reliability components and conservative design. ules that survive node failures and synchronize concurrent
Fault tolerance employs error detection and redundancy to access to data), exception handlers, periodic and background
deal with faults and errors. Most of what we discuss in this testing procedures, and recovery of a committed update given
section relates to the fault tolerance aspect of reliability. the present update does not complete. A distributed program
Reliability is a multidimensional activity that must si- written in ARGUS may potentially experience deadlock.
multaneously address some or all of the following: fault Currently, deadlocks are broken by timing out and aborting
confinement, fault detection, fault masking, retries, fault actions.
diagnosis, reconfiguration, recovery, restart, repair, and Distributed databases make use of many reliability features
reintegration [109]. Rather than simply discussing each of such as stable storage, transactions, nested transactions [86],
these in turn, we briefly discuss how these various issues are commit and recovery protocols [112], nonblocking commit
treated at the subnet, operating system, programming lan- protocols [34], [110], termination protocols [111], check-
guage, and database levels. pointing, replication, primary/backups, logs/audit trails,
Frames on a subnet typically contain error detection codes differential files [107], and timeouts to detect failures.
such as the CRC. Conservative design may appear as to- Operating system support is required to make these mech-
pology design with "n" paths to each destination. Data link anisms more efficient [50], [131].
protocols use handshaking techniques with positive feedback From the above list of database reliability features let us
in the form of ACK and NAK messages and where the NAK consider termination and recovery protocols. A termination
messages may contain reasons for the failure. Timers are protocol is used in conjunction with nonblocking commit
used in the subnet to react to lost messages, lost control protocols and is invoked at failure detection time to guarantee
tokens, or network partitionings. Most protocols will employ transaction atomicity. It attempts to terminate (commit or
retries to quickly overcome transient errors. Some subnets abort) all affected transactions at all participating hosts with-
create an abstraction called a virtual circuit that guarantees out waiting for recovery. This is an extremely important
the reliable transmission of messages. Flow control protocols feature when it is necessary to allow as much continued
1110 IEEE TRANSACTIONS ON COMPUTERS, VOL. c-33, NO. 12, DECEMBER 1984
execution as possible (availability) in spite of the failure. A execution time efficiency. For many systems it is just
host which has failed must then execute a recovery protocol too costly to incorporate an extensive number of reliability
before it resumes communication with other hosts. The major mechanisms. Reliability is also enhanced by proper access
functions of the recovery protocol are to restart the system control and judicial use of distributed control, two other
processes, and to reestablish consistent transaction states for fundamental issues discussed in this paper. The major chal-
all transactions affected by the failure, if this has not been lenge is to integrate solutions to all these issues in a cost
already accomplished by the termination protocol. This illus- effective manner and produce an extremely reliable system.
trates the close interaction that exists between the various
protocols where decisions made in one protocol make sub- E. Heterogeneity
sequent protocols easier or more difficult. It is obvious that Incompatibility problems arise in heterogeneous DCS's in
a termination (cleanup) and recovery protocol are required at a number of ways [14], [17], [71] and at all levels. First,
all levels in the system. Hence, specific algorithms (or ideas incompatibility is due to the different internal formatting
from them) used at the database level might be applicable to schemes that exist in a collection of different communication
other levels as well and vice versa. For example, termination and host processors. Second, incompatibility also arises from
and recovery protocols may themselves be distributed to en- the differences in communication protocols and topology
hance reliability. However, the distributed termination proto- when networks are connected to other networks via gate-
cols typically require n (n - 1) messages during a round of ways. Third, major incompatibilities arise due to different
communication (and several rounds may be necessary) where operating systems, file servers, and database systems that
n is the number of participating entities. This is too costly for might exist on a (set of) network(s).
slow networks, but it may be acceptable on fast local net- The easiest solution to this general problem for a single
works or within a single host. The benefits would be greater DCS is to avoid the issue by using a homogeneous collection
reliability and better availability. Note that as was true for of machines and software. If this is not'practical, then some
concurrency protocols (Section III-B), further improvements form of translation is necessary. Some earlier systems left
in efficiency and availability might be possible if these this translation to the user. This is no longer acceptable.
termination and recovery protocols relax the serializability Translation done by the DCS system can be done at the
requirement. receiver host or at the source host. If it is done at the receiver
One aspect of reliability not stressed enough in DCS re- host then the data traverse the network in their original form.
search is the need for robust solutions, i.e., the solutions The data usually are supplemented with extra information to
must explicity assume an unreliable network, tolerate host guide the translation. The problem with this approach is that
failures, network partitionings, and lost, duplicate, out of at every host there must be a translator to convert each format
order, or noisy data [27]. Robust algorithms must sometimes in the system to the format used on the receiving host. When
make decisions after reaching only approximate agreement of there exist "n" different formats, this requires the support of
by using statistical properties of the system (assumed known (n - 1) translators at each host. Performing the translation at
or dynamically calculated). A related question is at what the source host before transmitting the data is subject to all
level should the robust algorithms, and reliability in general, the same problems.
be supported? Most systems attempt to have the subnet en- There are two better solutions, each applicable under dif-
sure reliable, error free data transmission between processes. ferent situations: an intermediate translator, or an inter-
However, according to the end-to-end argument [104], such mediate standard data format.
functions placed at the lower levels of the system are often An intermediate translator accepts data from the source
redundant and unnecessary. The rationale for this argument and produces the acceptable format for the destination. This
is that since the application has to take into account errors is usually used when the number of different types of neces-
introduced not only by the subnet, many of the error detection sary conversions is small. For example, a gateway linking
and recovery functions can be correctly and completely pro- two different networks acts as an intermediate translator.
vided only at the application level. For a given conversion problem, if the number of different
The relationship of reliability to the other issues discussed types to be dealt with grows large, then a single intermediate
in this paper is very strong. For example, object oriented translator becomes unmanageable. In this case, an inter-
systems confine errors to a large degree, define a consistent mediate standard data format (interface) is declared, hosts
system state to support rollback and restart, and limit propa- convert to the standard, data are moved in the format of the
gation of rollback activities. Since objects can represent standard, and another conversion is performed at the desti-
unreliable resources (such as processors and disks), and since nation. By choosing the standard to be the most common
higher level objects can be built using lower level objects, the format in the system, the number of conversions can be
goal of reliable system design is to create "reliable" objects reduced.
out of unreliable objects. For example, a stable storage can At a high level of abstraction the heterogeneity problem
be created out of several disk objects and the proper logic. and the necessary translations are well understood. However,
Then a physical processor, a checkpointing capability, a implementing the translators in a cost effective way has not
stable storage, and logic can be used to create a stable pro- been achieved in general. Complicated issues are precision
cessor. One can proceed in this fashion to create a very loss, format incompatibilities (e.g., minus zero value in sign
reliable system. The main drawback is potential loss of magnitude and 1's complement cannot be represented in 2's
STANKOVIC: DISTRIBUTED COMPUTER SYSTEMS lilll
complement), data type incompatibilities (e.g., mapping of processing elements might also be exploited to improve re-
an upper/lower case terminal to an upper case only terminal sponse time and throughput of user processes. While effi-
is a loss of information), efficiency concerns, the number and ciency concerns exist at every level in the system, they
locations of the translators, and what constitutes a good inter- must also be treated as an integrated "system" level issue. For
mediate data format for a given incompatibility problem. example, a good design, the proper tradeoffs between levels,
As DCS's become more integrated one can expect that both and the pairing down of over-ambitious features usually im-
programs and complicated forms of data might be moved to proves efficiency. In this section, however, we concentrate
heterogeneous hosts. How will a program run on this host on discussing efficiency as it relates to the execution time of
given that the host has different word lengths, different ma- processes.
chine code, and different operating system primitives? How Once the system is operational, improving response time
will database relations stored as part of a CODASYL data- and throughput of user processes is largely the respon-
base be converted to a relational model and its associated sibility of scheduling and resource management algorithms
storage scheme? Moving a data structure object requires [28], [30], [35], [95], [124], [125], [127]-[129]. The sched-
knowledge about the semantics of the structure (e.g., that uling algorithm is intimately related to the resource allocator
some of the fields are pointers and these have to be updated because a process will not be scheduled for the CPU if it is
upon a move). How should this information be imparted to waiting for a resource. If a DCS is to exploit the multiplicity
the translators, what are the limitations if any, and what are of processors and resources in the network it must contain
the benefits and costs of having this kind of flexibility? In more than "n" independent local schedulers. The local sched-
general, the problem of providing translation for movement ulers must interact and cooperate and the degree to which this
of data and programs between heterogeneous hosts and net- occurs can vary widely. We suggest that a good scheduling
works has not been solved. The main problem is ensuring that algorithm for a DCS will be a heuristic that acts like ap
such programs and data are interpreted correctly at the desti- "expert system." This expert system's task is to effectively
nation host. In fact, the more difficult problems in this area utilize the resources of the entire distributed system given a
have been largely ignored. complex and dynamically changing environment. We hope to
It is inevitable that incompatibilities will exist in DCS's illustrate this in the following discussion.
because it is quite natural to extend such systems by inter- In the remainder of this section when we refer to the sched-
connecting networks, by adding new hosts and commu- uling algorithm we are referring to the part of the scheduler
nication processors, and by increasing functionality with new (the expert system) that is responsible for choosing the host
software. Further, the main function of NOS's and file serv- of execution for a process. We assume-that there is another
ers is precisely to present a uniform logical interface (view) part of the scheduler which assigns the local CPU to the
to the end user, frormra collection of different environments. highest priority ready process.
Depending on the degree of incompatibility, the cost of the We divide the characteristics of a DCS which influence
translations can be high, thereby limiting their application in response time and throughput into two types: 1) system
DCS's which have severe real-time constraints. While real- characteristics, and 2) scheduling algorithm characteristics.
time systems should also be as extensible as possible, they System characteristics include: the number, type, and speed
will probably have to rely on a few translators, on good of processors, the allocation of data and programs [29],
object-based design and on extensible distributed control al- whether data and programs can be moved, the amount and
gorithms rather than on being able to incrementally add more location of replicated data and programs, how data are par-
and more translators. For all DCS's, a method is needed to titioned, partitioned functionality in the form of dedicated
limit the number of incompatibilities during the lifetime of processors, any special purpose hardware, characteristics of
the system, while allowing significant and easy extensibility. the communication subnet, and special problems of distribu-
The problems associated with heterogeneity are currently tion such as no central clock and the inherent delays in the
considered less important than the other problems considered system. A good scheduling algorithm would take the system
in this paper because they can be handled on a problem by characteristics into account. Scheduling algorithm character-
problem basis. However, as DCS's become more sophis- istics include: the type and amount of state information used,
ticated and approach achieving their full potential, then we how and when that information is transmitted, how that in-
believe the heterogeneity issue will become increasingly im- formation is used (degree and type of cooperation between
portant and a problem by problem solution may not work. distributed scheduling entities), when the algorithm is in-
voked, adaptability of the algorithm, and the stability of
F. Efficiency the algorithm.
Distributed computer systems are meant to be efficient in By the type of state information, it is meant whether the
a multitude of ways. Resources (files, compilers, debuggers, algorithm uses queue lengths, CPU utilization, amount of
and other software products) developed at one host can be free memory, estimated average response time, etc., in mak-
shared by users on other hosts limiting duplicate efforts. ing its scheduling decision. The type of information also
Expensive hardware resources can also be shared minimizing refers to whether the information is -local or networkwide
costs. Communication facilities such as electronic mail and information. For example, a scheduling algorithm on host 1
file transfer protocols also improve efficiency by enabling could use queue lengths of all the hosts in the network in
better and faster transfer of information. The multiplicity of making its decision. The amount of state information refers
I12 IEEE TRANSACTIONS ON COMPUTERS, VOL. c-33, NO. 12, DECEMBER 1984
to the number of different types of information used by the the load. At moderate loads, the full blown scheduling algo-
scheduler. rithm might be employed. This might include individual
Information used by a scheduler can be transmitted hosts refusing all requests for information and refusing to
periodically or asynchronously. If asynchronously, it may be accept any-process because it is too busy. Under heavy loads
sent only when requested (as in bidding), it may be pig- on all hosts it again seems unnecessary to use networkwide
gybacked on other messages between hosts, or it may be sent scheduling. A bidding scheme might use both source and
only when conditions change by some amount. The informa- server directed bidding [ 113], [125]. An overloaded host asks
tion may be broadcast to all hosts, sent to neighbors only, or for bids and is the source of work for some other host in the
to some specific set of hosts. network. Similarly, a lightly loaded host may make a reverse
The information is used to estimate the loads on other hosts bid, i.e., it asks the rest of the network for some work. The
of the network in order to make an informed global sched- two types of bidding'might coexist. Schedulers could be
uling decision. However, the data received are out of date and designed in a multilevel fashion with decisions being made at
even the ordering of events might not be known [64]. It is different rates, e.g., local decisions and state information
necessary to manipulate the data in some way to obtain better updates occur frequently, but more global exchanges of deci-
estimates. Several examples are: very old data can be dis- sions and state information might proceed at a slower rate
carded; given that state information is timestamped a linear because of the inherent cost of these global actions.
estimation of the state extrapolated to the current time might Stability [25] refers to the situation where processes are
be feasible; conditional probabilities on the accuracy of the moved among hosts in the network in an incremental and
state information might be calculated in parallel with the orderly fashion. It is not acceptable for N - 1 hosts to flood
scheduler by some monitor nodes and applied to the received a lightly loaded host in such a manner that the previously
state information; the estimates can be some function of the lightly loaded host must now reassign some of the work
age of the state information; or some form of (iterative) mes- moved to it. (Some form of hysteresis is required.) Sched-
sage interchange might be feasible. A message interchange is uling algorithms can employ implicit or explicit stability
subject to long delays before the scheduling decision is made, mechanisms. An implicit mechanism exists when the algo-
and'if mutual agreement among scheduling entities is neces- rithm is tuned so that the relative importance of the various
sary even in the presence of failures, then the interchange is factors used, the relative timings of the scheduling algorithm,
also prone to the Byzantine Generals Problem [33], [65]. 'The the passing of state information, the characteristics of the
Byzantine Generals Problem is a particularly disruptive type processes in the system, the adaptability of the algorithm,
of problem where no assumptions can be made about the type etc., are all integrated in the right proportion to provide a
of failure of a process involved in message exchanges. For stable system. Implicit treatment of stability can be danger-
example, the failed process can send messages when it is not ous, but requires less overhead than explicit mechanisms.
supposed to, can make conflicting claims to other processes, Explicit mechanisms refer to specific logic and information
or act dead for a while and then revive. Even though it is hard that is used to better guarantee stability. The overheads of
to b'elieve that a process would act like this on purpose, in explicit mechanisms can be very high and as one tries to
practice, systems fail in unexpected ways giving rise to these lower'the overheads, stability becomes jeopardized. It is not
kinds of behavior. Protecting against this type of failure is a clear which technique is better.
conservative approach to reliable systems design. An important part of efficiency is adequate measurement
Before a process is actually moved the cost of moving it techniques. Many of the issues raised above require mea-
must be accounted for in determining the estimated benefit of surement methods. It might be necessary to measure the de-
the move. This cost is different if the process has not yet lay in the subnet, or between two hosts, or the utilization of
begun execution than if it is already in progress. In both a host, or the probabilities of certain conditions in the distrib-
cases, the resources required must also be considered. If a uted system. The cost of the measurement must be weighed
process is in execution, then environment information (e.g., against the benefits it produces.
the process control blocks) probably should be moved with A classic efficiency question in any system is: what should
the process. It is expected that in many cases the decision will be supported by the kernel, or more generally by the opera-
be not to move the process. ting system, and what should be left to the user? The trend in
Schedulers invoked too often will produce excessive over- DCS is to support objects, primitive IPC mechanisms, and
head. If they are not invoked often enough they will not be processes in the kernel [115]. Some researchers advocate
able to react fast enough to changing conditions. There will supporting the concept of a transaction in the kernel. This
be undue startup delay for processes. There must be some argument will never be settled conclusively since it is a func-
ideal invocation schedule which is a function of the load. tion of the requirements, type of processes running, etc. This
In a complicated DCS environment it can be expected that is the classical Vertical Migration question [119].
the scheduler will have to be quite adaptive [12], [24], [123]. Of course, many other efficiency questions remain at all
A scheduler might make minor adjustments in weighing the levels including those briefly discussed throughout the pre-
importance of various factors as the network state changes in vious sections of this paper. These include: the efficiency of
an attempt to track a slowly changing environment. Major the object model, the end-to-end argument [104], locking
changes might require major adjustments. For example, un- granularity, performance of remote operations, improve-
der very light loads there'does not seem to be much justifi- ments due to distributed control, the cost effectiveness of
cation for networkwide scheduling, so the algorithm might be various reliability mechanisms, and efficiently dealing with
turned off-except the part that can recognize a change in heterogeneity. Efficiency is, therefore, not a separate issue
STANKOVIC: DISTRIBUTED COMPUTER SYSTEMS 1113
but must be addressed for each issue in order to result in an [4] M. Schwartz and T. E. Stem, "Routing techniques used in computer
communication networks," IEEE Trans. Commun., vol. COM-28,
efficient, reliable, and extensible DCS. A difficult question Apr. 1980.
to answer is exactly what is acceptable performance given [5] D. Walden and A. McKensie, "The evolution of host to host protocol
that multiple decisions are being made at all levels and that technology," IEEE Computer, vol. 12, Sept. 1979.
[6] D. W. Davies and D. L. A. Barber, Communication Networks For
these decisions are being made in the presence of missing and Computers. New York: Wiley, 1973.
inaccurate information. [7] D. W. Davies, D. L.A. Barber, W. L. Price, and C. M. Solomonides,
Computer Networks and Their Protocols. New York: Wiley, 1979.
[8] W. R. Franta and I. Chlamtac, Local Networks. Lexington,
IV. SUMMARY MA: Lexington Books, 1981.
[9] J. Martin, Computer Networks and Distributed Processing.
While it is true that DCS's have proliferated, it is also true Englewood Cliffs, NJ: Prentice-Hall, 1981.
[10] 1. McNamara, Technical Aspects of Data Communication. Maynard,
that there remain many unsolved problems relating to the MA: Digital, 1977.
issues of the object model, access control, distributed con- [11] C. Weitzman, Distributed Micro/Minicomputer Systems. Englewood
trol, reliability, heterogeneity, and efficiency as well as their Cliffs, NJ: Prentice-Hall, 1980.
[12] A. K. Agrawala, S. K. Tripathi, and G. Ricart, "Adaptive routing using
interactions. These fundamental issues have been recognized a virtual waiting time technique," IEEE Trans. Software Eng.,
for some time, but solutions to problems associated with vol. SE-8, Jan. 1982.
these issues have not produced totally satisfactory systems. [13] P. Alsberg and J. Day, "A principle for resilient sharing of distributed
resources," in Proc. 2nd Int. Conf. Software Eng., 1976.
We will not achieve the full potential advantages of DCS's [14] B. Anderson et al., "Data reconfiguration service," Bolt Beranek and
until better experimental evidence is obtained and until cur- Newman, Tech. Rep., May 1971.
rent and new solutions are better integrated in a systemwide, [15] Apollo Domain Architecture, Apollo Computer, Inc., Feb. 1981.
[16] J. M. Ayache, J. P. Courtiat, and M. Diaz, "REBUS, A fault tolerant
flexible, and cost effective manner. Major breakthroughs will distributed system for industrial control," IEEE Trans. Comput.,
be required to achieve this potential. It is our opinion that vol. C-31, July 1982.
such breakthroughs will be largely based on distributed deci- [17] M. Bach, N. Coguen, and M. Kaplan, "The ADAPT system: A gener-
alized approach towards data conversion," in Proc. 5th Int. Conf. Very
sion making that will necessarily use heuristics similar to Large Data Bases, Rio de Janeiro, Brazil, Oct. 1979.
those found in "expert" systems. The heuristics, though, will [18] J. E. Ball, J. Feldman, J. Low, R. Rashid, and P. Rovner, "RIG,
have to directly address the problems of distribution, such as Rochester's intelligent gateway: System overview," IEEE Trans.
Software Eng., vol. SE-2, no. 4, Dec. 1980.
long delays, the assignment of credit, missing and out-of- [19] D. K. Barclay, E. R. Byrne, and F. K. Ng, "A real-time database man-
date information, the use of statistical information, and fail- agement system for No. 5 ESS," Bell Syst. Tech. J., vol. 61, no. 9,
ure events. Further, the heuristics will have to deal with such Nov. 1982.
[20] J. N. Bartlett, "A non-stop operating system," in Proc. 11th Hawaii Int.
complexity very efficiently, and this eliminates many clas- Conf. Syst. Sci., Jan. 1978.
sical solutions. These complications are not typically found [21] P. A. Bernstein, D. W. Shipman, and J. B. Rothnie, Jr., "Concurrency
in expert systems to date, so it is not possible to simply control in a system for distributed databases (SDD-1)," ACM Trans.
Database Syst., vol. 5, no. 1, pp. 18-25, Mar. 1980.
"borrow" the solution. To achieve the objectives of DCS's, it [22] P. Bernstein and N. Goodman, "Concurrency control in distributed
is important to study distributed resource management para- database systems," ACM Comput. Surveys, vol. 13, no. 2, June 1981.
digms that view resources at an integrated "system" level [23] A. Birrell, R. Levin, R. Needham, and M. Schroeder, "Grapevine:
An exercise in distributed computing," Commun. ACM, vol. 25,
which includes the hardware, the communication subnet, the pp. 260-274, Apr. 1982.
operating system, the database system, the programming lan- [24] S. H. Bokhari, "Dual processor scheduling with dynamic reas-
guage, and other software resources. To this end, this paper signment," IEEE Trans. Software Eng., vol. SE-5, no. 4, July 1979.
[25] R. M. Bryant and R. A. Finkel, "A stable distributed scheduling algo-
has tried to take a system viewpoint in presenting six funda- rithm," in Proc. 2nd Int. Conf. Distrib. Comput. Syst., Apr. 1981.
mental issues. A DCS of the future will be a form of an [26] L. Casey and N. Shelness, "A domain structure for distributed computer
extensible, adaptable, physically distributed, but logically system," in Proc. 6th ACM Symp. Oper. Syst. Princ., Nov. 1977,
pp. 101-108.
integrated, expert system. [27] T. C. K. Chou and J. A. Abraham, "Load redistribution under failure in
In summary, this paper has presented a brief overview of distributed systems," IEEE Trans. Comput., vol.. C-32, pp. 799-808,
distributed computer systems, and then discussed some of the Sept. 1983.
[28] , "Load balancing in distributed systems," IEEE Trans. Software
problems and solutions for each of six fundamental distrib- Eng., vol. SE-8, no. 4, July 1982.
uted systems issues. The paper has described (by means of [29] W. W. Chu, "Optimal file allocation in a multiple computing system,"
examples from the communication subnet, distributed oper- IEEE Trans. Comput., vol. C-18, pp. 885-889, Oct. 1969.
[30] W. W. Chu, L. J. Holoway, M. Lan, and K. Efe, "Task allocation in
ating system, and distributed database areas) the interactions distributed data processing," IEEE Computer, vol. 13, pp. 57-69,
among these issues, and the need for better integration of Nov. 1980.
[31] D. W. Davies, E. Holler, E. D. Jensen, S. R. Kimbleton, B. W. Lamp-
solutions. Important issues that this paper has not covered son, G. LeLann, K. J. Thurber, and R. W. Watson, Distributed
due to lack of space are the need for a theory and specification Systems -Architecture and Implementation, Vol. 105, Lecture Notes in
language for distributed systems, as well as the need for a Computer Science. Berlin: Springer-Verlag, 1981.
[32] J. Dion, "The Cambridge file server," ACM Oper. Syst. Rev.,
distributed systems methodology. Oct. 1980.
[33] P. Dolev, "The Byzantine generals strike again," J. Algorith., vol. 3,
no. 1, 1982.
REFERENCES [34] C. Dwork and D. Skeen, "The inherent cost of nonblocking commit-
ment," Dep. Comput. Sci., Cornell Univ., Ithaca, NY, Tech. Rep.,
[1] G. R. Andrews and F. Schneider, "Concepts and notations for concur- May 1983.
rent programming," ACM Comput. Surveys, vol. 15, no. 1, Mar. 1983. [35] K. Efe, "Heuristic models of task assignment scheduling in distributed
[2] P. Green, "An introduction to network architectures and protocols," systems," IEEE Computer, vol. 15, June 1982.
IEEE Trans. Commun., vol. COM-28,-Apr. 1980. [36] P. Enslow, "What is a distributed data processing system," IEEE
[3] L. Kleinrock and M. Gerla, "Flow control: A comparative survey," Computer, vol. 11, Jan. 1978.
IEEE Trans. Commun., vol. COM-28, Apr. 1980. [37] P. Enslow and T. Saponas, "Distributed and decentralized control in a
1114 IEEE TRANSACTIONS ON COMPUTERS, VOL.C-33, NO. 12, DECEMBER 1984
fully distributed processing system," Tech. Rep. GIT-ICS-81/82, [69] G. LeLann, "Algorithms for distributed data-sharing systems which use
Sept. 1980. tickets," in Proc. 3rd Berkeley Workshop Distrib. Databases Comput.
[38] D. J. Farber et al., "The distributed computer system," in Proc. 7th Networks, 1978.
Annu. IEEE Comput. Soc.Int. Conf., Feb. 1973. [70] ,"A distributed system for real-time transaction processing," IEEE
[39] W. D. Farmer and E. E. Newhall, "An experimental distributed switch- Computer, vol. 14, Feb. 1981.
ing system to handle bursty computer traffic," in Proc. ACM Symp.
Probl. Opt. Data Commun. Syst., 1969.
[71] P. H. Levine, "Facilitating interprocess communication in a hetero-
geneous network environment," Masters thesis, Massachusetts Inst.
[40] R. A. Floyd and C. S. Ellis, "The ROE file system," in Proc. 3rd Symp. Technol., Cambridge, MA, June 1977.
Reliability Distrib. Software Database Syst., Oct. 1983. [72] B. Lindsay, "Object naming and catalogue management for a distributed
[41] H. C. Forsdick, R. E. Schantz, and R. H. Thomas, "Operating systems database manager," IBM Res. Rep. RJ2914, Aug. 1980.
for computer networks," IEEE Computer, vol. 11, Jan. 1978. [73] B. Liskov and R. Scheifler, "Guardians and actions: Linguistic support
[42] A. G. Fraser, "Spider An experimental data communications sys- for robust, distributed programs," in Proc. 9th Symp. Princ. Pro-
tem," Bell Labs., Tech. Rep., 1975. gramming Lang., Jan. 1982, pp. 7-19.
[43] M. Fridrich and W. Older, "The FELIX file server," in Proc. 8th Symp. [74] M. T. Liu, D. Tsay, C. Chou, and C. Li, "Design of the distributed
Oper. Syst. Princ. (SIGOPS), Dec. 1981, pp. 37-44. double-loop computer network (DDLCN)," J. Digital Syst., vol. V,
[44] R. Gallager, "A minimum delay routing algorithm using distributed no. 12, 1981.
computation," IEEE Trans. Commun., vol. COM-25, Jan. 1977. [75] G. W. R. Luderer et al., "A distributed UNIX system based on a virtual
[45] J. Garcia-Molina, "Reliability issues for fully replicated distributed circuit switch," in Proc. 8th Symp. Oper. Syst. Princ., Dec. 1981.
databases," IEEE Computer, vol. 16, pp. 34-42, Sept. 1982. [76] J. R. McGraw and G. R. Andrews, "Access control in parallel pro-
[46] D. Gifford, "Weighted voting for replicated data," in Proc. 7th Symp. grams," IEEE Trans. Software Eng., vol. SE-5, Jan. 1979.
Oper. Syst. Princ., Dec. 1979, pp. 150-159. [77] M. S. McKendry, J. E. Allchin, and W. C. Thibault, "Architecture for
[47] , "Violet: An experimental decentralized system," Oper. Syst. global operating system," in Proc. IEEE INFOCOM, Apr. 1983.
Rev., vol. 13, no. 5, Dec. 1979. [78] J. M. McQuillan and D. C. Walden, "The ARPA network design deci-
[48] V. D. Gligor and S. H. Shattuck, "On deadlock detection in distributed sions," Comput. Networks, vol. 1, Aug. 1977.
systems," IEEE Trans. Software Eng., vol. SE-6, no. 5, pp. 435-440, [79] J. M. McQuillan, I. Richer, and E. C. Rosen, "The new routing
[49]
Sept. 1980.
J. N. Gray, R. A. Lorie, and G. R. Putzolu, "Granularity of locks in a
algorithm for the ARPANET," IEEE Trans. Commun., vol. COM-28,
May 1980.
shared database," in Proc.Int. Conf. Very Large Database, Sept. 1975, [80] A. Meijer and P. Peeters, Computer Network Architectures.
pp. 428-451. Rockville, MD: Computer Science Press, 1982.
[50] J. N. Gray, "Notes on data base operating systems," in Operating [81] P.M. Melliar-Smith and R. L. Schwartz, "Formal specification and
[51]
Systems: An Advanced Course. Berlin: Springer-Verlag, 1979.
, "The transaction concept: Virtue and limitations," in Proc.Int.
mechanical verification of
July 1982.
SIFT," IEEE Trans. Comput., vol. C-13,
Conf. Very Large Database, Sept. 1981, pp. 144-154.
M. Guillemont, "The chorus distributed operating system: Design and
[82] D. A. Menasce and R. R. Muntz, "Locking and deadlock detection in
[52] distributed data bases," IEEE Trans. Software Eng., vol. SE-5, no. 3,
implementation," in Proc. Int. Symp. Local Comput. -Networks, May 1979.
Florence, Italy, Apr. 1982. [83] P. M. Merlin and A. Segall, "A failsafe distributed routing protocol,"
[53] J. Hamilton, "Functional specification of the WEB kernel," DEC RD IEEE Trans. Commun., vol. COM-27, Sept. 1979.
Group, Maynard, MA, Nov. 1978. [84] R. M. Metcalf and D. Boggs, "Ethernet: Distributed packet switching
[54] Y. -C. Ho, "Team decision theory and information structures," Proc. for local computer networks," Commun. ACM, vol. 19, July 1976.
IEEE, vol. 68, June 1980. [85] J. G. Mitchel and J. Dion, "A comparison of two network-based file
[55] R. A. Jarvis, "Optimization strategies in adaptive control: A selective servers," Commun. ACM, vol. 25, pp. 233-245, Apr. 1982.
survey," IEEE Trans. Syst., Man, Cybern., vol. SMC-5, Jan. 1975. [86] J. E. B. Moss, "Nested transactions and reliable distributed computing,"
[561 E. D. Jensen, "The Honeywell experimental distributed processor- An in Proc. 2nd Symp. Reliability Distrib. Software Database Syst.,
overview of its objective, philosophy and architectural facilities," IEEE July 1982.
Computer, vol. 11, Jan. 1978. [87] R. M. Needham and A. J. Herbert, The Cambridge Distributed
[57] E. D. Jensen and N. Pleszkoch, "ArchOs: A physically dispersed oper- Computing System. London: Addison-Wesley, 1982.
ating system," Distrib. Processing Tech. Comm. Newsletter, Summer
1984.
[88] B.81-9,J. Nelson, "Remote procedure call," Xerox Corp., Tech. Rep. CSL-
May 1981.
[58] A. K. Jones, "The object model: A conceptual tool for structuring [89] D. Oppen and Y. K. Dalal, "The clearinghouse: A decentralized agent
software," in Lecture Notes in Computer Science, Vol. 60. Berlin: for locating named objects in a distributed environment," Xerox Corp.,
Springer-Verlag, 1978. Office Products Div. Rep. OPD-T8103, Oct. 1981.
[59] A. K. Jones, R. J. Chansler,l, Durhan, K. Schwans, and S. R. Vegdahl, [90] J. Ousterhout, D. Scelza, and P. Dindhu, "Medusa: An experiment in
"StarOS, A multiprocessor operating system for the support of task distributed operating system structure," Commun. ACM, vol. 23,
forces," in Proc. 7th Symp. Oper. Syst. Princ., Dec. 1979. Feb. 1980.
[60] K. C. Kahn et al., "iMax: A multiprocessor operating system for an can data loops go," IEEE Trans. Commun.,
[91] J. Pierce, "HowJunefar1972.
object-based computer," in Proc. 8th Symp. Oper. Syst. Princ., vol. COM-20,
Dec. 1981, pp. 14-16. [92] G. Popek et al., "LOCUS, A network transparent, high reliability1981,dis-
[61] S. R. Kimbelton, H. M. Wood, and M. L. Fitzgerald, "Network oper- tributed system," in Proc. 8th Symp. Oper. Syst. Princ., Dec.
ating systems -An implementation approach," in Proc. AFIPS Conf., pp. 14-16.
1978. [93] L. Pouzin, "Presentation and major design aspects of the Cyclades
[62] W. Kohler, "A survey of techniques for synchronization and recovery in computer network," in Proc. 3rd ACM Data Commun. Conf.,
decentralized computer systems," ACM Comput. Surveys, vol. 13, Nov. 1973.
no. 2, June 1981.
H. T. Kung and J. T. Robinson, "On optimistic methods for concurrency
[94] Proc.
M. L. Powell and B. P. Miller, "Process migration in DEMOS/MP," in
9th Symp. Oper. Syst. Princ., Oct. 1983.
[63]
control," ACM Trans. Database Syst., vol. 6, no. 2, June 1981. [95] K. Ramamritham and J. A. Stankovic, "Dynamic task scheduling in
[64] L. Lamport, "Time, clocks, and the ordering of events in a distributed distributed hard real-time systems," IEEE Software, vol. 1, no. 3,
system," ACM, July 1978. July 1984.
[65] L. Lamport, R. Shostak, and M. Pease, "The Byzantine generals prob- [96] B. Randell, "Recursively structured distributed computing systemns,"
lem," ACM Trans. Programming Lang. Syst., vol. 4, no. 3, July 1982. in Proc. 3rd Symp. Reliability Distrib. Software Database Syst.,
[66] B. Lampson, "Atomic transactions," in Lecture Notes in Computer Oct. 1983.
Science, Vol. 105, B. W. Lampson, M. Paul, and H. J. Siegert, [971 R. Rashid, "An inter-process communication facility for UNIX,"
Eds. Berlin: Springer-Verlag, 1980, pp. 365-370. Carnegie-Mellon Univ., Pittsburgh, PA, Tech. Rep., June 1980.oriented
[67] R. E. Larsen, Tutorial: Distributed Control, IEEE Catalog No. EHO [98] R. F. Rashid and G. G. Robertson, "Accent: A communication
153-7, New York: IEEE Press, 1979. network operating system kernel," in Proc. 8th Symp. Oper. Syst.
[68] E. Lazowska, H. Levy, G. Almes, M. Fischer, R. Fowler, and Princ., Dec. 1981.
S. Vestal, "The Architecture of the Eden System," in Proc. 8th Annu. [99] D. R. Ries and M. R. Stonebraker, "Locking granularity revisited,"
Symp. Oper. Syst. Princ., Dec. 1981. ACM Trans. Database Syst., pp. 210-227, June 1979.
STANKOVIC: DISTRIBUTED COMPUTER SYSTEMS 1115
[100] D. J. Rosenkrantz, R. E. Stearns, and P.M. Lewis, "System level [128] , "Critical load factors in distributed computer systems," IEEE
concurency control for distributed database systems," ACM Trans. Trans. Software Eng., vol. SE-4, May 1978.
Database Syst., vol. 3, no. 2, June 1978. [129] H. S. Stone and S. H. Bokhari, "Control of distributed processes," IEEE
[101] J. B. Rothni, Jr., P. A. Bernstein, S. Fox, N. Goodman, M. Hammer, Computer, vol. 11, pp. 97-106, July 1978.
T. A. Landers, C. Reeve, D. W. Shipman, and E. Wong, "Introduction [130] M. Stonebraker and E. Neuhold, "A distributed database version of
to a system for distributed databases (SDD-1)," ACM Trans. Database INGRES," in Proc. 1977 Berkeley Workshop Distrib. Data Manage-
Syst., vol. 5, no. 1, pp. 1-17, Mar. 1980. ment Comput. Networks, pp. 19-36.
[1021 L. A. Rowe and K. P. Birman, "A local network based on the UNIX [131] M. Stonebraker, "Operating system support for database management,"
operating system," IEEE Trans. Software Eng., vol. SE-8, no. 2, Commun. ACM, vol. 24, pp. 412-418, July 1981.
Mar. 1982. [132] H. Sturgles, J. Mitchell, and J. Isreal, "Issues in the design and use of
[103] J. H. Saltzer, "Naming and binding of objects," Operating Systems: An distributed file system," ACM Oper. Syst. Rev., July 1980.
Advanced Course. Berlin: Springer-Verlag, 1978. [133] D. Swinehart, G. McDaniel, and G. Boggs, "WFS: A simple shared file
[104] J. H. Saltzer, D. P. Reed, and D. D. Clark, "End-to-end arguments in system for a distributed environment," in Proc. 7th Symp. Oper. Syst.
system design," in Proc. 2nd Int. Conf. Distrib. Comput. Syst., Princ., Dec. 1979.
Apr. 1981. [134] A. S. Tanenbaum, Computer Networks. Englewood Cliffs,
[105] N. Sandell, P. Varaiya, M. Athans, and M. Safonov, "Survey of decen- NJ: Prentice-Hall, 1981.
tralized control methods for large scale systems," IEEE Trans. Auto. [135] R. R. Tenny and N. R. Sandell, Jr., "Structures for distributed decision-
Cont., vol. AC-23, no. 2, Apr. 1978. making," IEEE Trans. Syst., Man, Cybern., vol. SMC-11,
[106] A. Segall, "The modelling of adaptive routing in data-communication pp. 517-527, Aug. 1981.
networks," IEEE Trans. Commun., vol. COM-25, no. 1, pp. 85-95, [136] , "Strategies for distributed decision-making," IEEE Trans. Syst.,
Jan. 1977. Man, Cybern., vol. SMC-11, pp. 527-538, Aug. 1981.
[107] D. G. Severance and G. M. Lohman, "Differential files: Their applica- [137] R. H. Thomas, "A majority consensus approach on concurrency control
tion to the maintenance of large databases," ACM Trans. Database for multiple copy databases," ACM Trans. Database Syst., vol. 4,
Syst., vol. 1, no. 3, Sept. 1976. no. 2, pp. 180-209, June 1979.
[108] S. K. Shrivastava and F. Panzieri, "The design of a reliable remote [138] D. Tsay and M. Liu, "MIKE: A network operating system for the
procedure call mechanism," IEEE Trans. Comput., vol. C-31, distributed double-loop computer network," IEEE Trans. Software
July 1982. Eng., vol. SE-9, no. 2, Mar. 1983.
[109] D. Siewiorek and R. Swarz, The Theory and Practice of Reliable System [139] A. van Dam, and J. Michel, "Experience and distributed processing on
Design. Bedford, MA: Digital, 1982. a host/satellite graphics system," in Proc. SIGGRAPH, July 1976.
[110] D. Skeen, "Nonblocking commit protocols," in Proc. ACM SIGMOD, [140] B.G. Walker, G. Popek, R. English, C. Kline, and G. Theil, "The
1981. LOCUS distributed operating system," in Proc. 9th Symp; Oper. Syst.
[111] , "A decentralized termination protocol," in Proc. I st IEEE Symp. Princ., Oct. 1983.
Reliability Distrib. Software Database Syst., 1981. [141] S. Ward, "TRIX a network oriented operating system," in Proc.
[112] D. Skeen and M. Stonebraker, "A formal model of crash recovery in a COMPCON, 1980.
distributed system," IEEE Trans. Software Eng., vol. SE-9, no. 3, [142] M.V. Wilkes and R.M. Needham, The Cambridge CAP Computer
May 1983. and its Operating System. Amsterdam, The Netherlands: Elsevier
[113] G. R. Smith, "The contract net protocol: High level communication and North-Holland, 1979.
control in a distributed problem solver," IEEE Trans. Comput., [143] L. Wittie and A. M. Van Tilborg, "MICROS, A distributed operating
vol. C-29, Dec. 1980. system for micronet, A reconfigurable network computer," IEEE Trans.
[114] M. H. Solomon and R. A. Finkel, "The Roscoe distributed operating Comput., vol. C-29, Dec. 1980.
system," in Proc. 7th Symp. Oper. Syst. Princ., Mar. 1979. [144] W. Wulf, E. Cohen, W. Corwin, A. Jones, R. Levin, C. Pierson, and
[115] A. Z. Spector, "Performance remote operations efficiently on a local F. Pollack, "HYDRA: The kernel of a multiprocessor operating sys-
computer network," Commun. ACM, vol. 25, pp. 246-259, Apr. 1982. tem," Commun. ACM, vol. 17, June 1974.
[116] A. Z. Spector and P.M. Schwarz, "Transactions: A construct for
reliable distributed computing," ACM Oper. Syst. Rev., vol. 17, no. 2,
Apr. 1983.
[117] S. K. Srivastava, "On the treatment of orphans in a distributed system,"
in Proc. 3rd Symp. Reliability Distrib. Syst., Oct. 1983.
[118] W. Stallings, Local Networks. New York: Macmillan, 1984.
[119] J. A. Stankovic, "The types and interactions of vertical migrations of
functions in a multi-level interpretive system," IEEE Trans. Comput.,
vol. C-30, July 1981.
[120] , "Improving system structure and its affect on vertical migration,"
Microprocessing and Microprogramming, vol. 8, no. 3,4,5, John A. Stankovic (S'77-M'79) received the
Dec. 1981. Sc.B. degree in electrical engineering in 1970, and
[121] , "ADCOS-An adaptive, system-wide, decentralized controlled the Sc.M. and Ph.D. degrees in computer science in
operating system," Univ. Massachusetts, Amherst, MA, Tech. Rep. 1976 and 1979, respectively, all from Brown Univer-
ECE-CS-81-2, 1981. sity, Providence, RI.
[122] -, "Software communication mechanisms: Procedure calls versus He is now an Associate Professor in the De-
messages," IEEE Computer, vol. 15, Apr. 1982. partment of Electrical and Computer Engineering,
[123] , "Simulations of three adaptive decentralized controlled, job University of Massachusetts, Amherst, MA. He has
scheduling algorithms," Comput. Networks, vol. 8, no. 3, been active in distributed systems research since
pp. 199-217, June 1984. 1976. His current research includes various ap-
[124] , "Bayesian decision theory and its application to decentralized proaches to process scheduling on loosely coupled
control of job scheduling," IEEE Trans. Comput., vol. C-34, networks and recovery protocols for distributed databases. He has been
Jan. 1985, to be published. involved in CARAT, a distributed systems testbed project at the University
[125] J. A. Stankovic and 1. S. Sidhu, "An adaptive bidding algorithm for of Massachusetts.
processes, clusters and distributed groups," in Proc. 4th Int. Conf. Prof. Stankovic was coeditor of the January 1978 Special Issue of IEEE
Distrib. Comput., May 1984. Computer on Distributed Processing. He now serves as the Vice Chairman of
[126] J. A. Stankovic, K. Ramamritham, and W. Kohler, "Current research the IEEE Technical Committee on Distributed Operating Systems. In this
and critical issues in distributed system software," Dep. Elec. Comput. capacity he has been responsible for serving as the Editor of two Special Issues
Eng., Univ. Massachusetts, Amherst, MA, Tech. Rep., 1984. of the Technical Committee's Newsletter. He received the 1983 Outstanding
[127] H. S. Stone, "Multiprocessor scheduling with the aid of network flow Junior Faculty Award for the School of Engineering, University of Massachu-
algorithms," IEEE Trans. Software Eng., vol. SE-3, Jan. 1977. setts. He is a member of ACM and Sigma Xi.