0% found this document useful (0 votes)
39 views28 pages

Transaction Monitoring in Encompass: '1tandemcomputers

The Transaction Monitoring Facility (TMF) is a component of the ENCOMPASS distributed data management system. Recovery from failures is transparent to user programs and does not require system balt or restart. The implementation utilizes distributed Audit Trails of data base activity and a decentralized transaction concurrency c:ontrol meehanism.

Uploaded by

tuna06
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views28 pages

Transaction Monitoring in Encompass: '1tandemcomputers

The Transaction Monitoring Facility (TMF) is a component of the ENCOMPASS distributed data management system. Recovery from failures is transparent to user programs and does not require system balt or restart. The implementation utilizes distributed Audit Trails of data base activity and a decentralized transaction concurrency c:ontrol meehanism.

Uploaded by

tuna06
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

'1TANDEMCOMPUTERS

Transaction Monitoring
in ENCOMPASS
Andrea Borr
Technical Report 81. 2
June 1981
PN87601
Transaction ~ n i t o r i n g in ENCCMPASS
Andrea Barr
June 1981
Tandem Technical Report 81.2
Tandem TR 81.2
Transaction Monitoring
in ENCOMPASS:
Reliable Distributed Transaction Processing
Andrea Borr
Tandem Computers Incorporated
19333 Vallco Parkway, Cupertino Cae 95014
JUDe 1981
ABSTRACT: A transaction is- an atomic update which takes a data base from a consistent state to
another c:onsistent state. The Transaction Monitoring Facility (TMF),. is a component of the
ENCOMPASS distributed data management system. which runs on the Tandem computer system.
TMF provides c:ontinuous. fault-tolerant transaction processing in a decentralized. distributed en-
viromnent. Recovery from failures is transparent to user programs and does not require system
balt or restart. Recovery from a failure which directly affeets active transactions. such as the
failure of a participating proc:essor or the loss of communications between participating network
nodes, is accomplished by means of the baekout and restart of affected transactions. The implemen-
tation utilizes distributed audit trails of data base activity and a decentralized transaction concur-
rency c:ontrol meehanism.
Copyright by IEEE. Originally appeared in Proceedings of Seventh International Conference on
Very Large Databases. Sept 1981. IEEE Press. Republished by Tandem Computers Incorporated
with the kind permission of IEEE.
CONTENTS
INTRODUCTION: Arc:hitecturally-derived data integrity vs. data base consistency " 1
ARCHITECTURAL OVERVIEW 1
Hardware Arc:hiteeture ............................................................ 1
The Tandem Operating System 2
DATA MANAGEMENT SYSTEM OVERVIEW 5
EN'COMPASS 5
Data Base Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 5
TermiDal Management . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 6
Transaction Flow and Application Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 6
TraDsaction Management 7
TID' DESIGN OVERVIEW 10
Concurrency Control 10
Audit Trails ~ 10
Transaction State Change 11
Distributed Transaction Proc:essiDg 13
Distributed Commit Protocol 13
ROLLFORWARD 15
A DISTRIBUTED DATA BASE APPLICATION ' 16
CONCLUSIONS 18
ACK1'{OWLEDGEMENTS 18
RE:FEREN'CES 19
INTROOUCTlON
The Tandem NonStop system architecture - hardware and software - is designed to provide failure-
tolerance, expandability, and distributed data processing in an online transaction processing en-
vironment. The architectural overview which follows shows how such features as continuous
availability, tolerance of single-module failures, fail-safe structural integrity of files, and 1/0 device
fault tolerance derive from the design. The extension of the operating system to support a network
of Tandem nodes is discussed. The network extension is reliahle, highly available, and provides
geographic independence. These features provide the foundation upon which to build a reliable
distributed data management. system; however, reliable distributed transaction proeessing
requires that another layer of failure protection be provided for the data base. Logical data base
.consistency must be guaranteed despite proeessor failure, application process failure, network
partition, transaction deadlock. or application-requested transaction abort. The means used by
ENCOMPASS to provide these features are examined.
ARCHITCCTURAL OVERVIEW
Hardware Architecture
The Tandem syst8m is based on multiple, independent processors. Figure 1 illustrates the architec-
ture of a typieal three-proeessor system. The hardware structure consists of from 2 to 16 processor
modules. each with its own power supply, up to two megabytes of memory, aDd I/O ehanDel, inter-
coDDected by dual high-speed (13.5 megabytes/see) iDterprocessor buses. Each I/O controller is
redundantly powered aDd connected to two I/O ehanDeJs. Disc drives may be connected to two I/O
controllers. and discs themselves may be duplicated. or "mirrored", to provide data base access
despite disc failures. At least two paths connect any two components in the system. Thus, hardware
redundancy is arranged so that the failure of a single module does not disable any other module or
disable any inter-module communication. Normally, all components are active in processing the
workload. However, when a component fails, the remainjng system components automatically take
over the workload of the failed component 4.
1
.....IdL.... -JdL.. .-IdL....
~
- .......
..,
-- ....... .., -- .........
OYN.. eUS OYN..eus OYN.. eUS
CONTROL CONTAOL CONTROL
CPU CPu CPU
....,N "EMORY ....NMEMORY ....,N MEMORY
IIOCH..NNEL IIOCHANHEL IIOCH..NNEL
I
OISC OISC
CONnlOLLlll CONTROLLlIl
A?'
"'--"
A?'
~
to. =--- II:. _":l>
L
j
L J l L J
""
.,r
~ ~
- ~ -....... - ~ - .......
OISC OISC
CONTAOLLER
CONTROLLER
UNe CCNT'AOI.LIR
J
~
,.---
J5)
-
.-
!J,)
~ ~
gJ
...11II QJ
W
-....-::/ --.:::/
--v'
OTHER NETWORK NODES
Figure 1. The Tandem Nonstop Hardware Architecture
The Tandem Operating System
System resourees are managed by a message-based operating system which decentralizes informa-
tion and eontroL The operating system resides in each eomponent processor. The relationship
among the processors of a system is characterized by symmetry and the absence of any master-
slave hierarchy.
2
The operating system provides the software abstractions. processes and messages. necessary for
decentralized control of the distributed components. All communications between processes is via
messages. The Message System makes the physical distribution of hardware components
transparent to processes. User processes access the Message System through the File System. The
Message System and the File System effectively transform the multiple-eomputer structure into a
unified multiprocessor at the user level.
System-wide access to I/O devices is provided by the mechanism of the I/O "process-pair". An I/O
proc:ess-pair consists of two cooperating processes which run in the two processors physic:a.lly con-
nected to a particular I/O device. One of these processes. designated the "primary", controls the I/O
device. handling all requests to perform I/O on the device. The other process. designed the
"backup", functions as a stand-by in ease of failure of the primary path to the device. The primary
process sends the bac:kup process "cheekpoints", via the Message System. which ensure that the
bac:kup process has all the information that it would need in the event of failure to assume control of
the device and earry through to completion any operation initiated by the primary. During normal
operation. the bac:kup is passive. acting only to receive the primary's cheekpoints. In the event of an
I/O channel error or a failure of the primary's processor, the bac:kup process takes control of the
device and becomes the primary.
The procesa-pair is a general meehaDism utilized within the operating system to make system
resources and services available to all processes in a fault-tolerant manner. An example of the use
of the procesa-pair mechanism for a system process other thaD an I/O process is the "operator'"
process-pair. which is respoDSible for formatting and printing error messages on the system COD-
sole. The primary and baciup of the operator process run in two processors of the system. In the
event of a failure of the primary's processor. the bac:kup is able to continue offering this service.
The eontmous service Provided by the proc:ess-pair is the essence of the feature termed NonStop.
The concept of the proc:ess-pair extends to application processes as well as to system processes.
User-eaJlable operating system routines are provided for c:reating a bac:kup process and sending
checkpoints to it. As shown by Bartlett z, a process-pair application can provide continuous fault-
tolerant processing despite module failure.
The design of the Tandem operating system- is described in more detail in [1] and [2].
The Tandem Network
The message-based struc:ture of the Tandem operating system allows it to exert decentralized con-
trol over a loeal network of processors. Since it already addresses some of the problems of controll-
ing a distributed computing system. the operating system has a natural extension to support a data.
communications network of Tandem nodes. each node containing up to 16 processors. (Henceforth.
the terms "system" and "node" will be used interehangeably, and the term "network" will be used
to refer to a collection of Tandem nodes connected by data. communications !inks).
The extension of the operating system to the network operating system involves the generalization
of message destinations to include proeesses in other nodes. This extension of the Message System
beyond the boundaries of a single system allows a process anywhere in the network to send or
receive a message from any other process in the network.
3
Features of the Tandem network include the following:
1. fault-tolerant nodes for high availability and data integrity;
2. user-level transparency of access to geographically distributed system resources;
3. decentralized control. characterized by the absence of a network master;
4. dynamic best-path message routing, including automatic re-routing in the event of a com-
munications line failure;
5. automatic. packet forwarding via an end-to-end protocol which assures that data transmissions
are reliably received.
The design of the Tandem network system is described in more detail in [5].
4
DATA MANAGEMENT SYSTEM OVERVIEW
ENCOMPASS
The ENCOMPASS distributed data management system performs the functions required for the
development and operational control of on-line application systems. The basic functions provided by
ENCOMPASS components include: (l) data base management; (2) terminal management; (3) transac-
tion flow and application control; and. (4) transaction management. .
Data Base Management
The data base management component of ENCOMPASS provides a data definition language. a data
dictionary. a relational data base manager. and a high-level non-procedural relational querytreport
language.
Among the features provided by the ENCOMPASS data base manager are the following:
1. three types of structured file organizations: key-sequenced. relative. and entry-sequenced;
2. multi-key access to records with automatic maintenance of the indices during file update;
3. data and index compression;
4. partitioning of files-by key value range-across multiple disc volumes (possibly on multiple
nodes);
S. security controls by function. user claSs. network node. application program. and specified ter-
minal;
6. a cache buffering scheme designed to keep the most rec:ently referenced blocla of data in main
m e ~ o r y .
The ENCOMPASS data base manager distributes data across multiple processors and discs. pro-
viding multiple points of control. Implemented as an IJO proc:ess-pair per disc volume. designated
the DISCPROCESS. it protects the structural integrity of individual files through active
cheekpointing of proc:ess state and data, and recovery in the case of processor. IlO channel. or disc
drive failure_ The DISCPROCESS controls all acc:ess to a logical disc volume. which in the case of a
mirrored device pair includes two physical disc drives and all primary and backup acc:ess paths (IlO
clwmels and IlO controllers).
The DISCPROCESS protects the integrity of the files resident on its volume by maintaining control
information and data in two proc:essors- the processors in which the primary and backup
DISCPROCESSes reside. If a failure occurs which prevents the primary DISCPROCESS from com-
pleting an operation it has started. the backup DISCPROCESS automatically takes over and com-
pletes the operation.
Two granularities of loc:king are provided for concurrency control: file and record. Record level
locking operates on tire primary key of an individual logical data record. (There is no locking at the
b10ek or index level.) Locks on existing records are obtained at read time by explicit application pro-
gram request. All locks are exclusive mode.
Each DISCPROCESS maintains the locking control information for those records and files resident
on its volume only. Thus. concurrency control for ENCOMPASS is decentralized and is effectively
di3tributed among the DISCPROCESSes; no central lock manager exists. Deadlock detection is by
timeout. the interval being specified as part of the lock request.
5
Terminal Management
The terminal management component of ENCaMPASS. known as the Terminal Control Process
(TCPl. provides screen formatting, data validation. screen sequencing, and data mapping. A TCP
controls up to 32 terminals and supports a variety of terminal types and communication lines. The
application interface to each terminal is defined by the user in a highlevel language known as
Screen COBOL (a COBOL!ike language with extensions for screen handling). The user's Screen
COBOL program is interpreted by the TCP to perform screen sequencing, data mapping, and field
validation for a single terminal A TCP supervises the interleaved execution of Screen COBOL pro-
grams. each a.ssoc:iated with one of the terminals under control of the TCP. Multiple TCP's can be
run provide better .distribution of available" resources or to support large number.s of terminals.
TCP's are configured as process-pairs. As a result of the fault tolerance thus provided, the terminal
user has continuous access to the executing Screen COBOL program despite module failure.
including processor failure.
Transaction Flow and Application Control
ENCOMPASS applications have user-defined transactions that originate at terminals and access
data bases. In addition to providing a Screen COBOL program defining the screen formats and con-
trols for a termiDaL the ENCOMPASS user provides a set of application program modules, known
as application "server'" programs. which access and update data. base files. Screen COBOL pr0-
grams and application server programs communicate by exchanging request and reply messages.
CommUDieation between a Screen COBOL program and a server is initiated by the Screen COBOL
'"SEND" verb. UDder control of the SEND verb. the TCP. using the File System, passes a
tion request message to a server. The server performs the application function against the data.
base. The structure of an application server program is simple and singltHhreaded: (1) read the
tr.msac:tion request message; (2) perform the data. base function requested; (3) reply. A server must
be "context free" in the sense that it retain3 no memory from the servicing of one request to the
next. Application server.s are written to be independent of terminal and communications considera
nons. Server.s may be written in available commercial suc:h as COBOL.. .
ENCOMPASS application control provides monitoring of applications which are spread across a
single system or network. It provides for the dynamic: creation and deletion of application server
proeesses to ensure good response time and utilization of resources as the workload on the system
changes. Figure 2 shows a typical ENCaMPASS configuration.
6
CPU 0
CPU 1
CPU 2
Figure 2. A Typical ENCOMPASS Configuration
Transaction Management
The traDsaction IDaDagement component ef ENCOMPASS is known as the Tr:mp.ction Monitoring
Facility (TMF). TMF implements the conc:ept of a traDsaction as defined in the following.
A traDsaction is a (possibly logical update which takes the data base from a consistent
state to another consistent state. A consistent data base is one which satisfies an application-
dependent set of assertions concerning the relationships between files. reeortb. fields. and secon-
dary indices. the truth of which is required for the data base to effectively model the application's
world of reference.
7
In order for data base consistency to be maintained. a transaction must have the property of
atomicity: either all of its effects persist or none of its effects persist. This in turn requires that a
transaction which aborts for any reason. by user request or due to a failure. be backed out so that
none of its effects persist. On the other hand. a transaction which completes reaches a point at
which it abdicates the right to back out. At this point. the transaction is said to commit: all of its
effects will persist regardless of subsequent failures.
Prior to the existence of TMF in ENCOMPASS. data base consistency had to be preserved by ap-
plication fault-tolerance. The application had to be coded as a process-pair, which by careful design
of cheekpoint logic could recover correctly from single-module failures. (The run-time systems of
COBOL and FORTRAN automated proeesa-pair coordination and error-recovery to some extent.)
Since proeesa-pairs always c:arried forward to completion processing interrupted by failure. trans-
actions always committed. The reversal of a transaction bad to be coded by the user. as there was
DO automated transaction backout. Furthermore. there was no protection for the data base in the
event of multiple-module failure. If a disc's primary and backup processors failed simultaneously,
data on the disc could be left in an unrecoverable or inconsistent state.
The introduction of transaction backout by TMF makes it unnecessary to code the application as a
process-pair. Without TMF, the application process must maintain correct state in a backup process
in ca.se the primary fails. With TMF, the state of progress of an incomplete transaction is im
material. since failure will cause the transaction to be automatically backed out. restoring data base
coamstency. TMF further provides voluntary transaction backout. making it unnecessary for the
user to code traDsaction reversal. Data base protection in the event of multiple-module failure is
provided by the ROLLFORWARD facility described in a later sec:tion.
The ENCOMPASS user's interface to TMF is through use of the Screen COBOL verbs:

END-TRANSACTtON
ABORT-TRANSACTION
RESTART-TRANSACTION.
BEGIN-TRANSACTION is used to mark the beginning of a sequence of operations which should be
treated as a single For the Screen COBOL program. BEGIN-TRANSACTION marks
the beginning of a series of one or more SEND's of transaction request messages to application
server processes. The network location of the application server process and. in fact. of the data
base itselL is transparent to the Screen COBOL program. The server. the data base. or part of the
database (in any combination) may reside .on remote network nodes. For example. a server may
transparently access data base files residing on any network node. A transaction may do work at
multiple nodes and involve multiple server requests.
ExeeutiOD of BEGINTRANSACTION causes a unique transaction identifier. or ..transid". to be
generated. The transid consists of a sequence number. qualified by the number of the processor in
which BEGIN-TRANSACTION was ealled. qualified by the number of the network node which
originated the t:ransaction. designated the "bome" node for the transaction. The Screen COBOL
special register TRANSACTIONID is set to contain the new transid. and the terminal is said to
ellter '"transaction mode".
mara a restart point in ca.seof failure while the terminal is in transaction
mode. If the transaction fails for any reason except an explicit ABORT-TRANSACTION by the
Screen COBOL program (and the number of restarts has not exeeeded a configurable '"transaction
restart limit"). the terminal's execution is restarted at BEGIN-TRANSACTION after TMF backs
out any data base updates that have been performed for the current transid. A new transid is ob-
tained for the new at exeeuting the logic3J. transaction.
8
The types of failures which would result in the automatic abort. backout. and restart of a transac-
tion by the system include: (1) failure of the primary TCP's processor (i.e. primary process of the
process-pair); (2) failure of an application server's processor while that server was working on the
transaction: (3) complete loss of communication with a network node which participated in the trans-
action. On the other hand. recovery from the failure of a component such as a primary
DISCPROCESS' processor. an individual network communication line. a power supply, I/O con
troller. or disc drive is handled automatically by the operating system transparently to transaction
processing.
The effect of a processor or other single module failure. which would necessitate crash restart and
data base reeovery on a conventional system. is limited to the on-line backout of those transactions
in process on the failed module. Transac:tions uninvolved in the failure continue processing. Because
the TCP checkpoints data extracted by the Screen COBOL program from input screen(s) to its
backup. in many c:ases the restart of a logical transaction may not require re-entering the input
sc:reen(s).
All SEND's exeeuted by a Screen COBOL program whose terminal is in transaction mode have the
terminal's eurrent transid automatically appended to the interprocess message by the File System.
'!hen the application server reads the transaction request message. the terminal:S eurrent transid
becomes the "current proc:ess transid'" for the application process. When the application process
then executes a statement requiring disc IJO and/or reeord or file locking, the File System
automatically appends the application proc:ess' current transid to the request message which is sent
to the DISCPROCESS.
When all the SEND's required for the transaction's exeeution are complete. the Screen COBOL pro-
gram indicates that the transaction should be committed by exeeuting the END-TRANSACTION
verb. At the completion of the exeeution of this verb. the transaction's data base updates become
permanent and will not 11Dder any c:ireumstances be backed out. The Screen COBOL program's
END-TRANSACTION request can. however, be rejected because the transaction has been aborted
by the system due to one of the causes of automatic: abort. e.g. network partition. If so. the Screen
COBOL program may be restarted at the BEGIN-TRANSACTION point.
If the Screen COBOL program. or any of the servers to which it has done a SEND, detects a need to
abort and back out a transaction-without automatic: restart by the TCP-the Screen COBOL pro-
gram executes the ABORTTRANSACTION verb.
FiDally, RESTARTTRANSACTION is used to indicate that the eurrent attempt to execute the
traDsaetion bas failed due to a transient problem and so should be backed out and restarted. For ex
ample. a server may request a data base reeord lock with a timeout interval specified; then. in case
the timeout oec:urs. it would reeover from a possible deadlock by replying to the SEND with an er-
ror result indieating that the Screen COBOL program should call RESTARTTRANSACTION.
9
TMF DESIGN OVERVIEW
Concurrency Control
Gray defines a transaction that sees a consistent data base state as one that (a) does not overwrite
dirty data of other transactions; (b) does not commit any writes until the end of transaction; (c) does
not read dirty data from other transactions; and (d) prior to its completion. does not permit any data
it reads to be dirtied by other transactions. If all transactions observe these protocols. then transac-
tion backout produces a consistent state 3. .
TMF enforces clauses (a), (bl, and (c) as follows. It that all records updated or deleted by a
traDsaction have been previously locked by that transaction. (A lock on an existing record is ac-
quired at record read time by explicit application request.) TMF automatically generates loclts on
all new records inserted by a transaction and on the primary key values of all records deleted by a
transaction. Clause (d) insures that the reads performed by a transaction are repeatable 3. The
observance of clause (d) is recommended to writers of TMF transactions, but for system perform
ance reasons is not enforced, as enforcement would require the generation of a lock for each record
read by a transaction.
AudIt Trails
TMF maintains distributed audit trails of logical data base record updates on mirrored disc
volumes. An audit trail is a numbered sequence of disc files whose volume of residence is con-
figurable and whose creation and purging is managed by TMF. The loc:atioDS of audit trails for disc
volumes containing data base files designated by the user as "audited" are independently con-
figurable. Each DISCPROCESS which manages a disc volume configured as "audited" (i.e. capable
of containing audited data base filesl automatically provides "before-images" and "afterimages" of
data base updates by application processes to an AUDITPROCESS (of which several, each a
process-pair, are configurable), which writes to an audit trail. All audited discs on a given controller
share an AUDITPROCESS and an audit trail. Multiple controllers may be configured to use the
same or different AUDITPROCESSes and audit trails. Auditing of data base updates is totally
traDSparent to application programs. For transactions that span data bases on multiple nodes of a
network. all audit images for records residing on a particular node are contained in audit trails at
that node. This enables transaction backout at a node to occur without the need for communication
with other nodes. Transaction backout is performed by the BACKOUTPROCESS (a process-pair),
using the transaction's before-images recorded in the audit trails.
The implementation of the DISCPROCESS as a process-pair, residing in two processors. eliminates
the necessity for the protoc:ol. termed "Write Ahead Log" by Gray 3, which requires before-images
to be write-forced to the audit trail prior to performing any update of the data base on disc. The
Write Ahead Log protoc:ol enables restart after system aash using conventional recovery tech
Diques. In the NonStop approach to bandling failure, checkpoint is the functional equivalent of
Write Ahead Log. By checkpointing the audit records generated by an update request to its backup
prior to performing the update. the primary DISCPROCESS assures the feasibility of transaction
backout even in the event of the failure of the primary's processor. As in Write Ahead Log,
however, all audit records generated by a transaction are write-forced to disc as part of the
phase transaction commit protoc:ol.
In addition to the data base audit trails, TMF maintains, at each node. a "Monitor Audit Trail"
which contaiDS a history of transaction completion statuses: commits and aborts. A transaction com-
mits at the time its commit record is written to the Monitor Audit Trail.
10
Transaction State Change
A transaction goes through several state changes during the commit or abort protocol. All transac-
tion state changes are broadcast. via the interprocessor bus. to all processors within a single node.
This is done regardless of which processors actually participated in the transaction. In contrast. in
the network case only those nodes which participated in the transaction are notified of transaction
state change. The decision to broadcast transaction state change information to all processors
within a single system was taken because of the speed and reliability of the interprocessor bus as a
communication medium. .
A summary of the possible states of a transaction follows:
1. Active. A transaction has this state after BEGIN-TRANSACTION has been ealled but before
commit or abort has been requested. BEGIN-TRANSACTION broadcasts the transid in
..active" state to all processors in the system. The possible states which can follow are "ending"
or "aborting".
2. Ending. A transaction has this state after END-TRANSACTION has been ealled but before the
transaction commit record has been written to the Monitor Audit Trail. During "ending" state.
all the transaction's audit records are written to the audit trails. This constitutes "phase one" of
traDsaction commit. The possible states which can follow are "ended" or "aborting".
3. Ended. A transaction has this state after the transaction commit record has been written to the
Monitor Audit TraiL Once the transaction has entered the "ended" state. the Screen COBOL
. END-TRANSACTION verb completes. and participating DISCPROCESSes are notified to
release the -eommitted transaction's locks. This constitutes ~ h a s e two" of transaction commit.
Once the "ended" state has completed. the transid leaves the system.
4. Aborting. A transaction has this state after the decision to back out the transaction has been
taken. but before any of its locks are released. While the transaction is in '"aborting" state. all of
its audit records are written to the audit ,trails and the transaction's data base updates are backed
out by the BACKOUTPROCESS. "Aborting" and "ending" are parallel states. The only possi-
ble following state is "aborted"..
5. Aborted. A transaction has this state after the transaction has been backed out. Once the trans-
action has entered "aborted" state. particiI g DISCPROCESSes are notified to release the
backed out transaction's locks. "Aborted" ana "ended" are parallel states. Once the "aborted"
state has completed. the traDsid leaves the system.
The state traDsitions of a transaction are illustrated in Figure 3.
11
BEGIN
FAILURE
PHASE-oNE
END PHASE-TWO BACl<OUT
Figure 3. State Transitions of a Transaction
For traDsac:tions which stay within a node. TMF uses an abbreviated two-phase coInmit. protocol. Its
purpose is to ensure that all audit records generated by a transaction are written to disc: prior to
tmloc:kiDg the traDsac:tion's locks. thus making the transaction's output visible to concurrent trans-
actions. This guarantees that a transaction. once c o m m i ~ can always be recovered using
ROLLFORWARD. a TMF utility which can be used to apply afterimages from the audit trails to a
previously archived copy of the data base. ROLLFORWARD is discussed in a later section.
12
Distributed Transaction Processing
-
Transactions can originate on any node of the network and can be transmitted by any path to any
number of other network nodes through SEND's from a TCP to remote servers or I/O requests from
servers to remote discs. TMF ensures data base consistency' both within a single node and across
nodes by treating all data base updates performed by a transaction as a group. identified network-
wide by the transid. All of the updates in the group are made permanent by the execution of the
Screen COBOL verb. If a failure occurs. the transaction's updates are backed
out on all participating nodes.
The strategy, used in the single-node ease (up to 16 processorsl. of broadcasting transaction state
change information to all proeessors. regardless of their participation in the transaction. is clearly
too expensive to use in the network ease. Furthermore. the information is likely to be useless to
most of the network. Therefore. only nodes participating in the transaction reeeive state change in-
formation.
Coordination of distributed transactions is one of the functions of the "Transaction Monitor Proc-
ess" (TMP). a process-pair which is configured for each network node that participates in the
distributed data base. Whenever a transid is transmitted by the File System to a remote node (as a
result of a SEND to a remote server or an I/O request to a remote audited data base file). the TMP
on the sending node determiDes whether the destination node bas received a previous transmiMion
01 the requesting transid from the sending node. If not. the TMP on sending node notifies the
TMP on the destination node to broadcast the transid in ..active" state to all proeessors on its node.
This '"remote transaetion begin" oc:curs prior"to any traJ1SlDiMion of the traDsid by the FUe System
to a server or DISCPROCESS on the destination node.
Distributed Commit Protocot
In a distributed data base environment. where a single transaction may result in the update of files
on multiple nodes: loss of communication between nodes may cause data base inconsistency. TMF
uses a more elaborate two-phase commit protocol for distributed transactions than that used for
single-node transactions. due to the unreliability of the communication medium and the loose coupl-
ing of the nodes. The distributed commit protocol allows any participating node to unilaterally
abort a transaction. The purpose of phase one is twofold. First. it serves to ensure that all audit
reeords generated by a transaction are written to the audit trails on all participating nodes prior to
allowing the UDloeking of any of the transaction's locks. Seeondly, it guarantees that the decision to
commit or abort a transaction is uniform across all nodes. even in the event of loss communica-
tioDS between participating nodes or the catastrophic failure of a node.
In the distributed ease. transaction state change is accomplished by TMP-to-TMP messages sent
over the network. Each participating node sends transaction state change messages to the TMP on
all nodes for which it was the direct souree of transid transmission. On reeeipt of a transaction state
change message. the TMP broadcasts the state change to all processors within its node.
Certain network TMP message types are termed "critical response". For critical response
messages. the destination TMP must be acc:essible at the time the message is initiated. and it must
reply with an affirmative response in order for the transaction state change to proceed. Examples
of c:ritic:al response messages to remote TMP's are the network message requesting remote transac-
tion begin and that requesting phase one of commit (i.e. transaction state change to "ending").
13
Other network TMP message types are termed "safe-delivery". For safe-delivery messages. the
destination TMP need not be accessible at the time the message is initiated. and the reply serves
only to acknowledge receipt of the message rather than to signify concurrence. The sending of safe-
delivery messages-whenever transmission becomes possible-is guaranteed. but their delivery is
not time-critical to the transaction's state change. Examples of safe-delivery messages to remote
TMP's are the network message requesting transaction state change to "ended" (phase two of com-
mit) and that requesting transaction state change to "aborting" (requesting transaction baekout).
Following transaction baekout. state change to "aborted" on each participating node is accom-
plished under loeal control. without need for further network communication.
The e:ritical response message requesting phase one of commit is initially transmitted over the net-
work by the TMP on the transaction's home node in reSlJOnse to the Screen COBOL p ~ g r a m ' s END-
TRANSACTION ea1L For phase one to complete successfully, each node to which the home node
direetly transmitted the transid must be aecessible and must reply affirmatively after having writ-
ten the transaction's audit records to disc on its node and atter having assured. transitively, that all
nodes to which the transid was further transmitted have done likewise. The existence of a node par-
ticipating in the transaction which is either inaccessible at phase-one time. or which responds
negatively to the phase-one request beeause it has previously decided to abort the transaction (e.g.
due to a prior loss of communications) will cause the commit attempt to fail. The TMP on the trans-
aeaon's home node will transmit traDsaetion backout messages in this ease. The successful comple-
tion of phase one. on the other hand. will cause the home node TMP to transmit phase-two
messages. eausiDg the release of the transaction's locks throughout the network and the completion
of the END-TRANSACTION eall. .
For example. suppose a TCP on node 1 SENDs to a server on node 2. which in turn updates a record
via a DISCPROCESS on node 3. The TMP on node 1 remembers that it transmitted the transaction
to node 2. but does not know that node 2 transmitted it to node 3. The TMP on node 2 remembers
that it transmitted the transaction to node 3. When END-TRANSACTION is ealled on node 1. "end-
ing" state is broadcast to all proeessors on node 1. The TMPon node 1 receives this broadcast and
sends a network message to the TMP on node 2. The latter broadcasts to all processors on node 2
and in addition sends a network message to the TMP on node 3. which broadcasts to all processors
on its node. This causes the transaction to go into "ending" state on all. processors of all par-
ticipating nodes. This is the first phase of the commit protocol. The second phase is similar.
Once Dhase one has completed sueeessfu1ly, the inability to communicate with all participating
nodes iDg phase two (lock release) does not \mpede the completion of the Sceen COBOL p ~
grams's END-TRANSACTION call on the home node. It merely means that records locked by the
committed transaction on inaccessible remote nodes will remain locked until such time as com-
munication is restored.
Until a non-home node has replied affirmatively to the phase-one message. it can unilaterally abort
the traDsaction. and then foree network consensus to abort by replying negatively to the phase-one
message when it is received. Once a non-home node has replied affirmatively to the phase-one
message. however. it must hold the transaction's loc:ks until notification of the transaction's final .
disposition (i.e.. to state "ended" or to state "aborting") is reeeived (possibly iDciireetly) from the
tranuction's home node_ If communication is lost at this point. the transaction's locks on the inae-
eessib1e node will be held UDtil communication is restored. The manual override for this situation re-
quires the following steps: (1) use of a TMF utility on the home node to determine the transaction's
disposition; (2) a telephone conversation (for example) between operators on the home node and on
the inaceessible non-home node; and. finally, (3) use of the TMF utility on the non-home node to force
lib dis- jtion of the transaction.
14
ROL1.FORWARD -
Conventional transaction management systems must be optimized for quick restart in the event of
total node failure. NonStop systems allow optimization of normal processing at the expense of
restart time. For example. audit records need not be written to disc prior to updating the data base.
On the other hand. however rare the occurrence of total node failure (e.g. simultaeous failures of
two processors hosting a process-pair), TMF must have a provision for reeovering the data base.
.
TMF's approach to recovery from total node failure is based on occasional archived copies of
audited data. base files. plus an archive of all audit trails written since the data base files were ar-
chived. These copies can be created during normal transaction processing. TMF reeonstruc:ts any
files open at the time of a total node failure by using the after-images from the audit trail to reapply
the updates of committed transactions. ROLLFORWARD negotiates with other nodes of the net-
work about transactions which were in "ending" state at the time of the node failure.
15
A DISTRIBUTED DATA BASE APPLJCATJON
Tandem's Manufacturing Division uses ENCOMPASS to implement a reliable distributed data base
to eoordiIiate its four manufacturing facilities in Cupertino (California). Santa Clara (California).
Reston (Virginia). and Neufahrn (West Germany). Manufacturing is one application of Tandem's
50-node corporate network. (The network grows at a rate of about two nodes per month: the
manufacturing application adds approximately two nodes yearly.) The network for the manufactur-
ing system is shown in Figure 4.
GLOBAL 3&
LOCALOATA
GLOBAL 3&
LOCAL DATA
Figure 4. ManufaetUring Network
Each node bas a copy of the "global" files: Item Master File. Bill of Materials File. and the Purchase
Order Header File. In addition. each node bas a set of "local" files: Stock File. Work-in-Progress
File. TraD.saetion Hi3tory File. and the Purchase Order Detail File.
The global files are replicated at all nodes for reasons of performanee and availability. Most transae-
tioDS aeeess and update only local files. TraDsac:tioDS which referenee global files access only a few
reeords and occur infrequently. Transac:tions which update global files can originate at any node.
16
The designers of the system were faced with two conflicting goals: (1) the maintenance of consisten-
cy among the global file copies; and (2) node autonomy: the ability for a node to carry on its process-
ing. including the update of global data. despite network partition or the unavailability of other
nodes.
If consistency were the only goal. straightforward application of TMF to the problem would have
been the solution. All global files would be TMF-audited. All reads of a record in a global file would
be directed to the loc:al copy. Updates of global files would be applied to all copies. within the scope
of a single TMF transaction. Unfortunately. this simple approach fails to address the goal of node
autonomy, since no node can run a global update transaction at a time when any other node is
UDavailable.
The actual design compromises the goal of replica consistency somewhat for the sake of node
autonomy. As inthe above design. all files are TMF-audited and reads are always directed to the
loc:al record copy. For the purpose of update. however, each global file record is assigned a master
node. the name of which is stored in each record instance. The update of a global record can occur
only if its master node is available. An update request is sent to a server on the record's master
node. The server executes a TMF transaction which updates the master copy of the record and
queues "deferred" update requests for the non-master copies of the record in a "suspense file'" at
the record's master node. A dedicated proc:ess. ealled the "suspense monitor', scans the suspense
file looking for work to do. When it finds a deferred update record for a node which is currently ae-
,,",,",Ie. the suspense monitor exec:utes a TMF transac:tion which sends the update to a server at
the noJHDaSter node and deletes the suspense file entry.
It is important that the deferred updates for non-master copies of records at a node occur in
suspense file order. If a node beeomes disconnected from the network. deferred updates for it ac:-
cumulate in the suspense files of other nodes. and the disconnected node's suspense file accumulates
updates for other nodes. When the network is re-eGnneeted and all accumulated updates are ap-
plied. global file copies converge to a consistent state.
17
CONCLUSIONS
The design of TMF is se!!n to rely heavily on the concept of NonStop, Tandem's unique approach to
fault toleranc:e which provides continuous operation despite single module failures. Unlike conven-
tional data base recovery techniques. which are oriented to repairing the data base a f t e ~ a system
halt and restart. TMF maintains data base consistency through failures via the on-line backout of
those individual transactions affected by the failure.
Since a Tandem node is in itself a loc:al network of up to 16 proc:essors. the techniques used to imple-
ment such features as traDsaction concurrency control and transaction colllDl;it coordination within
a single node are of necessity oriented to the distributed environment. Consequently, their exten-
.. sion to a network is a relatively straightforward extrapolation of the single-node implementation.
The differenc:es between the handling of the single-node case and the network case are largely ac:-
.counted for by the significant disparity in the speed and reliability of the communication media.
ACKNOWLEDGEMENTS
I would like to a.clmowledge the contributions of Glenn Peterson, Keith Hospers. Bill Earl. Gary
Kelley, Keith Stobie, Jerry Held, DeIUlis McEvoy, and Glenn Linderman in the design. implementa-
tion, and testing of TMF. The design of Tandem's manufac:turing data base is due to Jim Gray and
Vince Sian. Thanb are furthermore due to Jim Gray for editorial suggestions whose implementa-
tion improved the presentation of this material.
18
REFERENCES
1. Bartlett. J. F.. A 'NonStop' OpeTati1lg System. Eleventh Hawaii International Conference on
System Sciences. 1978.
2. Bartlett. J. F.. A NonStop Kernel. Proceedings of Eighth Symposium on Opera.ting Systems
Principles. ACM. 1981 (also Tandem TR 81.4).
3. Gray, J. N.. Notes Oft DaM Base OpeTating Syderru. IBM Research Report: RJ 2188. February
1978.
4. Katzman. J. A.. A Fa.ult-To18Ta'1&t Computing Sydem. Eleventh Hawaii International Con-
ference on System Sciences. 1978.
5. KatmlaD. J. A.. and Taylor, R. H.. GUARDIAN/EZPAND. II NonStop Network (submitted for
publication).
6. Smith. L.. Designer's Ove'T't1isw of Tnmsactio'ft Procesmg, (submitted for publication).
ENCOMPASS. TaDdem. aDd NoaStop an tl"adeaIaziE:I of Taudem Compaten IDl:orporated.
19
Distributed by
'1TANDEMCOMPUTERS
Corporate Inforri'lation Center
19333 Vallco Parkway MS3-07
Cupertino, CA 95014-2599

You might also like