0% found this document useful (0 votes)
59 views46 pages

A State of The Art Review of Distributed Database Technology

This document provides a state of the art review of distributed database technology. It discusses the architectural issues including servers, networks, data models, and schemas. It also covers functional issues such as data location and function distribution. While no standards currently exist, some guidelines have been set forth. True implementations of general purpose distributed database management systems are only now emerging in the marketplace.

Uploaded by

garlofrank
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views46 pages

A State of The Art Review of Distributed Database Technology

This document provides a state of the art review of distributed database technology. It discusses the architectural issues including servers, networks, data models, and schemas. It also covers functional issues such as data location and function distribution. While no standards currently exist, some guidelines have been set forth. True implementations of general purpose distributed database management systems are only now emerging in the marketplace.

Uploaded by

garlofrank
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

A STATE OF THE ART REVIEW

of
DISTRIBUTED DATABASE TECHNOLOGY

Contract Number F30602- 89 -C-0082


Data Sr Analysis Center for Software

October 5, 1992

Prepared for:

Rome Laboratory
RLIC3CB
Griffiss AFB, NY 13441-5700

Prepared by:

Kaman Sciences Corporation


258 Genesee Street
Utica, New York 13502- 4627
b

-.
DOCUMENTATJON
REPORT PAGE Farm
OP.UNQ. 07U-Or&?
AopDW
u m ~ - ~ ~ l O l ~ W - . ~ ~ Z I . ~ ~ n - . r ~ - ~ - -
rn-mr-rs-rr--- L . . l - m ~ m - ~ ~ ~ - r U - r - k - r a l r r ,
b - m - m r r r r r p r c - h l . 3 r ~ - u - C L D I * . n C . l ~ 1 7 1 *
~ ~ - U - C ~ P I - ~ A ~ - U - I ~ I ~ - S = ~ . ---s-- aX*.-rn-=--

I . AGENCY USE CNLY (L.U erry


I ZR-OITE
I 3. R E P O F ( T m A N 0 O A E S =*RED
A
I
1 October 5, 1992 I
4. nnE AHO suam ( INNOING MERI
A State of the Art Review of
Distributed Database Technology
~ ~ r r m o ~ ~ ) F30602-89-C-0082

1 Carol Wawrzusin
C t

I
7. P E R M R U M OCKiANlUTlON H I Y E ( S ) A H O A O C R E S ( E S ) 1. PEF?FORUrY; O R W I U T I O N
R E M WUBER
Kaman Sciences Corporation
258 Genesee Street
Utica, NY 13502
L
i E N C l PUYE(S)AHOA O M E f S E S l
9. ~ R 3 H b Q ~ N O U l r C T O l l t fAfG
10. S W N S O R I K i ~ I f f Ai C E m
Sponsoring Org. Monitoring Org. REPORT HUUBER
Defense Technical Info. Ctr. Rome Laboratory
DTIC/AI, Cameron Station RL/C3C N/A
Alexandria, VA 22304 Griffiss AFB, NY 13441
11. SUPPLEMENTARY M T E S
I
I
I Available from: Data & Analysis Center for Software
P. 0 . Box 120
Utica, NY 13503
1 tr O l S T F I I E ~ A V A N A & U I TS A T E Y E N T l a . DlSTRIBUmN CGOC

Approved for public release.


Distribution Unlimited
I
1 I A ~ C ; ~ ; m ~

A distributed database is a collection of multiple, logically interrelated databases


distributed over a computer network. A Distributed Database Management System (DDBMS)
is a software system that permits the management of distributed data making the
distribution transparent to the user. This report reviews tKe issues that arise with
such systems, surveys current comnercially available DDBMSes, and summarizes the
state of the art. Although no standards yet exist within this new technology, some
guidelines have been provided by C. J. Date and E. F. Codd. True implementations of
general purpose DDBMSes are only now beginning to emerge in the marketplace. Their
implementations with respect to issues of distributed database technology differ

74. SUeLECiTERYS i%-C+MGU


Distributed Databases, Database Management Systems, Relational 40
Databases, Object-Oriented Databases IL P ~ C E c m ~
I 17. S E t U F U n c u s s 1 ~ ~ ~ l n 0 n 16. SECUFIlTT MSSLFICAnEm 10. SECURITY C U S S I F W O N

H%
OF R E W R T
Unclassified
7YO<n-ZUUm
OF THIS PAGE
Unclassified 1 O F A G W
Unclassified
aUmuU---
-.l*rd)&LIC1.
A STATE OF THE ART REVIEW
of
DISTRIBUTED DATABASE TECHNOLOGY

Contract Number F30602- 89 - C-0082


Data S( Analysis Center for Software

October 5, 1992

Prepared for:

Rome Laboratory
RL/C3CB
Griffiss AFB, NY 13441-5700

Prepared by:

Kaman Sciences Corporation


258 Genesee Street
Utica, New York 13502-4627
1.INTRODUCnON ........................................
1.1 Evolutio~iof Distributed Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Reasolis for Distributed Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.THE STATE OF THE ART IN DISTRIBUTED DATABASE


TECHNOLOGY ......................................
2.1 Tile Arcllitectural Issues of Distributed Databases . . . . . . . . . . . . . . . . . . . . .
2.1.1 Tlie Semer/Host . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.2 Tlie Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.3 Tlie Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.4 Tlie Sclie~iia. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Functional Issues of Distributed Databases . . . . . . . . . . . . . . . . . . . . . . . . -- .
2.2.1 Data Locatio~iaild Fu~lctio~i Distributioli . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1.1 Fragmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1.2 Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.2 Distributed Quely Processuig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.3 Transaction M a ~ i a g e ~ ~ i. e. .~.i.t. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.3.1 Coiicurrericy Colitrol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.3.1.1 Lockuig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.3.1.2Tuiiesta~iipuig. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.3.1.3 Multiple Protocol Metliodology . . . . . . . . . . . . . . . . .
2.2.3.1.4 Deadlock Maliage~tie~it. . . . . . . . . . . . . . . . . . . . . . . .
2.2.3.2 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Current Tecl~iiologyInipleme~itkigDistributed Databases . . . . . . . . . . . . . . .
2.3.1 Data Models a ~ i dSclieliia I~itegratioli ...........................
2.3.2 Data Distributio~l............................................
2.3.2.1 Degree of Distribution .................................
2.3.2.2 Location Tramparelicy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.3 Distributed Q u e ~ yProcessllig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.4 Distributed Tra~lsactiollMariage~~ie~~t ...........................
2.3.4.1 Co~icurre~lcy Co~itrolProtocols . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.4.2 Recovery Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.4.3 Ten~ii~iatio~i Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.4.4 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Strategies a ~ i dCorlsideratio~is. . . . . . . . . . . . . . . . . . . . . . . .
2.4 I~iiple~ilelitatio~i
2.4.1 Degree of Site Autononiy .....................................
2.4.2 Lack of Stanciards ............................................
2.4.2.1 ANSI Standard SQL-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.2.2 Relnote Data Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.2.3 ISO/OSI .............................................
2.4.3 Distributed Database Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . : . . .
2.4.4 Pla~ililligfor tlie Future . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5 Su~ill~iaryof tlie State of tlie Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.1 No Full I~iiple~~ie~itatio~i .......................................
2.5.2 Market Pull versus Tecll~iologyPus11 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6 Related Researcli Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.1 Distributed Database Operating Syste~iis. . . . . . . . . . . . . . . . . . . . . . . . .
2.6.2 Distributed Multidatabase Operating Systeliis . . . . . . . . . . . . . . . . . . . .
2.6.3 Object Orielited Distributed Databases . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.4 Distributed K~iowledgeBases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
APPENDIX A: GLOSSARY .................................. 36

APPENDIX B: VENDORS IN DISTRIBUTED DATABASE


TECHNOLOGY ..................................... 39
A STATE OF THE ART REVIEW
of
DISTRIBUTED DATABASE TECHNOLOGY

1. INTRODUCTION

1.1 Evolution of Distributed Databases

Duruig tlie past twellty years, tlie practice of organizulg repositories of data ullder a celitral point of co~itrol
beca~lleconiinonplace. In an effort to overcoil~ethe ull~~la~iageable sihlation of applicatiolls ge~leratuigand
~llailitai~iillg auto~lolllousfiles of data, corporations orgaliized their ciata illto "centralizeci" databases, free of
Juplicatiolis and uiconsistencies, ~ila~laged by a database llla~lageliielltsyste~ii(DBMS) u~lderthe colltrol of a
ce~ltraldatabase adniulistrator. Guarcied by the MIS mulions, this practice gave ~tlaliage~ile~it a secure llold over
corporate data, and at the saliie tune gave the company's diverse iliterllal 01-ganizationsthe illusioli of being in
c o ~ ~ t rof
o l their own data. For tlie corporate decisioli makers, gatlleruig data was a "siiilple" process of querying
the databases residing on tlie cotl3orate nlaiiifranie.

In receiit years the pllysical niakeup of col-porations, private elltities and gover~litlelltactivities, has cllaligeJ
from a celltralized to a Jist~ibutedstructure. The rise of giant co~iglo~~lerates is an example of this change.
Witlllli 011e of tliese co~llpositeorga~lizatioils,eve11 tlie ~lia~lufacturi~ig,
e~lguieerulg,aiid busi~lessunits of olie
co~llyolie~~t call be geographically dispersed. The distributed pliysical arcllitecture of tllese orga~lizatio~ls
de~lia~ids that the i~lfonllatio~l architecture also be ciistributed, e~llbraclligtlie coi~ceptsof open systen~s,
distributed coalputlllg, and liardware a ~ l dsoftware independence. As a natural consequence of this distributio~l
lies the require~llelltto distribute and i~lallagetlie corporate data. Evolvlllg ti-om this need, the coiillue~lceof
database tecl~liologyarid ~ietworktecl~~iology Iias now produced oiie of tlie rlewest l ~ l e ~ ~ i bof
e r ssoftware
engineering tecli~~ology called distributed elatabase tecluiology.

A dishibuttld clritcibast>is a collectioii of tnultiple, logically lllterrelated databases distributed over a coit~puter
network. A distributed database lliallagelilelit syste111 (DDBMS) is a software syste~ilthat penilits tlie
iria~iageillelltof distributed data making the distributio~ltrallspareilt to the user. A distributed database is lllore
reliable and illore respo~isivetllaii a ce~ltrallylocated a i d controlleci ciatabase; data can be e~iteredwhere it is
geaerated, data at different sites car1 be shared, and data call be replicated giving users tlie optioil of accessulg
copies of the ciata in tlie event of a site or letw work failure. Outgrowlllg data storage resources or c o ~ l ~ p u t u ~ g
power doeai't ~lecessitate~liovingup to the next expe~isivemai~lframe;ciistributed database teclulology allows
a!Tordable, itlcreiiieiltal harckare growth.

As witli all new technology, tlie definition of a distributed database was ullclear until tilile and use brouglit
clarification. During tllz early eighties, ve~idorsselluig "clistributed" ciatabases, users wlio felt a ~ieedto
iniplenient sucll a beast, and tlleorists wlio wrote articles ciealllig wit11 the topic for tecll~licalpublicatio~isall Ilad
tlleir ow11ideas about what constihlteci a "distributed" database.

In the Su~iu~ler1987 publicatio~iof InfoDB, C.J. Date proposed twelve sules that apply to a distributed database.
Like E.E Codd's fa~ilousrules for tlie relatio~ialniociel, Date's have beco~llea bible for distributed database
teclinology. Sunuilarized, the rules are as follows:
o SiteAlctoiioinv-Each site ~llai~ltaulslocal privacy and control of its ow11data; users that c o ~ i u ~ ~share
o~~ly
data can have it located at the site wllere they work

o No Ct~iitr(i1Sift>-The operation of the clatabase does not depzild on ally si~lglesite; each site ill the
letw work ru~lslocal applicatio~isll~depe~~de~itly
of the otller sites, or globally on clata at re~ilotesites;
no single DBMS is 111orellecessa~ytllali ally other

o Coiitiiiuous O~~tpr(ltioii-Tlledistributed ciatabase should llever require downtiale; planned activity


sliould not req~11l.ea sI1~tdow1i

o E(liiq~art~iicv-Tlle
locatio~lof the data does not 11eed to be known to applications or users

o Fragint~iltotioiiIiidtyt~1ith1c.t~-Tile divisio~lof a table into fragnents sl~ouldbe transparelit to the


applications/use~s

o Rc.ylit.ritwii hidt~pii&iic.t~-Replicati011
of data sl~ouidbe u~da~own
to tlie user and updates to replicated
data are pe1fO1111ed tra~lspare~ltly to the user

o Distributed Qutv Proct~ssii~g-The performance of a quely sl~ouldbe i~ldepe~lde~lt of the site at wl~icl~
it is
subnlitted; interleaved tra~lsactio~lsupdating ~~iultiple
sites sllould be capable of serializatio~land, ~ I I
the event of failure, sl~ouldleave tlle database hi a co~isiste~lt
state

o Distributed Trciiisciction Mciiigrint~iit-Tlie distributed syste111 sl~oulclbe able to support a t o ~ ~ i i c


tra~lsactiolls

o Htirttwnre b~depeil&iict~-Theciatabase should be able to iategrate data from a wide variety of syste~ns

o Operatiiig Svstein hdepeiideilce -The database should be able to run 011 differelit operating syste~rls

o Network hrtlryrndt~iict*-Tile database ~iiustbe able to operate usulg ally co~iuriu~iicatio~is


protocol

o DBMS Iiicirynidt~1it.t~-Databases
~rlustbe able to co~ruriu~licate
with tllose of otlier ve~ldors

More recently, ~ I I1990, Codd specified four ~~lulurlal


co~lditio~ls
to be satisfied by a distributed database. [38]
Tllese collditio~isare the followi~lg:

o The &itabase coiisists of dlitri dispersed at two or inore sites

o Tllr sites art>liiiked bv ti t.oininuiiic(~tioirsiit~twork

t totalilly of tltr d(ltrl (is if it wtw a sillgk global clritabast>


o At aiiy site X, tlre users rziid progr(iins car1 t r t ~tlw
rt~sidiiigat X

o All of tltr dlta residing at ally site X a i i d p r t c t i i i g iii tltegbbal d(ltrlbise nil1be trtwttld by tlre users at site X
in euict& tllr snintt way as f i t wew n bed datnbrisc. isolated froin the rest of the iiehvork

Codd uses tllese four co~lciitio~ls


to disti~lguisliprociucts that support true ciistributecl database lriallageliiellt
fro111tllose supporti~~g
o111ydistributed processing.

1.2 Reasons for Distributed Databases

Correctly implementecl, ciistributed elatabases are more reliable, provide faster clata access, recluce
co~xu~~~~~~ load,
i c aand
tio~allow
l s for the increme~italupward scaling of hardware. AIIIOII~
tlie 111a1iy~ilotivatio~ls
for developll~ga ~Iistributedclatabase, tllese are the ~liostfreque~ltlye~~countered:
o a distributed orga~iizatio~lal
structure dealancis distributed data

o a need to generate global applicatiolls based on pre-existing databases

o a requirement to reduce cor~l~ilu~licatio~is


costs

o Illcreased per-fonnance or reliability de~lialids

A DDBMS is Iiomoge~ieousif tlie sallle DBMS occurs at eacli site regardless of the hardware and operating
system.[6] Generally, wlie~itlie ~llotivatio~l is to ultzgrate pre-existing databases, the "bottom-up" design
solution Il~volveslieterogel~eousdatabases - tllose belong~ngto several ve~idorsprobably ~ i obased t on tlie sallle
data model. I11 oilier situations, a "top-down" design call be used wllicli takes best adva~ltageof tlie
functionality of a distributed database. I11 zitller case, the database ciesig~ierwill need to know what tecl~~iology
is available to i ~ i ~ p l e r ~al distributed
e~~t database system. Most DBMS ve~ldorscurre~ltlyoffer a "distributed"
versio~iof tlleir product, but because of tlie lack of standards, tllese offerings vary in the level of support give11to
the various aspects of distributed database techiology. Also, depelldi~lgon the particular ltiarket served by a
vendor, so~ileaspects of distributed database tecli~~ology are e~lipliasizeciwhile otliers are ~~ii~illllized
or
IlOll -exiSte11t.

In order to select tlie right DDBMS or to develop an optlliiu~lldistributed design, tlie database syste111desigler
liiust understand tlie relative ~ i l e ~ iof
t s each feature and be able to make tradeoffs to effectively ~liatcli
i~i~plz~ller~ted
features to tlie specific data lieeds to be supported.

The objective of tliis state of tlie art review is to review tliose unique features of Jistributecl clatabases tliat
distinguisll t l l e ~ lfsom
~ ce~itralizecidatabases and to exaxamhle currently available inlplementatio~aof tllese
features.
2. THE STATE OF THE ART IN DISTRIBUTED DATABASE TECHNOLOGY

As in ce~itralizeddatabases, regardless of tlie ~i~iderl@ig data model, tlie fuadamental issue of distributeJ
databases is trni~spurr~~cy. hi a centralized clatabase, tralisparellcy refers olily to Jata Illdependence; hi a
distributed database, transpareacy refers to tlie Jata and to tlie ~ietwork.Accordi~igto Date's first rule, tlie
distributed database sliould appear to tlie user as one, u~lifieddatabase. To accoliiplisli this, not o~llythe locatio~i
of the data, but tlie very existence of tlie ~ietwork~iiustbe transparent to tlie user.

Tlie flipside of tlie transparency issue is the issue of local autononiy. Eacli site participating 111 a distributed
database ~iiustbe wliolly i~idependent;its operating system, administration, reside~~t databases and associated
catalogs must be totally autonomous.

Factors that co~iieinto play wl~etico~isiderlligtra~isparelicyand local a u t o ~ i o ~uiclude


~ i y arcliitectliral issues, such
as tlie u~iderlylligdata model, tlie scliema, and tlie site a ~ i d~ietworkliardwar-2,and tlie functional issues, such as
liow and wliere data is located and liow tlie syste~iisy~icliro~iizes updates atiiolig tlie participati~igsites.

2.1 The Architectural Issues of Distributed Databases

Tlie arcl~itectureof a distributed database itlcludes the pliysical co~iipo~ie~its


of liosts/servers and tlie network
comiecti~igtliem, tile data ~iiodels~isedto u r ~ p l e ~ ~the
i e ~co~iipo~~elit
~t databases, alld tile s c l ~ e ~ used
~ i a to
integrate tlie various uideperide~itdatabases. Tlie structure of a distributed database syste~iiis show~iUI Figure
1.

2.1.1 The Server/Host

The arcl~itectureof a distributed database per~iiitsa very large database to be supported on a collectio~iof host
equipment of varyitig capacities and perfor~iiat~celevels. Eacli participating site in a ~ietwork is a
general-purpose coniputer that executes both local applicatio~i p r o g r a ~ ~atid ~ s distributed database
~ i i a ~ ~ a g e functions.
~ ~ i e ~ i t Tliese computers range ui size fro111personal computers to powerful workstatio~~s
and
parallel computers.

One of tlie strong points of distributed database techllology is tlie ability to ll~cre~iie~itally add liost power to a
database structure. However, tlie desigu of tlie allocatio~~ of data to tlie liosts ~riustco~~sidertlie perfor~iia~ice
cliaracteristics of eacli host 111order to ellsure tlie ~iiostefticient operation of tlie distributed database system. A
frequently used portion of tlie database sl~ouldnot be allocated to a host with inferior perfor~iia~~ce
cliaracteristics, to a liost coruiected to an u~ireliablepower supply, or to a host on a poorly perfor~rillig,or
overloaded Local Area Network (LAN). Taki~~g advantage of tlie pelfor~iia~ice knproveme~itsoffered by
parallel architectures, recent tre~idshi distributed database tecli~iologyare toward assigning database fu~lctio~ls
to dedicated data servers wlierz the setvers are parallel processors. Tliese ~iiaclillizsare capable of l~ostlligvery
large (nlany gigabyte) databases, e ~ ~ l i a ~ ipel-foniiance
ci~~g by co~~curre~itexecution of parallel, coniplex queries,
and significantly reclucllig tlie VO bottle~ieckvia parallel disk accessing.

2.1.2 The Network

Tliz co~il~iiunicatio~i~ietworkco~uiectlligtlie sites cooperating hi a distributed database are ~iiostfrequently one


of tliese three basic types:
END USERS ' 0 0
SITE 1 LOCAL DBMS Local
Database

I
Local Network
DBMS Interface

Local Network
Global DDBMS
DBMS Interface

sm 2

Locd
Databasa
- LOCAL DBMS

Local Network
DBMS Interface
I
0 0
I END USERS

mAl-Dl
Databasa LOCAL DBMS USERS
END

FIGURE 1: Tllz Structure of a Distributecl Database [6]


o liigli batictwidtli, low delay "Etlier~iet"-like local area ~ietworks

o lower ba~iclwicltli,liiglier delay, lo~igerrange packet -switch networks, like Arpanet

o lower banelwidth, lower delay poult- to -point leased circuit

h d i o andlor satellite broadcast ~ietworksare also being aiiployed as distributeel database tecli~iologygai~is
popularity.

Not part of tlie tecluiology of distributed databases per se, but cer-tauily witlilli tlie purview of tlie database
inlplenientor, are tlie co~i~pressio~i/deco~iipresio~i algoritli~iise~iiployedtlirougliout tlie co~lfiguratio~i of tlie
database. Tliis ilicludes an analysis of ally bridge, router,.and gateway liarciware and software that lilay be part of
tlie total system. As tlie ~iu~liberand types of liosts (and workstations) 011 a network, a ~ i dtlie ~lu~liber
of local area
~ietworks(LANs) coliiprisllig a systelii increase, tlie ability of the bridges, routers, and gateways to lialldle tlie
traffic efticiency is seriously affictecl.

2.1.3 The Data Model

Treated as an extelisio~iof ce~itralizeddatabase tecluiology, co~isiderablediscussio~iand disagree~rle~it lias


contuiued co~icer~ii~ig tlie appropriate data ~iiodelto be used for distributed databases. A collectio~iof
conceptual tools for describi~igdata, data relatio~isliips,data semantics, and co~isiste~icy
restraints, a data 11iode1
call be o ~ i eof tliree types: object-based logical model, record-based logical model, and pliysical data
model.[lO]

Object -based logical ~iiodelsprovide flexible structuri~igcapabilities and allow tlie explicit specification of data
constraitits. Tlie entity-relationsliip (E-R) 11loJe1is aa exa~iipleof a11object based logical niodel. It liiodels tlie
real world as a collectio~iof objects, called e~itities,and tlie relatio~isbetweell tlieni. Sliow~iin Figure 2 is an E-R
~ilodelof customers, accounts, and tlie relatio~isliipof "customer account".

FIGURE 2: E-R Model of Custo~ilerAccou~it

Recorcl-baseci logical ~iioclelsclescribe data at tlie conceptual and view levels, speclfyllig both tlie overall logical
structure a11c1tlie u~iple~~lelitatioli,
but liot tlie data constrahits. Tlie relatio~ialmodel, a record-based logical
rnodel, rzpreseilts data and the relatio~lsliipsbetweell data as a collectiorl of tables. Figure 3 sllows the custoiiler
account as a relatio~lalmodel.

FIGURE 3: Relatioilal Model of Custoiller Accourlt

Tlle iletwork ii~odel,ailotllzr recorci-based logical model, represzlits clata by a collectioil of records and
represeilts relatioilsllips ar~iollgdata by links. Figure 4 is a iietwork uiodel of the custoi~ieraccouiiit.

Pliysical data i~~odels,


wllicll describe clata at the lowest level, are not vely popular a i d are not coilsidzred
appropriate hi the co~ltextof distributeel databases.

Tllere has beell researcli doile into the "universal" illode1 wllicli takes all the relatioils in a regular relatioilal
database and glues the~ntogetlier by means of orle operator (natural join) to forin a single relatiorl of very lligll
clzgree that contains all the ulforixlatioil 111 tlie cIatabase.[40] Tlie universal relatio~lalinodel aillls at acllizvllig
con~pleteaccess-path u~depei~cle~~ce in relatioilal elatabases by relieving the user of the rleed for logical
ilavigatioil among relatio~ls.Access paths are embeclcled 111attribute nanles, llidiilg all lllfonliatioil about the
logical structure of the database fi.0111the user. Altllougli rzlatiollal databases reir~ovedthe ileed for pllysical
navigation, access paths arrlollg relatiolls illust still be specified. The lllotivatioil belluid the urliversal relatioilal
ll~odelis to fully realize Codd's goal to free users Goirl tile ~leedto s y e c access
~ paths.
Lowe~y Maple Quee~is 900 55

556 100 000


Shiver Nor-tli Bro~ur
-
647 105 366

Hodges Sideliill Brooklyii -801 10 533

FIGURE 4: Network Model of Custoiiier Accou~it

A~iiongtlie current leadllig co~ite~iders for use witli clistributed databases are tlie record-based relatio~iala ~ i d
~ietwork~iiodelsand tlie object-oriented model. However, as it lias with ce~itralizeddatabases, the relatio~ial
ciata ~iiodelIias becoiiie tlie de facto sta~iclardfor DDBMS. E.E Codd, fou~iderof tlie relatio~ialmodel, Iiolds
tliat distributed clatabase technology is only feasible wlien based 011 a relational model. As cliaracterized by
Codd, tlie relational ~iiodelco~itallisshiiple data stnictures, provides a solid fouiidatioii for data consistency,
and allows set-oriented ~iia~iipulatio~is of relatio~ls.Tliese three powerful features Iiave propelled tlie
relatio~ialiiiodel to tlie forefront of tlie technology.

Tlie superiority of tlie relatio~ial~iiocielfor use hi ciistributed databases is refuted by recelit work doiie at the
Ger~iia~iNatioiial Researcli Ce~iterfor Coliiputer Scie~icewliicli espouses tlie use of an object-o~iented
database approacli to distributed database nlaiiageiiient.[4] A discussio~iof this effort is foulid hi sectio~i2.6.3.

It sliould be noted tliat it is possible to build a distributecl database syste~iiwitliout a slligle "global" data model.
Provicii~iga liigli degree of site auto~ioixiyby not eiiforcllig a global data ~xiodelor schema, tlie Sybase DDBMS
product supports ciistributecl operatioils via application progranmliiig or database-oriented remote proceclure
calls (RPCs) betweell Structured Quely L,anguage (SQL) Servers.[23] When multiple data ixiodels exist witliiii a
distributed database system, tlie syste~ii~iiustprovide for ~iiapplligfro111structures of one DBMS to a~iotlierand
for tra~islatlligtlie co~i~liia~ids of oiie DBMS's data ~iia~iipulatio~i language to tlieir equivalents in tlie data
iiiariipulatioii language of tlie otlier DBMS(s).[6] For example, Ingres' distributed yrociuct, Iiigres/STAR,
provides tliese functions via gateway products (restrictions apply to tlie locatio~iof tlie global ciata dictionaiy).
Also providing tra~isparelitjohi and view of ~iiultipledatabases, tlie Infor~iiix-STAR product i~icludesa11
extended sy~ionyliisfeature pzniiitti~iguse~sto e~xiploysyiioiiy~iisas polliters w1le11tables are ~iiovedbetweell
sites tlius freellig tlie~iifi-om tlie ~ieedto specify wliicli co~ilputerto access.

2.1.4 The Schema

Tlie scliz~iiadescribes a database as it is stored; it describes pliysical cliaracteristics such as for~nat,storage


locatio~i,and access paths, and deii~iestlie logical structure of tlie database. I11 a ce~itralizedclatabase, tlie
sclie~iiais tlie global view of tlie database in ter~nsof wliicli all user views, called subschemas, are defined. I11 a
ciistributecl database, sclieiiia i~itegratio~i
refers to tlie way users logically view tlie distributed data. Wlietlier the
DDBMS is lio~iioge~ieous or l~eterogeneous,tllere are two kinds of sclie~lias- tlie global sclie~iiaand the local
sclienia. Tlie global sclle~ilaJefuies all of tlie Jata 111 the syste111;tllc local sclle~iiaAefiues tlie Jata at the local
sites.

Wliz~ltlie distributed ciatabase uivolves lieteroge~~eous


databases, two ge~ieralapproaclles exist to ~liapplligthe
co~iipo~ie~it
distributed database scllenias:

o uitegrate all tlie local sclie~~ias


hito o ~ l global
e sclie~iiaand derive all user views fro111tlie global schema

o integrate various portio~isof the local sclie~iiasillto ~ilultiplefederated sclie~lias

In tlie l~eterogeneouscase, tlie designer 111ayalso be faced wit11 a sclie~iiatranslation proble~ilwlle~idiffere~lt


ciata ~ilodelsare i~ivolvedor differelit ~la~illlig are used. Data mapping lilay also be required if Jata
co~ive~itio~is
types or data values 11eeJ to be converted for co~lfor~rlity
(for Illstance, temperatures stored as Fallre~llieitand
Celsius).

Wliatever sclie~iiaapproacll is chosen, users sliould be able to refer to and create tables by liallie witliout needi~ig
to hiow wllere in tlie system the table is pliysically located or llavi~igto be co~lcer~led
about ~iarillligco~lflicts.Tlie
ability of tlie elatabase to e~isureuriique syste111~ia~ilesis provided tl~ouglia catalog called tlie data dictionary.
Iafonnation about sites and storage structures, database sizes a ~ i dotlier statistics, access privileges,
fraguientation a ~ i dreplicatio~lof tables, and syste~il~ia~iiing co~ive~itio~lsare kept in a global data ciictio~iary
wliicll is itself a distributed database.

Tlie co~lceptualproblems associated with tllz sclle~ilaare e~iibodiedi11 tlie data dictionary. If it is kept at a si~lgle
site tliere tlie~iexists a slligle poitit of failure. If replicated at every nocte, tlie~leve~yclia~igeiri its itifor~ilatio~l
requires a change at every site. So~rieDDBMSs e~iiployan approach in wllicli each site ~ilai~itaitls its own local
catalog wliicli tlie syste~lisearclies for each reference to a table. Tliis ~iietliodsaves in the maintenance effort but
gellerates overliead ~ietworktraffic.

2.2 Functional Issues of Distributed Databases

Tsa~lspare~icy refers to tlie separatio~iof tlie liiglier-level se~lia~itics of a syste~lifio111 its lower level
ir~~ple~ile~itatio~i
issues. It is tlie fu~ldallie~ltal
cliaraderistic of a distributed database, wit11 llie degree of
tra~lspare~icy being directly related to tlie degree of distribution. I11 order to acliieve a high degree of
transparency, tlie syste~iilliust autolintically record a~iciliiallitai~lillfor~iiatioliabout tlie locatioli of the data it1
the database, tlie status of transactions, failure of any site or conuiiu~iicatio~i lllik i11 tlie ~ietwork,and ~ilust
support co~illiiitand recovery protocols for e ~ i s u r i ~tra~lsadio~i
~g ato~nicity,isolation, and durability. Tliese
coticen~scan be divided uito four ~liajorissues pertahiitig to functionality: Jata location aud functioti
distribution, transaction management, and q u e ~ yprocessiug.

2.2.1 Data Location and Function Distribution

Gelieral distributed processi~igallocates parts of an ayplicatio~lto differe~lt~liaclllliesbased on wllere the parts


are requireci. The user is very 11iuc11aware of tlie ciistribution; tlie user must pliysically ~iiovebetweell ~iiacliiues
to perforrn ciiffere~itapplication functions. Witlilll the co~ltextof a ciistributeci database, the h~ictional
distribution u~ipliestliat tlie data is distributed across database servers based on wliere tliz data is required. Tlie
fact tliat the data is distributed is trallspare~itto the user.

Locatio~itransparency, also ter~~leci


distributio~li~idepe~lde~ice,
hides the pliysical Jistribution of tlie data fro111
tile user. Supporting locatio~ltransparelicy is tlie single ~iiostitnpot-tant function of a distributed database
system. Progra~iis111ustco~lti~lueto operate regardless of the distributio~~
co~lfiguratio~i
of the data. Codd points
out that o111y prototypes and proclucts based on tlie relational nlodel have beell able to de~iio~istrate the
capability of supprtllig distlibutio~lIl1dependence.[38]

Data Jistributioli is the single fil~lctio~i u~lderthe co~itrolof the applicatio~lssyste111designer or database
administrator (the remauluig issues are ge~ierallyan i~itegralpart of the DDBMS). The two key issues of
designiig a distributed clatabase are it's data fragmentation and allocation. The yu~?>oseof fragnienting, or
breaking up, tlie tables of a ciatabase and allocatllig tlie~iito o ~ l or
e iiiore sites in tlie letw work is to llicrease tlie
perfor~iia~ice ancl/or reliability of applicatio~isusing the clatabase. The problena associated wit11 the allocatio~l
of tlie fragiients are slliiilar to tliose enco~uiteredUI allocating tiles to ~rodeson a co~iiputernetwork. Althougll
much of tlie researcli hito tile allocatio~~ call bee11 applied to fi-agnient allocation, tliere curre~ltlyare no
autorllated tools or allocatio~ialgoritl~rilsavailable to aici in evi~luatligalter~iativeallocatio~idesigns.

2.2.1.1 Fragmentation

Tlle ability to frr~pnttllt a relation, dividing it into subrelatio~isand allocatuig the subrelatio~isto a subset of
pal-ticipatu~gsites, is the clisti~lguislilligfeature of distributed database techiology. Slllce applicatio~isge~lerally
view subsets of a relation, subsets are ~iaturalunits of clistribution. Fragmentation per~ilitsthis filer gra~lularity
in the unit of data distribution.

Fraguient gro~lpsare collect-ionsthat iliclude tlie prirlia~yfragnent a~icttllose Gag~ile~its resulti~lgfro111derived


frag~iie~rtllig.
Derived frag~iientationrefers to relatio~~s that become par-titioned due to a pruiiary Gagnentatioi~
perfor~iledelsewl~ere.For example, if a relati011 dealkig with depal-t~ilentnumbers in an organization is
fragnented by clzpartment location, t11211 aliotller relation dealing wit11 eii~ployeedata 111aybecoi~lzfrilg~i~e~ited
sirice the e~~iployee's
depart~ile~lt~iu~ilberwould ~iiostlikely appear in sucli a relationsliip.

Fragiienting a relati011call be yzrfonlled horizontally, vertically, or hi co~nbinatioti.Horizo~ltalfragiie~ltatio~i


partitiolis a relati011 a1011g its rows; for emliiple, if a colu~luiill a relati011 co~ltallissite ide~itificatio~i
data, it
~iiakesse~lseto store records associated wit11 a site at that site, givll~glocal users local access to tlie clata.

Table 1 sliows tlie global relatio~lsllipProjects co~itaililllgsoftware project data. Witlill~Projects are tlie project
nuniber, tlie title of software project, its approxinlate dollar value, allel tlie perforlrilllg location.

Table Projects

TABLE 1: Global relatio~~sliip


Projects

12
Two exa~iiplesof possible liorizo~italpartitio~iirigsof tile global relatioosliip Projects are sliow~ibelow. I11 Tables
2(a), (b), and (c) the relatio~isliipis partitioiled alo~igtlie locatioii column. Tlie tliree res~lltlligfragliieiit~would
ideally be allocated to liosts at their respective site locations. This sclie~~iz places geograpliically relateci data
closest to tlie locatioii iiiost likely to be accessirig the ciata, a i ~ dreduces tlie processing requirecl wlieii joi~isare
executed agaiiist tlie Projects data at each site. Of course, aclditioiial co~~uiiuiiicatio~ls costs would be illcurred if,
for instance, tlie New Jersey site requirect access to tlie Alabama data.

Table Projects (New Jersey)

TABLE 2(a): Horizo~italfrag~iientirigof Projects by Locatioa, New Jersey fragliient

Table Projects (Alabama)

TABLE 2(b): Hosizo~italfragnleiitiug of Projects by Location, Alaba~ilafragiiie~it

Table Projects (Illinois)


Nu~ilbzr I Title I $Value I Locatioli

TABLE 2(c): Horizoiital fiagmentiig of Projects by Location, Illiliois fragnient


Tlie frag~iientationby ciollar value sliowi in Ebles 2(d) and (e) lliiglit be of value 111 situatio~iswliere differelit
organizations, eclcli witli its ow11 computllig resources, ~iiustcieal wit11 data based 011 a ciollar threshold. For
example, if tlie purchasing function were divided sucli tliat 011e orga~iizatioiidealt with orciers exceeding
$200,000 wliile a~iotlierdealt with orders of lesser value, this fragnientatioii ~iliglitbe suitable. Locality of
referelice is tlie releva~itcriterion for tlie design of frag11ients.[l6]

Table Projects ($Value > 200,000)

TABLE 2(cl): Horizoiital fragInentllig of Projects by ?;Value, > $200,000

Table Projects ($V.Lllue < = 200,000)

TABLE 2(e): Horizo~italfiagiientiig of Projects by $Value, < = $200,000

Usllig SQL,tlie liorizo~italfragnentation of Projects based 011 locatio~iwould be defuied as follows:

New-Jersey-Projects =
select * fro111Projects wliere Locatio~i= 'New Jersey'

Alabama-Projects =
select * fro111 Projects wliere Locatio~i= 'Alabanla'

Illinois-Projects =
select * fro111Projects wliere Locatio~i= 'Illinois'

Tlie horizontal frag~nentationof Projects based 011 dollar value would be Jefllizd as follows:

Projects-Over =
select * froiii Projects wliere $Value > 200,000

Projects-Not-Over =
select * fro111 Projects wliere $Value < = 200,000
Vertical fragmentation partitions a relatio~linto s~iiallerrelatio~~s wit11 tlie goal of liill~llliizi~~g
t11e e x e c ~ ~ t itillie
o~i
of user applications on the fragiients. Joins pertbrliiecl 011the s~iiallerrelatio~iswill require ~ i i u cless
l ~ processing
tune. Tile co~lceptof vertical partitioning, clevelopcci witl~il~ tllz col~textof ceritralized databases for tile sarrle
reason, is mefill i11 distributed clatabases wllere eacli fragiient may colitai~~ data wit11 co~illiio~i geograpl~ical
properties. Table 3 sllows a vertical partitio~~llig of Projects. Note tliat the primary key of Projects, project-
nuniber, appears as tlie primary key of each vertical fragment.

Table Projects (Title and Location)

Table Projects ($Value)

TABLE 3: Vertical frag~iientationof Projects


Uslllg SQL, tlie definition for the vertical fragmentation of global relationsliip Projects as sliow~lhl Table 3 is as
follows:

Projects-Title-Locatio~i=
select Nunlber, Title, Locatio~ifro111 Projects
Projects-$Value =
select Nunlber, $Value fs0111 Projects

Keepuig hi 111i11cithat tlie pe~foniia~lceof query execution will be affected by tlie extent to wliich a database is
fragnented, tlie database ciesigler t l ~ i ~detenni~le
st tlie correct level of fsagrl~e~itatio~l
wllile ~liai~ltai~li~ig
tlie
following proper-ties:

I. Completeness - If a global relatio~lis deco~~~posed irlto fragnients, each data itel11that call be fouad ~ I I
the global relationship call also be foulid hi one or ~lioreof tlie fsagnents. Tliis property e~isuresagainst
loss of data.

2. Reco~lstructiot~ - It must always be possible to recollstruct a global relatio~lby joillillg the fragments
together. Holizo~ltalfrag~~ie~lts car1 be reco~liblliedby using tlie SQL UNION operator. In vertical
par-titionulg this is generally acco~~iplislied by including tlie key of the global relatio~~sllip 111 each
fragment piaranteeitlg tlie reco~lstructio~l througll a joi~l relationsliip. Tliis property et~surzsthat
co~lstrallltsdefilled on the clata hi the for111of dependencies are preserved.

3. Disjoi~ited~iess - If a global relation is llorizo~itallydeco~i~posed llito fiagnents, ally llidividual data itel11
call be found ill o~llyorle of the fragn~ents.Si~lcethe primary key attributes of a relati011 are typically
repeated in each of its vertical fragnents, disjoi~lted~less is defined o~ily011 the ~ l o ~ i p r i l ~key
a r yattributes
of a vertical fragmentation.

I11 a distributed database systen~,a query written against a fragmented database would look exactly like a query
writtell agallist a centralized database. However, suice no curre~ltDDBMS product supports fragnentatiou, the
user 111ustlulow liow the database is fragmented to be able to collstruct correct queries.

Table 4 sllows tlie relatiolisliip Staff occurri~lgin the database of software e~~gineenllg
projects. Staff coritailis
the employee ide~ltificatio~l number, name, position, and ~lunlberof currelit project assigui~ent.

Table Staff

TABLE 4: Global relationsliip Staff

16
To tilid out wliere progsanmiers are correctly workllig tlie SQL query agabist eitlier a distributecl or celitralized
database would be:

select Title, Location fro111Staff, Projects


wliere Projects.Number = Staff.Nuniber
and Positioli = 'Progranlliier'

A true Jistributecl database systerli wo~lldautomatically expand tliis cluely to sol~ieequivalent of tlie following:

select Title, Locatioii fro111Staff, New-Jersey-Projects


wliere New-Jersey-Projects.Nu~liiber= Staff.Nun~ber
and Positio~i= 'Progsanmier'
if iiot found
select Title, Locatio~ifro111Staff, Alabama-Projects
wliere Nabama-Projects.N~~riiher= Staff.Nuiiiber
aliJ Positio~i= 'Progsanmier'
if not fou~id
select Title, Locatiori fro111Staft; Illinois-Projects
wlizre Illitlois-Projects.Nu~liiber= Staff.Number
aiid Positio~i= 'Prograiiuner'
else print "NO progralluiiers assiglied"

Witli tlie c u ~ ~ estate-of-the-ai-t


nt in distributed databases, tlie user iiiust be aware of tlie fiagile~itatio~i
and
~iiustprovide for tlie expansion.

23.12 Replication

011ce fragtnentation lias beell completed, tlie i~idividualfragliients ~iiustbe allocated to various sites 011 the
network. A big decisioil for tlie distributed database desig~izris wlietlier ally or all of tlie fiagtnents sliould be
maintained at ~iioretlia~iO I I ~site. If no data is replicated, tlie syste~iiis referred to as partition4 if all the data
exists at evesy site, tlie systeiii is teniledfU& rt?pli('akd;if only sollie fiagtiielits exist at liiultiple sites tlie systeill is
called przrtir~llyrtylicatd.

Ai ophnal allocatioil of fraglnents ~ilustaddress botli tlie costs associated witli s t o ~ i ~mdtiple
ig copies arid tlie
perfor~iianceof tlie resulta~itsysteiii. Storing costs ~iiustjliclude tlie pliysical storage costs, tlie quesyllig costs,
and tlie updatuig costs, wlizre updating i~ivolvesco~icurre~icy colitrol ~iieclia~iis~lls
and integrity e~iforceirie~lts
across ~liultiplecopies of data.

Transactio~iorie~iteddatabase applicatiolis de~iiaiida liigli level of reliability alid availability. Witli i~iultiple
copies, tlie probability that so~iiecopy will be available even wlieli syste~iifailures occur is liigli. For read only
queries accessuig tlie sariie data, multiple copies provide an opportu~iityfor parallel execution. Wliile a syste~~i
with no data replication elimuiates tlie co~liplexitizsrelated to update syncllroiiization, reliability a ~ i d
perfor~iialicerequiseriie~itsliiay dictate either full replicatio~iof data at each site or varying degrees of pastial
replication. Altliougli partial replicatio~iilitroduces tlie additio~ialcost of rz~iioteaccesses, the cost is low wlie~i
coliipared with tlie costs associated with write operatioils iii a fully replicated situatioa.[37] Data place~iielitis
esse~itiallya trade-off betweeti update costs and tlie bellefits of lllcreased reliability and perfoniiance.

Tlie complexity of sylicliroriizatio~iprocedures a ~ i dthe level of co~iuiiuliicatio~is


req~iiredare depe~iderit011 tlie
nu~liberof copies of the data ~iiai~italliedin tlie system. Usllig their robust and adaptable distributed database
systeni, RAID, Purdue U~iiversityis currelitly carrying out experiments to obtain measurements that provide
eiiipirical evaluation of algoritliiiis used in distributed database systeins. [37] Tlie Raicl experi~iieiitshave
examllied replicated copy co~itrolduring site failure and recovery to Jeterilillie how fast database coiisisteiicy
call be restored a~iciwhat are tlie associated costs. At syste~licoilfiguratio~ltime, Raici provides a tliresliolcl value
wllicli specifies tlie i~ihiunum~iu~iiber of copies of eacli ciatabase object to be ~iiallitalliedjli tlie syste~ii.
Availability was nieasurecl by tlie ~iu~ilberof aborted tra~isactio~is due to site failures. The experi~iie~its indicate
tliat tllresliolds up to tliree i~iiprove availability, wliile tliose above tliree yielcl substalitially s~l~aller
iniprovemeots, and full replicati011produces the poorest perfor~uaiiceresults.

2.23 Distributed Query Processing

Tlie quely processor 111a ceiitralized DBMS tra~isforr~is higli-level queries hito equivale~itlower-level queries
wliicli lliipleiiie~ittlie exec~tio~i
strategy, focusi~igon opthnization of peifor~~ia~iceprirnaiily by reducing disk
accesses.

Distributed query processing lli~lstdeal with tlie analysis, optitnizatioa, and executioii of queries refere~icllig
distributed data. Query optll~lizatio~i aud execution 111 a distributed database e~iviro~i~iie~it uivolves global and
local optii~iizatioiiplans aiicl tlie selectio~iof access paths. Clioices coiiceriiuig tlie best site to process data and
liow data sliould be iiioved betwee11 sites ~iiakethe task of distributed query processilig significantly inore
co~iiplexthan tlie ce~itralizeclversion.

A distributed queiy optlliiizer deco~iiposesa query into a seque~iceof serial and parallel operations, groups tlie
operations tliat call be perforiiied at tlie saliie site, aiid stages tlie traiisiiiissio~iof results betwee11 sites to
eve~ituallyyield tlie desired result. Tlie dy~iamiciiature of a DDBMS adds to tlie coi~iplexityof tlie optlliiizer,
sllice each site ~iiustalso carry 011 its ow11local executio~iload, wliile tlie ~ietworkis subjected to vaxyi~igtraffic
patter~isand bottleiiecks. Optiliiizllig distributed queries i~ivolvesco~isideratio~i of tlie following:

o speed differeiices in co~ru~iu~iicatioii


links

o speeds aiid loacis of local processors

o iiature of operatio~isat sites

o possible parallelis111 hi query exec~~tion

o replicated/frag~ie~itzd
clata possibilities

Significant researcli lias occurred in tlie area of distributed query processing. Tlie results of this researcli call be
observed in tlie variety of i~iipleiiientationscurre~itlyfoulid ui co~iurlercialaiid researcli systei~is.Tlie researcli
empliasis lias been 011 fi~idllig~iietliodsthat iiiii~i~ilize
tlie costs associated with intersite conmiui~ication.I11 iiiost
cases, optllnization is broken llito two separate probleiia: selectio~iof a global execution strategy, based 011
i~itersittconununication, aiid selectio~iof eacli local executioii strategy, based 011 ce~itralizedquery processllig
algoritl~~lls.
[31]

Just as tlie orderiiig of jouis is llilporta~ithi ce~itralizeddatabases, it is more ll~iporta~it


hi distributed databases
because of tlie existe~iceof fragnents wliere tlieir jollihig liiay sigiificantly llicrease co~iu~iu~iicatio~i
costs. Some
opti~iiziiig algoritliiils exploit tlie existelice of replicated fragnents at lull time hi order to minllnize
conuiiu~iicationcosts by using tlie se~iiijollioperatio11to reduce tlie a~liouiitof data that ~iiustbe ~iiovedbetwee11
sites. However, because tlie se~i~ijoll~ also llicreases tlie overliead associated with co~itrolniessages, some recent
syste~~is no loiiger rely on it except in cases wliere it sigiit'ica~itlyreduces tlie amount of data that ~iiustbe
movecl. [311
Successful distributed query processllig oftell clepe~ids011 tlie availability of elatabase statistics. Wliere li~liited
ba~idwidtliis a deter~iii~illigfactor, tlie selectio~iof tlie orderuig of operatio~isis a critical operation. 111order to
make tliis selection, tlie optuiiizer ~xiustliave at it's disposal statistics on the database kagnie~itswith wliicli it will
Jy~ianiicallyesti~iiatetlie cardllialities of results of relational operations. Tliese statistics are ~iiallitalliedit1 tile
data dictio~ia~y wliicli, dzpe~icllligon tlie particular DDBMS lliiple~xie~itatioii, uiay be ce~itralizedor dist~ibuteci.
Tlie designer of a distributed database who a~iticipateslieavy distributed que~ylligactivity sliould be especially
co~icer~ied about tlie distributeel capabilities of tlie DDBMS's data dictioiialy.

2.2.3 Ransaction Management

Managing transactions ui a clistributecl database enviroa~nentrequires dealing with colicurrelicy control, systeni
reliability, atid tlie efficie~icyof tlie syste~iias a wliole. Tlie executio~iof tra~lsactio~ls~iiustbe do~ieui a way that
preserves tlie cliaracteristics of transactions, ~iii~ii~ilizes tlie cost, and ~liaxuiiizessystetii availability. Tlie
tra~isactio~i nianager ~liustprovide tlie syste~iiwith resiliency. Despite colxipo~ie~~t failures, tlie syste~xi~iiustbe
able to co~itiriueoperatio~isa ~ i de~isurethat database coiisiste~icyis not violatecl.
2.23.1 Concurrency Control
Concurrency colitrol is tlie ~xiostdifficult of tlie proble~iisfaced by distributed databases wlieri data redunclancy
is permitted. Generally, tlie techniques hi use today to maintain data corisiste~icywliile ~~illiiriiizirig
tliz overliead
of propagating cotitrol llifor~iiatio~i to all ~ioclesin tlie tietwork are extensions of one or botli of tlie sanie
techniques used hi centralized databases - locks and ti~iiestamps.Likewise, wlie~ilockirig is used as tlie ~iietliod
of syncllronization, cleadlock of tlie DDBMS call result. Well establislieci ~lietliodsof Jeacilock prevetition,
avoidance, and detectio~icall be applied to distributed database systems.

2.23.1.1 Locking
Locking, tlie sixiplest fonii of concurrency cotitrol to implement, is tlie ~xietl~od ~iiostusecl 111ce~itralizedDBMS
products. Those portio~isof a database irivolved ui a read or write operatioti are "locked", ~iladeunavailable for
ally otlier operations. Diffire~ices111DBMS products call be f o ~ l ~hi i dtlie granula~ityof tlie locks; products liiay
"lock" at tlie data iteni, tlie record, tlie page, tlie table or file, etc.
Wlie~iused hi a distributed database e~iviro~l~xient, the locki~~g ~iietliodresults i11 1o11gclelays wliile tlie lockuig
protocol is propagated to all tlie affected nodes, the tra~isactio~i is accomplished, a~icltlie acluiowledge~ieitsare
again propagated. For an "11" ~ i o d enetwork, straigl~tforwardlocking i~ivolves511 i~iter~iode liiessages to
accoiriplisl~o ~ i etra~isactio~i
as follows: 11 lock messages, 11 lock grant messages, 11 update messages, 11 update
ackt~owleelgnieats,and 11 release lock messages. Several variatio~isof locking, biclucii~i~ tlie popular "two pliase
conm~it",reduce tlie tiu~xiberof liiessages to 411, 311, a ~ i deven 1.511by usuig co~iceptssuch as ~xiajoritylocking,
wliere o~ilya majority of tlie nocles are required for a co~illxiitrather tlia~iu~ia~ijliiousapproval, and piggybackitig
upclate liiessages 011 top of lock requests, but all of tliese techniques prove to be unsatisfactory 111 situatio~is
i~ivolvi~iglarge ~iu~iibzrsof sites and liigl~tra~isactio~i
volumes.[40]
Aiotlier variation of locki~ig,tlie pri~iiary-site concept, i~ivolvesfunnelli~igall updates for give11 partitio~isof
tlie database through a pli~iiarysite. W ~ e requests
n for data co~ifor~ii
to well-defuied patterns, for irlstance, by
geograpliical location, tliz primary site tecll~iiqueis effective; however, w11e11requests call span ~~iultiple
pritliary
sites tliis techiique call result hi global database locki1ig.[40]

2.23.1.2 Timestamping

Ti~iiestanip-based co~icu~re~icy co~itrolalgoritlui~sestablisli a serializatio~i orderllig of transactions by


assigning to eacli a unique identifier, usually a co~iipositestanip co~italliuiga site identifier and a ~~io~ioto~iically
i~icreasuigcounter value. Tlie tra~isactio~is are executecl accordilig to tlie assig~iedorder.
To co~ilpe~lsate for tile real-life situation of operatio~lsarriving at nodes out of sequeace, each cirrtcl ittvn is
assigned two timestanlps. Tile read ti~nestampindicates the largest ti~ilestalllpof tra~lsactio~ls to llave read the
clata item; the w~itetll~lesta~~lp llldicates the largest ti~ilesta~lip
to llave updated the data iteni. The tra~lsacrio~l
~iia~lagerco~l~pares the value of a data item's tllilesta~llpsto those of tlle incoming trallsactio~lsto deter~lli~ieif it
sllould apply the transaction.
A llybricl class of locking-baseci algoritll~~is
also use tllllesta~llpi~lg
to i~i~proveefficie~~cyand the level of
concurrency. Tllese algoritluils are not cusrently i~llple~lie~lted
in ally co~iurlercialor researcll distributeci
DBMS. [31]
2.2.3.1.3 Multiple Protocol Methodology

Alother c o ~ ~ c u r r e ~c ~o ~c y~ t r o~lletllod
l under Jevelopliie~~t today c ~ ~ ~ p l oseveral
ys different syllcllro~lizatio~l
techniques Jepe~~cii~lg on tlie tra~lsactio~l belllg executed. At syste111ciesig~ltime, after an a~lalysisof the ways iu
wl~iclltra~lsactio~ls call interfere wit11each other, several sy~~chronization protocols are establislied wllicll vary hi
cost according to the level of co~ltrolprovided. 3aasactions are identified as belo~lgi~lg to a class depe~ldlllgon
tile level of co~~curre~lcy c o ~ ~ t r required
ol to ~ilaultai~l
C O I I S ~ S At
~ ~ru11-time,
I I ~ ~ . tlle syste~lldoes a table
look-up to d e t e r ~ i ~ iwllicll
~ l e protocol to employ; if the tra~lsactio~l belo~igsto several classes, tlle syste~llcl~ooses
the 111ostefficient, it'it doesn't belo~lgto any, the syste~ili~llposesthe stro~lgestprotocol Jefuled. Tliis teclu~ique,
111IBM's e x p e ~ i ~ ~ ~R*
i~ilple~r~e~lted e r ~distributed
tal database system, is reporteel as providing the fastest, lowest
cost ~l~etl~oci
of concul~encyco~ltrolat this timz.[40]
2.2.3.1.4 Deadlock Management

Wlle~llocki~igbased algoritll~llsare usecl to provide distributed c o ~ ~ c u r r e ~co~ltrol


~ c y 111 a syste~lico~ltallllllg
redu~lda~lt data, syste111deadlock - a circular waiting sittlatio~l- can occur. Most DDBMS products stlive to
przvellt deadlocks by using ti~l~eouts as a cietzctio~lmecl~anis~n. Tlie tllilzout method causes a tra~lsactio~l to
abort after waiting for a resource for a give11ti~lieitltewal. Deterllli~illlgan appropriate value for the i~ltewalis
clifficult 111a distributed e ~ ~ v i s o ~because
~ ~ i ~ofe tlle
~ ~ unpredictable
t load on the letw work and site hosts. A lo~lger
tilieout value u~troducesu~u~ecessa~y delay, wllile s1101-ter ulte~valscause u~u~ecessary aborts. A p l ~ e ~ ~ o ~ l ~ e ~ l a
associated witli short values is the cascadi~lgeffect caused wlle~lan overloaded syste111 causes aborts wllicll
gellerate ~ilorzaborts, itlcreasulg tliz load.
2.23.2 Reliability

Witllhi a distributed database e~lvironment,the database recovely manager ~ilustdeal wit11four types of failures:
trallsactio~lfailures, ~ilediafailures, site failures, and conu~iunicationfailures.
Transaction, n~edia,and site failures are COII~IIOII to both ce~ltralizedand distributed DBMSs. Tra~lsactio~l
failures, usually caused by an error 111tlie data or by the existence or pote~~tial for deadlock, are 11a11dledby
aborting the tra~lsactio~l and restoring the ciatabase to its state prior to the transaction. Media failures, wllicll
result in levels of data loss raagiig from co~i~pleteloss of tllz stable database andtor the ciatabase log to loss of
recent traasactions, are 111ost oftell repaireel by eitller a full restore tYom an arcllive copy or a restore
acco~l~plislled by redoi~lgand l~~ldouig trallsactio~~s
stored 11 the database log.
Uaique to clist~ibuteddatabases, co~lullu~licatio~l failures ge~lerallyare related to ~llessagesthat either co~ltaul
e~rors,are delivered out of S ~ ~ U ~orI Iare C ~lost.
, Tile lower tllree layers of the I~lter~latio~lal Sta~ldards
Organization's Ope11 Syste~llsI ~ ~ t e r c o ~ u(ISOtOSI)
~ect architecture are expected to Iia~lJletlie iisst two types of
liiessage related errors. Lost niessages, typically the result of colii~llu~licatio~i li~leor site failures, 111ust be
Ilanciled by the DDBMS. LI the event of conununication line failures the letw work may becorlle divided, hlow~las
yclrtitioi~c'n,a ~ l deach partiti011 111aycontu~ueoperatioo. Mallltalllu~gthe collsiste~lcyof a distributed database
across a partitio~lednetwork, especially if replicatio~lof data exists across the pa~titions,is a monumental task
for the clistribllted tra~isactio~l nlanager.
Protocols en~ployed111 reliability tecll~~iques ulclucie the conullit, terrniaate, a d recover protocols. Comelit and
recover protocols exist in ce~ltralizedDBMSs, but their u ~ i p l r ~ ~ ~ e ~dl it fa kt ii o~UI~ DDBMSs.
~ Mai~ltai~lulgthe
atoillicity of trallsactiolls across ~ ~ l ~ i l t isites
p l e u~ipliesthat if a tra~lsactio~l
fails at one site, it niust be aborted all
all otller sites. Ter~lli~iatio~lprotocols, unique to distributed databases, c o ~ ~ ~ p l erecovely ~ ~ ~ e ~protocols;
it while
recove~ydeals wit11 re-cstablisl~ing a co~lsiste~lt database across 111~11tiple sites, tennulation deals wit11
tennirlating active tra~~sactio~is wlle~ia failure has occurred at one or lliore sites. The co~luilitprotocol ~liust
ensure that the effects of a tra~lsactio~l across the eatise database is an all or ~lotlli~lg situatio~l.

2.3 Current Technology Implementing Distributed Databases

Wlleil c o ~ ~ t e ~ ~ l p lelltry
a t i ~ ~illto
g a 11ew tecll~~ology domain, 1110st syste~llsdesigners survey tlie curre~lt
iinykintv~ttd state-of-the-art. Tlle ~llajorityof today's iinpki?zt~iltc~cf distributed databases are of the
l~eterogeneousvariety, liavlllg bee11 developed as a response to the prob1e111of ultegatllg databases scattered
tl~rougl~out organizations. AIIIOII~ tlie 111ostdocumented cases, all llaving bee11 ~uide~way for several years, are
Ge~leralMotors' DATAPLEX, Anoco's Ar~zocoDistnbutt~cfDairzbast~S\~strm(ADDS), Xerox's MULTIBASE,
allel the I r l i t p t t d Mi~iluji~ciurirzg Dat(1 Adilziilistr(ltio11S,vstclin (IMDAS) cieveloped to support the Natio~lal
Bureau of Sta~ldardsAutonlatecl Ma~iufacturi~lg Researcll Facility. Each of tllese syste~~ls
is a special purpose,
o ~ i eof a kuid systenl, customizecl to include tllose features of distributed database tecl~~~ology wliicll ~lieetthe
~ieedsof the organization.

By and large, "bottom-up" distributed elatabase unple~nentationsliave bee11 acco~ilplisliedtllrougll sizable


in-house projects iiivolvi~lgyears of effort. For example, IMDAS, developed to support a prototype computer
integrated ~~la~lufactuli~lg e ~ l v i r o ~ l ~ ~atl ethe
~ l t Natio~lalI~lstituteof Stailclards and Tech~ology,represellts
15-20 stllff yt.clrs of effort. Li additio~lto the substantial yroble~llsrelated to l~a~ldlllig distributed data, these
prograllls 111ustdeal wit11 a ~llultihldeof lleteroge~leityissues in areas such as the following:

o co~iu~~u~licatio~ls
protocols

o database nia11agement systel~is

o data n~odels

o data represe~ltatio~ls

o data nlanipulation languages

o tra~lsactio~l
managenlent protocols

Withi11 tllese custo111syste~llscar1 be found sopllisticated solutio~lsto the problen~sof distributecl ciatabases.
However, tllese solutions are tailored to the needs of tlie particular imple~nentation.

The 111ajor focus of researcll allel d e v e l o p i ~ ~ etoday


~ ~ t is to develop ge~leralpuspose distrib~~ted database
~lla~lage~~ syste~lis
l e ~ l t that will solve a wide range of data I I I ~ I I ~ ~ ~ Iproblenis.
I I ~ I I ~ T I I I ~unple~nentationsof
g e ~ ~ e rpuspose
al clistributed database tecl~nologyare ollly now begul~li~lg to e ~ i ~ e r gi11e the ~l~arketplace.
Tllese
proclucts are ge~ierallyIion~oge~ieous
solutio~lswith luiiited support for tlle I~eteroge~ieouse~iviro~i~rle~it
via
proprietary gateway products. Tlieir i~ilple~lie~itatiorls
wit11 respect to tlie issues of distributed database
teclii~ologycliffif nlarkedly.

2.3.1 Data Models and Schema Integration

Tlie ~liostdesirable distributed database design would be based on a si~igledata n~odel,preferably o ~ l ehi wllicll
each site iiilple~iieiltedtlie sa111edatabase iiia~iage~rle~itsystem. I11 this situatioti, all tlie proble~~ls
of disparate
sclle~ilallitegratio~iand query language tra~lslatio~i disappear. Because we harclly ever get to deal wit11 the
"perfect" situation, several corlllliercial distributed database products now suppor-t ~~lultiple data ~liodelsand
network protocols via gateway products. Altliougl~all tliese products use the rrlatiorial data ~~ioclel for tlie native
syste~ii,gateway products provide tlle~liwit11 [lie ability to hicorporate older liierarcliical and network databases
as nocles. Factors to be co~lsiciereclwlle~ibegi~i~lllig
the desigli of a distributed database illclude the following:

o wlletlier or not the system e~lforcesa global data model

o the ~iietl~odology
employed to ge~leratetlie global sclieriia

o the location(s) of tlie global dictio~la~y

2 3 4 Data Distribution

The aclvantages to be offireel by data distributio~i~liustbe fulfilled by the DDBMS. Offsetti~lgtlie pro~llisesof
i~nprovedpel-formance, reliability, and availability are tlie co~llplexitiesrelated to update syncllronization,
distribution of control, security and the ge~lerallack of experience dealing with distributed databases.
Tllerefore, w11e11the developnient of or ~iiigratio~i to distributed databases is contemplated, tlie degree of
clistributio~l ancl level of locatio~ltra~ispareticysupported by a DDBMS product are factors for serious
consideration.

23.2.1 Degree of Distribution

I11 tlie design of a distributed clatabase, it 111ay be decided that the organization's structure, geographical
dispersion, or otlier data requireriie~~ts iiiay ~ieczssitateor leiid itself to the use of frag~rie~ltatiori and/or
replicatio~lof relations. Altliougl~tliere are pla~isfor it in every venclor's future, 110 distributed database product
curre~ltlysupports transparent l~orizo~ital and/or vertical fiagnentation. If tlie use of fragiientation is a
requirement, custo~ilsoftware ~iiustbe writtell to support the level of trac~sparellcyrequired.

0 1 1 tlie otlier hand, ~ilostdistributed clatabase products currently support replicated data for q u e ~ y
purposes;
with 1'112exception of two products (see sectio~l2.3.4), I~owever,tllese salile proclucts o~ilysuppor-t suigle site
update witlii~ia sfiigle tra~isaction.

23.2.2 Location Tkansparency

Witlilll tlie co~itextof a DDBMS, locatio~itra~~spare~lcy boils down to ~la~iiulg trallspare~icy- providuig unique
uanles for each object 1l1the database. I~ilple~~le~ltatio~ls of tliis fii~ictionrange frotii requiring tlie user to provide
unique ~ia~lies to liavllig tlie systetii enibed site locatio~llianies witlii~itile ~ i a ~ l ofi e each database object.
Ell~beddi~ig locatio~lsui tlie object tia~llescan ~iiakeit ur~wieldyw11e11tliz uszr is requirzd to spec@ tlie fi~llnanie,
as in IBM's experimeiital R* systeiii. The enibeciduig practice causes other probleliis wlleli objects are liloved
across iiiaclillies for pei-foniiance optimization. Soiiie systeiils elect to eiiibed the "birtll" site iiaiiie i11 a11object's
liaiile, providing refereucuig functions witliul tlie systeni's data clictioliaiy that resolve tlie curreiit locatioil of tlie
object. Otlier svstei~isprovide an aliasing capability for long names. However impleiiie~iteci,tlie best solutioii is
for the systeni to provicle unique ulteri~alllaliies for database objects ailci to trailslate the user liaiiies to tliese
transparently.

2-33 Distributed Query Processing

Distributed query capability call be fouiid hi just about every distributed database product, with sigiiiiicant
diffireiices occurring 111 the query manager. So~iiecurreiit ciistributed DBMS products send queries to each
database, aiid tlieii coi~ipilethe results illto one respo~iseratlier tliaii Iiandoff tlle queiy to a distributed query
manager. Soiiie products co~itallisopliisticateci cost optllnizers.

Tlie locatio~lof tlie data clictioilary is a significant factor in query optimization. S o ~ ~systei~is
iz fully replicate the
dictionary at each site to expedite query processuig; otliers iiiaii~taiiia ce~itralizeciversion of tlie dictioiiaiy with
tlie e~~ipliasis on expeciitiiig updates. Depeiidiiig on tlie specific applicatioii to be imple~iiented,tlie designer
iiiust co~isidertlie dictionary's location(s). Usuig a product that suppor-ts a ceiitralized dictionary lias serious
lii~iitatioilsfor ail application witli llefty distributeci q u e ~ yrequirements.

Most query optlliiizers are tied to ciata transmission costs; a cost -based optiiiiizer reviews all possible seiiiijollis,
deter~iiuiesthe tiiiie aiid co~tuiiuiiicatioiisburdeii for eacli, aiid cliooses tlie least cost alter~iative.Tlie query
metlloci that itillllliiizes iietwork traffic is geiierally coilsidered tlie i~iostcost effective. Orie coiiuiiercial product,
Infonnix-STAR, lias a verbose feahlre that ulfonns tlie user of tlie costs i~ivolvedfor eacli SQL statement;
however, it oiily reveals the costs for tlie cliose~ialteniative, not for all tlie possibilities. Tlie Itlgres/STAR
product boasts tlie Illelustry's orily "intelligent" optlliiizer; it relies on database sailiplllig statistics arid lieuristics
to arrive at an optitiial query processuig strategy.

Sollie databases take acivaiitage of tlie parallelisili offered in distributed databases by coi~curreiitlyexecuti~ig
sub-queries at retiiote sites, aiid tlieii biiiigllig tlie data togetlier 111 soiiie optlltial iilariiler for final proczssi~ig.
Otlier products require tliat tlie processuig be perforined at tlie dataserver ~iearesttlie user; if tlie designer's
iletwork coiltailis datase~verswitli sigilficaiit yerforiiiaiice diffirences, this is a serious concerii.

2.3.4 Distributed llansaction Management

Distributed traiisactioii itiaiiageiileiit deals with tlie probleiii of concurrency - sy~iclironizi~ig traiisactiolis that
update redundantly stored ciata. Traiisactioll iiiaiiageiiieilt protocols liaiidle tlie coiiuiiit/abort decision at each
site in tlie distributed database. Fully iiiiplemeiited, tliese protocols require traiisactioii logpig, recovery,
conunit, aiid deadlock cietectioidpreventioiifeatures.

Tlle capability to read ai~ciupdate data located at itiultiple sites witlliii a single transaction, preservllig the
properties of atoniicity, isolation, aiid durability [IS], iiiay or lilay iiot be provided by a ciistributed database
~iia~lage~~ieiit
system. How the database lialidles distributed coi~curre~icy co~itrolaild coi~uiiitprotocols (without
incuniig excessive overliead costs wliile propagatiilg coiitrol Il~fonnation),a~icitlie ability of tlie systei~lto
coiitlliue operation despite a coi~ipoiieiitfailure (ciisurlllg that database co~isisteilcyis iiot violated), Jeteriiiuies
tlie extent to wliicli distributed tra~isactioii~liailageilieiitis supported.

Witli tlie exception of Sybase aiicl Llgres/STAR, 111ostcurre~itdistributed database iiiaiiage~iielltsyste~ns,even


tlie custoiiiized u~iplei~ie~itatioiis,
restrict distiibuted update to a suigle site withhi a suigle transaction. Tlie
Sybase and I~igresproducts both support distributeci updates that spa11 liiultiple locatioils with a two-pliase
co~illiiitprotocol; Ilowzver, only Sybase supports iriultisite updates withui oiie traiisactio~iwith guaraiiteeci
recoverability. Both tlie researcli ~~iodels aod tlie collilllercial products list n~ultisite-update in their filh~re
plans. Until then, if you require update at ~ ~ ~ u l t ilocatio~ls
ple you niust either use tlie Sybase product, or develop
custom software to fill tlie gap.
Tliese two options are, 111 reality, closer tlia~ltlie reader ~lliglitsuspect. Sybase acco~liplislies~liultisiteupdate by
providing its users with a libray of database fi~~ictio~is to be used in cievelopi~igdistributed database
applications. By incorporating tlie update, prepare, and co~luliitfunctio~i,tlie applicatio~isdeveloper directs liis
own distributed tra~isactio~l nianagenient.
2.3.4.1 Concurrency Control Protocols
Tlie niajority of co~iu~iercially available ge~ieralpurpose distributed database rria~iage~iie~it syste~lisdo not -

curre~itlyi ~ ~ i p l e ~ i ~~iultisite
i e ~ i t update; ~~pciates
are lllliited to eitller tlie local host or a a~lotllersi~iglehost i ~tlie
i
network.
Tile ~liostwell -k~iowna~iclwiciely usecl nietliod for i~nple~nentllig concurrency co~itrolis two-phased locking hi
wliicl~tra~isactio~~s wa~itlligto reacl data obtalli a sliared lock 011 tlie data ite~ll,and tra~lsactio~is
walltirig to write
tlie data iten1 obtalll an exclusive lock. Tlie granularity of tlie lock has beell tlie subject of Jiscussio~iand dispute;
generally, lockuig occurs at tlie record, or ttlple, level, with a few syste~rislocking at the file, or "relation" level.
Database syste~ilsthat lla~idledisk accesses themselves, rather tliali using 110 provideel by the operatirig system,
11iay lock at tlie "page" level.
The deadlock sihlation, UI wliicli two transactioas each have a locked data itern and are waiting for tlie otlier to
release tlie lock, is getierally lia~idledvia deadlock detection mechanisnls.
2 3 4 . 2 Recovery Protocols
The ~iiostwell-known and wiciely used metliod for U~lple~~~eritllig recovely protocol is tlie two-phase conmiit.
Durllig tlie first phase, the participati~igsites indicate the ability ancl willi~lg~iess
to conunit; cluring tlie seco~id
phase, if all participa~itshave a~lswereciaffin~iatively,tlie trallsactio~iis globally conunitted. If even o ~ l e
participant resporlds negatively, or fails to respo~ld,tlie tra~lsactio~iis aborted at evevzry site.
Tlle successful implzme~itationof the two-phased co~iul~it depends 011 a tra~lsactio~~logging fu~lctio~l at each
site during wliicli log records co~~talllllig illfon~iatiolifor undolllg and redoing tra~isactio~ls is writtell to
redmidant, non-volatile storage. The two-pliasecl co~lltiiitis tolera~ltof failures as long as there is 110 loss of log
i~ltbr~nation. Protocols exist that deal wit11 tllose situatio~lswllere sites fail ciuru~gthe ready-conunit
sequenci~ig.One of tlie proble~iisassociated wit11 tlie two-phased coliuliit occurs wlie~ia co~lu~lu~iicatio~is
failure or a failure of tlie site lliitiatlllg a tra~lsactiollOCC~ILSres~iltuigi11 a partitio~led letw work (see sectio~l
2.3.4.3). Sollie sites niay be blocked wliile waitlllg for the co~rurlit/abor-tconu~land.During this ti~iiesystzlli
availability is affected by tlie lield resources related to the blocked transaction. The practice of elllnhlating the
"ready" phase by llavllig sites tra~is~iiit a "ready" i~~lt~iediatelyafter executi~igtlie tra~isactio~iexacerbates tlie
bloclu~igproble~~i w11z11a letw work or initiating site failure occurs.
Variatio~isIi the two-pllased co~lullithave bee11 designed 111 an attelilpt to solve tlie blockllig proble~n.Tile
"presuniect cotiunitlabort" valiatioli assunies a tra~isactio~l is conmlitteci/abor-teci if no llifor~llatio~i
about it is
co~ltai~ied hi tlie log. Tlie "spoollig" variation stores ~liessagesfor a dow~iedsite at a predefhled "spooluig site".
Wlleil tlie site recovers, it applies the spooled aiessages. Aiother variation directs recoveri~~g sites to look for
lost i~lfor~liatio~iat otller sites i11 tlie ~ietwork.
2.3.4.3 Termination Protocols
Sites participatulg UI a distributed ciatabase ~liusthave a co~~siste~it view of tlie aetwork. If, because of a
coriui~u~iicatioris failure, tlie letw work beco~llespartitioned, sites in eacll partition will llave a di!Yerent "view",
suice all tlie sites ui tlie otlier partiti011will appear to be down.
Addressing how sites deal with this type of coiiuiiuiiicatio~ifailure, ter~lii~iatioii protocols lia~idletlie abortion of
executuig tra~isactio~is. Tliese ~ ~ r o t o c owliicli
l ~ , use tlie tiiiieo~t~iieclia~iisiii,
vaiy depeiidi~igoil tile stage of tlie
transaction, tlie kiii~lsof coliux~~~iicatio~l periiiitte~lwitliui tlie DDBMS, aiid wlietlier it is tlie uiitiator of a
tra~isactiolio r a participaiit tliat lias failed.

Considesing tlie uiitiator, if a failure (tinieout) occurs wliile waituig for tlie participaiits to respo~idwith a
conunit/abort decisioii, tlizii tlie trruisactio~icall be globally aborted. If a failure occurs wliile wairuig for a
co~iuxiitor abort achiowledgenient, tlie iiiitiator call oiily coiitlli~~e to wait. For a par-ticipatllig site, if it lias
received an initial update iliessage but uever receives the prepare to co~iuxiitor abort, it call abort tlie
transaction. However, if a participa~xtlias voted to coiiuxiit a transaction, but i~everreceives a co~rl~ilit iiiessage
from tlie initiator, it will be blocked fro111 ally furtlier activity unless tlie systeiii allows it to co~r~lx~uiiicate
with
aiiotlier participati~igsite.

Blockuig aiid non-blocking te~~xii~iatio~i algoritlllxis have beeii developed that deal witli variatioils that iiiay
a ~ i s when
e sites are allowed to "discuss" their tra~isactioiistates.

2.3.4.4 Reliability

Two related aspects of tlie reliability of distributed database systeins are correctliess aiid availability. Tliese two
factors are iiiversely related; imposing lxiore of oiie results in less of tlie other. Tlie trade-off iieeds to be
evaluated by tlie clesig~ierof tlie syste~iiat Iiaiid.

For 1io11-redu~i~iaiit
data, availability Jepeiids strictly 011 tlie occurrelice of site or ~ietworkfailures; tliere is no
way to uicrease tlie reliability of tlie system. hicreasuig tlie availability of tlie systeiii is a major goal wlieii
i~itrocluci~ig
redu~ida~itdata hito a distributed database system.

2.4 Implementation Strategies and Considerations

2.4.1 Degree of Site Autonomy

Despite Date's nile co~iceriiuigsite autoiioixiy, reality lxiay dictate va~ylligdegrees of autoiioniy. Giveii tliat each
site iiiallitailis coiitrol of its owl1 data, tliere lilay be co~iipzlliiigreasoils for tlie existence, at soiiie c e ~ ~ t rsite,
a l for
ally of tlie following:

I. Agbbul c~ltr~logresyo~isiblefor ~xiallitai~ii~ig


llifor~iiatio~i
about fragli~eiitatioiiaiid allocatioii of data; an
alter~iativewliere tliere is liigli probability of frequent catalog updates coupleci with uifrequent
distributed queries

2. A ceiitral sclieduler, or coorcilliating process, respo~isiblefor synclironiziiig access to tlie global database

3. A central cieadlock detector to wliicli local sites periodically report iidomiation relatllig to trailsactioils
waiting for resources; a si~xipledetectioii meclianis~xi,this inay be a viable clioice if tlie iietwork lias tlie
capacity to carly tlie extra co~iuxiu~iicatioiis load a i d if tlie issue of failures related to tlie tirile it takes to
traiislliit deadlock data to tlie ce~itralsite lias beeii considered

2.4.2 Lack of Standards

As it1 ally field of eiigineerfiig, a system's arcliitecture defines its structure. Witliui tlie field of co11iputersysrellls
we try to establisli soiiiz refere~icearchitecture tliat we ter~iia "staiiciard". Software developers iiiay deviate
fro111 this reference, and in tlie past they Iiave, but deviatuig iri today's iiiarket is risky busuiess.
Sta~ldardsrely on prove11 and mature tech~ology.The rapici ul~lovatio~i rate ill this field illakes standards
obsolete before they call be establislied. Sulce tlie relatioual data ~rloclela i d so~llevariant of SQL llave beell
adopteel by today's co~illllercialDBMS products, tliese liave bee11"stanciarciized". Especially for l~eterogeneous
distributeci databases, standarcis for both la~lguageand rclliote access are esseiitial. If the two-pl~ase co~iu~lit
and two-pllase lochlg protocols were standardized, irilplerlieiitatio~iswould be straiglitfo~warci.

2.4.2.1 ANSI Standard SOL-2

All of tlie available distributed database products support soil~eversioli of IBM's Stx~ictureciQuery Language
(SQL). Altl~ougl~tlie PLlllerica~lNatiollal Sta~ldardsIiistitute (ANSI) establislied an SQL standard, SQL-86, in
1986, each prociuct's versioll of SQL is different.

The SQL-86 stallclard llas beell vigorously attacked by E.F. Codd hi at least tluee publications. Tile tisst two
occurred UI a two yax-t article, "Fatal Flaws in SQC', appeaxing i l the August allel September, 1988 editio~lsof
Datlzrnatiort. Codd reiteratecl and elaboratecl his co~ilplai~its
in liis recent publication Tlie Relatioilal Model for
Database ManagementlVeision 2.[38] Tllree flaws, described by Codcl as Ilavuig "grave consequences" are
these:

o SQL pennits duplicate rows UI relatioris

o It supports an inadequately deth~eclkind of llestirlg of a q u a y witlii~la query

o It does not adequately support tluec-valued (or four-valued) logic

Sulce iricreashlg numbexs of businesses and govenllile~ltinstitutions are beco~llll~g depe~idetlton relatiollal
DBMSs for tlle success of their operations, Codd believes tllese flaws must be repaired. He recolr~lliei~ds that
database usexs avoid duplicate rows witllirl relatio~lsat all times, avoid llested velsio~lsof SQL stateille~lts
wllerlever a IIOII-11estedversio~lis possible, a ~ l dtake extra care wlle~imanipulating relatio~lsthat llave col~l~luls
that 111aycolltaul ~nissulgvalues.

Tlie ANSI X3H2 Database Starlciards Coirullittee is currelitly battling over the ~lewlye111ergillgstaiidard called
SQL-2. Eiilbroiled UI tlie battle, but rlot on the conunittee, are Codd and Date. Lading problelrls to be solved
andlor liegotiated are tlle following:

o Sllould duplicate rows be supposted

o Sllould NULLS be supported

o Sllould pruliary keys be supported

o Datetit~lefunctio~iscorl~plicatedby institutio~iof U~liversalTime Coordinate (UTC) replacing


Greenwich 11lea11tinle

o Con~plexityadded by updateable views

o Security issues associated wit11 GRANT alicl REVOKE; REVOKE lias beell aclded and GRANT
now pennits circular references

Tlle latest ii~forrliatioiii~ldicatestliat the co~llliiitteelllay be close to agree~ilelltwit11 I~ltenintiollalStailciards


Organization's (ISO) working Jraft.[8] Tllere is still a lo~igway to go before tlie next SQL standard is availabie.
2.4.2.2 Remote Data Access

Reriiote access protocols support coiiuiiol~coiiu~~~iliicatiol~ ~ ~ i e c l ~ a l ~belweell


i s i ~ i s local and rerilote processes.
Two alterllative approaclles have bee11 proposed: liiessage passulg and reliiote procedure calls.[31] Message
passing co~isistsof two pri~iiitives,send alld receive, wliicll, depeuding 011 their iiilplelilelltatiori, may provide
reliable or ~il~reliableconinlunications. Reli~oteprocedure calls are a restricted for111 of iiiessage passing
equivaleat to blockllig se~ldslid receive.

DBMS vellciors have establislled proprietary intesprocess conmiunications protocols, but if l~eterogel~eous
clistributed ~iatabasesare to tlourish staildards lilust be established and followed. III 1985 ail I S 0 workilig group
was fornlzd to work 011 Re~iioteAccess Standarcis.

The stanclardization llecessa~yto interconnect I~eterogeneousIlarclware alld suppoi-t tlie trallsfer of data
betweell tlle~iiis provideci by the Ope11 Systeliis I~~tercolu~ect (OSI) protocol falllily of tlie ISO. Call all tlie
systelli fuactions of a ciistributeti database ~iialiage~iiel~t
syste111be pei-fonned adequately at the applicatiolis
layer'! Tllere llave beell suggestioas ia the researcll conmiunity that tl~isliiay be the c o i ~ e capproach.
t

2.4.3 Distributed Database Tools

Tllere is an acute lleed for automated tools to support distributed databases. Tools are required for eacli of tlie
followil1g:

of table & fragnent locatiolis


o Design and ~iiai~iteiial~ce

o Measuring allel liiollitorir~gsystelli ~x1-foni~ance

o Global security admit~istratioll

Otlier tlia~ithe pe~fonila~lce tool associated wit11 tlie IIlgres/STAR query optiliiizer (Jiscussecl earlier), no
automated tools are provided wit11 currently available distributed DBMS products. Developers of the
one-of-a-kind syste~iisalso express a desire for auto~i~ateciciesigli and 1i1easure11ie11ttools.

2.4.4 Planning for the Future

Iliiple~ilelitillg a l~oliioge~~eous distributed clatabase is a sizable effort; lliiple~ilelititig a 11eteroge11eous


distributed database is a l i ~ o i ~ u l l l e ~task.
~ t a lEve11wit11 the l~ewestcoliullercially available clatabase products, not
all features of clistributed database tecluiology are available, with liiost prociucts elllpliasizi~igtliose features that
coniplenieat their particular liiarket niche.

Toward builciiug long-lastillg database applications, alld pla~illillgfor ~pwarcl~iiigratioil,the database desigller
is urged to co~lsicierdesign strategies that Illsdate applicatiolls fsoiii cliallges that would ntllerwise be req~iireci
by future releases of your u ~ ~ d e r l y uDBMS
~ g procluct that increase its distributed functionality. If your DDBMS
cloes llot support multisitz update wit11i11a slllgle transaction, provide custoiii software that lilakzs it ~lppcnrto tlrr
~ipplit*ritiorltllat it is supported. If your DDBMS does not suppost table replication, suppleli~el~t it with custoiii
software tliat copies a re~llotetable to tlie user's site tra~ispare~itly to the applicatio~i.Follow Codd's advice
reprdi~ig avoiding capabilities existing i ~SQL i today tliat may 110t be tliere in tlie future. W l ~ etlie
~ i ti~tieco~iies
tliat tlie ~liisslligfeature is provicied by tlie DDBMS, or standardization eluillliates orie tliat is tliere now, tlie
custo~iisoftware is re~iioveciand all applicatio~lsliiay take advantage of tlie i~icreasedfunctionality witliout
modification.

2.5 Summary of the State of the Art

Distributeci ciatabase tecl~~iology's cliief adva~itageis tlie ability to access ciata faster and cheaper tlia~itlie
alteniative centralized database approacli. I11 orcler to ~liakefull use of this advantage, data 111ustbe able to be
located trausparently tlirougliout tlie system, updates to tlie data ~ l ~ ube s tsy~iclironized,and queries of tlie ciata
~iiustbe optimized to reduce not orily tlie local disk accesses, but also tlie co~il~iiu~iicatio~is costs. Wliere tlie
syste~lii~lvolveslieterogeneous databases, tlie s y s t e ~ i ~ ~ ~
bel uable
s t to cope with various SQL ciialects a ~ i dreliiote
procedure call protocols.

2.5.1 No Full Implementation

No current, co~luiierciallyavailable distributed database product fully i~iiple~rie~its


tlie co~iceptsof distributed
database technology. Nolie address tlie proble~llof table fragmentation. However, pote~itialusers wlio carefully
a~ialyzetlieir particular data requirements probably call thicl gerieral purpose DDBMS software tliat will ~iieet
tliose requiree~ents,althougli it lilay ~ieedto be augnienteci with custo~lisolhare or additio~ialliardware to
coxlipellsate for deficiencies.

2.5.2 Market Pull versus Technology Push

Tlie sihlation witliiu tlie distributed database researcli and develop~lle~it co~iuiiu~iityis currently o ~ i directed
e by
~xiarketpull ratlier than tecluiology pusli. Tlie tecli~iologyis gouig to adva~icebased pruiiarily 011 the ~ieedsof tlie
users, ratlier than 011ally radical breaktlirouglis acco~iiplisliedui tlie researcll labs.

Date's twelve rules spell out tlie requirements for ilxiplementi~lgtrue distributed databases, a ~ i duntil tliose rules
call be satisfied by a ge~ieralpu~-poseciistributed database product, applicatio~iswill not be able to take
advantage of tlie hill fu~ictio~iality
offered by this technology.

2.6 Related Research Issues

Researcliers ui so~iietecluiology domauls that have traciitio~lallystudied and developed in isolatio~inow fuid
tlieir tecli~iologiesoverlappuig. Toclay, they are eitller jolliitig ranks or are being forcecl uito cooperati011in order
to produce solutio~isthat meet tlie growing de~ilaridof user conui~unities.Tlie followllig topics are all tecli~iology
areas that are being unpacted by Jevelop~~ie~its being 11iaJe ia distributecl database teclu~ology.

2.6.1 Distributed Database Operating Systems

Distributed elatabase syste~lisrun as user applicatio~is011 top of a host operati~lgsystem. Altliougli tlie topic of
distributecl database operatuig syste~iishas ~ i o bee11
t fully researcl~ed,tliere has bee11 so111eJiscussio~ito the
effect that tlie pcrfonnaace anci f~i~ictionality of DBMSs call be ililprovecl by mod@~~g and e~llia~ici~ig
the
operating syste~iito satisfy tlie additio~ialrequirenients of DBMSs, particularly their tra~isactio~isupport, b u l k
~uanagemmt,slid co~ic~iuency co~itrolrzquirenients.
E~lliaiice~iie~its
and ii~lprovei~ie~lts have bee11 iiiipleiileiltecl in special purpose "database operatiiig systenls"
fo~uidill elatabase nlacllllles, but lot witllili tlie colltext of general pur~?oseoperating systenls, altllough sollie
researcli oyerati~igsysteii~sclesigis now illclude sollie of the required functionality. Areas wl~ereoperatiilg
syste~lichange is coiitenlplated are in the provisio~lof the following:

o Fragneiitation and replicatioil transparency

o Network transparency

o User autlie~iticatio~i
a~lciautliorization coiitrol

o Full transaction ~lla~lage~lie~it

o Special buffer aiici ~ ~ ~ e l i ii~ia~lageilieilt


oiy

2.6.2 Distributed Multidz~tabaseOperating Systems

Coordlliatfiig existing ~ U ~ O I I O I I ~ O Illeteroge~ieo~~s


IS databases without acte~llptlllgto integrate tliei~iwit11 a
unified sclle~~iais tlie subject of rnultidlrttlbllst>or fidt)rntt~dsystems. These systeii~sexhibit a lligli degree of
auto~loliiyand do not leiid theniselves to integration. Tlie global sclle~iiain a nlultidatabase represelits each local
database separately. Tlie user is presented a coiiniio~~ data manipulation language with wllicli lie/she ide~ltifies
the database to be used. Queries against a iiidtidatabase ge~~erally are directed at oiily oiie of its coniponents.

Soiiie exa~~iples of autonoii~ousdatabases tliat are federated for tlie purpose of i~lfor~liatioii
retrieval are
dial-up llifor~iiatio~i services such as ConipuSexve (TM) and The Source (TM). Dial-up services frequeiitly
guicle tlie user tlirougll a sequence of queries to arrive at the required iiltbrination.

Tlie techniques cleveloped for distributeel databases will not suffice for iiiultidatabases. Curre~itly,the eiilpliasis
in tliis tecluiology area is focusing on a coiil~iio~i
la~lgliageto be used in ~iiailagiilgitifoniiatioll retrieval. Tlie
~ilajorconuiiercial syste~iisare involved and, tl~ougliI S 0 and SQL Access standards, they will be able to
cooperate 111 processll~g~iiultidatabasequeries.[25]

2.6.3 Object Oriented Distributed Databases

Object-oriented ( 0 0 ) ciatabase syste~ilsare fast gauillig support froill tlie design conlliiunities. Wliere
relatio~lal databases iiieet the Je~iiailds of busll~ess applicatio~is typified by very large aliioulits of
well-structured i~lfonnatio~i, luiiitecl types allel stnlctures, and tra~isactioilsthat last for sliort le~igtlisof tune,
tlie 00 Jata ~iiodelsupports elltities that are objects with fi~~ictioiial cliaracteristics and supports tlie
requirenieiit for dealing with long-lived transactions. Tlie ~liatlie~iiatical sllllplicity of relatiollal database
camiot support coiliplex data types or prograi~uiilllglanguage co~itrolstructures. The 00 Jata 11ioJe1provides a
~iaturalway to niap real-world objects and tlieir relatio~isliipsdirectly to colnputer representations, ~ileetlligtlie
ciata ~liocieli~igrequireiileiits of applicatio~ls sucli as colliputer aided design (CAD), co~liputer aided
manufachirulg (CAM), coiiiputer aided software e~igiiieeri~ig (CASE), hyperineciia and expert syste111s.

M.P. Papazoglou aiid L. Maritlos[4] refute the position that the relatio~ialitiodel is the illode1 best suited for
supportll~g distributed database applications. Co~lcelitratuig on distributed lieterogeiieous i~lfor~iiatio~i
systeilis, they poiilt out that tlie relatioilal iiiodel cioes not adequately support tlie coi~~plex structures required
aiid has linlited seiiiaiitic expressioil capabilities. Tlie data elode1 niust "facilitate the co~~i~~iuilicatioiibetwee11
tlie users of diverse a11J iilcol~ipatibleiilforiiiatioii systeilis a11J assist ... wit11 tlie u~iifor~ii
represe~itatioilaiid
iiitegratioil of lleterogeneous ciata from oiie site to another." [4] Defined ia teriiis of an object-orienteci ciata
ii~odelwllicll encapsulates tlie behavioral properties of tlie database objects, a distlibuted object-orienteel
database management systeiii (doodms) 111apsthe d i t ~rirlri
l t l ~prwt1ssii~$
r colnpolltlrlts of the elltire syste111illto a
ullique systeli~wide object space.

The cloocln~s,as ellvisaged by Papazoglou anci Maritlos, consists of a layereel u~ilbrellaover each autonomous
DBMS. Tlie ulilbrella is col~lprisedof (I) a systelii language component, providiilg a system-wide query
language 111additioll to each site's ow11 query language, (2) nletadata data nlodules, lllappitlg local coilceptual
sclleillas illto the distributed coliceptual schema, and (3) tlie global trallsactioll i~ioclulewllicll provides for
distributed query decoiilpositioil aiid execution, concurrency control, ailel recovery.

Tllere are no ililple~lieiltatiolls,coii~lilercialor research, of a cioodn~s,but its flexible, lllodular approacll aild its
coilfolli~ailceto ~~loderilsoftware engineering priiiciples llldicate that it will be forthcomllig.

2.6.4 Distributed Knowledge Bases

Knowledge bases are relational elatabases extended wit11 logic - the capabdity of deducing iiew lliforrllatio~l
fro111 existing ll~forii~ation.
Most of the tecllt~ologyrequired to uiipleli~elitdistributeel databases call be usecl to
iliiplelileilt distributed h~owledgebases; the co~lsiste~icy of the h~owledgebase and its query processll~g
capabilities (especially recursive querying) being two of the tile 111ajorissues.

Current trellds toward the developl~le~~t of hiowledge bases, re~ilovu~g the "llltelligence" of artificial
intelligence a i d expert systeli~sfrolll applicatioils aild placing it wllere it call be shareel by 11la11yapplications, is
spurrllig database rzsearcllers to expand and extend their efforts in distributed database tecl~l~ology to 111zetthe
growlllg iieecis of lu~owledgebases.
3. SUMMARY

Tlle data processu~grequirenieats of today's decentralized co~-~,oratioiis togetller with advances UI both
database a~icirletwork tecliliologies has led to tlie enlergence of distributecl database teclinologies. Altl~ougllno
sta~ldardsyet exist withi11this llew teclinology, soine guideliries have beell provideel by C.J. Date aiici E.E Codd,
both developers of relatio~ialclatabase technologies.

A distributed database is a collectioll of elultiple, logically i~iterrelateddatabases Jistributeci over a coniputer


network; a clistributed database l~~allageilieilt systeiil is a software syste~i~ tliat perlllits the liiailageilieilt of
distributed data making the distributio~ltrallsparellt to the user. Distributed databases are illore reliable aiid
liiore respo~lsivetliall celitrally located arlJ co~itrollecldatabases; Jata can be entered wllere it is generated, clata
at diffireilt sites call be shareci, aiid Jata call be replicated giving users tile o p t i o ~
of~accessi~igcopies of the ciata
in the event of a site or lietwork failure.

The fuiidaiiiental issue of distributed databases is transparency, whicll, in a distributed database, refers to both
the ciata a ~ i tlie
J iietwork. However, acllievillg tra~~spareiicy
iilust not iiifrilige 011 tlie autonoi~iousnature of each
participatuig site.

Although several data moclels liave been proposecl for use with distributed databases, oilly tlie relatiollal l~lodel
has bee11 ililplel~ierltedin curre~ltcoiilt~lercialproducts. The relatiolial iliociel colltaiiis su~ipledata structures,
provides a solid foul~datioilfor data consisteacy, and allows set-orienteci nianipulatioi~sof relations.

Distributed databases call be either Iiomogeneous, wl~ereall participating local databases are based 011 tlie saine
clata li~odel(and are froill the saliie vendor), or Ileterogeneous, llivolvuig databases beloi~guigto several veiidors
probably not based on the saliie data model. Holi~oge~~eous distributed databases geiierally use a slligle, global
schema, wliile lleteroge~ieousdistributed databases may opt for eitlier a single, ilitegrateci sclie~iiaor a
federatioil of tlie local sclietii:~~.

To obtalli tlie lligll ciegree of reliability aiid availability offired by distributed database technology, the relatioiial
tables iiiust be fragmeiited ancl/or replicated across ll~ultiplesites. Fragiientation iilvolves pal-titioniiig a
relatio~leither llorizoiitally or vertically allel allocatiug tlie partitioned relatiolls to sites wllere tliz data is iiiost
oftell required. Tliis practice is illost useful hl situations wllere fragiiie~itsco~ltaiiiJata with col~uiioi~
geograpliical properties. Replicatioli is a trade-off exercise betweell update costs arlcl tlie beiie.tits of llicreased
reliability slid pe~foriiiaiice.Siice inally factors coiltribute to ail optimal frag~iientatioiilre~~licatioii design, no
tools or algoritlillis liave beell developed to assist tlie ciesigl~erhi this task.

Distributed query capability call be foulid h~just about every distributed database product, with significant
differelices occurriiig 111 the tlie performance capabilities of tlie quely nianager. Distributed query processuig
illust deal with tlie analysis, optinlization, allcl execution of queries referellcilig distributed data. The dylialiiic
nature of a DDBMS uicreases tlie coinplexity of optimizllig, sllice each site iiiust also carry on its owl1 local
execution load, wliile tlie lietwork is subjected to varyltig traffic patter~isand bottleiiecks. Wliile past researcli
lias led to tlie cievelopl~ie~~t
of query optitiiizers for ce~ltralizeddatabases, tliese optimizers are designed with the
goal of minllnizhig respoiise tiiiie. Now they are bei~igextenJerl for distributed databases, with tlie objective of
optllilizi~igboth respoiise ti~lieaiid conuiiunication cost.

Managing trailsactiolls in a distributed clatabase el~viso~iliieiit requires ciealing with co~icurrei~cy coiltrol, systeil~
reliability, and tlie efiicieiicy of tlie syste111as a whole. Today's colruilercial DDBMS products use extelisiolis of
oiie or both of the same tecluiiques used hi centralized databases - locks and tiniestamps. Differences ui DBMS
products can be foulid ui tlie both tlie gmnularity of tlie locks aiid in tlie particular llliple~~~elitatio~i of the
popular two-phase coiii~liitprotocol. Most currelit llilple~iieiitatio~ls are u~isatisfactoryill sifliatiolis u~volvll~g
large numbers of sites and liigli tra~isactiolivoluines, and ~iiostrestrict distributed update to a single site witliui a
single transaction.
Despite tlie requireiiieiits for trailsparelicy a~icisite autoiioii~y,tlie lack of liversal sally accepted staiidards anti
ciiffireilces ~ I the
I i~iipleiiieiitatiorlof tlie data Jictio~iaiy111various versioils of distributed database systeiils llave
produced significant variatioiis in tlie degree to wl~iclitlie requirements liave bee11 inet. 111 order to select tlie
riglit DDBMS or to develop an optlliiuni distributed design, tlie database systeiii Jesigiler ~liustmlderstaiid tlie
relative ilierits of eacli feature and be able to ~naketrade-offs to tofictively lliatcli iilipleilie~itedfeahlres to tlie
specific data ileeds to be supported.

Tlle developi~ientof ciistiibuted database tecliliology is sti~liulatiiigtlie Jevelop~~ieiitof iiew applicatioiis that
require suppoit for distributed data. AdvaiiceJ office autoiliation systenis, coillputer aided ciesigi systenis, aiid
h~owledgebased syste~iisare tllree tliat profit froi~itlie abilily to share data across a iietwork of coniputers.
BIBLIOGRAPHY

1. " A Berlclllliark for Per-forrnance Evaluatioll of a Distributed File System", Anla Hac, The Journal of
Systems and Software, Vol. 9 No.4, May 1989, pp. 273-285.

2. " A Hol~~ogel~eous
Relatioilal Model arld Query Languages for Tzrllporal Databases", Sllaslli K. Gadia,
ACM 'Ikansactions on Database Systems, Vol. 13 No. 4, Decelllber 1988, pp. 418-448.

3. " A Parallel Pipelfiled Relatioilal Quely Processor", Wori Kun, Daniel Gajski and David J. Kuck, ACM
TLansactions on Database Systems, Vol. 9 No. 2, Jurle 1984, pp. 214-242.

4. "Arl Object-Oriented Approacll to Distributed Data Management", M.P. Papazoglou and L. Marulos,
The Journal of Systems and Software, Vol. 11 No. 2, February 1990, pp. 95- 109.

5. "Application Desigl for Distributed DB2", Rob Goldring, Database Programming & Design, Vo1. 33
No. 9, Septzlllber 1990, pp 31 -36.

6. "Arcliitzch~reof Distributed Data Base Systenls", SuJlla Ram a ~ l dClark L. Chastain, The Journal of
Systems and Software, Vol. 10 No. 2, September 1989, pp. 77-95.

7. "Cracks UI the ANSI Wall", Joe Celko, Database Programming & Design, Vol. 2 No. 6, Julie 1989, pp.
66-69.

8. Data Management on Distributed Databases, Be~ijallliilW. Wah, UMI Researcli Press, A m Arbor,
Micliigan, 1981.

9. Database System Concepts, Heilry E Kortli and Abrallarll Silberschatz, McGraw-Hill Book Coinpauy,
New York, N.Y.,1986.

10. "DATAPLEX: A1 Access to Heterogeneous Distributed Databases", Chin-Wan Chung,


Communications of the ACM, Vol. 33 No. I, Januaiy 1990, pp. 70-80.

11. "DB2 v. 2.2: A Few More Bells & Wllistles", Craig S. Mullins, Database Programming & Design, Vol. 3
NO. 6, June 1990, pp. 59-61.

12. Distributed Databases, Edited by I.W. Draffan and E Poole, Cambr-iclge University Press, London,
England, 1980.

13. "Distlibuted Databases", Herb Edelstein, DBMS, Vol. 3 No. 10, Seyte~liber1990, pp. 36-48.

14. "Distributed Database for SAA", R. Rzinsh, IBM Systems Journal, Vnl. 27, No. 3, 1988, pp. 362-369.
Distributed Database Management Systems, Oliil H. Bray, D.C. Heath aiid Compaily, Lexington,
Massacllusetts, 1982.

Distributed Databases Principles and Systems, Stefailo Ceri aird Giuseppe Pelagatti, McGraw Hill
Book Conipany, New York, N.Y. 1984.

"Distributed File Systenls: Co~iceptsand Examples", Eliezer L v y a~iclAbrallalil Silberscliatz, ACM


Computing Surveys, Vol. 22 No. 4, Dece~x~ber1990, pp. 321 -374.

"Divide a~lclConquer Your Database", Deiulis Livingston, Systems Integration, Vol. 24 No. 5, May 1991,
pp. 43 -45.

"Does Clieilt-Selver Equal Distributed Database?", Beth Gold-Benlstziu, Database Programming &
Design, Septe~xlber1990, pp.52-62

"Dynamic File Migration 111 Distributed Computer Systems", Bezalel Gavisll and Olivia R. Liu Sheng,
Communications of the ACM, Vol33. No.2, Februaly 1990, pp. 177- 189.

"Enerald Bay's Quiet Returii", La11 Bariles, DBMS, Vol. 3 No. 11, October 1990, pp. 50-57.

"Experiences wit11 the Alxoeba Distributed Operating Systea~",Andrew S. Tiliie~~bau~ll, Robbert V ~ I I


Renesse, Hails van Staveren, Gregory J. Sharp, Sape J. Mullencler, Jack Jailsell aiid Guido van Rossum,
Communications of the ACM, Vol. 33 No. 12, Deceil~ber1990, pp. 46-63.

"Heterogeneous Distributed Database Systellls for Procluctioli Use",Gomer Tliomas,et.al., ACM


Computing Surveys, Vol. 22 No. 3, Septe~~lber
1990, pp. 237-266.

"Heterogenous Processing: A 4GL Case Study", Micliael M. David, Database Programming & Design,
Vol. 4 No. 3, Marc11 1991, pp.27-34.

Intelligent Databases, Kalxva~lParsaye, Mark Cliiglell, Setrag Klioshafian, a i d Harry Woilg, Jolul
Wiley & Sons, Inc, New York, N.Y. 1989.

"Maintauling Availability i11 Partitioned Replicated Databases", Aix El Abbadi ailcl Sail1 Toueg, ACM
Tkansactions on Database Systems, Vol. 14 No. 2, Julie 1989, pp. 265-290.

"Multiple-Quay Optunization", Tllilos K. Sellis, ACM Ikansactions on Database Systems, Vol. 13 No.
I, Marc11 1988, pp. 23-52.

"Object-Oriented Databases: Design and Iiiiplaneiitatioil", JO~III


V Josepll, ct al, Proceedings of the
IEEE, Jailuary 1991, pp. 42- 64.

"On tlie Foundations of tlie Universal Relatioil Model", David Maier, Jeffrey D. Ull~liailand Mosllc Y.
Vardi, ACM Ikansactions on Database Systems, Vol. 9 No. 2, Julie 1984, pp. 283-308.
30. of Long-Living Transactions", I<. Brahnadathan and K. V S. Ramarao, The
" 0 1 1 the Maliagel~~e~lt

Journal of Systems and Software, Vol. 11 No. I, Jailuaiy 1990, pp. 45-52.

Principles of Distributed Database Systems, M. Ta~nerOzsu and Patrick Valduiiez, Prelitice Hall, Inc,
Ellglewood Cliffs, New Jersey 1991.

"Relational Database Design Using an Object-Oriented Methodology", Micllael R. Blalla, Willialli J.


Prel~lerlallialld Jaliies E. Run~baugh,Communications of the ACM, Vol. 31 No. 4, April 1988, pp.
414-427.

Relational Database Technology, Suacl Alagic, Springer-Verlag, New York, N.Y., 1986.

"SQL2 AII Eniergulg Standard", Ji111 Melton, Database Programming & Design, Vol. 3 No. 11,
Novel~~ber
1990, pp. 25- 32.

"Strategic Database Planning", G. Lawrellce Sanders, Database Programming & Design, Vol. 3 No. 11,
Novelliber 1990, pp. 52- 56.

"Tile Case for Object-Orienteci Databases", Tllol~lasAtwoord, IEEE Spectrum, February 1991, pp.
44 -47.

"The Raid Distributed Database Syste~n",Bllarat Bl~argavaand Jollll Riedl, IEEE lkansactions on
Software Engineering, Vol. 15 No. 6, June 1989, pp. 726-736.

The Relational Model for Database ManagementNersion 2, E.E Cocld, Acidison-Wesley Publisl~blg
Compally, Reading, Massacllusetts, 1990.

"Tile Trouble wit11Two-Phase Conlmit", George D. Tillniann, Database Programming & Design, Vol. 3
NO. 9, Septeliiber 1990, pp. 64-70.

"Tra~aactionProcessillg Monitors", Philip A. Ber~ntein,Communications of the ACM, Vol. 33 No. 11,


Nove~llber1990, pp. 75- 86.

Tutorial: Distributed Database Management, Philip A Berlntein, Jalues B. Rothoie, David W. Shipnlan,
IEEE Publisllulg Services, New York, N.Y., 1978.

"Update a11d Retrieval 111 a Relatio~~al Database Tlirougl~a Universal Sclielila Interface", Volkert
Brosda ailcl Gottfr-ied Vossen, ACM lkansactions on Database Systems, Vol. 13 No. 4, Deceli~ber1988,
pp. 449 - 485.

"Wily Clloose Distributed Database?", Micliael Krasowski, Database Programming & Design, Vol. 4
No. 3, Marc11 1991, pp. 46-53.
APPENDIX A: GLOSSARY

atomicity Either all or ~lolleof a transaction's operations are \xrionneJ.

bridge Network llarciware that serves to restrict packets to a local seguent of a network.

broadcast network A letw work i11 wliicli all sites receive all the ~liessagessent by a~lotliersite; a r~lecllariislli
(typically a prefix colltai~ii~ig
an identitication of tlle clestitiatioli site) allows each site to recogiize tliose
~lizssagesdirected to it.

catalog A rzyositoly of ullbrlliatioll about a database including, i ~~listributed


j databases, tlie description of the
fragmentation and allocatioll of data and the lilappilig to local nanles.

concurrency control Er~surestral~sactio~l


ato~llicityhi tlie prese~iceof collc~~rre~it
executioll of tra~~sactio~ls.

data distribution Refers to the partitio~ling,frageientatioe, reylicatioa, and allocatio~lof data a~iio~lg
the sites
participating in a clistributed database.

data dictionary Tlle ~najordatabase ~llocilllcthat colitairs database ~n~tlzdlzttr;


it i~icludes,at a minimum, sclie~iia
a~ici~ilayplllgciefi~litiolis.

data manipulation language A language that enables users to access or tuanipulate ciata orga~lizedby a data
nlodel. Aprmt~durnlIn~~guclgtp requires the user to speclfjr what data is lleeded and how to get it; a ~lo~y)rmtdunrl
language requires tlie user to spec@ what data is l~eedeclwitllout spec@ng how to get it.

data mapping Data types or data values are co~lvertedfor corlfonliity wit11 each otller.

data model A collectio~lof cotlceptual tools for describing data, data relationships, data semantics, and
collsiste~icyrestraints.

deadlock A circular waiting situatioll wliicll arises wlle~ltwo or liiore tra~lsactio~ls


obtalli exclusive locks on olle
or ~lloredata resources and are waiting for a resource l~eldby a~lotllerwaiting transaction.

deadlock avoidance Metllods that e~~iploy either co~icurrellcyco~ltrolteclmiquzs that never result in deadlocks
or require schedulers to detect pote~ltialcieadlock situatio~lsin adva~lcea~iclensure that they will not occur.

deadlock detection Tlie Jetectioll of a state of deadlock and the preelllptio~land abortio~lof otle (or more)
transaction(s) until processlllg iliay continue.

deadlock prevention Metllod that guaratltees deadlock calulot occur; all data ite~llsrequired for a transactio~l
are predeclared and lriust be accessible before tllz tra~isactiollis initiated.

distributed database A database syste~iithat provides access to data located at nlultiple sites hi a network.

distributed processing Based on a collectio~iof progralxis that are Jistlibuted a111o11gsites in a network, per~uitsa
progralli at any site to uivoke a ~ ~ s o g r aat~allotller
li site in the lletwork as if it were a locally resident subprogram.

distributed query processor A distlibuted database syste~lll~iodulzthat, give11a query, deter~rlulesan executio~l
strategy that llli~li~llizes wllicll ll~cludes110, CPU, and comnlunicatio~icosts.
a syste~licost fil~ictio~i

dump An image of a previous state of a database, usually stored on offli~lestorage.

durability 011ce a trallsactio~lis comnlitteci, tlie results of its operatio~lswill lever be lost, I~cieyendentof
subsequent fa ilures.
Ethernet Ai exaiilple of a packet-switclied iietwork 111 wliich packets liiay vary iii size fro11164 bytes to 1,518
bytes aiid operates at 10 lieg gab its per second.

fragment groups Collectioiis tliat ll~cludethe prililaly Caginent alld tliose Cagiiieilts resultllig from deriveJ
frag~iie~it
a t'ion.

fragmentation Dividiiig a global relatioil llito subrelatioi~saiid allocatllig the subrelations to sites participatulg
in tlie global ciatabase.

fragmentation, horizontal Data iteins in different local databases iliay be ide~ltitiedas logically belo~iguigto tlie
sanie table in tlie global database; a relatio~lpartitioiled aloilg its rows.

fragmentation, vertical Data iteiiis 111 differelit local databases iilay be ideiitified as logically represelitiilg tlie
saiile row in the global database but coiltaiiiiiig different att~ibutesfor the row; a relation paltitioned illto
siiialler relatioas.

gateway Network software tliat perillits tlie ilioveiiieiit of lllforiiiatioii betweell lietworks of diffeiiig
conunwiicatio~~s
protocols.

homogeneous database Refers to a distributed database in wliich eacli pliysical coilipol~eiiti'uiis uiider eitlier tlie
saille database illaiiageilleiit systeiii or, at least, tlie sailie data i~iodel.

heterogeneous database R e k ~ to s a distributed ciatabase 111wllicli iiot all pliysical coilipo~~eiitsrull uiicier the
sailie database liiaiiageiiiei~tsystem; soille literature refers to a distributed database as bellig hrtrrogrlzrous if
the local iiodes llave diffireiit types of coiiiputers aiid operating systeiils, even if all local databases are based on
tlie sailie data ~ilocielaiid even tlie saiile database iliaiiageiiie~itsysteiil.

IS0 Iiiteniatioiial Staiiciards Orgaiiizatioii.

isolation hi i~icoilipletetrailsactioil caiiliot reveal its res~iltsto otlier traiisactioi~sbefore its conuuitment.

local autonomy Refers to tlie aiilouilt of coiitrol exercised by local database adiiilllistrators witl~iila distributed
database eiiviroiiiilent; local adiiihlistmtors wit11 total coiltrol over that part of a distributeci database at their
sites are said to be autonoiiious.

local recovery manager Module of a distributed database iliailageiileiit systeiii (one exists at a local site)
respoiisible for ll~~pleiliei~tll~g
local procedures by wllicli tlie local database can be recovered to a coiisiste~~t
state
followuig a fa il ure.

long-lived transaction Data tliat lives after the processes tliat created tlieiil tenni~iate.

object-based logical model A data illode1 used ui describiiig data at the coiiceptual and view levels. The
coi~cephiallevel ciescribes wllat data is stored in the database aiid wllat relatiollsllips exist aillo~lgtlie data aiiJ
the view level restricts tlie conceptual level to part of the database. Object-based logical ~ilodelsallow tlie
ex-plicit speciticatioii of data coiistrallits.

packet-switched network A iletwork ui wliich iiicssages are broke11 up into packets and each packet is
transnlittzci iiidiviclually. Tlie packets liiay travel uidepeiideiltly alici inay take ciiffireiit routes.

point-to-point network A iietwork ui wliich sites are co~ii~ected by co~luiiuiiicatioilscllaniiels, typically


telepl~o~ie
lilies. Leased circuit utilizes point -to-point tecl~iiologyaloiig leased coilulluilicatioils lines.
record-based logical model A data 111ode1usecl in describi~lgdata at the co~lceptualand view levels wlicre tile
. co~iceptuallevel describes what ciata is stored in tlie database and what relatio~lsllipsexist ~ I I I O I the
I ~ data allci
the view level restricts tlie co~~ceptual
level to part of the database. Record-based logical ~llodelscall be used to
spec* tlie overall logical structure of the database, but do not provide for spec@ng data constraints.

replication Data ite~rlsin different local ciatabases rilay be iderltifiecl as copies of each other.

router Network l~arciwarethat picks the optinla1 route to se~lcitraffic over a network.

scheduler Mociule of a distributed database ~lla~iagetlie~lt syste~lirespo~isiblefor tlie lnlplementatio~iof a


specific concu~~ency
co~ltrolalgoritlllli for sy~icllrollizlligaccess to the database.

schema Describes the database as it is stored; describes the pllysical format, storage locations, and access paths
alici defi~lestlie logical structure.

schema, global Describes all the data in a distributed database

schema, loc!l Describes the data at tlie local sites UI a distributed ciatabase

serializability If several trallsactiolls are executeci concurrently, tlie result tnust be the saliie as if they were
executed serially hi sollie order.

timestamp U~liquelyiclentifies a transaction; for two trallsactio~lsA and B, it' A occurred before B tlle~ithe
tllliesta~rlpof A is less tllall tlie tunestanip of B.

transaction A sequence of operations wllicli either are performed hl e~itiretyor are not perfor~iledat all; an
atolilic unit of executio~l

transaction manager Module of a distributed database llia~lage~lielit


systelli responsible for coordinating the
executio~lof the database operatio~lson behalf of an application.

transparency Also called daft1 ir~ciryrrrclrrzcr,refers to tlle independence of application programs fro111 the
pllysical or logical orga~iizatio~i
of tlie data.

transparency, distribution Refers to the u~dependenceof applicatio~lprogsams fso111tlie pllysical locatio~iof tlie
ciata in a distributed database.

transparency, fragmentation Refers to tlie lack of awarelless by applicatio~iprograms of tlie existence of


fragmented relatio~isui a distributed ciatabase.

transparency, replication Refers to the lack of awareness by application prograliis of the existence of replicated
data in a distributed database.

two-phase commit A protocol that requires for each trallsactio~~ a first phase dusi~igwllicll an abortlconuiiit
Jzcisio~lis ~lladeby each pal-ticipa~ltand a second pliase duri~lgwllicll the Jecisio~lis implemented.

two-phase locking A lockillg protocol tliat requires tbr each transaction a tirst phase duri~igwliicll new locks are
acquired and a seco~ldphase ciuru~gwllicll locks are o111yreleased.
AF'PENDM B: VENDORS IN DISTRIBUTED DATABASE TECHNOLOGY

ASK Colliputer Systeliis Inc., 111gres Products Division, 1080 Marina Village Parkway, Alameda,
Califoniia 94501, (415) 769- 1400

Coliiputer Associates Intzr~iatio~ial


Inc., 711 Stewart Avetlue, Garden City, New York, 11530,
(516) 227-3300

Digital Equipllie~lt Co~-poration,Database Systeliis Group, 55 Nortlieasterli Blvd., Nasliua, Hew


Hampshire 03062, (603) 884 - 2423

Gupta Tecli~~ologies
Inc., 1040 Marsli Road, Meiilo Park, Califor~lia94025, (415) 321 -9500

IBM Cor~>oration,
Ole1 Orcl~arclRoad, Aniionk, New York 10504, (914) 765-1900

Iliforlilix Software Inc., 4100 B o l ~ a ~ ~Drive,


~ i o ~Me1110
i Park, Califoniia 94025, (415) 926-6300

111terbaseSoftware Corporation, 209 Burlil~gtoliRoad, Beclford, Massacl~usetts01730, (800) 245-7367

Microsoft Corporation, 1 Microsoft Way, Recl~i~oncl,


Wasliu~gto~~
98052, (206) 882-8080

Olitologic Inc., Billerica, Massacl~usetts

Oracle Corporation, 500 Oracle Parkway, Redwood Sliores, Califo~liia94065, (415) 506-7000

PeerLogic Inc., 333 DeHaro Street, Sa11Francisco, Califorilia, 94107, (415) 626-4545

Progress Software Corporation, 5 Oak Park, Bedford, Massacliusetts, 01730, (617) 275-4500

Ratliff Software Productio~iInc., 2155 Verclugo Blvd., Suite 20, Montrose, Califor~iia91020,
(818) 546-3850

Revelatioli Technologies, 2 Park Avenue, New York, New York, 10016, (617) 275-4500

Saros Corporation, 10900 N.E. 8th Street, Bellevue, Wasliiligto~i98004, (206) 646-1066

Sybase hic., 6475 Cllristie Avenue, Emeryville, Califonlia 94608, (415) 596-3500

Ta~ide~li
Colliputers Inc., 19333 Valico Parkway, Cupertino, Califorliia 95014, (408) 965-7542

WordTecli Systeiiis Itic., 21 Altorllida Roaci, Orinda, Califoniia 94563, (415) 254-0900

XDB Systems, 14700 Sweitzer L,ane, Lmrel, Maryland 20707, (301) 317-6800

You might also like