Data Cleaning: Problems and Current Approaches
Data Cleaning: Problems and Current Approaches
Approaches
Abstract
Weclassifydataqualityproblemsthatareaddressedbydatacleaningandprovideanoverviewofthemain
solutionapproaches.Datacleaningisespeciallyrequiredwhenintegratingheterogeneousdatasourcesand
shouldbeaddressedtogetherwithschemarelateddatatransformations.Indatawarehouses,datacleaningis
amajorpartofthesocalledETLprocess.Wealsodiscusscurrenttoolsupportfordatacleaning.
1 Introduction
Data cleaning,alsocalled data cleansingor scrubbing,dealswithdetectingandremovingerrorsand
inconsistenciesfromdatainordertoimprovethequalityofdata.Dataqualityproblemsarepresentinsingle
datacollections,suchasfilesanddatabases,e.g.,duetomisspellingsduringdataentry,missinginformation
orotherinvaliddata.Whenmultipledatasourcesneedtobeintegrated,e.g.,indatawarehouses,federated
databasesystemsorglobalwebbasedinformationsystems,theneedfordatacleaningincreases
significantly.Thisisbecausethesourcesoftencontainredundantdataindifferentrepresentations.Inorderto
provideaccesstoaccurateandconsistentdata,consolidationofdifferentdatarepresentationsandelimination
ofduplicateinformationbecomenecessary.
Operational
sources
Integration
Aggregation
Schemamatching
andintegration
Schemaextraction
andtranslation
Data
warehouse
Schema
implementation
Data
warehouse
Data
staging
area
Instanceextraction
andtransformation
Instancematching
andintegration
Filtering,
aggregation
Scheduling,logging,monitoring,recovery,backup
Legends:
Metadataflow
Dataflow
Figure1.
3Instancecharacteristics
(realmetadata)
2Translationrules
4Mappingsbetweensourceandtarget
schema
5Filteringandaggregationrules
Stepsofbuildingadatawarehouse:theETLprocess
Datawarehouses[6][16]requireandprovideextensivesupportfordatacleaning.Theyloadand
continuouslyrefreshhugeamountsofdatafromavarietyofsourcessotheprobabilitythatsomeofthe
sourcescontaindirtydataishigh.Furthermore,datawarehousesareusedfordecisionmaking,sothatthe
correctnessoftheirdataisvitaltoavoidwrongconclusions.Forinstance,duplicatedormissinginformation
willproduceincorrectormisleadingstatistics(garbagein,garbageout).Duetothewiderangeofpossible
ThisworkwasperformedwhileonleaveatMicrosoftResearch,Redmond,WA.
datainconsistenciesandthesheerdatavolume,datacleaningisconsideredtobeoneofthebiggestproblems
indatawarehousing.DuringthesocalledETLprocess(extraction,transformation,loading),illustratedin
Fig.1,furtherdatatransformationsdealwithschema/datatranslationandintegration,andwithfilteringand
aggregatingdatatobestoredinthewarehouse.AsindicatedinFig.1,alldatacleaningistypically
performedinaseparatedatastagingareabeforeloadingthetransformeddataintothewarehouse.Alarge
numberoftoolsofvaryingfunctionalityisavailabletosupportthesetasks,butoftenasignificantportionof
thecleaningandtransformationworkhastobedonemanuallyorbylowlevelprogramsthataredifficultto
writeandmaintain.
Federateddatabasesystemsandwebbasedinformationsystemsfacedatatransformationstepssimilarto
thoseofdatawarehouses.Inparticular,thereistypicallya wrapperperdatasourceforextractionanda
mediatorforintegration[32][31].Sofar,thesesystemsprovideonlylimitedsupportfordatacleaning,
focusinginsteadondatatransformationsforschematranslationandschemaintegration.Dataisnot
preintegratedasfordatawarehousesbutneedstobeextractedfrommultiplesources,transformedand
combinedduringqueryruntime.Thecorrespondingcommunicationandprocessingdelayscanbesignificant,
makingitdifficulttoachieveacceptableresponsetimes.Theeffortneededfordatacleaningduring
extractionandintegrationwillfurtherincreaseresponsetimesbutismandatorytoachieveusefulquery
results.
Adatacleaningapproachshouldsatisfyseveralrequirements.Firstofall,itshoulddetectandremoveall
majorerrorsandinconsistenciesbothinindividualdatasourcesandwhenintegratingmultiplesources.The
approachshouldbesupportedbytoolstolimitmanualinspectionandprogrammingeffortandbeextensible
toeasilycoveradditionalsources.Furthermore,datacleaningshouldnotbeperformedinisolationbut
togetherwithschemarelateddatatransformationsbasedoncomprehensivemetadata.Mappingfunctionsfor
datacleaningandotherdatatransformationsshouldbespecifiedinadeclarativewayandbereusablefor
otherdatasourcesaswellasforqueryprocessing.Especiallyfordatawarehouses,aworkflowinfrastructure
shouldbesupportedtoexecutealldatatransformationstepsformultiplesourcesandlargedatasetsina
reliableandefficientway.
Whileahugebodyofresearchdealswithschematranslationandschemaintegration,datacleaninghas
receivedonlylittleattentionintheresearchcommunity.Anumberofauthorsfocussedontheproblemof
duplicateidentificationandelimination,e.g.,[11][12][15][19][22][23].Someresearchgroupsconcentrateon
generalproblemsnotlimitedbutrelevanttodatacleaning,suchasspecialdataminingapproaches[30][29],
anddatatransformationsbasedonschemamatching[1][21].Morerecently,severalresearcheffortspropose
andinvestigateamorecomprehensiveanduniformtreatmentofdatacleaningcoveringseveral
transformationphases,specificoperatorsandtheirimplementation[11][19][25].
Inthispaperweprovideanoverviewoftheproblemstobeaddressedbydatacleaningandtheirsolution.In
thenextsectionwepresentaclassificationoftheproblems.Section3discussesthemaincleaning
approachesusedinavailabletoolsandtheresearchliterature.Section4givesanoverviewofcommercial
toolsfordatacleaning,includingETLtools.Section5istheconclusion.
Multi-Source Problems
Schema Level
Instance Level
Schema Level
(Lackofintegrity
constraints,poor
schemadesign)
(Dataentryerrors)
(Heterogeneous
datamodelsand
schemadesigns)
Uniqueness
Referentialintegrity
Misspellings
Redundancy/duplicates
Contradictoryvalues
Namingconflicts
Structuralconflicts
Figure2.
Instance Level
(Overlapping,
contradictingand
inconsistentdata)
Inconsistentaggregating
Inconsistenttiming
Classificationofdataqualityproblemsindatasources
Dirty Data
bdate=30.13.70
age=22,bdate=12.02.70
emp1=(name=JohnSmith,SSN=123456)
emp2=(name=PeterMiller,SSN=123456)
emp=(name=JohnSmith,deptno=127)
Reasons/Remarks
valuesoutsideofdomainrange
age=(currentdatebirthdate)
shouldhold
uniquenessforSSN(socialsecurity
number)violated
referenceddepartment(127)notdefined
Table1.Examplesforsinglesourceproblemsatschemalevel(violatedintegrityconstraints)
Forbothschemaandinstancelevelproblemswecandifferentiatedifferentproblemscopes:attribute(field),
record,recordtypeandsource;examplesforthevariouscasesareshowninTables1and2.Notethat
uniquenessconstraintsspecifiedattheschemaleveldonotpreventduplicatedinstances,e.g.,ifinformation
onthesamerealworldentityisenteredtwicewithdifferentattributevalues(seeexampleinTable2).
Scope/Problem
Attribute Missingvalues
Record
Record
type
Source
Dirty Data
phone=9999999999
Misspellings
Crypticvalues,
Abbreviations
Embeddedvalues
city=Liipzig
experience=B;
occupation=DBProg.
name=J.Smith12.02.70NewYork
Misfieldedvalues
Violatedattribute
dependencies
Word
transpositions
Duplicatedrecords
city=Germany
city=Redmond,zip=77777
Contradicting
records
Wrongreferences
Reasons/Remarks
unavailablevaluesduringdataentry
(dummyvaluesornull)
usuallytypos,phoneticerrors
multiplevaluesenteredinoneattribute
(e.g.inafreeformfield)
cityandzipcodeshouldcorrespond
name1=J.Smith,name2=MillerP.
usuallyinafreeformfield
emp1=(name=JohnSmith,...);
emp2=(name=J.Smith,...)
emp1=(name=JohnSmith,bdate=12.02.70);
emp2=(name=JohnSmith,bdate=12.12.70)
emp=(name=JohnSmith,deptno=17)
sameemployeerepresentedtwicedueto
somedataentryerrors
thesamerealworldentityisdescribedby
differentvalues
referenceddepartment(17)isdefinedbut
wrong
Table2.Examplesforsinglesourceproblemsatinstancelevel
Giventhatcleaningdatasourcesisanexpensiveprocess,preventingdirtydatatobeenteredisobviouslyan
importantsteptoreducethecleaningproblem.Thisrequiresanappropriatedesignofthedatabaseschema
andintegrityconstraintsaswellasofdataentryapplications.Also,thediscoveryofdatacleaningrules
duringwarehousedesigncansuggestimprovementstotheconstraintsenforcedbyexistingschemas.
Name
KristenSmith
ChristianSmith
Street
2HurleyPl
HurleySt2
City
SouthFork,MN48503
SForkMN
Sex
0
1
Client(source2)
Cno
24
LastName
Smith
FirstName
Christoph
Gender
M
493
Smith
KrisL.
Address
23HarleySt,Chicago
IL,606332394
2HurleyPlace,South
ForkMN,485035998
Phone/Fax
3332226542/
3332226599
4445556666
Customers(integratedtargetwithcleaneddata)
No LName
1
Smith
FName
KristenL.
Gender
F
Smith
Christian
Smith
Christoph
Figure3.
Street
2Hurley
Place
2Hurley
Place
23Harley
Street
City
South
Fork
South
Fork
Chicago
State
MN
MN
IL
ZIP
48503
5998
48503
5998
60633
2394
Phone
444555
6666
Fax
333222
6542
333222
6599
CID
11
Cno
493
24
24
Examplesofmultisourceproblemsatschemaandinstancelevel
ThetwosourcesintheexampleofFig.3arebothinrelationalformatbutexhibitschemaanddataconflicts.
Attheschemalevel,therearenameconflicts(synonyms Customer/Client, Cid/Cno, Sex/Gender)and
structuralconflicts(differentrepresentationsfornamesandaddresses).Attheinstancelevel,wenotethat
therearedifferentgenderrepresentations(0/1vs.F/M)andpresumablyaduplicaterecord(Kristen
Smith).Thelatterobservationalsorevealsthatwhile Cid/Cnoarebothsourcespecificidentifiers,their
contentsarenotcomparablebetweenthesources;differentnumbers(11/493)mayrefertothesameperson
whiledifferentpersonscanhavethesamenumber(24).Solvingtheseproblemsrequiresbothschema
integrationanddatacleaning;thethirdtableshowsapossiblesolution.Notethattheschemaconflictsshould
beresolvedfirsttoallowdatacleaning,inparticulardetectionofduplicatesbasedonauniform
representationofnamesandaddresses,andmatchingofthe Gender/Sexvalues.
Data analysis:Inordertodetectwhichkindsoferrorsandinconsistenciesaretoberemoved,adetailed
dataanalysisisrequired.Inadditiontoamanualinspectionofthedataordatasamples,analysis
programsshouldbeusedtogainmetadataaboutthedatapropertiesanddetectdataqualityproblems.
Definition of transformation workflow and mapping rules:Dependingonthenumberofdatasources,
theirdegreeofheterogeneityandthedirtynessofthedata,alargenumberofdatatransformationand
cleaningstepsmayhavetobeexecuted.Sometime,aschematranslationisusedtomapsourcestoa
commondatamodel;fordatawarehouses,typicallyarelationalrepresentationisused.Earlydata
cleaningstepscancorrectsinglesourceinstanceproblemsandpreparethedataforintegration.Later
stepsdealwithschema/dataintegrationandcleaningmultisourceinstanceproblems,e.g.,duplicates.
Fordatawarehousing,thecontrolanddataflowforthesetransformationandcleaningstepsshouldbe
specifiedwithinaworkflowthatdefinestheETLprocess(Fig.1).
Theschemarelateddatatransformationsaswellasthecleaningstepsshouldbespecifiedbya
declarativequeryandmappinglanguageasfaraspossible,toenableautomaticgenerationofthe
transformationcode.Inaddition,itshouldbepossibletoinvokeuserwrittencleaningcodeandspecial
purposetoolsduringadatatransformationworkflow.Thetransformationstepsmayrequestuser
feedbackondatainstancesforwhichtheyhavenobuiltincleaninglogic.
Verification:Thecorrectnessandeffectivenessofatransformationworkflowandthetransformation
definitionsshouldbetestedandevaluated,e.g.,onasampleorcopyofthesourcedata,toimprovethe
definitionsifnecessary.Multipleiterationsoftheanalysis,designandverificationstepsmaybeneeded,
e.g.,sincesomeerrorsonlybecomeapparentafterapplyingsometransformations.
Transformation:ExecutionofthetransformationstepseitherbyrunningtheETLworkflowforloading
andrefreshingadatawarehouseorduringansweringqueriesonmultiplesources.
Backflow of cleaned data:After(singlesource)errorsareremoved,thecleaneddatashouldalsoreplace
thedirtydataintheoriginalsourcesinordertogivelegacyapplicationstheimproveddatatooandto
avoidredoingthecleaningworkforfuturedataextractions.Fordatawarehousing,thecleaneddatais
availablefromthedatastagingarea(Fig.1).
Thetransformationprocessobviouslyrequiresalargeamountofmetadata,suchasschemas,instancelevel
datacharacteristics,transformationmappings,workflowdefinitions,etc.Forconsistency,flexibilityandease
ofreuse,thismetadatashouldbemaintainedinaDBMSbasedrepository[4].Tosupportdataquality,
detailedinformationaboutthetransformationprocessistoberecorded,bothintherepositoryandinthe
transformedinstances,inparticularinformationaboutthecompletenessandfreshnessofsourcedataand
lineageinformationabouttheoriginoftransformedobjectsandthechangesappliedtothem.Forinstance,in
Fig.3,thederivedtable Customerscontainstheattributes CIDand Cno,allowingonetotracebackthe
sourcerecords.
Inthefollowingwedescribeinmoredetailpossibleapproachesfordataanalysis(conflictdetection),
transformationdefinitionandconflictresolution.Forapproachestoschematranslationandschema
integration,werefertotheliteratureastheseproblemshaveextensivelybeenstudiedanddescribed
[2][24][26].Nameconflictsaretypicallyresolvedbyrenaming;structuralconflictsrequireapartial
restructuringandmergingoftheinputschemas.
Metadata
cardinality
max,min
variance,deviation
Misspellings
attributevalues
Missing
values
Varying value
representatio
Duplicates
nullvalues
attributevalues+defaultvalues
attributevalues
cardinality+uniqueness
attributevalues
Examples/Heuristics
e.g.,cardinality(gender)>2indicatesproblem
max,minshouldnotbeoutsideofpermissiblerange
variance,deviationofstatisticalvaluesshouldnotbehigherthan
threshold
sortingonvaluesoftenbringsmisspelledvaluesnexttocorrect
values
percentage/numberofnullvalues
presenceofdefaultvaluemayindicaterealvalueismissing
comparingattributevaluesetofacolumnofonetableagainstthat
ofacolumnofanothertable
attributecardinality=#rowsshouldhold
sortingvaluesbynumberofoccurrences;morethan1occurrence
indicatesduplicates
Table3.Examplesfortheuseofreengineeredmetadatatoaddressdataqualityproblems
Data mininghelpsdiscoverspecificdatapatternsinlargedatasets,e.g.,relationshipsholdingbetween
severalattributes.Thisisthefocusofsocalleddescriptivedataminingmodelsincludingclustering,
summarization,associationdiscoveryandsequencediscovery[10].Asshownin[28],integrityconstraints
amongattributessuchasfunctionaldependenciesorapplicationspecificbusinessrulescanbederived,
whichcanbeusedtocompletemissingvalues,correctillegalvaluesandidentifyduplicaterecordsacross
datasources.Forexample,anassociationrulewithhighconfidencecanhinttodataqualityproblemsin
instancesviolatingthisrule.Soaconfidenceof99%forruletotal=quantity*unit priceindicatesthat1%of
therecordsdonotcomplyandmayrequirecloserexamination.
Exampleoftransformationstepdefinition
Fig.4showsatransformationstepspecifiedinSQL:99.TheexamplereferstoFig.3andcoverspartofthe
necessarydatatransformationstobeappliedtothefirstsource.Thetransformationdefinesaviewonwhich
furthermappingscanbeperformed.Thetransformationperformsaschemarestructuringwithadditional
attributesintheviewobtainedbysplittingthenameandaddressattributesofthesource.Therequireddata
extractionsareachievedbyUDFs(showninboldface).TheUDFimplementationscancontaincleaning
logic,e.g.,toremovemisspellingsincitynamesorprovidemissingzipcodes.
UDFsmaystillimplyasubstantialimplementationeffortanddonotsupportallnecessaryschema
transformations.Inparticular,simpleandfrequentlyneededfunctionssuchasattributesplittingormerging
arenotgenericallysupportedbutneedoftentobereimplementedinapplicationspecificvariations(see
specificextractfunctionsinFig.4).Morecomplexschemarestructurings(e.g.,foldingandunfoldingof
attributes)arenotsupportedatall.Togenericallysupportschemarelatedtransformations,language
extensionssuchastheSchemaSQLproposalarerequired[18].Datacleaningattheinstancelevelcanalso
benefitfromspeciallanguageextensionssuchasaMatchoperatorsupportingapproximatejoins(see
below).Systemsupportforsuchpowerfuloperatorscangreatlysimplifytheprogrammingeffortfordata
transformationsandimproveperformance.Somecurrentresearcheffortsondatacleaningareinvestigating
theusefulnessandimplementationofsuchquerylanguageextensions[11][25].
dependsonapplicationcharacteristics.Forinstance,differentattributesinamatchingrulemaycontribute
differentweighttotheoveralldegreeofsimilarity.Forstringcomponents(e.g.,customername,company
name,)exactmatchingandfuzzyapproachesbasedonwildcards,characterfrequency,editdistance,
keyboarddistanceandphoneticsimilarity(soundex)areuseful[11][15][19].Morecomplexstringmatching
approachesalsoconsideringabbreviationsarepresentedin[23].Ageneralapproachformatchingbothstring
andtextdataistheuseofcommoninformationretrievalmetrics.WHIRLrepresentsapromising
representativeofthiscategoryusingthecosinedistanceinthevectorspacemodelfordeterminingthedegree
ofsimilaritybetweentextelements[7].
Determiningmatchinginstanceswithsuchanapproachistypicallyaveryexpensiveoperationforlargedata
sets.Calculatingthesimilarityvalueforanytworecordsimpliesevaluationofthematchingruleonthe
cartesianproductoftheinputs.Furthermoresortingonthesimilarityvalueisneededtodeterminematching
recordscoveringduplicateinformation.Allrecordsforwhichthesimilarityvalueexceedsathresholdcanbe
consideredasmatches,orasmatchcandidatestobeconfirmedorrejectedbytheuser.In[15]amultipass
approachisproposedforinstancematchingtoreducetheoverhead.Itisbasedonmatchingrecords
independentlyondifferentattributesandcombiningthedifferentmatchresults.Assumingasingleinputfile,
eachmatchpasssortstherecordsonaspecificattributeandonlytestsnearbyrecordswithinacertain
windowonwhethertheysatisfyapredeterminedmatchingrule.Thisreducessignificantlythenumberof
matchruleevaluationscomparedtothecartesianproductapproach.Thetotalsetofmatchesisobtainedby
theunionofthematchingpairsofeachpassandtheirtransitiveclosure.
4 Tool support
Alargevarietyoftoolsisavailableonthemarkettosupportdatatransformationanddatacleaningtasks,in
particularfordatawarehousing.1Sometoolsconcentrateonaspecificdomain,suchascleaningnameand
addressdata,oraspecificcleaningphase,suchasdataanalysisorduplicateelimination.Duetotheir
restricteddomain,specializedtoolstypicallyperformverywellbutmustbecomplementedbyothertoolsto
addressthebroadspectrumoftransformationandcleaningproblems.Othertools,e.g.,ETLtools,provide
comprehensivetransformationandworkflowcapabilitiestocoveralargepartofthedatatransformationand
cleaningprocess.AgeneralproblemofETLtoolsistheirlimitedinteroperabilityduetoproprietary
applicationprogramminginterfaces(API)andproprietarymetadataformatsmakingitdifficulttocombine
thefunctionalityofseveraltools[8].
Wefirstdiscusstoolsfordataanalysisanddatarengineeringwhichprocessinstancedatatoidentifydata
errorsandinconsistencies,andtoderivecorrespondingcleaningtransformations.Wethenpresent
specializedcleaningtoolsandETLtools,respectively.
Forcomprehensivevendorandtoollistings,seecommercialwebsites,e.g.,DataWarehouseInformationCenter
(www.dwinfocenter.org),DataManagementReview(www.dmreview.com),DataWarehousingInstitute(www.dwinstitute.com)
(e.g.,merge,split).INTEGRITYidentifiesandconsolidatesrecordsusingastatisticalmatchingtechnique.
Automatedweightingfactorsareusedtocomputescoresforrankingmatchesbasedonwhichtheusercan
selecttherealduplicates.
orwildcardmatchingandsoundex.However,userdefinedfieldmatchingfunctionsaswellasfunctionsfor
correlatingfieldsimilaritiescanbeprogrammedandaddedtotheinternaltransformationlibrary.
5 Conclusions
Weprovidedaclassificationofdataqualityproblemsindatasourcesdifferentiatingbetweensingleand
multisourceandbetweenschemaandinstancelevelproblems.Wefurtheroutlinedthemajorstepsfordata
transformationanddatacleaningandemphasizedtheneedtocoverschemaandinstancerelateddata
transformationsinanintegratedway.Furthermore,weprovidedanoverviewofcommercialdatacleaning
tools.Whilethestateoftheartinthesetoolsisquiteadvanced,theydotypicallycoveronlypartofthe
problemandstillrequiresubstantialmanualeffortorselfprogramming.Furthermore,theirinteroperabilityis
limited(proprietaryAPIsandmetadatarepresentations).
Sofaronlyalittleresearchhasappearedondatacleaning,althoughthelargenumberoftoolsindicatesboth
theimportanceanddifficultyofthecleaningproblem.Weseeseveraltopicsdeservingfurtherresearch.First
ofall,moreworkisneededonthedesignandimplementationofthebestlanguageapproachforsupporting
bothschemaanddatatransformations.Forinstance,operatorssuchasMatch,MergeorMapping
Compositionhaveeitherbeenstudiedattheinstance(data)orschema(metadata)levelbutmaybebuilton
similarimplementationtechniques.Datacleaningisnotonlyneededfordatawarehousingbutalsoforquery
processingonheterogeneousdatasources,e.g.,inwebbasedinformationsystems.Thisenvironmentposes
muchmorerestrictiveperformanceconstraintsfordatacleaningthatneedtobeconsideredinthedesignof
suitableapproaches.Furthermore,datacleaningforsemistructureddata,e.g.,basedonXML,islikelytobe
ofgreatimportancegiventhereducedstructuralconstraintsandtherapidlyincreasingamountofXMLdata.
Acknowledgments
WewouldliketothankPhilBernstein,HelenaGalhardasandSunitaSarawagiforhelpfulcomments.
References
[1]Abiteboul,S.;Clue,S.;Milo,T.;Mogilevsky,P.;Simeon,J.: Tools for Data Translation and Integration.In
[26]:38,1999.
[2]Batini,C.;Lenzerini,M.;Navathe,S.B.: A Comparative Analysis of Methodologies for Database Schema
Integration.InComputingSurveys18(4):323364,1986.
[3]Bernstein,P.A.;Bergstraesser,T.: Metadata Support for Data Transformation Using Microsoft Repository.In
[26]:914,1999
[4]Bernstein,P.A.;Dayal,U.: An Overview of Repository Technology.Proc.20thVLDB,1994.
[5]Bouzeghoub,M.;Fabret,F.;Galhardas,H.;Pereira,J;Simon,E.;Matulovic,M.: Data Warehouse Refreshment.In
[16]:4767.
[6]Chaudhuri,S.,Dayal,U.: An Overview of Data Warehousing and OLAP Technology.ACMSIGMODRecord
26(1),1997.
[7]Cohen,W.: Integration of Heterogeneous Databases without Common Domains Using Queries Based
Textual
Similarity.Proc.ACMSIGMODConf.onDataManagement,1998.
[8]Do,H.H.;Rahm,E.: On Metadata Interoperability in Data Warehouses.Techn.Report,Dept.ofComputerSci
ence,Univ.ofLeipzig.http://dol.unileipzig.de/pub/200013.
[9]Doan,A.H.;Domingos,P.;Levy,A.Y.: Learning Source Description for Data Integration.Proc.3rdIntl.Work
shopTheWebandDatabases(WebDB),2000.
[10]Fayyad,U.: Mining Database: Towards Algorithms for Knowledge Discovery.IEEETechn.BulletinDataEngi
neering21(1),1998.
[11]Galhardas,H.;Florescu,D.;Shasha,D.;Simon,E.: Declaratively cleaning your data using AJAX.InJournees
BasesdeDonnees,Oct.2000.http://caravel.inria.fr/~galharda/BDA.ps.
[12]Galhardas,H.;Florescu,D.;Shasha,D.;Simon,E.: AJAX: An Extensible Data Cleaning Tool.Proc.ACMSIG
MODConf.,p.590,2000.
[13]Haas,L.M.;Miller,R.J.;Niswonger,B.;TorkRoth,M.;Schwarz,P.M.;Wimmers,E.L.: Transforming Heterogeneous Data with Database Middleware: Beyond Integration.In[26]:3136,1999.
[14]Hellerstein,J.M.;Stonebraker,M.;Caccia,R.: Independent, Open Enterprise Data Integration.In[26]:4349,
1999.
[15]Hernandez,M.A.;Stolfo,S.J.: Real-World Data is Dirty: Data Cleansing and the Merge/Purge Problem.Data
MiningandKnowledgeDiscovery2(1):937,1998.
[16]Jarke,M.,Lenzerini,M.,Vassiliou,Y.,Vassiliadis,P.: Fundamentals of Data Warehouses.Springer,2000.
[17]Kashyap,V.;Sheth,A.P.: Semantic and Schematic Similarities between Database Objects: A Context-Based
Approach.VLDBJournal5(4):276304,1996.
10
11