0% found this document useful (0 votes)
54 views

Intelligent Knowledge Discovery

The document discusses intelligent knowledge discovery and data mining. It describes the basic role of the GOAL project, which aims to develop a framework for integrating geographic information systems and data warehousing. The knowledge discovery part of the project involves developing a knowledge discovery in databases (KDD) package. The KDD process involves several steps including data preprocessing, data mining, and output generation. Common data mining tasks that could be included in the package are discovering association rules, decision trees, clusters, and dependencies.

Uploaded by

mca_rafi
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

Intelligent Knowledge Discovery

The document discusses intelligent knowledge discovery and data mining. It describes the basic role of the GOAL project, which aims to develop a framework for integrating geographic information systems and data warehousing. The knowledge discovery part of the project involves developing a knowledge discovery in databases (KDD) package. The KDD process involves several steps including data preprocessing, data mining, and output generation. Common data mining tasks that could be included in the package are discovering association rules, decision trees, clusters, and dependencies.

Uploaded by

mca_rafi
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Intelligent Knowledge iscovery D

Jn arali P
DepartmentCybernetics rtificial of A a Intelligence Technical Universityf oKoice Letn 9, 0Koice 0420 Slovak epublic R [email protected]

Eva ndrssyov A
DepartmentCybernetics rtificial of A a Intelligence Technical Universityf oKoice Letn 9, 0Koice 0420 Slovak epublic R [email protected]

Abstract The ain othis aperdescriptionbasic m role f p is of ideas ehind b international Copernicusesearch r proj ect titled GOAL - eographic G Information On-Line Analysis GIS ( Data Warehouse Integration), mainly focusing on its knowledge discovery part. Regarding the knowledge extracting hase, DD p a (Knowledge iscoveryDat K D in abases) package should developed be within project. h this T e basic ideahereistoprovideasacorea umberofdiffe n rent algorithmsandtheircombinations.Algorithmsable to discover association rules, ecision and d trees clu sters are planedo integrated. plano rovidehis a t be We tp t p ckagea as standlonepplicationwell. a a as

I. INTRODUCTION Theintegration combination GIS(Geographic and of Information System)datainto with and OLAP systems poses a number yet otatisfyinglyolved of n s s probl emsn i terms of gettingthe data intothe OLAP system, representinghe ataor t d f analysisndxtracting a e k nowledge while onsideringecurityestrictions. c s r Current ap proaches do addresshese not t special roblemsesulting p r fro m the targetedpplicationrenaGISnd LAPystems. a a of a O s The othe OAL goal f G projectso i t develop ener a g ic frameworkbothrecognizedby theresearchcommunity andpplicableneal applications, a i r world whicho s lveshe t general issuesGIS DWH of and interoperability, inc luding DWH feeding,knowledge extraction, nterpretation, i and securityoncepts. feasibilitytfhisramewor c The o f k ill w be tested 2 different world pplications on very real a from the GIS domain environmental ensor ata nd using s d a cult ural data, allowing worldvaluationtherame real e a of f work. Regardingtheknowledgeextractingphase,a DD K (Knowledge Discovery Databases) ackage in p should b e developed within project. he this T basic here idea is to provide as a number different lgorithms core a of a and their combinations. Algorithms able to discover associationules, r decisionrees clustersre t and a planedo t bentegrated. planprovidehis ackage s i We to t p aa stand alonepplicationwell. a as II. PROCESS KDD When studying literature topic data with of mining we haveencounteredwithtermssuchlike: datamining , knowledgediscoveryindatabases or abbreviation KDD.In various sources those terms explained are on differentway.Inouropinion,themostsophisticat ed definition one is according to [4] Fayyad al.), here ( et w authorshavedeterminedthatknowledgediscoveryin databasessnteractive iterative ii and process ith w several steps. Itmeansthatatany stagethe should user have possibilitytomakechanges(forinstancetochoose different orechnique) nd task t a repeat followi the ngteps s 307

to achieve better results. ata D mining a of ispart this process. In ostsources, term m of the Data ining (DM) ften M is o usedtonamethefieldofknowledgediscovery.This confusing oterms DD DMdueo use f K and is t histori cal reasonsndueoact the ostthe ork a d tf that m of w i focused s onefinement applicabilityxperimentsML r and e of a nd I A algorithmsor data iningtep. f the m s Pre-processing isften o included his apartmininglgorithm it stepa of n s a . WithintheKDDprocessfollowingstepscanbe recognised (according to [1]). twoteps the DD First s of K process reelatedohe ar tt goal identification namely task , discovery and data discovery Theollowingtep . f s includes all data re-processing Letcall p . us it dataleaning Core c . oftheKDD processisthe datamining phase,which includes modeldevelopment and dataanalysis. Finally suitable output generation is ecessaryorhe ser.the n f t u In following KDD process will e steps b described mo in re details. Within the task discovery one asotatehe roblem h ts t p or goal, which often seems to be clear. Further investigation isrecommended as be such to acquaint ed with customer's rganisation spending ti o after some me at the lacendoift p a t s throughheaw (tonde t r data u rstand its form,content,organisationalroleand sourcesofd ata). Then real of discoveryanfound. the goalthe c be Data Discovery is complementary the ofask to step t discovery. thetepdata iscovery, has In s of d one t o ecide d whether uality datasatisfactory the q of is for goa (what l dataoesdoes cover). d or not Data leaning is ften ecessaryhought happen C o n t imay that omething s removed cleaning be by can indicator of somenteresting i domain phenomenon (outlier key or data point?). Analyst's ackground nowledgecrucial b k is in ata d cleaningprovidedbycomparisonsofmultiplesource s. Other ayso data eforeoaded datab w it clean b l into ase y b editing procedures. Recently, dataor areoming the f KDD c from warehouses containatalreadylea data that d a c ned. Model Development isnmportant oKDD ai phase f that must recede ctual nalysis fhe Interacti p a a ot data. on with the leads data analysts formation hypothesis to of (it s i often based experienceand on background knowledge) . Sub-processesmodel of development are: data segmentation (unsupervised learning techniques, examplelustering); for c model election s (choosing bestype model the t of after exploringeveral s different types); parameter selectionparameterschosen odel). ( of m Data Analysis in generals n i a ambition understand to why certain groups entities behaving the of are on way theyo,searchor orulessucheh d it is f lawsr of b aviour. At first hould analysed parts here a s be those w such g roups

arelready a identified. Sub-processes atanaly in a d sis are: model pecification s someformalism used is to denotepecific odel, s m model fitting - when necessary the specific parameters determined, are evaluation model - is evaluatedgainst data, a the model efinement - odel s r m i refined iterations in accordinghevaluationesults. tt e o r Model evelopment nd analysis re d a data a complementa ry so leads scillationetween it often to o b those twoteps. s Output eneration - utput an in G o c be variousorms. f The simplest orm a f isreport ith w analysis results The . other, complicatedorms, graphsinom more f are ors cases e itdesirableo btainction escriptions hich is t a d w might be taken directlys utputs. therehould a ao Or s be mon itor s a the utput, hich o w should trigger n a alarm action or under some certain condition. Output requirements might determine odesigned DD taskf K application. III. TASKS KDD First tep the s in KDD processsask i t discovery. h T ere are ossible ovariousasks. thisectio p a f lot t In s n of some the important iscovery are most d tasks listed b and riefly described. More ( particular description be can foun d in [5].) Discovery SQO of rules. Semantic uery ptimisation Q O rules erform yntactical p a s transformationthe of i ncoming querytoproducemoreefficientquerybyaddingor

Discoveryassociation of rules. Anssociationules a r ia relationshiptheorm of f X =>where Y Y X are and setsf o itemsconjunctsattribute ( of values) X and Y = . Rules aressignedsupport confidenceactor. a by and f Dependencemodelling. Discoveryofdependencies amongttributesnormif-thenules"if a i f of r as (A ntecedent isrue)-then t (Consequentsrue)". ntecedents it A i usually conjunction of ttribute alues nd a v a consequents ia single value. main ifference etween ependence odel The d b d m ling and database dependencies that ules depende is r for nce modellingnot tbexact. do have o Deviation detection. Thisask t focuses n o discovery f o significant deviations etweenhectual b t a contents of data a subsetanditsexpectedcontents.Ingeneralwecan distinguish typesdeviations: two of temporal - significant changes along time dimension; group unexpected ifferences etweenwoubsets d b t s of data. Significance deviation subjective of is measure str ongly dependentuser. on Clustering.Itisaclassificationscheme,wherethe classes re a unknown. uples ith T w similar ttribute a values are lusteredntoheame lass. problemt c i t s c The in hisask t iso etermine owmeasurehe ualitythe td h to t q of p roduced clusters. fter lusteringdoneis ossible A c is itp t o a apply classificationsummarisationlgorithmthei or a to " nvented" classes. Causationmodelling. Discoveryofrelationshipof

prediction classification regression

association rules

clustering dependency modeling causation modeling

database dependencies SQO rules

deviationetection d summarisation

highlighting
Fig. nowledgeiscovery partitioning 1 K d tasks

description

removingonjuncts. c Characteristicor rules f SQO i that the query processing (derived access time from method a nd indexingchemedatabase anagement s of m system)t is aken intoccountcostattribute. a as of Discovery database of dependencies. In case this the term referso t relationships mong a attributes fe or lations. Databasedependenciesareusedinthedesignand maintenanceDBMS. of 308

cause effect mong and a attributes. ules simila R are dependencemodelling,butcausalrulesindicatetha antecedent auseshe c t consequent nd relations a this not tanother due o observedariable. v Classification. whereachupleelongs Task e t b tclass, o a whichone pre-defined oclasses. cla is of set f The tupleisindicatedbythevalueofuserdefinedcla attribute.Classificationalgorithmaimstofindso

r to t hip is

ss a of ss me

relationship between predicting attributes each and class (eachalueclass v of attribute). Regression. similarolassification, predicted Task tc the value is rather continuous. Traditionalmethods are statistical(suchaslinearregression)howeverthe reis numbersymbolic ethods here odifiedlassific of m w m c ation methods involved instance are (for decision wi tree th a linear as node). modela leaf Summarisation.Itis akind summary, escribing of d some ropertieshared most theuples elong p s by of t b ingo t theamelass. s c Discoveredummariesan express s c b eds a characteristicule hich benterpreted "i r w may i as: (tuple f belongsohelassndicatednntecedent) tt c i ia then (theuple t hasll a properties entionedconsequent)". m in Such rules a i not discriminatinghelasses classifica tt c o unlike tionules. r IV. PACKAGE KDD As of resultanalysisasednheactsegardin a b ot f r g DD K process nd DD describedn reviouswoec a K tasks ip t s tions we ecidedo roposeheollowingtructureth d tp t f s of e DD K package haveeenevelopingsee ig. we b d ( F 2). KDDpackagewillhave amodularstructure,where common ofhe parts t system be by of can used each t he specialized modules DM covering or ore one m possibl e KDD Theommonarts just tasks. c p are two. 1) access . t smodule accessing DB Ii a for database sourceswhichan e BF SQL ( c bD file, databasepos or sibly data arehouse). w 2)Visualcomponentfordatamanipulation. This moduleenablesausertovisualize,browse,modify,

transform dataromatabasedata arehou etc. f d a (a w se). All possible perationsn atare efined ylug-ins o od a d bp as ell. w Thismakesitpossibletoaddanewtransformation (sampling, checking, operation n ata nyi etc.) od a t me ery v easily. For ach DD a e K task different mininglgorithm data a as well s of a type output eneration suitable. her g is T efore eachnew dataminingalgorithm aswellas output generationmodule canbeimplementedseparatelyand added nto ur DD o o K packagenorma i f of plug-in. It does not necessarily eanhat data ininglgorith m t each m a m must haveownutput its o generation odule. m Usuallyadd KDD processingunctional to a new task f ity into KDD the package ill theollowing: w mean f To implement (or just re-use an existing implementation f) mininglgorithmform oa data a in of plug-in a Ifnecessary,toimplementnewtransformationor other pre-processingunctions ormplug-ins f in of f If ecessary, o n t implementnew a output eneration g module ormalug-in. if of n p Based somevery experiments real ata on first with d fromtheGOALprojectpilotapplicationsweplanto implement at least the following four KDD tasks functionality. he DD areeferred resp T K tasks r with ecto t our partitioningepicted ig. d iF 1. n A. Prediction classification Herewearegoingtoimplementtwodifferentdata

Association rules

if ... ... then

DB

DB

access

Visual component for data manipulation

Classification

.
C1: ... C2: ...

Clustering

Others
. . .

Data preprocessing modules

Data ining m modules

Output generation modules

Fig. Proposedtructurethe DD 2 s of K package.

309

miningalgorithmswithtwopossibleoutputgenerati forms. first nes N2 The o iC system, Clark Ni by and (see [3]),isasymbolicdataminingtooldesignedto efficientlyinducesimpleandcomprehensiblerules domains where noisy may present. he data be T input CN2onsists suallyaile c u of describing attr f the theirtypesandafilecontainingtheexamples.The attributes be two can of types: iscrete finite d (a values) orderedintegersfloats). and ( or CN2 outputs ordered unordered of ecisio an or list d lists,rulestheorm <complex> PRE or of f 'IF THEN <class>', where complex>a < is conjunctattribut of Thesedecisionlistsareprobabilisticrules,i.e. condition oversxamplesa c e osingle lass, po f c but few examples classeswell. other of as The ther ne roduces ell o o p w known decision I trees. C4.5nrderbableoandle ithumerical i o toe th w n a (whichpredecessor cannot). its ID3 B. Highlighting/Prediction associationules r

on blett in of ibutesnd a set f o n DICT eests. t the ssibly a is t ttributes

Herewearegoingtoimplementamoregeneral approach which in usable other inds f is fact in k o KDD taskswell. describedManilla as It was by in [6]. A fairly class data large of mining tasks be can de scribed the as search interesting frequentl for and y occurringatternsrom data. is, areiv p f the Thatwe g en class a P of atterns sentenceshat escribe p or t d properties o fhe t data, nd can a we specify whether a pattern p P occurs frequentlynoughndotherwisenteresting. e a is i Tha is, t the genericata ining itoind set d m taskf the s PI(d, P)={p P| poccuressufficientlyoftenin database a iinteresting}. d p s For ssociation a rules,he t pattern ishe classt set of ll a rules fhe ot form X => B, and isnterestingits a i rule if confidence sufficiently For inding is high. f episod es,he t patternsaretheepisodesandthereneednotbeany interestingness criterion. C. Description/Prediction clustering For lustering e ouldikeo ffer utoClassys c ww l t A s tem. May some interesting be other approachese.g. ase ( b d on neural networks) badded heuture. cane in f t AutoClassisanautomaticclassificationprogramto extractusefulinformationfromdatabases [2].Itisan approachtounsupervisedclassificationbasedupon the classicalmixturemodel,supplementedbyaBayesian methodor f determininghe ptimal t o classes. emph We asize thatnocurrentunsupervisedclassificationsystem can produce optimal n own.ishe o its Itt interaction b etween domain xperts ndhe achineearching verhe e a t m s o t m odel space,hat enerates knowledge. oth uni t g new B bring que informationnd bilitiesohe atabase nalysis a a tt d a task, and eachnhances others' e the effectiveness. V. SUMMARY This aper resents p p a strategy implementation for o KDD package hich ill aa w w servespecial s module i w the OAL G project can e sed an pentand but b u as o s a applicationor f knowledgeiscovery. d Description the DD of K process nd particular a its t f a thin lone asks 310

hashown data ininga s that m is versatile ndhoug a t h very importantisust ne ofhe it j o part t whole KDD proc ess. Thereforenrder existingata ininglgor i o adopt d m a ithmso t be in used connection real ata ourcesand with d s ( i does t not atterthey ren atabase a ware m if a i a d or data house), threebjectives crucial met. o are toe b 1. Fast onnectionexisting ataourcesin c to d s ( cas ef o the OALroject G p preferablyata arehouses). d w 2. Flexible nd data election transformat a rich s and ion methods ust providednorm m be i f whicheasyo is t usendasy nderstandtheser. a e tu o by u 3. The system be for asy must open e integration of new mining data algorithms ndnecessary a if even new output generationorms. f BasedontheseobjectivesKDDpackagedesignhas beenketched. modulartructure nablesoch s Its s e ta ieve ll a three given objectives. Based on some very first experiments real ata the with d from GOAL project i p lot applications e to w plan implement leasthe at t foll owing three KDD tasks functionality 1) prediction classification, 2) highlighting/prediction association rules 3) and description/prediction clustering. VI. ACKNOWLEDGMENT This ork as eenupported w h b s by uropean ommission E C within INCO the Copernicus Programme under ontract c No. 977091andbyMinistryofEducation,Slovak Republic,VEGAgrantNo.1/5032/98 -Integrationof Tools Intelligent for Technologies. V. REFERENCES [1] R.J.Brachman,andT.Anand,"TheProcessof KnowledgeDiscoveryinDatabases," Advancesin KnowledgeDiscovery Data & Mining , AAI/MIT A Press, Cambridge, Massachusetts, pp. 1996, 37-57. [2] P. Cheeseman J. andStutz, Bayesian " Classifica tion (AutoClass):Theory Results,"in and Advancesin Knowledge Discovery Data and Mining Usama , M. Fayyad, Gregory iatetsky-Shapiro, P Padhraic myth, S & Ramasamy thurusamy, AAAI 1996. U Eds., Press, [3] P. Clark and T. Niblett. The CN2 Induction Algorithm.In MachineLearningJournal, no. 261-283, 3, pp. Netherlands, Kluwer, 1989. vol. 4,

[4] U.M. ayyad, . iatetsky-Shapiro, nd Smyth F GP a P. , "The DD K Processor xtracting seful fE U Knowledge from Volumes Data", of COMMUNICATIONS OF THE ACM vol.39, Nov. pp. , no.11, 1996, 27-34. [5] A.A.Freitas, Generic,Set-OrientedPrimitivesto Support Data-Parallel Knowledge Discovery in Relational Database Systems . Ph.D. Thesis, Universityf oEssex, July997. UK, 1 [6] H. Mannila, "Methods and Problems in Data Mining," in the Proceedings of International Conference Database heory on T Jan. 997, , 1 Delphi, Springer-Verlag.

You might also like