DWM
DWM
Def
ineDatawar
ehousehow i
tisdi
ff
erentf
rom a
dat
abase.
Thet
erm Dat
aWar ehousewasdef
inedbyBi
l
lInmon
i
n1990,
inthefol
l
owingway:
"Awar ehousei
sasubj ect
-or
iented,integr
ated,t
ime-
var
iantandnon-vol
ati
l
ecollecti
onofdat ainsupport
ofmanagement'
sdecisionmakingpr ocess".
Hedef
inedt
het
ermsi
nthesent
enceasf
oll
ows:
Subj
ectOri
ented¬ -Datathatgi
vesinf
ormat
ionabout
a par ti
cularsubjecti
nsteadofaboutacompany's
ongoi
ng operati
ons
I
ntegrat
ed - Datathatis gat
hered i
ntot
he dat
a
warehousefr
om av ar
iet
yofsourcesandmer
gedint
o
acoherentwhole.
Ti
me-var
iant-Al
ldatainthe data war
ehouse i
s
i
dent
if
iedwit
hapar
ti
cul
art
imeperi
od.
Non-vol
ati
l
e - Datai
sstablei
n adatawarehouse.
Moredatai saddedbutdat
aisnev
erremoved.This
enabl
esmanagementt ogai
naconsist
entpi
ctur
eof
thebusi
ness
b)
Def
inet
het
erm Dat
acl
eani
ngwi
thexampl
e.
Datacleaningisalsoknownasscr ubbing.Thedat a
deaningprocessdetect
sandr emovestheer rorsand
i
nconsistenci
esandi mprov
est hequal
it
yoft hedata.
Dataqual i
typr
obl
emsar iseduet omisspel
lingsdata
entr
y.missingval
uesoranyotherinval
i
ddat a.
Forexample,i
fyouconductasur
veyandaskpeopl
e
forthei
rphone numbers,peopl
e may ent
erthei
r
numbersindif
fer
entf
ormats.
c)
Listdi
ff
erentDat
acubecomput
ati
onmet
hods.
Dat
aCubeComput
ati
onMet
hods
oMul
ti
-WayAr
rayAggr
egat
ion.
oBUC.
oSt
ar-
Cubi
ng.
oHi
gh-
Dimensi
onal
OLAP.
D)
Def
inet
het
erm Dat
ami
ning.
Dat ami ni
ngist heprocessofsorti
ngt hroughlarge
dataset stoidentif
ypat t
ernsandrelat
ionshipsthat
can hel p sol
ve business probl
ems thr ough dat a
analysi
s.Dat ami ni
ng techni
quesand t oolsenabl e
enterpr
isestopr edi
ctfuturetr
endsandmakemor e-
i
nformedbusi nessdecisi
ons.
E)
Stat
eAppl
icat
ionofcl
ust
eranal
ysi
s.
Cl
usteri
nganalysisi sbr oadl
yusedi nmanyappl i
cat
ions
such as marketr esearch,patter
nrecognit
ion,data
analysi
s,andi magepr ocessing.Clust
eri
ngcanal sohelp
marketersdiscoverdist i
nctgroupsinthei
rcustomerbase.
Andt heycanchar acterizetheircust
omergroupsbasedon
thepurchasi
ngpat t
erns.
F)Li
stAppl
icat
ionofOLAP
Account
ing, for
ecast
ing, budget
ing, cost
, and
pr
ofit
abi
lit
yanal
ysi
sandconsoli
dati
on.
Human r esour
ces, skil
l consol
i
dat
ion, l
abor
schedul
i
ng,
andopti
mizat
ion.
Di
str
ibut
ion,
schedul
i
ng,
andopt
imi
zat
basedanal
ysi
s.i
on.
Mar
ket
ing,
chur
n,andmar
ket
-
G)Def
ineOLAPDat
acube.
AnOLAP cubei s amult
i-
dimensi onal ar rayof
dat
a.[1]Online anal yt
ical processi ng(OLAP) [2]i
s a
comput er-
basedt echni
queofanal yzingdat at ol ookf or
i
nsights.Thet er
m cubeher eref
erstoamul ti-
dimensi onal
dat
aset ,whichisalsosomet i
mescalledahy percubeifthe
numberofdi
mensi
onsi
sgr
eat
ert
han3.
Q.
2)At
temptanyTHREEoft
hef
oll
owi
ng.
a)Expl
aint
hreet
ierar
chi
tect
ureofdat
awar
ehousi
ng.
1.Bot
tom Ti
er(
Dat
aSour
cesandDat
aSt
orage)
I
tisa warehouse databaseserver
,thati sgenerall
ya
RDBMS.UsingAppl
icationProgr
am interf
aces( cal
l
edas
gat
eways)
,dat
aisextractedf
rom oper
ationalandexter
nal
Gatewayslike,ODBC(OpenDatabaseconnect
ion)
,OLE-DB
(Openlinki
ngandembeddi ngf ordat
abase),108C(Java
DatabaseConnection)i
ssupport
edbyunderly
ingDBMS.
2.Mi
ddl
eTi
er(
OLAPEngi
ne)
B)
Def
inet
het
erm 1)
OLAP2)
ROLAP3)
MOLAP4)
HOLAP
MOLAP -Thi
sist
he mor
etr
adi
ti
onalway ofOLAP
analy
sis,I
nMOLAP,dataisst
oredinamul t
idi
mensional
cube.Thestoragei
snoti
ntherel
ati
onaldat
abase,buti
n
propri
etar
yfor
mats.
ROLAP-Thi smethodologyr el
iesonmanipulat
ingthedata
stor
edi nther
elat
ionaldatabasetogivetheappearanceof
tr
aditi
onalOLAP's sl i
cing and dici
ng functi
onali
ty.In
essence,eachacti
onofsl ici
nganddicingisequiv
alentto
addinga"WHERE"cl auseintheSQLst at
ement.
HOLAPtechnol
ogi
esat
tempttocombinetheadv
antages
ofMOLAPandROLAP.Forsummary
-ty
peinfor
mati
on,
HOLAPl
ever
agescubet
echnol
ogyf
orf
ast
erper
for
mance.
When detai
linfor
mati
on is needed,HOLAP can "dri
ll
t
hrough"f
rom t
hecubeint
otheunderly
ingr
elat
ional
data.
OLAP-OLAP( f
oronli
neanalyti
calprocessi
ng)issoftwar
e
forperfor
mingmulti
dimensionalanalysi
sathi ghspeeds
onlargev ol
umesofdat afrom adat awar ehouse,dat
a
mart,orsomeotheruni
fi
ed,central
i
zeddat astore.
C)Descr
ibeanyf
ourChal
lengesofDat
ami
ning.
1.Secur
it
yandSoci
alChal
l
enges
Dynamict echniquesaredonet hroughdat aassor tment
shari
ng, so i tr equi
res impr essive secur i
ty. Pri
vate
i
nformation about people and t ouchy information is
gather
ed f or the cli
ent
’s profiles,client standard of
conductunder standi
ng—il
l
ici
tadmi t
tancet oinformation
and the secr et idea of informat i
on t ur
ning i nt
o a
si
gnifi
cantissue.
2.Noi
syandI
ncompl
eteDat
a
DataMiningisthewayt owar
dobt aini
nginformationfrom
hugevolumesofdat a.Thispresentreal
it
yi nf
ormat i
onis
noi
sy,incompl et
e,and heterogeneous.Dat ai n huge
amountsr egul
arl
ywillbeunreli
ableorinaccurate.These
i
ssuescoul dbebecauseofhumanmi st
akesbl undersor
er
ror
sint
hei
nst
rument
sthatmeasur
ethedat
a.
3.Mi
ningdependentonLev
elofAbst
ract
ion
DataMi ning measur
eshoul d becommuni t
y-or
ient
edin
l
ightofthefactthati
tpermit
scl i
ent
stofocusonexample
opti
mizing,pr
esenti
ng,andpatternf
indi
ngfordatamini
ng
dependentonbroughtresul
tsback.
4.I
ntegr
ati
onofBackgr
oundKnowl
edge
Prev
iousi nformati
onmi ghtbeuti
l
izedtocommunicate
examplest oexpressdiscover
edpat
ternsandt
odi
rectthe
expl
orati
onpr ocesses.
5.Di
str
ibut
edDat
a
True datai s normallyputawayon v ar
ious st
ages in
distr
ibut
edpr ocessingcondit
ions.Itverywellmaybeon
theinter
net,i
ndividualsy
stems,orev enont hedatabases.
Itisessenti
all
yhar dtocarryallthedatatoauni f
ieddata
archivepri
nci
pallybecauseoft echni
calandorgani
zational
reasons.
Q.
3)At
temptanyTHREEoft
hef
oll
owi
ng.
a)Compar
eOLAPandOLTPSy
stems.
OLTP OLAP
Itisanonl inet r
ansacti
onalI t i s an onl i
ne dat a
syst em and manages r etriev i
nganddat aanalysi
s
dat abasemodi f
icati
on. sy stem.
Inser t
, Updat e, Del ete Ext r
actdat af oranalyzi
ng
i
nf ormation from the that hel ps in decision
dat abase maki ng.
OLTP and i tst ransact
ions Dif f
er entOLTPs dat abase
aret he ori
ginalsour ce ofbecomes t he sour
ce of
dat a. dat af orOLAP.
OLTPhasshor ttransacti
ons OLAP has long
transact ions
The pr ocessing time of a The processing ti
me ofa
tr
ansact ioniscompar ati
vel
yt r
ansaction i
s
l
essi nOLTP compar ati
vel
y mor e i n
OLAP.
TablesinOLTPdat abaseare Tables in OLAP dat abase
normal i
zed( 3NF). arenotnor mali
zed.
OLTP database mustOLAP dat abase does not
maintain dat a i ntegr
it
y get f r
equent l
y modi fi
ed.
constraint Hence,dat aintegr
it
yi snot
aff
ected
b)Expl
ainDat
aCl
eani
ngPr
ocess.
Datacl eaningi sapr ocessbywhi chi naccur ate,poor l
y
formatted,orot herwise messydat ais or ganized and
corrected.Forexampl e,ify
ouconductasur veyandask
peoplef orthei
rphonenumber s,peopl emayent ert heir
number si ndiff
erentformats.Bef
orei tcanbeused, t
hose
phonenumber sneedt obestandardizedsot hatt hey ’
reall
formattedt hesame.Dat acanbemessyl iket hisf orlots
ofr easons.Addr essescanbef ormattedi nconsi stently;
recordscangetdupl icatedandneedt obei dent if
iedand
reconciled;somer ecordsmayusedi ffer
entt erms,l ike
“Cl
osedwon”and“ ClosedWon”t orepresentwhatshould
be the same val
ues;nullvalues need to be handled
cor
rectl
y;andsoon.Anexample:Howdat acanbecleaned
Datacanbecl eanedinanumberofway s.Someti
mes,i t
’s
donemanual l
yinSQLquer ies,i
nPy t
honscr ipt
s,orinExcel.
Somet i
mes,peopl e use toolsl i
ke Tr if
actat hat are
designedtoprogrammat i
call
ycleandata.Andsomet i
mes,
i
t’sincorpor
atedi nt
oETLpr ocessest hatcleandataas
theyextr
actandloadi ti
ntoawar ehouse.
Opi
nion:Dat
acleani
ng,dat
apr
ep,anddat
amodel
i
ngar
e
al
lsl
ight
lydi
ff
erent
Datacl
eani
ngof t
engetsconfl
atedwithtwootherrel
ated
ter
ms:dataprep,anddatamodel i
ng.Wet hi
nkoft hese
words as meaning t
hree di
ff
erent
,albei
tov er
lapping,
thi
ngs.
C)expl
ainMar
ketbasketanal
ysi
s.
Marketbasketanaly
sisi
samodel l
ingtechni
quewhichis
al
socalledasaf f
ini
tyanaly
sis,i
thelpsidenti
fyi
ngwhich
i
temsarelikel
ytobepurchasedtogether
.
Themar ket
-basketproblem assumeswehav esomel ar
ge
numberofitems,e. g.
,"bread","mil
k."
,etc.Cust
omer sbuy
thesubsetofi t
emsaspert heirneedandmar ket
ergets
theinf
ormationt hatwhicht hingscustomershav etaken
t
oget
her.Sothemar ketersuset
hisi
nfor
mat
iont
oputt
he
i
temsondif
ferentposit
ion.
ForExample:
Ifsomeonebuysapacketofmilkal
sot
ends
tobuyabreadatthesameti
me.Mil
k=>Bread
Mar ketbasketanal ysi
s al
gor i
thms are strai
ght
forward;
di
fficult
iesar i
semai nlyindealingwit
hl argeamount sof
tr
ansact ionaldata,whereafterapply
ingal gori
thm i
tmay
gi
ver i
set olargenumberofr ul
eswhi chmaybet ri
viali
n
nature.
Mar ketbasketanalysisi
susedi ndecidingt
hel ocati
onof
i
temsi nsideast or
e,fore.g.i
facustomerbuysapacketof
breadhei smor eli
kelyt o buyapacketofbut tertoo,
keepingt hebreadandbut ternexttoeachotherinast or
e
woul dresulti
ncustomer sgetti
ngtemptedtobuyonei tem
withtheot her
.
Appl
i
cat
ionsofMar
ketBasketAnal
ysi
s
Credi
tcar
dtr
ansact
ionsdonebyacust
omermaybe
analy
sed.
Phonecal
l
ingpat
ter
nsmaybeanal
ysed.
Fr
audul
entMedi
cal
insur
ancecl
aimscanbei
dent
if
ied.
D)Expl
ainBi
tmapi
ndexi
nOLAP.
I
tal
l
owsqui
cksear
chi
ngi
ndat
acubes.
Thebit
mapi ndexi
sanal
ter
nat
iver
epr
esent
ati
onof
ther
ecordI
D( RI
D)l
i
st.
Eachat
tri
but
eisr
epr
esent
edbydi
sti
nctbi
tval
ue.
Ifat
tri
bute'
sdomainconsi
stsofnval
ues,thennbi
ts
areneededforeachent
ryi
nthebi
tmapindex.
I
ft heat tr
ibut
eval
ueispresentint herow t
henitis
represented by1 i
nthecor respondi
ng row ofthe
bit
mapi ndexandal
lot
herbitsforthatrowaresetto
0.
ADVANTAGES:
Bit
mapindexi
ngisadv
ant
ageouscompar
edt
ohash
andtr
eeindi
ces.
useful f or l ow-car
dinal
it
y domai ns because
compar i
son,j
oin,andaggregati
onoperati
onsarethen
reducedtobitarit
hmetic,whichsubst
anti
all
yreduces
theprocessi
ngt i
me.
Q.
4)At
temptanyTHREEoft
hef
oll
owi
ng.
a)Diff
erent
iat
e between oper
ati
onaldat
abase sy
stem
anddatawarehouse.
c)
Draw st
arschemaofadat
awar
ehousef
orsal
es
consi
der
ingFacttabl
eSal
esanddimensi
onalt
abl
es
asTime,I
tem,Br
anchandLocat
ion.
b)Descr
ibet
heneedofdat
apr
epr
ocessi
ng.
1.
Real
wor
lddat
aar
egener
all
y
Incomplet
e:The dat ais saidto be incomplete when
certai
n att
ri
butes orattr
ibutes v
alues are missi
ng or
onlyaggr
egatedataisav
ail
able.
Noi sy:Whenthedatacont
ainser
rorsorsomeoutl
ier
sitis
consideredtobenoisydata.I
nconsi
stent
:Whenthedat a
containsdif
fer
encesincodesornamesi tisi
nconsi
stent
data.
2.Maj
orTasksi
ndat
apr
e-pr
ocessi
ng
Datacl
eaning:Thisprocessconsist
soffil
li
ngofmi ssi
ng
val
ues, smoot heni
ng noi sy dat a, i
denti
fyi
ng and
removi
ngany out li
ers present and resol
vi
ng
i
nconsi
stenci
es.
DataI nt
egrati
on:Thisrefer
st ointegrat
ing datafrom
mult
iplesourcesli
kedat
abases,dat
acubes,orf i
les.Dat
a
tr
ansformat
ion:Normali
zat
ionandaggregati
on.
Datar educt
ion:I
ndat
ar educti
ontheamountofdat
ais
reducedbutsameanal
yti
calresul
tsar
epr
oduced.
Dat
a discr
eti
zati
on :Partofdatar
educt
ion,r
epl
aci
ng
numer
icalat
tri
buteswi
thnomi
nal
C)Descri
befeat
uresofOLAP.Needf
orOnl
ineAnal
yti
cal
Processi
ng
OLAPprovi
desfast,st
eady,andpr
ofi
cientaccesst
o
t
hevar
iousviewsofinf
ormati
on.
Thecompl
exquer
iescanbepr
ocessed
It
'
s easy to anal
yze i
nformat
ion by processi
ng
compl
exqueri
esonmulti
dimensi
onal
viewsofdata
Data warehouse i
s gener
all
y used t
o anal
yse t
he
i
nformati
onwher ehugeamountofhi st
ori
caldat
ais
st
ored.
I
nformati
onindat awarehousei
srelat
edtomorethan
one dimension li
ke sales,markett r
ends,buyi
ng
pat
terns,
suppli
er,
etc.
D)Descri
beExt r
act
ion,Tr
ansf
ormat
ionandLoadi
ngi
n
dat
awarehousi
ng.
ETL is a pr ocess i
n Data Warehousing and i
tstands
forExtract,Transfor
m andLoad.Iti sapr ocessinwhich
anETLt oolext r
act
st hedatafrom v ari
ousdatasource
systems, transformsiti
nthestagingarea,andthenf
inal
ly
,
l
oadsi tintot heDataWarehousesy st
em
Extraction:The f i
rstst ep oft he ETL pr ocess is
extraction.Int his step,dat af rom v ari
ous sour ce
systemsi sext
r actedwhi chcanbei nvar
iousf ormats
l
iker elati
onaldat abases,NoSQL,XML,andf latfil
es
i
ntot hest agi
ngar ea.Itisimpor tanttoext ractthe
dataf r
om v ar
ioussour cesy st
emsandst orei ti
nto
thest agingar eaf ir
standnotdi rectl
yintot hedat a
warehousebecauset heext r
acteddat aisi nv ari
ous
format sandcanbecor ruptedalso.
Transfor
mat i
on:Thesecondst epoftheETLprocess
i
st r
ansfor
mat i
on.Int his st
ep,a setofr ul
es or
functi
onsareappliedontheextract
eddatat
oconv er
t
i
ti nto a singl
e standardf ormat
.I tmay invol
ve
fol
lowingprocesses/t
asks:
Fi
lt
ering–loadi
ngonl
ycer
tai
nat
tri
but
esi
ntot
hedat
a
warehouse.
Cleani
ng – fi
ll
ing up the NULL v
alues wit
h some
defaul
tvalues,mappi ng U.
S.A,Unit
ed Stat
es,and
Amer i
cai
ntoUSA, etc.
Joi
ning–j
oini
ngmul
ti
pleat
tr
ibut
esi
ntoone.
Loadi ng:Thet hirdandf i
nalst epoftheETLpr ocessis
l
oadi ng.Int hisst ep,thet ransfor
meddat ai sf i
nall
y
l
oadedi ntothedat awar ehouse.Somet i
mest hedata
i
supdat edbyl oadi
ngi ntot hedatawar ehousev ery
frequentlyandsomet i
mesi tisdoneaf t
erlongerbut
regularinter
v als.Therateandper i
odofloadingsol el
y
dependsont her equi
rement sandv ari
esfrom sy st
em
tosy stem.
Q.
5)At
temptanyTWOoft
hef
oll
owi
ng. (
12Mar
ks)
a)Expl
ainmulti
dimensi
onalDat
amodel
?Howi
tisusedi
n
datawarehouse.
Amul t
idimensi
onalmodelv iewsdataintheform ofadata
-cube.A dat a cube enables datato be model ed and
viewedi nmulti
pledi
mensi ons.Iti
sdefinedbydimensions
andf acts.Thedimensionsar etheper
specti
vesorentiti
es
concerningwhichanor ganizati
onkeepsrecords.
Use
-Themul t
i-
Dimensi
onalDat
aModelisamet hodwhi
chis
usedf ororder
ingdataint
hedatabasealongwithgood
arrangementand assembli
ng ofthe contentsinthe
database.
-TheMult
iDi mensionalDataModelal lowscust
omerst o
i
nterr
ogat
eanal y
ticalquest
ionsassociatedwi
thmarketor
busi
nesstrends,unlikerel
ati
onaldatabaseswhichall
ow
customerstoaccessdataintheform ofqueries.They
al
low userstorapidl
yrecei
veanswer stother equest
s
which t
heymade bycr eati
ng and exami
ning the dat
a
comparati
vel
yfast
.
b)Expl
aintopdownandbot
tom updesi
gnappr
oachof
dat
awarehouse.
Top-
DownDesi
gnModel
:
Inthe t
op- down model ,an ov
erview oft he system is
for
mulat
edwi thoutgoingint
odetailforanypartofit.Each
partofi
tt henr efi
nedi ntomoredet ai
ls,defi
ningitinyet
more details unti
lt he enti
re specif
icati
on is detail
ed
enoughtov ali
datethemodel .
Adv
ant
ages:
-Br
eaki
ng pr
obl
emsi
nto par
tshel
p ust
oident
if
ywhat
needst
obedone.
-Ateachstepofrefi
nement ,new par
tswil
lbecomeless
complexandther
eforeeasiert
osolve.
Par
tsofthesol
uti
on
mayturnouttobereusable.
-Br
eaking pr
obl
ems i
nto par
ts al
l
ows mor
ethan one
persontosol
vet
hepr
oblem.
Bot
tom-
UpDesi
gnModel
:
I
nthisdesi
gn,indiv
idualpart
softhesystem arespecifi
ed
i
ndetai
l.Thepar t
sar eli
nkedtofor
ml argercomponent s,
whi
chareinturnli
nkedunt i
lacomplet
esy stem i
sformed.
Obj
ect-
ori
ented l
anguage such as C++ orj ava uses a
bot
tom-upapproachwher eeachobj
ectisidenti
fi
edfir
st.
Adv
ant
age:
-Make deci
sions aboutr
eusablelow-
lev
elut i
li
ti
es then
deci
dehow therewi l
lbeputtoget
hert
ocr eat
ehi gh-
level
const
ruct.
-Thecont
rastbet
weenTop-
downdesi
gnandbot
tom-
up
desi
gn.
c)Li
stcl
ust
eri
ngMet
hodsexpl
ainanyt
wo.
*Basi
cCl
ust
eri
ngMet
hod
-A good clust
eri
ng met
hod wi
l
lpr
oduce hi
gh qual
i
ty
cl
uster
swith:
°Hi
ghi
ntr
a-cl
ass
si
mil
ari
ty
°
Lowi
nter
-cl
asssi
mil
ari
ty
-Majorcl
uster
ing met
hodscan becl
assi
fi
ed i
ntot
he
f
oll
owi
ngcat
egories:
1.
Par
ti
ti
oni
ngmet
hods:I
nPar
ti
ti
oni
ngbasedappr
oach,
vari
ousparti
ti
onsarecr eat
edandthentheyareevaluat
ed
basedoncertaincri
ter
ia.2.Hi
erar
chi
calmethods:Theset
ofdat a obj
ects are decomposed hi
erar
chical
ly usi
ng
cert
ain cr
Q.
6)At
temptanyTWOoft
hef
oll
owi
ng. (
12Mar
ks)
a)Expl
ainDat
apr
epr
ocessi
ngt
echni
quei
ndat
ami
ning.
b)Expl
ainApri
orialgor
it
hmsf
orf
requenti
temsetusi
ng
candi
dategener
ation.
*
FrequentI
temset
s
-
AnitemsetXisf
requentifX'ssuppor
tisnol
esst
hana
mini
mum suppor
tthreshold.
-Afrequentit
emsetisasetofit
emsthatappearsatl
east
i
napr e-
speci
fiednumberoftr
ansact
ions.
Frequenti
temset sar
ety
pical
l
yusedtogenerat
e
associati
onrules.
-
Consi
deradat
asetS,
frequenti
temseti
nSar
ethose
i
temsthatappeari
natl
eastafr
acti
onsofthebasket
,
wher
esisachosenconstantwi
thaval
ueof0.01or1%.
-Tofi
ndfrequenti
temsetsonecanuset hemonotonici
ty
pri
nci
pleora-pri
oritr
ickwhichisgi
venas,
Ifasetofit
ems
saySisfrequentthenall
itssubset
sarealsof
requent.
-
Thepr
ocedur
etof
indf
requenti
temset
s:
-
Alevelwisesear
chmaybeconductedt
of i
ndt hefr
equent
-
1items(setofsi
ze1)
,thenpr
oceedtof
indfrequent2
i
temsandsoon.
-
Nextsear
chf
oral
lmaxi
mal
frequenti
temset
s.
c)Expl
ainst
epsi
nvol
vedi
nKDDpr
ocesswi
thdi
agr
am.
1.
Dev
elopi
nganunder
standi
ngof
-
Theappl
i
cat
iondomai
n
-
Ther
elev
antpr
iorknowl
edge
-
Thegoal
soft
heend-
user
.
2.
Creat
ingat
argetdat
aset
-
Selecti
ngadat
aset,orfocusi
ngonasubsetofvar
iabl
es,
ordatasampl
es,onwhichdiscov
eryi
stobeperf
ormed.
3.
Dat
acl
eani
ngandpr
e-pr
ocessi
ng
-
Noi
seorout
li
ersar
eremov
ed.
-Essent
ialinf
ormati
oniscol
l
ect
edf
ormodel
l
ingor
accountingfornoi
se.
-Missi
ngdataf
iel
dsar
ehandl
edbyusi
ngappr
opr
iat
e
str
ategi
es.
-
Timesequencei
nfor
mat
ionandchangesar
emai
ntai
ned.
4.
Dat
areduct
ionandpr
oject
ion
-
Basedont hegoal
oft
het
ask,
usef
ulf
eat
uresar
efoundt
o
r
epresentt
hedata.
-
Thenumberofv ar
iablesmaybeeffect
ivel
yr educedusing
methodsl
ikedimensional
i
tyreduct
ionortr
ansf or
mat i
on.
I
nvari
antr
epresent
ati
onsf ort
hedatamayal sobef ound
out
.
5.
Choosi
ngt
hedat
ami
ningt
ask
-Select
ingtheappropriat
eDat amini
ngtasksl
i
ke
classif
icat
ion,cl
uster
ing,regr
essi
onbasedonthegoal
of
theKDDpr ocess.
6.Choosi
ngt
hedat
ami
ningal
gor
it
hm(
s)
-
Pat
ter
nsearchi
sdoneusi
ngt
heappr
opr
iat
eDat
aMi
ning
met
hod(s)
.
-Adeci
sionistakenonwhi
chmodel
sandpar
amet
ersmay
beappropri
ate.
-
Consi
der
ingtheover
all
cri
ter
iaoft
heKDDprocessa
mat
chforthepar
ti
cul
ardatamini
ngmethodi
sdone.
7.
Dat
ami
ning
-Usi
ngar epr
esentati
onalform orotherrepr
esent
ati
ons
l
ikeclassi
fi
cati
on,rul
esort r
ees,regressi
onclust
eri
ngf
or
sear
chingpatter
nsofi nt
erest.
8I
nter
pret
ingmi
nedpat
ter
ns
9.Consol
i
dat
ingdi
scov
eredknowl
edge