KDD Process
KDD Process
Evolution of Database
Technology
❑1950s: First computers, use of computers for census.
❑1960s: Data collection, database creation (hierarchical
and network models)
❑1970s: Relational data model, relational DBMS
implementation
❑1980s: Ubiquitous RDBMS, advanced data models
(extended-relational, OO, deductive, etc.) and
application-oriented DBMS (spatial, scientific,
engineering, etc)
❑1990s: Data mining and data warehousing, massive
media digitization, multimedia databases and Web
technology
Data Rich but
Information Poor
Databases are
too big.
Terrorbytes
What should we
do? • We are not trying to
find the needle in the
hay stock because
DBMSs know how to
do that.
Knowledge
Data
Introduction: Outline
• What kind of information are we collecting?
• What are Data Mining and Knowledge
Discovery?
• What kind of data can be mined?
• What can be discovered?
• Is all that is discovered interesting and
useful?
• Are there application examples?
Introduction: Outline
• What kind of information are we collecting?
• What are Data Mining and Knowledge Discovery?
• What kind of data can be mined?
• What can be discovered?
• Is all that is discovered interesting and useful?
• Are there application examples?
Data Collected
(SOURCE)
• Business transactions
• Scientific data
• Medical and personal
data
• Surveillance video and
pictures
• Satellite sensing
• Games
Data Collected (Format)
• Digital Media
• CAD and Software
Engineering
• Virtual Worlds
• Text Reports and
Memos
• The World Wide Web
Introduction: Outline
• What kind of information are we collecting?
• What are Data Mining and Knowledge
Discovery?
• What kind of data can be mined?
• What can be discovered?
• Is all that is discovered interesting and
useful?
• Are there application examples?
Knowledge Discovery
Process of non-trivial
extraction of implicit,
previously unknown
and potentially useful
information from large
collections of data.
Many Steps in KDD
Process
❑Gathering the data together.
❑Cleanse the data and fit it together.
❑Select the necessary data.
❑Crunch and squeeze the data to extract the
essence of it.
❑Evaluate the output and use it.
So What is Data Mining?
• In theory, Data Mining is a STEP in the
knowledge discovery process. It is the
extraction of implicit information from a
large data set.
• In practice, data mining and knowledge
discovery are becoming synonymous.
So What is Data Mining?
• There are other equivalent terms: KDD,
knowledge extraction, discovery of
regularities, patterns discovery, data
archeology, data dredging, business
intelligence, information harvesting
• Shouldn’t it be called knowledge mining
instead of data mining?
Data Mining: A KDD
Process
Pattern Evaluation
Data mining: the core of
knowledge discovery
process. Data Mining
Task-relevant Data
Data Cleaning
Data Integration
Databases
Data Mining: A KDD
Process
Pattern Evaluation
Data Mining
Task-relevant Data
Data Cleaning
Data Integration
Databases
Data Mining: A KDD
Process
Multiple data
Pattern Evaluation
sources, often
heterogeneous,
may be combined Data Mining
in a commonTask-relevant Data
source
Data Warehouse Data Selection and transformation
Data Cleaning
Data Integration
Databases
Data Mining: A KDD
Process
Noise data and
Pattern Evaluation
irrelevant data are
removed from the
Data Mining
collections
Task-relevant Data
Data Cleaning
Data Integration
Databases
Data Mining: A KDD
Process
Data relevant to the
analysis is decided
on and retrieved Pattern Evaluation
from the data
collection Data Mining
Task-relevant Data
Data Cleaning
Data Integration
Databases
Data Mining: A KDD
Selected data is
transformed intoProcess
forms appropriate Pattern Evaluation
for the mining
procedure Data Mining
Task-relevant Data
Data Cleaning
Data Integration
Databases
Data Mining: A KDD
Process
Clever techniques
are applied to
extract patterns Pattern Evaluation
potentially useful
Data Mining
Task-relevant Data
Data Cleaning
Data Integration
Databases
Data Mining: A KDD
Process
Strictly interesting
patterns
representing Pattern Evaluation
knowledge are
identified based on Data Mining
given measures
Task-relevant Data
Data Cleaning
Data Integration
Databases
Data Mining: A KDD
Knowledge Process
representation: Uses Pattern Evaluation
visualization
techniques to help Data Mining
users understand and
interpret the Task-relevant
data Data
mining results
Data Warehouse Selection and transformation
Data Cleaning
Data Integration
Databases
KDD is an Iterative Process
Introduction: Outline
• What kind of information are we collecting?
• What are Data Mining and Knowledge
Discovery?
• What kind of data can be mined?
• What can be discovered?
• Is all that is discovered interesting and useful?
• Are there application examples?
What Kind of Data Can
be Mined?
❑Flat Files
❑Heterogeneous and legacy databases
❑Relational databases
And other DB: Object-oriented and object-
relational data bases
❑Transactional databases
Transaction(TID, Timestamp, {item1, items,..})
What Kind of Data Can
be Mined?
Data warehouses
What Kind of Data Can be
Mined?
Multimedia Databases
What Kind of Data Can be
Mined?
Spatial Databases
What Kind of Data Can be Mined?
Time-Series Databases
What Kind of Data Can be Mined?
Text Documents
What Kind of Data Can be Mined?
Components of WWW:
1. Content of the web
2. Structure of the web
3. Usage of the web
Introduction: Outline
• What kind of information are we collecting?
• What are Data Mining and Knowledge
Discovery?
• What kind of data can be mined?
• What can be discovered?
• Is all that is discovered interesting and
useful?
• Are there application examples?
What can be discovered?
• What can be discovered depends on the data
mining task employed.
•Descriptive DM
Describe general properties
•Predictive DM Tasks
Infer on available data
Data Mining Functionality
❑Characterization
Summarization of general features of objects in a
target class (concept description)
Ex: Characterize graduate students in Science
❑Discrimination
Comparison of general features of objects
between a target class and a contrasting class
(concept description)
Ex: Compare students is Science and students in
Arts
Data Mining Functionality
❑Association
Studies the frequency of items occurring together
in transactional databases
Ex: buys(x, bread) → buys(x, milk)
❑Prediction
Predicts some unknown or missing attribute
values based on other information
Ex: Forecast the sale value for next week based
on available data.
Data Mining Functionality
❑Classification
Organizes data in given classes based on attribute
values (supervised classification)
Ex: classify students based on final results
❑Clustering
Organizes data in classes based on attribute values
(unsupervised classification)
Ex: group crime locations to find distribution patterns.
Data Mining Functionality
❑Outlier analysis
Identifies and explains exceptions (surprises)
❑Time-series analysis:
Analyzes trends and deviations, regression,
sequential pattern, similar sequences
Introduction: Outline
• What kind of information are we collecting?
• What are Data Mining and Knowledge
Discovery?
• What kind of data can be mined?
• What can be discovered?
• Is all that is discovered interesting and
useful?
• Are there application examples?
Is all that is Discovered
Interesting?
• A data mining operation may generate
thousands of patterns, not all of them are
interesting.
• Data Mining results are sometimes so large
that we may need to mine it too (Meta-
Mining?)
• Medicine
– Association pathology – symptoms
– DNA
Potential and/or Successful
Applications
• Sports
– IBM Advanced Scout analyzed NBA game
statistics (shots, blocks, fouls, assists) to gain
competitive edge.
• Astronomy
– JPL and Palomar Observatory discovered 22
quasars with the help of data mining
– Identifying volcanoes on Jupiter
Potential and/or Successful
Applications
• Surveillance cameras
– Use of stereo cameras and outlier analysis to
detect suspicious activities or individuals
• Web surfing and mining
– IBM Surf-Aid applies data mining algorithms to
Web access logs for market-related pages to
discover customer preference and behavior pages
(e-commerce)
– Adaptive web sites / improving Web site
organization, etc.
Warning: Data Mining Should
Not be Used Blindly!