0% found this document useful (0 votes)
72 views

KDD Process

This document provides an introduction to knowledge discovery in databases (KDD). It discusses the evolution of database technology from the 1950s to present day. It explains that while databases are growing exponentially in size, the information within them remains difficult to access. Data mining is presented as a way to discover useful knowledge and patterns within large datasets. The document outlines the KDD process and defines data mining as a core step within knowledge discovery. It provides examples of the types of data that can be mined and knowledge that can be discovered, such as associations, predictions, classifications and clusters.

Uploaded by

AYUSH AGRAWAL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views

KDD Process

This document provides an introduction to knowledge discovery in databases (KDD). It discusses the evolution of database technology from the 1950s to present day. It explains that while databases are growing exponentially in size, the information within them remains difficult to access. Data mining is presented as a way to discover useful knowledge and patterns within large datasets. The document outlines the KDD process and defines data mining as a core step within knowledge discovery. It provides examples of the types of data that can be mined and knowledge that can be discovered, such as associations, predictions, classifications and clusters.

Uploaded by

AYUSH AGRAWAL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Introduction to KDD

Evolution of Database
Technology
❑1950s: First computers, use of computers for census.
❑1960s: Data collection, database creation (hierarchical
and network models)
❑1970s: Relational data model, relational DBMS
implementation
❑1980s: Ubiquitous RDBMS, advanced data models
(extended-relational, OO, deductive, etc.) and
application-oriented DBMS (spatial, scientific,
engineering, etc)
❑1990s: Data mining and data warehousing, massive
media digitization, multimedia databases and Web
technology
Data Rich but
Information Poor

Databases are
too big.

Data Mining can


help discover
knowledge

Terrorbytes
What should we
do? • We are not trying to
find the needle in the
hay stock because
DBMSs know how to
do that.

• We are merely trying


to understand the
consequences of the
presence of the needle
if it exists.
What Led Us To This?
• Necessity is the Mother of Invention
– Technology is available to help us collect data.
• Barcode, scanners, satellites, cameras, etc.
– Technology is available to help us store data.
• Database, data warehouses, variety of responses
– We are starving for knowledge (competitive edge, research,
etc).
• We are swamped by data that continually pour
on us.
– We do not know what to do with this data.
– We need to interpret this data in search for new knowledge.
What is our Need?
Extract interesting knowledge
(rules, regularities, patterns,
constraints) from data in large
collections.

Knowledge

Data
Introduction: Outline
• What kind of information are we collecting?
• What are Data Mining and Knowledge
Discovery?
• What kind of data can be mined?
• What can be discovered?
• Is all that is discovered interesting and
useful?
• Are there application examples?
Introduction: Outline
• What kind of information are we collecting?
• What are Data Mining and Knowledge Discovery?
• What kind of data can be mined?
• What can be discovered?
• Is all that is discovered interesting and useful?
• Are there application examples?
Data Collected
(SOURCE)

• Business transactions
• Scientific data
• Medical and personal
data
• Surveillance video and
pictures
• Satellite sensing
• Games
Data Collected (Format)

• Digital Media
• CAD and Software
Engineering
• Virtual Worlds
• Text Reports and
Memos
• The World Wide Web
Introduction: Outline
• What kind of information are we collecting?
• What are Data Mining and Knowledge
Discovery?
• What kind of data can be mined?
• What can be discovered?
• Is all that is discovered interesting and
useful?
• Are there application examples?
Knowledge Discovery

Process of non-trivial
extraction of implicit,
previously unknown
and potentially useful
information from large
collections of data.
Many Steps in KDD
Process
❑Gathering the data together.
❑Cleanse the data and fit it together.
❑Select the necessary data.
❑Crunch and squeeze the data to extract the
essence of it.
❑Evaluate the output and use it.
So What is Data Mining?
• In theory, Data Mining is a STEP in the
knowledge discovery process. It is the
extraction of implicit information from a
large data set.
• In practice, data mining and knowledge
discovery are becoming synonymous.
So What is Data Mining?
• There are other equivalent terms: KDD,
knowledge extraction, discovery of
regularities, patterns discovery, data
archeology, data dredging, business
intelligence, information harvesting
• Shouldn’t it be called knowledge mining
instead of data mining?
Data Mining: A KDD
Process
Pattern Evaluation
Data mining: the core of
knowledge discovery
process. Data Mining
Task-relevant Data

Data Warehouse Selection and transformation

Data Cleaning

Data Integration

Databases
Data Mining: A KDD
Process
Pattern Evaluation

Data Mining

Task-relevant Data

Data Warehouse Selection and transformation

Data Cleaning

Data Integration

Databases
Data Mining: A KDD
Process
Multiple data
Pattern Evaluation
sources, often
heterogeneous,
may be combined Data Mining
in a commonTask-relevant Data
source
Data Warehouse Data Selection and transformation

Data Cleaning

Data Integration

Databases
Data Mining: A KDD
Process
Noise data and
Pattern Evaluation
irrelevant data are
removed from the
Data Mining
collections
Task-relevant Data

Data Warehouse Selection and transformation

Data Cleaning
Data Integration

Databases
Data Mining: A KDD
Process
Data relevant to the
analysis is decided
on and retrieved Pattern Evaluation
from the data
collection Data Mining

Task-relevant Data

Data Warehouse Data Selection and transformation

Data Cleaning

Data Integration

Databases
Data Mining: A KDD
Selected data is
transformed intoProcess
forms appropriate Pattern Evaluation
for the mining
procedure Data Mining

Task-relevant Data

Data Warehouse Data Selection and Data transformation

Data Cleaning

Data Integration

Databases
Data Mining: A KDD
Process
Clever techniques
are applied to
extract patterns Pattern Evaluation
potentially useful
Data Mining
Task-relevant Data

Data Warehouse Selection and transformation

Data Cleaning

Data Integration

Databases
Data Mining: A KDD
Process
Strictly interesting
patterns
representing Pattern Evaluation
knowledge are
identified based on Data Mining
given measures
Task-relevant Data

Data Warehouse Selection and transformation

Data Cleaning

Data Integration

Databases
Data Mining: A KDD
Knowledge Process
representation: Uses Pattern Evaluation
visualization
techniques to help Data Mining
users understand and
interpret the Task-relevant
data Data
mining results
Data Warehouse Selection and transformation

Data Cleaning

Data Integration

Databases
KDD is an Iterative Process
Introduction: Outline
• What kind of information are we collecting?
• What are Data Mining and Knowledge
Discovery?
• What kind of data can be mined?
• What can be discovered?
• Is all that is discovered interesting and useful?
• Are there application examples?
What Kind of Data Can
be Mined?
❑Flat Files
❑Heterogeneous and legacy databases
❑Relational databases
And other DB: Object-oriented and object-
relational data bases
❑Transactional databases
Transaction(TID, Timestamp, {item1, items,..})
What Kind of Data Can
be Mined?
Data warehouses
What Kind of Data Can be
Mined?
Multimedia Databases
What Kind of Data Can be
Mined?
Spatial Databases
What Kind of Data Can be Mined?
Time-Series Databases
What Kind of Data Can be Mined?

Text Documents
What Kind of Data Can be Mined?

World Wide Web

Components of WWW:
1. Content of the web
2. Structure of the web
3. Usage of the web
Introduction: Outline
• What kind of information are we collecting?
• What are Data Mining and Knowledge
Discovery?
• What kind of data can be mined?
• What can be discovered?
• Is all that is discovered interesting and
useful?
• Are there application examples?
What can be discovered?
• What can be discovered depends on the data
mining task employed.

•Descriptive DM
Describe general properties
•Predictive DM Tasks
Infer on available data
Data Mining Functionality
❑Characterization
Summarization of general features of objects in a
target class (concept description)
Ex: Characterize graduate students in Science
❑Discrimination
Comparison of general features of objects
between a target class and a contrasting class
(concept description)
Ex: Compare students is Science and students in
Arts
Data Mining Functionality
❑Association
Studies the frequency of items occurring together
in transactional databases
Ex: buys(x, bread) → buys(x, milk)

❑Prediction
Predicts some unknown or missing attribute
values based on other information
Ex: Forecast the sale value for next week based
on available data.
Data Mining Functionality
❑Classification
Organizes data in given classes based on attribute
values (supervised classification)
Ex: classify students based on final results

❑Clustering
Organizes data in classes based on attribute values
(unsupervised classification)
Ex: group crime locations to find distribution patterns.
Data Mining Functionality

❑Outlier analysis
Identifies and explains exceptions (surprises)

❑Time-series analysis:
Analyzes trends and deviations, regression,
sequential pattern, similar sequences
Introduction: Outline
• What kind of information are we collecting?
• What are Data Mining and Knowledge
Discovery?
• What kind of data can be mined?
• What can be discovered?
• Is all that is discovered interesting and
useful?
• Are there application examples?
Is all that is Discovered
Interesting?
• A data mining operation may generate
thousands of patterns, not all of them are
interesting.
• Data Mining results are sometimes so large
that we may need to mine it too (Meta-
Mining?)

How to measure? INTERESTINGNESS


Interestingness
• Objective interestingness measures:
– Based on statistics and structure of patterns, e.g.,
support, confidence, etc.

• Subjective interestingness measures:


– Based on user’s beliefs in data, e.g.,
unexpectedness, novelty
Interestingness Measures:

A pattern is interesting if it is:


• Easily understood by humans
• Valid on new or test data with some degree
of certainty
• Potentially useful
• Novel, or validates some hypothesis that a
user seeks to confirm
Introduction: Outline
• What kind of information are we collecting?
• What are Data Mining and Knowledge
Discovery?
• What kind of data can be mined?
• What can be discovered?
• Is all that is discovered interesting and
useful?
• Are there application examples?
Potential and/or Successful
Applications

• Business data analysis and decision support


– Marketing focalization
• Recognizing specific market segments that
respond to particular characteristics
• Return on mailing campaign (target marketing)
– Customer profiling
• Segmentation of customer for marketing strategies
and/or product offerings
• Customer behavior understanding
• Customer retention and loyalty
Potential and/or Successful
Applications
• Business data analysis and decision support
– Market analysis and management
• Provide summary information for decision-making
• Market basket analysis, cross selling, market
segmentation
– Risk analysis and management
• What if analysis
• Forecasting
• Pricing analysis, competitive analysis
• Time-series analysis (ex. Stock market)
Potential and/or Successful
Applications
• Fraud detection
• Detection of telephone fraud -Telephone call model:
destination of the call, duration, time of the day or week.
Analyze patterns that deviate from an expected norm.
• Detecting automotive and health insurance fraud
• Detection of credit-card fraud
• Detecting suspicious money transactions (money
laundering)
Potential and/or Successful
Applications
• Text Mining
– Message filtering (e-mail, newsgroups, etc)
– Newspaper article analysis

• Medicine
– Association pathology – symptoms
– DNA
Potential and/or Successful
Applications

• Sports
– IBM Advanced Scout analyzed NBA game
statistics (shots, blocks, fouls, assists) to gain
competitive edge.
• Astronomy
– JPL and Palomar Observatory discovered 22
quasars with the help of data mining
– Identifying volcanoes on Jupiter
Potential and/or Successful
Applications

• Surveillance cameras
– Use of stereo cameras and outlier analysis to
detect suspicious activities or individuals
• Web surfing and mining
– IBM Surf-Aid applies data mining algorithms to
Web access logs for market-related pages to
discover customer preference and behavior pages
(e-commerce)
– Adaptive web sites / improving Web site
organization, etc.
Warning: Data Mining Should
Not be Used Blindly!

• Data mining find regularities from history, but


history is not the same as the future
• Association does not dictate trend or causality
– Drink diet drinks lead to obesity
DBMS Transaction Subsystem
• Transaction manager Transaction
coordinates transactions on Scheduler
manager
behalf of application
programs
Buffer Recovery
• It communicates with the manager manager
scheduler
• The scheduler handles
Access File concurrency control.
manager manager • Its objective is to
maximize concurrency
without allowing
Systems simultaneous
manager transactions to
interfere with one
another
• The recovery manager ensures that the database is restored to the right
state before a failure occurred.
• The buffer manager is responsible for the transfer of data between disk
storage and main memory.
Thank You

You might also like