Unit-4 Introduction To Data Mining
Unit-4 Introduction To Data Mining
2
Data Mining Models and
Tasks
BASIC TASKS
4
Classification
Previous
customers Classifier Decision rules
Salary > 5 L
Age
Salary Good/
Profession
Prof. = Exec
bad
Location
Customer
type New applicant’s
data
DATA MINING TASKS…………cntd
Regression : Used to predict for individuals on the basis of
information gained from a previous sample of similar
individuals.
Example:
A person wants to do some savings for future, and then it wil be
based on his current values and several past values. He uses a
linear regression formula to predict his future savings.
6
DATA MINING TASKS…………cntd
Clustering : Clustering is a data mining technique used to place
data elements into related groups without advance knowledge
of the group definitions.
7
DATA MINING TASKS…………cntd
Pattern mining is a data mining method that involves
finding existing patterns in data. In this context patterns
often means association rules. The original motivation for
searching association rules came from the desire to analyze
supermarket transaction data, that is, to examine customer
behavior in terms of the purchased products.
8
DATA MINING TASKS…………cntd
Summarization maps data into subsets with associated
simple descriptions (Characterization or Generalization)
Ex- GATE score
9
Data Mining Application: Marketing
Sales Analysis
• associations between product sales:
bread and butter
Toothpaste and toothbrush
Customer Profiling
• data mining can tell you what types of customers
buy what products
Identifying Customer Requirements
• identify the best products for different customers
• use prediction to find what factors will attract
new
customers
10
Data Mining Application:
Fraud Detection
• Association Rule Mining can detect a group of people who
stage accidents to collect on insurance
11
Data Preprocessing
12
Why Data
Preprocessing?
Data in the real world is dirty
incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate
data
e.g., occupation=“ ”
noisy: containing errors or outliers
e.g., Salary=“-10”
inconsistent: containing discrepancies in codes or
names
e.g., Age=“42” Birthday=“03/07/1997”
e.g.,Was rating “1,2,3”, now rating “A, B, C”
e.g., discrepancy between duplicate records
13 Data Mining: Concepts and Techniques
Why Is Data Dirty?
19
The KDD
process
"KDD is the nontrivial process of identifying valid, novel,
potentially useful, and ultimately understandablepatterns in
data".
20
Steps
: The process operates on the following basic steps:
(i) identifying the goal from the user's point of view ( based on
the relevant knowledge about the domain),
(ii) creating a target data,
(iii) data preprocessing,
(iv) data reduction and projection,
(v) matching the goals of the KDD process,
(vi) exploratory analysis,
(vii) data mining,
(viii) interpreting mined patterns,
(ix) acting on the discovered knowledge.
21
These steps can be divided into three tasks:
the preprocessing of data(steps i - vi),
the mining of data (steps vii) and
the postprocessing of data (steps viii - ix).
22
Fig. : The KDD Process
23
KDD Process Ex: Web
Log
Selection:
Select log data (dates and locations) to use
Preprocessing:
Remove identifying URLs
Remove error logs
Transformation:
Sessionize (sort and group)
Data Mining:
Identify and count patterns
Construct data structure
Interpretation/Evaluation:
Identify and display frequently accessed sequences.
Potential User Applications:
Cache prediction
Personalization
24
KDD
Issues
Human Interaction
Outliers
Interpretation
Visualization
Large Datasets
High Dimensionality
25
KDD Issues…………
cntd
Multimedia Data
Missing Data
Irrelevant Data
Noisy Data
Changing Data
Integration
Application
26