0% found this document useful (0 votes)
198 views26 pages

Unit-4 Introduction To Data Mining

Data mining is an information extraction activity that aims to discover hidden facts contained within large databases. Some basic data mining tasks include classification, regression, clustering, pattern mining, summarization, and link analysis. Data preprocessing is an important step in the KDD process and involves cleaning data by filling in missing values, smoothing noisy data, identifying outliers, and resolving inconsistencies.

Uploaded by

Shaheen Mondal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
198 views26 pages

Unit-4 Introduction To Data Mining

Data mining is an information extraction activity that aims to discover hidden facts contained within large databases. Some basic data mining tasks include classification, regression, clustering, pattern mining, summarization, and link analysis. Data preprocessing is an important step in the KDD process and involves cleaning data by filling in missing values, smoothing noisy data, identifying outliers, and resolving inconsistencies.

Uploaded by

Shaheen Mondal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Unit-4

Introduction to Data Mining


Data Mining is an information extraction activity
whose goal is to discover hidden facts
contained in large
databases.

2
Data Mining Models and
Tasks
BASIC TASKS

 Classification : Classification is a data mining technique


used for systematic placement of group membership
for data.

 For example, you may wish to use classification to


predict whether the weather on a particular day will be
“sunny”, “rainy” or “cloudy”. Popular classification
techniques include decision trees and neural networks.

4
Classification

 Given old data about customers and payments, predict


new applicant’s loan eligibility.

Previous
customers Classifier Decision rules
Salary > 5 L
Age
Salary Good/
Profession
Prof. = Exec
bad
Location
Customer
type New applicant’s
data
DATA MINING TASKS…………cntd
 Regression : Used to predict for individuals on the basis of
information gained from a previous sample of similar
individuals.

Example:
 A person wants to do some savings for future, and then it wil be
based on his current values and several past values. He uses a
linear regression formula to predict his future savings.

6
DATA MINING TASKS…………cntd
Clustering : Clustering is a data mining technique used to place
data elements into related groups without advance knowledge
of the group definitions.

Example : A department store chain creates special catalogues


targeted to various types of customer groups based on
attributes such as income, location, etc.

7
DATA MINING TASKS…………cntd
 Pattern mining is a data mining method that involves
finding existing patterns in data. In this context patterns
often means association rules. The original motivation for
searching association rules came from the desire to analyze
supermarket transaction data, that is, to examine customer
behavior in terms of the purchased products.

 For example, an association rule “cold drink ⇒ potato chips


(80%)" states that four out of five customers that bought
cold drink also bought potato chips.

8
DATA MINING TASKS…………cntd
 Summarization maps data into subsets with associated
simple descriptions (Characterization or Generalization)
 Ex- GATE score

 Link Analysis uncovers relationships among data.


 Association Rules
 Sequential Analysis determines sequential patterns.

9
Data Mining Application: Marketing
 Sales Analysis
• associations between product sales:
 bread and butter
 Toothpaste and toothbrush

 Customer Profiling
• data mining can tell you what types of customers
buy what products
 Identifying Customer Requirements
• identify the best products for different customers
• use prediction to find what factors will attract
new
customers
10
Data Mining Application:
Fraud Detection
• Association Rule Mining can detect a group of people who
stage accidents to collect on insurance

• a data-mining application can be used to detect suspicious


money transactions

• data mining can be used to help commercial lending


decisions and to prevent fraud

11
Data Preprocessing

12
Why Data
Preprocessing?
 Data in the real world is dirty
incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate
data
 e.g., occupation=“ ”
noisy: containing errors or outliers
e.g., Salary=“-10”
inconsistent: containing discrepancies in codes or
names
e.g., Age=“42” Birthday=“03/07/1997”
 e.g.,Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records
13 Data Mining: Concepts and Techniques
Why Is Data Dirty?

 Incomplete data may come from


 “Not applicable” data value when collected
 Different considerations between the time when the data was collected and when it
is analyzed.
 Human/hardware/software problems
 Noisy data (incorrect values) may come from
 Faulty data collection instruments
 Human or computer error at data entry
 Errors in data transmission
 Inconsistent data may come from
 Different data sources
 Functional dependency violation (e.g., modify some linked data)
 Duplicate records also need data cleaning
14 Data Mining: Concepts and Techniques August 10, 2015
Why Is Data Preprocessing
Important?

 No quality data, no quality mining results!


 Quality decisions must be based on quality data
e.g., duplicate or missing data may cause incorrect or even misleading statistics.
 Data warehouse needs consistent integration of quality data
 Data extraction, cleaning, and transformation comprises the majority
of the work of building a data warehouse

15 Data Mining: Concepts and Techniques


Multi-Dimensional Measure of Data
Quality
 Properties of a well-accepted multidimensional
view:
 Accuracy
 Completeness
 Consistency
 Timeliness
 Believability
 Value added
 Interpretability
 Accessibility

16 Data Mining: Concepts and Techniques August 10, 2015


Major Tasks in Data
Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve
inconsistencies
 Data integration
 Integration of
multiple databases,
data cubes, or files
 Data
transformation
 Normalization and
aggregation
 Data reduction
 Obtains reduced representation in volume but produces the same or
17 Data Mining: Concepts and Techniques August 10, 2015
similar analytical results
Forms of Data
Preprocessing

18 Data Mining: Concepts and Techniques August 10, 2015


KDD Process

19
The KDD
process
"KDD is the nontrivial process of identifying valid, novel,
potentially useful, and ultimately understandablepatterns in
data".

20
Steps
: The process operates on the following basic steps:
 (i) identifying the goal from the user's point of view ( based on
the relevant knowledge about the domain),
 (ii) creating a target data,
 (iii) data preprocessing,
 (iv) data reduction and projection,
 (v) matching the goals of the KDD process,
 (vi) exploratory analysis,
 (vii) data mining,
 (viii) interpreting mined patterns,
 (ix) acting on the discovered knowledge.

21
 These steps can be divided into three tasks:
 the preprocessing of data(steps i - vi),
 the mining of data (steps vii) and
 the postprocessing of data (steps viii - ix).

 The domain knowledge helps the process to focus on the


research content.

22
Fig. : The KDD Process

23
KDD Process Ex: Web
Log
 Selection:
 Select log data (dates and locations) to use
 Preprocessing:
 Remove identifying URLs
 Remove error logs
 Transformation:
 Sessionize (sort and group)
 Data Mining:
 Identify and count patterns
 Construct data structure
 Interpretation/Evaluation:
 Identify and display frequently accessed sequences.
 Potential User Applications:
 Cache prediction
 Personalization

24
KDD
Issues
 Human Interaction
 Outliers
 Interpretation
 Visualization
 Large Datasets
 High Dimensionality

25
KDD Issues…………
cntd
 Multimedia Data
 Missing Data
 Irrelevant Data
 Noisy Data
 Changing Data
 Integration
 Application

26

You might also like