Data Mining
Data Mining
By:
Nithin Bhaktha H
4NM07CS073
NMAMIT, Nitte
What is Data Mining?
Data mining:
Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
information or patterns from data in large
databases
Data Mining
ot al
ai
rli e b te e
ne s en a nk le da
re si n ph ta
se g tra on
rv i m ns e
at a ac ca
io ge ti lls
ns s on
s
cr ta
ed x
it re
Data
ca tu
rd rn
s
ch
ar
ge
s
3
Multidisciplinary
Statistics
Pattern Neurocomputing
Recognition
Machine
Data Mining Learning AI
Databases
KDD
Data Warehouse:
The term Data Warehouse was coined by Bill Inmon in 1990, which
he defined in the following way: "A warehouse is a subject-oriented,
integrated, time-variant and non-volatile collection of data in support
of management's decision making process". He defined the terms in
the sentence as follows:
Subject Oriented:
Data that gives information about a particular subject instead of
about a company's ongoing operations.
Integrated:
Data that is gathered into the data warehouse from a variety of
sources and merged into a coherent whole.
Time-variant:
All data in the data warehouse is identified with a particular time
period.
Non-volatile
Data is stable in a data warehouse. More data is added but data is
never removed. This enables management to gain a consistent
picture of the business.
Wednesday, December 08, 202 Data Mining 6
What can DM do?
Pattern Evaluation
Data mining: the core of
knowledge discovery process.
Data Mining
Task-relevant Data
Data Selection
Data Preprocessing
Data Warehouse
Data Cleaning
Data Integration
Databases
Wednesday, December 08, 202 Data Mining 8
Steps of a KDD Process
Increasing potential
to support
business decisions End User
Making
Decisions
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA DBA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
Wednesday, December 08, 202 Data Mining 10
Data Mining Techniques
Classification
Clustering
Mining Associations
Sequential Pattern Discovery
Set
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
Training
Learn
10
10 No Single 90K Yes
Set Classifier Model
Direct Marketing
Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
Approach:
▪ Use the data for a similar product introduced before.
▪ We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class attribute.
▪ Collect various demographic, lifestyle, and company-interaction
related information about all such customers.
▪ Type of business, where they stay, how much they earn, etc.
▪ Use this information as input attributes to learn a classifier model.
Fraud Detection
Goal: Predict fraudulent cases in credit card transactions.
Approach:
▪ Use credit card transactions and the information on its account-
holder as attributes.
▪ When does a customer buy, what does he buy, how often he pays on
time, etc
▪ Label past transactions as fraud or fair transactions. This forms the
class attribute.
▪ Learn a model for the class of the transactions.
▪ Use this model to detect fraud by observing credit card transactions
on an account.
Intracluster
Intraclusterdistances
distances Intercluster
Interclusterdistances
distances
are
areminimized
minimized are
aremaximized
maximized
TID Items
1 Bread, Coke, Milk
Rules
RulesDiscovered:
Discovered:
2 Beer, Bread
{Milk}
{Milk}-->
-->{Coke}
{Coke}
3 Beer, Coke, Diaper, Milk {Diaper,
{Diaper,Milk}
Milk}-->
-->{Beer}
{Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Industry Application
Finance Credit Card Analysis
Insurance Claims, Fraud Analysis
Telecommunication Call record analysis
Transport Logistics management
Consumer goods promotion analysis
Data Service providers Value added data
Utilities Power usage analysis
Thank You
Wednesday, December 08, 202 Data Mining 26