0% found this document useful (0 votes)
9 views

2 Data Mining

Uploaded by

ygayathri2003
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

2 Data Mining

Uploaded by

ygayathri2003
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 20

DATA MINING

Coping with Information

 Computerization of daily life produces data


 Point-of-sale, Internet shopping (& browsing), credit
cards, banks . . .
 Info on credit cards, purchase patterns, payment
history, sites visited . . .
 Travel. One trip by one person generates info
on destination, airline preferences, seat
selection, hotel, rental car, name, address,
restaurant choices . . .
 Data cannot be processed or even inspected
manually
 Automated data collection tools and mature
database technology lead to tremendous
amounts of data stored in databases, data
warehouses and other information repositories
Data Overload
Vast quantities of data are collected and
stored out of fear that important info will be
missed
Data volume grows so fast that old data is
never analyzed
Only a small portion of data collected is
analyzed (estimate: 5%)
Database systems do not support queries like
 “Who is likely to buy product X”
 “List all reports of problems similar to this one”
 “Flag all fraudulent transactions”

But these may be the most important


questions!
Why mine data?

There is often information ‘hidden’ in


the data that is not readily evident
“More often, data mining yields
unexpected nuggets of information that
open the company’s eyes to new
markets, new ways of reaching
customers and new ways of doing
business”
Human analysts may take a very long
time to discover useful information
What Is Data Mining?

Data mining (knowledge discovery in databases):

 Extraction of interesting (non-trivial, implicit, previously


unknown and potentially useful) information or patterns from
data in large databases
Alternative names :
 Data mining: a misnomer?
 Knowledge discovery(mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
Data Mining: A KDD Process
Evaluation & presentation
 Data mining: the core
of knowledge discovery
process.
Data Mining patterns

Selection & transformation

Data Selection
Warehouse
Data Cleaning

Data Integration

Databases
Data Mining: On What Kinds of Data?
7
 Database-oriented data sets and applications
Relational database, data warehouse, transactional database
 Advanced data sets and advanced applications
 Data streams and sensor data

 Time-series data, temporal data, sequence data (incl. bio-

sequences)
 Structure data, graphs, social networks and multi-linked data

 Object-relational databases

 Heterogeneous databases and legacy databases

 Spatial data and spatiotemporal data

 Multimedia database

 Text databases

 The World-Wide Web

Data Mining: Concepts and Techniques February 11, 2025


Data Mining Functions
(What kind of patterns can be mined)

Concept/class Descriptions
Mining frequent patterns, Associations & correlation
Classification & Prediction
Cluster Analysis
Outlier Analysis
Evolution Analysis
Concept/class description:

Data can be associated with classes or concepts.


Eg: In an electronic store,
classes of items -- computers & printers
concepts of customers --big spenders &
budget spenders
Mining frequent patterns, Associations &
Correlation:

Frequent patterns are the patterns that occur

frequently in data.
There are many kinds of frequent patterns,

including itemsets, subsequences, substructures.


Classification & prediction

•Given: A collection of records (training set)


•Task:
• Find a model for the class attribute as a function of
other attributes
• Use the model to predict the class for previously
unseen records
•Goal:
•Model should accurately predict the class for
previously unseen records (test set)
Process (1): Model Construction
12

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


ravi Assistant Prof 3 no (Model)
suresh Assistant Prof 7 yes
raghav Professor 2 yes
rohan Associate Prof 7 yes
david Assistant Prof 6 no IF rank = ‘professor’
shiva Associate Prof 3 no OR years > 6
THEN tenured = ‘yes’
Data Mining: Concepts and Techniques February 11, 2025
Process (2): Using the Model in Prediction
13

Classifier

Testing
Data Unseen Data

(Sriram, Professor, 4)
NAME RANK YEARS TENURED
mellisa Assistant Prof 2 no Tenured?
ritu Associate Prof 7 no
priya Professor 5 yes
Joseph Assistant Prof 7 yes
Data Mining: Concepts and Techniques February 11, 2025
age( x, “youth” ) AND income( x,”high”) -> class( x,”A”)

age( x, ”youth” ) AND income( x, ”low”) -> class( x,”B”)


age( x, “middle-aged”) --------------------> class( x, ”c”)
age( x, “senior”) ----------------------------> class( x, ”c”)

Fig:IF – THEN rules


Age?
youth Middle_aged
, senior

Income?
Class c

high low

i o n
Class A Class B cis
a de
. :
i
F eg
tre
Clustering

“The art of finding groups in data”


Given:
 A set of data points
 Each data point has a set of attributes
 A distance/similarity measure between data
points
 E.g., Euclidean distance, cosine distance etc.
Task:
 Partition the data points into separate
groups (clusters)
Goal:
 Data points that belong to the same cluster
are similar to one another
 Much more difficult than classification since
the classes are not known in advance (no
training)
 Technique: unsupervised learning
The objects are
clustered or
grouped based on
principle of
maximizing the
intraclass similarity
& minimizing the
interclass
similarity.
Outlier analysis

A database may contain data objects


that do not comply with general
behavior or model of data. These data
objects are outliers.
Some data mining methods discard
outliers as noise or exceptions.
However, it is useful in some
applications such as fraud detection.
Evolution analysis

It describes & models regularities or


trends for objects whose behavior
changes over time.

Example:
The data of result of the last several years of
a college would give an idea of quality of
graduates produced by it.

You might also like