0% found this document useful (0 votes)
2 views

01 Intro to Data Mining

The document discusses the importance and process of data mining, emphasizing its role in extracting valuable knowledge from vast amounts of data generated daily across various sectors. It outlines the steps involved in data mining, including problem definition, data preparation, exploration, modeling, evaluation, and deployment. Additionally, it highlights various applications of data mining, such as fraud detection, healthcare improvement, and market analysis.

Uploaded by

William D2
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

01 Intro to Data Mining

The document discusses the importance and process of data mining, emphasizing its role in extracting valuable knowledge from vast amounts of data generated daily across various sectors. It outlines the steps involved in data mining, including problem definition, data preparation, exploration, modeling, evaluation, and deployment. Additionally, it highlights various applications of data mining, such as fraud detection, healthcare improvement, and market analysis.

Uploaded by

William D2
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Adopted from Dr.

Saed Sayad Lectures


▪ Why Data Mining?
▪ What is Data Mining?
▪ Data Mining Applications
▪ Data Mining Tasks
▪ Data Mining Steps
The Explosive Growth of Data: from terabytes to petabytes
Vast amount of data is collected daily
in Commercial Viewpoint
➢ Web data
➢ e-commerce
➢ purchases at department/
grocery stores
➢ Bank/Credit Card
transactions
Vast amount of data is collected daily
in Scientific Viewpoint
➢ remote sensors on a satellite
➢ telescopes scanning the skies
➢ microarrays generating gene
expression data
➢ scientific simulations
generating terabytes of data
▪ Information retrieval (Databases) is simply not enough
anymore for decision making

▪ We are drowning in data, but starving for knowledge!

▪ Mining Data — Automated analysis of massive data sets


▪ Also known as Knowledge Discovery in Databases (KDD)

▪ It is an Extraction of interesting (non-trivial, implicit,


previously unknown and potentially useful) patterns or
knowledge from huge amount of data
The process of
A multi-disciplinary filed
which combines Statistics,
AI & Machine Learning,
Database & Data Warehousing
Data mining is the process of discovering interesting
patterns and knowledge from large amounts of data.
The data sources can include databases, data warehouses,
the Web, other information repositories, or data that are
streamed into the system dynamically.
Improving health care and reducing costs Predicting the impact of climate change

Reducing hunger and poverty by


Finding alternative/ green energy sources
increasing agriculture production
Prediction Methods
✓ Use some variables to predict unknown or future values
of other variables.

Description Methods
✓ Find human-interpretable patterns that describe the
data.
Data
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
11 No Married 60K No
12 Yes Divorced 220K No
13 No Single 85K Yes
14 No Married 75K No
15 No Single 90K Yes
10

Milk
Find a model for class attribute as a function of the values of
other attributes Model for predicting credit
worthiness

Class Employed
# years at
Level of Credit No Yes
Tid Employed present
Education Worthy
address
1 Yes Graduate 5 Yes No Education
2 Yes High School 2 No
{ High school,
3 No Undergrad 1 No Graduate
Undergrad }
4 Yes High School 10 Yes
… … … … … Number of Number of
10

years years

> 3 yr < 3 yr > 7 yrs < 7 yrs

Yes No Yes No
# years at
Level of Credit
Tid Employed present
Education Worthy
address
1 Yes Undergrad 7 ?
# years at 2 No Graduate 3 ?
Level of Credit
Tid Employed present 3 Yes High School 2 ?
Education Worthy
address
… … … … …
1 Yes Graduate 5 Yes 10

2 Yes High School 2 No


3 No Undergrad 1 No
4 Yes High School 10 Yes

10
… … … … … Test
Set

Learn
Training Model
Set Classifier
• Classifying credit card transactions as legitimate or fraudulent
• Classifying land covers (water bodies, urban areas, forests, etc.) using
satellite data
• Categorizing news stories as finance, weather, entertainment, sports, etc
• Identifying intruders in the cyberspace
• Predicting tumor cells as benign or malignant
• Classifying secondary structures of protein as alpha-helix, beta-sheet, or
random coil
GOAL:
Predict fraudulent cases in credit card transactions

APPROACH:
▪ Use credit card transactions and the information on its account-holder
as attributes.
▪ When does a customer buy, what does he buy, how often he pays
on time, etc

▪ Label past transactions as fraud or fair transactions. This forms the


class attribute
▪ Predict a value of a given continuous valued variable based on the values
of other variables, assuming a linear or nonlinear model of dependency.

▪ Examples:
✓ Predicting sales amounts of new product based on advetising
expenditure.
✓Predicting wind velocities as a function of temperature, humidity, air
pressure, etc.
✓Time series prediction of stock market indices.
▪ Finding groups of objects such that the objects in a group will be
similar (or related) to one another and different from (or unrelated to)
the objects in other groups
.
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
▪ Given a set of records each of which contain some number of items from
a given collection
▪ Produce dependency rules which will predict occurrence of an item
based on occurrences of other items.

TID Items
1 Bread, Coke, Milk
Rules Discovered:
2 Beer, Bread {Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk {Diaper, Milk} --> {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
▪ Market-basket analysis
▪ Rules are used for sales promotion, shelf management, and inventory
management

▪ Telecommunication alarm diagnosis


▪ Rules are used to find combination of alarms that occur together
frequently in the same time period

▪ Medical Informatics
▪ Rules are used to find combination of patient symptoms and test results
associated with certain diseases
▪ Detect significant deviations from
normal behavior
▪ Applications:
▪ Credit Card Fraud Detection
▪ Network Intrusion
Detection
▪ Identify anomalous behavior from
sensor networks for monitoring
and surveillance.
▪ Detecting changes in the global
forest cover.
1 • Problem Definition
2 • Data Preparation
3 • Data Exploration
4 • Modeling
5 • Evaluation
6 • Deployment
▪ Understanding the project objectives and requirements from
a business perspective and then converting this knowledge
into a data mining problem definition with a preliminary plan
designed to achieve the objectives.

▪ A successful data mining project starts from a well defined


question or need.
▪ Data Preparation
involves data,
dataset, databases
and ETL( Extraction,
Transformation &
Loading

You might also like