0% found this document useful (0 votes)
53 views

Data Mining Overview: by Dr. Sunil D. Lakdawala

This document provides an overview of data mining, including definitions, applications, techniques, and the data mining methodology and process. It defines data mining as the process of discovering meaningful patterns and rules from large amounts of data. The main applications discussed are marketing, customer relationship management, and process improvement. The key techniques covered are supervised learning (classification and prediction), unsupervised learning (clustering, association rules), and dimension reduction. It also outlines the 11 step data mining methodology to avoid discovering patterns that are not true or useful.

Uploaded by

priyankarora
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

Data Mining Overview: by Dr. Sunil D. Lakdawala

This document provides an overview of data mining, including definitions, applications, techniques, and the data mining methodology and process. It defines data mining as the process of discovering meaningful patterns and rules from large amounts of data. The main applications discussed are marketing, customer relationship management, and process improvement. The key techniques covered are supervised learning (classification and prediction), unsupervised learning (clustering, association rules), and dimension reduction. It also outlines the 11 step data mining methodology to avoid discovering patterns that are not true or useful.

Uploaded by

priyankarora
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 52

Data Mining Overview

By
Dr. Sunil D. Lakdawala
Content
Case Data Mining – Supervised Learning
Case Data Mining – Unsupervised Learning
Definition
Applications
Techniques
Supervised Learning
Unsupervised Learning

Data Mining - Overview 2


Content
DM as a Business Process
DM Methodology
References

Data Mining - Overview 3


Definition
• Advanced methods for exploring and modeling
relationships in large amounts of data (SAS)
• Process of discovering meaningful new
correlations, patterns and trends by sifting through
large amounts of data stored in repositories, using
pattern recognition technologies as well as
statistical and mathematical techniques.” (Gartner
Group)

Data Mining - Overview 4


Definition (Cont)
• Process of exploration and analysis, by automatic
or semi automatic means, of large quantities of
data in order to discover meaningful patterns and
rules
From the middle of 1900s, corporate data has
increased by factor of 100,000! due to automated
operations throwing enormous opportunities to
improve business decision making

Data Mining - Overview 5


Applications
• Data Mining is useful when large amount of data
and something worth learning (i.e. resulting
knowledge is worth more money than it costs to
discover)
• Research- large pharma companies
• Process Improvement
• Marketing
• Customer Relationship Management (CRM)

Data Mining - Overview 6


Application (Cont)
• CRM (cont)
– Presenting single image of organization
– Keeping single image of customer
– Knowing Likes and dislikes of customers
– Anticipating their needs and exploiting them
proactively
– Recognizing their displeasure and do some
thing before it is too late
Data Mining - Overview 7
Popular Applications
(source – kdnuggets)

Data Mining - Overview 8


Techniques
Supervised Learning (Directed Knowledge
Discovery)
• Classification (e.g. assigning customers to
predefined segment. Discrete classes)
• Estimation / Regression (e.g. Value of real estate.
Continuous)
• Prediction: Classification or Estimation for future
(Which customer will close account in 6 month)
• Time Series Analysis

Data Mining - Overview 9


Techniques (Cont)
Unsupervised Learning (Undirected Knowledge
Discovery)
• Association Rules (Affinity Grouping): Which
things go together
• Sequence Discovery: Association Rules based on
time
• Clustering: Segmenting diverse group into number
of similar group / cluster
• Dimension reduction
• Summarization / Characterization / Generalization
Data Mining - Overview 10
Overview of Techniques - 1
Classification
Logistic Predicts probability of success; Gives
Regression subset selection of variables
Classification Gives a decision tree with rules of
Tree classification
Neural Is very opaque but gives higher level
Network of accuracy in many situations
k-Nearest Groups cases into neighbors and
Neighbor assigns a class based on majority of
cases in a neighborhood
Data Mining - Overview 11
Illustrative Applications - Classification

• Target Marketing
• Attrition Prediction/Churn Analysis
• Fraud Detection
• Credit Scoring
Predicting for every case which class it belongs to or
probability of success based on its predictor
variables data

Data Mining - Overview 12


Overview of Techniques - 2
Prediction
Multiple Linear Gives predicted values based on
Regression Regression Model
Regression Tree Gives a decision tree with rules of
prediction
k-Nearest Groups cases into neighbors and
Neighbors assigns a value based on majority of
cases in a neighborhood
Neural Network
Data Mining - Overview 13
Illustrative Applications - Prediction

• Forecasting sales
• Predicting price fluctuations
• Predicting profitability of business units
• Predicting market value of assets
• Predicting yield or consumption of critical
inputs
Predicting for every case a value based on its
predictor variables data
Data Mining - Overview 14
Overview of Techniques - 3
Clustering and Dimension Reduction
k-Means For given number of clusters – k value - develops
Clustering clusters based on minimum distance between the
cluster centers and the cases in the cluster.
Hierarchical Builds, through successive steps, clusters by
Clustering grouping cases having less dissimilarities and
finally creating a single cluster. The user can
choose the number of clusters corresponding to a
distance measure.
Principal Creates new variables, called Principal
Components Components, that are uncorrelated and that
explain majority of variability in original data.
Data Mining - Overview 15
Dimension Reduction
• When there are many dimensions
(predictors), say 20, 30 or 50..
• Or when several predictors are correlated
• Develop new variables that:
– Explain the major portion of variability in data,
and
– Are uncorrelated

Data Mining - Overview 16


Illustrative Applications - Clustering

• Market segmentation
• Product grouping based on customer preferences
• Grouping of business units based on performance
parameters
• Grouping channel partners based on performance
parameters
Grouping of homogenous cases based on
predefined variables data
Data Mining - Overview 17
Overview of Techniques - 4
Market Basket Analysis / Affinity

Association Gives prediction of combinations


Rules of events that will occur together
based on the past occurrences

Data Mining - Overview 18


Illustrative Applications –
Market Basket
• Cross selling
• Product placement in a store
• Forecasting sales

Predicting events that occur together as antecedents and consequents


with certain level of confidence and support number of events

Data Mining - Overview 19


DM as Business Process
Identifying the business problem (and how will
business benefits will be measured)
• Planning direct marketing campaign - new Product
• Understanding customer attrition
Mining Data to transform data into Actionable
Information
• Who are more likely to buy product
• Which customers are likely to leave. Are they
worth keeping?
Data Mining - Overview 20
DM as Business Process (Cont)
Acting on the information
• Contacting more likely customers
• Offering special services to valuable customers
likely to leave
Measuring the results
• Actual Business benefits achieved as defined
earlier

Data Mining - Overview 21


DM Methodology
Why Methodology?
• Avoid learning that is not true
• Avoid learning that is true but not useful

Data Mining - Overview 22


Learning that is not true
• Incorrect Data
• Data may not be relevant (business situation has changed)
• Summarization of data may have destroyed important
information (Fig 3.1 pg 47)
• Due to small volume of data, pattern emerges due to
chance (when India does well in cricket, sensex goes up)
• Model set may not reflect relevant population (“Issue of
Credit” model built on persons who were given credit. Poll
conducted on WEB)
Data Mining - Overview 23
Learning that is true but not useful
• Learning that are already known: People in area with no
cell coverage, do not buy cell phones
• Learning that can not be used: Product sale is related to
weather (Can you change weather?). Bad credit history
may be predictive of more insurance claim, but regulators
may prohibit usage of such information

Data Mining - Overview 24


DM Methodology – 11 Steps
Step 1: Translate business problem into DM problem
• State in specific term (i.e. instead of “Gaining insight into
customer behavior”, Identify customer who are unlikely to
renew subscription)
• Determine type of problem (Classification, Clustering, etc.)
• Decide how results will be used
– Contact high risk / high value customer and try to lure
them with offer
– Forecast customer population in future months

Data Mining - Overview 25


DM Methodology – 11 Steps (cont)
Step 2: Select appropriate Data
• Input variables
– Which one?
Ignore Input columns with only one value
Ignore Input columns with unique value for each row (e.g.
customer name)
Choose only one column out of two having high
correlation. (e.g. Age_Difference and Age_Ratio)
– What should it contain: Example of all possible outcome
- Availability
Ideally from DW (If present) but may need to supplement

Data Mining - Overview 26


DM Methodology – 11 Steps (cont)
Step 2: Select appropriate Data (Cont)
• Input variables (Cont)
– How Many?
Do not eliminate at this stage
Needs to be done later on
– How Much Data?
More the merrier
Needs to optimize w.r.t. cost involved in processing, etc.
(Rule: If doubling size does not improve result much,
stop)
- How much history?
Seasonality? (Consider seasonality. Data that is too old,
may not be relevant. Typically 2 – 3 years for CRM)
Data Mining - Overview 27
DM Methodology – 11 Steps (cont)
Step 3: Get to know Data
• Data Type
• Descriptive statistics
• Validation (Why were so many customers born on 1911?
Are they really that old?)

Data Mining - Overview 28


Data Type
• Columns: Categorical Vs Continuous
– Categorical: Takes discrete values (# of
children, Marital Status)
– Continuous: Takes continuous values (Income)
• Unordered vs Ordered Columns
– Unordered: (Marital Status, Sex)
– Ordered: Rank (e.g. “Low”, “High”)
– Ordered: Interval (e.g. Temperature)
– Ordered: True Numeric (e.g. Sales in Rs.,
Weight
Data Mining - Overview 29
Alcohol
Descriptive Mean 13.00
Standard Error 0.06
Statistics Median 13.05
Mode 13.05
• We can get general Standard Deviation 0.81
idea about the way Sample Variance 0.65
data are distributed Kurtosis -0.85
Skewness -0.05
Range 3.80
Minimum 11.03
Maximum 14.83
Sum 2314.11
Count 178
Largest(1) 14.83
Smallest(1) 11.03
Confidence Level
(95.0%)
Data Mining - Overview 300.12
Data Visualization
• We can study data Histogram - All Types of Wines

40 120.00%

distribution using

Frequency
100.00%
30
80.00% Frequency
20 60.00%
Cumulative %

Histogram
40.00%
10
20.00%
0 .00%

Bin - Alcohol Content

Histogram - Type A Wines Histogram - Type B Wines


25 120.00%
40 150.00%
F re q u e n c y

Frequency
20 100.00%
30 100.00% Frequency
80.00%
15 Frequency 20
60.00%
Cumulative % 50.00% Cumulative %
10
40.00%
10
5 20.00%
0 .00%
0 .00% .5 .5 .5 .5 re
11 12 13 14 Mo
Bin - Alcohol Content
Bin - Alcohol Conte nt

Data Mining - Overview 31


Data Visualization
• Visual presentation of data (e.g. Graphs like bar
chart, X-Y Plot of two variables, Scatter Chart
etc.)
• Correlation-ship between data

Data Mining - Overview 32


Validation
Incorrect Values:
Reasons
– Transcription error
– Laziness (force entry for birth day  many
were born on November 11, 1911!!)
– Programming error (value of previous field gets
entered in this field)
– Old code and new code coexist!
– Collected wrongly (Time zone not considered)
Data Mining - Overview 33
Validation
Incorrect Values:
Reasons
– Stored incorrectly (Numeric instead of
character type)
“My data must be clean because no human being
has touched it manually” .. One CEO
Result: 50% data wrong, because human being
did not touch system clocks on computers!

Data Mining - Overview 34


DM Methodology – 11 Steps (cont)
Step 4: Create a model set
• Sampling
– Proportionate (Including multiple time frames)
– Over sampling
• Partitioning
– Training
– Validation
– Test

Data Mining - Overview 35


DM Methodology – 11 Steps (cont)
Step 5: Fix Problems with Data
• Correct Error
• Missing Values
• Outliers

Data Mining - Overview 36


Missing Data
Reasons
– “Missing Data” might be important
information. (e.g. not providing TN  do not
bother me calling) Keep a flag

Data Mining - Overview 37


Missing Data
Reasons (Cont)
– Nature of Problem. (e.g. New customer do not
have 12 month history data) Build separate
model for those
– Sources not providing data (e.g. external
vendor not able to provide certain data) Replace
by other derived value / build separate model
– Data was never collected

Data Mining - Overview 38


Missing Data
What to do?
– Do Nothing
– Filter rows (introduces bias)
– Ignore column
– Predict New Value
– Build separate model
– Modify operations systems to collect data

Data Mining - Overview 39


Missing Data Correction
• Delete record
Problems
– Too many rows thrown out
– Bias introduced (All persons not wanting to state
“Salary” out)
• Replace values with:
– Mode
– Mean (Local / Global)
– Median
– User specified value
Will replacement create problems?
Data Mining - Overview 40
Outliers
• Outlier are cases that contain unusual high or low
data value in a variable.
• Such records unduly influence the model.
• If they are not a natural occurrence they should be
remove
• Treatment depends upon algorithm chosen
(Decision tree – no problem. Clustering – Define
separate cluster. Some cases – remove / replace
with Max / Min )

Data Mining - Overview 41


DM Methodology – 11 Steps (cont)
Step 6: Transform Data
• Normalization
• Transforming

Data Mining - Overview 42


Transformation
• Derived Variables
• Create derived variable that represent
something in real world (e.g. Passenger *
Miles)

Data Mining - Overview 43


Transformation
Extracting Information from a column /
Transformation
• 26 Jan and 15 Aug  Holiday
• Date: Holiday / Working Day
• Date: Festive Season / Normal Season
• Time: Peak Hour / Off-peak Hour
• Telephone Number: Landline / Mobile
• Address: Single House / Multi-unit dwelling
• Categorize continuous data (e.g. Income)
Data Mining - Overview 44
DM Methodology – 11 Steps (cont)
Step 7: Build Model
• Choose one or more techniques
Step 8: Asses Models
Some Errors are more serious than others
• Confusion Matrix
• Lift
• RMS
• Ratio of intra-cluster to inter-cluster distance

Data Mining - Overview 45


DM Methodology – 11 Steps (cont)
Step 9: Deploy Model
• Choose one or more techniques
Step 10: Asses Results
Example:
What was the cost of direct marketing campaign?
(Including DM Cost)
What were benefits>)
Step 11: Begin Again
• Things change over time
• Better way of handling

Data Mining - Overview 46


DM and KDD
KDD (Knowledge Discovery in Database) and DM are
used interchangeably.
Some prefer to differentiate. KDD consists of:
• Selection: Sourcing Data
• Preprocessing: Correcting erroneous data, handling
missing data
• Transformation: Transforming data to more usable
formats
• Data Mining: Applying various algorithms
• Presentation / Interpretation / Evaluation of data
Data Mining - Overview 47
SEMMA Methodology (SAS)
• Sample from data sets, Partition into Training,
Validation and Test datasets
• Explore data set statistically and graphically
• Modify:Transform variables, Impute missing
values
• Model: fit predictive models e.g. regression, tree,
collaborative filtering
• Assess: Compare models

Data Mining - Overview 48


Miscellaneous
Data Mining Issues
• Human Interaction
• Over fitting
• Outliers
• Interpretation of Results
• Visualization of Results
• Large Datasets (some algorithm do not scale. Use
Sampling or Parallel processing)
Data Mining - Overview 49
Miscellaneous (Cont)
Data Mining Issues (Cont)
• High Dimensionality
• Multimedia Data
• Missing Data
• Irrelevant Data
• Noisy data
• Changing Data
• Integration of KDD in traditional DBMS systems
• Applications
Data Mining - Overview 50
Miscellaneous (Cont)
Future
• Data Mining Query Lang (DMQL) based on SQL
• DMQL should bring out
– Generalized Relation: Obtained by Generalizing
data from input data
– Characteristic Rule: Condition satisfied by
almost all records in target class
– Discriminate Rule: Condition satisfied by target
class but not by other classes
– Classify Rule: Used to classify data
Data Mining - Overview 51
References
1. Michael Berry, Gordon Linoff “Mastering Data
Mining”, Wiley Publications (Ch 1, 3, 5, 6, 7)
2. Michael Berry, Gordon Linoff “Data Mining
Techniques”, Wiley Publications, (Ch 7 –
Overview of Data Mining Techniques)
3. Margaret Dunham, “Data Mining – Introductory
and Advanced Topics”, Pearson Edition (Ch
1,2,3)

Data Mining - Overview 52

You might also like