0% found this document useful (0 votes)

8 views

Chapter 02 Overview

Uploaded by

pranjal songara

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Chapter 02 Overview

Uploaded by

pranjal songara

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 43

Chapter 2: Overview

Data Mining for Business

Analytics
Concepts, Techniques and Applications
with JMP Pro
Shmueli, Bruce, Stephens & Patel

© Galit Shmueli, Peter Bruce, and Mia Stephens

2016
Core Ideas in Data Mining
Data
Mining
Supervised Unsupervised
Learning Learning
(Predictive (Descriptive
Analytics) Analytics)

Association Rules Data Exploration &

Prediction Time Series Visualization
Classification Data Reduction &
Dimension Reduction

2
Supervised Learning
Goal: Predict a single “target” or “outcome”
variable

Training data, where target value is known

Score data where value is not known

Methods: Classification and Prediction

3
Predictive Modeling Example
Who will cheat on their taxes? l l us ss
ri ca ri ca uo l a
o go ti n tc
teg t e n ge
c a ca co tar
Use what we know to
TI Refu Marital Taxabl Cheat
predict what we don’t.
D nd Status e
Income
on
Taxes
1 Yes Single 125K No
2 No Married 100K No
Refu Marital Taxabl Cheat
3 No Single 70K No
nd Status e on
Income Taxes 4 Yes Married 120K No
No Single 75K ? 5 No Divorce 95K Yes
d
Yes Married 50K ?
6 No Married 60K No
No Married 150K ?
7 Yes Divorce 220K No
Yes Divorce 90K ?
d
d
8 No Single 85K Yes
No Single 40K ?
4 9 No Married 75K No
No Married 80K ?
10 No Single 90K Yes
Supervised: Prediction
Goal: Predict numerical target (outcome)
variable
Examples: sales, revenue, performance
Each row is a case (customer, tax return,
applicant)
Each column is a variable

5
Supervised Prediction
Example
X Y
0 1
1 3
2 5
3 7
4 9
5 11
6 13
7 15
8 17
9 19
10 21

Known data for X equal to 0 through 10

Can we predict X equal to 11? Or X equal
6 to 5.5?
Supervised: Classification

Goal: Predict categorical target (outcome)

variable
Examples: Purchase/no purchase,
fraud/no fraud, creditworthy/not
creditworthy…
Each row is a case (customer, tax return,
applicant)
Each column is a variable
Target variable is often binary (yes/no)

7
Unsupervised Learning

Goal: Segment data into meaningful

segments; detect patterns

There is no target (outcome) variable to

predict or classify

Methods: Association rules, data reduction

& exploration, visualization

8
Unsupervised: Association
Rules
Goal: Produce rules that define “what
goes with what”
Example: “If X was purchased, Y was
also purchased”
Rows are transactions
Used in recommender systems – “Our
records show you bought X, you may
also like Y”
Also called “affinity analysis”

Note: Association Rules are available in JMP Pro as of version 13.

9
Unsupervised: Data
Exploration

Data sets are typically large, complex &

messy
Need to review the data to help refine the
task
Use techniques of Reduction and
Visualization

10
Unsupervised: Data
Reduction
Distillation of complex/large data into
simpler/smaller data
Reducing the number of variables/columns
(e.g., principal components)
Reducing the number of records/rows (e.g.,
clustering)

11
Unsupervised: Data
Visualization
Graphs and plots of data
Histograms, boxplots, bar charts,
scatterplots
Especially useful to examine relationships
between pairs of variables

12
The Process of Data Mining

13
Steps in Data Mining
1. Define/understand purpose of project
2. Obtain data (may involve random sampling)
3. Explore, clean, preprocess data
4. Reduce the data dimensionality
5. Specify data mining task (classification,
clustering, ...)
6. Partition the data (if supervised)
7. Choose the techniques (regression, CART,
neural networks, etc.)
8. Perform the task
9. Interpret the results and assess the model –
14
compare models and select the best model
10. Deploy best model
Data Mining Effort
100
90
80
70
60
50
40
30
20
10
0
Business Data Data Analysis of
Objective Preparation Modeling Results and
Determination Knowledge
Assimilation

15
SEMMA Approach to Data
Mining
Business Imperative
Sample +
 You gather the data by creating one or more tables. The sample should
be large enough to contain the significant information, but small
enough to process.
Explore
 You search for anticipated relationships, unanticipated trends, and
anomalies to gain understanding and ideas.
Modify
 You change the data by creating, selecting, and transforming the
variables to focus the model selection process.
Model
 You model the data by using the analytical tools to search for a
combination of the data that reliably predicts a desired outcome.
Assess
 You evaluate competing predictive models by building charts to
16 evaluate the usefulness and reliability of the findings from the data
mining process.
Not a one-time initiative
Models need to be maintained over time
Adapt to changes in the business
environment
Leverage new data as it becomes available

17
Obtaining Data: Sampling

Data mining typically deals with huge

databases
Algorithms and models are often applied to
a sample from a database
Once you develop and select a final model,
you use it to “score” the observations in
the larger database

18
Pre-processing Data

19
Types of Variables
Determine the types of pre-processing
needed, and algorithms used
Main distinction: Categorical vs. numeric
Modeling types are used in JMP to
indicate the types of variables
Numeric
Continuous or Integer – Continuous in JMP
Categorical
Ordered (low, medium, high) – Ordinal in JMP
Unordered (male, female) – Nominal in JMP
20
Types of Variables
 The icons next to the variable names (under
Columns) indicate the modeling type

21
Variable handling
Numeric
Most algorithms in JMP can handle numeric
data
May occasionally need to “bin” into categories

Categorical
JMP algorithms can generally handle
categorical data
Some stat packages require conversion to
binary dummy or indicator variables (number
of dummies = number of categories – 1)
22
JMP automatically codes categorical variables
behind the scenes if needed
Detecting Outliers
An outlier is an observation that is
“extreme”, being distant from the rest of the
data (definition of “distant” is deliberately
vague)
Outliers can have disproportionate influence
on models (a problem if it is spurious)
An important step in data pre-processing is
detecting outliers
Once detected, domain knowledge is
required to determine if it is an error, or truly
extreme.
23
Detecting Outliers
In some contexts, finding outliers is the
purpose of the DM exercise (airport security
screening)
This is called “anomaly detection”

 The Explore Outliers utility provides

procedures for detecting outliers
 Multivariate outliers can also be identified
using numerical summaries and visualization
tools, and the Outlier Analysis option in the
Multivariate platform.
24
Detecting Outliers – Quantile
Range
0.1 0.9

Outliers Outliers
80%

-400 -300 -200 -100 0 100 200 300 400 500

In this example :
Tail Quantile = 0.1 = 10%
Mid Interval = 1 – 2 x Tail Interval = 0.8 =
80%
Inter Quantile Range = 100
Q = number of intervals to accept before
calling something an outlier = 3
25
Outliers are anything < -300 or > 400
Detecting Outliers

Tail Quantile = 10%

Lower 10% Quantile = 860.65
Upper 10% is the 90% Quantile = 9717.4
Inter Quantile Range = 9717.4 – 860.65 =
8856.75
Q indicates how many multiples of the IQR you
tolerate before calling something an outlier
Low Threshold = Low Quantile – (IQR x Q) =
860.65 – (8856.75 x 3) = -
26 25710

Handling Missing Data
Most algorithms will not process records with
missing values. JMP will generally omit these
records.
Solution 1: Omission
 If a small number of records have missing values,
can omit them
 If many records are missing values on a small set of
variables, can drop those variables (or use proxies)
 If many records have missing values, omission is not
practical
Solution 2: Imputation
 Replace missing values with reasonable substitutes
27  Lets you keep the record and use the rest of its
Handling Missing Data
There are several features in JMP Pro for
evaluating missing values or working with
missing values
 Columns Viewer – reports the number of values missing for
a variable
 Missing Data Pattern – find patterns of missing data
 Explore Missing Values utility – provides methods for
imputing missing data for continuous variables
 Multivariate platform – provides imputation for continuous
variables
 Recode – can be used to recode missing values into a
”missing” category
 Informative Missing – available in most platforms to handle

28 missing values
Normalizing (Standardizing)
Data
Used in some techniques when variables with
the largest scales would dominate and skew
results
Puts all variables on same scale
Normalizing function: Subtract mean and
divide by standard deviation
Generally applied behind the scenes in JMP if
needed, or a built in option is provided in model
dialogs
Select a variable in the data table and select
New Formula Column > Transform >
29
Standardize

The Problem of Overfitting
Statistical models can produce highly
complex explanations of relationships
between variables
The “fit” may be excellent
When used with new data, models of great
complexity do not do so well.

30
100% fit – not useful for new
data
1600

1400

1200

1000
Revenue

800

600

400

200

0
200 300 400 500 600 700 800 900 1000

Expenditure

31
Overfitting (cont.)
Causes:
 Too many predictors
 A model with too many parameters
 Trying many different models

Consequence: Deployed model will not work

as well as expected with completely new
data.

32
Partitioning the Data
Problem: How well will our
model perform with new data?

Solution: Separate data into

two parts
Training partition to develop
the model
Validation partition to
implement the model and
evaluate its performance on
“new” data
33
Test Partition
 When a model is developed on
training data, it can overfit the
training data (hence need to assess
on validation)
 Assessing multiple models on same
validation data can overfit
validation data
 Some methods use the validation
data to choose a model. This too
can lead to overfitting the
validation data
 Solution: final selected model is

34
applied to a test partition to give
unbiased estimate of its
Example – Linear Regression
West Roxbury Housing Data

First 10 records

35
Example – Linear Regression
West Roxbury Housing Data

36
Partitioning the data
(Using the Make Validation Column
Utility)

37
Fitting a Multiple Linear Regression
Using Fit Model

38
Saving values to the data
table
Click on the red
triangle in the
Fit Least
Squares
analysis
window to
select
additional
options
Save Prediction
Formula
39
Save Residuals
Predictions for First 10
Observations
Pred Formula TOTAL VALUE is the value
predicted by our regression model
Residual TOTAL VALUE is the difference
between the actual and predicted value
Highlighted rows are in the Validation Set

40
Summary of errors and cross
validation statistics

41
RMSE (and RASE)
Error = actual - predicted
RMSE = Root-mean-squared error = Square
root of average squared error
(RMSE is reported as RASE in Fit Model and
some other JMP Platforms)

In previous example, sizes of training and

validation sets differ, so only RMSE and
Average Error are comparable

42
Summary
Data Mining consists of supervised methods
(Classification & Prediction) and
unsupervised methods (Association Rules,
Data Reduction, Data Exploration &
Visualization)
Before algorithms can be applied, data must
be characterized and pre-processed
To evaluate performance and to avoid
overfitting, data partitioning is used
Data mining methods are usually applied to
a sample from a large database, and then
43 the best model is used to score the entire

Capstone Presentation: Telecom Churn Study
100% (3)
Capstone Presentation: Telecom Churn Study
19 pages
Accounting DeMYSTiFieD, 2nd Edition
From Everand
Accounting DeMYSTiFieD, 2nd Edition
Leita Hart
5/5 (2)
Data Collection: Getting Started With Statistics
From Everand
Data Collection: Getting Started With Statistics
Lee Baker
No ratings yet
BANA 560 - Lecture - 2 - Data - Mining - Overview - Data - Exploration
No ratings yet
BANA 560 - Lecture - 2 - Data - Mining - Overview - Data - Exploration
38 pages
Overview of Data Mining Process
No ratings yet
Overview of Data Mining Process
43 pages
Data Mining Notes
No ratings yet
Data Mining Notes
43 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Insy662 - f23 - Week 1
No ratings yet
Insy662 - f23 - Week 1
21 pages
2 - Preprocessing
No ratings yet
2 - Preprocessing
74 pages
[Ebooks PDF] download (eBook PDF) Data Mining for Business Analytics: Concepts, Techniques, and Applications with JMP Pro full chapters
100% (1)
[Ebooks PDF] download (eBook PDF) Data Mining for Business Analytics: Concepts, Techniques, and Applications with JMP Pro full chapters
55 pages
Business Analytics Process and Data Exploration
No ratings yet
Business Analytics Process and Data Exploration
38 pages
(Ebook PDF) Data Mining For Business Analytics: Concepts, Techniques, and Applications With JMP Pro Download PDF
100% (3)
(Ebook PDF) Data Mining For Business Analytics: Concepts, Techniques, and Applications With JMP Pro Download PDF
41 pages
CH 2
No ratings yet
CH 2
36 pages
Data Mining For Business Intelligence: Shmueli, Patel & Bruce
No ratings yet
Data Mining For Business Intelligence: Shmueli, Patel & Bruce
37 pages
Chap2 Overview
No ratings yet
Chap2 Overview
17 pages
data science slides
No ratings yet
data science slides
57 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
Chapter 4 Data Mining
No ratings yet
Chapter 4 Data Mining
5 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
SML Updated UNIT-2
No ratings yet
SML Updated UNIT-2
43 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
Chapter 02 Overview (R)
No ratings yet
Chapter 02 Overview (R)
43 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Data Mining Chapter3 0
No ratings yet
Data Mining Chapter3 0
32 pages
module 3 data preparation
No ratings yet
module 3 data preparation
33 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
No ratings yet
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
37 pages
Exam-1
No ratings yet
Exam-1
12 pages
DM-2Preprocessing 2
No ratings yet
DM-2Preprocessing 2
61 pages
DMiningKuliah 2A DPreparation
No ratings yet
DMiningKuliah 2A DPreparation
32 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Mining Process: Dr. Gaurav Dixit
No ratings yet
Data Mining Process: Dr. Gaurav Dixit
18 pages
CH05 Business Analytics Process and Data Exploration
No ratings yet
CH05 Business Analytics Process and Data Exploration
37 pages
Unit - II
No ratings yet
Unit - II
56 pages
Chapter 2
No ratings yet
Chapter 2
35 pages
DADM S2 Data Preprocessing-Data Cleaning and Transformation
No ratings yet
DADM S2 Data Preprocessing-Data Cleaning and Transformation
12 pages
Data Preparation DM
No ratings yet
Data Preparation DM
26 pages
UNIT-2
No ratings yet
UNIT-2
37 pages
253777
No ratings yet
253777
66 pages
Lecture6a DataPreprocessing
No ratings yet
Lecture6a DataPreprocessing
52 pages
Statistics for Data Science
No ratings yet
Statistics for Data Science
39 pages
Chapter 3-IB
No ratings yet
Chapter 3-IB
69 pages
Example New
No ratings yet
Example New
92 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Study Material I
No ratings yet
Study Material I
140 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
Normalization
No ratings yet
Normalization
35 pages
Data Wrangling
No ratings yet
Data Wrangling
30 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
data_mining_unit_3[1]
No ratings yet
data_mining_unit_3[1]
64 pages
Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
No ratings yet
Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
28 pages
13. Data Preprocessing_Updated (6)
No ratings yet
13. Data Preprocessing_Updated (6)
31 pages
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
100% (1)
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
8 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
Data Wrangling and Visualization
No ratings yet
Data Wrangling and Visualization
48 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Lecture 3
No ratings yet
Lecture 3
32 pages
2 DMiningKuliah 2A DPreparation
No ratings yet
2 DMiningKuliah 2A DPreparation
32 pages
Solution Manual For Using Econometrics A Practical Guide 6 e 6th Edition A H Studenmund
No ratings yet
Solution Manual For Using Econometrics A Practical Guide 6 e 6th Edition A H Studenmund
6 pages
Econometrics 1: Dummy Dependent Variables Models
0% (1)
Econometrics 1: Dummy Dependent Variables Models
12 pages
Monitoreo de Perfiles: Rubén Darío Guevara González
No ratings yet
Monitoreo de Perfiles: Rubén Darío Guevara González
17 pages
Cohen 2019 Measuring Real Activity Management
No ratings yet
Cohen 2019 Measuring Real Activity Management
45 pages
What Are The Consequences of Heteroscedasticity and Multicollinearity in Regression? What Are The Possible Remedies?
No ratings yet
What Are The Consequences of Heteroscedasticity and Multicollinearity in Regression? What Are The Possible Remedies?
3 pages
Assignment 1: Statistical Signal Processing
No ratings yet
Assignment 1: Statistical Signal Processing
2 pages
Descriptive & Inferential Statistics
No ratings yet
Descriptive & Inferential Statistics
6 pages
Ch.7 Confidence Intervals and Tests using t-Distribution
No ratings yet
Ch.7 Confidence Intervals and Tests using t-Distribution
1 page
Final Exam
No ratings yet
Final Exam
13 pages
BEX1033 Supplementary Exam
No ratings yet
BEX1033 Supplementary Exam
11 pages
Boosting: 1. What Is The Difference Between Adaboost and Gradient Boosting?
No ratings yet
Boosting: 1. What Is The Difference Between Adaboost and Gradient Boosting?
2 pages
Basic Concepts of Split-Plot Design, Analysis of Covariance (ANCOVA) & Response Surface Design (Statistics)
No ratings yet
Basic Concepts of Split-Plot Design, Analysis of Covariance (ANCOVA) & Response Surface Design (Statistics)
20 pages
Effect Size
No ratings yet
Effect Size
3 pages
Exam 3 Review
No ratings yet
Exam 3 Review
15 pages
Design and Analysis of Surveys: Summer 2021
No ratings yet
Design and Analysis of Surveys: Summer 2021
27 pages
Guide For PETA 1 4th Quarter
No ratings yet
Guide For PETA 1 4th Quarter
4 pages
Relative Standard Deviation
100% (1)
Relative Standard Deviation
4 pages
Quartiles
No ratings yet
Quartiles
8 pages
L23 - Hypothesis Testing Examples
No ratings yet
L23 - Hypothesis Testing Examples
24 pages
Shortfall Surprise
100% (1)
Shortfall Surprise
21 pages
Assignment 1 Answers
No ratings yet
Assignment 1 Answers
7 pages
Package Rcmdrplugin - Nmbu': R Topics Documented
No ratings yet
Package Rcmdrplugin - Nmbu': R Topics Documented
9 pages
Analisis Perbandingan Kinerja Keuangan Perusahaan Farmasi Milik BUMN Dan Swasta
No ratings yet
Analisis Perbandingan Kinerja Keuangan Perusahaan Farmasi Milik BUMN Dan Swasta
12 pages
Statistics Chapter 10-12
No ratings yet
Statistics Chapter 10-12
11 pages
Form 4 Maths Sir Fathi 28.07.2024
No ratings yet
Form 4 Maths Sir Fathi 28.07.2024
4 pages
Two-Wheeler Consumers' Behaviour Towards Customer Satisfaction
No ratings yet
Two-Wheeler Consumers' Behaviour Towards Customer Satisfaction
19 pages
Get Real Econometrics: The Right Tools To Answer Important Questions 2nd Edition Michael A. Bailey PDF Ebook With Full Chapters Now
100% (2)
Get Real Econometrics: The Right Tools To Answer Important Questions 2nd Edition Michael A. Bailey PDF Ebook With Full Chapters Now
69 pages
Central Limit Theorem
No ratings yet
Central Limit Theorem
21 pages
Blood Pressure Levels For Boys by Age and Height Percentile
No ratings yet
Blood Pressure Levels For Boys by Age and Height Percentile
4 pages