0% found this document useful (0 votes)
28 views

MBA 643 - Final Exam Notes

This document discusses various classification and predictive modeling techniques such as logistic regression, decision trees, clustering, and probability distributions. It provides examples of accuracy metrics like sensitivity and specificity. Key concepts covered include overfitting, hierarchical vs k-means clustering, and the binomial, normal and other probability distributions.

Uploaded by

Ganesh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

MBA 643 - Final Exam Notes

This document discusses various classification and predictive modeling techniques such as logistic regression, decision trees, clustering, and probability distributions. It provides examples of accuracy metrics like sensitivity and specificity. Key concepts covered include overfitting, hierarchical vs k-means clustering, and the binomial, normal and other probability distributions.

Uploaded by

Ganesh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Chapter 9A – Predictive Data Mining

Classification Accuracy:

Predicted class is in column (vertical), Actual class is in horizontal (rows).


Actual class Predicted Class

Actual Class 1 0

1 𝑛11 =146 𝑛10 =89

0 𝑛01 =5244 𝑛00 =7479

Number of False Positive = 𝒏𝟎𝟏


False positive (Look vertical): Misclassifying an actual Class 0 observation as a Class 1 observation
𝑛01 5244
Class 0 error rate = = = 𝟒𝟏. 𝟐%
𝑛01 +𝑛00 5244+7479

Number of False Negative=𝒏𝟏𝟎


False negative (Look vertical): Misclassifying an actual Class 1 observation as a Class 0 observation
𝑛10 𝟖𝟗
Class 1 error rate = = = 𝟑𝟕. 𝟗%
𝑛11 +𝑛10 𝟏𝟒𝟔+𝟖𝟗

Accuracy of the model: one minus the overall error rate


Overall error rate: An aggregate measure of misclassification,
Decile-wise lift chart: Another way to view how much better a classifier is at identifying Class 1
observations than random classification The point (10,5) on the blue curve means that if the 10
observations with largest estimated probabilities of being in
Class 1 were selected from the data table, 5 of these
observations correspond to actual Class 1 members

The point (10,2.2) on the red curve means that if the 10


observations were randomly selected, only (11/50) x 10 =2.2
of these observations would be Class 1 members.

The better the classifier is at identifying responders, the


larger the vertical gap between points on the red and on the
blue curves
Sensitivity: The ability to correctly predict Class 1 (positive) observations is commonly expressed as
sensitivity.
n11
Sensitivity = 1  Class 1 error rate =
n11  n10
Specificity: The ability to correctly predict Class 0 (negative) observations is commonly expressed as
specificity.
𝒏𝟎𝟎
Specificity = 1 – Class 0 error rate =
𝒏𝟎𝟏 +𝒏𝟎𝟎
Precision is a measure that corresponds to the proportion of observations predicted to be Class 1 by a
classifier that are in Class 1:
n11
Precision =
n11  n01
The F1 Score combines precision and sensitivity into a single measure and is defined as
2n11
F1 Score =
2n11  n01  n10
Logistic Regression:
The logistic regression model: The logistic function:
̂
𝒑 𝟏
𝐥𝐧 ( ) = 𝒃𝟎 + 𝒃𝟏 𝑿𝟏 + 𝒃𝟐 𝑿𝟐 + ⋯ 𝒃𝒒 𝑿𝒒 ̂=
𝒑
𝟏−𝒑̂ 𝟏 + 𝒆−(𝒃𝟎 +𝒃𝟏 𝑿𝟏 +⋯+𝒃𝒒𝑿𝒒)

Best Subsets extraction methods to identify the best two logistic regression models:
Best subset is defined by:
1. First sort the Best subset data by decreasing order of Mallow’s Cp value
2. Look for lowest positive Mallow’s Cp value
3. Mallow’s Cp value should be greater than or equal to #Coefficients (# of independent variables)
Note: When Cp is less than or equal to p, it suggests the model is unbiased

Determine the best logistic regression model. Justify. What is the classification rule?
Best logistic regression model is defined by max sensitivity.
Classification: If the estimated probability 𝑝 ̂ > 0.5, then the observation is classified as Class 1, otherwise,
it is classified as Class 0.

Most significant predictor to identify churners? Justify: Based on output, check the lowest p-value (most
significant).

Interpret the meaning of the decile-lift: Our classifier is XX.XX times better in predicting in class 1 than
taking random variables

To achieve a sensitivity of at least 0.70, how much Class 0 error can be tolerated?
Check the test data.
PPR Predictive Positive Rate (Support)
TPR True Positive Rate (Sensitivity, Recall)  (1 – Class 1 error rate)
FPR False Positive Rate (1-Specificity)  (Class 0 error rate)

Chapter 9B – Predictive Data Mining


K-nearest neighbours k-NN

CART: Series of questions that successively narrow down observations into smaller and smaller groups of
decreasing impurity.
Overfitting: A full grown tree based on the training data leads to complete overfitting of the data.
How to avoid overfitting: Two ways.
1. Stop the tree growth: by controlling number of splits, minimum of records in a terminal node,
and minimum reduction in impurity
2. Pruning a full-grown tree: It means removing the weakest branches.

Chapter 4:
Table 4.2 shows that Cluster 3 is the smallest, most homogeneous cluster, whereas Cluster 2 is the largest,
most heterogeneous cluster.

In Table 4.3, we compare the average distances between clusters to the average distance within clusters
in Table 4.2.

Hierarchical Clustering - Very sensitive to outliers (Suitable for small set of data <500)

k-Means Clustering - Generally not appropriate for binary or ordinal data because the average is not
meaningful (for larger data set)
Support(apple)= 4/8

Confidence: How likely item Y is purchased when item X


is purchased. = Proportion of transactions with item X, in
which item Y also appears.

Lift: How likely item Y is purchased when item X is


purchased, while controlling for how popular item Y is.
Confidence{Apple → Beer}
𝐿𝑖𝑓𝑡 {Apple ⟶ Beer} =
Support{Beer}⁄#transactions

Text mining:
Chapter 5: Probability: Introduction to modeling uncertainty
Probability rules:
𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) − 𝑃(𝐴 ∩ 𝐵)
If A & B are two mutually exclusive events: P (𝑨 ∩ 𝑩) = 0
P(A|B) = The probability of event A given the condition that event B has already occurred

𝑃(𝐴 ∩ 𝐵) Multiplication law: If independent events:


𝑃(𝐴|𝐵) =
𝑃(𝐵) 𝑃(𝐴|𝐵) = P (A) and 𝑃(B|𝐴) = P(B)
𝑃(𝐴 ∩ 𝐵) = P(B) 𝑃(𝐴|𝐵)
𝑃(𝐴 ∩ 𝐵) P(𝑨 ∩ 𝑩 ) = P(A) P(B)
𝑃(B|𝐴) = 𝑃(𝐴 ∩ 𝐵) = P(A) 𝑃(B|𝐴)
𝑃(𝐴)

Discrete Probability Distributions:


Expected Value (mean):
𝐸(𝑋) = 𝜇 = ෍ 𝑥𝑓(𝑥) = ෍ 𝑥𝑃(𝑋 = 𝑥)

Variance ( σ2 ) = (𝑥 − 𝜇)2 𝑓(𝑥)

Standard deviance = 𝜎

Binomial Probability Distributions:


𝑛 𝑛 𝑛!
𝑓(𝑥) = 𝑃(𝑋 = 𝑥|n,𝜋) = ቀ ቁ 𝜋 𝑥 (1 − 𝜋)𝑛−𝑥 ቀ ቁ=
𝑥 Note: 0! = 1
𝑥 𝑥! (𝑛 − 𝑥)!

𝑬(𝑿) = 𝒏 𝝅 Variance (X) = 𝒏𝝅(𝟏 − 𝝅)

X is a random variable = of number of successes


π = probability of a success in one trial
n = the number of trials
f(x) = Probability of x successes in n trials

Excel: =BINOM.DIST(x, n, π, FALSE) = P(X = x l n, π) Ex: Exact 4 =BINOM.DIST(4, 20, 0.2, FALSE)

=BINOM.DIST(x, n, π, TRUE) = P(X ≤ x l n, π) Ex: 3 or fewer =BINOM.DIST(3, 20, 0.2, TRUE)

Continuous Probability Distributions:


1
𝑓𝑜𝑟 𝑎 ≤ 𝑥 ≤ 𝑏 𝑥−𝑎
𝑓(𝑥) = (𝑏 − 𝑎) 𝑃(𝑋 ≤ 𝑥) = 𝑏−𝑎
𝑓𝑜𝑟 𝑎 ≤ 𝑥 ≤ 𝑏
0 𝑒𝑙𝑠𝑒𝑤ℎ𝑒𝑟𝑒

𝑎+𝑏 (𝑏−𝑎)2
𝐸(𝑥) = Var(𝑥) =
2 12

𝑿 − 𝝁
Normal Probability Distribution: 𝒁=
𝝈
The standardized normal distribution (Z) has a
mean of 0 and a standard deviation of 1
To find prob of flight time more than 40,000
40,000−36,500
P(X > 40,000) = P ቀ𝑍 > 5,000

1 - P(X < 40,000) = 1 - NORM.DIST(40000, 36500, 5000, TRUE)

=NORM.DIST(X, µ, σ, TRUE)

We use “TRUE” above as it is less than 40,000. If it was exact


40,000 we would have used FALSE

Questions if talks about percentage, instead of average:


=NORM.INV(0.1, 36500, 5000) less than 10% (mean = 36500, std. dev. = 5000)
=NORM.INV(percentage, mean, standard deviation)

Prob. for greater than 30,000 but less than 40,000 hours:

=(NORM.DIST(40000, 36500, 5000, TRUE)


- NORM.DIST(30000, 36500, 5000, TRUE))

CHAPTER – 6A: Statistical inference

The sampling distribution of 𝑋̅ is the probability distribution of all possible values of the sample mean 𝑋̅.
𝝈
̅ = 𝑬(𝑿
EXPECTED VALUE OF 𝑿 ̅) = 𝝁 𝝈𝑿̅ =
√𝒏
Note: When the expected value of a point estimator equals the population parameter, we say the point
estimator is unbiased

What is the value exceeded by X% of sample mean?


Say for example X=60. When we say exceeded by 60%, it means we need to investigate the next 40% part.
So basically, we need to investigate the probability of 40%.

Sampling Distribution of Sample Proportion P:


𝜋(1−𝜋)
Expected value of P (mean) = 𝐸(𝑃) = 𝜋 𝜎𝑃 = √
𝑛

𝑃−𝜋
Z= where P = claimed probability, 𝜋 = 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑔𝑖𝑣𝑒𝑛 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦
√𝜋(1−𝜋)
𝑛

Interval Estimation: Point estimate ± Margin of error

For population mean: ̅ ± Margin of error


𝑿
For population proportion: P ± Margin of error

Confidence interval for µ when σ is unknown:

Assumptions:

Population standard deviation σ is known


σ
X  Z α/2
Population is normally distributed

If population is not normal, use large sample (n > 30)


n
Excel – Margin of error: =CONFIDENCE.NORM(α/2, σ, n)

Confidence Intervals for µ of a Normal Population when σ Unknown:

S For a given 90% confidence interval, 0.9 is referred as confidence coefficient.


X  t( n1), / 2 Level of significance = 1 – confidence co-efficient
n
Interval Estimation:
For a sample of 70, find count above threshold (say count is 31).
 (1  ) p(1 p)
σp  P = 31/70 = 0.44
n n
0.44(1 − 0.44)
0.44 ± 1.96√
70

0.324 < π < 0.556 np = 70(0.44) = 31 and n(1-p) = 29


p(1  p)
p  Z α/2
n
Sample size for CI for μ:
Let E = the desired margin of error, then
𝜎 𝑃∗ (1−𝑃∗ )
𝐸 = 𝑍𝛼⁄2 and solving for n, we find: 𝐸 = 𝑍𝛼⁄2 √
√𝑛 𝑛
𝟐 𝟐
ቀ𝒁𝜶⁄ ቁ 𝝈𝟐 ቀ𝒁𝜶⁄ ቁ 𝑷∗ (𝟏−𝑷∗ )
𝟐 𝟐
𝒏= 𝑬𝟐
𝒏= 𝑬𝟐

Chapter 6B
Test statistics:
1. Understand H0 & H1. H0 will always be with equal to sign
2. If anything, less than or greater than is mentioned in the statement, consider that as H1.

Level of significance (α): Probability of making a Type I error.


Power of test (1-β): Probability of Type II error.

8 steps DM classical approach:


1. Identify the hypothesis to be tested (look for
the sign)
Test statistics
2. Identify the level of significance α

3. Identify the test statistics

4. Determine the rejection region (Z or T


Critical) (take sign from step 1)

5. Calculate Zcalculated or Tcalculated

6. Decision (compare Zcalc with Zcrit)

7. Conclusion

8. P-value
Hypothesis Testing: σ Known
p-Value Approach to Testing:
• The p-value is also called the observed level
of significance
• It is the smallest value of α for which H0 can
be rejected
o If p-value < α , reject H0
o If p-value ≥ α , do not reject H0

To calculate P-value:
=NORM.S.DIST(z-score, TRUE)

To calculate z from probability:


=NORM.S.INV(probability)

Hypothesis Testing: σ Known


p-Value Approach to Testing:
• The p-value is also called the observed level of
significance
• It is the smallest value of α for which H0 can be
rejected
o If p-value < α , reject H0
o If p-value ≥ α , do not reject H0

• To calculate P-value:
• =T.DIST(t, n-1, TRUE)  H1 <
• =1-T.DIST(t, n-1, TRUE)  H1 >
• =2 x min(T.DIST(│t│, n-1, TRUE), 1-T.DIST(t, n-1,
TRUE)

• To calculate T from probability:


• = T.INV.2T (α, degree of freedom)  for H1 !=H0
• =T.INV(α, degree of freedom)  H1 > H0

If p-value < 0.01, we have overwhelming evidence that 𝐻1 is true

If 0.01 < p-value < 0.05, we have strong evidence to conclude that 𝐻1 is true

If 0.05 < p-value < 0.10, we have weak evidence to conclude that 𝐻1 is true

If p-value > 0.10, we have insufficient evidence to conclude that 𝐻1 is true


P-Values:
• If p-value < 0.01, we have overwhelming evidence that 𝐻1 is true

• If 0.01 < p-value < 0.05, we have strong evidence to conclude that 𝐻1 is true

• If 0.05 < p-value < 0.10, we have weak evidence to conclude that 𝐻1 is true

• If p-value > 0.10, we have insufficient evidence to conclude that 𝐻1 is true

You might also like