0% found this document useful (0 votes)
11 views

Data Mining Final

Uploaded by

ahmedjamshaid953
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Data Mining Final

Uploaded by

ahmedjamshaid953
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Data mining

Evaluating classification methods

• Predictive accuracy

• Efficiency
– time to construct the model
– time to use the model
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability:
– understandable and insight provided by the model
• Compactness of the model: size of the tree, or the number of rules.

2
Evaluation methods
• Holdout set: The available data set D is divided into two disjoint
subsets,
– the training set Dtrain (for learning a model)
– the test set Dtest (for testing the model)
• Important: training set should not be used in testing and the test
set should not be used in learning.
– Unseen test set provides a unbiased estimate of accuracy.
• The test set is also called the holdout set. (the examples in the
original data set D are all labeled with classes.)
• This method is mainly used when the data set D is large.

CS583, Bing Liu, UIC 3


Evaluation methods (cont…)
• n-fold cross-validation:
• N-Fold Cross-Validation is a resampling technique used to evaluate the
performance of a machine learning model. It helps to assess how well a
model generalizes to an independent dataset by using multiple training and
testing splits of the data. The available data is partitioned into n equal-size
disjoint subsets.

• Use each subset as the test set and combine the rest n-1 subsets as the training
set to learn a classifier.
• The procedure is run n times, which give n accuracies.
• The final estimated accuracy of learning is the average of the n accuracies.
• 10-fold and 5-fold cross-validations are commonly used.
• This method is used when the available data is not large.
CS583, Bing Liu, UIC 4
Precision and recall measures
• Used in information retrieval and text classification.
• We use a confusion matrix to introduce them.

CS583, Bing Liu, UIC 5


true positives (TP): These are cases in which we predicted yes (they have the
disease), and they do have the disease.
true negatives (TN): We predicted no, and they don't have the disease.
false positives (FP): We predicted yes, but they don't actually have the
disease. (Also known as a "Type I error.")
false negatives (FN): We predicted no, but they actually do have the disease.
(Also known as a "Type II error.")
Precision and recall measures (cont…)

TP TP
p . r .
TP  FP TP  FN

 Precision p is the number of correctly classified positive


examples divided by the total number of examples that are
classified as positive.
 Recall r is the number of correctly classified positive
examples divided by the total number of actual positive
examples in the test set.
CS583, Bing Liu, UIC 7
An example

• This confusion matrix gives


– precision p = 100% and
– recall r = 1%
because we only classified one positive example correctly
and no negative examples wrongly.
• Note: precision and recall only measure classification
on the positive class.

CS583, Bing Liu, UIC 8


F1-value (also called F1-score)
• It is hard to compare two classifiers using two measures. F1 score combines
precision and recall into one measure

• The harmonic mean of two numbers tends to be closer to the smaller of the
two.
• For F1-value to be large, both p and r much be large.

CS583, Bing Liu, UIC 9


Receive operating characteristics curve

• It is commonly called the ROC curve.


• It is a plot of the true positive rate (TPR) against the
false positive rate (FPR).
• True positive rate:

• False positive rate:


CS583, Bing Liu, UIC 10
Sensitivity and Specificity
• In statistics, there are two other evaluation
measures:
– Sensitivity: Same as TPR
– Specificity: Also called True Negative Rate (TNR)

• Then we have
CS583, Bing Liu, UIC 11
Confusion Matrix:

A confusion matrix is a technique for summarizing the performance of a


classification algorithm.
Evaluation Parameters
Confusion Matrix and Evaluation Parameters
Confusion Matrix :

Accuracy: Overall, how often is the classifier correct?


Accuracy = (TP + TN) / (TP + TN + FP + FN)= (100+50)
/(100+5+10+50)= 0.90

Misclassification Rate: Overall, how often is it wrong?


(FP+FN)/total = (10+5)/165 = 0.09
equivalent to 1 minus Accuracy
also known as "Error Rate"
True Positive Rate/Recall: When it's actually yes, how often does it
predict yes?
TP/actual yes = 100/105 = 0.95
also known as "Sensitivity" or "Recall"
False Positive Rate: When it's actually no, how often does it predict yes?
FP/actual no = 10/60 = 0.17
Confusion Matrix
Specificity: When it's actually no, how often does it predict no?
TN/actual no = 50/60 = 0.83
equivalent to 1 minus False Positive Rate
Precision: When it predicts yes, how often is it correct?
TP/predicted yes = 100/110 = 0.91
Prevalence: How often does the yes condition actually occur in our sample?
actual yes/total = 105/165 = 0.64
F1Score:
Fmeasure=(2*Recall*Precision)/(Recall+Presision)=(2*0.95*0.91)/
(0.91+0.95)=0.92
What is Naive Bayes algorithm?
• It is used in classification especially in text mining
• Very good in large data sets
• Based on probability
•‘Naive Bayes‘, which can be extremely fast relative
to other classification algorithms
What is Naive Bayes algorithm?

Step 1: Convert the data set into a frequency table


Step 2: Create Likelihood table by finding the probabilities like Overcast probability =
0.29 and probability of playing is 0.64.
Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each
class. The class with the highest posterior probability is the outcome of prediction.
Example
• Example: Play Tennis

Today's outlook is sunny, temperature is cool, Humidity high, and


wind strong. Using Naive Bayes, what is the probability that she
will be playing tennis today? 20
Example
• Learning Phase
Outlook Play=Y Play=N Temperat Play=Ye Play=No
es o ure s
Sunny 2/9 3/5 Hot 2/9 2/5
Overcas 4/9 0/5 Mild 4/9 2/5
t
Cool 3/9 1/5
Rain 3/9 2/5
Humidity Play=Y Play= Wind Play=Ye Play=N
es No s o
High 3/9 4/5 Strong 3/9 3/5
Normal 6/9 1/5 Weak 6/9 2/5

P(Play=Yes) = P(Play=No) =
9/14 5/14
Example

P(Outlook=Sunny|Play=Yes) = 2/9
P(Temperature=Cool|Play=Yes) = 3/9 P(Outlook=Sunny|Play=No) =
P(Huminity=High|Play=Yes) = 3/9 3/5
P(Wind=Strong|Play=Yes) = 3/9 P(Temperature=Cool|
P(Play=Yes) = 9/14 Play==No) = 1/5
P(Huminity=High|Play=No) =
4/5
P(Yes|x’) ≈ [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|
P(Wind=Strong|Play=No) = 3/5
Yes)]P(Play=Yes) = 0.0053 P(Play=No) = 5/14
P(No|x’) ≈ [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|
No)]P(Play=No) = 0.0206
P(X) = P(Outlook=Sunny) * P(Temperature=Cool) * P(Humidity=High) * P(Wind=Strong)
P(X) = (5/14) * (4/14) * (7/14) * (6/14)
P(X) = 0.02186

Then, dividing the results by this value:


P(Play=Yes | X) = 0.0053/0.02186 = 0.2424
P(Play=No | X) = 0.0206/0.02186 = 0.9421

Since 0.9421 is greater than 0.2424 then the answer is ‘no’, we cannot play a game of tennis
today.
Example:
Thank you

You might also like