Data Mining Final
Data Mining Final
• Predictive accuracy
• Efficiency
– time to construct the model
– time to use the model
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability:
– understandable and insight provided by the model
• Compactness of the model: size of the tree, or the number of rules.
2
Evaluation methods
• Holdout set: The available data set D is divided into two disjoint
subsets,
– the training set Dtrain (for learning a model)
– the test set Dtest (for testing the model)
• Important: training set should not be used in testing and the test
set should not be used in learning.
– Unseen test set provides a unbiased estimate of accuracy.
• The test set is also called the holdout set. (the examples in the
original data set D are all labeled with classes.)
• This method is mainly used when the data set D is large.
• Use each subset as the test set and combine the rest n-1 subsets as the training
set to learn a classifier.
• The procedure is run n times, which give n accuracies.
• The final estimated accuracy of learning is the average of the n accuracies.
• 10-fold and 5-fold cross-validations are commonly used.
• This method is used when the available data is not large.
CS583, Bing Liu, UIC 4
Precision and recall measures
• Used in information retrieval and text classification.
• We use a confusion matrix to introduce them.
TP TP
p . r .
TP FP TP FN
• The harmonic mean of two numbers tends to be closer to the smaller of the
two.
• For F1-value to be large, both p and r much be large.
• Then we have
CS583, Bing Liu, UIC 11
Confusion Matrix:
P(Play=Yes) = P(Play=No) =
9/14 5/14
Example
P(Outlook=Sunny|Play=Yes) = 2/9
P(Temperature=Cool|Play=Yes) = 3/9 P(Outlook=Sunny|Play=No) =
P(Huminity=High|Play=Yes) = 3/9 3/5
P(Wind=Strong|Play=Yes) = 3/9 P(Temperature=Cool|
P(Play=Yes) = 9/14 Play==No) = 1/5
P(Huminity=High|Play=No) =
4/5
P(Yes|x’) ≈ [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|
P(Wind=Strong|Play=No) = 3/5
Yes)]P(Play=Yes) = 0.0053 P(Play=No) = 5/14
P(No|x’) ≈ [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|
No)]P(Play=No) = 0.0206
P(X) = P(Outlook=Sunny) * P(Temperature=Cool) * P(Humidity=High) * P(Wind=Strong)
P(X) = (5/14) * (4/14) * (7/14) * (6/14)
P(X) = 0.02186
Since 0.9421 is greater than 0.2424 then the answer is ‘no’, we cannot play a game of tennis
today.
Example:
Thank you