01-Introduction - Shared PDF
01-Introduction - Shared PDF
Data
Output
Program
Machine Learning
A computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P if its
performance at tasks in T, as measured by P,
improves with experience E. (Tom Mitchell) A
computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P if its
performance at tasks in T, as measured by P,
improves with experience E. (Tom Mitchell) A
computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P if its
performance at tasks in T, as measured by P,
improves with experience E. (Tom Mitchell)
A computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P if its
performance at tasks in T, as measured by P,
improves with experience E. (Tom Mitchell) A
computer program is said to learn from
experience E with respect to some class of
tasks
Data
Output
Program
Past data
$100,000
$140,000
$400,000
Y Y N Y
$250,000
N N Y Y
Y N N Y
$190,000
Traditional CS
Data
Output
Program
Machine Learning
Data
Program
Output
Machine Learning Traditional CS
Data
Output
Data
Program
Output
Training Testing
What is Machine Learning?
Formally: A computer program A is said to learn from
experience E with respect to some class of tasks T
and performance measure P if its performance at
tasks in T, as measured by P, improves with
experience E. (Tom Mitchell, 1997)
Positive 2 4
Negative 9 1
𝑪𝒐𝒓𝒓𝒆𝒄𝒕 𝒅𝒆𝒄𝒊𝒔𝒊𝒐𝒏𝒔 𝒕𝒑+𝒕𝒏 𝟐+𝟗 𝟏𝟏
• Accuracy = = = =
𝑨𝒍𝒍 𝒅𝒆𝒄𝒊𝒔𝒊𝒐𝒏𝒔 𝒕𝒑+𝒕𝒏+𝒇𝒑+𝒇𝒏 𝟐+𝟗+𝟒+𝟏 𝟏𝟔
𝑻𝒓𝒖𝒆 𝒑𝒐𝒔𝒊𝒕𝒊𝒗𝒆𝒔 𝒕𝒑 𝟐 𝟐
• Precision = = = =
𝑨𝒍𝒍 𝒑𝒐𝒔𝒊𝒕𝒊𝒗𝒆𝒔 𝒕𝒑+𝒇𝒑 𝟐+𝟒 𝟔
𝑻𝒓𝒖𝒆 𝒑𝒐𝒔𝒊𝒕𝒊𝒗𝒆𝒔 𝒕𝒑 𝟐 𝟐
• Recall (sensitivity) = = = =
𝑨𝒍𝒍 𝑻𝒓𝒖𝒆 𝑬𝒗𝒆𝒏𝒕𝒔 𝒕𝒑+𝒇𝒏 𝟐+𝟏 𝟑
𝑻𝒓𝒖𝒆 𝒏𝒆𝒈𝒂𝒕𝒊𝒗𝒆𝒔 𝒕𝒏 𝟏 𝟏
• Specificity = = = =
𝑨𝒍𝒍 𝑭𝒂𝒍𝒔𝒆 𝑬𝒗𝒆𝒏𝒕𝒔 𝒇𝒑+𝒕𝒏 𝟒+𝟏 𝟓
Gold standards
In spam detection (for example) for each item (email
document)
• we therefore need to know whether our system called it
spam or not
• We also need to know whether the email is actually
spam or not, i.e. the human-defined labels for each
document
• We will refer to these human labels as the gold labels.
Gold Labels, Annotators and Agreement
• Multiple annotators
Utterances Ann1 Ann2 Raw Agreement Ann Rand Agreement (A1, Rand) Agreement (A2, Rand)
S1 + + 1 - 0 0
S2 - - 1 + 0 0
S3 + - 0 + 1 0
S4 - + 0 - 1 0
… … … … … … …
• Raw agreement
• Chance agreement
• Cohen’s Kappa and Krippendorff's alpha
• Agreement over a random subset with an expert
annotator
Accuracy, Precision, Recall
𝑀𝑦 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝐴𝑛𝑠𝑤𝑒𝑟𝑠 tp + tn
Accuracy = =
𝐴𝑙𝑙 𝑄𝑢𝑒𝑠𝑡𝑖𝑜𝑛𝑠 tp+tn+fp+fn
(What fraction of time am I correct in my classification)
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 tp
Precision = =
𝑀𝑦 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 tp+fp
(How much should you trust me when I say that something tests positive OR what fraction of my positives
are true positives)
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 tp
Recall = Sensitivity = =
𝑅𝑒𝑎𝑙 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 tp+fn
(How much of the reality has been covered by my positive output? OR what fraction of the true positives is
captured by my positives? E.g. How many sick people are correctly identified as having the condition?)
𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 tn
Specificity = =
𝑅𝑒𝑎𝑙 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 tn+fp
(How much of the reality has been covered by my negative output? OR what fraction of the true negatives
is captured by my negatives? E.g. How many identified healthy people do not have the condition?)
Precision and recall
Precision: Fraction of selected items that are correct
Recall: Fraction of correct items that are selected
Type-I errors
Type-II errors
More Related Measures
__tp _ = "Sensitivity" aka "True Positive Rate“
tp + fn
__tn _ = "Specificity" aka "True Negative Rate“
fp + tn
__tp _ = "Positive Predictive Value“ aka Precision
tp + fp
__tn _ = "Negative Predictive Value“
fn + tn
1 - Specificity = "False Positive Rate“ = __fp _ aka “False Acceptance Rate"
fp + tn
1 - Sensitivity = "False Negative Rate“= __fn _ aka “False Rejection Rate"
tp + fn
True Positive Rate_ = "Positive Likelihood Ratio"
False Positive Rate
False Negative Rate = "Negative Likelihood Ratio"
True Negative Rate
Probability___ = "Odds," often expressed as X:Y
1 - Probability
Precision, Recall, Accuracy
Imagine you’re the CEO of the Delicious Pie Company and you
need to know what people are saying about your pies on social
media
You build a system that detects tweets concerning Delicious Pie
• the positive class is tweets about Delicious Pie
• the negative class is all other tweets.
Imagine that we looked at a million tweets
• only 100 of them are discussing their love (or hatred) for our pie
• the other 999,900 are tweets about something completely unrelated
• Imagine a simple classifier that stupidly classified every tweet as “not
about pie”
• This classifier would have 999,900 true positives and only 100 false negatives for
an accuracy of 999,900/1,000,000 or 99.99%!
Accuracy is not a good metric when the goal is to discover
something that is rare, or at least not completely balanced in
frequency
A very common situation in the world.
Precision, Recall, Accuracy
You are shown a set of 21 coins: 10 gold and 11 copper.
Your task to accept all gold coins and reject all copper ones
You accept 7 coins as being gold (these are your positives)
• 5 of these are actually gold (these are your true positives, tp)
• 2 of these are copper (these are your false positives, fp)
• You falsely rejected 5 gold ones (false negatives, fn)
• You correctly rejected 9 copper ones (true negatives, tn)
Actual Gold Actual Copper
Predicted Gold 5 2
Predicted Copper 5 9
Geometric Mean
𝑛
𝐺𝑀 = 𝑎1 + 𝑎2 + 𝑎3 + ⋯ + 𝑎𝑛
2
For 2 values: 𝐺𝑀 = 𝑎1 + 𝑎2
Harmonic Mean
𝑛
𝐻𝑀 = 1 1 1 1
+ +
𝑎1 𝑎2 𝑎3
+⋯+ 𝑎𝑛
2 2𝑎1 𝑎2
For 2 values: 𝐻𝑀 = 1 1 =
+ 𝑎1 +𝑎2
𝑎1 𝑎2
Arithmetic Mean
Ref: http://economistatlarge.com/finance/applied-finance/differences-arithmetic-geometric-harmonic-means
If we naively take the arithmetic mean of raw ratings for each coffeeshop:
Coffeeshop A = (4.5 + 68) ÷2 = 36.25
Coffeeshop B = (3 + 75) ÷2 = 39
Coffeeshop A
• 4.5 * 20 = 90
• (90 + 68) ÷2 = 79
Coffeeshop B
• 3 * 20 = 60
• (60 + 75) ÷2 = 67.5
2 2𝑃𝑅
𝐹= =
1 1 𝑃+𝑅
+
𝑃 𝑅
F-β-MEASURE
We can choose to favor precision or recall by using an
interpolation weight α:
1
F=
1 1
+ (1 − )
P R
( 2 + 1) PR 1−
= =
2P + R
• Balanced F1 measure has = 1 (that is, = ½) as shown above.
• To give more weight to the Precision, we pick a value in the
interval 0 < < 1. [notice that it is getting multiplied with P in the
denominator]
• To give more weight to the Recall, we pick a Value in the
interval 1 < < +∞
• -> 0 considers only precision, -> +∞ only recall
F-MEASURE
56
Sec.14.5
More Than Two Classes: Sets of binary classifiers
One-of or multinomial classification
• Classes are mutually exclusive: each instance in exactly one class
For each class c∈C
• Build a classifier γc to distinguish c from all other classes c’∈C.
Given test instance d,
• Evaluate it for membership in each class using each γc
• d belongs to the one class with maximum score
57
Evaluation
3-way one-of email categorization decision (urgent, normal,
spam)
Sec. 15.2.4
Per class evaluation measures
Recall: cii
Fraction of instances in class i classified
correctly:
cij
j
cii
Precision:
Fraction of instances assigned class i that
c ji
j
are actually about class i:
c i
ii
Accuracy: (1 - error rate)
Fraction of instances classified correctly: c
j i
ij
59
Sec. 15.2.4
60
Evaluation
Evaluation
Evaluation
Evaluation
Evaluation
Evaluation
Evaluation
A micro-average is dominated by the more frequent
class (in this case spam)
• as the counts are pooled
The macro-average better reflects the statistics of the
smaller classes,
• is more appropriate when performance on all the classes
is equally important.
Test sets and cross-validation
We use:
• the training set to train the model,
• the development test set (also called a devset) to
perhaps tune some parameters and decide what the best
model is
• Run the best model on unseen test set to report its
performance (precision, recall, F-measure, accuracy,
error rate)
The use of a devset avoids overfitting the to test set
Test sets and cross-validation
But having a fixed training set, devset, and test set creates
another problem:
• in order to save lots of data for training, the test set (or devset)
might not be large enough to be representative
• It would be better if we could somehow use all our data both
for training and test.
We do this by cross-validation:
• For example, randomly choose a training and test set division
of our data, train our classifier, and then compute the error rate
on the test set
• Then repeat with a different randomly selected training set and
test set.
• We do this sampling process 10 times and average these 10
runs to get an average error rate.
• This is called 10-fold cross-validation
The next few slides are from:
CSCE 666 Pattern Analysis | Ricardo
Gutierrez-Osuna | CSE@TAMU 3
One may be tempted to use the entire training data to select
the “optimal” classifier, then estimate the error rate
This naïve approach has two fundamental problems
• The final model will normally overfit the training data: it will not
be able to generalize to new data
• The problem of overfitting is more pronounced with models that have a
large number of parameters
• The error rate estimate will be overly optimistic (lower than the
true error rate)
• In fact, it is not uncommon to achieve 100% correct
classification on training data
• How to make the best use of your (limited) data for
– Training
– Model selection and
– Performance estimation
Cross-validation
A problem with cross-validation is:
• Because all the data is used for testing, we need the
whole corpus to be blind i.e. we can’t examine any of the
data to suggest possible features
• But looking at the corpus is often important for designing
the system
It is common to create a fixed training set and test set,
then do 10-fold cross-validation inside the training
set, but compute error rate the normal way in the
test set
Cross-validation with fixed test data
For more details please visit
http://aghaaliraza.com
Thank you!