0% found this document useful (0 votes)
87 views

01-Introduction - Shared PDF

Here are the steps to search for a substring in a given string: 1. Define the substring we want to search for (e.g. "see") 2. Start from the beginning of the given string 3. Compare the substring to a segment of the given string of the same length - If they match, we've found the substring - If they don't match, move to the next segment of the given string and repeat the comparison 4. Continue comparing and advancing through the given string one segment at a time until either: - A match is found, in which case return the index where the match starts - We reach the end of the given string without a match, in which case

Uploaded by

Maham Sheikh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views

01-Introduction - Shared PDF

Here are the steps to search for a substring in a given string: 1. Define the substring we want to search for (e.g. "see") 2. Start from the beginning of the given string 3. Compare the substring to a segment of the given string of the same length - If they match, we've found the substring - If they don't match, move to the next segment of the given string and repeat the comparison 4. Continue comparing and advancing through the given string one segment at a time until either: - A match is found, in which case return the index where the match starts - We reach the end of the given string without a match, in which case

Uploaded by

Maham Sheikh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

Machine Learning

CS 535/EE 514 Machine Learning

Agha Ali Raza


Sources
- Cornell Machine Learning, Fall 2018, Kilian Weinberger,
http://www.cs.cornell.edu/courses/cs4780/2018fa/
- MIT 6.034 Artificial Intelligence, Fall 2010, Patrick Winston, http://ocw.mit.edu/6-
034F10
- Minsky, M. Papert, S. (1969). Perceptron: an introduction to computational
geometry. The MIT Press, Cambridge, expanded edition, 19(88), 2.
- Tesauro, G. (1994). TD-Gammon, a self-teaching backgammon program,
achieves master-level play. Neural Computation, 6(2), 215–219.
https://doi.org/10.1162/neco.1994.6.2.215
Traditional Computer Science
Tasks like:
- Play an audio/video file
- Display a text file on screen
- Perform a mathematical operation on two numbers
- Sort an array of numbers using Insertion Sort
- Search for a string in a text file

Data
Output

Program
Machine Learning
A computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P if its
performance at tasks in T, as measured by P,
improves with experience E. (Tom Mitchell) A
computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P if its
performance at tasks in T, as measured by P,
improves with experience E. (Tom Mitchell) A
computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P if its
performance at tasks in T, as measured by P,
improves with experience E. (Tom Mitchell)
A computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P if its
performance at tasks in T, as measured by P,
improves with experience E. (Tom Mitchell) A
computer program is said to learn from
experience E with respect to some class of
tasks

Tumor? y/n Price? What was said? Summarize


text

Data
Output

Program
Past data
$100,000

$140,000

$400,000

Y Y N Y
$250,000
N N Y Y
Y N N Y

$190,000
Traditional CS
Data
Output

Program

Machine Learning
Data
Program

Output
Machine Learning Traditional CS

Data
Output
Data

Program
Output

Training Testing
What is Machine Learning?
Formally: A computer program A is said to learn from
experience E with respect to some class of tasks T
and performance measure P if its performance at
tasks in T, as measured by P, improves with
experience E. (Tom Mitchell, 1997)

Informally: Algorithms that improve on some task


with experience.
A Brief History of
Machine Learning
The Turing Test (1950)
- Alan Turing creates the “Turing Test” to
determine if a computer has real
intelligence
- To pass the test, a computer must be
able to fool a human into believing it is
also human.

Player C, the interrogator,


tries to determine which of
players A and B is a
computer and human. The
interrogator is limited to
using the responses to
written questions to make
the determination.
Samuel’s Checker Player (1952)
- The algorithm learned from experience
- Essentially Shannon’s Minimax
Algorithm with alpha-beta pruning
- The program improved over time as it
remembered every position it had
already seen, along with the terminal
value of the reward function
- The Chinook was an improved version
that was developed in 1989 and beat the
world champion in 1994
- Checkers is now a solved problem

Marion Franklin Tinsley


Checkers world champion
Perceptron (1957) – Frank Rosenblatt
- Provable convergence
properties
- Stack together to create an
artificial neural network
AI Winter (1974–1980 and 1987–1993)
https://medium.com/@jayeshbahire/the-xor-problem-in-neural-networks-50006411840b

- Minsky and Papert (1969) showed that a


single layered perceptron can never
learn the XOR classification
- A limitation of this architecture is that it is
only capable of separating data points
with a single line
- XOR inputs are not linearly separable
- Note: The solution is to expand beyond
the single-layer by adding a hidden layer
to form a multilayer perceptron (MLP).
AI Winter (1974–1980 and 1987–1993)
https://medium.com/@jayeshbahire/the-xor-problem-in-neural-networks-50006411840b

- The expectation bubble was burst and the funding for AI


collapsed for several decades
- Rebirth as “Machine Learning” – though the original term was
coined in 1959 by Arthur Samuel
- The rebirth was initially mostly a name game to get funding
but eventually led to profound differences
AI and ML – The differences
- ML is bottom-up, AI is top-down
- AI tries to imitate the human – the problems and
methods, ML starts from the machine – its
capabilities and limitations
- ML is more practical with smaller goals
AI

- ML is based on Statistics and Optimization


unlike AI that was based on Logic

- Certainty in the real-world is a rare luxury – ML


Uncertainty is the basis of ML that is
quantified using probability and statistics
TD Gammon (1994)
- Gerry Tesauro (IBM) developed a neural network able to
teach itself to play backgammon by playing against itself and
learning from the results (and beat the World Champion)
- Start from random initial weights (and hence random initial strategy),
TD-Gammon achieves a strong level of play
- Initially only check if a move is legal and make it – initially random
- When one of “them” wins, you reinforce it positively – reinforcement
learning
- When a set of handcrafted features is added to the network's
input representation, the result is a staggering level of
performance
- The program discovered a new opening move
Deep Blue (1997)
- IBM’s Deep Blue wins against Gary Kasparov in chess.
- Match 1: Philadelphia, 1996, won by Kasparov.
- Match 2: New York City, 1997, won by Deep Blue.
- The first defeat of a reigning world chess champion by a computer
under tournament conditions.
- Deep Blue employed parallel alpha-beta search
- Good Old-Fashioned Artificial Intelligence rather than deep
learning which would come a decade later.
- A brute force approach
Watson wins at Jeopardy (2011)
- Watson is a question-answering computer
system capable of answering questions
posed in natural language
- Developed in IBM's DeepQA project
- Initially developed to answer questions on
the quiz show Jeopardy
- In 2011, Watson competed on Jeopardy!
against champions Brad Rutter and Ken
Jennings, winning the first place prize of $1
million.
And now…
Machine Learning everywhere… Machine customizing
to your unique patterns
- SPAM filters
- Autocomplete
- Web search
- Machine Translation
- Speech to text, text to speech
- Self-driving cars
- Machine Learning in other domains
- Biology, Chemistry, Health, Economics, Development,…
Types of Machine Learning
- Supervised Learning
- Given labelled data predict the labels of unseen data
- SPAM vs. HAM, Speech Recognition, predict house
prices,…
- Unsupervised Learning
- Given unlabeled data, discover similar patterns and
structures
- Automatically cluster news articles by topics, books by
authors, speech by speakers,…
- Reinforcement Learning
- Learn from delayed feedback
- A robot learns to fly, walk, play backgammon,…
Performance measures:
Precision, Recall, Specificity
and the F measure
The Search Problem
Search
Data 34 2 29 8 0 -15 45 66 100 8 55 2
Index 0 1 2 3 4 5 6 7 8 9 10 11

• Searching for integers


• Linear search
o First occurrence (search for 45)
o All occurrences (search for 8)
o How many iterations?
• What if the array is sorted?
Data -15 0 2 2 8 8 29 34 45 55 66 100
Index 0 1 2 3 4 5 6 7 8 9 10 11

o Can we do a better job at searching?


o How many iterations?
o How do you lookup words in a dictionary?
Given a character string, search a sub-string
• “Now, here, you see, it takes all the running you can do, to keep
in the same place. If you want to get somewhere else, you must
run at least twice as fast as that!”
(Lewis Carrol, Alice through the looking glass)
• Search for “the”
• So far, so good! Right?
• “The big fat brother of the other child decided to bathe in the
sunshine with them.”
• Search for “the”
• “The big fat brother of the other child decided to bathe in the
sunshine with them.”
• Wait… What???
• Check for spaces? End/start of sentence? Cap/small?
Substring?
• Regular expressions to the rescue!
Regular Expressions
Tools: regex101.com with python, Regexpal.com, sublime editor
Operations:
• Disjunction: [] e.g. /[tT]his/, /[A-Z]/
• Kleene Star and Plus: * + e.g. /a*/, /a+/, /[tT]*/
• Any single character: . e.g. /beg.n/
• Optional character: ? e.g. /colou?r/
• Disjunction: | e.g. /gupp(y|ies), (t|T)his/
• Anchors: ^, $ e.g. /^[Ss]tart/, /[eE]nd$/
• Parenthesis: () e.g. /(this|that) one/
• Not: ^ e.g. /[^A-Za-z]/
• Escape: \ e.g. /period\./
• Counter: {} e.g. /a{2,5}/ i.e. 2 to 5 occurrences of a, /a{3}/ i.e. exactly 3 a’s, /a{2,}/ i.e. at least 2
occurrences of a e.g. (b!, ba!, baa!, baaa!, baaaa!, baaaaa!, baaaaaa!)
Generalizations:
• Any cap: A-Z, any small: a-z, any digit: 0-9
• \d: any digit, \D: any non-digit, \w: any alphanumeric/underscore, \W: a non-alphanumeric, \s: [ \r\t\n\f],
\S: [^\s]
Memory:
• /the (.*)er they (.*), the \1er we \2/ e.g. the faster they ran, the faster we ran
Substitution:
• s/colour/color/ i.e. replace color with colour
Binary Classifiers
• Think of a search as a deterministic binary classifier
• It returns positives, it leaves behind negatives
• Its decisions could be true or false
• The big fat brother of the other child decided to bathe in the
sunshine with them.”
True False

Positive 2 4
Negative 9 1
𝑪𝒐𝒓𝒓𝒆𝒄𝒕 𝒅𝒆𝒄𝒊𝒔𝒊𝒐𝒏𝒔 𝒕𝒑+𝒕𝒏 𝟐+𝟗 𝟏𝟏
• Accuracy = = = =
𝑨𝒍𝒍 𝒅𝒆𝒄𝒊𝒔𝒊𝒐𝒏𝒔 𝒕𝒑+𝒕𝒏+𝒇𝒑+𝒇𝒏 𝟐+𝟗+𝟒+𝟏 𝟏𝟔
𝑻𝒓𝒖𝒆 𝒑𝒐𝒔𝒊𝒕𝒊𝒗𝒆𝒔 𝒕𝒑 𝟐 𝟐
• Precision = = = =
𝑨𝒍𝒍 𝒑𝒐𝒔𝒊𝒕𝒊𝒗𝒆𝒔 𝒕𝒑+𝒇𝒑 𝟐+𝟒 𝟔
𝑻𝒓𝒖𝒆 𝒑𝒐𝒔𝒊𝒕𝒊𝒗𝒆𝒔 𝒕𝒑 𝟐 𝟐
• Recall (sensitivity) = = = =
𝑨𝒍𝒍 𝑻𝒓𝒖𝒆 𝑬𝒗𝒆𝒏𝒕𝒔 𝒕𝒑+𝒇𝒏 𝟐+𝟏 𝟑
𝑻𝒓𝒖𝒆 𝒏𝒆𝒈𝒂𝒕𝒊𝒗𝒆𝒔 𝒕𝒏 𝟏 𝟏
• Specificity = = = =
𝑨𝒍𝒍 𝑭𝒂𝒍𝒔𝒆 𝑬𝒗𝒆𝒏𝒕𝒔 𝒇𝒑+𝒕𝒏 𝟒+𝟏 𝟓
Gold standards
In spam detection (for example) for each item (email
document)
• we therefore need to know whether our system called it
spam or not
• We also need to know whether the email is actually
spam or not, i.e. the human-defined labels for each
document
• We will refer to these human labels as the gold labels.
Gold Labels, Annotators and Agreement
• Multiple annotators
Utterances Ann1 Ann2 Raw Agreement Ann Rand Agreement (A1, Rand) Agreement (A2, Rand)

S1 + + 1 - 0 0

S2 - - 1 + 0 0

S3 + - 0 + 1 0

S4 - + 0 - 1 0

… … … … … … …

Sum of Agreements Sum of chance agreements Sum of chance agreements

• Raw agreement
• Chance agreement
• Cohen’s Kappa and Krippendorff's alpha
• Agreement over a random subset with an expert
annotator
Accuracy, Precision, Recall
𝑀𝑦 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝐴𝑛𝑠𝑤𝑒𝑟𝑠 tp + tn
Accuracy = =
𝐴𝑙𝑙 𝑄𝑢𝑒𝑠𝑡𝑖𝑜𝑛𝑠 tp+tn+fp+fn
(What fraction of time am I correct in my classification)

𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 tp
Precision = =
𝑀𝑦 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 tp+fp
(How much should you trust me when I say that something tests positive OR what fraction of my positives
are true positives)

𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 tp
Recall = Sensitivity = =
𝑅𝑒𝑎𝑙 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 tp+fn
(How much of the reality has been covered by my positive output? OR what fraction of the true positives is
captured by my positives? E.g. How many sick people are correctly identified as having the condition?)

𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 tn
Specificity = =
𝑅𝑒𝑎𝑙 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 tn+fp
(How much of the reality has been covered by my negative output? OR what fraction of the true negatives
is captured by my negatives? E.g. How many identified healthy people do not have the condition?)
Precision and recall
Precision: Fraction of selected items that are correct
Recall: Fraction of correct items that are selected

Type-I errors

Type-II errors
More Related Measures
__tp _ = "Sensitivity" aka "True Positive Rate“
tp + fn
__tn _ = "Specificity" aka "True Negative Rate“
fp + tn
__tp _ = "Positive Predictive Value“ aka Precision
tp + fp
__tn _ = "Negative Predictive Value“
fn + tn
1 - Specificity = "False Positive Rate“ = __fp _ aka “False Acceptance Rate"
fp + tn
1 - Sensitivity = "False Negative Rate“= __fn _ aka “False Rejection Rate"
tp + fn
True Positive Rate_ = "Positive Likelihood Ratio"
False Positive Rate
False Negative Rate = "Negative Likelihood Ratio"
True Negative Rate
Probability___ = "Odds," often expressed as X:Y
1 - Probability
Precision, Recall, Accuracy
Imagine you’re the CEO of the Delicious Pie Company and you
need to know what people are saying about your pies on social
media
You build a system that detects tweets concerning Delicious Pie
• the positive class is tweets about Delicious Pie
• the negative class is all other tweets.
Imagine that we looked at a million tweets
• only 100 of them are discussing their love (or hatred) for our pie
• the other 999,900 are tweets about something completely unrelated
• Imagine a simple classifier that stupidly classified every tweet as “not
about pie”
• This classifier would have 999,900 true positives and only 100 false negatives for
an accuracy of 999,900/1,000,000 or 99.99%!
Accuracy is not a good metric when the goal is to discover
something that is rare, or at least not completely balanced in
frequency
A very common situation in the world.
Precision, Recall, Accuracy
You are shown a set of 21 coins: 10 gold and 11 copper.
Your task to accept all gold coins and reject all copper ones
You accept 7 coins as being gold (these are your positives)
• 5 of these are actually gold (these are your true positives, tp)
• 2 of these are copper (these are your false positives, fp)
• You falsely rejected 5 gold ones (false negatives, fn)
• You correctly rejected 9 copper ones (true negatives, tn)
Actual Gold Actual Copper
Predicted Gold 5 2
Predicted Copper 5 9

• Your precision is tp/(all of your positives) = tp/(tp+fp) = 5/7


• Your recall is tp/(number of actual gold coins) = tp/(tp+fn) = 5/10
• Your specificity is tn/(number of copper coins) = tn/(fn+tn) = 9/11
• Your accuracy is correct answers/all attempts =
(tp+tn)/(tp+tn+fp+fn) = (5+9)/(5+9+2+5) = 14/21
Precision, Recall, Accuracy
Realistic Extremes:
You accept only one coin and that is gold Ac Gld Ac Cop
• Your precision is very high (1/1) but Pr Gld 1 0
recall is very low (1/10) Pr Cop 9 11

You return all 21 coins Ac Gld Ac Cop

• Your recall is very high (10/10) but Pr Gld 10 11

precision is very low (10/21) Pr Cop 0 0

Only one out of the 21 coins is gold. And Ac Gld Ac Cop


you reject everything. Pr Gld 0 0
• Your accuracy is very high (20/21 = Pr Cop 1 20
0.95) but precision/recall are 0
A combined measure of Precision and Recall
• It is useful to have a single number to describe
performance. Should be high when both P and R are high.
• The mean of precision and recall?
• But what kind of a mean should we use?
• Simple Arithmetic Mean is problematic: e.g.
• P = 0.0, R = 1.0, AM = 0.5
• P = 0.1, R = 0.9, AM = 0.5
• Requirements:
• We need a weighted mean as we may care more about P or R
• We need a conservative (deliberately lower) estimate of mean
• If P and R are far apart we need the mean to tend to the lower value
• In order to do well, a classifier must do well on both P and R so that
it cannot beat the system by being either too reluctant or too
promiscuous
What is mean (or average)?
• The central tendency is a single number that represents
the most common value for a list of numbers.
• It is the value that has the highest probability from the
probability distribution that describes all possible values
that a random variable may have.
• Many ways to calculate it
• Mode: the most common value in the data distribution
• Median: the middle value if all values in the data sample
were ordered
• Mean: the average value – Three common types
• The mean is different from the median and the mode in that
it is a measure of the central tendency that is calculated
from the data.
What is mean (or average)?
• Different ways to calculate the mean based on the
type of data.
• Three common types:
• Arithmetic mean
• Geometric mean
• Harmonic mean
aka, the Pythagorean means
Types of Means
Arithmetic Mean
𝑎1 +𝑎2 +𝑎3 +⋯+𝑎𝑛
𝐴𝑀 =
𝑛
𝑎1 +𝑎2
For 2 values: 𝐴𝑀 =
2

Geometric Mean
𝑛
𝐺𝑀 = 𝑎1 + 𝑎2 + 𝑎3 + ⋯ + 𝑎𝑛
2
For 2 values: 𝐺𝑀 = 𝑎1 + 𝑎2
Harmonic Mean
𝑛
𝐻𝑀 = 1 1 1 1
+ +
𝑎1 𝑎2 𝑎3
+⋯+ 𝑎𝑛
2 2𝑎1 𝑎2
For 2 values: 𝐻𝑀 = 1 1 =
+ 𝑎1 +𝑎2
𝑎1 𝑎2
Arithmetic Mean
Ref: http://economistatlarge.com/finance/applied-finance/differences-arithmetic-geometric-harmonic-means

• The most common average, simplest to compute


• The AM of 2, 3 and 4 is 3
• Suitable when:
• the data is not skewed (no extreme outliers)
• the individual data points are not dependent on each other
• the numbers are relatively evenly distributed, e.g.
o follow a normal distribution
o when you are rolling a fair die, the expected value is the
mean of all the numbers on it
• Easily distorted if the sample of observations contains outliers
or for data that has a non-Gaussian distribution (e.g. multiple
peaks – a multi-modal probability distribution).
• The AM is more meaningful when a variable has a Gaussian
or Gaussian-like data distribution.
Geometric Mean
Ref: https://towardsdatascience.com/on-average-youre-using-the-wrong-average-geometric-harmonic-means-in-data-analysis-2a703e21ea0

• The geometric mean multiplies rather than sums values, then


takes the nth root rather than dividing by n
• It essentially says: if every number in our dataset was the
same number, what would that number have to be in order to
have the same multiplicative product as our actual dataset?
• This makes it well-suited for describing multiplicative
relationships, such as rates and ratios, even if those ratios are
on different scales (i.e. do not have the same denominator)
Geometric Mean
Ref: https://towardsdatascience.com/on-average-youre-using-the-wrong-average-geometric-harmonic-means-in-data-analysis-2a703e21ea0

• Geometric mean should be used when the data points are


inter-related
• It’s appropriate for numbers that are distributed along a
logarithmic scale - that is, when you’re as likely to find a
number twice the size as a number half the size.
• The geometric mean does not accept negative or zero values,
e.g. all values must be positive.
• A fancy feature of the geometric mean is that you can often
average across numbers on completely different scales.
Geometric Mean
• Compare ratings for two coffeeshops using two different sources.
• The problem is that source 1 uses a 5-star scale and source 2 uses a
100-point scale:
Coffeeshop A
• Source 1 rating: 4.5
• Source 2 rating: 68
Coffeeshop B
• Source 1 rating: 3
• Source 2 rating: 75

If we naively take the arithmetic mean of raw ratings for each coffeeshop:
Coffeeshop A = (4.5 + 68) ÷2 = 36.25
Coffeeshop B = (3 + 75) ÷2 = 39

We’d conclude that Coffeeshop B was the winner.


Geometric Mean
• The right way to do this is:
• We need to normalize our values onto the same scale before averaging
them with the arithmetic mean, to get an accurate result.
• So we multiply the source 1 ratings by 20 to bring them from a 5-star
scale to the 100-point scale of source 2:

Coffeeshop A
• 4.5 * 20 = 90
• (90 + 68) ÷2 = 79
Coffeeshop B
• 3 * 20 = 60
• (60 + 75) ÷2 = 67.5

So we find that Coffeeshop A is the true winner, contrary to the naive


application of arithmetic mean above.
Geometric Mean
• For this particular problem, the geometric mean, allows us to reach
the same conclusion:
Coffeeshop A = square root of (4.5 * 68) = 17.5
Coffeeshop B = square root of (3 * 75) = 15
• The arithmetic mean is dominated by numbers on the larger scale,
which makes us think Coffeeshop B is the higher rated shop. This is
because the arithmetic mean expects an additive relationship between
numbers and does not account for scales and proportions. Hence the
need to bring numbers onto the same scale before applying the
arithmetic mean.
• The geometric mean, on the other hand, can handle varying proportions
with ease, due to it’s multiplicative nature. This is a tremendously useful
property, but we no longer have any interpretable scale at all. The
geometric mean is effectively unitless in such situations.
Harmonic Mean
Ref: http://economistatlarge.com/finance/applied-finance/differences-arithmetic-geometric-harmonic-means

• The harmonic mean of a set of numbers is the reciprocal of the arithmetic


mean of reciprocals
• Best used in situations where extreme outliers exist in the population
x0 x1 x2 x3 x4 x5 x6 x7 x8 AM GM HM
1 2 3 4 5 6 7 8 9 5.00 4.15 3.18
2 4 8 16 32 64 128 256 512 113.56 32.00 9.02
5 5 5 5 5 5 5 5 5 5.00 5.00 5.00
5 5 5 5 5 5 5 5 10 5.56 5.40 5.29
5 5 5 5 5 5 5 5 100 15.56 6.97 5.59
5 5 5 5 5 5 5 5 1000 115.56 9.01 5.62
5 5 5 5 5 5 5 5 10000 1115.56 11.63 5.62
5 5 5 5 5 5 5 5 100000 11115.56 15.03 5.62
5 5 5 5 5 5 5 100000 100000 22226.11 45.16 6.43
5 5 5 5 5 100000 100000 100000 100000 44447.22 407.89 9.00
5 100000 100000 100000 100000 100000 100000 100000 100000 88889.44 33274.21 44.98
100000 100000 100000 100000 100000 100000 100000 100000 100000 100000.0 100000.0 100000.0
A combined measure
Use harmonic mean instead of arithmetic mean as
we are taking the average of ratios
Harmonic mean punishes extreme values more
strictly:
• Consider a trivial classifier (always returns class A)
• A very large number of data elements of class B, and a
single element of class A:
o Precision: 0.0
o Recall: 1.0
o Arithmetic mean = 0.5 (50% correct), despite being
the worst possible outcome!
o The harmonic mean is nearly 0.
o To have a high F-score, you need both a high precision and
high recall
F-1-MEASURE
A combined measure that assesses the P/R tradeoff
is F measure (weighted harmonic mean):

2 2𝑃𝑅
𝐹= =
1 1 𝑃+𝑅
+
𝑃 𝑅
F-β-MEASURE
We can choose to favor precision or recall by using an
interpolation weight α:
1
F=
1 1
 + (1 −  )
P R
(  2 + 1) PR 1−
= =
 2P + R 
• Balanced F1 measure has  = 1 (that is,  = ½) as shown above.
• To give more weight to the Precision, we pick a  value in the
interval 0 <  < 1. [notice that it is getting multiplied with P in the
denominator]
• To give more weight to the Recall, we pick a  Value in the
interval 1 <  < +∞
•  -> 0 considers only precision,  -> +∞ only recall
F-MEASURE

The β parameter differentially weights the importance


of recall and precision
• based on the needs of an application
• values of β > 1 favor recall, while
• values of β < 1 favor precision
When β = 1, precision and recall are equally balanced
This is the most frequently used metric, and is called
Fβ=1 or just F1:
Intuition
Ref: https://stats.stackexchange.com/questions/49226/how-to-interpret-f-measure-values, http://www.electronics-tutorials.ws/resistor/res_4.html

The formula for F-measure (F1, with β=1) is the same


as the formula giving the equivalent resistance
composed of two resistances placed in parallel in
physics (forgetting about the factor 2).

This analogy would define F-measure as the


equivalent resistance formed by sensitivity and
precision placed in parallel.
Net resistance gets reduced as soon as any one
among the two looses resistance
More than two classes
More than two classes
Lots of classification tasks in language processing
have more than two classes:
• Sentiment analysis (positive, negative, neutral),
• Part-of-speech tagging (|POS tags|)
• Word sense disambiguation (|word senses|)
• Emotion detection (|emotions|)
Sec.14.5
More Than Two Classes: Sets of binary classifiers
Any-of or multi-label classification
• An instance can belong to 0, 1, or more than one classes.
For each class c∈C
• Build a binary classifier γc to distinguish c from all other classes
c’∈C (c vs not c)
Given test instance d,
• Evaluate it for membership in each class using each γc
• d belongs to any class for which γc returns true

56
Sec.14.5
More Than Two Classes: Sets of binary classifiers
One-of or multinomial classification
• Classes are mutually exclusive: each instance in exactly one class
For each class c∈C
• Build a classifier γc to distinguish c from all other classes c’∈C.
Given test instance d,
• Evaluate it for membership in each class using each γc
• d belongs to the one class with maximum score

57
Evaluation
3-way one-of email categorization decision (urgent, normal,
spam)
Sec. 15.2.4
Per class evaluation measures
Recall: cii
Fraction of instances in class i classified
correctly:
 cij
j

cii
Precision:
Fraction of instances assigned class i that
 c ji
j
are actually about class i:

c i
ii
Accuracy: (1 - error rate)
Fraction of instances classified correctly:  c
j i
ij

59
Sec. 15.2.4

Micro- vs. Macro-Averaging


If we have more than one class, how do we combine
multiple performance measures into one quantity?
Macro-averaging: Compute performance for each class,
then average.
Micro-averaging: Collect decisions for all classes, compute
contingency table, evaluate.

60
Evaluation
Evaluation
Evaluation
Evaluation
Evaluation
Evaluation
Evaluation
A micro-average is dominated by the more frequent
class (in this case spam)
• as the counts are pooled
The macro-average better reflects the statistics of the
smaller classes,
• is more appropriate when performance on all the classes
is equally important.
Test sets and cross-validation
We use:
• the training set to train the model,
• the development test set (also called a devset) to
perhaps tune some parameters and decide what the best
model is
• Run the best model on unseen test set to report its
performance (precision, recall, F-measure, accuracy,
error rate)
The use of a devset avoids overfitting the to test set
Test sets and cross-validation
But having a fixed training set, devset, and test set creates
another problem:
• in order to save lots of data for training, the test set (or devset)
might not be large enough to be representative
• It would be better if we could somehow use all our data both
for training and test.
We do this by cross-validation:
• For example, randomly choose a training and test set division
of our data, train our classifier, and then compute the error rate
on the test set
• Then repeat with a different randomly selected training set and
test set.
• We do this sampling process 10 times and average these 10
runs to get an average error rate.
• This is called 10-fold cross-validation
The next few slides are from:
CSCE 666 Pattern Analysis | Ricardo
Gutierrez-Osuna | CSE@TAMU 3
One may be tempted to use the entire training data to select
the “optimal” classifier, then estimate the error rate
This naïve approach has two fundamental problems
• The final model will normally overfit the training data: it will not
be able to generalize to new data
• The problem of overfitting is more pronounced with models that have a
large number of parameters
• The error rate estimate will be overly optimistic (lower than the
true error rate)
• In fact, it is not uncommon to achieve 100% correct
classification on training data
• How to make the best use of your (limited) data for
– Training
– Model selection and
– Performance estimation
Cross-validation
A problem with cross-validation is:
• Because all the data is used for testing, we need the
whole corpus to be blind i.e. we can’t examine any of the
data to suggest possible features
• But looking at the corpus is often important for designing
the system
It is common to create a fixed training set and test set,
then do 10-fold cross-validation inside the training
set, but compute error rate the normal way in the
test set
Cross-validation with fixed test data
For more details please visit

http://aghaaliraza.com

Thank you!

You might also like