Unit 3
Unit 3
Why “Learn”?
• Machine learning is programming computers to
optimize a performance criterion using example data
or past experience.
• There is no need to “learn” to calculate payroll
• Learning is used when:
– Human expertise does not exist (navigating on Mars),
– Humans are unable to explain their expertise (speech
recognition)
– Solution changes in time (routing on a computer network)
– Solution needs to be adapted to particular cases (user
biometrics)
2
What We Talk About When
We Talk About“Learning”
• Learning general models from a data of particular
examples
• Data is cheap and abundant (data warehouses, data
marts); knowledge is expensive and scarce.
• Example in retail: Customer transactions to consumer
behavior:
People who bought “Da Vinci Code” also bought “The Five
People You Meet in Heaven” (www.amazon.com)
• Build a model that is a good and useful
approximation to the data.
3
Data Mining/KDD
•Definition := “KDD is the non-trivial process of
•identifying valid, novel, potentially useful, and
•ultimately understandable patterns in data” (Fayyad)
•Applications:
• Retail: Market basket analysis, Customer relationship
management (CRM)
• Finance: Credit scoring, fraud detection
• Manufacturing: Optimization, troubleshooting
• Medicine: Medical diagnosis
• Telecommunications: Quality of service optimization
• Bioinformatics: Motifs, alignment
• Web mining: Search engines
• ... 4
What is Machine Learning?
• It is very hard to write programs that solve problems like
recognizing a face.
– We don’t know what program to write because we don’t
know how our brain does it.
– Even if we had a good idea about how to do it, the
program might be horrendously complicated.
• Instead of writing a program by hand, we collect lots of
examples that specify the correct output for a given input.
• A machine learning algorithm then takes these examples
and produces a program that does the job.
– The program produced by the learning algorithm may
look very different from a typical hand-written program.
It may contain millions of numbers.
– If we do it right, the program works for new cases as
well as the ones we trained it on.
A classic example of a task that requires machine
learning: It is very hard to say what makes a 2
Some examples of tasks that are best
solved by using a learning algorithm
• Recognizing patterns:
– Facial identities or facial expressions
– Handwritten or spoken words
– Medical images
• Generating patterns:
– Generating images or motion sequences
• Recognizing anomalies:
– Unusual sequences of credit card transactions
– Unusual patterns of sensor readings in a nuclear
power plant or unusual sound in your car engine.
• Prediction:
– Future stock prices or currency exchange rates
Some web-based examples of machine learning
decision game
theory theory
AI control
theory
information
biological theory
evolution
Machine
probability Learning
& philosophy
statistics
optimization
Data Mining statistical psychology
mechanics
computational
complexity
theory neurophysiology
Example 1: Credit Risk Analysis
• Typical customer: bank.
• Database:
– Current clients data, including:
– basic profile (income, house ownership,
delinquent account, etc.)
– Basic classification.
• Methodology:
– consider “typical words” for each category.
– Classify using a “distance “ measure.
Example 3: Robot control
• Goal: Control a robot in an unknown
environment.
• Needs both
– to explore (new places and action)
– to use acquired knowledge to gain
benefits.
• Learning task “control” what is
observes!
Example 4: Medical Application
• Goal: Monitor multiple physiological
parameters.
– Control a robot in an unknown
environment.
• Needs both
– to explore (new places and action)
– to use acquired knowledge to gain
benefits.
• Learning task “control” what is
observes!
History of Machine Learning
• 1960’s and 70’s: Models of human learning
– High-level symbolic descriptions of knowledge, e.g., logical expressions
or graphs/networks, e.g., (Karpinski & Michalski, 1966) (Simon & Lea,
1974).
– Winston’s (1975) structural learning system learned logic-based
structural descriptions from examples.
• Today status:
– First-generation algorithms:
– Neural nets, decision trees, etc.
• Future:
– Smart remote controls, phones, cars
– Data and communication networks,
software
Type of models
• Supervised learning
– Given access to classified data
• Unsupervised learning
– Given access to data, but no classification
– Important for data reduction
• Control learning
– Selects actions and observes
consequences.
– Maximizes long-term cumulative return.
Learning: Complete Information
• Probability D1 over
and probability D2 for
• Equally likely.
• Computing the
(x,y)
probability of “smiley”
given a point (x,y).
• Use Bayes formula.
• Let p be the
probability.
Task: generate class label to a
point at location (x,y)
P(( x, y ) | S ) P( S )
P( S | ( x, y ))
P(( x, y ))
P(( x, y ) | S ) P( S )
P(( x, y ) | S ) P( S ) P(( x, y ) | H ) P ( H )
• Determine between S or H by
comparing the probability of P(S|(x,y))
to P(H|(x,y)).
• Clearly, one needs to know all these
probabilities
Predictions and Loss Model
• How do we determine the optimality of
the prediction
• We define a loss for every prediction
• Try to minimize the loss
– Predict a Boolean value.
– each error we lose 1 (no error no loss.)
– Compare the probability p to 1/2.
– Predict deterministically with the higher
value.
– Optimal prediction (for zero-one loss)
• Can not recover probabilities!
Bayes Estimator
Weak Learning:
Assume that for any distribution D, there is some
predicate heH that predicts better than 1/2+e.
Methodology:
Change the distribution to target “hard” examples.
- + +
? -
+
- + -
Separating Hyperplane
sign
Perceptron: sign( S xiwi )
Find w1 .... wn
S
Limited representation
w1 wn
x1 xn
Neural Networks
Sigmoidal gates:
a= S xiwi and
output = 1/(1+ e-a)
x1 xn
x1 > 5
+1
x6 > 2
+1 -1
Decision Trees
• Limited Representation
• Highly interpretable
n dimensions m dimensions
Support Vector Machine
+ -
+ + -
+ -
Reinforcement Learning
Keeps the “best” candidates Keep trees with low observed error
Unsupervised learning: Clustering
Unsupervised learning: Clustering
Basic Concepts in Probability
• For a single
hypothesis h:
– Given an observed
error
– Bound the true error
• Markov Inequality
E[ x]
Pr[ x ]
Basic Concepts in Probability
• Chebyshev Inequality
Var ( x)
Pr[| x E[ x] | ]
2
Basic Concepts in Probability
• Chernoff Inequality