49 Machine Learning
49 Machine Learning
Topic Covered
• Introduction to ML
• K-Nearest Neighbor
• Decision Tree
• Ensemble Methods
• Random Forest
• Artificial Neural Networks
• Support Vector Machines
• K-Means Clustering
• Reinforcement Learning
Supervised Learning: Tree-
based Methods
Road Map
Basic concepts
K-nearest neighbor
Decision tree induction
Ensemble methods: Bagging and Boosting
Summary
2
We start a little light….
3
Machine Learning: An example application
4
Another application
A credit card company receives thousands of
applications for new cards. Each application
contains information about an applicant,
age
Marital status
annual salary
outstanding debts
credit rating
etc.
Problem: to decide whether an application should
approved, or to classify applications into two
categories, approved and not approved.
5
Another Example
6
Computer vision is hard
Machine learning and feature representations
pixel 1
Learning
algorithm
pixel 2
Input
Motorbikes
Input space “Non”-Motorbikes
pixel 2
pixel 1
How is computer perception done?
Object
detection
Audio
classification
Low-level Speaker
Audio
audio features identification
Helicopter
control
Low-level state
Helicopter
features Action
Machine learning and our focus
Like human learning from past experiences.
A computer does not have “experiences”.
A computer system learns from data, which
represent some “past experiences” of an
application domain.
Our focus: learn a target function that can be used
to predict the values of a discrete class attribute,
e.g., approve or not-approved, and high-risk or low
risk.
The task is commonly called: Supervised learning,
classification, or inductive learning.
10
The data and the goal
Data: A set of data records (also called
examples, instances or cases) described by
k attributes: A1, A2, … Ak.
a class: Each example is labelled with a pre-
defined class.
Goal: To learn a classification model from the
data that can be used to predict the classes
of new (future, or test) cases/instances.
11
An example: data (loan application)
Approved or not
12
An example: the learning task
Learn a classification model from the data
Use the model to classify future loan applications
into
Yes (approved) and
No (not approved)
What is the class for following case/instance?
13
Supervised vs. unsupervised Learning
Supervised learning: classification is seen as
supervised learning from examples.
Supervision: The data (observations,
measurements, etc.) are labeled with pre-defined
classes. It is like that a “teacher” gives the classes
(supervision).
Test data are classified into these classes too.
Unsupervised learning (clustering)
Class labels of the data are unknown
Given a set of data, the task is to establish the
existence of classes or clusters in the data
14
Supervised learning process: two steps
Learning (training): Learn a model using the training data
Testing: Test the model using unseen test data to assess the model accuracy
15
What do we mean by learning?
Given
a data set D,
a task T, and
a performance measure M,
a computer system is said to learn from D to
perform the task T if after learning the
system’s performance on T improves as
measured by M.
In other words, the learned model helps the
system to perform T better as compared to
no learning.
16
An example
Data: Loan application data
Task: Predict whether a loan should be
approved or not.
Performance measure: accuracy.
17
Fundamental assumption of learning
Assumption: The distribution of training
examples is identical to the distribution of test
examples (including future unseen examples).
21
k-Nearest Neighbor Classification (kNN)
kNN does not build model from the training
data.
To classify a test instance d, define k-
neighborhood P as k nearest neighbors of d
Count number n of training instances in P that
belong to class cj
Estimate Pr(cj|d) as n/k
No training is needed. Classification time is
linear in training set size for each test case.
22
kNNAlgorithm
23
Example: k=6 (6NN)
Government
Science
Arts
24
Discussions
kNN can deal with complex and arbitrary
decision boundaries.
Despite its simplicity, researchers have
shown that the classification accuracy of kNN
can be quite strong and in many cases as
accurate as other elaborated methods.
kNN is slow at the classification time
kNN does not produce an understandable
model
25
Road Map
Basic concepts
K-nearest neighbor
Decision tree induction
Ensemble methods: Bagging and Boosting
Summary
26
Introduction
Decision tree learning is one of the most
widely used techniques for classification.
Its classification accuracy is competitive with
other methods, and
it is very efficient.
The classification model is a tree, called
decision tree.
C4.5 by Ross Quinlan is perhaps the best
known system and the codes are freelly
available from internet.
27
The loan data (reproduced)
Approved or not
28
A decision tree from the loan data
Decision nodes and leaf nodes (classes)
29
Use the decision tree
No
30
Is the decision tree unique?
No. Here is a simpler tree. We want smaller tree and accurate tree.
31
From a decision tree to a set of rules
A decision tree can be converted to a
set of rules
Each path from the root to a leaf is a
rule.
32
Algorithm for decision tree learning
Basic algorithm (a greedy divide-and-conquer algorithm)
Assume attributes are categorical now (continuous attributes
can be handled too)
Tree is constructed in a top-down recursive manner
At start, all the training examples are at the root
Examples are partitioned recursively based on selected
attributes
Attributes are selected on the basis of an impurity function (e.g.,
information gain)
Conditions for stopping partitioning
All examples for a given node belong to the same class
There are no remaining attributes for further partitioning –
majority class is the leaf
There are no examples left
33
Decision tree learning algorithm
34
Choose an attribute to partition data
The key to building a decision tree - which
attribute to choose in order to branch.
The objective is to reduce impurity or
uncertainty in data as much as possible.
A subset of data is pure if all instances belong to
the same class.
The heuristic in C4.5 is to choose the attribute
with the maximum Information Gain or Gain
Ratio based on information theory.
35
The loan data (reproduced)
Approved or not
36
Two possible roots, which is better?
37
Information theory
Information theory provides a mathematical
basis for measuring the information content.
To understand the notion of information, think
about it as providing the answer to a question,
for example, whether a coin will come up heads.
If one already has a good guess about the answer,
then the actual answer is less informative.
If one already knows that the coin is rigged so that it
will come with heads with probability 0.99, then a
message (advanced information) about the actual
outcome of a flip is worth less than it would be for a
honest coin (50-50).
38
Information theory (cont …)
For a fair (honest) coin, you have no
information, and you are willing to pay more
(say in terms of $) for advanced information -
less you know, the more valuable the
information.
Information theory uses this same intuition,
but instead of measuring the value for
information in dollars, it measures information
contents in bits.
One bit of information is enough to answer a
yes/no question about which one has no
idea, such as the flip of a fair coin
39
Information theory: Entropy measure
The entropy formula,
|C |
entropy( D) Pr(c ) log
j 1
j 2 Pr( c j )
|C |
Pr(c ) 1,
j 1
j
40
Entropy measure: let us get a feeling
j 1 | D |
entropy( D j )
42
Information gain (cont …)
43
An example
6 6 9 9
entropy( D) log 2 log 2 0.971
15 15 15 15
6 9
entropyOwn _ house ( D) entropy( D1 ) entropy( D2 )
15 15
6 9
0 0.918
15 15
0.551
5 5 5
entropyAge ( D) entropy( D1 ) entropy( D2 ) entropy( D3 ) Age Yes No entropy(Di)
15 15 15
young 2 3 0.971
5 5 5
0.971 0.971 0.722 middle 3 2 0.971
15 15 15
0.888
old 4 1 0.722
44
We build the final tree
45
Handling continuous attributes
Handle continuous attribute by splitting into
two intervals (can be more) at each node.
How to find the best threshold to divide?
Use information gain or gain ratio again
Sort all the values of an continuous attribute in
increasing order {v1, v2, …, vr},
One possible threshold between two adjacent
values vi and vi+1. Try all possible thresholds and
find the one that maximizes the gain (or gain
ratio).
46
An example in a continuous space
47
Avoid overfitting in classification
Overfitting: A tree may overfit the training data
Good accuracy on training data but poor on test data
Symptoms: tree too deep and too many branches,
some may reflect anomalies due to noise or outliers
Two approaches to avoid overfitting
Pre-pruning: Halt tree construction early
Difficult to decide because we do not know what may
happen subsequently if we keep growing the tree.
Post-pruning: Remove branches or sub-trees from a
“fully grown” tree.
This method is commonly used. C4.5 uses a statistical
method to estimates the errors at each node for pruning.
A validation set may be used for pruning as well.
48
An example Likely to overfit the data
49
Other issues in decision tree learning
50
Road Map
Basic concepts
K-nearest neighbor
Decision tree induction
Ensemble methods: Bagging and Boosting
Summary
51
Ensemble
Data
model 2
……
Combine multiple
model k
models into one!
52 http://ews.uiuc.edu/~jinggao3/sdm10ensemble.htm
Motivations
• Motivations of ensemble methods
– Ensemble model improves accuracy and
robustness over single model methods
– Applications:
• distributed computing
• privacy-preserving applications
• large-scale data with reusable models
• multiple sources of data
– Efficiency: a complex problem can be
decomposed into multiple sub-problems that are
easier to understand and solve (divide-and-
conquer approach)
53
Relationship with Related Studies
• Multi-task learning
– Learn multiple tasks simultaneously
– Ensemble methods: use multiple models to learn
one task
• Data integration
– Integrate raw data
– Ensemble methods: integrate information at the
model level
54
Relationship with Related Studies (2)
• Meta learning
– Learn on meta-data (include base model output)
– Ensemble methods: besides learn a joint model
based on model output, we can also combine
the output by consensus
• Non-redundant clustering
– Give multiple non-redundant clustering
solutions to users
– Ensemble methods: give one solution to users
which represents the consensus among all the
base models
55
Why Ensemble Works?
• Intuition
– combining diverse, independent opinions in human
decision-making as a protective mechanism (e.g. stock
portfolio)
• Uncorrelated error reduction
– Suppose we have 5 completely independent classifiers
for majority voting
– If accuracy is 70% for each
• 10 (.7^3)(.3^2)+5(.7^4)(.3)+(.7^5)
• 83.7% majority vote accuracy
– 101 such classifiers
• 99.9% majority vote accuracy
Model 1 Model 6
Model 3 Model 5
Model 2 Model 4
57
Why Ensemble Works?
58
Summary
Semi- Semi-supervised
Consensus
supervised Learning, Multi-view Learning
Learning Collective Inference Maximization
K-means,
Unsupervised
Spectral Clustering, Clustering Ensemble
Learning …...
59
Ensemble of Classifiers—Learn to Combine
training test
classifier 1
Ensemble
model
labeled unlabeled
data classifier 2 data
……
classifier k final
predictions
60
Ensemble of Classifiers—Consensus
training test
classifier 1
combine the
labeled unlabeled
predictions by
data classifier 2 data
majority
voting
……
classifier k
final
predictions
• Problem
– Given a data set D={x1,x2,…,xn} and their
corresponding labels L={l1,l2,…,ln}
62
Bias and Variance
• Ensemble methods
– Combine learners to reduce variance
from Elder, John. From Trees to Forests and Rule Sets - A Unified Overview of Ensemble Methods. 2007. 63
Generating Base Classifiers
64
65
66
67
Random forest classifier
M features
N examples
Random Forest Classifier
Create bootstrap samples
from the training data
M features
N examples
....…
Random Forest Classifier
Construct a decision tree
M features
N examples
....…
Random Forest Classifier
At each node in choosing the split feature
choose only among m<M features
M features
N examples
....…
Random Forest Classifier
Create decision tree
from each bootstrap sample
M features
N examples
....…
....…
Random Forest Classifier
M features
N examples
Take the
majority
vote
....…
....…
Thus,…..
Random forest
• Available package:
• http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
• To read more:
• http://www-stat.stanford.edu/~hastie/Papers/ESLII.pdf
Boosting
77
Boosting (1)
• Principles
– Boost a set of weak learners to a strong learner
– Make records currently misclassified more important
• Example
– Record 4 is hard to classify
– Its weight is increased, therefore it is more
likely to be chosen again in subsequent rounds
*[FrSc97]
from P. Tan et al. Introduction to Data Mining.
78
Boosting (2)
• AdaBoost
80
In a nutshell….
81
Supervised Learning: Artificial
Neural Networks
What are connectionist neural networks?
• Connectionism refers to a computer modeling approach to
computation that is loosely based upon the architecture of the brain.
3
The 2012 Breakthrough….
4
Deep Learning: Acknowledging Some Resources used here…
Machine “2”
Handwriting Digit Recognition
Input Output
y1
0.1 is 1
x1
x2 y2
0.7 is 2
The image
is “2”
……
……
……
x256 y10
0.2 is 0
16 x 16 = 256
Ink → 1 Each dimension represents
No ink → 0 the confidence of a digit.
Example Application
• Handwriting Digit Recognition
x1 y1
x2
y2
…… Machine “2”
……
x256 y10
𝑓: 𝑅256 → 𝑅10
In deep learning, the function 𝑓is
represented by neural network
Element of Neural Network
Neuron 𝑓: 𝑅𝐾 → 𝑅
a1 w1 z a1w1 a2 w2 aK wK b
a2 w2
z z a
wK
…
aK Activation
weights function
b
bias
Neural Network
neuron
Input Layer 1 Layer 2 Layer L Output
x1 …… y1
x2 …… y2
……
……
……
……
……
xN …… yM
Input Output
Layer Hidden Layers Layer
z
1
1 ez
z
Example of Neural Network
1 4 0.98 2 0.86 3 0.62
1
-2 -1 -1
1 0 -2
-1 -2 0.12 -2 0.11 -1 0.83
-1
1 -1 4
0 0 2
Example of Neural Network
1 0.73 2 0.72 3 0.51
0
-2 -1 -1
1 0 -2
-1 0.5 -2 0.12 -1 0.85
0
1 -1 4
0 0 2
𝑓: 𝑅2 → 𝑅2 1 0.62 0 0.51
𝑓 = 𝑓 =
−1 0.83 0 0.85
Different parameters define different function
Matrix Operation
1 4 0.98
1 y1
-2
1
-1 -2 0.12
-1 y2
1
0
1 −2 1 1 0.98
𝜎 + =
−1 1 −1 0 0.12
4
−2
Neural Network
x1 …… y1
x2 W1 W2 ……
WL y2
b1 b2 bL
……
……
……
……
……
xN x a1 ……
a2 y yM
𝜎 W1 x + b1
𝜎 W2 a1 + b2
𝜎 WL aL-1 + bL
Neural Network
x1 …… y1
x2 W1 W2 ……
WL y2
b1 b2 bL
……
……
……
……
……
xN x a1 ……
a2 y yM
=𝜎 WL …𝜎 W2 𝜎 W1 x + b1 + b2 … + bL
Softmax
• Softmax layer as the output layer
Ordinary Layer
z1
y1 z1
In general, the output of
z2
y2 z 2
network can be any value.
3 0.88 3
e
20
z1 e e z1
y1 e z1 zj
j 1
0.12 3
z2
1
e e z 2 2.7
y2 e z2
e
zj
j 1
0.05 ≈0
z3 -3
3
e
z3
e y3 e z3 zj
e
3 j 1
e zj
j 1
How to set network parameters
𝜃 = 𝑊 1 , 𝑏1 , 𝑊 2 , 𝑏 2 , ⋯ 𝑊 𝐿 , 𝑏 𝐿
x1 …… y1
0.1 is 1
Softmax
x2 …… y2
0.7 is 2
……
……
……
x256 …… y10
0.2 is 0
16 x 16 = 256
Ink → 1 Set the network parameters 𝜃 such that ……
No ink → 0
Input: How to let thethe
y1 has neural
maximum value
network achieve this
Input: y2 has the maximum value
Training Data
• Preparing training data: images and their labels
“1”
x1 …… y0.2
1 1
x2 …… y2
0.3 0
Cost
……
……
……
……
……
x256 …… y0.5 𝐿(𝜃) 0
10
target
Cost can be Euclidean distance or cross
entropy of the network output and target
Total Cost
For all training data … Total Cost:
𝑅
x1 NN y1 𝑦1 𝐶 𝜃 = 𝐿𝑟 𝜃
𝐿1 𝜃
𝑟=1
x2 NN y2 𝑦2
𝐿2 𝜃 How bad the network
parameters 𝜃 is on
x3 NN y3 𝑦3 this task
𝐿3 𝜃
……
……
……
……
Find the network
parameters 𝜃 ∗ that
xR NN yR 𝑦𝑅 minimize this value
𝐿𝑅 𝜃
Assume there are only two
parameters w1 and w2 in a
Gradient Descent network.
Error Surface 𝜃 = 𝑤1 , 𝑤2
𝛻𝐶 𝜃 𝛻𝐶 𝜃 𝛻𝐶 𝜃
≈0 =0 =0
parameter space
In physical world ……
• Momentum
Gradient = 0
Improving Simple Gradient Descent
Momentum
Don’t just change weights according to the current data point.
Re-use changes from earlier iterations.
Let ∆w(t) = weight changes at time t.
Let be the change we would make with
w
regular gradient descent.
Instead we use
Δw t 1 Δw t
w
wt 1 wt Δwt
Momentum damps oscillations. momentum parameter
A hack? Well, maybe.
Mini-batch
Randomly initialize 𝜃 0
x1 NN y1 𝑦 1 Pick the 1st batch
Mini-batch
𝐿1 𝐶 = 𝐿1 + 𝐿31 + ⋯
x31 NN y31 𝑦 31 𝜃 1 ← 𝜃 0 − 𝜂𝛻𝐶 𝜃 0
𝐿31 Pick the 2nd batch
…… 𝐶 = 𝐿2 + 𝐿16 + ⋯
𝜃 2 ← 𝜃 1 − 𝜂𝛻𝐶 𝜃 1
x2 NN y2 𝑦2
Mini-batch
…
𝐿2
16
C is different each time
x16 NN y16 𝑦 when we update
𝐿16 parameters!
……
Mini-batch
Original Gradient Descent With Mini-batch
unstable
…… 𝐶 = 𝐶 2 + 𝐶 16 + ⋯
𝜃 2 ← 𝜃 1 − 𝜂𝛻𝐶 𝜃 1
x2 NN y2 𝑦2
Mini-batch
…
𝐶2 Until all mini-batches
have been picked
x16 NN y16 𝑦16
𝐶 16 one epoch
……
Ref:
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lec
ture/Theano%20DNN.ecm.mp4/index.html
Linear Perceptrons
They are multivariate linear models:
Out(x) = wTx
Outx y
2
k k
k
w
2
x k yk
k
E
wj wj - η
w j
E
So what’s ?
w j
Copyright © 2001, 2003, Andrew W. Moore
Linear Perceptron Training Rule
E
R
E ( yk w T x k ) 2
R
( yk w T x k ) 2
k 1
w j k 1 w j
R
Gradient descent tells us we should update
w thusly if we wish to minimize E:
k
2(
k 1
y w T
xk )
w j
( yk w T x k )
R
2 δ k wT xk
k 1 w j
E …where…
wj wj - η δk y k w T x k
w j
R m
2 δk w x
w j
i ki
k 1 i 1
E
So what’s ? R
w j 2 δk xkj
k 1
Linear Perceptron Training Rule
R
E ( yk w T x k ) 2
k 1
Gradient descent tells us we should update
w thusly if we wish to minimize E:
E
wj wj - η R
…where…
w j w j w j 2η δk xkj
k 1
E R
2 δk xkj
w j k 1 We frequently neglect the 2 (meaning we halve the
learning rate)
The “Batch” perceptron algorithm
1) Randomly initialize weights w1 w2 … wm
2) Get your dataset (append 1’s to the inputs if you don’t want to
go through the origin).
3) for i = 1 to R i : yi w x i
4) for j = 1 to m R
w j w j i xij
i 1
w j w j i xij
MANY NAMES
If data is voluminous and arrives fast
observe (x,y)
yw x
j w j w j η δ x j
A 1-HIDDEN LAYER NET
NINPUTS = 2 NHIDDEN = 3
N INS
v1 g w1k xk
k 1
w11
w1
x1 w21
w31
N INS
v2 g w2 k xk w2 N HID
Wk vk
Out g
w12 k 1 k 1
w22
x2 w3
w32
N INS
v3 g w3k xk
k 1
OTHER NEURAL NETS
1
x1
x2
x3
2-Hidden layers + Constant Term
“JUMP” CONNECTIONS
x1
N INS N HID
w0 k xk Wk vk
x2
Out g
k 1 k 1
Copyright © 2001, 2003, Andrew W. Moore
Backprop
• Very powerful - can learn any function, given enough
hidden units! With enough hidden units, we can generate
any function.
• Have the same problems of Generalization vs.
Memorization. With too many units, we will tend to
memorize the input and not generalize well. Some schemes
exist to “prune” the neural network.
• Networks require extensive training, many parameters to
fiddle with. Can be extremely slow to train. May also fall
into local minima.
• Inherently parallel algorithm, ideal for multiprocessor
hardware.
• Despite the cons, a very powerful algorithm that has seen
widespread successful deployment.
Types of Neurons
Linear Neuron
Logistic Neuron
46
Linear Regression Neural Networks
• What happens when we arrange linear
neurons in a multilayer network?
47
Linear Regression Neural Networks
• Nothing special happens.
– The product of two linear transformations is itself a linear
transformation.
48
Neural Networks
• We want to introduce non-linearities to the network.
– Non-linearities allow a network to identify complex regions
in space
49
Linear Separability
• 1-layer cannot handle XOR
• More layers can handle more complicated spaces – but
require more parameters
• Each node splits the feature space with a hyperplane
• If the second layer is AND a 2-layer network can
represent any convex hull.
50
Feed-Forward Networks
• Predictions are fed forward through the
network to classify
51
Feed-Forward Networks
• Predictions are fed forward through the
network to classify
52
Feed-Forward Networks
• Predictions are fed forward through the
network to classify
53
Feed-Forward Networks
• Predictions are fed forward through the
network to classify
54
Feed-Forward Networks
• Predictions are fed forward through the
network to classify
55
Feed-Forward Networks
• Predictions are fed forward through the
network to classify
56
Error Backpropagation
• We will do gradient descent on the whole
network.
• Training will proceed from the last layer to the
first.
57
Error Backpropagation
• Introduce variables over the neural network
58
Error Backpropagation
• Introduce variables over the neural network
– Distinguish the input and output of each node
59
Error Backpropagation
60
Error Backpropagation
Training: Take the gradient of the last component and iterate backwards
61
Error Backpropagation
Empirical Risk Function
62
Error Backpropagation
Optimize last layer weights wkl
63
Error Backpropagation
Optimize last layer weights wkl
64
Error Backpropagation
Optimize last layer weights wkl
65
Error Backpropagation
Optimize last layer weights wkl
66
Error Backpropagation
Optimize last layer weights wkl
67
Error Backpropagation
Optimize last hidden weights wjk
68
Error Backpropagation
Optimize last hidden weights wjk
69
Error Backpropagation
Optimize last hidden weights wjk
70
Error Backpropagation
Optimize last hidden weights wjk
71
Error Backpropagation
Optimize last hidden weights wjk
72
Error Backpropagation
Repeat for all previous layers
73
Error Backpropagation
Now that we have well defined gradients for each parameter, update using Gradient Descent
74
Error Back-propagation
• Error backprop unravels the multivariate chain rule and
solvesthe gradient for each partial component separately.
• The target values for each layer come from the next layer.
• This feeds the errors back along the network.
75
Problems with Neural Networks
• Interpretation of Hidden Layers
• Overfitting
76
Interpretation of Hidden Layers
• What are the hidden layers doing?!
• Feature Extraction
• The non-linearities in the feature extraction can make
interpretation of the hidden layers very difficult.
• This leads to Neural Networks being treated as black boxes.
77
Universality Theorem
Any continuous function f
f :R N
R M
x1 x2 …… xN x1 x2 …… xN
Shallow Deep
Fat + Short v.s. Thin + Tall
Word Error Rate Word Error Rate
Layer X Size Layer X Size
(%) (%)
1 X 2k 24.2
2 X 2k 20.4
3 X 2k 18.4
4 X 2k 17.8
5 X 2k 17.2 1 X 3772 22.5
7 X 2k 17.1 1 X 4634 22.6
1 X 16k 22.1
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-
Dependent Deep Neural Networks." Interspeech. 2011.
Why Deep?
• Deep → Modularization
Girls with Long hair
Classifier 1 womenLong hair
long hair womenLong hair
Long hair
women
women
Boys with
Classifier 2 Cheiko
long hair
weak Little examples
Image
Girls with Short hair
Classifier 3 womanShort hair
short hair womanShort hair
Short hair
woman
woman
• Deep → Modularization
Long hair Chieko
womanLong hair
Short hair woman Long hair
Long hair
Boy or Girl? womanShort
woman
hair
Short hair
woman
woman v.s. Short hair
Short hair
Short hair male
woman maleShort hair
woman maleShort hair
Image Basic male
Classifier
Long hair Short hair
womanLong hair womanShort hair
Long or short? womanLong hair
Long hair womanShort hair
Short hair
woman
woman v.s. woman
woman
Short hair
Short hair
Classifiers for the Chieko male
maleShort hair
maleShort hair
attributes male
Why Deep?
can be trained by little data
• Deep → Modularization
Girls with
Classifier 1
long hair
Boy or Girl? Boys with
Classifier 2
fine long hair
Little data
Image Basic
Classifier Girls with
Classifier 3
short hair
Long or short?
Boys with short
Sharing by the following Classifier 4
hair
classifiers as module
Deep Learning also works on small
Why Deep? data set like TIMIT.
x1 ……
x2 The modularization is automatically ……
learned from data.
……
……
……
……
xN ……
The most basic Use 1st layer as module to build Use 2nd layer as
classifiers classifiers module ……
Hand-crafted
kernel function
SVM
Apply simple
classifier
Source of image: http://www.gipsa-lab.grenoble-
inp.fr/transfert/seminaire/455_Kadri2013Gipsa-lab.pdf
Deep Learning
simple classifier
Learnable kernel
𝜙 𝑥
x1 …… y1
x2 …… y2
𝑥
…
…
…
…
…
xN …… yM
Recipe for Learning
http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-learning-explained-in-a-single-
powerpoint-slide/
Recipe for Learning
http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-learning-explained-in-a-single-
powerpoint-slide/
Support Vector Machines,
Clustering, and more…
Introduction
Support vector machines were invented by V.
Vapnik and his co-workers in 1970s in Russia and
became known to the West in 1992.
SVMs are linear classifiers that find a hyperplane to
separate two class of data, positive and negative.
Kernel functions are used for nonlinear separation.
SVM not only has a rigorous theoretical foundation,
but also performs classification more accurately than
most other methods in applications, especially for
high dimensional data.
It is perhaps the best classifier for text classification.
2
a
Linear Classifiers
x f y
f(x,w,b) = sign(w x + b)
denotes +1 w x + b>0
denotes -1
w x + b<0
a
Linear Classifiers
x f y
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1
Any of these
would be fine..
..but which is
best?
a
Linear Classifiers
x f y
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1
Misclassified
to +1 class
a
Classifier Margin
x f y
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1 Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.
a
Maximum Margin
x f y
1. Maximizing the margin is good
accordingf(x,w,b)
to intuition and PAC
= sign(w theory
x+ b)
denotes +1 2. Implies that only support vectors are
denotes -1 important; other The
training examples
maximum
are ignorable.
margin linear
3. Empirically it works very very
classifier iswell.
the
linear classifier
Support Vectors with the, um,
are those
datapoints that maximum margin.
the margin This is the
pushes up
against simplest kind of
SVM (Called an
LSVM)
Linear SVM
Example of Bad Decision Boundaries
Class 2 Class 2
Class 1 Class 1
Good Decision Boundary: Margin Should Be
Large
The decision boundary should be as far away from the data of both classes
as possible
2
– We should maximize the margin, m m
w.w
Support vectors
datapoints that the
margin
pushes up against
Class 2
The maximum margin linear
classifier is the linear classifier
with the maximum margin.
m This is the simplest kind of
Class 1 SVM (Called an Linear SVM)
The Optimization Problem
Let {x1, ..., xn} be our data set and let yi {1,-1} be the class label of xi
The decision boundary should classify all points correctly
A constrained optimization problem
||w||2 = wTw
Lagrangian of Original Problem
ai0
The Dual Optimization Problem
a1=0.8
a4=0
a6=1.4
a9=0
a3=0
Class 1
Non-linearly Separable Problems
We allow “error/slack variable” xi in classification; it is based on the
output of the discriminant function wTx+b
xi approximates the number of misclassified samples
Class 2
Class 1
The Optimization Problem
w is also recovered as
The only difference with the linear separable case is that there is an upper
bound C on ai
Once again, a QP solver can be used to find ai efficiently!!!
Extension to Non-linear SVMs
(Kernel Machines)
Non-linear SVMs: Feature Space
General idea: the original input space (x) can be mapped to some higher-
dimensional feature space (φ(x) )where the training set is separable:
x=(x1,x2) 2x1x2
φ(x) =(x12,x22,2x1x2)
Φ: x → φ(x)
x22
x12
( )
( ) ( )
( ) ( ) ( )
(.) ( )
( ) ( )
( ) ( )
( ) ( )
( ) ( ) ( )
( )
( )
Input space Feature space
Kernel Trick
maximize N
1 N
subject to
i 1
a i a ia j yi y j xi x j
2 i j 1
N
C a i 0, a i yi 0
i 1
Since data is only represented as dot products, we need not do the mapping explicitly.
K ( xi , x j ) ( xi ) ( x j )
(*)Kernel function – a function that can be applied to pairs of input data to evaluate dot products
in some corresponding feature space
Example Transformation
Original
With kernel
function
Examples of Kernel Functions
1 2 4 5 6
Choosing the Kernel Function
Weaknesses
– Training (and Testing) is quite slow compared to ANN
• Because of Constrained Quadratic Programming
– Essentially a binary classifier
• However, there are some tricks to evade this.
– Very sensitive to noise
• A few off data points can completely throw off the algorithm
– Biggest Drawback: The choice of Kernel function.
• There is no “set-in-stone” theory for choosing a kernel function for
any given problem (still in research...)
• Once a kernel function is chosen, there is only ONE modifiable
parameter, the error penalty C.
Summary
Strengths
– Training is relatively easy
• We don’t have to deal with local minimum like in ANN
• SVM solution is always global and unique (check “Burges” paper
for proof and justification).
– Unlike ANN, doesn’t suffer from “curse of dimensionality”.
• How? Why? We have infinite dimensions?!
• Maximum Margin Constraint: DOT-PRODUCTS!
– Less prone to overfitting
– Simple, easy to understand geometric interpretation.
• No large networks to mess around with.
Applications of SVMs
Bioinformatics
Machine Vision
Text Categorization
Ranking (e.g., Google searches)
Handwritten Character Recognition
Time series analysis
Lots of very successful applications!!!
Reference
SVM
Apply simple
classifier
Source of image: http://www.gipsa-lab.grenoble-
inp.fr/transfert/seminaire/455_Kadri2013Gipsa-
Deep Learning lab.pdf
simple
Learnable kernel
classifier
x1 … y1
…
x2 … y2
…
…
…
…
…
…
xN … yM
…
What is Cluster Analysis?
• Finding groups of objects such that the objects in a group
will be similar (or related) to one another and different
from (or unrelated to) the objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Clustering
• However,
– Similarity is hard to define, and …. “We know it when we see it”
– The real meaning of similarity is a philosophical question. We will
take a more pragmatic approach.
PC: http://www.funnyphotos.net.au/dog-owners/
Vehicle Example
3500
3000 Lorries
2500
Weight [kg]
1000
500
100 150 200 250 300
Top speed [km/h]
Terminology
Object or feature space
data point
3500
label
3000 Lorries
2500
cluster
Weight [kg]
500
100 150 200 250 300
Top speed [km/h]
feature
Clustering – is an ill-defined problem!!
• Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree
Types of Clusters: Center-Based
l Center-based
– A cluster is a set of objects such that an object in a cluster is
closer (more similar) to the “center” of a cluster, than to the
center of any other cluster
– The center of a cluster is often a centroid, the average of all
the points in the cluster, or a medoid, the most “representative”
point of a cluster
4 center-based clusters
Types of Clusters: Contiguity-Based
8 contiguous clusters
Types of Clusters: Density-Based
l Density-based
– A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density.
– Used when the clusters are irregular or intertwined, and when
noise and outliers are present.
6 density-based clusters
Types of Clusters: Conceptual Clusters
2 Overlapping Circles
Types of Clusters: Objective Function
data point k
1 if u c 2 u c 2
ik k i k j
0 otherwise
distance
c-partition
All clusters C Clusters do
together fills the not overlap
whole universe U
C i U
i 1
Ci C j Ø for all i j
A cluster C is Ø Ci U for all i
never empty and
it is smaller than 2cK There must be at least
the whole 2 clusters in a c-
universe U partition and at most as
many as the number of
data points K
Objective Functions: Hard and Fuzzy C Means
c c
J i u k ci
2
J k means
i 1 i 1 k ,u k Ci
c c n
J FCM J i u k ci m 2
ik
i 1 i 1 k 1
K-Means Type Clustering Algorithms
j 1 X i C j
– Xi is a data point in cluster Cj and cj is the representative point
(most often the mean) for cluster Cj .
– Given two clusters, we can choose the one with the smallest
error
– One easy way to reduce SSE is to increase k, the number of
clusters
• A good clustering with smaller k can have a higher SSE than a poor
clustering with higher k
Fuzzy C Means: Fuzzy membership matrix
M
Point k’s
membership of
cluster i Fuzziness exponent
1
ik 2 / q 1
c d ik
j 1 d jk
Distance from point k
to current cluster
centre i
2.5
2 Original Points
1.5
y
1
0.5
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
Iteration 6
1
2
3
4
5
3
2.5
1.5
y
0.5
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Importance of Choosing Initial Centroids …
Iteration 5
1
2
3
4
3
2.5
1.5
y
0.5
Iteration 1 Iteration 2
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Problems with Selecting Initial Points
Example of an extremely
simple one-dimensional dataset
In addition to two global minima, we have local minima at: approximately (0, 10),
(5, 10), (10, 0), (10, 5). For these local minima one prototype covers one of the
data clusters, while the other prototype is mislead to the noise cluster.
A Few Open-ended Research Problems
• Detecting Clustering Tendency before doing
any clustering operation: Hypothesis testing?
10 10
8 8
C C'
6 6
4 4
D D'
2 2
0 0
20 30 40 50 60 70 Age
Based on “Reinforcement
Learning – An Introduction”
by Richard Sutton and
Andrew Barto
1
Learning from Experience Plays a Role in …
Artificial Intelligence
Neuroscience
Artificial Neural Networks
Reinforcement Learning 2
What is Reinforcement Learning?
Reinforcement Learning 3
Supervised Learning
Supervised Learning
Inputs System
Outputs
Reinforcement Learning 4
Reinforcement Learning
RL
Inputs System
Outputs (“actions”)
Reinforcement Learning 5
The Big Picture
Your action influences the state of the world which determines its reward
Key Features of RL
Reinforcement Learning 7
The Markov Property
Reinforcement Learning 8
Markov Decision Processes
reward probabilities:
Rsas Ert 1 st s,at a,st 1 s for all s, sS, a A(s).
Reinforcement Learning 9
An Example Finite MDP
Recycling Robot
Reinforcement Learning 10
Recycling Robot MDP
S high ,low Rsearch expected no. of cans while searching
A( high ) search ,wait R wait expected no. of cans while waiting
A( low ) search ,wait ,recharge Rsearch R wait
Reinforcement Learning 11
Transition Table
Reinforcement Learning 12
Complete Agent
Temporally situated
Continual learning and planning
Object is to affect the environment
Environment is stochastic and uncertain
Environment
state action
reward
Agent
Reinforcement Learning 13
Elements of RL
Policy
Reward
Value
Model of
environment
Policy: what to do
Reward: what is good
Value: what is good because it predicts reward
Model: what follows what
Reinforcement Learning 14
Overview
• Supervised Learning: Immediate feedback (labels provided for every input).
• Unsupervised Learning: No feedback (no labels provided).
• RL deals with agents that must sense & act upon their environment.
This is combines classical AI and machine learning techniques.
It the most comprehensive problem setting.
• Examples:
• A robot cleaning my room and recharging its battery
• Robot-soccer
• How to invest in shares
• Modeling the economy through rational agents
• Learning how to fly a helicopter
• Scheduling planes to their destinations
• and so on
Complications
• The outcome of your actions may be uncertain
• You may not be able to perfectly sense the state of the world
• You may have no clue (model) about how the world responds to your actions.
• You may have no clue (model) of how rewards are being paid off.
• How much time do you need to explore uncharted territory before you
exploit what you have learned?
Value Function
The features Phi are fixed measurements of the state (e.g. # stones on the board).
We only learn the parameters theta.
• Update rule: (start in state s, take action a, observe reward r and end up in state s’)
( )
qka ¬ qka + a r + g max Q̂ (s ', a ') - Q̂ (s , a ) Fk (s )
a'
change in Q
Q Learning Example
Suppose we have 5 rooms in a building connected by doors as shown in the
figure below. We'll number each room 0 through 4. The outside of the
building can be thought of as one big room (5). Notice that doors 1 and 4
lead into the building from room 5 (outside).
We can represent the rooms on a graph, each room as a node, and
each door as a link.
For this example, we'd like to put an agent in any room, and from that room, go outside the
building (this will be our target room). In other words, the goal room is number 5.
To set this room as a goal, we'll associate a reward value to each door (i.e. link between
nodes). The doors that lead immediately to the goal have an instant reward of 100.
Other doors not directly connected to the target room have zero reward.
Because doors are two-way ( 0 leads to 4, and 4 leads back to 0 ), two arrows are assigned to
each room. Each arrow contains an instant reward value, as shown below:
Of course, Room 5 loops back to itself with a reward of 100, and all other direct
connections to the goal room carry a reward of 100. In Q-learning, the goal is to
reach the state with the highest reward, so that if the agent arrives at the goal, it
will remain there forever. This type of goal is called an "absorbing goal".
Imagine our agent as a dumb virtual robot that can learn through experience. The
agent can pass from one room to another but has no knowledge of the
environment, and doesn't know which sequence of doors lead to the outside.
Suppose we want to model some kind of simple evacuation of an agent from any
room in the building. Now suppose we have an agent in Room 2 and we want the
agent to learn to reach outside the house (5).
The terminology in Q-Learning includes the terms "state" and "action".
We'll call each room, including outside, a "state", and the agent's movement from
one room to another will be an "action". In our diagram, a "state" is depicted as a
node, while "action" is represented by the arrows.
Suppose the agent is in state 2. From state 2, it can go to state 3 because state 2 is
connected to 3. From state 2, however, the agent cannot directly go to state 1 because
there is no direct door connecting room 1 and 2 (thus, no arrows). From state 3, it can go
either to state 1 or 4 or back to 2 (look at all the arrows about state 3). If the agent is in
state 4, then the three possible actions are to go to state 0, 5 or 3. If the agent is in state 1,
it can go either to state 5 or 3. From state 0, it can only go back to state 4.
We can put the state diagram and the instant reward values into the following reward
table, "matrix R".
The Gamma parameter has a range of 0 to 1 (0 <= Gamma > 1). If Gamma is
closer to zero, the agent will tend to consider only immediate rewards. If
Gamma is closer to one, the agent will consider future rewards with greater
weight, willing to delay the reward.
To use the matrix Q, the agent simply traces the sequence of states, from the initial
state to goal state. The algorithm finds the actions with the highest reward values
recorded in matrix Q for current state.
The Q-Learning algorithm goes as follows:
1. Set the gamma parameter, and environment rewards in matrix R.
Select one among all possible actions for the current state.
Using this possible action, consider going to the next state.
Get maximum Q value for this next state based on all possible actions.
Compute: Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]
Set the next state as the current state.
End Do
End For
To understand how the Q-learning algorithm works, we'll go through a few episodes step
by step.
We'll start by setting the value of the learning parameter Gamma = 0.8, and the initial
state as Room 1.
Initialize matrix Q as a zero matrix:
Look at the second row (state 1) of matrix R. There are two possible actions
for the current state 1: go to state 3, or go to state 5. By random selection, we
select to go to 5 as our action.
Now let's imagine what would happen if our agent were in state 5. Look at
the sixth row of the reward matrix R (i.e. state 5). It has 3 possible actions:
go to state 1, 4 or 5.
Since matrix Q is still initialized to zero, Q(5, 1), Q(5, 4), Q(5, 5), are all zero. The
result of this computation for Q(1, 5) is 100 because of the instant reward from R(5, 1).
The next state, 5, now becomes the current state. Because 5 is the goal state, we've
finished one episode. Our agent's brain now contains an updated matrix Q as:
The next state, 1, now becomes the current state. We repeat the inner
loop of the Q learning algorithm because state 1 is not the goal state
So, starting the new loop with the current state 1, there are two possible
actions: go to state 3, or go to state 5. By lucky draw, our action selected
is 5.
Now, imaging we're in state 5, there are three possible
actions: go to state 1, 4 or 5. We compute the Q value
using the maximum value of these possible actions.
Q(1, 5) = R(1, 5) + 0.8 * Max[Q(5, 1), Q(5, 4), Q(5, 5)] = 100 + 0.8 * 0 = 100
The updated entries of matrix Q, Q(5, 1), Q(5, 4), Q(5, 5),
are all zero. The result of this computation for Q(1, 5) is 100
because of the instant reward from R(5, 1). This result does
not change the Q matrix.
Because 5 is the goal state, we finish this episode. Our agent's brain now contain
updated matrix Q as:
If our agent learns more through further episodes, it will finally reach convergence
values in matrix Q like:
This matrix Q, can then be normalized (i.e.; converted to percentage) by
dividing all non-zero entries by the highest number (500 in this case):
Once the matrix Q gets close enough to a state of convergence, we know our agent has
learned the most optimal paths to the goal state. Tracing the best sequences of states is as
simple as following the links with the highest values at each state.
For example, from initial State 2, the agent can use the matrix Q as a guide:
From State 2 the maximum Q values suggests the action to go to state 3.
From State 3 the maximum Q values suggest two alternatives: go to state 1 or 4. Suppose we
arbitrarily choose to go to 1.
From State 1 the maximum Q values suggests the action to go to state 5.
Thus the sequence is 2 - 3 - 1 - 5.
The n-Armed Bandit Problem
E rt | at Q* ( at )
These are unknown action values
Distribution of rt depends only on at
Objective is to maximize the reward in the long term,
e.g., over 1000 plays
r1 r2 rka
Qt ( a ) “sample average”
ka
lim Qt ( a ) Q ( a )
*
ka
Reinforcement Learning 49
e-Greedy Action Selection
e-Greedy:
at* with probability 1 e
at { random action with probability e
Reinforcement Learning 50