0% found this document useful (0 votes)
578 views

49 Machine Learning

All about machine learning
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
578 views

49 Machine Learning

All about machine learning
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 300

Machine Learning

Topic Covered
• Introduction to ML
• K-Nearest Neighbor
• Decision Tree
• Ensemble Methods
• Random Forest
• Artificial Neural Networks
• Support Vector Machines
• K-Means Clustering
• Reinforcement Learning
Supervised Learning: Tree-
based Methods
Road Map
 Basic concepts
 K-nearest neighbor
 Decision tree induction
 Ensemble methods: Bagging and Boosting
 Summary

2
We start a little light….

3
Machine Learning: An example application

 An emergency room in a hospital measures 17


variables (e.g., blood pressure, age, etc) of newly
admitted patients.
 A decision is needed: whether to put a new patient
in an intensive-care unit.
 Due to the high cost of ICU, those patients who
may survive less than a month are given higher
priority.
 Problem: to predict high-risk patients and
discriminate them from low-risk patients.

4
Another application
 A credit card company receives thousands of
applications for new cards. Each application
contains information about an applicant,
 age
 Marital status
 annual salary
 outstanding debts
 credit rating
 etc.
 Problem: to decide whether an application should
approved, or to classify applications into two
categories, approved and not approved.

5
Another Example

6
Computer vision is hard
Machine learning and feature representations

pixel 1

Learning
algorithm
pixel 2
Input
Motorbikes
Input space “Non”-Motorbikes
pixel 2

pixel 1
How is computer perception done?

Object
detection

Image Low-level Recognition


vision features

Audio
classification

Low-level Speaker
Audio
audio features identification

Helicopter
control

Low-level state
Helicopter
features Action
Machine learning and our focus
 Like human learning from past experiences.
 A computer does not have “experiences”.
 A computer system learns from data, which
represent some “past experiences” of an
application domain.
 Our focus: learn a target function that can be used
to predict the values of a discrete class attribute,
e.g., approve or not-approved, and high-risk or low
risk.
 The task is commonly called: Supervised learning,
classification, or inductive learning.
10
The data and the goal
 Data: A set of data records (also called
examples, instances or cases) described by
 k attributes: A1, A2, … Ak.
 a class: Each example is labelled with a pre-
defined class.
 Goal: To learn a classification model from the
data that can be used to predict the classes
of new (future, or test) cases/instances.

11
An example: data (loan application)
Approved or not

12
An example: the learning task
 Learn a classification model from the data
 Use the model to classify future loan applications
into
 Yes (approved) and
 No (not approved)
 What is the class for following case/instance?

13
Supervised vs. unsupervised Learning
 Supervised learning: classification is seen as
supervised learning from examples.
 Supervision: The data (observations,
measurements, etc.) are labeled with pre-defined
classes. It is like that a “teacher” gives the classes
(supervision).
 Test data are classified into these classes too.
 Unsupervised learning (clustering)
 Class labels of the data are unknown
 Given a set of data, the task is to establish the
existence of classes or clusters in the data

14
Supervised learning process: two steps
Learning (training): Learn a model using the training data
Testing: Test the model using unseen test data to assess the model accuracy

Number of correct classifications


Accuracy  ,
Total number of test cases

15
What do we mean by learning?
 Given
 a data set D,
 a task T, and
 a performance measure M,
a computer system is said to learn from D to
perform the task T if after learning the
system’s performance on T improves as
measured by M.
 In other words, the learned model helps the
system to perform T better as compared to
no learning.
16
An example
 Data: Loan application data
 Task: Predict whether a loan should be
approved or not.
 Performance measure: accuracy.

No learning: classify all future applications (test


data) to the majority class (i.e., Yes):
Accuracy = 9/15 = 60%.
 We can do better than 60% with learning.

17
Fundamental assumption of learning
Assumption: The distribution of training
examples is identical to the distribution of test
examples (including future unseen examples).

 In practice, this assumption is often violated


to certain degree.
 Strong violations will clearly result in poor
classification accuracy.
 To achieve good accuracy on the test data,
training examples must be sufficiently
representative of the test data.
18
19
20
Road Map
 Basic concepts
 K-nearest neighbor
 Decision tree induction
 Ensemble methods: Bagging and Boosting
 Summary

21
k-Nearest Neighbor Classification (kNN)
 kNN does not build model from the training
data.
 To classify a test instance d, define k-
neighborhood P as k nearest neighbors of d
 Count number n of training instances in P that
belong to class cj
 Estimate Pr(cj|d) as n/k
 No training is needed. Classification time is
linear in training set size for each test case.

22
kNNAlgorithm

k is usually chosen empirically via a validation set or cross-validation by trying a


range of k values.
Distance function is crucial, but depends on applications.

23
Example: k=6 (6NN)
Government
Science
Arts

24
Discussions
 kNN can deal with complex and arbitrary
decision boundaries.
 Despite its simplicity, researchers have
shown that the classification accuracy of kNN
can be quite strong and in many cases as
accurate as other elaborated methods.
 kNN is slow at the classification time
 kNN does not produce an understandable
model

25
Road Map
 Basic concepts
 K-nearest neighbor
 Decision tree induction
 Ensemble methods: Bagging and Boosting
 Summary

26
Introduction
 Decision tree learning is one of the most
widely used techniques for classification.
 Its classification accuracy is competitive with
other methods, and
 it is very efficient.
 The classification model is a tree, called
decision tree.
 C4.5 by Ross Quinlan is perhaps the best
known system and the codes are freelly
available from internet.
27
The loan data (reproduced)
Approved or not

28
A decision tree from the loan data
Decision nodes and leaf nodes (classes)

29
Use the decision tree

No

30
Is the decision tree unique?
No. Here is a simpler tree. We want smaller tree and accurate tree.

Easy to understand and perform better.

Finding the best tree is NP-hard.


All current tree building algorithms are
heuristic algorithms

31
From a decision tree to a set of rules
A decision tree can be converted to a
set of rules
Each path from the root to a leaf is a
rule.

32
Algorithm for decision tree learning
 Basic algorithm (a greedy divide-and-conquer algorithm)
 Assume attributes are categorical now (continuous attributes
can be handled too)
 Tree is constructed in a top-down recursive manner
 At start, all the training examples are at the root
 Examples are partitioned recursively based on selected
attributes
 Attributes are selected on the basis of an impurity function (e.g.,
information gain)
 Conditions for stopping partitioning
 All examples for a given node belong to the same class
 There are no remaining attributes for further partitioning –
majority class is the leaf
 There are no examples left

33
Decision tree learning algorithm

34
Choose an attribute to partition data
 The key to building a decision tree - which
attribute to choose in order to branch.
 The objective is to reduce impurity or
uncertainty in data as much as possible.
 A subset of data is pure if all instances belong to
the same class.
 The heuristic in C4.5 is to choose the attribute
with the maximum Information Gain or Gain
Ratio based on information theory.

35
The loan data (reproduced)
Approved or not

36
Two possible roots, which is better?

Fig. (B) seems to be better.

37
Information theory
 Information theory provides a mathematical
basis for measuring the information content.
 To understand the notion of information, think
about it as providing the answer to a question,
for example, whether a coin will come up heads.
 If one already has a good guess about the answer,
then the actual answer is less informative.
 If one already knows that the coin is rigged so that it
will come with heads with probability 0.99, then a
message (advanced information) about the actual
outcome of a flip is worth less than it would be for a
honest coin (50-50).

38
Information theory (cont …)
 For a fair (honest) coin, you have no
information, and you are willing to pay more
(say in terms of $) for advanced information -
less you know, the more valuable the
information.
 Information theory uses this same intuition,
but instead of measuring the value for
information in dollars, it measures information
contents in bits.
 One bit of information is enough to answer a
yes/no question about which one has no
idea, such as the flip of a fair coin

39
Information theory: Entropy measure
 The entropy formula,
|C |
entropy( D)    Pr(c ) log
j 1
j 2 Pr( c j )

|C |

 Pr(c )  1,
j 1
j

 Pr(cj) is the probability of class cj in data set D


 We use entropy as a measure of impurity or
disorder of data set D. (Or, a measure of
information in a tree)

40
Entropy measure: let us get a feeling

As the data become purer and purer, the entropy value


becomes smaller and smaller. This is useful to us!
41
Information gain
 Given a set of examples D, we first compute its
entropy:

 If we make attribute Ai, with v values, the root of the


current tree, this will partition D into v subsets D1, D2
…, Dv . The expected entropy if Ai is used as the
current root:
v |D |
entropyAi ( D)  j

j 1 | D |
 entropy( D j )

42
Information gain (cont …)

 Information gained by selecting attribute Ai to


branch or to partition the data is
gain( D, Ai )  entropy( D)  entropyAi ( D)

 We choose the attribute with the highest gain to


branch/split the current tree.

43
An example
6 6 9 9
entropy( D)    log 2   log 2  0.971
15 15 15 15

6 9
entropyOwn _ house ( D)    entropy( D1 )   entropy( D2 )
15 15
6 9
  0   0.918
15 15
 0.551

5 5 5
entropyAge ( D)    entropy( D1 )   entropy( D2 )   entropy( D3 ) Age Yes No entropy(Di)
15 15 15
young 2 3 0.971
5 5 5
  0.971   0.971   0.722 middle 3 2 0.971
15 15 15
 0.888
old 4 1 0.722

Own_house is the best


choice for the root.

44
We build the final tree

We can use information gain ratio to evaluate the


impurity as well

45
Handling continuous attributes
 Handle continuous attribute by splitting into
two intervals (can be more) at each node.
 How to find the best threshold to divide?
 Use information gain or gain ratio again
 Sort all the values of an continuous attribute in
increasing order {v1, v2, …, vr},
 One possible threshold between two adjacent
values vi and vi+1. Try all possible thresholds and
find the one that maximizes the gain (or gain
ratio).

46
An example in a continuous space

47
Avoid overfitting in classification
 Overfitting: A tree may overfit the training data
 Good accuracy on training data but poor on test data
 Symptoms: tree too deep and too many branches,
some may reflect anomalies due to noise or outliers
 Two approaches to avoid overfitting
 Pre-pruning: Halt tree construction early
 Difficult to decide because we do not know what may
happen subsequently if we keep growing the tree.
 Post-pruning: Remove branches or sub-trees from a
“fully grown” tree.
 This method is commonly used. C4.5 uses a statistical
method to estimates the errors at each node for pruning.
 A validation set may be used for pruning as well.

48
An example Likely to overfit the data

49
Other issues in decision tree learning

 From tree to rules, and rule pruning


 Handling of missing values
 Handing skewed distributions
 Handling attributes and classes with different
costs.
 Attribute construction
 Etc.

50
Road Map
 Basic concepts
 K-nearest neighbor
 Decision tree induction
 Ensemble methods: Bagging and Boosting
 Summary

51
Ensemble

model 1 Ensemble model

Data
model 2

……

Combine multiple
model k
models into one!

Applications: classification, clustering, collaborative


filtering, anomaly detection……

52 http://ews.uiuc.edu/~jinggao3/sdm10ensemble.htm
Motivations
• Motivations of ensemble methods
– Ensemble model improves accuracy and
robustness over single model methods
– Applications:
• distributed computing
• privacy-preserving applications
• large-scale data with reusable models
• multiple sources of data
– Efficiency: a complex problem can be
decomposed into multiple sub-problems that are
easier to understand and solve (divide-and-
conquer approach)

53
Relationship with Related Studies

• Multi-task learning
– Learn multiple tasks simultaneously
– Ensemble methods: use multiple models to learn
one task

• Data integration
– Integrate raw data
– Ensemble methods: integrate information at the
model level

54
Relationship with Related Studies (2)

• Meta learning
– Learn on meta-data (include base model output)
– Ensemble methods: besides learn a joint model
based on model output, we can also combine
the output by consensus
• Non-redundant clustering
– Give multiple non-redundant clustering
solutions to users
– Ensemble methods: give one solution to users
which represents the consensus among all the
base models

55
Why Ensemble Works?
• Intuition
– combining diverse, independent opinions in human
decision-making as a protective mechanism (e.g. stock
portfolio)
• Uncorrelated error reduction
– Suppose we have 5 completely independent classifiers
for majority voting
– If accuracy is 70% for each
• 10 (.7^3)(.3^2)+5(.7^4)(.3)+(.7^5)
• 83.7% majority vote accuracy
– 101 such classifiers
• 99.9% majority vote accuracy

from T. Holloway, Introduction to Ensemble Learning, 2007.


56
Why Ensemble Works? (2)

Some unknown distribution

Model 1 Model 6
Model 3 Model 5
Model 2 Model 4

Ensemble gives the global picture!

57
Why Ensemble Works?

• Overcome limitations of single hypothesis


– The target function may not be implementable with
individual classifiers, but may be approximated by
model averaging

Decision Tree Model Averaging

58
Summary

Boosting, rule Bagging, random


SVM, ensemble, Bayesian forest, random
Supervised
Logistic Regression, model averaging, decision tree
Learning …... …... …...

Semi- Semi-supervised
Consensus
supervised Learning, Multi-view Learning
Learning Collective Inference Maximization

K-means,
Unsupervised
Spectral Clustering, Clustering Ensemble
Learning …...

Single Combine by Combine by


Models learning consensus

59
Ensemble of Classifiers—Learn to Combine

training test

classifier 1
Ensemble
model
labeled unlabeled
data classifier 2 data

……

classifier k final
predictions

learn the combination from labeled data

Algorithms: boosting, stacked generalization, rule ensemble,


Bayesian model averaging……

60
Ensemble of Classifiers—Consensus

training test

classifier 1
combine the
labeled unlabeled
predictions by
data classifier 2 data
majority
voting
……

classifier k
final
predictions

Algorithms: bagging, random forest, random decision tree, model averaging of


probabilities……
61
Supervised Ensemble Methods

• Problem
– Given a data set D={x1,x2,…,xn} and their
corresponding labels L={l1,l2,…,ln}

– An ensemble approach computes:


• A set of classifiers {f1,f2,…,fk}, each of
which maps data to a class label: fj(x)=l
• A combination of classifiers f* which
minimizes generalization error: f*(x)=
w1f1(x)+ w2f2(x)+…+ wkfk(x)

62
Bias and Variance

• Ensemble methods
– Combine learners to reduce variance

from Elder, John. From Trees to Forests and Rule Sets - A Unified Overview of Ensemble Methods. 2007. 63
Generating Base Classifiers

• Sampling training examples


– Train k classifiers on k subsets drawn from the
training set
• Using different learning models
– Use all the training examples, but apply different
learning algorithms
• Sampling features
– Train k classifiers on k subsets of features drawn
from the feature space
• Learning “randomly”
– Introduce randomness into learning procedures

64
65
66
67
Random forest classifier

• Random forest classifier, an extension to bagging which


uses de-correlated trees.
Random Forest Classifier
Training Data

M features
N examples
Random Forest Classifier
Create bootstrap samples
from the training data

M features
N examples

....…
Random Forest Classifier
Construct a decision tree

M features
N examples

....…
Random Forest Classifier
At each node in choosing the split feature
choose only among m<M features

M features
N examples

....…
Random Forest Classifier
Create decision tree
from each bootstrap sample

M features
N examples

....…
....…
Random Forest Classifier

M features
N examples

Take the
majority
vote

....…
....…
Thus,…..
Random forest

• Available package:
• http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

• To read more:
• http://www-stat.stanford.edu/~hastie/Papers/ESLII.pdf
Boosting

77
Boosting (1)

• Principles
– Boost a set of weak learners to a strong learner
– Make records currently misclassified more important

• Example
– Record 4 is hard to classify
– Its weight is increased, therefore it is more
likely to be chosen again in subsequent rounds

*[FrSc97]
from P. Tan et al. Introduction to Data Mining.

78
Boosting (2)

• AdaBoost

– Initially, set uniform weights on all the records


– At each round
• Create a bootstrap sample based on the weights
• Train a classifier on the sample and apply it on the
original training set
• Records that are wrongly classified will have their
weights increased
• Records that are classified correctly will have their
weights decreased
• If the error rate is higher than 50%, start over

– Final prediction is weighted average of all the


classifiers with weight representing the training
accuracy
79
Boosting (3)

• Determine the weight


 w j (Ci ( x j )  y j )
N
j 1
– For classifier i, its error is  i 

N
j 1
wj

– The classifier’s importance


is represented as:
1 1 i 
 i  ln  
2  i 

– The weight of each record


is updated as:
w(ji ) exp  i y j Ci ( x j ) 
w(ji 1) 
Z (i )
– Final combination:
C * ( x)  arg max y i 1 i Ci ( x)  y 
K

80
In a nutshell….

81
Supervised Learning: Artificial
Neural Networks
What are connectionist neural networks?
• Connectionism refers to a computer modeling approach to
computation that is loosely based upon the architecture of the brain.

• Many different models, but all include:


– Multiple, individual “nodes” or “units” that operate at the same time (in
parallel)
– A network that connects the nodes together
– Information is stored in a distributed fashion among the links that connect
the nodes
– Learning can occur with gradual changes in connection strength
Developments in Neural Learning Systems

3
The 2012 Breakthrough….

 A Krizhevsky, I Sutskever, and GE Hinton, “Imagenet


classification with deep convolutional neural
networks”, Advances in neural information processing
systems (NIPS 2012), 1097-1105
Has 18,000+ citations by now!!

Reduced the error rate on imagenet under LSVRC to 16%

Later evolved into Alexnet.

4
Deep Learning: Acknowledging Some Resources used here…

 “Neural Networks and Deep Learning”


 written by Michael Nielsen
 http://neuralnetworksanddeeplearning.com/
 “Deep Learning”
 Written by Ian J. Goodfellow, Yoshua
Bengio, and Aaron Courville
 http://www.iro.umontreal.ca/~bengioy/dlbook/
Deep Learning Tutorial. Hung-yi Lee, NTU.
Introduction to Neural
Learning

What people already knew since


1950s
Example Application
• Handwriting Digit Recognition

Machine “2”
Handwriting Digit Recognition

Input Output

y1
0.1 is 1
x1
x2 y2
0.7 is 2
The image
is “2”

……
……
……
x256 y10
0.2 is 0
16 x 16 = 256
Ink → 1 Each dimension represents
No ink → 0 the confidence of a digit.
Example Application
• Handwriting Digit Recognition

x1 y1
x2
y2
…… Machine “2”

……
x256 y10
𝑓: 𝑅256 → 𝑅10
In deep learning, the function 𝑓is
represented by neural network
Element of Neural Network
Neuron 𝑓: 𝑅𝐾 → 𝑅

a1 w1 z  a1w1  a2 w2    aK wK  b

a2 w2
 z  z  a
wK

aK Activation
weights function
b
bias
Neural Network
neuron
Input Layer 1 Layer 2 Layer L Output
x1 …… y1
x2 …… y2

……
……

……

……

……
xN …… yM
Input Output
Layer Hidden Layers Layer

Deep means many hidden layers


Example of Neural Network
1 4 0.98
1
-2
1
-1 -2 0.12
-1
1
0
Sigmoid Function  z 

 z  
1
1  ez
z
Example of Neural Network
1 4 0.98 2 0.86 3 0.62
1
-2 -1 -1
1 0 -2
-1 -2 0.12 -2 0.11 -1 0.83
-1
1 -1 4
0 0 2
Example of Neural Network
1 0.73 2 0.72 3 0.51
0
-2 -1 -1
1 0 -2
-1 0.5 -2 0.12 -1 0.85
0
1 -1 4
0 0 2

𝑓: 𝑅2 → 𝑅2 1 0.62 0 0.51
𝑓 = 𝑓 =
−1 0.83 0 0.85
Different parameters define different function
Matrix Operation
1 4 0.98
1 y1
-2
1
-1 -2 0.12
-1 y2
1
0

1 −2 1 1 0.98
𝜎 + =
−1 1 −1 0 0.12
4
−2
Neural Network
x1 …… y1
x2 W1 W2 ……
WL y2
b1 b2 bL

……
……

……

……

……
xN x a1 ……
a2 y yM

𝜎 W1 x + b1
𝜎 W2 a1 + b2
𝜎 WL aL-1 + bL
Neural Network
x1 …… y1
x2 W1 W2 ……
WL y2
b1 b2 bL

……
……

……

……

……
xN x a1 ……
a2 y yM

Using parallel computing techniques


y =𝑓 x
to speed up matrix operation

=𝜎 WL …𝜎 W2 𝜎 W1 x + b1 + b2 … + bL
Softmax
• Softmax layer as the output layer

Ordinary Layer

z1   
y1   z1
In general, the output of
z2   
y2   z 2
network can be any value.

May not be easy to interpret


z3   
y3   z 3
Softmax
Probability:
• Softmax layer as the output layer  1 > 𝑦𝑖 > 0
 𝑖 𝑦𝑖 = 1
Softmax Layer

3 0.88 3

e
20
z1 e e z1
 y1  e z1 zj

j 1
0.12 3
z2
1
e e z 2 2.7
 y2  e z2
e
zj

j 1
0.05 ≈0
z3 -3 
3

e
z3
e y3  e z3 zj
e
3 j 1

 e zj

j 1
How to set network parameters
𝜃 = 𝑊 1 , 𝑏1 , 𝑊 2 , 𝑏 2 , ⋯ 𝑊 𝐿 , 𝑏 𝐿
x1 …… y1
0.1 is 1

Softmax
x2 …… y2
0.7 is 2

……

……

……
x256 …… y10
0.2 is 0
16 x 16 = 256
Ink → 1 Set the network parameters 𝜃 such that ……
No ink → 0
Input: How to let thethe
y1 has neural
maximum value
network achieve this
Input: y2 has the maximum value
Training Data
• Preparing training data: images and their labels

“5” “0” “4” “1”

“9” “2” “1” “3”

Using the training data to find


the network parameters.
Given a set of network parameters 𝜃,
Cost each example has a cost value.

“1”

x1 …… y0.2
1 1
x2 …… y2
0.3 0
Cost
……

……
……

……

……
x256 …… y0.5 𝐿(𝜃) 0
10

target
Cost can be Euclidean distance or cross
entropy of the network output and target
Total Cost
For all training data … Total Cost:
𝑅
x1 NN y1 𝑦1 𝐶 𝜃 = 𝐿𝑟 𝜃
𝐿1 𝜃
𝑟=1
x2 NN y2 𝑦2
𝐿2 𝜃 How bad the network
parameters 𝜃 is on
x3 NN y3 𝑦3 this task
𝐿3 𝜃
……
……

……
……
Find the network
parameters 𝜃 ∗ that
xR NN yR 𝑦𝑅 minimize this value
𝐿𝑅 𝜃
Assume there are only two
parameters w1 and w2 in a
Gradient Descent network.
Error Surface 𝜃 = 𝑤1 , 𝑤2

The colors represent the value of C. Randomly pick a


starting point 𝜃 0
Compute the
negative gradient
𝑤2 𝜃∗ at 𝜃 0
−𝜂𝛻𝐶 𝜃 0 −𝛻𝐶 𝜃 0
−𝛻𝐶 𝜃 0
Times the
𝜕𝐶 𝜃 0 /𝜕𝑤1 learning rate 𝜂
𝜃0 𝛻𝐶 𝜃 0 =
𝜕𝐶 𝜃 0 /𝜕𝑤2 −𝜂𝛻𝐶 𝜃 0
𝑤1
Gradient Descent
Eventually, we would
Randomly pick a
reach a minima …..
starting point 𝜃 0
Compute the
2−𝜂𝛻𝐶 𝜃2 negative gradient
−𝜂𝛻𝐶 𝜃𝜃
1
𝑤2 2 at 𝜃 0
−𝛻𝐶
−𝛻𝐶 𝜃 1 𝜃
𝜃1 −𝛻𝐶 𝜃 0
Times the
learning rate 𝜂
𝜃0
−𝜂𝛻𝐶 𝜃 0
𝑤1
Local Minima
• Gradient descent never guarantee global minima
Different initial
point 𝜃 0

𝐶 Reach different minima,


so different results
Who is Afraid of Non-Convex
Loss Functions?
𝑤1 𝑤2 http://videolectures.net/eml07
_lecun_wia/
Besides local minima ……
cost
Very slow at the
plateau
Stuck at saddle point

Stuck at local minima

𝛻𝐶 𝜃 𝛻𝐶 𝜃 𝛻𝐶 𝜃
≈0 =0 =0
parameter space
In physical world ……
• Momentum

How about put this phenomenon


in gradient descent?
Still not guarantee reaching
Momentum global minima, but give some
hope ……
cost
Movement =
Negative of Gradient + Momentum
Negative of Gradient
Momentum
Real Movement

Gradient = 0
Improving Simple Gradient Descent
Momentum
Don’t just change weights according to the current data point.
Re-use changes from earlier iterations.
Let ∆w(t) = weight changes at time t.
Let    be the change we would make with
w
regular gradient descent.
Instead we use

Δw t  1    Δw t 
w
wt 1  wt   Δwt 
Momentum damps oscillations. momentum parameter
A hack? Well, maybe.
Mini-batch
 Randomly initialize 𝜃 0
x1 NN y1 𝑦 1  Pick the 1st batch
Mini-batch
𝐿1 𝐶 = 𝐿1 + 𝐿31 + ⋯
x31 NN y31 𝑦 31 𝜃 1 ← 𝜃 0 − 𝜂𝛻𝐶 𝜃 0
𝐿31  Pick the 2nd batch

…… 𝐶 = 𝐿2 + 𝐿16 + ⋯
𝜃 2 ← 𝜃 1 − 𝜂𝛻𝐶 𝜃 1
x2 NN y2 𝑦2
Mini-batch


𝐿2
16
C is different each time
x16 NN y16 𝑦 when we update
𝐿16 parameters!
……
Mini-batch
Original Gradient Descent With Mini-batch

unstable

The colors represent the total C on all training data.


Mini-batch Faster Better!
 Randomly initialize 𝜃 0
x1 NN y1 𝑦 1  Pick the 1st batch
Mini-batch
𝐶1 𝐶 = 𝐶 1 + 𝐶 31 + ⋯
x31 NN y31 𝑦 31 𝜃 1 ← 𝜃 0 − 𝜂𝛻𝐶 𝜃 0
𝐶 31  Pick the 2nd batch

…… 𝐶 = 𝐶 2 + 𝐶 16 + ⋯
𝜃 2 ← 𝜃 1 − 𝜂𝛻𝐶 𝜃 1
x2 NN y2 𝑦2
Mini-batch


𝐶2  Until all mini-batches
have been picked
x16 NN y16 𝑦16
𝐶 16 one epoch
……

Repeat the above process


Backpropagation
• A network can have millions of parameters.
• Backpropagation is the way to compute the gradients efficiently (not today)
• Ref:
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/DNN%2
0backprop.ecm.mp4/index.html
• Many toolkits can compute the gradients automatically

Ref:
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lec
ture/Theano%20DNN.ecm.mp4/index.html
Linear Perceptrons
They are multivariate linear models:

Out(x) = wTx

And “training” consists of minimizing sum-of-squared residuals by gradient descent.

 Outx   y 
2
 k k
k

 w 
 2
 x k  yk
k

QUESTION: Derive the perceptron training rule.


Linear Perceptron Training Rule
R
E   ( yk  w T x k ) 2
k 1
Gradient descent tells us we should update
w thusly if we wish to minimize E:

E
wj  wj - η
w j

E
So what’s ?
w j
Copyright © 2001, 2003, Andrew W. Moore
Linear Perceptron Training Rule
E 
R
E   ( yk  w T x k ) 2
R
 ( yk  w T x k ) 2
k 1
w j k 1 w j
R

Gradient descent tells us we should update
w thusly if we wish to minimize E:
  k
2(
k 1
y  w T
xk )
w j
( yk  w T x k )

 R
 2  δ k wT xk
k 1 w j
E …where…
wj  wj - η δk  y k  w T x k
w j
R m
 2 δk w x
w j
i ki
k 1 i 1
E
So what’s ? R
w j  2 δk xkj
k 1
Linear Perceptron Training Rule
R
E   ( yk  w T x k ) 2
k 1
Gradient descent tells us we should update
w thusly if we wish to minimize E:

E
wj  wj - η R

…where…
w j w j  w j  2η δk xkj
k 1
E R
 2 δk xkj
w j k 1 We frequently neglect the 2 (meaning we halve the
learning rate)
The “Batch” perceptron algorithm
1) Randomly initialize weights w1 w2 … wm

2) Get your dataset (append 1’s to the inputs if you don’t want to
go through the origin).

3) for i = 1 to R  i : yi  w x i 

4) for j = 1 to m R
w j  w j     i xij
i 1

5) if  stops improving then stop. Else loop back to 3.


2
i
 i  yi  w x i

A RULE KNOWN BY

w j  w j   i xij
MANY NAMES
If data is voluminous and arrives fast

Input-output pairs (x,y) come streaming in very quickly. THEN


Don’t bother remembering old ones.
Just keep using new ones.

observe (x,y)

  yw x 

j w j  w j  η δ x j
A 1-HIDDEN LAYER NET
NINPUTS = 2 NHIDDEN = 3

 N INS 
v1  g   w1k xk 
 k 1 
w11
w1
x1 w21

w31
 N INS 
v2  g   w2 k xk  w2  N HID 
 Wk vk 
Out  g  
w12  k 1   k 1 
w22
x2 w3

w32
 N INS 
v3  g   w3k xk 
 k 1 
OTHER NEURAL NETS
1

x1
x2
x3
2-Hidden layers + Constant Term

“JUMP” CONNECTIONS

x1

 N INS N HID

  w0 k xk  Wk vk 
x2
Out  g  
 k 1 k 1 
Copyright © 2001, 2003, Andrew W. Moore
Backprop
• Very powerful - can learn any function, given enough
hidden units! With enough hidden units, we can generate
any function.
• Have the same problems of Generalization vs.
Memorization. With too many units, we will tend to
memorize the input and not generalize well. Some schemes
exist to “prune” the neural network.
• Networks require extensive training, many parameters to
fiddle with. Can be extremely slow to train. May also fall
into local minima.
• Inherently parallel algorithm, ideal for multiprocessor
hardware.
• Despite the cons, a very powerful algorithm that has seen
widespread successful deployment.
Types of Neurons

Linear Neuron

Logistic Neuron

Potentially more. Require a convex


loss function for gradient descent training.
Perceptron
45
Multilayer Networks
• Cascade Neurons together
• The output from one layer is the input to the next
• Each Layer has its own sets of weights

46
Linear Regression Neural Networks
• What happens when we arrange linear
neurons in a multilayer network?

47
Linear Regression Neural Networks
• Nothing special happens.
– The product of two linear transformations is itself a linear
transformation.

48
Neural Networks
• We want to introduce non-linearities to the network.
– Non-linearities allow a network to identify complex regions
in space

49
Linear Separability
• 1-layer cannot handle XOR
• More layers can handle more complicated spaces – but
require more parameters
• Each node splits the feature space with a hyperplane
• If the second layer is AND a 2-layer network can
represent any convex hull.

50
Feed-Forward Networks
• Predictions are fed forward through the
network to classify

51
Feed-Forward Networks
• Predictions are fed forward through the
network to classify

52
Feed-Forward Networks
• Predictions are fed forward through the
network to classify

53
Feed-Forward Networks
• Predictions are fed forward through the
network to classify

54
Feed-Forward Networks
• Predictions are fed forward through the
network to classify

55
Feed-Forward Networks
• Predictions are fed forward through the
network to classify

56
Error Backpropagation
• We will do gradient descent on the whole
network.
• Training will proceed from the last layer to the
first.

57
Error Backpropagation
• Introduce variables over the neural network

58
Error Backpropagation
• Introduce variables over the neural network
– Distinguish the input and output of each node

59
Error Backpropagation

60
Error Backpropagation
Training: Take the gradient of the last component and iterate backwards

61
Error Backpropagation
Empirical Risk Function

62
Error Backpropagation
Optimize last layer weights wkl

Calculus chain rule

63
Error Backpropagation
Optimize last layer weights wkl

Calculus chain rule

64
Error Backpropagation
Optimize last layer weights wkl

Calculus chain rule

65
Error Backpropagation
Optimize last layer weights wkl

Calculus chain rule

66
Error Backpropagation
Optimize last layer weights wkl

Calculus chain rule

67
Error Backpropagation
Optimize last hidden weights wjk

68
Error Backpropagation
Optimize last hidden weights wjk

Multivariate chain rule

69
Error Backpropagation
Optimize last hidden weights wjk

Multivariate chain rule

70
Error Backpropagation
Optimize last hidden weights wjk

Multivariate chain rule

71
Error Backpropagation
Optimize last hidden weights wjk

Multivariate chain rule

72
Error Backpropagation
Repeat for all previous layers

73
Error Backpropagation
Now that we have well defined gradients for each parameter, update using Gradient Descent

74
Error Back-propagation
• Error backprop unravels the multivariate chain rule and
solvesthe gradient for each partial component separately.
• The target values for each layer come from the next layer.
• This feeds the errors back along the network.

75
Problems with Neural Networks
• Interpretation of Hidden Layers
• Overfitting

76
Interpretation of Hidden Layers
• What are the hidden layers doing?!
• Feature Extraction
• The non-linearities in the feature extraction can make
interpretation of the hidden layers very difficult.
• This leads to Neural Networks being treated as black boxes.

77
Universality Theorem
Any continuous function f

f :R N
R M

Can be realized by a network with one


hidden layer
Reference for the reason:
(given enough hidden neurons) http://neuralnetworksanddeeplearning.
com/chap4.html

“Deep” neural network vs. “Fat” neural network?


Fat + Short v.s. Thin + Tall
The same number of
parameters

…… Which one is better?

x1 x2 …… xN x1 x2 …… xN

Shallow Deep
Fat + Short v.s. Thin + Tall
Word Error Rate Word Error Rate
Layer X Size Layer X Size
(%) (%)
1 X 2k 24.2
2 X 2k 20.4
3 X 2k 18.4
4 X 2k 17.8
5 X 2k 17.2 1 X 3772 22.5
7 X 2k 17.1 1 X 4634 22.6
1 X 16k 22.1
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-
Dependent Deep Neural Networks." Interspeech. 2011.
Why Deep?
• Deep → Modularization
Girls with Long hair
Classifier 1 womenLong hair
long hair womenLong hair
Long hair
women
women

Boys with
Classifier 2 Cheiko
long hair
weak Little examples
Image
Girls with Short hair
Classifier 3 womanShort hair
short hair womanShort hair
Short hair
woman
woman

Boys with short Short hair


Classifier 4 male
Short hair
maleShort hair
hair maleShort hair
male
Why Deep? Each basic classifier can have sufficient
training examples.

• Deep → Modularization
Long hair Chieko
womanLong hair
Short hair woman Long hair
Long hair
Boy or Girl? womanShort
woman
hair
Short hair
woman
woman v.s. Short hair
Short hair
Short hair male
woman maleShort hair
woman maleShort hair
Image Basic male

Classifier
Long hair Short hair
womanLong hair womanShort hair
Long or short? womanLong hair
Long hair womanShort hair
Short hair
woman
woman v.s. woman
woman
Short hair
Short hair
Classifiers for the Chieko male
maleShort hair
maleShort hair
attributes male
Why Deep?
can be trained by little data

• Deep → Modularization
Girls with
Classifier 1
long hair
Boy or Girl? Boys with
Classifier 2
fine long hair
Little data
Image Basic
Classifier Girls with
Classifier 3
short hair
Long or short?
Boys with short
Sharing by the following Classifier 4
hair
classifiers as module
Deep Learning also works on small
Why Deep? data set like TIMIT.

• Deep → Modularization → Less training data?

x1 ……
x2 The modularization is automatically ……
learned from data.
……

……
……

……
xN ……

The most basic Use 1st layer as module to build Use 2nd layer as
classifiers classifiers module ……
Hand-crafted
kernel function

SVM
Apply simple
classifier
Source of image: http://www.gipsa-lab.grenoble-
inp.fr/transfert/seminaire/455_Kadri2013Gipsa-lab.pdf
Deep Learning
simple classifier
Learnable kernel
𝜙 𝑥
x1 …… y1
x2 …… y2
𝑥




xN …… yM
Recipe for Learning

http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-learning-explained-in-a-single-
powerpoint-slide/
Recipe for Learning

Don’t forget! overfitting


Modify the Network Preventing
Better optimization Strategy Overfitting

http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-learning-explained-in-a-single-
powerpoint-slide/
Support Vector Machines,
Clustering, and more…
Introduction
 Support vector machines were invented by V.
Vapnik and his co-workers in 1970s in Russia and
became known to the West in 1992.
 SVMs are linear classifiers that find a hyperplane to
separate two class of data, positive and negative.
 Kernel functions are used for nonlinear separation.
 SVM not only has a rigorous theoretical foundation,
but also performs classification more accurately than
most other methods in applications, especially for
high dimensional data.
 It is perhaps the best classifier for text classification.

2
a
Linear Classifiers
x f y
f(x,w,b) = sign(w x + b)
denotes +1 w x + b>0
denotes -1

How would you


classify this data?

w x + b<0
a
Linear Classifiers
x f y
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1

How would you


classify this data?
a
Linear Classifiers
x f y
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1

How would you


classify this data?
a
Linear Classifiers
x f y
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1

Any of these
would be fine..

..but which is
best?
a
Linear Classifiers
x f y
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1

How would you


classify this data?

Misclassified
to +1 class
a
Classifier Margin
x f y
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1 Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.
a
Maximum Margin
x f y
1. Maximizing the margin is good
accordingf(x,w,b)
to intuition and PAC
= sign(w theory
x+ b)
denotes +1 2. Implies that only support vectors are
denotes -1 important; other The
training examples
maximum
are ignorable.
margin linear
3. Empirically it works very very
classifier iswell.
the
linear classifier
Support Vectors with the, um,
are those
datapoints that maximum margin.
the margin This is the
pushes up
against simplest kind of
SVM (Called an
LSVM)
Linear SVM
Example of Bad Decision Boundaries

Class 2 Class 2

Class 1 Class 1
Good Decision Boundary: Margin Should Be
Large

The decision boundary should be as far away from the data of both classes
as possible
2
– We should maximize the margin, m m 
w.w
Support vectors
datapoints that the
margin
pushes up against

Class 2
The maximum margin linear
classifier is the linear classifier
with the maximum margin.
m This is the simplest kind of
Class 1 SVM (Called an Linear SVM)
The Optimization Problem

Let {x1, ..., xn} be our data set and let yi  {1,-1} be the class label of xi
The decision boundary should classify all points correctly 
A constrained optimization problem

||w||2 = wTw
Lagrangian of Original Problem

The Lagrangian is Lagrangian multipliers

– Note that ||w||2 = wTw


Setting the gradient of w.r.t. w and b to zero, we have

ai0
The Dual Optimization Problem

We can transform the problem to its dual Dot product of X

a’s  New variables


(Lagrangian multipliers)
This is a convex quadratic programming (QP) problem
– Global maximum of ai can always be found
well established tools for solving this optimization problem (e.g.
cplex)
Note:
A Geometrical Interpretation
Class 2
Support vectors
a10=0 a’s with values
a8=0.6
different from zero
(they hold up the
a7=0 separating plane)!
a2=0
a5=0

a1=0.8
a4=0
a6=1.4

a9=0
a3=0
Class 1
Non-linearly Separable Problems
We allow “error/slack variable” xi in classification; it is based on the
output of the discriminant function wTx+b
xi approximates the number of misclassified samples

New objective function:

Class 2

C : tradeoff parameter between


error and margin;
chosen by the user;
large C means a higher
penalty to errors

Class 1
The Optimization Problem

The dual of the problem is

w is also recovered as
The only difference with the linear separable case is that there is an upper
bound C on ai
Once again, a QP solver can be used to find ai efficiently!!!
Extension to Non-linear SVMs
(Kernel Machines)
Non-linear SVMs: Feature Space
General idea: the original input space (x) can be mapped to some higher-
dimensional feature space (φ(x) )where the training set is separable:
x=(x1,x2) 2x1x2
φ(x) =(x12,x22,2x1x2)

Φ: x → φ(x)

x22

x12

If data are mapped into higher a space of sufficiently high dimension,


then they will in general be linearly separable;
N data points are in general separable in a space of N-1 dimensions or more!!!
This slide is courtesy of www.iro.umontreal.ca/~pift6080/documents/papers/svm_tutorial.ppt
Transformation to Feature Space

Possible problem of the transformation


– High computation burden due to high-dimensionality and hard to get a good
estimate
SVM solves these two issues simultaneously
– “Kernel tricks” for efficient computation
– Minimize ||w||2 can lead to a “good” classifier

( )
( ) ( )
( ) ( ) ( )
(.) ( )
( ) ( )
( ) ( )
( ) ( )
( ) ( ) ( )
( )
( )
Input space Feature space
Kernel Trick 

Recall: Note that data only appears as dot products

maximize N
1 N
subject to 
i 1
a i   a ia j yi y j xi x j
2 i  j 1
N
C  a i  0,  a i yi  0
i 1

Since data is only represented as dot products, we need not do the mapping explicitly.

Introduce a Kernel Function (*) K such that:

K ( xi , x j )   ( xi )   ( x j )

(*)Kernel function – a function that can be applied to pairs of input data to evaluate dot products
in some corresponding feature space
Example Transformation

Consider the following transformation

Define the kernel function K (x,y) as

The inner product (.)(.) can be computed by K without going through


the map (.) explicitly!!!
Modification Due to Kernel Function
Change all inner products to kernel functions
For training,

Original

With kernel
function
Examples of Kernel Functions

Polynomial kernel with degree d

Radial basis function kernel with width 

– Closely related to radial basis function neural networks


Sigmoid with parameter  and 

– It does not satisfy the Mercer condition on all  and 


Research on different kernel functions in different applications is very active
Example

Suppose we have 5 1D data points


– x1=1, x2=2, x3=4, x4=5, x5=6, with 1, 2, 6 as class 1 and 4, 5 as class
2  y1=1, y2=1, y3=-1, y4=-1, y5=-1, y6=1
We use the polynomial kernel of degree 2
– K(x,y) = (xy+1)2
– C is set to 100
We first find ai (i=1, …, 5) by
Example

By using a QP solver, we get


a1=0, a2=2.5, a3=0, a4=7.333, a5=4.833
– Verify (at home) that the constraints are indeed satisfied
– The support vectors are {x2=2, x4=5, x5=6}
The discriminant function is

b is recovered by solving f(2)=1 or by f(5)=-1 or by f(6)=1, as x2, x4, x5 lie on


and all give b=9
Example

Value of discriminant function

class 1 class 2 class 1

1 2 4 5 6
Choosing the Kernel Function

Probably the most tricky part of using SVM.


The kernel function is important because it creates the kernel matrix,
which summarizes all the data
Many principles have been proposed (diffusion kernel, Fisher kernel,
string kernel, …)
There is even research to estimate the kernel matrix from available
information

In practice, a low degree polynomial kernel or RBF kernel with a


reasonable width is a good initial try
Note that SVM with RBF kernel is closely related to RBF neural
networks, with the centers of the radial basis functions automatically
chosen for SVM
Software

A list of SVM implementation can be found at


http://www.kernel-machines.org/software.html
Some implementation (such as LIBSVM) can handle
multi-class classification
SVMLight is among one of the earliest
implementation of SVM
Several Matlab toolboxes for SVM are also available
Recap of Steps in SVM

Prepare data matrix {(xi,yi)}


Select a Kernel function
Select the error parameter C
“Train” the system (to find all αi)
New data can be classified using αi and Support
Vectors
Summary

Weaknesses
– Training (and Testing) is quite slow compared to ANN
• Because of Constrained Quadratic Programming
– Essentially a binary classifier
• However, there are some tricks to evade this.
– Very sensitive to noise
• A few off data points can completely throw off the algorithm
– Biggest Drawback: The choice of Kernel function.
• There is no “set-in-stone” theory for choosing a kernel function for
any given problem (still in research...)
• Once a kernel function is chosen, there is only ONE modifiable
parameter, the error penalty C.
Summary

Strengths
– Training is relatively easy
• We don’t have to deal with local minimum like in ANN
• SVM solution is always global and unique (check “Burges” paper
for proof and justification).
– Unlike ANN, doesn’t suffer from “curse of dimensionality”.
• How? Why? We have infinite dimensions?!
• Maximum Margin Constraint: DOT-PRODUCTS!
– Less prone to overfitting
– Simple, easy to understand geometric interpretation.
• No large networks to mess around with.
Applications of SVMs

 Bioinformatics
 Machine Vision
 Text Categorization
 Ranking (e.g., Google searches)
 Handwritten Character Recognition
 Time series analysis
Lots of very successful applications!!!
Reference

 Support Vector Machine Classification of


Microarray Gene Expression Data, Michael P. S.
Brown William Noble Grundy, David Lin, Nello
Cristianini, Charles Sugnet, Manuel Ares, Jr., David
Haussler
 www.cs.utexas.edu/users/mooney/cs391L/svm.ppt
 Text categorization with Support Vector
Machines:
learning with many relevant features
T. Joachims, ECML - 98
Hand-crafted
kernel function

SVM
Apply simple
classifier
Source of image: http://www.gipsa-lab.grenoble-
inp.fr/transfert/seminaire/455_Kadri2013Gipsa-
Deep Learning lab.pdf
simple
Learnable kernel
classifier

x1 … y1

x2 … y2




xN … yM

What is Cluster Analysis?
• Finding groups of objects such that the objects in a group
will be similar (or related) to one another and different
from (or unrelated to) the objects in other groups

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Clustering
• However,
– Similarity is hard to define, and …. “We know it when we see it”
– The real meaning of similarity is a philosophical question. We will
take a more pragmatic approach.

PC: http://www.funnyphotos.net.au/dog-owners/
Vehicle Example

Vehicle Top speed Colour Air Weight


km/h resistance Kg
V1 220 red 0.30 1300
V2 230 black 0.32 1400
V3 260 red 0.29 1500
V4 140 gray 0.35 800
V5 155 blue 0.33 950
V6 130 white 0.40 600
V7 100 black 0.50 3000
V8 105 red 0.60 2500
V9 110 gray 0.55 3500
Vehicle Clusters

3500

3000 Lorries

2500
Weight [kg]

2000 Sports cars

1500 Medium market cars

1000

500
100 150 200 250 300
Top speed [km/h]
Terminology
Object or feature space
data point
3500
label
3000 Lorries

2500
cluster
Weight [kg]

2000 Sports cars

1500 Medium market cars


feature
1000

500
100 150 200 250 300
Top speed [km/h]
feature
Clustering – is an ill-defined problem!!

How many clusters? Six Clusters

Two Clusters Four Clusters


Types of Clustering
• A clustering procedure outputs a set of
clusters
• Important distinction between hierarchical
and partitional sets of clusters
• Partitional Clustering
– A division data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset

• Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree
Types of Clusters: Center-Based

l Center-based
– A cluster is a set of objects such that an object in a cluster is
closer (more similar) to the “center” of a cluster, than to the
center of any other cluster
– The center of a cluster is often a centroid, the average of all
the points in the cluster, or a medoid, the most “representative”
point of a cluster

4 center-based clusters
Types of Clusters: Contiguity-Based

l Contiguous Cluster (Nearest neighbor or


Transitive)
– A cluster is a set of points such that a point in a cluster is
closer (or more similar) to one or more other points in the
cluster than to any point not in the cluster.

8 contiguous clusters
Types of Clusters: Density-Based

l Density-based
– A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density.
– Used when the clusters are irregular or intertwined, and when
noise and outliers are present.

6 density-based clusters
Types of Clusters: Conceptual Clusters

l Shared Property or Conceptual Clusters


– Finds clusters that share some common property or represent
a particular concept.
.

2 Overlapping Circles
Types of Clusters: Objective Function

l Clusters Defined by an Objective Function


– Finds clusters that minimize or maximize an objective function.
– Enumerate all possible ways of dividing the points into clusters and
evaluate the `goodness' of each potential set of clusters by using
the given objective function. (No. of feasible partitions could be
very large and the problem is NP Hard)

– Can have global or local objectives.


 Hierarchical clustering algorithms typically have local objectives
 Partitional algorithms typically have global objectives
– A variation of the global objective function approach is to fit the
data to a parameterized model.
 Parameters for the model are determined from the data.
 Mixture models assume that the data is a ‘mixture' of a number of
statistical distributions.
Types of Clusters: Objective Function …

l Map the clustering problem to a different domain


and solve a related problem in that domain
– Proximity matrix defines a weighted graph, where the
nodes are the points being clustered, and the
weighted edges represent the proximities between
points

– Clustering is equivalent to breaking the graph into


connected components, one for each cluster.

– Want to minimize the edge weight between clusters


and maximize the edge weight within clusters
Membership matrix M
cluster
centre i
cluster centre j

data point k

1 if u  c 2  u  c 2

ik   k i k j

0 otherwise

distance
c-partition
All clusters C Clusters do
together fills the not overlap
whole universe U

C i U
i 1

Ci  C j  Ø for all i  j
A cluster C is Ø  Ci  U for all i
never empty and
it is smaller than 2cK There must be at least
the whole 2 clusters in a c-
universe U partition and at most as
many as the number of
data points K
Objective Functions: Hard and Fuzzy C Means

Minimise the total sum


of all distances

c c 
  J i     u k  ci 
2
J k means 
i 1 i 1  k ,u k Ci 

c c n
J FCM   J i    u k  ci m 2
ik
i 1 i 1 k 1
K-Means Type Clustering Algorithms

• Given k, the k-means algorithm consists


of four steps:
– Select initial representative points (means
or centroids) at random.
– Assign each object to the cluster with the
nearest centroid.
– Compute each centroid as the mean of the
objects assigned to it.
– Repeat previous 2 steps until no change.
53
The Objective Function of K-Means
• Most common measure is Sum of Squared Error (SSE)
or summed Intra Cluster Spread (ICS)
– For each point, the error is the distance to the nearest cluster
center
– To get SSE, we square these errors and sum them.
   k  
SSE  ICS (c1 , c2 ,...,ck )    d ( X i , c j )
2

j 1 X i C j
– Xi is a data point in cluster Cj and cj is the representative point
(most often the mean) for cluster Cj .
– Given two clusters, we can choose the one with the smallest
error
– One easy way to reduce SSE is to increase k, the number of
clusters
• A good clustering with smaller k can have a higher SSE than a poor
clustering with higher k
Fuzzy C Means: Fuzzy membership matrix
M
Point k’s
membership of
cluster i Fuzziness exponent

1
ik  2 /  q 1
c  d ik 
 

j 1  d jk


Distance from point k
 to current cluster
centre i

dik  uk  ci Distance from point


k to other cluster
centres j
Fuzzy membership matrix M
1
 ik  2 /  q 1
c  d ik 
 

j 1  d jk



1
 2 /  q 1 2 /  q 1 2 /  q 1
 d ik  d  d 
    ik      ik 
 d1k   d 2k   d ck 
1
2 /  q 1
d ik

1 1 1 Gravitation to
2 /  q 1
 2 /  q 1
   2 /  q 1
d1k d 2k d ck cluster i relative
to total gravitation
Electrical Analogy
U  RI
1
I R
1 1 1
  Same form
R1 R2 Rc as mik
i1 i2 1
U R1 R2
1 Ri
R 
Ri 1  1    1
R1 R2 Rc
1 U 1 ii
R  
Ri I U I
ii
Two different K-means Clustering
3

2.5

2 Original Points
1.5

y
1

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x

Optimal Clustering Sub-optimal Clustering


Importance of Choosing Initial Centroids

Iteration 6
1
2
3
4
5
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x
Importance of Choosing Initial Centroids
Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Iteration 4 Iteration 5 Iteration 6


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Importance of Choosing Initial Centroids …

Iteration 5
1
2
3
4
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x
Importance of Choosing Initial Centroids …

Iteration 1 Iteration 2
3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x

Iteration 3 Iteration 4 Iteration 5


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Problems with Selecting Initial Points

• If there are K ‘real’ clusters then the chance of selecting


one centroid from each cluster is small.
– Chance is relatively small when K is large
– If clusters are the same size, n, then

– For example, if K = 10, then probability = 10!/1010 = 0.00036


– Sometimes the initial centroids will readjust themselves in
‘right’ way, and sometimes they don’t
Solutions to Initial Centroids
Problem
• Multiple runs
– Helps, but probability is not on your side
• Sample and use hierarchical clustering to
determine initial centroids
• Select more than k initial centroids and
then select among these initial centroids
– Select most widely separated
• Postprocessing
• Bisecting K-means
– Not as susceptible to initialization issues
Limitations of K-means
• K-means has problems when clusters are
of differing
– Sizes
– Densities
– Non-globular shapes

• K-means has problems when the data


contains outliers.
Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)


Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)


Limitations of K-means: Non-globular
and linearly non-separable Shapes

Original Points K-means (2 Clusters)


Landscape for Clustering Problem
k
The k-means objective function ICS (C1 , C2 ,..., Ck )    d 2 ( Z i ,V j )
j 1 Zi Ci

Example of an extremely
simple one-dimensional dataset

Fitness function (Reciprocal of ICS)


plot of the hard c-means algorithm
for above dataset
Landscape for Clustering Problem (Contd.)
The previous data but now
With some noise points

Fitness function (Reciprocal of ICS)


plot of the hard c-means algorithm
for above dataset

In addition to two global minima, we have local minima at: approximately (0, 10),
(5, 10), (10, 0), (10, 5). For these local minima one prototype covers one of the
data clusters, while the other prototype is mislead to the noise cluster.
A Few Open-ended Research Problems
• Detecting Clustering Tendency before doing
any clustering operation: Hypothesis testing?

A random patch of data points Clustering with K means when initialized


With no cluster With c = 3
Simultaneous Feature Selection
with Clustering
• Feature selection (i.e., attribute subset selection):
– Eliminate noisy features
– Discover the most important features
• Advantages
– Reducing dimensionality
– Improving learning efficiency
– Increasing predictive accuracy
– Reducing complexity of learned results

A must read for high-dimensional clustering:


One feature may be more discriminative than the other

Only feature x1 can identify two clusters and not feature x2


One feature may be more discriminative than the other
(Contd.)
Salary Salary
x10000 x10000

10 10

8 8

C C'
6 6

4 4

D D'
2 2

0 0
20 30 40 50 60 70 Age

There is no cluster in the age subspace


because there is no dense unit in that
subspace. However there are two clusters in
salary subspace
Open Problems
1) Can we use a proper representation of DE
Vectors to make the clustering fully automatic?

2) Is it possible to select most important features of the


dataset to reduce the computational overhead?

3) Can we choose a better way to evaluate the quality


of the clusters using human user experience?
Sparse k Means Clustering

Daniela M. Witten and Robert Tibshirani, A Framework for Feature Selection in


Clustering, Journal of the American Statistical Association, Vol. 105, No. 490
(June 2010), pp. 713-726
Towards Multi-context clustering – can we
improve with evolutionary approaches?

Moumita Saha and Pabitra


Mitra, VLGAAC: Variable
Length Genetic Algorithm
Based Alternative Clustering.
ICONIP (2) 2014: 194-202
Towards Multi-context clustering – can we
improve with evolutionary approaches?
Can we cluster a face database by using two contexts and present both
partitions at one go?

• Cluster the faces by who’s


face is that? – Tom, Dick,
Harry….

• Cluster the faces as per the


indicated emotion – sad,
angry, happy…..
Reinforcement Learning

Based on “Reinforcement
Learning – An Introduction”
by Richard Sutton and
Andrew Barto

1
Learning from Experience Plays a Role in …

Artificial Intelligence

Control Theory and


Psychology Operations Research
Reinforcement
Learning (RL)

Neuroscience
Artificial Neural Networks

Reinforcement Learning 2
What is Reinforcement Learning?

Reinforcement learning is an area of machine


learning inspired by behaviorist psychology,
concerned with how software agents ought to
take actions in an environment so as to maximize
some notion of cumulative reward.
Learning from interaction
Goal-oriented learning
Learning about, from, and while interacting with an
external environment
Learning what to do—how to map situations to
actions—so as to maximize a numerical reward signal

Reinforcement Learning 3
Supervised Learning

Training Info = desired (target) outputs

Supervised Learning
Inputs System
Outputs

Error = (target output – actual output)

Reinforcement Learning 4
Reinforcement Learning

Training Info = evaluations (“rewards” / “penalties”)

RL
Inputs System
Outputs (“actions”)

Objective: get as much reward as possible

Reinforcement Learning 5
The Big Picture

Your action influences the state of the world which determines its reward
Key Features of RL

In machine learning, the environment is typically


formulated as a Markov decision process (MDP) as
many reinforcement learning algorithms for this
context utilize dynamic programming techniques.
Learner is not told which actions to take
Trial-and-Error search
Possibility of delayed reward (sacrifice short-term
gains for greater long-term gains)
The need to explore and exploit
Considers the whole problem of a goal-directed
agent interacting with an uncertain environment

Reinforcement Learning 7
The Markov Property

By “the state” at step t, we mean whatever information is


available to the agent at step t about its environment.
The state can include immediate “sensations,” highly
processed sensations, and structures built up over time
from sequences of sensations.
Ideally, a state should summarize past sensations so as to
retain all “essential” information, i.e., it should have the
Markov Property:

Pr st 1  s,rt 1  r st ,at ,rt ,st 1 , at 1 , , r1 , s0 , a0  


Pr st 1  s,rt 1  r st , at 
for all s’, r, and histories st, at, st-1, at-1, …, r1, s0, a0.

Reinforcement Learning 8
Markov Decision Processes

If a reinforcement learning task has the Markov


Property, it is basically a Markov Decision Process
(MDP).
If state and action sets are finite, it is a finite MDP.
To define a finite MDP, you need to give:
state and action sets
one-step “dynamics” defined by transition
probabilities:
Psas  Prst 1  s st  s,at  a for all s, sS, a A(s).

reward probabilities:
Rsas   Ert 1 st  s,at  a,st 1  s  for all s, sS, a A(s).
Reinforcement Learning 9
An Example Finite MDP

Recycling Robot

At each step, robot has to decide whether it should (1)


actively search for a can, (2) wait for someone to bring it
a can, or (3) go to home base and recharge.
Searching is better but runs down the battery; if runs out
of power while searching, has to be rescued (which is
bad).
Decisions made on basis of current energy level: high,
low.
Reward = number of cans collected

Reinforcement Learning 10
Recycling Robot MDP
S  high ,low Rsearch  expected no. of cans while searching
A( high )  search ,wait  R wait  expected no. of cans while waiting
A( low )  search ,wait ,recharge  Rsearch  R wait

Reinforcement Learning 11
Transition Table

Reinforcement Learning 12
Complete Agent

Temporally situated
Continual learning and planning
Object is to affect the environment
Environment is stochastic and uncertain

Environment

state action

reward
Agent

Reinforcement Learning 13
Elements of RL

Policy

Reward
Value
Model of
environment
Policy: what to do
Reward: what is good
Value: what is good because it predicts reward
Model: what follows what

Reinforcement Learning 14
Overview
• Supervised Learning: Immediate feedback (labels provided for every input).
• Unsupervised Learning: No feedback (no labels provided).

• Reinforcement Learning: Delayed scalar feedback (a number called reward).

• RL deals with agents that must sense & act upon their environment.
 This is combines classical AI and machine learning techniques.
 It the most comprehensive problem setting.

• Examples:
• A robot cleaning my room and recharging its battery
• Robot-soccer
• How to invest in shares
• Modeling the economy through rational agents
• Learning how to fly a helicopter
• Scheduling planes to their destinations
• and so on
Complications
• The outcome of your actions may be uncertain

• You may not be able to perfectly sense the state of the world

• The reward may be stochastic.

• Reward is delayed (i.e. finding food in a maze)

• You may have no clue (model) about how the world responds to your actions.

• You may have no clue (model) of how rewards are being paid off.

• The world may change while you try to learn it

• How much time do you need to explore uncharted territory before you
 exploit what you have learned?
Value Function

p * (s ) = argmax éêër (s , a ) + gV * (d (s , a ))ùúû


a
V * (s)¬ max[ r(s,a) + gV * (d (s,a))]
a
Q (s , a ) º r (s , a ) + gV * (d (s , a ))
= r (s , a ) + g max Q (d (s , a ), a ')
a'

Q̂ (s , a ) ¬ r + g max Q̂ (s ', a ') s’=st+1


a'
Q-Learning

Q̂ (s , a ) ¬ r + g max Q̂ (s ', a ') s’=st+1


a'
Q̂ (s , a ) ¬ r + g max Q̂ (s ', a ')
a'
Q (s , a ) = E [r (s , a )] + g åP (s '| s , a ) max Q (s ', a ')
a'
s'
More on Function Approximation
• For instance: linear function: K
Q (s , a ) » fq (s , a ) = åqka Fk (s )
k =1

 The features Phi are fixed measurements of the state (e.g. # stones on the board).
 We only learn the parameters theta.
• Update rule: (start in state s, take action a, observe reward r and end up in state s’)

( )
qka ¬ qka + a r + g max Q̂ (s ', a ') - Q̂ (s , a ) Fk (s )
a'

change in Q
Q Learning Example
Suppose we have 5 rooms in a building connected by doors as shown in the
figure below. We'll number each room 0 through 4. The outside of the
building can be thought of as one big room (5). Notice that doors 1 and 4
lead into the building from room 5 (outside).

We can represent the rooms on a graph, each room as a node, and
each door as a link.
For this example, we'd like to put an agent in any room, and from that room, go outside the
building (this will be our target room). In other words, the goal room is number 5.
To set this room as a goal, we'll associate a reward value to each door (i.e. link between
nodes). The doors that lead immediately to the goal have an instant reward of 100.
 Other doors not directly connected to the target room have zero reward.
Because doors are two-way ( 0 leads to 4, and 4 leads back to 0 ), two arrows are assigned to
each room. Each arrow contains an instant reward value, as shown below:
Of course, Room 5 loops back to itself with a reward of 100, and all other direct
connections to the goal room carry a reward of 100. In Q-learning, the goal is to
reach the state with the highest reward, so that if the agent arrives at the goal, it
will remain there forever. This type of goal is called an "absorbing goal".
Imagine our agent as a dumb virtual robot that can learn through experience. The
agent can pass from one room to another but has no knowledge of the
environment, and doesn't know which sequence of doors lead to the outside.
Suppose we want to model some kind of simple evacuation of an agent from any
room in the building. Now suppose we have an agent in Room 2 and we want the
agent to learn to reach outside the house (5).
The terminology in Q-Learning includes the terms "state" and "action".
We'll call each room, including outside, a "state", and the agent's movement from
one room to another will be an "action". In our diagram, a "state" is depicted as a
node, while "action" is represented by the arrows.
Suppose the agent is in state 2. From state 2, it can go to state 3 because state 2 is
connected to 3. From state 2, however, the agent cannot directly go to state 1 because
there is no direct door connecting room 1 and 2 (thus, no arrows). From state 3, it can go
either to state 1 or 4 or back to 2 (look at all the arrows about state 3). If the agent is in
state 4, then the three possible actions are to go to state 0, 5 or 3. If the agent is in state 1,
it can go either to state 5 or 3. From state 0, it can only go back to state 4.

We can put the state diagram and the instant reward values into the following reward
table, "matrix R".

The -1's in the table represent


null values (i.e.; where there
isn't a link between nodes). For
example, State 0 cannot go to
State 1.
Now we'll add a similar matrix, "Q", to the brain of our agent, representing
the memory of what the agent has learned through experience. The rows of
matrix Q represent the current state of the agent, and the columns
represent the possible actions leading to the next state (the links between
the nodes).
The agent starts out knowing nothing, the matrix Q is initialized to zero. In
this example, for the simplicity of explanation, we assume the number of
states is known (to be six). If we didn't know how many states were
involved, the matrix Q could start out with only one element. It is a simple
task to add more columns and rows in matrix Q if a new state is found.
The transition rule of Q learning is a very simple formula:
Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]

According to this formula, a value assigned to a specific element of matrix Q, is


equal to the sum of the corresponding value in matrix R and the learning parameter
Gamma, multiplied by the maximum value of Q for all possible actions in the next
state.
Our virtual agent will learn through experience, without a teacher (this is
called unsupervised learning). The agent will explore from state to state
until it reaches the goal. We'll call each exploration an episode. Each
episode consists of the agent moving from the initial state to the goal
state. Each time the agent arrives at the goal state, the program goes to the
next episode.

Each episode is equivalent to one training session. In each training session,


the agent explores the environment (represented by matrix R ), receives the
reward (if any) until it reaches the goal state. The purpose of the training is to
enhance the 'brain' of our agent, represented by matrix Q. More training
results in a more optimized matrix Q. In this case, if the matrix Q has been
enhanced, instead of exploring around, and going back and forth to the same
rooms, the agent will find the fastest route to the goal state.

The Gamma parameter has a range of 0 to 1 (0 <= Gamma > 1). If Gamma is
closer to zero, the agent will tend to consider only immediate rewards. If
Gamma is closer to one, the agent will consider future rewards with greater
weight, willing to delay the reward.

To use the matrix Q, the agent simply traces the sequence of states, from the initial
state to goal state. The algorithm finds the actions with the highest reward values
recorded in matrix Q for current state.
The Q-Learning algorithm goes as follows:
1. Set the gamma parameter, and environment rewards in matrix R.

2. Initialize matrix Q to zero.


3. For each episode:
Select a random initial state.
Do While the goal state hasn't been reached.

Select one among all possible actions for the current state.
Using this possible action, consider going to the next state.
Get maximum Q value for this next state based on all possible actions.
Compute: Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]
Set the next state as the current state.
End Do

End For
To understand how the Q-learning algorithm works, we'll go through a few episodes step
by step.
We'll start by setting the value of the learning parameter Gamma = 0.8, and the initial
state as Room 1.
Initialize matrix Q as a zero matrix:
Look at the second row (state 1) of matrix R. There are two possible actions
for the current state 1: go to state 3, or go to state 5. By random selection, we
select to go to 5 as our action.
Now let's imagine what would happen if our agent were in state 5. Look at
the sixth row of the reward matrix R (i.e. state 5). It has 3 possible actions:
go to state 1, 4 or 5.

Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]


Q(1, 5) = R(1, 5) + 0.8 * Max[Q(5, 1), Q(5, 4), Q(5, 5)] = 100 + 0.8 * 0 = 100

Since matrix Q is still initialized to zero, Q(5, 1), Q(5, 4), Q(5, 5), are all zero. The
result of this computation for Q(1, 5) is 100 because of the instant reward from R(5, 1).
The next state, 5, now becomes the current state. Because 5 is the goal state, we've
finished one episode. Our agent's brain now contains an updated matrix Q as:

This entry will be 100

This entry will be 0


For the next episode, we start with a randomly chosen initial state. This time, we have
state 3 as our initial state.
Look at the fourth row of matrix R; it has 3 possible actions: go to state 1, 2 or 4. By
random selection, we select to go to state 1 as our action.
Now we imagine that we are in state 1. Look at the second row of reward matrix R (i.e.
state 1). It has 2 possible actions: go to state 3 or state 5. Then, we compute the Q
value:

Q(state, action) = R(state, action) + Gamma * Max[Q(next


state, all actions)]

Q(3, 1) = R(3, 1) + 0.8 * Max[Q(1, 3), Q(1, 5)] = 0 + 0.8 *


Max(0, 100) = 80
We use the updated matrix Q from the last episode. Q(1, 3) = 0 and Q(1, 5) =
100. The result of the computation is Q(3, 1) = 80 because the reward is
zero. The matrix Q becomes:

The next state, 1, now becomes the current state. We repeat the inner
loop of the Q learning algorithm because state 1 is not the goal state

So, starting the new loop with the current state 1, there are two possible
actions: go to state 3, or go to state 5. By lucky draw, our action selected
is 5.
Now, imaging we're in state 5, there are three possible
actions: go to state 1, 4 or 5. We compute the Q value
using the maximum value of these possible actions.

Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]

Q(1, 5) = R(1, 5) + 0.8 * Max[Q(5, 1), Q(5, 4), Q(5, 5)] = 100 + 0.8 * 0 = 100

The updated entries of matrix Q, Q(5, 1), Q(5, 4), Q(5, 5),
are all zero. The result of this computation for Q(1, 5) is 100
because of the instant reward from R(5, 1). This result does
not change the Q matrix.
Because 5 is the goal state, we finish this episode. Our agent's brain now contain
updated matrix Q as:

If our agent learns more through further episodes, it will finally reach convergence
values in matrix Q like:
This matrix Q, can then be normalized (i.e.; converted to percentage) by
dividing all non-zero entries by the highest number (500 in this case):

Once the matrix Q gets close enough to a state of convergence, we know our agent has
learned the most optimal paths to the goal state. Tracing the best sequences of states is as
simple as following the links with the highest values at each state.
For example, from initial State 2, the agent can use the matrix Q as a guide:
From State 2 the maximum Q values suggests the action to go to state 3.
From State 3 the maximum Q values suggest two alternatives: go to state 1 or 4. Suppose we
arbitrarily choose to go to 1.
From State 1 the maximum Q values suggests the action to go to state 5.
Thus the sequence is 2 - 3 - 1 - 5.
The n-Armed Bandit Problem

Choose repeatedly from one of n actions; each


choice is called a play
After each play at, you get a reward rt, where

E rt | at  Q* ( at )
These are unknown action values
Distribution of rt depends only on at
Objective is to maximize the reward in the long term,
e.g., over 1000 plays

To solve the n-armed bandit problem,


you must explore a variety of actions
and then exploit the best of them.
Reinforcement Learning 47
The Exploration/Exploitation Dilemma

Suppose you form estimates


*
Qt (a)  Q (a) action value estimates

The greedy action at t is


at*  argmax Qt (a)
a
*
at  at  exploitation
at  at*  exploration
You can’t exploit all the time; you can’t explore all the
time
You can never stop exploring; but you should always
reduce exploring
Reinforcement Learning 48
Action-Value Methods

Methods that adapt action-value estimates and


nothing else, e.g.: suppose by the t-th play, action a
had been chosen k a times, producing rewards r1 ,r2 , ,rka ,
then

r1  r2    rka
Qt ( a )  “sample average”
ka

lim Qt ( a )  Q ( a )
*
ka 

Reinforcement Learning 49
e-Greedy Action Selection

Greedy action selection:


at  at*  arg max Qt ( a )
a

e-Greedy:
at* with probability 1  e
at  { random action with probability e

… the simplest way to try to balance exploration and


exploitation

Reinforcement Learning 50

You might also like