0% found this document useful (0 votes)

4 views

l9

The document provides an overview of machine learning, focusing on definitions, learning models, and algorithms such as the Naïve Bayes Classifier and decision trees. It discusses the learning process, types of learning, and how to classify data using examples, including practical applications in areas like text classification and prediction problems. Additionally, it covers concepts like entropy, information gain, and the challenges faced when implementing decision trees.

Uploaded by

Eddie Otieno

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

l9

Uploaded by

Eddie Otieno

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 110

Artificial Intelligence

Machine Learning
Machine Learning
• What is learning?
Definitions
• Webster
– To gain knowledge or understanding of or skill in by study, instruction or experience; memorize; to
acquire knowledge or skill in a behavioral tendency; discovery, to obtain knowledge of for the first
time
• Simon
– Any process by which a system improves its performance
• So far we have programmed knowledge into the agent (expert rules, probabilities, search
space representations), but an autonomous agent should acquire this knowledge on its own.
• Machine Learning will make this possible
A General Model of Learning Agents
• Learning Element
– Adds knowledge, makes improvement to system
• Performance Element
– Performs task, selects external actions
• Critic
– Monitors results of performance, provides feedback to learning element
• Problem Generator
– Actively suggests experiments, generates examples to test
• Performance Standard
– Method / standard of measuring performance
The Learning Problem
• Learning = Improving with experience at some task
– Improve over task T
– With respect to performance measure P
– Based on experience E
• Example: Learn to play checkers (Chinook)
– T: Play checkers
– P: % of games won in world tournament
– E: opportunity to play against self
• Example: Learn to Diagnose Patients
– T: Diagnose patients
– P: Percent of patients correctly diagnosed
– E: Pre-diagnosed medical histories of patients
Categories of Learning
• Learning by being told • Syskill and Webert
• Learning by examples / Perform Web Page Rati
Supervised learning ng
• Learning by discovery /
Unsupervised learning
• Example of supervised l
• Learning by experimentation /
earning
Reinforcement learning • Example of supervised l
earning
• Example of unsupervise
d learning
Learning From Examples
• Learn general concepts or categories from examples
• Learn a task (drive a vehicle, win a game of backgammon)
• Examples of objects or tasks are gathered and stored in a
database
• Each example is described by a set of attributes or features
• Each example used for training is classified with its correct label
(chair, not chair, horse, not horse, 1 vs. 2 vs. 3, good move, bad
move, etc.)
• The machine learning program learns general concept
description from these specific examples
• The ML program should be applied to classify or perform tasks
never before seen from learned concept
Learning From Examples
• First algorithm: naïve Bayes classifier
• D is training data
– Each data point is described by attributes a1..an
• Learn mapping from data point to a class value
– Class values v1..vj
• We are searching through the space of
possible concepts
– Functions that map data to class value
Supervised Learning Algorithm – Naïve Bayes

1. For each hypothesis h  H, calculate the posterior probability

P ( D | h) P ( h)
P(h | D) 
P( D)

2. Output the hypothesis hMAP (maximum a posteriori hypothesis)

with the highest posterior probability

hMAP arg max hH P (h | D)

NBC Definition
• Assume target function is f:D->V, where each instance d is
described by attributes <a1, a2, .., an>. The most probable value of
f(d) is
v MAP arg max v j V P (v j | a1 , a 2 ,.., a n )

P (a1 , a 2 ,.., a n ) P(v j )

v MAP arg max v j V
P(a1 , a 2 ,.., a n )

arg max v j V P(a1 , a 2 ,.., a n | v j ) P(v j )

NB Assumption
• Assume that the attributes are independent
with respect to the class. The result is
P(a1 , a 2 ,.., a n | v j )  P(ai | v j )
i

• which yields the NBC

v NB arg max v j V P(v j ) P(ai | v j )
i
Using NBC
• Training
– For each target value (class value) vj estimate P(vj)
• For each attribute value ai of each attribute a
estimate P(ai|vj)
• Classify new instance
v NB arg max v j V P (v j ) P (ai | v j )
ai d
PlayTennis Training Examples
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
PlayTennis Example
• Target values are Yes and No
• Assume examples 1-12 comprise training data
• P(Yes) = 8/12, P(No) = 4/12
• Now we want to determine class for example 13
(Outlook=Overcast, Temperature=Hot, Humidity=Normal,
Wind=Weak)
• Determine which value is larger
• P(Yes)*P(Overcast|Yes)*P(Hot|Yes)*P(Normal|Yes)*P(Weak|Yes)
= 8/12 * 2/8 * 1/8 * 5/8 * 5/8 = 0.00814
• P(No)*P(Overcast|No)*P(Hot|No)*P(Normal|No)*P(Weak|No)
= 4/12 * 0 * 2/4 * 1/4 * 2/4 = 0.0
• Answer is Yes
NBC Subtleties
• Conditional independence is often violated...
– ...but it works surprisingly well anyway. We do not
need the actual probability, just the class that
yields the largest value.
• Some attribute values may not appear in any
training examples.
– Use small non-zero value
NBC For Text Classification
• Uses
– Filter spam
– Classify web pages by topic
– Categorize email
• NBC is very effective for this application.
• What attributes shall we use?
• Do we care about word position?
• Joachims performed experiment with 20 newsgroups,
1000 articles per class, 2/3 training 1/3 testing,
achieved 89% accuracy
Prediction Problems
• Customer purchase behavior
Prediction Problems
• Customer retention
Prediction Problems
Prediction Problems
• Problems too difficult to program by hand
Sense

Identify Assist Identify

Remind Assess
Prediction Problems
• Software that customizes to user
Inductive Learning Hypothesis
• Any hypothesis found to approximate the target function
well over a sufficiently large set of training examples will
also approximate the target function well over other
unobserved examples
Inductive Bias
• There can be a number of hypotheses consistent with
training data
• Each learning algorithm has an inductive bias that imposes a
preference on the space of all possible hypotheses
Decision Trees
• A decision tree takes a description of an object or situation as
input, and outputs a yes/no "decision".
• Can also be used to output greater variety of answers.
• Here is a decision tree for the concept PlayTennis
Decision Tree Representation

• Each internal node tests an attribute

• Each branch corresponds to attribute value
• Each leaf node assigns a classification
When to Consider Decision Trees

• Instances describable by attribute-value pairs

• Target function is discrete valued
• Disjunctive hypothesis may be required
• Possibly noisy training data
Inducing Decision Trees

• Each example is described by the values of the

attributes and the value of the goal predicate (Yes/No)
• The value of the goal predicate is called the
classification of the example
• If the classification is true (Yes), this is a positive
example, otherwise this is a negative example
• The complete set of examples is called the training set
Decision Tree Learning
• Any concept that can be expressed as a propositional statement can be
expressed using a decision tree
• No type of representation is efficient for all kinds of functions
How represent of m of n?
• Once we know how to use a decision tree, the next question is, how do we
automatically construct a decision tree?
• One possibility: search through the space of all possible decision trees
– All possible n features at root
– For each root, n-1 possible features at each child
– …
– Keep the hypotheses that are consistent with training examples
– Among these, keep one that satisfies bias
• Too slow!
• Another possibility: construct one path for each positive example
– Not very general
• Another possibility: find smallest decision tree consistent with all examples
– Inductive Bias: Ockham's Razor
Top-Down Induction of Decision Trees

1. At each point, decide which attribute to use

as next test in the tree
2. Attribute splits data based on answer to
question
– Each answer forms a separate node in decision
tree
– Each node is the root of an entire sub-decision
tree problem, possibly with fewer examples and
one fewer attribute than its parent
Four Cases To Consider
1. If both + and -, choose best attribute to split
2. If all + (or -), then we are done
3. If no examples, no examples fit this category, return default
value (calculate using majority classification from parent)
4. If no attributes left, then there are inconsistencies, called
noise
We can use a majority vote to label the node.
Which Attribute Is Best?

• Pick one that provides the highest expected

amount of information
• Information Theory measures information
content in bits
• One bit of information is enough to answer a
yes/no question about which one has no idea
Entropy
• S is a sample of training examples
• p+ is the proportion of positive examples in S
• p- is the proportion of negative examples in S
• Entropy measure the impurity of S
• Entropy(S) = expected #bits needed to encode
class (+ or -) of randomly drawn element of S
(using optimal, shortest-length code)
Entropy
• Information theory: optimal length code
assigns –log2p bits to message of probability p
• Expected number of bits to encode + or – of
random element of S
p+(-log2p+) + p-(-log2p-)
• Entropy(S) = -p+log2p+ - p-log2p-
Information Content
• Entropy is also called the information content I of an actual answer
• Suppose you are sending messages to someone, could send several
possible messages. How many bits are needed to distinguish which
message is being sent?
• If P(message) = 1.0, don't need any bits (pure node, all one class).
• For other probability distributions, use shorter encodings for higher-
probability classes. P = 0.8,0.2 requires fewer bits on average than P
= 0.5, 0.5.
• Log of a fraction is always negative, so term is multiplied by -1.
• If possible answers vi have n probabilities P(v i) then the information
I (( P (v ),.., P (v n )) i 1  P (vi ) log 2 P (vi )
content 1I of actual answer is given by

I(1/2, 1/2) = -(1/2 log21/2) - (1/2 log21/2) = 1 bit

Information Theory and Decision Trees

• What is the correct classification?

• Before splitting, estimate of probabilities of possible answers
calculated as proportions of positive and negative examples
• If training set has p positive examples and n negative examples,
then the information contained in a correct answer is

I((p/p+n), (n/p+n))

• Splitting on a single attribute does not usually answer entire

question, but it gets us closer
• How much closer?
Information Gain
• Gain(S,A) = expected reduction in entropy due to
sorting on A
• Look at decrease in information of correct answer
after split
| Sv |
Gain( S , A)  Entropy ( S )   Entropy ( S v )
vValues ( A ) | S |
Training Examples
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Selecting the Next Attribute
Partially Learned Tree
Danger: Overfit

One solution:
prune decision tree
How Do We Prune a Decision Tree?
• Delete a decision node
• This causes entire subtree rooted at node to
be removed
• Replace by leaf node, and assign it by majority
vote
• Reduced error pruning: remove nodes as long
as performance improves on validation set
Measure Performance of a Learning Algorithm

• Collect large set of examples (as large and diverse as possible)

• Divide into 2 disjoint sets (training set and test set)
• Learn concept based on training set, generating hypothesis H
• Classify test set examples using H, measure percentage correctly
classified
• Should demonstrate
improved performance as
training set size increases
(learning curve)
• How quickly does it learn?
Measure Performance of a Learning Algorithm

• Use statistics tests to determine significance of improvement

• Cross-validation
Challenges for Decision Trees
• Numeric attributes
• Missing attribute values
• Incremental updating
Training Examples
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny 85 85 Weak No
D2 Sunny 80 90 Strong No
D3 Overcast 83 86 Weak Yes
D4 Rain 70 96 Weak Yes
D5 Rain 68 80 Weak Yes
D6 Rain 65 70 Strong No
D7 Overcast 64 65 Strong Yes
D8 Sunny 72 95 Weak No
D9 Sunny 69 70 Weak Yes
D10 Rain 75 80 Weak Yes
D11 Sunny 75 70 Strong Yes
D12 Overcast 72 90 Strong Yes
D13 Overcast 81 75 Weak Yes
D14 Rain 71 91 Strong No
Decision Tree

Outlook

sunny windy
overcast

Humidity Yes Windy

<=75 >75 true false

Yes No No Yes
Performance Measures
• Percentage correctly classified, averaged over folds
• Confusion matrix

Predicted Predicted
Negative Positive

Actual TN FP
Negative

Actual FN TP
Positive

• Accuracy = (TP+TN) / (TP+FP+TN+FN)

• Error = 1 - Accuracy
Examples
• Bet on a basketball team
• Build decision tree
• Build decision trees
Neural Networks

• Instead of traditional vonNeumann machines, researchers

wanted to build machines based on the human brain
• An architecture as well as a learning technique
Connection Machine
• The data structure is a network of units that act as "neurons"
• Tested as a computing device originally by researchers such
as Jon Hopfield (Cal Tech), Hebb (1949), Minsky (1951), and
Rosenblatt (1957)
• Also models human performance
Voice recognition, handwriting recognition, face recognition
(traditionally computers bad at these tasks, humans great)
Power In Numbers
• Each neuron is not extremely powerful by itself
• Neuron switching time is ~ second
• Each message 100,000 times slower than a computer
switch
• 10 billion - 1 trillion neurons
• Each neuron has 1,000 - 100,000 connections
• Computational neural networks are inspired by biology, but
do not exactly imitate biology
Neuron

A neural network is made up of neurons, or

processing elements
A Simple Neural Network – The Perceptron
Neuron
• Each activation value xi is weighted by wi
• The output y is determined by the NN transfer function

y= 1 if >nThreshold
wx
i 1 i i
0 otherwise

• This is a step transfer function

Neural Networks Learn a Function
• y = function(x1, x2, ..., xn)
• Perceptrons use one input neuron for each input parameter x 1..xn
• y is computed using the transfer function applied to values coming in to the
output node
• Suppose we are trying to learn the concept “all binary strings of length five
with a 1 in the first and last positions”
• 10101 -> 1
10100 -> 0
• 5 input units, 1 output unit
Input units are assigned value 0 or 1
In this case, output is 0 or 1
• Input units are always assigned a value
If we need symbolic inputs, can map to numeric inputs (convert to binary)
• f(2 legs, brown, flat, 3' high) = chair
Applications
• Handwriting recognition
• Control problems
• Autonomous navigation
• Stock market prediction
• Image recognition
– Alvinn drives 70 mph on highways
– Alvinn in action
When To Consider Neural Networks

• Input is high-dimensional discrete or real-

valued (e.g. raw sensor input)
• Output is discrete or real valued
• Output is a vector of values
• Possibly noisy data
• Form of target function is unknown
• Human readability of result is unimportant
Two Computation Phases
1. Training phase
2. Testing / use phase
• During training, run perceptron on examples and compare
network output (y) to desired output (yd)
• Weights are adjusted after each training step using function
w new  w old  ( y d  y ) x

• Optionally, the threshold can be adjusted as well using function

new old d
t t  ( y  y)
• Notice that x is the value of the input feature, thus weights are
changed ONLY for nodes that are activated (used in the
computation)
Parameters That Can Affect Performance

• Initial weights (can be initialized to 0, usually

better if randomly set)
• Initial threshold, y = 1 if  w * x  Threshold
i i

• Transfer function
• Learning rate, wnew wold  ( y d  y ) x
• Threshold update function
• Number of epochs
Learn Logical AND of x1 and x2
w *x i i  Threshold
x1 x2 yd
• Initially let w1=0, w2=0, T=0, eta=1 0 0 0
• Epoch 1 0 1 0
1 0 0
1 1 1

x1 x2 w1old w2old Told y yd w1new w2new Tnew

0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0
1 1 0 0 0 0 1 1 1 -1
Learn Logical AND of x1 and x2
w *x
i i  Threshold
x1 x2 yd
• Epoch 2 0 0 0
0 1 0
1 0 0
1 1 1

x1 x2 w1old w2old Told y yd w1new w2new Tnew

0 0 1 1 -1 1 0 1 1 0
0 1 1 1 0 1 0 1 0 1
1 0 1 0 1 0 0 1 0 1
1 1 1 0 1 0 1 2 1 0
Learn Logical AND of x1 and x2
w *x
i i  Threshold
x1 x2 yd
• Epoch 3 0 0 0
0 1 0
1 0 0
1 1 1

x1 x2 w1old w2old Told y yd w1new w2new Tnew

0 0 1 1 0 0 0 2 1 0
0 1 1 1 0 1 0 2 0 1
1 0 2 0 1 1 0 1 0 2
1 1 1 0 2 0 1 2 1 1
Learn Logical AND of x1 and x2
w *x
i i  Threshold
x1 x2 yd
• Epoch 4 0 0 0
0 1 0
1 0 0
1 1 1

x1 x2 w1old w2old Told y yd w1new w2new Tnew

0 0 2 1 1 0 0 2 1 1
0 1 2 1 1 0 0 2 1 1
1 0 2 1 1 1 0 1 1 2
1 1 1 1 2 0 1 2 2 1
Learn Logical AND of x1 and x2
w *x
i i  Threshold
x1 x2 yd
• Epoch 5 0 0 0
0 1 0
1 0 0
1 1 1

x1 x2 w1old w2old Told y yd w1new w2new Tnew

0 0 2 2 1 0 0 2 1 1
0 1 2 2 1 1 0 2 1 2
1 0 2 1 2 0 0 2 1 2
1 1 2 1 2 1 1 2 1 2
Learn Logical AND of x1 and x2
w *x
i i  Threshold
x1 x2 yd
• Epoch 6 0 0 0
0 1 0
1 0 0
1 1 1

x1 x2 w1old w2old Told y yd w1new w2new Tnew

0 0 2 1 2 0 0 2 1 2
0 1 2 1 2 0 0 2 1 2
1 0 2 1 2 0 0 2 1 2
1 1 2 1 2 1 1 2 1 2

CONVERGENCE!
The AND Function
• Notice that the classes can be separated by a line (hyperplane)
x2
+ is Class 1
- Is Class 2 + + +
- -
+
-- - + +
- + x1
- +
- -- +

- - -
+

2x1 + x2 = 2

• Learn XOR of two inputs, x1 and x2

– w1=2, w2=1, Threshold=2
Function Learned By Perceptron
• The final network can be expressed as an equation of the
parameters x1 through xn
• w1 x1 + w2 x2 + ... + wn xn = Threshold
• If learned, the network can be represented as a hyperplane
x2 w 1x 1 + w 2x 2 + … + w n x n = t
+ is Class 1
- Is Class 2 + + +

+ - -
+
- - x1
+
+ -
+ -
+
-
+ - -
- -
-
Examples
• Perceptron Example
• Perceptron Example
Linearly Separable
• If the classes can be separated by a hyperplane,
then they are linearly separable.
• Linearly Separable Learnable by a Perceptron
• Here is the XOR space: No line can separate these data points
into two classes – need two lines

+ -
- +
How Can We Learn These Functions?

• Add more layers with more neurons!

• Features of perceptrons:
– Only 1 neuron in output layer
– Inputs only 0 or 1
– Transfer function compares weighted sum to
threshold
– No hidden units
– Output is only 0 or 1
Multilayer Neural Networks
• 1 input layer (2 units), 1 hidden layer (2 units), 1
output layer (1 unit)
• Like before, output is a function of the weighted
input to the node
Function Learned by MNN
Learning in a Multilayer Neural Network

• How should we change (adapt) the weights in a multilayer neural

network?
• A perceptron is easy - direct mapping between weights and
output.
• Here, weights can contribute to intermediary functions and only
indirectly affect output. These weights can indirectly affect
multiple output nodes.
• If output is 12 and we want a 10, change weights to output node
so that output next time would be (closer to) desired value 10.
• How do we change weights to hidden units?
• Assign portion of error to each hidden node, change weights to
lessen that error next time.
Weight Learning Using Gradient Descent
Deriving an Update Function
Our goal is to reduce error (often sum of squared errors), which is
1 1
E  Err 2  (t  o) 2
2 2

Since the gradient specifies direction of steepest increase of error, the training rule for gradient
descent is to update each weight by the derivative of the error with respect to each weight, or
E Err
 Err *
W j W j

The derivative of a particular weight is

where g’(in) is the derivative of the transfer function & a j is the activation value at source node j.

We want to eliminate the error when we adjust the weights, so we multiply the formula by -1.
We want to constrain the adjustment, so we multiply the formula again by the learning rate .

The result is the Delta Rule.

Delta Rule
• Weight update for hidden-to-output links:

wji = Weight of link from node j to node i

 = Learning rate, eta
aj = Activation of node j (output of hidden node j or input for input node j)
Erri = Error at this node (target minus actual output,
total weight change needed to node i)
g’(ini) = Derivative of the transfer function g

Same general idea as before. If error is positive, then network output is too small
so weights are increased for positive inputs and decreased for negative inputs.
The opposite happens when the error is negative.
Hidden-to-output Weights

Let = Erri * g ' (ini ) represent the error term

ti = true / target output for node i

oi = actual / calculated output for node i
ini = sum of inputs
g’ = derivative of the transfer function
Next Layer
• For input-to-hidden weights, we need to define a value analogous to the
error term for output nodes. Here we perform error backpropagation.
Each hidden node is assigned a portion of the error corresponding to its
contribution to the output node.
• The formula

– Assigns a portion of the responsibility to node j

– The proportion is determined by the weight from j to all output nodes. Now we
can give the weight update rule for links from input to hidden nodes.
• For each input node k to hidden node j use
Update Function
• Why use the derivative of the transfer function in the
calculation of delta?
• Note the visualization of the weight space using gradient
descent.
• We want to move the weights in the direction of steepest
descent in this space.
• To do this, compute derivative of the error with respect to
each weight in the equation. This results in the delta terms
showed earlier.
• Note that the transfer function must be differentiable
everywhere.
Step Function
• Our perceptron transfer function will not work

derivative = infinity here

Sigmoid Function
1
g ( x)  where x is weighted sum of inputs
1  e x

The sigmoid function is handy to use for backpropagation because

its derivative is easily computed as g(x)*(1-g(x)).
Sigmoid Update formula
• To calculate output, for each node in network
– Calculate weighted sum of inputs,
– Compute output or activation of node,
• For each node i in output layer

• For each node j in lower layers

• Update weight wji by

NN Applications - NETtalk
• Sejnowski and Rosenburg, 1985
• Written English text to English speech
• Based on DECtalk expert system
• Look a a window of 7 characters:
THIS IS (A-Z, “,”, “.” “ “)

• Decide how to utter middle character

• 1 network
– 1 hidden layer
– 203 input units (7 character window * 29 possible characters)
– 80 hidden units
– Approximately 30 output units
•
Examples
• Handwriting Recognition
• Balancing Ball
• Learn 3D Map
Networks That Deal With Time
• In the feedforward networks we have been studying, transfer
functions capture the network state (not usually spatiotemporal in
nature).
• Recurrent neural networks feed signal back from output to network
(output to hidden, hidden to hidden, hidden to input, others).
• In this way they can learn a function of time.
• y(t) = w1 x1(t) + w2 x2(t) + ... + wn+1 x1(t-1) + wn+1 x2(t-1) + ...
Neural Network Issues
• Usefulness
• Generalizability
• Understandability (cannot explain results)
• Self-structuring neural networks
Add/delete nodes and links until error is minimized
• Networks and expert systems
Can we learn rules corresponding to network?
• Input background knowledge
Predefined structure, weights
• Computational complexity
Blum and Rivest in 1992 proved that training even a three-
node network is NP Complete
Extending the Idea of Linear Separability
• No plane separates examples on the left

• We can, however, map the figures onto three new features

• and the data in the new space is separable as shown on the right.
• We can search for such mappings that leave maximal margins between
Reinforcement Learning
• Learn action selection for probabilistic applications
– Robot learning to dock on battery charger
– Learning to choose actions to optimize factory output
– Learning to play Backgammon
• Note several problem characteristics:
– Delayed reward
– Opportunity for active exploration
– Possibility that state only partially observable
– Possible need to learn multiple tasks with same
sensors/effectors
Reinforcement Learning
• Learning an optimal strategy for maximizing future reward
• Agent has little prior knowledge and no immediate
feedback
• Action credit assignment difficult when only future reward
• Two basic agent designs
– Agent learns utility function U(s) on states
• Used to select actions maximizing expected utility
• Requires a model of action outcomes (T(s,a,s'))
– Agent learns action-value function
• Gives expected utility Q(s,a) of action a in state s
• Q-learning Q(s,a)
• No action outcome model, but cannot look ahead
•
Passive Learning in a Known Environment

• P(intended move) = 0.8, P(right angles to intended move) = 0.1

• Rewards at terminal states are +1 and -1, other states have reward of -
0.04
• From our start position the recommended sequence is [Up, Up, Right,
Right, Right]. This reaches the goal with probability 0.85 = 0.32768.
• Transitions between states are probabilistic, and are represented as a
Markov Decision Process.
Markov Decision Processes
• Assume
– Finite set of states S (with initial state S0)
• Set of actions A
• Transition model T(s,a,s'), which can be probabilistic
• At each discrete time agent observes state s in S and chooses action a in A
• then receives immediate reward R(s)
• and state changes to a’

• Markov assumption: Resulting state s’ depends only on current state s (or a

finite history of previous states) and action a
Problem Solution
• Desirability of moving to a given state s is
expressed by a utility value
• Utility is not the reward, but an estimate of
the award that can be accrued from that state
• The policy  specifies what the agent should
do for every reachable state.
• An optimal policy,  *, is a policy that selects
moves with the highest expected utility .
Example Policy
• Here is the optimal policy for our example environment.
Define Utility Values
• Learn utilities U of each state, pick action that maximizes expected
utility of resulting state
• Assume infinite horizon
– Agent can move an infinite number of moves in the future
– With fixed horizon of 3, agent would need to head from (3,1) directly to +1
terminal state
– Given a fixed horizon of N, U([s0, s1, .., sN+k]) = U([s0, s1, .., sN])
• Reward is accumulated over entire sequence of states
– Additive rewards
• Uh[s0, s1, s2, ..] = R(s0) + R(s1) + R(s2) + …
U h [ s 0 , s1 , s 2 ,..]  R ( s 0 )  R ( s1 )   2 R ( s 2 )  ...
• This could present a problem with infinite horizon problems
– Discounted rewards:
– is a discount factor Rmax
1   horizon a discounted reward is finite
– If R is bounded, even for an infinite
Calculate Utility Values
• We will define the utility of a state in terms of the utility of the state sequences
that start from the state.
• The utility of state using policy is the expected discounted reward for sequence
with t steps starting in s


• Here are utilities for our navigation problem with =1 and R(s) = -.04 for
Calculate Utility Values
• When selecting an action, the agent choose an action that maximizes
the Expected Utility of the resulting state

• Because we define utility in terms of immediate reward and

(discounted) expected utility of sequences from a state, we can
define goodness now as

• This is the Bellman equation

Value Iteration
• Calculate Bellman equation values
incrementally

– Iterate until minimal changes

– Guaranteed to reach an equilibrium
Example
Example
Example
Example

[RN] shows convergence after 30 iterations

Where Does Learning Fit In?
• Learn the transition values
• Learn the utility values
Adaptive Dynamic Programming
• Learn the transition model
• When a new state is encountered
– Initialize utility to perceived reward for the state
– Keep track of Nsa[s,a] (number of times action a was
executed from state s and Nsas’[s,a,s’] (number of times
action a was executed from state s resulting in state s’
– T[s,a,s’] = Nsas’[s,a,s’]/Nsa[s,a]
– Update utility as before
• Solve n equations in n unknowns, n = |states|
• Converges slowly
Temporal Difference Learning
• Instead of solving equations for all states,
incrementally update values only for states that are
visited after they are visited
• TD reduces discrepancies between utilities of the
current and past states
• If previous state has utility -100 and current state
has utility +100, increase previous state utility to
lessen discrepancy
• No more need to store explicit model, just need
states, visited counts, rewards, and utility values
TD Learning
• Temporal difference (TD)
– When observe transition from state s to state s’
•
• Set U(s') to R(s') the first time s' is visited
•  is learning rate
•  can decrease as number of visits to s increases
•  (N[s])~1/N[s]
• Slower convergence than ADP, but much
simpler with less computation
TD Example

3 .19 .46 +1

2 .19 -1

1 -.54

1 2 3 4
Learning an Action-Value Function: Q-Learning

• Temporal difference removed the need for a model of

transitions
• Q learning removes the need for a model of action
selection
• Assigns value to action/state pairs, not just states
• These values are called Q-values Q(a,s)
– Q(a,s) = value of performing action a in state s
– Learn directly from observed transitions and rewards
– Model-free approach
• U(s) can be calculated as maxaQ(s,a)
• TD-based Q learning
– When transition from state s to state s’
Example
• Q Learning on Grid World
• Numbers show Q-value for each action, arrows
show optimal action
• Rewards of +10 at (9,8), +3 at (8,3), -5 at (4,5), -
10 at (4,8)
• Probability 0.7 of moving in desired direction,
0.1 of moving in any of other three directions
• Bumping into a wall incurs a penalty of 1
(reward of -1) and agent does not move
Examples
• TD-Gammon [Tesauro, 1995]
– Learn to play Backgammon
– Immediate reward
• +100 if win
• -100 if lose
• 0 for all other states
– Trained by playing 1.5 million games against itself
– Now approximately equal to best human player
• More about TD-Gammon
• Q-Learning applet
• TD Learning applied to Tic Tac Toe
• Move graphic robot across space
• RL applied to channel allocation for cell phones
• RL and robot soccer
Subtleties and Ongoing Research
• Exploration vs. exploitation
• Scalability
• Generalize utilities from visited states to other
states (inductive learning)
• Design optimal exploration strategies
• Extend to continuous actions, states

Soultion5
No ratings yet
Soultion5
3 pages
Lect6 PDF
No ratings yet
Lect6 PDF
66 pages
Artificial Intelligence: Slide 6
100% (1)
Artificial Intelligence: Slide 6
42 pages
Learning
No ratings yet
Learning
51 pages
Lecture 06 Part A - Macine Learning
No ratings yet
Lecture 06 Part A - Macine Learning
77 pages
DTreesAndOverfitting-1-11-2011_final
No ratings yet
DTreesAndOverfitting-1-11-2011_final
20 pages
Machine Learning: Mona Leeza Email: Monaleeza - Bukc@bahria - Edu.pk
No ratings yet
Machine Learning: Mona Leeza Email: Monaleeza - Bukc@bahria - Edu.pk
60 pages
06 Learning
No ratings yet
06 Learning
51 pages
Machine Learning
No ratings yet
Machine Learning
6 pages
Chap 18
No ratings yet
Chap 18
51 pages
CCST9017 (2023-24lecture11printed Version) MachineLearning
No ratings yet
CCST9017 (2023-24lecture11printed Version) MachineLearning
55 pages
ML Sit1305
No ratings yet
ML Sit1305
127 pages
Machine Learning Overview
No ratings yet
Machine Learning Overview
54 pages
AI Lecture 9
No ratings yet
AI Lecture 9
69 pages
AI- UNIT VI
No ratings yet
AI- UNIT VI
40 pages
jdavis-indlearn2 (1)
No ratings yet
jdavis-indlearn2 (1)
91 pages
Presentation on ML - Copy
No ratings yet
Presentation on ML - Copy
469 pages
Chapter 8: Learning: By, Safa Hamdare
No ratings yet
Chapter 8: Learning: By, Safa Hamdare
46 pages
CS 343: Artificial Intelligence Machine Learning: Raymond J. Mooney
No ratings yet
CS 343: Artificial Intelligence Machine Learning: Raymond J. Mooney
35 pages
Introduction To ML
No ratings yet
Introduction To ML
31 pages
Decision Trees
No ratings yet
Decision Trees
53 pages
Decision Tree Learning: - A Learned Decision Tree Can Also Be Re-Represented As A Set of If-Then Rules
No ratings yet
Decision Tree Learning: - A Learned Decision Tree Can Also Be Re-Represented As A Set of If-Then Rules
49 pages
1 - Introduction
No ratings yet
1 - Introduction
82 pages
WEEK 01 Merged
No ratings yet
WEEK 01 Merged
606 pages
Tycs Ai Unit 2
No ratings yet
Tycs Ai Unit 2
84 pages
Chapter 01 Introduction To Machine Learning
No ratings yet
Chapter 01 Introduction To Machine Learning
59 pages
Basics of Machine Learning and Classifications: Dr. Helal Uddin Ahmed
No ratings yet
Basics of Machine Learning and Classifications: Dr. Helal Uddin Ahmed
18 pages
Notes
No ratings yet
Notes
125 pages
Machine Learning Learning
No ratings yet
Machine Learning Learning
35 pages
Chapter 6:artificial Intelligence Learning: By. Getaneh T
No ratings yet
Chapter 6:artificial Intelligence Learning: By. Getaneh T
59 pages
UNIT 5
No ratings yet
UNIT 5
21 pages
Outline: - Learning Agents - Inductive Learning - Decision Tree Learning
No ratings yet
Outline: - Learning Agents - Inductive Learning - Decision Tree Learning
30 pages
Chapter19 4e
No ratings yet
Chapter19 4e
67 pages
Sec 1630
No ratings yet
Sec 1630
145 pages
Cs 171 18 IntroLearning Old
No ratings yet
Cs 171 18 IntroLearning Old
47 pages
3. Decision Tree -1.Pptx
No ratings yet
3. Decision Tree -1.Pptx
31 pages
AI-unit-4
No ratings yet
AI-unit-4
91 pages
Decision Trees
No ratings yet
Decision Trees
42 pages
Machine_learning(unit 3)
No ratings yet
Machine_learning(unit 3)
9 pages
ML 01
No ratings yet
ML 01
44 pages
ML DecisionTrees
No ratings yet
ML DecisionTrees
46 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
27 pages
Unit 3 Bayesian Learning
No ratings yet
Unit 3 Bayesian Learning
49 pages
Lecture 4.2 Supervised Learning Classification
No ratings yet
Lecture 4.2 Supervised Learning Classification
25 pages
Machine
No ratings yet
Machine
61 pages
Module 1 ML
No ratings yet
Module 1 ML
78 pages
AI Chapter 3 Part 2
No ratings yet
AI Chapter 3 Part 2
51 pages
Ttnt 09 Learning From Examples
No ratings yet
Ttnt 09 Learning From Examples
58 pages
Lecture Series On Machine Learning: Ravi Gupta G. Bharadwaja Kumar
No ratings yet
Lecture Series On Machine Learning: Ravi Gupta G. Bharadwaja Kumar
77 pages
ML-Lec-06-Supervised Learning-Decision Trees
No ratings yet
ML-Lec-06-Supervised Learning-Decision Trees
45 pages
Lec7 - Nonparametric Methods - II
No ratings yet
Lec7 - Nonparametric Methods - II
38 pages
10 Learning
No ratings yet
10 Learning
32 pages
Machine Learning INTRO
No ratings yet
Machine Learning INTRO
12 pages
NNML
No ratings yet
NNML
113 pages
Classification
No ratings yet
Classification
33 pages
Machine Learning and Neural Networks: Riccardo Rizzo
100% (1)
Machine Learning and Neural Networks: Riccardo Rizzo
113 pages
Class 16 Decision Tree
No ratings yet
Class 16 Decision Tree
45 pages
2021 Lecture10 BasicML
No ratings yet
2021 Lecture10 BasicML
76 pages
Asset v1 MKAU+SEng9032+DEV 01+Type@Asset+Block@ML Chapterthree
No ratings yet
Asset v1 MKAU+SEng9032+DEV 01+Type@Asset+Block@ML Chapterthree
129 pages
Numerical Reasoning: Emergency Services Training
From Everand
Numerical Reasoning: Emergency Services Training
Craig MacKellar
No ratings yet
Presentations on the Critical Path Method
From Everand
Presentations on the Critical Path Method
Robert Perrine
1/5 (2)
Project Cost Estimation and Management
No ratings yet
Project Cost Estimation and Management
46 pages
l7
No ratings yet
l7
39 pages
14. Process Synchronization and Communication
No ratings yet
14. Process Synchronization and Communication
11 pages
6. Operating System Processes
No ratings yet
6. Operating System Processes
5 pages
ML Sheet01 Handout PDF
No ratings yet
ML Sheet01 Handout PDF
3 pages
Data Analytics - Object Segmentation UNIT-IV
No ratings yet
Data Analytics - Object Segmentation UNIT-IV
34 pages
ML Supervised Learning Unit 3
No ratings yet
ML Supervised Learning Unit 3
51 pages
UMBC CMSC 471 Final Exam,: 1. True/False (20 Points)
No ratings yet
UMBC CMSC 471 Final Exam,: 1. True/False (20 Points)
6 pages
ML Notes
No ratings yet
ML Notes
60 pages
Role of Data Warehousing in Egov
No ratings yet
Role of Data Warehousing in Egov
8 pages
Lecture 12
No ratings yet
Lecture 12
19 pages
German Dataset Tasks
No ratings yet
German Dataset Tasks
6 pages
Data Mining in Banking and Finance
100% (1)
Data Mining in Banking and Finance
14 pages
Decision Making Under Risk Continued: Decision Trees
No ratings yet
Decision Making Under Risk Continued: Decision Trees
17 pages
Web Content Mining Techniques Tools & Algorithms - A Comprehensive Study
No ratings yet
Web Content Mining Techniques Tools & Algorithms - A Comprehensive Study
6 pages
machine-learning-lab-viva (1)
No ratings yet
machine-learning-lab-viva (1)
3 pages
Machine Learning-Based Protection and Fault Identification of 100% Inverter-Based Microgrids
No ratings yet
Machine Learning-Based Protection and Fault Identification of 100% Inverter-Based Microgrids
4 pages
Decision Fania Bab 6 Akhir
No ratings yet
Decision Fania Bab 6 Akhir
12 pages
@machine Learning Applied To The Design and Inspection of Reinforced Concrete Bridges Resilient Methods and Emerging Applications
No ratings yet
@machine Learning Applied To The Design and Inspection of Reinforced Concrete Bridges Resilient Methods and Emerging Applications
10 pages
Project Report PDF
100% (1)
Project Report PDF
38 pages
1 s2.0 S0029801823028834 Main
No ratings yet
1 s2.0 S0029801823028834 Main
25 pages
DSA5102_lecture3
No ratings yet
DSA5102_lecture3
34 pages
Decision Tree R
No ratings yet
Decision Tree R
5 pages
Text Processing: Basics: Pawan Goyal
No ratings yet
Text Processing: Basics: Pawan Goyal
42 pages
Analysis and Prediction in Agricultural Data Using Data Mining Techniques
No ratings yet
Analysis and Prediction in Agricultural Data Using Data Mining Techniques
8 pages
CS-30004(DSA)-CS_END_NOV_2024
No ratings yet
CS-30004(DSA)-CS_END_NOV_2024
17 pages
Cs8091 Bigdata Analytics Question Bank
No ratings yet
Cs8091 Bigdata Analytics Question Bank
40 pages
Prospects and challenges of using artificial intelligence in the audit process
No ratings yet
Prospects and challenges of using artificial intelligence in the audit process
28 pages
DWM Review Paper
No ratings yet
DWM Review Paper
3 pages
Water 11 00973 v2 PDF
No ratings yet
Water 11 00973 v2 PDF
16 pages
Large Scale Parallel Data Mining 1759 Lecture Notes in Computer Science 1st edition by Mohammed Zaki, Ching Tien Ho ISBN 3540671943 978-3540671947instant download
100% (4)
Large Scale Parallel Data Mining 1759 Lecture Notes in Computer Science 1st edition by Mohammed Zaki, Ching Tien Ho ISBN 3540671943 978-3540671947instant download
77 pages
Data Mining: Set-01: (Introduction)
No ratings yet
Data Mining: Set-01: (Introduction)
14 pages

l9

Uploaded by

l9

Uploaded by

Artificial Intelligence

1. For each hypothesis h  H, calculate the posterior probability

2. Output the hypothesis hMAP (maximum a posteriori hypothesis)

hMAP arg max hH P (h | D)

P (a1 , a 2 ,.., a n ) P(v j )

arg max v j V P(a1 , a 2 ,.., a n | v j ) P(v j )

• which yields the NBC

Identify Assist Identify

• Each internal node tests an attribute

• Instances describable by attribute-value pairs

• Each example is described by the values of the

1. At each point, decide which attribute to use

• Pick one that provides the highest expected

I(1/2, 1/2) = -(1/2 log21/2) - (1/2 log21/2) = 1 bit

• What is the correct classification?

• Splitting on a single attribute does not usually answer entire

• Collect large set of examples (as large and diverse as possible)

• Use statistics tests to determine significance of improvement

Humidity Yes Windy

<=75 >75 true false

• Accuracy = (TP+TN) / (TP+FP+TN+FN)

• Instead of traditional vonNeumann machines, researchers

A neural network is made up of neurons, or

• This is a step transfer function

• Input is high-dimensional discrete or real-

• Optionally, the threshold can be adjusted as well using function

• Initial weights (can be initialized to 0, usually

x1 x2 w1old w2old Told y yd w1new w2new Tnew

x1 x2 w1old w2old Told y yd w1new w2new Tnew

x1 x2 w1old w2old Told y yd w1new w2new Tnew

x1 x2 w1old w2old Told y yd w1new w2new Tnew

x1 x2 w1old w2old Told y yd w1new w2new Tnew

x1 x2 w1old w2old Told y yd w1new w2new Tnew

• Learn XOR of two inputs, x1 and x2

• Add more layers with more neurons!

• How should we change (adapt) the weights in a multilayer neural

The derivative of a particular weight is

The result is the Delta Rule.

wji = Weight of link from node j to node i

Let = Erri * g ' (ini ) represent the error term

ti = true / target output for node i

– Assigns a portion of the responsibility to node j

derivative = infinity here

The sigmoid function is handy to use for backpropagation because

• For each node j in lower layers

• Update weight wji by

• Decide how to utter middle character

• We can, however, map the figures onto three new features

• P(intended move) = 0.8, P(right angles to intended move) = 0.1

• Markov assumption: Resulting state s’ depends only on current state s (or a

• Because we define utility in terms of immediate reward and

• This is the Bellman equation

– Iterate until minimal changes

[RN] shows convergence after 30 iterations

• Temporal difference removed the need for a model of

You might also like