Week 9 - Probabilistic Dynamic Programming
Week 9 - Probabilistic Dynamic Programming
1
IE301
Operations Research II
2
Probabilistic Dynamic Programming
• In deterministic dynamic programming, a
specification of the current state and current
decision was enough to tell us with certainty the new
state and the costs/rewards during the current stage.
• Different from deterministic dynamic programming,
in probabilistic dynamic programming, the state at
the next stage is not completely determined by the
state and policy decision at the current stage. There
is a probability distribution for what the next state
will be.
3
Basic Structure of Probabilistic
Dynamic Programming
Contribution Stage n+1
Stage n Probability from stage n
1 fn+1(1)
C1
p1
C2 2 fn+1(2)
Decision p2
sn xn
pS
fn(sn,xn)
CS
S fn+1(S)
f n ( sn ) min f n ( sn , xn )
xn 5
Example: Uncertain Demand
• For the price of $1/gallon, the Safeco
Supermarket chain has purchased 6 gallons of
milk from a local dairy.
• Each gallon of milk is sold in the chain’s three
stores for $2/gallon.
• The dairy must buy back at 50¢/gallon any milk
that is left at the end of the day.
• Unfortunately for Safeco, demand for each of the
chain’s three stores is uncertain.
• Safeco wants to allocate the 6 gallons of milk to
the three stores so as to maximize the expected
net daily profit earned from milk.
6
Example: Uncertain Demand
• Use dynamic programming to determine how
Safeco should allocate the 6 gallons of milk
among the three stores.
Demand Probability
Store 1 1 0.6
2 0
3 0.4
Store 2 1 0.5
2 0.1
3 0.4
Store 3 1 0.4
2 0.3
3 0.3
7
Example: Uncertain Demand
• With the exception of the fact that the
demand is uncertain, this problem is very
similar to the resource allocation problem.
– Stage: Each store
– State: The amount to be allocated to the
remaining stores
– Decision: Amount allocated to the current store
• Observe that since Safeco’s daily purchase
costs are always $6, we may concentrate our
attention on the problem of allocating the
milk to maximize daily revenue earned from
the 6 gallons.
8
Example: Uncertain Demand
• Define
– rt(gt) = expected revenue earned from gt
gallons assigned to store t
– ft(x) = maximum expected revenue earned
from x gallons assigned to stores t, t+1,…,3
• For t = 1, 2, we may write
f t ( x) max{rt ( g t ) f t 1 ( x g t )}
gt
10
Example: Uncertain Demand
Stage 2:
11
Example: Uncertain Demand
Stage 1:
Optimal Solution:
Allocate 1 gallon to store 1, then from f2(5) allocate 3
gallons to store 2, and 2 gallons to store 3
12
Example: A Probabilistic Inventory Model
• We modify the inventory model discussed
previously to allow for uncertain demand.
• This will illustrate the difficulties involved in
solving a PDP for which the state during the
next period is uncertain.
13
Example: A Probabilistic Inventory Model
20
Example: A Probabilistic Inventory Model
21
A Probabilistic Inventory Model: f3(i)
22
A Probabilistic Inventory Model: f2(i)
23
A Probabilistic Inventory Model: f1(i)
Optimal Policy:
At stage 1, produce 3, then decide based on realized
demand
24
Further Examples of Probabilistic
Dynamic Programming Formulations
• Many probabilistic dynamic programming problems
can be solved using recursions of the following form
(for max problems):
f t (i ) max (expectedreward in stage t | i, a) p( j | i, a, t ) f t 1 ( j )
a
j
• ft(i): max expected reward that can be earned during
stages t, t+1,…, given the state at the beginning of
stage t is i
• p(j|i,a,t): probability that the next stage’s state will be
j, given that current state is i (at stage t) and action a
is chosen
25
Another Example
• When Sally Mutton arrives at the bank,
30 minutes remain in her lunch break.
• If Sally makes it to the head of the line
and enters service before the end of her
lunch break, she earns reward r.
• However, Sally does not enjoy waiting in
lines, so to reflect her dislike for waiting
in line, she incurs a cost of c for each
minute she waits.
26
Another Example
• During a minute in which n people are ahead of
Sally, there is a probability p(x|n) that x people
will complete their transactions.
• Suppose that when Sally arrives, 20 people are
ahead of her in line.
• Use dynamic programming to determine a
strategy for Sally that will maximize her
expected net revenue (reward-waiting costs).
27
Solution
• When Sally arrives at the bank, she must decide
whether to join the line or give up and leave.
• At any later time, she may also decide to leave if
it is unlikely that she will be served by the end
of her lunch break.
• We can work backward to solve the problem.
• ft(n): the maximum expected net reward that
Sally can receive from time t to the end of her
lunch break if at time t, n people are ahead of
her.
28
Solution
• Let t=0 be the current time and t=30 be the
end of the problem.
• Since t=29 is the beginning of the last minute
of the problem, we write
0 (Leave)
f 29 (n) max
rp(n | n) c (Stay)
32
Solution
• To determine Sally’s optimal waiting policy, we
work backward until f0(20) is computed.
• Sally stays until the optimal action is “leave” or
she begins to be served.
• Problems in which the decision maker can
terminate the problem by choosing a
particular action are known as stopping rule
problems; they often have a special structure
that simplifies the determination of optimal
policies.
33
Example: Determining Reject Allowances
• The HIT-AND-MISS MANUFACTURING COMPANY has
received an order to supply one item of a particular
type. However, the customer has specified such
stringent quality requirements that the manufacturer
may have to produce more than one item to obtain
an item that is acceptable. The number of extra
items produced in a production run is called the
reject allowance.
• The manufacturer estimates that each item of this
type that is produced will be acceptable with
probability 0.5 and defective (without possibility for
rework) with probability 0.5.
34
Example: Determining Reject Allowances
• Marginal production costs for this product are
estimated to be $100 per item (even if defective),
and excess items are worthless. In addition, a setup
cost of $300 must be incurred whenever the
production process is set up for this product.
• The manufacturer has time to make no more than
three production runs. If an acceptable item has not
been obtained by the end of the third production
run, the cost to the manufacturer in lost sales
income and penalty costs will be $1,600.
35
Example: Determining Reject Allowances
• The objective is to determine the policy
regarding the lot size (1 + reject allowance) for
the required production run(s) that minimizes
total expected cost for the manufacturer.
36
Example: Determining Reject Allowances
• Stage:
– Each production run (n = 1, 2, 3)
• Decision:
– How much to produce (lot size) in each stage (xn)
• State:
– Number of acceptable items still needed (1 or 0)
at beginning of stage n (sn)
37
Example: Determining Reject Allowances
• Notation:
– fn(sn, xn): total expected cost for stages n, …,
3 if system starts in state sn at stage n and
lot size is xn
– fn*(sn): the minimum expected cost for
stages n, …, 3 if system starts in state sn
• fn*(sn) = min xn fn(sn, xn)
• Note that fn*(0) = 0.
38
Example: Determining Reject Allowances
• Assume that the numbers in the following
calculations are in hundred dollars.
• Contribution to cost (actually production cost)
from stage n is K(xn)+xn
– K(xn) = 3 if xn>0; 0, otherwise.
39
Example: Determining Reject Allowances
• For sn=1,
40
Example: Determining Reject Allowances
• Hence, the recursive relationship is as follows:
41
Example: Determining Reject Allowances
• For n=3,
42
Example: Determining Reject Allowances
• For n=2,
43
Example: Determining Reject Allowances
• For n=1,
44
Example: Determining Reject Allowances
• Optimal Solution:
– Produce two items on the first production run
– If none is acceptable, then produce either two or
three items on the second production run
– If none is acceptable, then produce either three or
four items on the third production run.
– Total expected cost is $675.
45