Case-Based Reinforcement Learning For Dynamic Inventory Control in A Multi-Agent Supply-Chain System
Case-Based Reinforcement Learning For Dynamic Inventory Control in A Multi-Agent Supply-Chain System
a r t i c l e i n f o a b s t r a c t
Keywords: Reinforcement learning (RL) appeals to many researchers in recent years because of its generality. It is an
Inventory control approach to machine intelligence that learns to achieve the given goal by trial-and-error iterations with
Reinforcement learning its environment. This paper proposes a case-based reinforcement learning algorithm (CRL) for dynamic
Supply-chain management inventory control in a multi-agent supply-chain system. Traditional time-triggered and event-triggered
Multi-agent simulation
ordering policies remain popular because they are easy to implement. But in the dynamic environment,
the results of them may become inaccurate causing excessive inventory (cost) or shortage. Under the con-
dition of nonstationary customer demand, the S value of (T, S) and (Q, S) inventory review method is
learnt using the proposed algorithm for satisfying target service level, respectively. Multi-agent simula-
tion of a simplified two-echelon supply chain, where proposed algorithm is implemented, is run for a few
times. The results show the effectiveness of CRL in both review methods. We also consider a framework
for general learning method based on proposed one, which may be helpful in all aspects of supply-chain
management (SCM). Hence, it is suggested that well-designed ‘‘connections” are necessary to be built
between CRL, multi-agent system (MAS) and SCM.
Ó 2008 Elsevier Ltd. All rights reserved.
0957-4174/$ - see front matter Ó 2008 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2008.07.036
C. Jiang, Z. Sheng / Expert Systems with Applications 36 (2009) 6520–6526 6521
(i) Each retailer has a fixed group of customers whose demand CRL CRL
is nonstationary. stocks arrived stocks arrived
(ii) Each customer is free to choose one retailer in each period,
i.e., in a competitive market. Fig. 2. (T, S) inventory replenishment mechanism.
6522 C. Jiang, Z. Sheng / Expert Systems with Applications 36 (2009) 6520–6526
LT LT der point (S) are learned based on the past experience. The result-
ing service levels of previously suggested S values are treated as
rewards for the actions taken before. The (State, Action, reward) re-
cords provide reference when similar state is met again.
when inventory <S when inventory <S The details of elements are shown as follows:
order Q order Q Case½CaseSize ¼ fCase½0; Case½1; . . . ; Case½CaseSize 1g;
CRL CRL
stocks arrived stocks arrived where Case [i] = {inventory level (int), Dmean (int)}.
Case [CaseSize] (array of vectors) is used to record the different
Fig. 3. (Q, S) inventory replenishment mechanism. states met. Each state includes two elements: current S value and
estimated current mean of customer demand.
1. Customer Class: ID (int), customer identity, from 0 to Ncustomer 1. Action_nn [0] SLaa_nn [0]
Demand (int), nonstationary. k (int), price sensitivity parameter. L
...
...
(int), quality sensitivity parameter. Index[] (array of double), rank SLba_n [0] Action_n [0] SLaa_n [0]
Action_nn [i] SLaa_nn [i]
of motivation values. Rank[] (array of all retailer objects), rank of SLba_n [1] Action_n [1] SLaa_n [1]
...
...
retailers. Influence_out[] (array of double), influence out to friend
...
...
...
agent (s) with respect to retailers. Influence_in[] (array of double),
Action_nn [0] SLaa_nn [0]
received influence from friend agent (s). Follow (double), follower SLba_n [i] SLaa_n [i]
Action_n [i]
...
...
tendency. FriendID[] (array of int), IDs of friend agent (s). a, b (sta-
tic double), motivation function parameters. Mean (static int), CV Action_nn [i] SLaa_nn [i]
...
...
...
(static double), Deviation (static int), T (static int), extent (static dou-
...
...
ble), parameters of nonstationary demand. Range (static int),
Fig. 4. Overview of data structure.
number of friend agent(s). SuperID (Vector), records the ID of
seller each time. BuyHistory (Vector), records IDs of retailers in
each defined period.
Case [i] SLba [i] Action [i] SLaa [i]
2. Retailer Class: ID (int), retailer identity. price (double), price of
products provided. quality (double), quality of products pro-
vided. cost (double), profit (double), bank (double), reserved. SLba_n [i] Action_n [i] SLaa_n [i]
Stock_in (Vector), Stock_out (Vector), stock flow records.
Money_in (Vector), Money_out (Vector), money flow records, Action_nn [i] SLaa_nn [i]
Fig. 8. (a) Initial S > D, (b) Initial S < D, (c) Target service level = 95%.
6524 C. Jiang, Z. Sheng / Expert Systems with Applications 36 (2009) 6520–6526
state. Memorize the location in case records using vice level out of target range, it is no longer a qualified
CaseMarker. action.
Step 1.2: Update CaseMarker. Step 3.2: Calculate InvenIncrease or InveDecrease based on the
Step 2: Level 2 searching. Check Case [CaseMarker]. If it is a new estimated mean of demand and the difference
case, add current service level to SLba_n [] and update between SLC and target service level. Add this amount
ActionMarker and UpdateMarker indicating new record to action records and update UpdateMarker indicating
added. Go to step 2.1. Else, search SLba_n [] for similar that new SLaa_nn [i] will be added. Thus, the old SLC
one with the criteria in step 1, setting the grid as 0.2%. will not be replaced.
Then, if no similar service level is found, go to step Step 4: repeat previous steps when inventory replenishment
2.1. Otherwise, update ActionMarker and go to step 3. condition is met again.
Step 2.1: Add current service level to the end of SLba record.
Compare the current service level to target service 4. Simulation results and analysis
level. If it is among the range of [TargetSL 0.2%, Tar-
getSL + 0.2%], no change will be made and 0 is added The supply-chain model in consideration simulated consists of
to action records. Else, calculate InvenIncrease or 10 retailers and 80 customers. Under condition one in section two,
InvenDecrease based on the estimated mean of cus-
tomer demand and difference between current ser-
Table 2
vice level and target service level. Add this amount Parameters added for condition 2
to action records and update S value.
Parameters Scale Distribution
Step 3: Level 3 searching. Search SLaa_n [ActionMarker] for clos-
est service level to target service level. Denote this L 0 100 Random distribution
K 100 0 Random distribution
record as SLC. If SLC is among the range of [Tar-
Followi 0 100 Normal distribution, l = 50 and d = 15
getSL 0.2%, TargetSL + 0.2%], go to step 3.1. Else, go Influence_outi 30 30 {30, 30 g, 30 2g, . . . , 30} where g = 60/Nretailer
to step 3.2. Range 2 Constant
Step 3.1: Get the corresponding action in Action_n [ActionMar- Pi 80 90 Random distribution
Qi 80 90 Random distribution
ker] and update S value and UpdateMarker. In the step
a a>1 Depend on testing
1 of next period, replace SLC with the resulting service b 0<b<1 Depend on testing
level. It guarantees that if this action results in a ser-
Fig. 9. (a) M1, (b) M2, (c) M3, (d) M4 under condition 1 for (T, S).
C. Jiang, Z. Sheng / Expert Systems with Applications 36 (2009) 6520–6526 6525
each retailer has a fixed group of eight customers. Their demand fol- The standard deviation of customer (d) is computed by multi-
lows a normal distribution N (l,d2), but its mean is changed by two plying l and coefficient of variable (CV). And initial l is set to
parameters: interval T and ranged extent, which follow a uniform 20. The target service level for each retailer is set to 90% equally.
distribution. It means that at every interval T, l = l + extent. Lead time is set to 1 day for (T, S) and 4 days for (Q, S). Each simu-
Two types of demands are considered which are defined as: lation is conducted for 1000 review periods. For the same experi-
ment condition, 20 simulations are repeated with different seeds
TE 1: T = Uniform (50, 80), extent = Uniform (1, 1). and their average service level is considered as actual service level.
TE 2: T = Uniform (15, 30), extent = Uniform (2, 2). The modes used in simulation under condition 1 are shown in Ta-
ble 1.
The values of grid1 and grid2 in Fig. 6 are initially set to 20 and
4 and are then changed as the mean value and standard deviation
change.
In order to show the independence of initial values, first 300 re-
view period is shown with: Initial supply > demand, Initial sup-
ply < demand, target service level is temporarily set to 95% (see
Fig. 8).
Fig. 9 shows the simulation results over time for four modes in
(T, S) system under condition 1. It can be seen that as the non-
stationary of customer is becoming more severe, the deviation of
average service level increases. However, the average service level
is kept very closely to target service level.
The simulation parameters (see Table 2) are added for simula-
tion under condition 2.
Fig. 10 show the results under condition 2. The customer de-
mand in this situation is much more nonstationary for the reason
Fig. 10. Results under condition 2 for (T, S).
that one retailer may loose most of its customers after increasing
Fig. 11. (a) M1, (b) M2, (c) M3, (d) M4 under condition 1 for (Q, S).
6526 C. Jiang, Z. Sheng / Expert Systems with Applications 36 (2009) 6520–6526
Acknowledgements