0% found this document useful (0 votes)
229 views6 pages

Risk-Constrained Markov Decision Processes

We propose a new constrained Markov decision process framework with risk-type constraints. The risk metric we use is Conditional Value-at-Risk (CVaR), which is gaining popularity in finance. We propose an iterative offline algorithm to find the riskcontrained optimal control policy.

Uploaded by

Sruthiy
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
229 views6 pages

Risk-Constrained Markov Decision Processes

We propose a new constrained Markov decision process framework with risk-type constraints. The risk metric we use is Conditional Value-at-Risk (CVaR), which is gaining popularity in finance. We propose an iterative offline algorithm to find the riskcontrained optimal control policy.

Uploaded by

Sruthiy
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

49th IEEE Conference on Decision and Control December 15-17, 2010 Hilton Atlanta Hotel, Atlanta, GA, USA

Risk-constrained Markov Decision Processes


Vivek Borkar & Rahul Jain

Abstract We propose a new constrained Markov decision process framework with risk-type constraints. The risk metric we use is Conditional Value-at-Risk (CVaR), which is gaining popularity in nance. It is a conditional expectation but the conditioning is dened in terms of the level of the tail probability. We propose an iterative ofine algorithm to nd the riskcontrained optimal control policy. A stochastic approximationinspired learning variant is also sketched. Index Terms Constrained Markov decision processes; Risk measures; Stochastic Approximations.

I. I NTRODUCTION The theory of single-stage stochastic programming is fairly well-developed [6]. Equally well developed is the theory of (unconstrained) stochastic dynamic programming for sequential optimization [4]. Marrying the two theories to develop a theory of sequential stochastic optimization with general constraints has proved to be challenging [31]. Extending stochastic programming methods to multi-stage problems has remained challenging [32] while attempts to extend dynamic programming approaches to optimization with constraints have been successful only when constraints have special form [7], [2]. In the theory of constrained Markov decision processes developed so far [2], both the objective function and constraints have the same form, usually an expectation of a sum, discounted or averaged. This allows for introduction of an occupation measure. The sequential optimization problem can then be formulated as a convex program. When the objective and constraint functions have different forms, this technique does not work. Moreover, the occupation measure technique cannot be generalized to probabilistic and general conditional expectation (of the form we will shortly see). Algorithms for stochastic programs with probabilistic/chance or conditional expectation constraints have been developed [25]. However, as with other stochastic programs, it is very difcult to extend these to a multi-stage sequential optimization setting. Our interest is in solving a sequential stochastic optimization problem with conditional expectation constraints. Our motivation comes from nancial risk management. Thus, the constraint is on a risk measure (i.e., Conditional Value at Risk) which has a conditional expectation form, but is a bit unusual in that the conditioning
Vivek Borkar is with the School of Technology and Computer Science, Tata Institute of Fundamental Research (TIFR) Mumbai. Email: [email protected]. Rahul Jain is with the EE & ISE Departments, University of Southern California, Los Angeles. Email: [email protected]. His research is supported by a James H. Zumberge Faculty Research and Innovation award, the NSF grant IIS-0917410 and the NSF CAREER award CNS0954116.

event is determined by a constraint on a probability (i.e., Value at Risk). Value-at-Risk (VaR) is a popular risk metric in nance. For a real-valued random variable Y on some probability space (, F, P ), the VaR at level , = (Y ) is dened to be arg sup P (Y > ) . Unfortunately, it has many shortcomings including the fact that it is not subadditive, i.e., the VaR of a portfolio may be greater than the sum of VaRs of the portfolio constituents. Thus, another measure called Conditional VaR (CVaR) was introduced by Artzner, et al. [3], which is dened as E[Y |Y > ], and depends on the VaR at level . It has been shown that CVaR is a coherent risk measure, i.e., it is convex, monotone, positively homogeneous and translation equivariant, and is amenable to standard methods of stochastic programming for its optimization [26]. However, often the risk must be measured and optimized not just of the current portfolio, but over time in a sequential manner. Motivated by this, we introduce a constrained Markov decision model where the constraint is on a CVaR-type conditional expectation. Consider a Markov decision process (MDP) dened on state space X, a control space U, with a reward function r(x, u) and a cost function c(x, u) where x X and u U, a transition function Pu (dx |x) and a nite horizon T . Let T Y (T +1) denote t=0 c(Xt , ut ). Then the CVaR at level is given by (Y (T +1)) = E[Y (T +1)|Y (T +1) > (Y (T + 1))]. We would like to maximize the expected reward over a nite time horizon subject to an upper bound on the T conditional expectation, CVaR: maxu {E[ t=0 r(Xt , ut )] : (Y (T + 1)) C }. This is useful when decision-making may involve multiple objectives, one say a reward, and another a cost. The goal is to maximize the expected total reward over a certain time horizon while making sure the conditional expectation of the total cost given the total cost exceeds some given level remains bounded. An example of this is the re-insurance business where the re-insurance companies want to collect premiums (the rewards) by providing re-insurance coverage while ensuring that in the case of rare (that have probability less than ) but catastrophic events (e.g., natural calamities such as devastating hurricanes or oods), the expected payouts (the costs) remain bounded. Standard methods for solving Constrained MDPs such as the occupation measure technique and the Lagrangian method [2] cannot deal with conditional expectation constraints of CVaR type. In this paper, we give an ofine multiple time-scale iterative algorithm to solve this problem. We prove its convergence under certain assumptions. We then propose an online stochastic approximations-based learning

978-1-4244-7746-3/10/$26.00 2010 IEEE

2664

algorithm. Literature overview of CMDPs and Stochastic Programming for Risk Analysis Constrained Markov Decision Process (CMDP) models were rst introduced by Derman and Klein [14] in the 1960s. It had been noticed that such models were not generally amenable to solution by dynamic programming. Thus, linear programming based solutions using an occupation measure approach (which had already been developed for dynamic programming formulations [13]) were proposed. This was extended by Kallenberg and others [18] to discounted cost, total cost, and average cost criterion for MDPs with a unichain structure. Borkar [7], Altman and others [16], [2] further generalized this approach to average cost with general multi-chain ergodic structure. A second method based on a Lagrangian approach was developed in [5] for MDPs with a single constraint, and extended to multiple constraints in [1]. A third method based on a linear program mixing stationary deterministic policies was developed in [1], [27], [15]. CMDP models with different discount factors for different constraints tend to be much more difcult, and usually optimal stationary policies do not exist though it has been shown that the optimal policies are eventually stationary [16]. Sample path formulations of MDPs were introduced in [28] and shown to satisfy Bellmans principle of optimality [17]. Alternative solution approaches based on stochastic approximations have also been proposed in [22]. However, almost all of this literature [2] is focused on expectation constraints. In many applications, other kinds of constraints also appear, such as probabilistic constraints, and more generally stochastic dominance constraints [12]. For example, it is said that Wall Street wants a lot of potential prot on the upside, and not much risk of losses on the downside, i.e., it wants to maximize prots subject to bounds on the risk of losses exceeding a certain amount. One popular measure of risk is the Value-at-Risk (VaR) metric, which is the smallest loss that is exceeded with probability at most some given level (typically, 5%). However, the VaR metric has many undesirable properties including the lack of sub-additivity (a portfolio of two assets may have a greater VaR than the sum of individual VaRs), is non-convex and non-smooth. Thus, a related convex measure - conditional Value-at-Risk (CVaR) was introduced in [3]. This measure is convex, monotonic, translation equivariant and positively homogeneous (and is the only such measure with these properties among popular risk measures, such as Markowitzs Mean-Variance (MV) risk measure [21], etc.) [29]. Optimization methods for computing single-stage CVaR have been proposed [26], [12], [29]. However, their extensions to multi-stage sequential problems suffers from computational difculties similar to multi-stage stochastic programs [6]. Furthermore, all of this literature focuses on computing optimal strategies that minimize the VaR and CVaR. Our interest instead is in multiobjective sequential optimization problems which can be formulated as a constrained Markov decision process where

one of the constraints is a CVaR type risk constraint. We propose an ofine iterative algorithm, as well as a stochastic approximation based online learning algorithm to solve the problem. The paper is organized as follows. In II, we introduce the problem formulation. Section III then presents preliminary results while section IV presents an ofine iterative quasigradient stochastic-approximation inspired algorithm. In section V, we present an online learning algorithm. These results are all for a nite horizon. Section VI discusses further work. II. P ROBLEM F ORMULATION Consider a compact metric state space X, a compact metric control space U, a continuous reward function r(x, u), a continuous cost function c(x, u) where x X and u U, a controlled transition function Pu (dx |x, u) continuous in (x, u), and a nite horizon T . Time is discrete and starts at 0. We will denote a policy by u = uT = (u1 , , uT ), where ut is the control applied at time t according to this policy. We will denote Pu as the probability measure on the nite horizon process X T = (X0 , , XT ) under control policy u. Only noisy observations of the cost are available. Thus, given a zero-mean i.i.d. noise process {t } with strictly positive density , we dene the cumulative cost process {Yt } as Y0 = 0, Yt+1 = Yt + c(Xt , ut ) + t+1 . (1)

We dene the Value-at-Risk (VaR) function = (Y ) for a random variable Y as (Y ) = arg sup P(Y > ) ,

with (0, 1) and is typically close to 0 such as 0.1, 0.05, or 0.01. The Conditional-Value-at-Risk (CVaR) function is dened as (Y ) = E[Y |Y > (Y )]. Then, our objective is to maximize the expected total reward over the nite horizon subject to the CVaR of the terminal cost being bounded by some constant C . rMDP : max E[
u T t=0

r(Xt , ut )]

(2) (3)

s.t.

(YT +1 ) C .

As discussed earlier, such a formulation is useful when the decision-maker wants to maximize expected total reward (rst objective) over a nite horizon, while making sure that the conditional expectation of the total cost (the second objective) given that it lies in the -probability tail does not exceed a deterministic bound C . Proposition 1: If {t } is an i.i.d. noise process with strictly positive density , then a solution u of rMDP exists. Proof: Existence of an optimal solution of the above follows by standard compactnesscontinuity arguments as follows. T E[ t=0 r(Xt , ut )] is clearly a continuous functional of the law L of (Xt , Yt , ut ), t 0. Since YT +1 has a strictly

2665

positive density by our hypotheses on {t }, its distribution function F () is continuous and strictly increasing. Thus = F 1 (1 ). Now (YT +1 ) = = = E[YT +1 |YT +1 > (YT +1 )] E[YT +1 I{YT +1 > F 1 (1 )}] P (YT +1 > F 1 (1 )) 1 E[YT +1 I{YT +1 > F 1 (1 )}].

Then, (YT +1 ) = E[YT +1 |YT +1 > ]


T

=
t=0 zt T

c(zt )P (Zt = zt |YT +1 > ) c(zt )P (YT +1 > |Zt = zt )


t=0 zt

= =

P (Zt = zt ) P (YT +1 > |Z0 = z0 )


T

In view of the foregoing, this is a continuous functional of L. Since x0 is xed, (x, u) P (dx |x, u) is continuous, and X and U compact, it follows that the set of attainable L is compact. Thus the constraint set is also compact and hence the objective attains its maximum on it.

1 E[ c(Zt )Vt (Zt )]. V0 (z0 ) t=0

(5)

Note that V0 (z0 ) = P (YT +1 > |Zt = z0 ) = . Thus, the constraint in rMDP becomes III. P RELIMINARIES Consider a controlled Markov chain Zt = (Xt , Yt , vt ), t 0, with control process ut , t 0, where vt = ut1 . The combined evolution of this three-component controlled Markov chain is determined by the transition kernel p(dx |Xt , ut ), the evolution equation (1) for Yt , and the equation vt = ut1 . We will assume z0 = (x0 , 0, v0 ) to be deterministic. The combined transition kernel will be denoted by p(dz |z, u). We shall assume the cost function to be separable, i.e., c(x, u) = c1 (x) + c2 (u). We will set u1 = uT = u, u being an additional element we add to U with c2 () = 0. u This does not alter the denition of YT +1 , the terminal cost. For a given , dene the state-value-at-risk function as Vt (z) := P (YT +1 > |Zt = z). (4) 1 E[ c(zt )Vt (zt )] C . t=0 IV. A N O FFLINE I TERATIVE A LGORITHM FOR THE R MDP PROBLEM We now present a multiple time-scale iterative algorithm to solve the rMDP problem: Let {n }, {n } be strictly positive stepsizes satisfying the conditions n , n 0, n n = n n = . We specify non-obvious initial conditions. Other variables and functions are assumed to be initiated at the beginning of the algorithm. Remarks. (1) The innermost loop over t computes m V0n (x0 ) = P (YT +1 > ), J0 (x0 , 0) = minu T E[ t=0 r(Xt , ut )] + (C E[YT +1 |YT +1 > ]) over m all non-anticipative controls, for xed = n , = m . It m also computes Q (x0 , 0) = E[YT +1 |YT +1 > ] for xed = m though it should be possible to do this less often by moving it outside the t and n loops. (2) The middle loop over n adjusts till V0n (z0 ) = P (YT +1 > ) = . (3) The outer loop over m adjusts till E[YT +1 |YT +1 > ] C . (4) The and iterations can also be done concurrently if we use n = o(n ). Convergence analysis below then still holds using a two two-scale argument [10]. Convergence Analysis: All iterations in the innermost loop involve nitely many steps, so do not need any convergence analysis. In practice, the iterations for m {n , n 0}, {m , m 0} will also be stopped after nitely many steps according to some stopping rule, but a convergence analysis is required nevertheless to justify such a procedure. Unfortunately, it does not seem possible to establish convergence of {n } in general. (Note that we have suppressed the superscript for notational ease, taking advantage of the
T

Then, we can write a backward recursion equation for Vt (z) as Vt (z) = p(dz |z, v)Vt+1 (z ).

Denote c(zt ) = c(xt , vt ) = c1 (xt ) + c2 (vt ). Then, similarly, we can express CVaR in terms of the state-value function. Lemma 1: If V0 (z0 ) > 0, then we have that 1 (YT +1 ) = E[ c(Zt )Vt (Zt )]. V0 (z0 ) t=0 Proof: By denition, we have
T T

E[YT +1 |YT +1 > ] =


t=0 zt

P (Zt = zt |YT +1 > )c(xt , vt ),

where we have used the separability of the cost function.

2666

Algorithm 1 iRMDP For m = 1, 2, , till convergence: For n = 1, 2, , till convergence: For t = T, , 0: 1)


n,m Jt (x, y)

o.d.e. (6), but the differential inclusion (t) H((t)), where maxu r(x, u) + H() := cl(
>0 {P (YT +1

> ) : < }).

n,m p(dxt+1 |x, u)Jt+1 (x

, y + c(x, u) + s)(s)ds

with n,m m JT +1 (x, y) = m (C yI(y > n )/) 2) un,m (z) t arg maxu r(x, u) +

n,m p(dx |x, u)Jt+1 (x , y + c(x, u) + s)(s)ds n,m 3) Vtn,m (z) = p(dz |z, un,m (z))Vt+1 (z ) t with n,m m VT +1 (z) = I(y > n )

(See, e.g., [10], Chapter 4.) This, in fact is one of the solution concepts for o.d.e.s with discontinuous right hand side, due to Krasovskii [19]. In some cases, this can yield useful information. Convergence analysis for {m } is easier. Lemma 3: Let {m } be strictly positive step-sizes such that m 0 and m m = . Then, as m , we obtain m and Qm Q such that 0 0 (C Q (z0 )) = 0 and C Q (z0 ) 0, 0 0 Proof: Let z0 .

4) Qm (z) = c(z)Vtn,m (z)/ t n,m m p(dz |z, ut (z))Qt+1 (z ), with m Qm+1 (z) = yI(y > n )/ T End t m m n+1 = n n ( V0n,m (z0 )) End n m+1 = (m m (C Qm (z0 )))+ 0 End m

G() := max E[
t=0

r(Xt , ut )] + (C [YT +1 ]) ,

where the maximum is over all admissible L. Suppose the maximum is attained by L := the law of (Xt , Yt , u ), t 0. G is clearly convex and by the t Milgrom-Segal envelope theorem [23], C [YT +1 ] is a valid subgradient thereof. Thus, the -iteration is simply an instance of the classical subgradient descent that is known to converge to a global minimum of g, in this case the desired Lagrange multiplier. V. A N O NLINE L EARNING A LGORITHM FOR FINITE R MDP S We now consider state space X and control space U to be nite which will make Y to be nite as well. We present an online learning algorithm for this setting that nds the optimal control given a sequence of samples. A sample k here is k k (X k,T , Y k,T , uk,T ) where X k,T = (X0 , , XT ), i.e., the k,T entire state trajectory over the T horizon, Y is the entire cost trajectory and uk,T the control sequence that generates it. Besides being an online algorithm, the algorithm below differs from the iterative algorithm presented earlier, namely as it operates at multiple time-scales and thus, along with 2 2 k = o(ak ) and k = o(k ), we need k a2 , k k , k k k are all nite while k ak = k k = k k = . Convergence: The rst equation is the Q-learning version of the deterministic DP recursion. In order that all (z, u) pairs are sampled often enough, in practice one will use the above choice of uk with high probability 1 , and uniform t random u with probability . The convergence analysis for and iterations for the off-line scheme continues to apply to the online scheme as well. As for the Q-learning scheme, let k J k denote Jt , (x, y, u) written as a vector as all of t, x, y, u vary over their respective domains. Again, treat k , k as quasi-static by invoking the two time scale argument. Now the iteration for J k is a special case of the Q-learning iterations analyzed in [30] and its almost sure convergence to the correct Q-value J follows. Note, however, that our

fact that it represents a slower time scale and its effect therefore is quasi-static (see [10], section 6.1).) We shall consider a special case, viz., when the function h() := P (YT +1 > ) is Lipschitz continuous. The difculty is that in addition to the explicit dependence of this probability on , there is also the hidden dependence via the underlying control policy, which is harder to decipher. Nevertheless, we assume this condition. Lemma 2: Suppose the function h() = P (YT +1 > ) is Lipschitz-continuous, then for xed m, as n , V0n,m (z0 ) for all z0 . Proof: Observe that with Lipschitz-continuity, the fact n 0 implies that the iterates will have the same asymptotic behavior as that of the o.d.e., (t) = h((t)). (6)

For 0, h() > and for >> 0, h() < . Thus the trajectories of (6) remain bounded and for a scalar o.d.e., this implies convergence to an equilibrium, i.e., a point where h() = . In view of the aforementioned properties of h and its continuity, at least one such point exists. In fact, all such points that correspond to downward crossings of the level by h will be stable equilibria and the rest unstable, and the smallest and the largest equilibrium will be necessarily stable. In general, however, continuity of h cannot be guaranteed. Hence, the iterates track not the

2667

Algorithm 2 oRMDP For k=1,2, For t = T, T 1, , 0: k1 k k 1) Jt (x, y, u) = Jt (x, y, u) + ak I{Xt = x, Ytk = k k k y, uk = u} r(x, u) + maxu Jt+1 (Xt+1 , Yt+1 , u ) t
k1 Jt (x, y, u) , with k k JT +1 (x, y) = I(YT +1 = y)k (C yI(y > k )/) k k k 2) uk := vt+1 = arg max Jt (Xt , Ytk , ) t k k k 3) Vtk (z) = Vtk1 (z) + ak I{Zt = z} Vt+1 (Zt+1 )

future work, we will also try to derive a proof that does not need such an assumption. The assumption on separability of the cost function is also crucial. In the future, we shall also consider the general cost structure. We hope our most immediate contribution is to restimulate research in the community on the further development of the theory of constrained MDPs whereby we can handle more general constraints than the current framework can. ACKNOWLEDGEMENTS The problem was introduced to the second author by Pu Huang and Dharmashankar Subramanian of the Risk Group in the Mathematical Sciences division of the IBM Watson Research Center. The second author is very grateful to them, and to Alan J. King and Jayant Kalagnanam, also of IBM Watson Research, and to Erim Kardes of USC for many helpful discussions. R EFERENCES
[1] E. Altman and F. Spieksma, The Linear Program approach in Markov decision problems revisited, ZOR Methods and Models in Operations Research, 42(2):169-188, 1995. [2] E. Altman, Constrained Markov Decision Processes, Chapman & Hall, 1999. [3] P. Artzner, F. Delbaen, J-M. Eber and D. Heath, Thinking Coherently, Risk 10:68-71, 1997. [4] D. Bertsekas, Dynamic programming and optimal control, third edition, Athena Scientic, 2005. [5] F. J. Beutler and K. W. Ross, Optimal policies for controlled Markov chains with a constraint, J. Math. Anal. Appl., 112:236-252, 1985. [6] J. Birge and F. Louveaux, Introduction to Stochastic Programming, Springer-Verlag, 1997. [7] V. Borkar, A convex analytic approach to Markov decision processes, Prob, Th. Rel. Fields 78:583-602, 1988. [8] V.S. Borkar, Stochastic approximation with two time scales, Systems & Control Letters, 29:291-294, 1997. [9] V. Borkar, An actor-critic algorithm for constrained Markov decision processes, Systems & Control Letters, 54:207-213, 2005. [10] V. Borkar, Stochastic Approximations: A Dynamical Systems Viewpoint, Cambridge University Press, 2009. [11] D. Dentcheva and A. Ruszczynski, Portfolio optimization with stochastic dominance constraints, J. of Banking and Finance 30:433451, 2006. [12] D. Dentcheva and A, Ruszczynski, Stochastic dynamic optimization with discounted stochastic dominance constraints, SIAM Journal of Control and Optimization 7(5):2540-2556, 2008. [13] C. Derman, Finite State Markovian Decision Processes, Academic Press, New York and London, 1970. [14] C. Derman and M. Klein, Some remarks on nite horizon Markovian decision models, Operations Research 13:272-278, 1965. [15] E. A. Feinberg, Constrained semi-Markov decision processes with average rewards, ZOR Methods and Models in Operations Research 39:257-288, 1995. [16] E. Feinberg and A. Shwartz, Constrained discounted dynamic programming, Math. of Oper. Res. 21: 922-945, 1996. [17] M. Haviv, On constrained Markov decision processes, OR Letters 19(1):25-28, 1995. [18] A. Hordijk and L. C. M. Kallenberg, Linear programming and Markov decision chains, Management Science 25:353-362, 1979. [19] N.N. Krasovskii, Stability of motion, Stanford University Press, 1963. [20] J. Palmquist, S. Uryasev and P. Krokhmal, Portfolio Optimization with conditional value-at-risk objective and constraints, J. of Risk 4(2):11-27, 2002. [21] H. M. Markowitz, Portfolio selection, Journal of Finance, 7(1):7791, 1952. [22] D.J. Ma, A. M. Makowski and A. Shwartz, Stochastic approximations for nite state Markov chains, Stochastic Processes and Their Applications, 35:27-45, 1988.

Vtk1 (z) with k VT +1 (z) = I(YT +1 = y, y > k )


k 4) Qk (z) = Qk1 (z) + ak I{Zt = z} Vtk (z)(z) + c t t k Qk (Zt+1 ) Qk1 (z) with t t+1 k yVT +1 (z)/ End t k = k1 k ( V0k (z0 )) k = (k1 k (C Qk (z0 )))+ 0 End k

Qk +1 (z) T

exploratory randomization of the action choice will yield a near-optimal rather than optimal control. VI. D ISCUSSION AND F URTHER W ORK In this paper, we have introduced a new class of contrained Markov decision processes. The constraints are of conditional expectation of the terminal value of a total cost functional. The motivation comes from nance, in particular insurance, wherein an insurance company wants to maximize its revenue from premiums subject to a constraint on the conditional expectation of the claims the insurer might have to pay. Since our interest is in risk from catastrophic events such as oods, hurricanes, market crashes which have small probability but result in large claims when they happen. Thus, conditioning in the expectation is on the tail probability (i.e., on events that have the largest claims but together have less than say 5% total probability mass). The problem as formulated is of tremendous interest in nance and risk management. It is, however, not amenable to solutions techniques available either in stochastic programming, nor theory of constrained Markov decision processes. We, thus, give an iterative/stochastic approximation-based algorithm to solve it. We are currently in the process of acquiring relevant insurance data, and would test the algorithm on such data in further work. In future work, we will also extend the methodology to the innite-horizon case, both discounted as well as average-case. Our proof of convergence of the ofine algorithm also needs an additional assumption. In

2668

[23] P. Milgrom and I. Segal, Envelope theorems for arbitrary choice sets, Econometrica 70:583-601, 2002. [24] G. Pug, Some Remarks on the Value-at-Risk and the Conditional Value-at-Risk, in: S. Uryasev (Ed) Probabilistic Constrained Optimization: Methodology and Applications, Kluwer Academic Publishers, 2000. [25] A. Prekopa, Stochastic Programming, Kluwer, 1995. [26] R.T. Rockafellar and S. Uryasev, Optimization of conditional valueat-risk, J. of Risk, 2:21-41, 2000. [27] K.W. Ross, Randomized and past-dependent policies for Markov decision processes with multiple constraints, Operations Research 37:474-477, 1989. [28] K. Ross and R. Varadarajan, Multichain Markov decision processes with a sample path constraint: A decomposition approach, Math. Oper. Res., 16 (1991), pp. 195-207. [29] A. Ruszczynski and A. Shapiro, Optimization of risk measures, preprint, 2007. [30] J. N. Tsitsiklis, Asynchronous stochastic approximation and Qlearning, Machine Learning, 16(3):185-202, 1994. [31] P. Varaiya and R. J-B. Wets, Stochastic dynamic optimization approaches and computation in Mathematical Programming, Recent Developments and Applications, M. Iri & K. Tanabe (eds), pp. 309332, Kluwer Academic Publisher, 1989. [32] R. J.-B. Wets, Challenges in Stochastic Programming, working paper, IIASA, Austria, 1994.

2669

You might also like