0% found this document useful (0 votes)
22 views

Robust Offline Reinforcement Learning With Linearly Structured F-Divergence Regularization

Uploaded by

neturiue
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Robust Offline Reinforcement Learning With Linearly Structured F-Divergence Regularization

Uploaded by

neturiue
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Robust Offline Reinforcement Learning with Linearly Structured

f -Divergence Regularization
Cheng Tang∗‡ Zhishuai Liu†‡ Pan Xu§
arXiv:2411.18612v1 [cs.LG] 27 Nov 2024

Abstract
The Distributionally Robust Markov Decision Process (DRMDP) is a popular framework for
addressing dynamics shift in reinforcement learning by learning policies robust to the worst-case
transition dynamics within a constrained set. However, solving its dual optimization oracle poses
significant challenges, limiting theoretical analysis and computational efficiency. The recently
proposed Robust Regularized Markov Decision Process (RRMDP) replaces the uncertainty set
constraint with a regularization term on the value function, offering improved scalability and
theoretical insights. Yet, existing RRMDP methods rely on unstructured regularization, often
leading to overly conservative policies by considering transitions that are unrealistic. To address
these issues, we propose a novel framework, the d-rectangular linear robust regularized Markov
decision process (d-RRMDP), which introduces a linear latent structure into both transition
kernels and regularization. For the offline RL setting, where an agent learns robust policies from
a pre-collected dataset in the nominal environment, we develop a family of algorithms, Robust
Regularized Pessimistic Value Iteration (R2PVI), employing linear function approximation and
f -divergence based regularization terms on transition kernels. We provide instance-dependent
upper bounds on the suboptimality gap of R2PVI policies, showing these bounds depend on
how well the dataset covers state-action spaces visited by the optimal robust policy under
robustly admissible transitions. This term is further shown to be fundamental to d-RRMDPs via
information-theoretic lower bounds. Finally, numerical experiments validate that R2PVI learns
robust policies and is computationally more efficient than methods for constrained DRMDPs.

1 Introduction
Offline reinforcement learning (Offline RL) (Levine et al., 2020) enables policy learning without
direct interaction with the environment, necessitating robust policies that remain effective under
environment shifts (Garcıa and Fernández, 2015; Packer et al., 2018; Zhang et al., 2020; Wang et al.,
2024b). A widely adopted framework for learning such policies is the distributionally robust Markov
decision process (DRMDP) (Iyengar, 2005; Nilim and El Ghaoui, 2005), which models dynamics
changes as an uncertainty set around a nominal transition kernel. In this setup, the agent seeks to
perform well even in the worst-case environment within the uncertainty set. The most common
design of these uncertainty sets is the (s, a)-rectangularity (Zhang et al., 2020; Blanchet et al., 2024;
Shi and Chi, 2024), which models uncertainty independently for each state-action pair. Although

Tsinghua University; email: [email protected]

Duke University; email: [email protected]

Equal contribution
§
Duke University; email: [email protected]

1
mathematically elegant, the (s, a)-rectangularity can include unrealistic dynamics, often resulting in
overly conservative policies. To address this issue, Goyal and Grand-Clement (2023) introduce the
r-rectangular set, which parameterizes transition kernels using latent factors. When focusing on linear
function classes, this concept has been extended to d-rectangular linear DRMDPs (d-DRMDPs),
where latent structures in transition kernels and uncertainty sets are linearly parameterized by
feature mappings. Building on d-DRMDPs, recent works (Ma et al., 2022; Wang et al., 2024a; Liu
and Xu, 2024b) propose computationally efficient algorithms with provable guarantees that leverage
function approximation for robust policy learning and extrapolation.
However, the d-DRMDP framework has several limitations that remain unaddressed in existing
works, which we summarize as follows. Theoretical Gaps: Current understanding of d-DRMDPs
is largely restricted to uncertainty sets defined by the Total Variation (TV) divergence (Liu and
Xu, 2024a,b). For uncertainty sets defined by the Kullback-Leibler (KL) divergence, prior works
(Ma et al., 2022; Blanchet et al., 2024) rely on additional regularity assumptions regarding the KL
dual variable, which is hard to validate in practice. Moreover, χ2 -divergence-based uncertainty
sets has been shown to be effective in certain empirical RL applications (Lee et al., 2021) and
also been analyzed under the (s, a)-rectangularity (Shi et al., 2024). Yet there are no theoretical
results or efficient algorithms in d-DRMDPs. Practical Challenges: Existing algorithms (Ma et al.,
2022; Liu and Xu, 2024b; Wang et al., 2024a) depend on a dual optimization oracle (see Remark
4.2 in Liu and Xu (2024a)) to estimate the robust value function. The computational complexity
of these methods is proportional to the feature dimension d and the planning horizon H. While
heuristic methods like the Nelder-Mead algorithm (Nelder and Mead, 1965) can approximate the
oracle, they become computationally expensive for high-dimensional features (large d) and extended
planning horizons (large H), which are common in real-world applications. These limitations raise
an important question:
Can we design provably and computationally efficient algorithms for robust RL using general
f -divergence1 for modeling uncertainty in structured transition dynamics?
In this work, we provide a positive answer to this question. Inspired by the robust regularized
MDP (RRMDP) framework under the (s, a)-rectangularity condition (Yang et al., 2023; Zhang et al.,
2020; Panaganti et al., 2024), where the uncertainty set constraint in DRMDP is replaced by a
regularization penalty term measuring the divergence between the nominal and perturbed dynamics,
we propose the d-rectangular linear RRMDP (d-RRMDP) framework. Specifically, d-RRMDP
replaces the d-rectangular uncertainty set in d-DRMDPs with a carefully designed penalty term
that preserves the linear structure. The motivations are two folds: (1) it has been shown by Yang
et al. (2023) that the robust policy under the RRMDP is equivalent to that under the DRMDP for
unstructured uncertainty as long as the regularizer is properly chosen; (2) removing the uncertainty
set constraint simplifies the dual problem for certain divergences (Zhang et al., 2023), potentially
improving computational efficiency and facilitating theoretical analysis. In this paper, we show that
the above two advantages hold for our proposed d-RRMDP. We summarize our contributions as
follows:
• We establish that key dynamic programming principles, including the robust Bellman equation
and the existence of deterministic optimal robust policies, hold under the d-RRMDP framework.
Additionally, we derive dual formulations of robust Q-functions for TV, KL, and χ2 divergences,
highlighting their linear structures.
1
The general f -divergence includes widely studied divergences such as TV, KL, and χ2 divergences.

2
• We propose a computationally efficient meta-algorithm, Regularized Robust Pessimistic Value
Iteration (R2PVI), for offline d-RRMDPs with general f -divergence regularization. For TV,
KL, and χ2 divergences, we provide instance-dependent upper bounds on the suboptimality
gap of policies learned by R2PVI. These bounds take the general form:
H d
X 
π ⋆ ,P
X
−1
β sup E ∥ϕi (s, a)1i∥Λh | s1 = s ,
P ∈U λ (P 0 ) h=1 i=1

where d is the feature dimension, H is the horizon length, ϕ(s, a) is the feature mapping,
λ is the regularization parameter, and β is a problem-dependent parameter whose specific
form depends on the choice of the divergence (see Section 5.1 for details). The set U λ (P 0 ) is
derived from our theoretical analysis, though it does not represent an uncertainty set in the
conventional DRMDP framework. We further construct an information-theoretic lower bound,
demonstrating that this instance-dependent uncertainty function is intrinsic to the problem.
• We conduct experiments in simulated environments, including a linear MDP setting (Liu
and Xu, 2024a) and the American Put Option environment (Ma et al., 2022). Our findings
show that: 1. The d-RRMDP framework yields equivalent robust policies as d-DRMDP with
appropriately chosen regularization parameters. 2. R2PVI significantly improves computational
efficiency compared to algorithms for d-DRMDPs. 3. R2PVI with TV and KL divergences
achieves computational efficiency comparable to algorithms designed for standard linear MDPs.
Notations. In this paper, we denote ∆(S) as the probability distribution over the state space S.
For any H ∈ N, [H] represents the set {1, 2, 3, · · · , H}. For a vector v ∈ Rd , we denote vi as the i-th
element. For any function V : S → [0, H], we denote Vmin = mins∈S V (s) and Vmax = maxs∈S V (s).
For any distribution µ ∈ ∆(S), we denote Vars∼µ V (s) as the variance of the random variable
V (s) under µ. For any two probability measures P and Q satisfying that P is absolute continuous
with respect to Q, the f -divergence is defined as Df (P ∥Q) = S f (P (s)/Q(s))Q(s)ds, where f is
R

a convex function on R and differentiable on R+ satisfying f (1) = 0 and f (t) = +∞, ∀t < 0. The
Total Variation (TV) divergence, Kullback-Leibler (KL) divergence and Chi-Square (χ2 ) divergence
between P and Q are defined by f (x) = |x − 1|/2, f (x) = x log x, f (x) = (x − 1)2 , respectively. Given
a scalar α, we denote [V (s)]α = min{V (s), α}. We denote I as the identity matrix and 1i ∈ Rd as
the one-hot vector with the i-th element equals to one.

2 Related Work
Distributionally Robust MDPs. The seminal works of Iyengar (2005); Nilim and El Ghaoui
(2005) proposed the framework of DRMDP. There are several lines of works studying DRMDP under
different settings. Zhang et al. (2020); Panaganti et al. (2022); Shi and Chi (2024) studied the offline
DRMDP assuming access to an offline dataset, and provided sample complexity bounds under the
coverage assumption on the offline dataset. Liu and Xu (2024a); Liu et al. (2024); Lu et al. (2024)
studied the online DRMDP where an agent learns robust policies by actively interacting with the
nominal environment. Blanchet et al. (2024); Panaganti et al. (2022) studied the DRMDP with
general function approximation, they focused on the offline setting with the (s, a)-rectangularity
assumption. Ma et al. (2022); Liu and Xu (2024b); Wang et al. (2024a) studied the offline d-DRMDP,
they proposed provably and computationally efficient algorithms and provided sample complexity
bounds under different kinds of coverage assumptions on the offline dataset.

3
RRMDPs. The work of Yang et al. (2023); Zhang et al. (2023) proposed the RRMDP, which
can be regarded as a generalization of the DRMDP by substituting the uncertainty set constraint
in DRMDP with the regularization term defined based on the divergence between the perturbed
dynamics and the nominal dynamics. In particular, Yang et al. (2023) studied the tabular RRMDP
and proposed a model-free algorithms assuming access to a simulator. Zhang et al. (2023) studied
the offline RRMDP, they established connections between RRMDPs with risk sensitive MDPs, and
derived the policy gradient principle. Moreover, they studied general function approximation and
proposed a computationally efficient algorithm, RFZI, under the RRMDP with the regularization
term defined based on the KL-divergence. Zhang et al. (2023) firstly discovered that the duality
of the robust value function has a closed expression under the KL-divergence. Panaganti et al.
(2024) studied the offline RRMDP with regularization terms defined by general f -divergence. They
studied general function approximation and provided sample complexity result under the general
f -divergence. They further proposed a hybrid algorithm which learns robust policies with both
historical data and online sampling, for RRMDP with TV-divergence regularization term. Their
works mostly focus on (s, a)-rectangularity uncertainty regularization, which is different from ours.

3 Problem formulation
Markov decision process (MDP). We first introduce the concept of MDPs, which is the basis
of our settings. Specifically, we denote MDP(S, A, H, P 0 , r) as a finite horizon MDP, where the
S is the state space, A is the action space, H is the horizon length, P 0 = {Ph0 }H h=1 are nominal
transitional kernels, and the r(s, a) ∈ [0, 1] is the deterministic reward function assumed to be known
in advance. For any policy π, the value function and Q-function at time step h is defined as
H
X  H
X 
0 0
Vhπ (s) = EP rt (st , at ) sh = s, π , Qπh (s, a) = EP rt (st , at ) sh = s, ah = a, π .
t=h t=h

The robust regularized MDP (RRMDP) The RRMDP framework with a finite horizon is
defined by RRMDP(S, A, H, P 0 , r, λ, D, F), where λ is the regularizer, D is the probability divergence
metric, and F is the feasible set of all perturbed transition kernels. For any policy π, the robust
regularized value function and Q-function are defined as
H
X 
Vhπ,λ (s) P
λD(Pt (|st , at )∥Pt0 (·|st , at )) (3.1)
 
= inf E rt (st , at ) + sh = s, π ,
P ∈F
t=h
XH 
Qπ,λ P
λD(Pt (·|st , at )∥Pt0 (·|st , at )) (3.2)
 
h (s, a) = inf E
P ∈F
rt (st , at ) + sh = s, ah = a, π .
t=h

The RRMDP framework has been referred to by different names in the literature, including the
penalized robust MDP (Yang et al., 2023), the soft robust MDP (Zhang et al., 2023), and the
robust ϕ-regularized MDP (RRMDP) (Panaganti et al., 2024). For consistency, we adopt the
term RRMDP in this work. In RRMDPs, the transition kernel class F typically encompasses all
possible kernels. However, for environments with large state-action spaces, F may be overly broad,
including transitions that are unrealistic or irrelevant. To address this, we introduce latent structures
on transition kernels and design regularization penalties that capture dynamics changes in the

4
latent structure, which is similar to r-rectangular MDPs (Goyal and Grand-Clement, 2023) and
d-rectangular linear DRMDPs (Ma et al., 2022).

The d-rectangular linear RRMDP (d-RRMDP). In this paper, we propose a novel d-RRMDP,
which admits linear structure of feasible set and reward function. Specifically, a d-RRMDP is a
RRMDP where the nominal environment P 0 is a special case of linear MDP with a simplex feature
space (Jin et al., 2020, Example 2.2), and the feasible sets F involves kernels defined based on the
linear structure of the nominal transition kernel. Specifically, we make the following assumption:
AssumptionPd 3.1 (Jin et al. (2020)). Given a known state-action feature mapping ϕ : S × A → Rd
satisfying i=1 ϕi (s, a) = 1, ϕi (s, a) ≥ 0. Further more, we assume the reward function {rh }h=1 and
H

the nominal transition kernels {Ph0 }H h=1 admit linear structures. Specifically, we have

rh (s, a) = ⟨ϕ(s, a), θh ⟩, Ph0 (·|s, a) = ⟨ϕ(s, a), µ0h (·)⟩, for all (h, s, a) ∈ [H] × S × A,

where {θh }Hh=1 are known vectors with bounded norm ∥θ h ∥2 ≤ d and {µ0h }H h=1 are unknown
probability measure vectors over S, i.e., µh = (µh,1 , µh,2 , · · · , µh,d ), µ0h,i ∈ ∆(S), ∀i ∈ [d].
0 0 0 0

With Assumption 3.1, due to our concentration on linear-structured based feasible set F, the
robust regularized value function and Q-function are defined as
H
X 
{Pt }H
Vhπ,λ (s) rt (st , at ) + λ⟨ϕ(st , at ), D(µt ||µt )⟩ sh = s, π , (3.3)
0
 
= inf E t=h
µt ∈∆(S)d ,Pt =⟨ϕ,µt ⟩
t=h
H
X 
{Pt }H
Qπ,λ 0
 
h (s, a) = inf E t=h rt (st , at ) + λ⟨ϕ(st , at ), D(µt ||µt )⟩ sh = s, ah = a, π ,
µt ∈∆(S)d ,Pt =⟨ϕ,µt ⟩
t=h
⊤
where D(µ||µ0 ) = D(µ1 ∥µ01 ), D(µ2 ∥µ02 ), · · · , D(µd ∥µ0d ) is the concatenated vector of D(µi ∥µ0i ).
In other words, we only consider perturbed kernels in the linear feasible set

FL = {P = {Ph }H
h=1 |Ph (·|s, a) = ⟨ϕ(s, a), µh (·)⟩, µh = (µh,1 , µh,2 , ...µh,d ) , µh,i ∈ ∆(S), ∀i ∈ [d]}.

The optimal robust regularized value function and Q-function is defined as:

Vh⋆,λ (s) = sup Vhπ,λ (s), Q⋆,λ π.λ


h (s, a) = sup Qh (s, a). (3.4)
π π

Based on (3.4), the optimal robust policy is defined as the policy that achieves the optimal robust
regularized value function at time step 1, i.e, π ⋆,λ = argmaxπ V1π,λ (s), ∀s ∈ S.

Dynamic Programming Principles for d-RRMDP For completeness, we first show that the
dynamic programming principles (Sutton, 2018) hold for the d-rectangular linear RRMDP.
Proposition 3.2. (Robust Regularized Bellman Equation) Under the d-rectangular linear RRMDP,
for any policy π and any (h, s, a) ∈ [H] × S × A, we have

Qπ,λ
 π,λ ′ 
(s ) + λ⟨ϕ(s, a), D(µh ||µ0h )⟩ , (3.5)
 
h (s, a) = rh (s, a) + inf Es′ ∼Ph (·|s,a) Vh+1
µh ∈∆(S)d ,Ph =⟨ϕ,µh ⟩

Vhπ,λ (s) = Ea∼π(·|s) Qπ,λ (3.6)


 
h (s, a) .

5
Next, we show that the optimal robust policy is deterministic and stationary.
Proposition 3.3. Under the setting of d-rectangular linear RRMDP, there exists a deterministic
and stationary policy π ⋆ , such that for any (h, s, a) ∈ [H] × S × A,
⋆ ,λ ⋆ ,λ
Vhπ (s) = Vh⋆,λ (s), Qπh (s, a) = Q⋆,λ
h (s, a). (3.7)

With these results, we can restrict the policy class Π to the deterministic and stationary one.
This leads to the robust regularized Bellman optimality equation:

Q⋆,λ
 ⋆,λ ′ 
(s ) + λ⟨ϕ(s, a), D(µh ||µ0h )⟩ ,
 
h (s, a) = rh (s, a) + inf Es′ ∼Ph (·|s,a) Vh+1
µh ∈∆(S)d ,Ph =⟨ϕ,µh ⟩

Vh⋆,λ (s) = max Qπ,λ


h (s, a). (3.8)
a∈A

Offline Dataset and Learning Goal. Assume that we have an offline dataset D consisting K
i.i.d. trajectories collected from the nominal environment by a behavior policy π b . Specifically, for
the τ -th trajectory {(sτh , aτh , rhτ )}H
h=1 , we have ah ∼ πh (·|sh ), rh = rh (sh , ah ), and sh+1 ∼ Ph (·|sh , ah )
τ b τ τ τ τ τ 0 τ τ

for any h ∈ [H]. The goal of the offline RRMDP is to learn the optimal robust policy π from the ⋆

offline dataset D. The suboptimality gap between any policy π̂ and the optimal robust policy π ⋆ is

SubOpt(π̂, s1 , λ) := V1⋆,λ (s1 ) − V1π̂,λ (s1 ). (3.9)

The goal of an agent is to learn a π̂ that minimizes the suboptimality gap SubOpt(π̂, s, λ), ∀s ∈ S.

4 Robust Regularized Pessimistic Value Iteration


In this section, we first develop a meta-algorithm for offline d-rectangular linear RRMDP with
D being the general f -divergence. Then in order to instantiate the meta-algorithm, we provide
exact dual form solution under specific the f -divergence D as Total-Variation (TV) divergence,
Kullback-Leibler (KL) divergence and Chi-Square (χ2 ) divergence, respectively.

4.1 Regularized Robust Pessimism Iteration (R2PVI)


We first show that under the d-rectangular linear RRMDP, Qπ,λ
h (s, a) admits a linear representation.

Proposition 4.1. Under Assumption 3.1, for any (π, s, a, h) ∈ Π × S × A × [H], we have

Qπ,λ π,λ
h (s, a) = ⟨ϕ(s, a), θh + wh ⟩, (4.1)
⊤
where whπ,λ = wh,1
π,λ π,λ π,λ π,λ  π,λ 
∈ Rd , and wh,i = inf µh,i ∈∆(s) Eµh,i Vh+1 (s) +λD(µh,i ∥µ0h,i ) .
 
, wh,2 , · · · , wh,d
The linear representation of the robust Q-function enables linear function approximation. Then,
it remains to obtain an estimation, ŵhλ , of the parameter whλ . For now, we focus on the algorithm
design and assume that we have an oracle to get ŵhλ . We will instantiate the estimation procedure
under different cases with different probability divergence D in the following sections. Leveraging the
robust regularized Bellman equation Proposition 3.2 and the pessimism principle (Jin et al., 2021;
Liu and Xu, 2024b) well-developed to take account of the distribution shift of the offline dataset, we
propose the meta-algorithm in Algorithm 1.

6
Algorithm 1 Robust Regularized Pessimistic Value Iteration (R2PVI)
Require: Dataset D, Regularizer λ > 0
1: init V̂H+1
λ (·) = 0

2: for episode h = H, · · · , 1 do
Compute Λh ← K τ τ τ τ ⊤
P
3: τ =1 ϕ(sh , ah )ϕ(sh , ah ) + γI
4: Obtain the parameter estimation ŵh as follows:
λ ◁ Duality Estimation
TV-divergence: use (4.3)
KL-divergence: use (4.6)
χ2 -divergence: use (4.10)
5: Construct the pessimism penalty Γh (·, ·). ◁ Pessimism
6: Estimate Q̂λh (·, ·) ← min{⟨ϕ(·, ·), θh + ŵhλ ⟩ − Γh (·, ·), H − h + 1}+ .
7: Construct π̂h (·|·) ← argmaxπh ⟨Q̂λh (·, ·), π̂h (·|·)⟩A and V̂hλ (·) ← ⟨Q̂λh (·, ·), π̂h (·|·)⟩A .
8: end for

Algorithm 1 presents a meta-algorithm for the d-rectangular linear RRMDP with any probability
divergence metric D. The algorithm follows a value iteration framework, integrating the pessimism
principle to iteratively estimate the robust regularized Q-function. Based on the robust regularized
Bellman optimality equation (3.8), the estimated robust policy is derived as the greedy policy with
respect to the estimated robust regularized Q-function. In subsequent sections, we instantiate D as
commonly studied f -divergences, including the TV-, KL-, and χ2 -divergences. We also detail the
parameter estimation procedure in Line 4 and the construction of the pessimism penalty in Line 5 of
Algorithm 1.

4.2 R2PVI with the TV-Divergence


In this section, we show how to get the estimation in Line 4 of R2PVI for TV divergence based
regularization. We first present a result on the duality of the TV-divergence.

Proposition 4.2. Given any probability measure µ0 ∈ ∆(S) and value function V : S → [0, H],
if the distance D is chosen as the TV-divergence, the dual formulation of the original regularized
optimization problem is formed as:

inf Es∼µ V (s) + λDTV (µ∥µ0 ) = Es∼µ0 [V (s)]mins′ {V (s′ )}+λ . (4.2)
µ∈∆(S)

Remark 4.3. We compare the duality of the regularized problem in Proposition 4.2 with the duality
of the constraint problem with the TV-divergence in DRMDP (Shi and Chi, 2024):
0
EP V (s) = EP [V (s)]α − ρ(α − min[V (s′ )]α ) .

inf max ′
P ∈U ρ (P 0 ) α∈[Vmin ,Vmax ] s

We highlight that the former has a closed form, while the later needs to be solved through optimization.
And both involve a truncation on the value function. We will see later this distinction makes R2PVI
much more computationally efficient compared to algorithms designed for DRMDP.

Next, we present the parameter estimation procedure. Given an estimated robust value function
λ , we denote α
h+1 = mins′ V̂h+1 (s ) + λ. According to the linear representation of the Q-function
V̂h+1 λ ′

7
Proposition 4.1, the duality for TV-divergence Proposition 4.2 and the linear structure of the nominal
kernel in Assumption 3.1, we estimate the parameter whλ as follows
K
X 2
ŵhλ = argmin λ
[V̂h+1 (sτh+1 )]αh+1 − ϕ(sτh , aτh )⊤ w + γ∥w∥22 , (4.3)
w∈Rd τ =1

where γ is a stabilization parameter to ensure numerical stability and prevent the matrix from
becoming ill-conditioned or singular.
Remark 4.4. Thanks to the closed form expression of the duality for TV in Proposition 4.2, R2PVI
does not need the dual oracle as the DRPVI algorithm proposed for the d-rectangular linear DRMDP
(Liu and Xu, 2024b, see their equation (4.4) and Algorithm 1 for more details). They need to solve
dual oracle separately for each dimension in each iteration, which is not necessary in our algorithm.

4.3 R2PVI with the KL-Divergence


Similar to TV-divergence, we next show how to get the estimation in Line 4 of Algorithm for KL
divergence based regularization. We first present a result on the duality of the KL-divergence.
Proposition 4.5. (Zhang et al., 2023, Example 1) Given any probability measure µ0 ∈ ∆(S) and
value function V : S → [0, H], if the probability divergence D is chosen as KL-divergence, the dual
formulation of the original regularized optimization problem is:
 V (s) 
inf Es∼µ V (s) + λDKL (µ∥µ0 ) = −λ log Es∼µ0 e− λ . (4.4)
µ∈∆(S)

The duality of KL also has a closed form, which is first observed in Zhang et al. (2023). We
will see that this property will not only lead to more computationally efficient algorithm, but also
simplify the theoretical analysis. Next, we present the parameter estimation procedure. According
to the linear representation of the Q-function in Proposition 4.1, the duality for TV-divergence in
Proposition 4.5 and the linear structure of the nominal kernel in Assumption 3.1, we estimate the
parameter whλ by a two-step procedure. Given an estimated robust value function V̂h+1 λ , we first
λ
solve the following ridge-regression to get an estimation of Es∼µ0 e−V̂h+1 (s)/λ
K  λ (sτ
V̂h+1 2
X h+1 )
ŵh′ = argmin e − λ − ϕ(sτh , aτh )⊤ w + γ∥w∥22 . (4.5)
w∈Rd τ =1

We then take a log-transformation to get an estimation of whλ :

ŵhλ = −λ log max{ŵh′ , e−H/λ } (4.6)

Note that the max operator is to ensure the ridge-regression estimator is well-defined to take
λ
log-transformation, and e−H/λ is a lower bound on Es∼µ0 e−V̂h+1 (s)/λ .
Remark 4.6. Ma et al. (2022); Blanchet et al. (2024) need to deal with hard dual oracle under
KL divergence, while our algorithms have close form solution. Their works require sophisticated
techniques to guarantee the estimated parameter well-defined to take log-transformation, while
our algorithm does not need such techniques, which reduces the computational cost and eases the
theoretical analysis (see Section 5 and Section 6 for detailed comparison).

8
4.4 R2PVI with the χ2 -Divergence
Now it remains to show how to get the estimation in Line 4 of Algorithm for χ2 -divergence based
regularization. We first present a result on the duality of the χ2 -divergence.
Proposition 4.7. Given any probability measure µ0 ∈ ∆(S) and value function V : S → [0, H], if
the probability divergence D is chosen as the χ2 -divergence, the dual formulation of the original
regularized optimization problem is:
n 1 o
inf Es∼µ V (s) + λDχ2 (µ∥µ0 ) = sup Es∼µ0 [V (s)]α − Vars∼µ0 [V (s)]α (4.7)
µ∈∆(S) α∈[Vmin ,Vmax ] 4λ

Next, we present the parameter estimation procedure. According to the linear representation of
the Q-function in Proposition 4.1, the duality for χ2 -divergence in Proposition 4.7 and the linear
structure of the nominal kernel in Assumption 3.1, estimate the parameter whλ as follows. To estimate
the variance of the value function in (4.7), we propose a new method motivated by the variance
estimation in (Liu and Xu, 2024b). Specifically, given an estimated robust value function V̂h+1
λ and
dual variable α, we first estimate Es∼µ0 [V̂h+1 (s)]α and Es∼µ0 [V̂h+1 (s)]α as follows:
λ λ 2

" K
#i
µ0h,i
X
Ê λ
[V̂h+1 (s)]α = argmin λ
([V̂h+1 (sτh+1 )]α − ϕ(sτh , aτh )⊤ w)2 + γ∥w∥22 , (4.8)
w∈Rd τ =1 [0,H]
" K
#i
0 X
ʵh,i [V̂h+1
λ
(s)]2α = argmin λ
([V̂h+1 (sτh+1 )]2α − ϕ(sτh , aτh )⊤ w)2 + γ∥w∥22 , (4.9)
w∈Rd τ =1 [0,H 2 ]

where the superscript i represents the i-th element of a vector. Then we construct the estimator ŵhλ
as follows
n 0 1 d µ0h,i λ o
λ
ŵh,i = sup ʵh,i [V̂h+1
λ
(s)]α − Var [V̂h+1 (s)]α
λ )
α∈[(V̂h+1 λ
min ,(V̂h+1 )max ]

n 0 1 µ0h,i λ 1 µ0h,i λ o
= max ʵh,i [V̂h+1
λ
(s)]α + (Ê [V̂h+1 (s)]α )2 − Ê [V̂h+1 (s)]2α . (4.10)
λ )
α∈[(V̂h+1 λ
min ,(V̂h+1 )max ]
4λ 4λ

We note that the parameter estimation procedure appears to be distinct from that of TV and KL,
and this is due to the fact that the duality of χ2 does not admit a closed form expression. Specifically,
it estimates the parameter whλ element-wisely, and for each dimension, it solves an optimization
problem over an estimated dual formulation. This parameter estimation procedure shares a similar
spirit with that in the d-DRMDP with TV divergence (Liu and Xu, 2024a,b).

5 Suboptimality Analysis
In this section, we establish theoretical guarantees for the algorithms proposed in Section 4. First,
we derive instance-dependent upper bounds on the suboptimality gap of the policies learned by
the instantiated algorithms. Next, under a partial coverage assumption, we present an instance-
independent upper bound for the suboptimality gap and compare it with results from existing
works. Finally, we provide a matching information-theoretic lower bound to highlight the intrinsic
characteristics of offline d-RRMDPs.

9
5.1 Instance-Dependent Upper Bound
Theorem
Pd 5.1. Under Assumption 3.1, for any δ ∈ (0, 1), if we set γ = 1 and Γh (s, a) =
β i=1 ∥ϕi (·, ·)1i ∥Λ−1 in Algorithm 1,
h

• (TV) β = 16Hd ξTV , where ξTV = 2 log(1024Hd1/2 K 2 /δ);

• (KL) β = 16dλeH/λ (H/λ + ξKL ), where ξKL = log(1024dλ2 K 3 H/δ);


p

• (χ2 ) β = 8dH 2 (1 + 1/λ) ξχ2 , where ξχ2 = log(192K 5 H 6 d3 (1 + H/2λ)3 /δ),


p

then with probability at least 1 − δ, for all s ∈ S, the suboptimality of Algorithm 1 satisfies:
H d
hX i
∗ ,P
X
SubOpt(π̂, s, λ) ≤ 2β sup Eπ ∥ϕi (s, a)1i ∥Λ−1 |s1 = s ,
h
P ∈U λ (P 0 ) h=1 i=1

where U λ (P 0 ) is the robustly admissible set defined as


O
U λ (P 0 ) = Uhλ (s, a; µ0h ), (5.1)
(h,s,a)∈[H]×S×A

∗,λ
with Uhλ (s, a; µ0h ) = { di=1 ϕi (s, a)µh,i (·) : D(µh,i ∥µ0h,i ) ≤ maxs∈S Vh+1
P
(s)/λ, ∀i ∈ [d]}.
Remark 5.2. Theorem 5.1 provides an instance-dependent upper bound on the suboptimality gap,
closely resembling the bounds established for algorithms tailored to the d-DRMDP (Liu and Xu,
2024b; Wang et al., 2024a). Notably, the set U λ (P 0 ) in Theorem 5.1 represents a subset of the
feasible set FL in the d-RRMDP. While the RRMDP framework does not impose explicit uncertainty
set constraints, this term naturally arises from our theoretical analysis (see Lemma C.1 for details).
Specifically, we show that only distributions within U λ (P 0 ) are relevant when considering the infimum
in the robust regularized value and Q-functions (3.3). Intuitively, the regularization term in (3.3)
should not exceed the change in expected cumulative rewards, which is bounded above by the optimal
robust regularized value function. Similar terms are also found in Definition 1 of Zhang et al. (2023)
and Assumption 1 of Panaganti et al. (2022).

5.2 Instance-Independent Upper Bound


Next, we derive instance-independent upper bounds on the suboptimality gap, building on the
results in Theorem 5.1. To achieve this, we adapt the robust partial coverage assumption on the
offline dataset, originally proposed for d-DRMDP (Assumption A.2 of Blanchet et al. (2024)), to the
d-RRMDP setting. This adaptation is straightforward and involves replacing the uncertainty set in
the d-DRMDP framework with the robustly admissible set defined in (5.1).
Assumption 5.3 (Robust Regularized Partial Coverage). For the offline dataset D, we assume that
there exists some constant c† > 0, such that ∀(h, s, P ) ∈ [H] × S × U λ (P 0 ), we have

Λh ⪰ γI + K · c† · Eπ ,P (ϕi (s, a)1i )(ϕi (s, a)1i )⊤ |s1 = s .
 

Intuitively, Assumption 5.3 assumes that the offline dataset has good coverage on the state-action
space visited by the optimal robust policy π ⋆ under any transition kernel in the robustly admissible
set U λ (P 0 ). With Assumption 5.3 and Theorem 5.1, we can provide instance-independent bound as:

10
Corollary 5.4. Under the same assumption and condition in Theorem 5.1, if we further assume
Assumption 5.3 holds for the offline dataset D, then for any δ ∈ (0, 1), with probability at least
1 − δ, we have
√ √
• (TV) SubOpt(π̂, s, λ) ≤ 16H 2 d2 ξTV / c† K;

• (KL) SubOpt(π̂, s, λ) ≤ 16λeH/λ d2 H (H/λ + ξKL )/ c† K;
p


• (χ2 ) SubOpt(π̂, s, λ) ≤ 8d2 H 3 (1 + 1/λ) ξχ2 / c† K.
p

We compare the suboptimality upper bound of Algorithm 1 with those algorithms proposed
in previous works for the offline d-DRMDP in Table 1. For the case with TV-divergence, the
suboptimality bound of R2PVI matches that of P2MPO in terms of d and H. DRPVI and DROP
admit tighter bounds on the suboptimality bound, simply because their bounds are derived based on
advanced techniques and a stronger assumption on the offline dataset. We remark that our analysis
can be tailored to adopt the same techniques and assumption, and thus get tighter bounds.
For the case with KL-divergence, existing theoretical results (Ma et al., 2022; Blanchet et al.,
2024) rely on an additional regularity assumption regarding the KL dual variable, stating that
the optimal dual variable for the KL duality admits a positive lower bound β under any feasible
transition kernel (see Assumption F.1 in Blanchet et al. (2024)). However, this assumption presents
the following drawbacks. First, it is challenging to verify the assumption’s validity in practice; second,
even if such a lower bound holds, there is no straightforward method to determine the magnitude
of the lower bound. It can be seen from Table 1 that the suboptimality bound of R2PVI matches
that of DRVI-L (Ma et al., 2022) in terms of d and H. However, our result depends on λ which
is the regularization parameter and can be arbitrarily
p H/βchosen, √ while the result of DRVI-L depends
on β which can be extremely small such that βe ≪ λe H/λ . Moreover, Zhang et al. (2023)
studied the RRMDP with the regularization term defined √ by the KL-divergence in their Theorem 5
and the suboptimality bound also depends on the term λeH/λ2 . Further, comparing the bounds of
P2MPO (Blanchet et al., 2024) and R2PVI, we can qualitatively conclude that the regularization
parameter λ in d-RRMDP plays a role analogous to 1/ρ in d-DRMDP. This relation aligns with the
intuition that a smaller λ in d-RRMDP or a larger ρ in d-DRMDP can induce a more robust policy.
For the case with χ2 divergence, our bound is the first result in literature. Compared with the
TV divergence, the sample complexity with χ2 divergence is higher due to the difficulty in solving the
dual oracle. This observation aligns with the findings of Shi et al. (2024). While existing RRMDP
works have focused on the (s, a)-rectangular structured regularization, our work fills the theoretical
gaps in RRMDP by introducing the d-rectangular structured regularization, a contribution that may
be of independent interest.

5.3 Information-Theoretic Lower Bound


We highlight that in Theorem 5.1, P suboptimality bounds under cases with TV-, KL-, χ2 -divergence

share the same term supP ∈U λ (P 0 ) H π ,P [ d ∥ϕ (s , a )1 ∥
i h h i Λ−1 |s1 = s]. In this section, we
P
h=1 E i=1 h
establish information theoretic lower bounds to show that this term is intrinsic in d-RRMDP problem.
The construction of the information-theoretic lower bound relies on a novel family of hard instances.
We illustrate one such instance in Figure 1. Both the nominal and target environments satisfy
2
Zhang et al. (2023) studied the infinite horizon RRMDP with a discounted factor γ, we replace the effective
1
horizon length 1−γ by the horizon length H in the finite horizon setting.

11
Table 1: Comparison of the suboptimality gap between our works and previous works. The ⋆
symbol denotes results that require an additional assumption (Assumption 4.4 of Ma et al. (2022)
and Assumption F.1 of Blanchet et al. (2024)) on the KL dual variable, an assumption not required
by our R2PVI algorithm. The parameter ρ represents the uncertainty level in DRMDP, while λ
represents the regularization term in RRMDP. The Coverage column indicates the assumption used
to derive the suboptimality gap: the robust partial coverage assumption refers to Assumption 6.2 of
Blanchet et al. (2024), and the regularized partial coverage assumption represents Assumption 5.3.
Algorithm Setting Divergence Coverage Suboptimality Gap
DRPVI
d-DRMDP TV full Õ(dH 2 K −1/2 )
(Liu and Xu, 2024b)
DROP
d-DRMDP TV robust partial Õ(d3/2 H 2 K −1/2 )
(Wang et al., 2024a)
P2MPO (TV)
d-DRMDP TV robust partial Õ(d2 H 2 K −1/2 )
(Blanchet et al., 2024)
R2PVI-TV (ours) d-RRMDP TV regularized partial Õ(d2 H 2 K −1/2 )
DRVI-L
KL robust partial
p H/β 2 3/2 −1/2 ⋆
d-DRMDP Õ( βe d H K )
(Ma et al., 2022)
P2MPO (KL)
d-DRMDP KL robust partial Õ(eH/β d2 H 2 ρ−1 K −1/2 )⋆
(Blanchet et al., 2024) √
R2PVI-KL (ours) d-RRMDP KL regularized partial Õ( λeH/λ d2 H 3/2 K −1/2 )
R2PVI-χ2 (ours) d-RRMDP χ2 regularized partial Õ(d2 H 3 (1 + λ−1 )K −1/2 )

Assumption 3.1. The environment consists of two states, s1 and s2 . In the nominal environment
Figure 1(a), s1 represents the good state with a positive reward. For any transition originating from
s1 , there is a 1 − ϵ probability of transitioning to itself and an ϵ probability of transitioning to s2 ,
regardless of the action taken, where ϵ is a parameter to be determined. The state s2 is the fail state
with zero reward and can only transition to itself. The worst-case target environment Figure 1(b) is
obtained by perturbing the transition probabilities in the nominal environment. The perturbation
magnitude ∆λh (ϵ, D) depends on the stage h, regularizer λ, divergence metric D, and parameter ϵ.
We would like to highlight the difference between our hard instances and the hard instances
developed in Liu and Xu (2024b). We find out that the instances developed in Liu and Xu (2024b)
only allow perturbations measured in TV-divergence. The reason is that in their nominal environment,
both s1 and s2 are absorbing states, and thus Ph0 (·|s, a) only has support on s, which could be
either s1 or s2 . In this case, any perturbation to Ph0 (·|s, a) would cause a violation of the absolute
continuous condition in the definition of the KL-divergence and the χ2 -divergence (as well as the
TV-divergence if strictly speaking). In comparison, we inject a small error ϵ in the nominal kernel
such that Ph0 (·|s1 , a) has full support {s1 , s2 } when the transition starts from s1 . Hence, we can make
perturbation on Ph0 (·|s1 , a) safely without violating the absolutely continuous condition. Additionally,
while Liu and Xu (2024b) only construct perturbation in first time step, we admits perturbation in
every time step h in order to make our instance more general. For details on the construction of
hard instances and the proof of Theorem 5.5, we refer readers to Appendix D.
In order to give a formal presentation of the information-theoretical lower bound, we define M
as a class of d-RRMDPs and SubOpt(M, π̂, s, ρ) as the suboptimality gap specific to one d-RRMDP

12
1−ϵ 1 − ϵ − ∆λh (ϵ, D) 1
1

s1 s2 s1 s2

ϵ ϵ + ∆λh (ϵ, D)
(a) The source MDP environment. (b) The target MDP environment.

Figure 1: The nominal environment and the worst case environment. The value on each arrow
represents the transition probability. The MDP has two states, s1 and s2 , and H steps. For he
nominal environment on the left, the s1 is the good state where the transition is determined by
an error term ϵ, and s2 is a fail state with reward 0 and only transitions to itself. The worst case
environment on the right is obtained by perturbing the transition probability at each step of the
nominal environment. The magnitude of the perturbation ∆λh (ϵ, D) at each stage h depends on the
divergence metric D, the regularized λ and the parameter ϵ.

instance M ∈ M. Hence, we state the information-theoretic lower bound in the following theorem:
Theorem 5.5. Given a regularizer λ, dimension d, horizon length H and sample size K >
max{Õ(d6 ), Õ(d3 H 2 /λ2 )}, there exists a class of d-rectangular linear RRMDPs M and an of-
fline dataset D of size K such that for all s ∈ S and any divergence D among DTV , DKL and Dχ2 ,
with probability at least 1 − δ, we have
H d
X 
π ⋆ ,P
X
inf sup SubOpt(M, π̂, s, λ, D) ≥ c · sup E ∥ϕi (sh , ah )1i ∥Λ−1 s1 = s ,
π̂ M ∈M P ∈U λ (P 0 ) h=1 h
i=1

where c is a universal constant.


Theorem 5.5 is a universal information theoretic lower bound for d-RRDMPs with all three
divergences studied in Section 5. Theorem 5.5 shows that the instance-dependent term is actually
intrinsic to the offline d-RRDMPs, and Algorithm 1 is near-optimal up to a factor β, for which the
definition varies among different divergence metric D as shown in Theorem 5.1. The proof outline of
Theorem 5.5 is inspired by that of Theorem 6.1 in Liu and Xu (2024b), but that here we need careful
treatment on bounding the robust regularized value function by duality under different choice of
f -divergence, especially considering the error ϵ in each stage would accumulate throughout the H
steps (see Lemma D.1 and its proof for more details).

6 Experiment
In this section, we conduct numerical experiments to explore (1) the robustness of R2PVI when
facing adversarial dynamics, (2) the role of regularizer λ in determining the robustness of R2PVI,
and (3) the computation cost of R2PVI. We evaluate our algorithm in two off-dynamics experiments
that has been used in the literature (Ma et al., 2022; Liu and Xu, 2024a). All experiments are
conducted on a machine with an 11th Gen Intel(R) Core(TM) i5-11300H @ 3.10GHz processor,
featuring 8 logical CPUs, 4 physical cores, and 2 threads per core.

13
Baselines. We compare our algorithms with three types of baseline frameworks: (1) non-robust
pessimism-based algorithm: PEVI (Jin et al., 2021), (2) d-DRMDP based algorithm under TV
divergence: DRPVI (Liu and Xu, 2024b), (3) d-DRMDP based algorithm under KL divergence:
DRVI-L (Ma et al., 2022). We do not implement P2MPO and DROP mentioned in Table 1 in our
experiment, due to the lack of code base and numerical experiment in those works.

6.1 Simulated Linear MDP


We borrow the simulated linear MDP constructed in Liu and Xu (2024a) and adapt it to the offline
setting. We set the behavior policy π b such that it chooses actions uniformly at random. The sample
size of the offline dataset is set to 100. For completeness, we present more details on the experiment
set up and result in Appendix A. In Figure 2(a), we compare R2PVI with its non-robust counterpart

1.5 PEVI =0.1 1.1

1.4
R2PVI-TV 1.6 =0.2
R2PVI-KL =0.3 1.0

Average reward
Average reward

Average reward

1.3
R2PVI- 2 1.4 =0.6
1.2 0.9
1.2
1.1
0.8
1.0 1.0
0.9
0.8
0.7 R2PVI
0.8 DRPVI
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Perturbation Perturbation /
(a) λ = 0.1 (b) R2PVI (c) q = 0.9

Figure 2: Simulated results for linear MDP. In Figure 2(a) and Figure 2(b), the x-axis refers to the
perturbation in the testing environment. In Figure 2(c), the x-axis represents different robust level ρ
and regularized penalty λ, respectively.

PEVI (Jin et al., 2021). We conclude that PEVI outperforms R2PVI when the perturbation of the
environment is small, but underperforms when the environment encounters a significant shift, which
verifies the robustness of R2PVI. The regularizer λ controls the extend of robustness of R2PVI
by determining the magnitude of the penalty as shown in Proposition 3.2. From Figure 2(b), we
can conclude that a smaller λ leads to a more robust policy. To illustrate the relation between the
d-rectangular linear RRMDP and the d-rectangular DRMDP, we fix a target environment, and then
test R2PVI with different λ and DRPVI (the algorithm designed for d-rectangular linear DRMDP
in Liu and Xu (2024b)) with different ρ. We find from Figure 2(c) that the ranges of the average
reward are approximately the same for the two algorithms, though the behaviors w.r.t. λ and ρ are
opposite. Thus, we verify the Theorem 3.1 of Yang et al. (2023) that for each given RRMDP, there
exists a DRMDP, whose value functions are exactly the same, and the regularizer λ plays a similar
role in the RRMDP as the inversed robustness parameter 1/ρ in the DRMDP.

6.2 Simulated American Put Option


In this section, we test our algorithm in a simulated American Put Option environment (Tamar et al.,
2014; Zhou et al., 2021) that does not belong to the d-rectangular linear RRMDP. This environment
is a finite horizon MDP with H = 20, and is controlled by a hyperparameter p0 , which is set to
be 0.5 in the nominal environment. We collect offline data from the nominal environment by a

14
14 PEVI (base) 7
PEVI (base) 16 PEVI (base)
R2PVI-TV (ours) 6 R2PVI 14 R2PVI-TV (ours)
12 R2PVI-KL (ours) DRPVI R2PVI-KL (ours)

Average Reward
5
10 DRPVI 12
DRPVI
DRVI-L DRVI-L
Time(s)

4 10

Time(s)
8
3 8
6
6
2
4
4
1
2 2
0
0 0
1 5 10 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 30 60 120
Sample Size N (x103) price-up probability d
(a) execution time w.r.t N. (b) λ = 1, 3, 5, ρ = 0.025, 0.1, 0.2. (c) execution time w.r.t d.

Figure 3: Simulation results for the simulated American put option task. Figure 3(a) shows the
computation time of R2PVI with respect to the sample size N . Figure 3(b) shows the robustness of
algorithms under different extend of perturbation in p0 . Figure 3(c) shows the computation time of
algorithms with respect to the feature dimension d.

uniformly random behavior policy. An agent use the collected offline dataset to learn a policy which
decides at each state whether or not to exercise the option. To implement our algorithm, we use a
manually designed feature mapping of dimension d. For more details on the experiment setup, we
refer readers to Section 6.2 of Liu and Xu (2024a) for more details.
All experiment results are shown in Section 6.2. In particular, from Figure 3(a) and Figure 3(c),
we can conclude that the computation cost of R2PVI is as low as its non-robust counterpart PEVI
(Jin et al., 2021), and improves that of DRPVI (Liu and Xu, 2024b) and DRVI-L (Ma et al., 2022)
designed for the d-rectangular linear DRMDP. This is due to the closed form duality of TV and KL
under the d-rectangular linear RRMDP. From Figure 3(b), we conclude that R2PVI on one hand is
robust to the environment perturbation. On the other hand, for each value of the regularizer λ in
R2PVI, there exist a value of the uncertainty level ρ for DRPVI such that they achieve almost the
same performance. This verifies the equivalence between DRMDP and RRMDP.

7 Conclusion
In this work, we propose a novel d-rectangular linear Robust Regularized Markov Decision Process
(d-RRMDP) framework to overcome the theoretical limitations and computational challenges inherent
in the d-rectangular DRMDP framework. We design a provable and practical algorithm R2PVI
that learns robust policies under the offline d-RRMDP setting across three divergence metrics:
Total Variation, Kullback-Leibler, and χ2 divergence. Our theoretical results demonstrate the
superiority of the d-RRMDP framework compared to the d-DRMDP framework, primarily through
the simplification of the duality oracle. Extensive experiments validate the robustness and high
computational efficiency of R2PVI when compared to d-DRMDP-based algorithms. It remains an
intriguing open question to improve the current upper and lower bounds to study the fundamental
hardness of the d-RRMDP.

A Additional Details on Experiments


In this section, we provide details on experiment setup.

15
A.1 Simulated Linear MDP
Construction of the simulated linear MDP We leverage the simulated linear MDP instance
proposed by Liu and Xu (2024a). The state space is S = {x1 , · · · , x5 } and the action space is
A = {−1, 1}4 ⊂ R4 . At each episode, the initial state is always x1 . From x1 , the next state can be
x2 , x4 , x5 with probability defined on the arrows. Both x4 and x5 are absorbing states. x4 is the fail
state with 0 reward and x5 is the goal state with reward 1. The hyperparameter ξ ∈ R4 is designed
to determine the reward functions and transition probabilities and δ is the parameter defined to
determine the environment. We perturb the transition probability at the initial stage to construct
the source environment. The extend of perturbation is controlled by the hyperparameter q ∈ (0, 1).
For more details on the simulated linear DRMDP, we refer readers to the Supplementary A.1 of Liu
and Xu (2024a).

Hyper-parameters The hyper-parameters in our setting are shown in Table 2. The horizon is
3, the β, γ, δ are set the same in all tasks, the ∥ξ∥1 is set as 0.3, 0.2, 0.1 in Figure 2 in order to
illustrate the versatility of our algorithms.

Table 2: Hyper-parameters.
Hyper-parameters Value
H (Horizon) 3
β (pessimism parameter) 1
γ 0.1
δ 0.3
∥ξ∥1 0.3, 0.2, 0.1

A.2 Simulated American Put Option


Construction of the simulated American put option In each episode, there are H = 20
stages, and each state h, the dynamics evolves following the Bernoulli distribution:
(
1.02sh , w.p p0
sh+1 = , (A.1)
0.98sh , w.p 1 − p0

where p0 ∈ (0, 1) is the probability of price up. At each step, the agent has two actions to take:
exercise the option ah = 1 or not exercise a = 0. If exercising the option ah = 0, the agent will obtain
reward rh = max{0, 100 − sh } and the state comes to an end. If not exercising the option ah = 1,
The state will continue to transit based on (A.1) and no reward will be received. To implement our
algorithms, we use the following feature mapping:
(
[φ1 (sh ) , · · · , φd (sh ) , 0] if a = 1
ϕ (sh , a) = ,
[0, · · · , 0, max {0, 100 − sh }] if a = 0

where φi (s) = max {0, 1 − |sh − si | /∆}, {si }di=1 are anchor states, s1 = 80 si+1 − si = ∆ and
∆ = 60/d. For more details on the simulated American put option environment, we refer readers to
the Appendix C of Ma et al. (2022).

16
Offline dataset and hyper-parameters We set p0 = 0.5 in the nominal environment, from
which trajectories are collected by fixed behavior policy, which chooses ah = 0. The β = 0.1 and
γ = 1 are set hyper-parameters in all tasks. For the time efficiency comparison in Figure 3(a) and
Figure 3(c), we counted the time it took for the agent to train once and repeated 5 times to take the
average.

B Proof of Properties of d-RRMDPs


In this section, we provide the proofs of results in Sections 3 and 4, namely, the robust regularized
Bellman equation, the existence of the optimal robust policy, and the linear representation of the
robust regularized Q-function under the d-rectangular linear RRMDP.

B.1 Proof of Proposition 3.2


Proof. We prove the a stronger proposition by induction from the last stage H. Specifically,
besides the equations in Proposition 3.2 hold, we further assume that there exist transition kernels
t=1 , P̂t = ⟨ϕ, µ̂t ⟩, such that for any (h, s) ∈ [H] × S,
{µ̂t }H
H
X 
π,λ {P̂t }H 0
(B.1)

Vh (s) = E t=h rt (st , at ) + λ⟨ϕ(st , at ), D(µ̂t ||µt )⟩ sh = s, π .
t=h

As there is no transitional kernel involved, the base case holds trivially. Suppose the conclusion
holds for stage h + 1, that is to say, there exists P̂t , t = h + 1, h + 2, · · · , H such that
H
 X 
π,λ {P̂i }H 0
 
Vh+1 (s) = E i=h+1 rt (st , at ) + λ⟨ϕ(st , at ), D(µ̂t ||µt ) sh+1 = s, π .
t=h+1

For the case of h, recall the definition of Qπh , we have


H
X 
π,λ {Pt }H 0
 
Qh (s, a) = inf E t=h rt (st , at ) + λ⟨ϕ(st , at ), D(µt ||µt )⟩ sh = s, ah = a, π
µt ∈∆(S)d ,Pt =⟨ϕ,µt ⟩
t=h
= rh (s, a) + inf λ⟨ϕ(sh , ah ), D(µh ||µ0h )⟩
µt ∈∆(S)d ,P t =⟨ϕ,µt ⟩
Z H
 X 
H
Ph (ds′ |s, a)E{Pt }t=h+1 rt (st , at ) + λ⟨ϕ(st , at ), D(µt ||µ0t )⟩ sh+1 = s′ , π
 
+
S t=h+1
≤ rh (s, a) + inf λ⟨ϕ(sh , ah ), D(µh ||µ0h )⟩
µh ∈∆(S)d ,P h =⟨ϕ,µh ⟩
Z H
 X 
H
Ph (ds′ |s, a)E{P̂t }t=h+1 rt (st , at ) + λ⟨ϕ(st , at ), D(µ̂t ||µ0t )⟩ sh+1 = s′ , π
 
+
S t=h+1
π,λ ′
= rh (s, a) + inf λ⟨ϕ(sh , ah ), D(µh ||µ0h )⟩ + Es′ ∼Ph (·|s,a) [Vh+1 (s )], (B.2)
µh ∈∆(S)d ,Ph =⟨ϕ,µh ⟩

π,λ
where (B.2) follows by the inductive hypothesis of Vh+1 (s). On the other hand, we can lower bound
π,λ
Qh (s, a) as

Qπ,λ
h (s, a)

17
= rh (s, a) + inf λ⟨ϕ(sh , ah ), D(µh ||µ0h )⟩
µt ∈∆(S)d ,Pt =⟨ϕ,µt ⟩
Z H
 X 
′ {Pt }H 0 ′
 
+ Ph (ds |s, a)E t=h+1 rt (st , at ) + λ⟨ϕ(st , at ), D(µt ||µt )⟩ sh+1 = s , π
S t=h+1
≥ rh (s, a) + inf λ⟨ϕ(sh , ah ), D(µh ||µ0h )⟩ (B.3)
µh ∈∆(S)d ,Ph =⟨ϕ,µh ⟩
Z H
 X 
{Pt }H
Phπ (ds′ |s, a) 0 ′
 
+ inf E t=h+1 rt (st , at ) + λ⟨ϕ(st , at ), D(µt ||µt )⟩ sh+1 = s , π
S µt ∈∆(S)d ,Pt =⟨ϕ,µt ⟩
t=h+1
π,λ ′
= rh (s, a) + inf λ⟨ϕ(sh , ah ), D(µh ||µ0h )⟩ + Es′ ∼Ph (·|s,a) [Vh+1 (s )], (B.4)
µh ∈∆(S)d ,P h =⟨ϕ,µh ⟩

π,λ
where (B.3) follows by the Fatou’s lemma, (B.4) follows by the definition of Vh+1 (s). Hence,
combining the two above inequalities, we conclude the (3.5). Next we focus on the proof of the (B.1),
by which we aim to proof the existence of transition kernel {P̂t }H
t=h . By the fact that

Qπ,λ
 π,λ ′ 
(s ) + λ⟨ϕ(s, a), D(µh ||µ0h )⟩ ,
 
h (s, a) = rh (s, a) + inf Es′ ∼Ph (·|s,a) Vh+1
µh ∈∆(S)d ,Ph =⟨ϕ,µh ⟩

we notice that the inf problem above is constraint by the distance D. Therefore by Lagrange duality
and the closeness of distribution ∆(S), there exists µ̂h ∈ ∆(S)d , P̂h = ⟨ϕ, µ̂h ⟩ such that

Qπ,λ
 π,λ ′  0
h (s, a) = rh (s, a) + Es′ ∼P̂h (·|s,a) Vh+1 (s ) + λ⟨ϕ(s, a), D(µ̂h ||µh )⟩. (B.5)

Now it remains to proof (3.6). By the definition of Vhπ,λ (s), we have

Vhπ,λ (s) (B.6)


H
X 
H
E{Pt }t=h rt (st , at ) + λ⟨ϕ(st , at ), D(µt ||µ0t )⟩ sh = s, π
 
= inf
µt ∈∆(S)d ,P t =⟨ϕ,µt ⟩
t=h
H
X 
H
X
π(a|s)E{Pt }t=h rt (st , at ) + λ⟨ϕ(st , at ), D(µt ||µ0t )⟩ sh = s, ah = a, π
 
= inf
µt ∈∆(S)d ,Pt =⟨ϕ,µt ⟩
a∈A t=h
H
X 
{P̂t }H
X
rt (st , at ) + λ⟨ϕ(st , at ), D(µ̂t ||µ0t )⟩ sh = s, ah = a, π
 
≤ π(a|s)E t=h

a∈A t=h
π,λ
X
= π(a|s)Qh (s, a), (B.7)
a∈A

where (B.7) comes from (B.5) and the inductive hypothesis. On the other hand, by the definition of
Qπ,λ
h (s, a), we have

π(a|s)Qπ,λ
X
h (s, a)
a∈A
H
X 
{Pt }H
X
rt (st , at ) + λ⟨ϕ(st , at ), D(µt ||µ0t )⟩ sh = s, ah = a, π
 
= π(a|s) inf E t=h
µt ∈∆(S)d ,Pt =⟨ϕ,µt ⟩
a∈A t=h

18
H
X 
{Pt }H
X
rt (st , at ) + λ⟨ϕ(st , at ), D(µt ||µ0t )⟩ sh = s, ah = a, π
 
≤ inf π(a|s)E t=h
µt ∈∆(S)d ,Pt =⟨ϕ,µt ⟩
a∈A t=h

= Vhπ,λ (s), (B.8)

where (B.8) comes from the definition of Vhπ,λ (s). Combining the two inequalities (B.7) and (B.8),
we have

Vhπ,λ (s) = Ea∼π(·|s) Qπh (s, a) .


 

This proves the (3.6) for stage h. Therefore, by using an induction argument, we finish the proof of
Proposition 3.2.

B.2 Proof of Proposition 3.3


Proof. We define the optimal stationary policy π ⋆ = {πh⋆ }H h=1 as: for all (h, s) ∈ [H] × S,
h  ⋆,λ ′  i
πh⋆ (s) = argmax rh (s, a) + (s ) + λ⟨ϕ(s, a), D(µh ||µ0h )⟩ .

inf Es′ ∼Ph (·|s,a) Vh+1
a∈A µh ∈∆(S)d ,Ph =⟨ϕ,µh ⟩

⋆ ,λ ⋆ ,λ
Now it remains to show that the regularized robust value function Vhπ , Qπh induced by policy π ⋆
is optimal, i.e., for all (h, s) ∈ [H] × S,
⋆ ,λ ⋆ ,λ
Vhπ (s) = Vh⋆,λ (s), Qπh (s, a) = Q⋆,λ
h (s, a).

By the (3.2), we only need to prove the first equation above, then the optimality of the Q holds
trivially. we prove this statement by induction from H to 1. For stage H, the conclusion holds by:

VH⋆,λ (s) = sup VHπ,λ (s)


π
 
PH
rH (sH , aH ) + λ⟨ϕ(sH , aH ), D(µH ||µ0H )⟩ sH = s, π
 
= sup inf E
π µH ∈∆(S)d ,PH =⟨ϕ,µH ⟩
h i
= sup rH (sH , πH (sH )) + inf λ⟨ϕ(s, a), D(µH ||µ0H )⟩
π µH ∈∆(S)d ,PH =⟨ϕ,µH ⟩
π ⋆ ,λ
= VH (s).

Now assume that the conclusion holds by stage h + 1. Hence, we have that for all s ∈ S,
π ,λ ⋆ ⋆,λ
Vh+1 (s) = Vh+1 (s).

For the case of h, by (3.2), we have


⋆ ,λ
Vhπ (s)
h ⋆ i
= Ea∼πh⋆ (·|s) Qhπ ,λ (s, a)
h  π⋆ ,λ ′  i
(s ) + λ⟨ϕ(s, a), D(µh ||µ0h )⟩

= Ea∼πh⋆ (·|s) rh (s, a) + inf Es′ ∼Ph (·|s,a) Vh+1
µh ∈∆(S)d ,Ph =⟨ϕ,µh ⟩
h  ⋆,λ ′  i
(s ) + λ⟨ϕ(s, a), D(µh ||µ0h )⟩ (B.9)

= Ea∼πh⋆ (·|s) rh (s, a) + inf Es′ ∼Ph (·|s,a) Vh+1
µh ∈∆(S)d ,Ph =⟨ϕ,µh ⟩

19
h  ⋆,λ ′  i
(s ) + λ⟨ϕ(s, a), D(µh ||µ0h )⟩ , (B.10)

= max rh (s, a) + inf Es′ ∼Ph (·|s,a) Vh+1
a∈A µh ∈∆(S)d ,Ph =⟨ϕ,µh ⟩

where (B.9) holds by the inductive hypothesis, (B.10) holds by the definition of π ⋆ . On the other
hand, recall the definition of Vh⋆,λ (s), then for any s ∈ S, by (3.2) we have

Vh⋆,λ (s)
= sup Vhπ,λ (s)
π
h i
= sup Ea∼πh (·|s) Qπ,λ
h (s, a)
π
h  π,λ ′  i
(s ) + λ⟨ϕ(s, a), D(µh ||µ0h )⟩

= sup Ea∼πh (·|s) rh (s, a) + inf Es′ ∼Ph (·|s,a) Vh+1
π µh ∈∆(S)d ,Ph =⟨ϕ,µh ⟩
h  ⋆,λ ′  i
(s ) + λ⟨ϕ(s, a), D(µh ||µ0h )⟩

≤ sup Ea∼πh (·|s) rh (s, a) + inf Es′ ∼Ph (·|s,a) Vh+1
π µh ∈∆(S)d ,Ph =⟨ϕ,µh ⟩
(B.11)
h  ⋆,λ ′  i
(s ) + λ⟨ϕ(s, a), D(µh ||µ0h )⟩

= max rh (s, a) + inf Es′ ∼Ph (·|s,a) Vh+1
a∈A µh ∈∆(S)d ,Ph =⟨ϕ,µh ⟩
⋆ ,λ
= Vhπ (s), (B.12)
π ,λ ⋆,λ⋆
where (B.11) holds by the fact that Vh+1 (s) ≤ Vh+1 (s), ∀s ∈ S, (B.12) holds by (B.10). In turn,
⋆,λ π ⋆ ,λ
we trivially have Vh (s) ≥ Vh (s) due to the optimality of the value function. Hence, we obtain

Vhπ ,λ (s) = Vh⋆,λ (s), ∀s ∈ S. Therefore, by the induction argument, we conclude the proof.

B.3 Proof of Proposition 4.1


Proof. By Proposition 3.2, we have

Qπ,λ
 π,λ ′ 
(s ) + λ⟨ϕ(s, a), D(µh ||µ0h )⟩
 
h (s, a) = rh (s, a) + inf Es′ ∼Ph (·|s,a) Vh+1
µh ∈∆(S)d ,Ph =⟨ϕ,µh ⟩
h d i
π,λ ′
X
= ϕ(s, a), θh + inf ϕ(s, a), Es′ ∼µh [Vh+1 (s )] + λ ϕi (s, a)D(µh,i ∥µ0h,i )
µh ∈∆(S)d
i=1
= ϕ(s, a), θh + ϕ(s, a), whπ,λ
= ϕ(s, a), θh + whπ,λ .

Hence we conclude the proof.

B.4 Proof of Proposition 4.2


Proof. The optimization problem can be formalized as:
X
inf Es∼µ V (s) + λDTV (µ∥µ0 ) subject to µ(s) = 1, µ(s) ≥ 0.
µ
s

Denote y(s) = µ(s) − µ0 (s), the objective function can be rewritten as:
X X
Es∼µ V (s) + λDTV (µ∥µ0 ) = µ(s)V (s) + λ/2 |µ(s) − µ0 (s)|
s s

20
X X
= V (s)(y(s) + µ0 (s)) + λ/2 |y(s)|
s
X X
= Es∼µ0 V (s) + V (s)y(s) + λ/2 |y(s)|.
s

Recall the constraint = 0, y(s) ≥ −µ0 (s), by the Lagrange duality, we establish the
P
s y(s)
Lagrangian function:
X X 
L = min max [y(s)(V (s) − µ(s) − r)) + λ/2|y(s)|] − µ(s)µ0 (s) .
y µ≥0,r∈R
s s

In order to achieve the minimax optimality, for any s, term y(s)(V (s) − µ(s) − r)) + λ/2|y(s)| should
obtain a bounded lower bound with respect to y(s), which requires that µ(s), r should satisfy the
following conditions:

∀s ∈ S, |V (s) − µ(s) − r| ≤ λ/2 ⇒ max{V (s) − µ(s)} − min{V (s) − µ(s)} ≤ λ.


s∈S s∈S

With the constraint above, we denote g(s) := V (s) − µ(s), we have


nX X o
L = max min [y(s)(V (s) − µ(s) − r)) + λ/2|y(s)|] − µ(s)µ0 (s)
µ≥0,r∈R y
s s
X
0
= max − µ(s)µ (s)
maxs∈S (V (s)−µ(s))−mins∈S (V (s)−µ(s))≤λ
nX o
= max g(s)µ0 (s) − Es∼µ0 V (s).
maxs g(s)−mins g(s)≤λ,g(s)≤V (s)

Thus we have,

Es∼µ V (s) + λDTV (µ∥µ0 )


nX o
= Es∼µ0 V (s) + max g(s)µ0 (s) − Es∼µ0 V (s)
maxs g(s)−mins g(s)≤λ,g(s)≤V (s)

= Es∼µ0 [V (s)]Vmin +λ , (B.13)

where (B.13) holds by directly solving the max problem. Hence we conclude the proof.

B.5 Proof of Proposition 4.7


Proof. Similar to the proof of Proposition 4.2, define y(s) = µ(s) − µ0 (s), with lagrange duality, we
have:
X X y(s)2 X X
L= V (s)(y(s) + µ0 (s)) + λ − (µ0
(s) + y(s))µ(s) − r y(s)
s s
µ0 (s) s s
X  λy(s)2 X X 
0 0
= + y(s)(V (s) − µ(s) − r) + V (s)µ (s) − µ (s)µ(s) .
s
µ0 (s) s s

Noticing that L is a quadratic function with respect to y(s), therefore after we fix the term µ(s), r
and compute the min with respect to y(s), we have
1 X 0 X X
L=− µ (s)(V (s) − µ(s) − r)2 + V (s)µ0 (s) − µ0 (s)µ(s)
4λ s s s

21
1 h 2 i
=− Es∼µ0 [V − µ]2 − Es∼µ0 [V − µ] + Es∼µ0 [V − µ] (B.14)

1
= Es∼µ0 [V − µ] − Vars∼µ0 [V − µ]

1
= sup Es∼µ0 [V (s)]α − Vars∼µ0 [V (s)]α , (B.15)
α∈[Vmin ,Vmax ] 4λ

where (B.14) comes from maximizing r, and (B.15) comes from maximizing µ(s), s ∈ S and the
observation that µ(s) = 0 or V (s), ∀s ∈ S when achieving its maximum. Hence, we conclude the
proof.

C Proof of the Upper Bounds of Suboptimality


In this section, we prove Theorem 5.1 and Corollary 5.4. For simplicity, we denote ϕτh = ϕ(sτh , aτh ).
According to the robust regularized Bellman equation in Proposition 3.2, we first define the robust
regularized Bellman operator: for any (h, s, a) ∈ [H] × S × A and any function V : S × A → [0, H],
 π,λ ′ 
Thλ V (s, a) := rh (s, a) + (s ) + λ⟨ϕ(s, a), D(µh ||µ0h )⟩ . (C.1)
 
inf Es′ ∼Ph (·|s,a) Vh+1
µh ∈∆(S)d ,P h =⟨ϕ,µh ⟩

We have Qπ,λ λ π,λ


h (s, a) = Th Vh+1 (s, a).

C.1 Proof of Theorem 5.1


We start from bounding the suboptimality gap by the estimation uncertainty in the following Lemma.

Lemma C.1. If the following inequality holds for any (h, s, a) ∈ [H] × S × A:

|Thλ V̂h+1
λ
(s, a) − ⟨ϕ(s, a), ŵhλ ⟩| ≤ Γh (s, a),

then we have
H
∗ ,P
X
SupOpt(π̂, s, λ) ≤ 2 Eπ
 
sup Γh (sh , ah )|s1 = s .
P ∈U λ (P 0 ) h=1

C.1.1 Proof of Theorem 5.1 - Case with the TV Divergence


For completeness, we present R2PVI specific to the TV distance in Algorithm 2, which gives a
closed form solution of (4.3). Now we present the upper bound of weights as follows.

Lemma C.2. (Bound of weights - TV) For any h ∈ [H], we have


s
√ Kd
∥whλ ∥2 ≤ H d, ∥ŵhλ ∥2 ≤ H .
γ

Proof of Theorem 5.1 - TV. The R2PVI with TV-divergence is presented in Algorithm 2. We derive
the upper bound on the estimation uncertainty Γh (s, a) to prove the theorem. We first decompose

22
Algorithm 2 Robust Regularized Pessimistic Value Iteration under TV distance (R2PVI-TV)
Require: Dataset D, regularizer λ > 0, γ > 0 and parameter β
1: init V̂H+1
λ (·) = 0

2: for episode h = H, · · · , 1 do
Λh ← K τ τ τ τ ⊤
P
3: τ =1 ϕ(sh , ah )(ϕ(sh , ah )) + γI
4: αh+1 ← mins∈S {V̂h+1 λ (s)} + λ
−1 K
ŵhλ ← Λh ( τ =1 ϕ(sτh , aτh )[V̂h+1λ (sτ ◁ Estimated by (4.3)
P
5: h+1 )]αh+1 )
Pd
6: Γh (·, ·) ← β i=1 ∥ϕi (·, ·)1i ∥Λ−1
h
7: Q̂λh (·, ·) ← min{ϕ(·, ·)⊤ (θh + ŵhλ ) − Γh (·, ·), H − h + 1}+
8: π̂h (·|·) ← argmaxπh ⟨Q̂λh (·, ·), πh (·|·)⟩A and V̂hλ (·) ← ⟨Q̂λh (·, ·), π̂h (·|·)⟩A
9: end for

the difference between the regularized robust bellman operator Thλ and the empirical regularized
robust bellman operator T̂hλ as

Thλ V̂h+1
λ
(s, a) − ⟨ϕ(s, a), ŵhλ ⟩ (C.2)
d
X
λ λ
= ϕi (s, a)(wh,i − ŵh,i )
i=1
d
X
= ϕi (s, a)1i (whλ − ŵhλ )
i=1
Xd d
X K
X
= γ ϕi (s, a)1i Λ−1 λ
h wh + ϕi (s, a)1i Λ−1
h
λ
ϕ(sτh , aτh )ηhτ ([V̂h+1 (s)]αh+1 ) (C.3)
i=1 i=1 τ =1
d
X Xd XK
≤γ ∥ϕi (s, a)1i ∥Λ−1 ∥whλ ∥Λ−1 + ∥ϕi (s, a)1i ∥Λ−1 ∥ λ
ϕ(sτh , aτh )ηhτ ([V̂h+1 (s)]αh+1 )∥Λ−1 ,
h h h h
i=1 | {z } i=1 τ =1
(i) | {z }
(ii)
(C.4)

where (C.3) comes from the definition of ŵhλ , while (C.4) follows by the Cauchy-Schwartz inequality.
By Lemma C.2 and the fact that V̂h+1λ (s) ≤ H and γ = 1, we have


(i) = ∥whλ ∥Λ−1 ≤ ∥Λ−1 h ∥1/2
∥wh
λ
∥ 2 ≤ H d,
h

where the last inequality comes from the fact that ∥Λ−1
h ∥ ≤ γ . Now it remains to bound term (ii),
−1

as V̂h+1 depends on data, which makes it difficult to bound it directly by concentration equality.
λ

Instead, we consider focus on the function class Vh (R0 , B0 , γ):

Vh (R0 , B0 , γ) = {Vh (x; θ, β, Λ) : S → [0, H], ∥θ∥2 ≤ R0 , β ∈ [0, B0 ], γmin (Λh ) ≥ γ},

where Vh (x; θ, β, Λ) = maxa∈A [ϕ(s, a)⊤ θ − β di=1 ∥ϕi (s, a)∥Λ−1 ][0,H−h+1] . By Lemma C.2 and the
P
h

definition of V̂h+1
λ , when we set R = H Kd/γ, B = β = 16Hd ξ
TV , it suffices to show that
p
0 0
V̂h+1 ∈ Vh+1 (R0 , B0 , γ). Next we aim to find a union cover of the Vh+1 (R0 , B0 , γ), hence the term (ii)
λ

23
can be upper bounded. Let Nh (ϵ; R0 , B0 , γ) be the minimum ϵ-cover of Vh (R0 , B0 , λ) with respect
to the supreme norm, Nh ([0, H]) be the minimum ϵ-cover of [0, H] respectively. In other words, for
any function V ∈ Vh (R0 , B0 , γ), αh+1 ∈ [0, H], there exists a function V ′ ∈ Vh (R0 , B0 , γ) and a real
number αϵ ∈ [0, H] such that:

sup |V (s) − V ′ (s)| ≤ ϵ, |αϵ − αh+1 | ≤ ϵ.


s∈S

By Cauchy-Schwartz inequality and the fact that ∥a + b∥2Λ−1 ≤ 2∥a∥2Λ−1 + 2∥b∥2Λ−1 and the definition
h h h
of the term (ii), we have
K K
X 2 X 2
(ii)2 ≤ 2 ϕ(sτh , aτh )ηhτ ([V̂h+1
λ
] αϵ ) +2 ϕ(sτh , aτh )ηhτ ([V̂h+1
λ λ
]αh+1 − [V̂h+1 ] αϵ )
Λ−1
h Λ−1
h
τ =1 τ =1
K K
X

2 X

2 2ϵ2 K 2
≤4 ϕ(sτh , aτh )ηhτ ([Vh+1 ] αϵ ) +4 ϕ(sτh , aτh )ηhτ ([V̂h+1
λ
]αϵ − [Vh+1 ] αϵ ) + ,
Λ−1
h Λ−1
h γ
τ =1 τ =1
(C.5)

where (C.5) follows by the fact that


K K
X 2 X 2ϵ2 K 2
2 λ
ϕ(sτh , aτh )ηhτ ([V̂h+1 λ
]αh+1 − [V̂h+1 ]αϵ ) ≤ 2ϵ2 |ϕτh Λ−1 τ⊤
h ϕh | ≤ .
Λ−1
h γ
τ =1 τ =1,τ ′ =1

Meanwhile, by the fact that |[V̂h+1


λ ]
αϵ − [Vh+1 ]αϵ | ≤ |V̂h+1 − Vh+1 |, we have
′ λ ′

K
X 2

4 λ
ϕ(sτh , aτh )ηhτ ([V̂h+1 ]αϵ − [Vh+1 ] αϵ ) (C.6)
Λ−1
h
τ =1
K

X
≤4 |ϕτh Λ−1 τ τ λ ′
h ϕh | max |ηh ([V̂h+1 ]αϵ − [Vh+1 ]αϵ )|
2

τ =1,τ ′ =1
K

X
≤ 4ϵ 2
|ϕτh Λ−1 τ
h ϕh |
τ =1,τ ′ =1
4ϵ K 2
2
≤ . (C.7)
γ
By applying the (C.7) into (C.5), we have
K
X

2 6ϵ2 K 2
2
(ii) ≤ 4 sup ϕ(sτh , aτh )ηhτ ([Vh+1 ] αϵ ) + . (C.8)
V ′ ∈Nh (ϵ;R0 ,B0 ,γ),αϵ ∈Nh ([0,H]) Λ−1
h γ
τ =1

By Lemma F.3, applying a union bound over Nh (ϵ; R0 , B0 , γ) and Nh ([0, H]), with probability at
least 1 − δ/2H, we have
K
X

2 6ϵ2 K 2
4 sup ϕ(sτh , aτh )ηhτ ([Vh+1 ] αϵ ) +
V ′ ∈Nh (ϵ;R0 ,B0 ,γ),αϵ ∈Nh ([0,H]) Λ−1
h γ
τ =1

24
 2H|Nh (ϵ; R0 , B0 , γ)∥Nh ([0, H])|  6ϵ2 K 2
≤ 4H 2 2 log + d log(1 + K/γ) + . (C.9)
δ γ
Applying Lemma F.1, we have
log |Nh (ϵ; R0 , B0 , λ)| ≤ d log(1 + 4R0 /ϵ) + d2 log(1 + 8d1/2 B02 /γϵ2 )
= d log(1 + 4K 3/2 d−1/2 ) + d2 log(1 + 8d−3/2 B02 K 2 H −2 )
≤ 2d2 log(1 + 8d−3/2 B02 K 2 H −2 ). (C.10)
Similarly, by Lemma F.2, we have
|Nh ([0, H])| ≤ 3H/ϵ.
Combining (C.10) with (C.8) and (C.9), by setting ϵ = dH/K, we have
 2H|Nh (ϵ; R0 , B0 , γ)∥Nh ([0, H])|  6ϵ2 K 2
(ii)2 ≤ 4H 2 2 log + d log(1 + K/γ) +
δ γ
≤ 4H 2 (4d2 log(1 + 8d−3/2 B02 K 2 H −2 ) + log(3K/d) + d log(1 + K) + 2 log 2H/δ) + 6d2 H 2
≤ 16H 2 d2 (log(1 + 8d−3/2 B02 K 2 H −2 ) + log(1 + K)/d + 3/8 + log H/δ)
≤ 32H 2 d2 log 8d−3/2 B02 K 2 H −1 /δ
= 32H 2 d2 log 1024Hd1/2 K 2 ξTV /δ
= 32H 2 d2 (log 1024Hd1/2 K 2 /δ + log ξTV )
β2
≤ 64H 2 d2 ξTV := .
4
Recall the upper bound in (C.4), we have with probability at least 1 − δ,
λ
|Thλ V̂h+1 (s, a) − ⟨ϕ(s, a), ŵhλ ⟩|
d
X d
X K
X
≤γ ∥ϕi (s, a)1i ∥Λ−1 ∥whλ ∥Λ−1 + ∥ϕi (s, a)1i ∥Λ−1 λ
ϕ(sτh , aτh )ηhτ ([V̂h+1 (s)]αϵ )
h h h Λ−1
h
i=1 i=1 τ =1
d
X √
≤ ∥ϕi (s, a)1i ∥Λ−1 (H d + β/2)
h
i=1
Xd
≤β ∥ϕi (s, a)1i ∥Λ−1 , (C.11)
h
i=1

where (C.11) follows by the fact that 2H d ≤ β. Hence, the prerequisite is satisfied in Lemma C.1,
we can upper bound the suboptimality gap as:
H
∗ ,P
X
SubOpt(π̂, s, λ) ≤ 2 Eπ
 
sup Γh (sh , ah )|s1 = s
P ∈U λ (P 0 ) h=1
H d
hX i
∗ ,P
X
= 2β · sup Eπ ∥ϕi (s, a)1i ∥Λ−1 |s1 = s .
h
P ∈U λ (P 0 ) h=1 i=1

This concludes the proof.

25
C.1.2 Proof of Theorem 5.1 - Case with KL Divergence

Algorithm 3 Robust Regularized Pessimistic Value Iteration under KL distance (R2PVI-KL)


Require: Dataset D, regularizer λ > 0, γ > 0 and parameter β
1: init V̂H+1
λ (·) = 0

2: for episode h = H, · · · , 1 do
Λh ← K τ τ τ τ ⊤
P
3: τ =1 ϕ(sh , ah )(ϕ(sh , ah )) + γI
V̂ λ (sτ )
ŵh′ ← Λ−1 τ , aτ )e− h+1 λ h+1
PK
4: h τ =1 ϕ(s h h ◁ Estimated by (4.5)
5: λ
ŵh ← −λ log max{ŵh , e ′ −H/λ }
Pd
6: Γh (·, ·) ← β i=1 ∥ϕi (·, ·)1i ∥Λ−1
h
7: Q̂λh (·, ·) ← min{ϕ(·, ·)⊤ (θh + ŵhλ ) − Γh (·, ·), H − h + 1}+
8: π̂h (·|·) ← argmaxπh ⟨Q̂λh (·, ·), πh (·|·)⟩A and V̂hλ (·) ← ⟨Q̂λh (·, ·), π̂h (·|·)⟩A
9: end for

For completeness, we present the R2PVI algorithm specific to the KL distance in Algorithm 3,
which gives closed form solution of (4.6). Our proof relies on the following lemmas on bounding the
regression parameter and ϵ-covering number of the robust value function class.

Lemma C.3 (Bound of weights - KL). For any h ∈ [H],


s
√ Kd
∥whλ ∥2 ≤ d, ∥ŵh′ ∥2 ≤ .
γ

Lemma C.4 (Bound of covering number - KL). For any h ∈ [H], let Vh denote a class of functions
mapping from S to R with the following form:

n  d
X o

Vh (x; θ, β, Λh ) = max ϕ(s, a) θ − λ log 1 + β ∥ϕi (·, ·)1i ∥Λ−1 ,
a∈A h [0,H−h+1]
i=1

the parameters (θ, β, Λh ) satisfy ∥θ∥2 ≤ L, β ∈ [0, B], γmin (Λh ) ≥ γ. Let Nh (ϵ) be the ϵ-covering
number of V with respect to the distance dist(V1 , V2 ) = supx |V1 (x) − V2 (x)|.Then

log |Nh (ϵ)| ≤ d log(1 + 4L/ϵ) + d2 log(1 + 8λ2 d1/2 B 2 /γϵ2 ).

Proof. The R2PVI with KL-divergence is presented in Algorithm 3. Similar to the proof of TV
divergence, we decompose the estimation uncertainty between Thλ and T̂hλ as:

|Thλ V̂h+1
λ
(s, a) − ⟨ϕ(s, a), ŵhλ ⟩| = ϕ(s, a)⊤ θh − λ log whλ − θh + λ log max{ŵh′ , e−H/λ }


= ϕ(s, a)⊤ λ log max{ŵh′ , e−H/λ } − λ log whλ




d ′ , e−H/λ }
X max{ŵh,i
=λ ϕi (s, a) log λ
i=1
wh,i
d ′ , e−H/λ }
X max{ŵh,i
≤λ ϕi (s, a) log λ
i=1
wh,i

26
d
X

ϕi (s, a) log 1 + eH/λ | max{ŵh,i , e−H/λ } − wh,i
λ
(C.12)

≤λ |
i=1
Xd

ϕi (s, a) log 1 + eH/λ |ŵh,i λ
(C.13)

≤λ − wh,i |
i=1
d
X d
X 

≤ λ log ϕi (s, a) + e H/λ
ϕi (s, a)|ŵh,i λ
− wh,i | (C.14)
i=1 i=1
 d
X 
= λ log 1 + eH/λ ϕi (s, a)1⊤
i |ŵ ′
h − w λ
|
h ,
i=1

where (C.12) and (C.13) comes from the fact that:


h V̂hλ (s′ ) i  |A − B| 
λ
wh,i = Es′ ∼µ0 e− λ ≥ Es′ ∼µ0 [e−H/λ ] = e−H/λ , | log A − log B| = log 1 + ,
h,i h,i min{A, B}

and (C.14) comes fromPthe Jensen’s inequality applying to function log(x). Therefore, our next goal
is to bound the term di=1 ϕi (s, a)1⊤
i |ŵh − wh |. Specifically, we have
′ λ

d
X
ϕi (s, a)1⊤ ′ λ
i |ŵh − wh |
i=1
d K λ (s
V̂h+1
X X h+1 )
−1
= ϕi (s, a)1⊤ λ
i wh − Λh (ϕτh )e− λ

i=1 τ =1
d K K K λ (s
V̂h+1
X X X X h+1 )
= ϕi (s, a)1⊤
i whλ − Λ−1
h ϕτh (ϕτh )⊤ whλ + Λ−1
h ϕτh (ϕτh )⊤ whλ − Λ−1
h (ϕτh )e− λ

i=1 τ =1 τ =1 τ =1
d K K K λ (s
V̂h+1 h+1 )
X  X X X 
= ϕi (s, a)1⊤
i whλ − Λ−1
h ϕτh (ϕτh )⊤ whλ + Λ−1
h ϕτh (ϕτh )⊤ whλ − Λ−1
h (ϕτh )e− λ .
i=1 τ =1 τ =1 τ =1
| {z } | {z }
(i) (ii)

Next, we upper bound term (i) and (ii), respectively. For the first term, we have:
d
X d
X
−1
ϕi (s, a)1⊤
i · (i) = ϕi (s, a)1⊤ λ λ
i (|wh − Λh (Λh − γI)wh |)
i=1 i=1
Xd
−1
=γ ϕi (s, a)1⊤ λ
i Λh |wh |
i=1
Xd
≤γ ∥ϕi (s, a)1i ∥Λ−1 ∥whλ ∥Λ−1 (C.15)
h h
i=1
√ Xd
≤γ d ∥ϕi (s, a)1i ∥Λ−1 , (C.16)
h
i=1

27
where (C.15) follows from the Cauchy-Schwartz inequality, (C.16) follows from the fact that:

∥whλ ∥Λ−1 ≤ ∥Λ−1 1/2


p
h ∥ ∥whλ ∥2 ≤ d/γ,
h

where the last inequality follows from Lemma C.3 and the fact that ∥Λ−1 h ∥ ≤ γ . Now it remains
−1

to bound the term (ii), by the definition of ηhτ (f ) = Es′ ∼P 0 (·|sτ ,aτ ) [f (s′ )] − f (sτh+1 ), the term (ii) can
h h h
be rewritten as:
d K λ (s
V̂h+1 h+1 )
X X h i
−1
(ii) = ϕi (s, a)1⊤
i Λh ϕτh (ϕτh )⊤ whλ − e− λ

i=1 τ =1
Xd XK  λ (s)
V̂h+1 
= ϕi (s, a)1⊤
i Λ−1
h ϕτh ηhτ e − λ

i=1 τ =1
Xd K
X λ (s) 
 V̂h+1
≤ ∥ϕi (s, a)1i ∥Λ−1 ϕτh ηhτ e− λ .
h Λ−1
h
i=1 τ =1
| {z }
(iii)

For the rest of the proof, it’s left to bound the term (iii). As the V̂h+1
λ depends on the offline dataset,
which makes it difficult to upper bound directly from concentration equality due to the dependence
issue, we seek for providing a uniform concentration bound applied to the term (iii), i.e. we aim to
upper bound the following term:
K
X V
sup ϕτh ηhτ (e− λ ) .
V ∈Vh+1 (R,B,γ) Λ−1
h
τ =1

Here for all h ∈ [H], the function class is defined as:

Vh (R, B, γ) = {Vh (x; θ, β, Λh ) : ∥θ∥2 ≤ R, β ∈ [0, B], γmin (Λh ) ≥ γ},


Pd
where Vh (x; θ, β, Λh ) = maxa∈A {ϕ(s, a)⊤ θ − λ log(1 + β i=1 ∥ϕi (·, ·)1i ∥Λ−1 )}[0,H−h+1] . In order
h
to ensure V̂h+1
λ ∈ Vh+1 (R0 , B0 , λ), we need to bound θ̂h = θh − λ log max{ŵh′ , e−H/λ }. Following
the fact that:

∥θ̂h ∥2 ≤ ∥θh ∥2 + λ∥ log max{ŵh′ , e−H/λ }∥2 .

By Lemma C.3, e−H/λ ≤ max{ŵh,i ′ , e−H/λ } ≤ max{∥ŵ ′ ∥, e−H/λ } ≤ max{ Kd/λ, e−H/λ }, there-
p
h
fore the term can be bounded as:
√ √
r
 Kd 
∥θ̂h ∥2 ≤ d + λ d max log , H/λ
λ
√ √
≤ H d + d Kλ

≤ 2Hd Kλ. (C.17)

Hence, we can choose R0 = 2Hd Kλ and B0 = β = 16dλeH/λ (H/λ + ξKL ), then we have for all
p

h+1 (R0 , B0 , λ). Next we aim to find a union cover of the Vh+1 (R0 , B0 , γ), hence the
λ ∈V
h ∈ [H], V̂h+1

28
term (iii) can be upper bounded. For all ϵ ∈ (0, λ), h ∈ [H], let Nh (ϵ; R, B, λ) := Nh (ϵ) denote the
minimal ϵ-cover of Vh (R, B, λ) with respect to the supreme norm. In other words, for any function
V̂ λ ∈ Vh (R, B, λ), there exists a function V ′ ∈ Nh+1 (ϵ) such that
λ ′
sup |V̂h+1 (x) − Vh+1 (x)| ≤ ϵ.
x∈S

Hence, given V̂h+1 λ ,V ′


h+1 satisfying the inequality above, recall the definition of ηh = ηh (f ) =
τ τ

Es′ ∼P 0 (·|sτ ,aτ ) [f (s′ )] − f (sτh+1 ), we have:


h h h

 λ (s)
V̂h+1   ′
Vh+1 (s) 
− −
ηhτ e λ − ηhτ e λ

λ (s)
V̂h+1 ′
Vh+1 (s) i λ (s
V̂h+1 ′
h+1 ) Vh+1 (sh+1 )
h
≤ Es∼P 0 (·|sτ ,aτ ) e− λ − e− λ − e− λ + e− λ
h h h
λ (s)
h V̂h+1 ′
Vh+1 (s) i λ (s
V̂h+1 ′
h+1 ) Vh+1 (sh+1 )
≤ Es∼P 0 (·|sτ ,aτ ) e− λ − e− λ + e− λ − e− λ
h h h

≤ 2ϵ/λ + 2ϵ/λ = 4ϵ/λ, (C.18)

where (C.18) follows from the fact that for any s ∈ S,


λ (s)
V̂h+1 ′
Vh+1 (s) λ (s)−V ′
|V̂h+1 h+1 (s)| ϵ
e− λ − e− λ ≤e λ − 1 ≤ e λ − 1 ≤ 2ϵ/λ,

where the last inequality is held by the fact that ϵ ∈ (0, λ). By the Cauchy-Schwartz inequality, for any
two vectors a, b ∈ Rd and positive definite matrix Λ ∈ Rd×d , it holds that ∥a + b∥2Λ ≤ 2∥a∥2Λ + 2∥b∥2Λ ,
hence for all h ∈ [H], we have:
K ′ K ′ λ (s) i
X  Vh+1 (s)  2 X h  Vh+1 (s)   V̂h+1 2
|(iii)|2 ≤ 2 ϕτh ηhτ e− λ +2 ϕτh ηhτ e− λ − ηhτ e− λ
Λ−1
h Λ−1
h
τ =1 τ =1
K ′ K
X  Vh+1 (s)  2 X ′
≤2 ϕτh ηhτ e − λ + 32ϵ /λ 2 2
|ϕτh Λ−1 τ
h ϕh |
Λ−1
h
τ =1 τ,τ ′ =1
K
X  V (s)  2 32ϵ2 K 2
≤2 sup ϕτh ηhτ e− λ + . (C.19)
V ∈Nh+1 (ϵ) Λ−1
h λ2 γ
τ =1
V (s)
We set f (s) = e− λ , by applying Lemma F.3, for any fixed h ∈ [H], δ ∈ (0, 1), we have:
K
 X  V (s)  2  H|Nh+1 (ϵ)|  K 
P sup ϕτh ηhτ e− λ −1
≥ 4 2 log + d log 1 + ≤ δ/H. (C.20)
V ∈Nh+1 (ϵ) Λh δ γ
τ =1

Hence, combining (C.20) with (C.19) and let γ = 1, then for all h ∈ [H], it holds that
K λ (s) 
X  V̂h+1 2  H|Nh+1 (ϵ)| 4ϵ2 K 2 
ϕτh ηhτ e− λ ≤ 8 2 log + d log(1 + K) + , (C.21)
Λ−1
h δ λ2
τ =1

with probability at least 1 − δ. By Lemma C.4, recall L = R0 = 2Hd Kλ in this setting, we have

log(|Nh+1 (ϵ)|) ≤ d log(1 + 4R0 /ϵ) + d2 log(1 + 8λ2 d1/2 B 2 /ϵ2 ). (C.22)

29
We then set ϵ = dλ/K ∈ (0, λ) and define β ′ = β/λeH/λ = 16d (H/λ + ξKL ) for brevity, then
p

(C.22) can be bounded as:


2H
log(|Nh+1 (ϵ)|) ≤ d log(1 + 4R0 K/dλ) + d2 log(1 + 8λ2 K 2 d−3/2 e λ β ′2 )
= d log(1 + 4R0 K/dλ) + d2 log(e−2H/λ + 8λ2 K 2 d−3/2 β ′2 ) + 2d2 H/λ
≤ 2d2 log(8λ2 K 2 d−3/2 β ′2 ) + 2d2 H/λ.

Therefore, by combining the result with the inequality (C.21), we can get
K λ (s) 
X  V̂h+1 2
ϕτh ηhτ e− λ (C.23)
Λ−1
h
τ =1
 H 
≤ 8 2 log + 4d2 H/λ + 4d2 log(8λ2 K 2 d−3/2 β ′2 ) + 4d2 + d log(1 + K)
δ
≤ 8(4d H/λ + 4d2 log(8λ2 K 3 Hd−3/2 β ′2 /δ))
2
(C.24)
2 2 2 3 −3/2 2 ′2
= 8(4d H/λ + 4d log(8λ K Hd /δ) + 4d log(β ))
≤ β ′2 /4, (C.25)

where (C.24) follows by the fact that 2 log Hδ + 4d2 + d log(1 + K) ≤ 4d2 log( HK δ ), and (C.25) is held
due to the fact that
 1024dλ2 K 3 H 
β ′2 /4 = 64d2 H/λ + log
δ
 1024d7/2 λ2 K 3 H 
= 8 8d2 H/λ + 4d2 log(8λ2 K 3 Hd−3/2 /δ) + 4d2 log + 4d2 log(128)
  δ
−3/2 ′2
2 2 2 3
≥ 8 4d H/λ + 4d log(8λ K Hd 2
/δ) + 4d log(β ) , (C.26)

where (C.26) holds by


  1024dλ2 K 3 H 
log(β ′2 ) = log 256d2 H/λ + log
δ
 1024dλ 2K 3H 
≤ log(256d2 ) + H/λ + log
δ
1024d7/2 λ2 K 3 H
≤ log(128) + log + H/λ.
δ
By the bound on (i), (ii), (iii), for all h ∈ H and (s, a) ∈ S × A, with probability at least 1 − δ, it
holds that
 √ d
X 

|Thλ V̂h+1
λ
(s, a) − ⟨ϕ(s, a), ŵhλ ⟩| ≤ λ log 1 + e H/λ
( d + β /2) ∥ϕi (s, a)1i ∥Λ−1
h
i=1
 d
X 
≤ λ log 1 + eH/λ β ′ ∥ϕi (s, a)1i ∥Λ−1 (C.27)
h
i=1
d
X
≤β ∥ϕi (s, a)1i ∥Λ−1 , (C.28)
h
i=1

30

where (C.27) follows by the fact that β ′ ≥ 2 d, (C.28) follows by the fact that log(1 + x) ≤ x holds
for any positive x. Thus, by Lemma C.1, we can upper bound the suboptimality gap as:
H
∗ ,P
X
SubOpt(π̂1 , s) ≤ 2 sup Eπ [Γh (sh , ah )|s1 = s]
P ∈U λ (P 0 ) h=1
H d
hX i
∗ ,P
X
= 2β sup Eπ ∥ϕi (s, a)1i ∥Λ−1 |s1 = s .
h
P ∈U λ (P 0 ) h=1 i=1

Therefore, we conclude the proof.

C.1.3 Proof of Theorem 5.1 - Case with χ2 Divergence

Algorithm 4 Robust Regularized Pessimistic Value Iteration under χ2 distance (R2PVI-χ2 )


Require: Dataset D, regularizer λ > 0, γ > 0 and parameter β
1: init V̂H+1
λ (·) = 0

2: for episode h = H, · · · , 1 do
Λh ← K τ τ τ τ ⊤
P
3: τ =1 ϕ(sh , ah )(ϕ(sh , ah )) + γI
µ0h,i λ (s)] ← [Λ−1 ( K τ τ ⊤ λ τ ◁ Estimated by (4.8)
P
4: Ê [V̂h+1 α h τ =1 ϕ(sh , ah ) [V̂h+1 (sh+1 )]α )][0,H]
0
µh,i −1 K
λ 2 τ τ ⊤ λ τ 2 ◁ Estimated by (4.9)
P
5: Ê [V̂h+1 (s)]α ← [Λh ( τ =1 ϕ(sh , ah ) [V̂h+1 (sh+1 )]α )][0,H 2 ]
6: Estimate ŵh,i according to (4.10)
λ

Γh (·, ·) ← β di=1 ∥ϕi (·, ·)∥Λ−1


P
7:
h
8: Q̂λh (·, ·) ← min{ϕ(·, ·)⊤ (θh + ŵhλ ) − Γh (·, ·), H − h + 1}+
9: π̂h (·|·) ← argmaxπh ⟨Q̂λh (·, ·), πh (·|·)⟩A and V̂hλ (·) ← ⟨Q̂λh (·, ·), π̂h (·|·)⟩A
10: end for

For completeness, we present the R2PVI algorithm specific to the χ2 distance in Algorithm 4,
which gives closed form solution of (4.8) and (4.9). Before the proof, we first present the bound on
weights under χ2 -divergence:
Lemma C.5 (Bound of weights - χ2 ). For any h ∈ [H],
√  H2 
∥ŵhλ ∥2 ≤ d H+ .

Proof of Theorem 5.1 - χ2 . The R2PVI with χ2 -divergence is presented in Algorithm 4. By the
definition of Thλ , T̂hλ , we have

Thλ V̂h+1
λ
(s, a) − ⟨ϕ(s, a), ŵhλ ⟩
= ϕ(s, a)⊤ (θh + whλ − θh − ŵhλ )
d
X
λ ′
= ϕi (s, a)(wh,i − ŵh,i )
i=1
d
X h n
λ 1 λ 1 o
= ϕi (s, a) sup Es∼µ0 [V̂h+1 (s)]α + (Es∼µ0 [V̂h+1 (s)]α )2 − λ
Es∼µ0 [V̂h+1 (s)]2α
α∈[0,H]
h,i 4λ h,i 4λ h,i
i=1

31
n 0 1 µ0h,i λ 1 µ0h,i λ oi
− sup ʵh,i [V̂h+1
λ
(s)]α + (Ê [V̂h+1 (s)]α )2 − Ê [V̂h+1 (s)]2α . (C.29)
α∈[0,H] 4λ 4λ

To continue, for any i ∈ [d], we denote


n
λ 1 λ 1 o
αi = argmax Es∼µ0 [V̂h+1 (s)]α + (Es∼µ0 [V̂h+1 (s)]α )2 − λ
Es∼µ0 [V̂h+1 (s)]2α .
α∈[0,H]
h,i 4λ h,i 4λ h,i

Hence, (C.29) can be further upper bounded as


λ
Thλ V̂h+1 (s, a) − ⟨ϕ(s, a), ŵhλ ⟩
d
X 0  1 0

λ
(s)]αi − ʵh,i [V̂h+1
λ λ
(s)]αi + ʵh,i [V̂h+1
λ

≤ ϕi (s, a) Es∼µ0 [V̂h+1 (s)]αi Es∼µ0 [V̂h+1 (s)]αi + 1
h,i 4λ h,i

|i=1 {z }
(i)
d
X 0
λ
(s)]2αi − ʵh,i [V̂h+1
λ
(s)]2αi . (C.30)

− ϕi (s, a) Es∼µ0 [V̂h+1
h,i

|i=1 {z }
(ii)

Next, we bound (i) and (ii), respectively.

Bounding term (i). We define

h K ii
0 X
Ẽµh,i [V̂h+1
λ
(s)]α = argmin λ
([V̂h+1 (sτh+1 )]α − ϕ(sτh , aτh )⊤ w)2 + γ∥w∥22 .
w∈Rd τ =1

0 0
Considering the gap between the ʵh,i [V̂h+1
λ (s)] µ
αi and Ẽ h,i [V̂h+1 (s)]αi due to the definition that
λ
0 0
ʵh,i [V̂h+1
λ (s)] µ
αi = [Ẽ h,i [V̂h+1 (s)]αi ][0,H] , we eliminate the clip operator at first. We rewrite (i) as
λ

follows:
d
X 0  1 0

(i) = λ
(s)]αi − ʵh,i [V̂h+1
λ λ
(s)]αi + ʵh,i [V̂h+1
λ

ϕi (s, a) Es∼µ0 [V̂h+1 (s)]αi Es∼µ0 [V̂h+1 (s)]αi + 1
h,i 4λ h,i
i=1
d
X 0
= λ
ϕi (s, a)(Es∼µ0 [V̂h+1 (s)]αi − Ẽµh,i [V̂h+1
λ
(s)]αi )
h,i
i=1
 1  Eµ0h,i [V̂ λ (s)]α − ʵ0h,i [V̂ λ (s)]α
µ0h,i
Es∼µ0 [V̂h+1 (s)]αi + Ê [V̂h+1 (s)]αi + 1 µ0 h+1
λ λ h+1
i i

× µ0 .
4λ h,i λ λ
E h,i [V̂h+1 (s)]αi − Ẽ h,i [V̂h+1 (s)]αi
| {z }
:=Ci

We claim that |Ci | ≤ 1 + H/2λ, ∀i ∈ [H]. We prove the claim by discussing the value of
0
Ẽµh,i [V̂h+1
λ (s)]
αi in the following three cases:

32
0
Case I. Ẽµh,i [V̂h+1
λ (s)]
αi ≤ 0. By the fact that Es∼µ0 [V̂h+1 (s)]αi ≤ H, we have:
λ
h,i

λ (s)]
Es∼µ0 [V̂h+1
 1  αi
λ h,i
|Ci | = E 0 [V̂ (s)]αi + 1 ≤ 1 + H/4λ,
4λ s∼µh,i h+1 λ
0
− Ẽµh,i [V̂h+1
λ (s)]
E s∼µ0h,i [V̂h+1 (s)]αi αi

where the equality holds by 1 λ


4λ Es∼µ0h,i [V̂h+1 (s)]αi + 1 ≤ 1 + H/4λ. Hence the claim holds by Case I.

0
Case II. 0 ≤ Ẽµh,i [V̂h+1
λ (s)]
αi ≤ H. The claim holds trivially, as we have:

1 0
λ
(s)]αi + Ẽµh,i [V̂h+1
λ

|Ci | = Es∼µ0 [V̂h+1 (s)]αi + 1 ≤ 1 + H/2λ.
4λ h,i

Hence, we conclude the claim.

0
Case III. Ẽµh,i [V̂h+1
λ (s)]
αi > H. Notice that

λ (s)] − H
Es∼µ0 [V̂h+1
 1  αi
λ h,i

|Ci | = Es∼µ0 [V̂h+1 (s)]αi + H + 1 0
4λ h,i
E λ
s∼µ0h,i [V̂h+1 (s)]αi − Ẽµh,i [V̂h+1
λ (s)]
αi
λ (s)]
H − Es∼µ0 [V̂h+1
 1  αi
λ h,i

= Es∼µ0 [V̂h+1 (s)]αi + H + 1 µ0
4λ h,i λ (s)] − E
Ẽ h,i [V̂h+1 λ
αi s∼µ0 [V̂h+1 (s)]αi h,i

≤ H/2λ + 1, (C.31)
0
where (C.31) holds by the fact that Ẽµh,i [V̂h+1
λ (s)]
αi > H.
With the upper bound for Ci , we can upper bound (i) as
d
X 0
|(i)| = λ
ϕi (s, a)(Es∼µ0 [V̂h+1 (s)]αi − Ẽµh,i [V̂h+1
λ
(s)]αi )Ci
h,i
i=1
d d K
µ0h
X X X
= γ ϕi (s, a)1i Λ−1 λ
h E [V̂h (s)]αi Ci + ϕi (s, a)1i Λ−1
h
λ
ϕ(sτh , aτh )ηhτ ([V̂h+1 ]αi )Ci
i=1 i=1 τ =1
d
X  XK 
≤ (1 + H/2λ) ∥ϕi (s, a)1i ∥Λ−1 γH + ϕ(sτh , aτh )ηhτ ([V̂h+1
λ
] αi ) . (C.32)
h Λ−1
h
i=1 τ =1

Bounding term (ii). Similar to bounding (i), we can deduce that:


d
X 0
|(ii)| = λ
ϕi (s, a)(Es∼µ0 [V̂h+1 (s)]2αi − ʵh,i [V̂h+1
λ
(s)]2αi )
h,i
i=1
d
X d
X K
X
≤ γH 2 ∥ϕi (s, a)1i ∥Λ−1 + ∥ϕi (s, a)1i ∥Λ−1 ϕ(sτh , aτh )ηhτ ([V̂h+1
λ 2
]αi ) , (C.33)
h h Λ−1
h
i=1 i=1 τ =1

33
where (C.33) follows by the Cauchy Schwartz inequality and the fact that Es∼µ0 [V̂hλ (s)]2αi ≤ H 2 , ∀i ∈
h,i
[d]. Hence combining (C.30), (C.32) and (C.33), we have

Thλ V̂h+1
λ
(s, a) − ⟨ϕ(s, a), ŵhλ ⟩
 X d d
X K
X 
≤ (1 + H/2λ) γH ∥ϕi (s, a)1i ∥Λ−1 + ∥ϕi (s, a)1i ∥Λ−1 ϕ(sτh , aτh )ηhτ ([V̂h+1
λ
]α′i )
h h Λ−1
h
i=1 i=1 τ =1
d
X d
X K
X
+ γH 2 ∥ϕi (s, a)1i ∥Λ−1 + ∥ϕi (s, a)1i ∥Λ−1 ϕ(sτh , aτh )ηhτ ([V̂h+1
λ 2
]α′ ) .
h h i Λ−1
h
i=1 i=1 τ =1

On the other hand, we can similarly deduce that there exists αi′ s.t.

⟨ϕ(s, a), ŵhλ ⟩ − Thλ V̂h+1


λ
(s, a)
 Xd d
X K
X 
≤ (1 + H/2λ) γH ∥ϕi (s, a)1i ∥Λ−1 + ∥ϕi (s, a)1i ∥Λ−1 ϕ(sτh , aτh )ηhτ ([V̂h+1
λ
]α′i )
h h Λ−1
h
i=1 i=1 τ =1
d
X d
X K
X
+ γH 2 ∥ϕi (s, a)1i ∥Λ−1 + ∥ϕi (s, a)1i ∥Λ−1 ϕ(sτh , aτh )ηhτ ([V̂h+1
λ 2
]α′ ) .
h h i Λ−1
h
i=1 i=1 τ =1

Then for all i ∈ [d], there exists α̂i ∈ {αi , αi′ }, such that
λ
|Thλ V̂h+1 (s, a) − ⟨ϕ(s, a), ŵhλ ⟩|
 Xd  d
X
≤ (1 + H/2λ) γH ∥ϕi (s, a)1i ∥Λ−1 + γH 2 ∥ϕi (s, a)1i ∥Λ−1
h h
i=1 i=1
d
X  K
X K
X 
+ ∥ϕi (s, a)1i ∥Λ−1 (1 + H/2λ) λ
ϕ(sτh , aτh )ηhτ ([V̂h+1 ]α̂i ) + λ 2
ϕ(sτh , aτh )ηhτ ([V̂h+1 ]α̂i ) .
h Λ−1
h Λ−1
h
i=1 τ =1 τ =1

Now it remains to bound the terms


K
X K
X
ϕ(sτh , aτh )ηhτ ([V̂h+1
λ
]α̂i ) and λ 2
ϕ(sτh , aτh )ηhτ ([V̂h+1 ]α̂i ) .
Λ−1
h Λ−1
h
| τ =1 {z } | τ =1 {z }
(iii) (iv)

Similar to the proof in KL divergence, we aim to find a union function class Vh+1 (R0 , B0 , λ), which
holds uniformly that V̂h+1
λ ∈V
h+1 (R0 , B0 , λ), here for all h ∈ [H], the function class is defined as:

Vh (R0 , B0 , λ) = {Vh (x; θ, β, Λ) : S → [0, H], ∥θ∥2 ≤ R0 , β ∈ [0, B0 ], γmin (Λh ) ≥ γ},

where Vh (x; θ, β, Λ) = maxa∈A [ϕ(s, a)⊤ θ − β di=1 ∥ϕi (s, a)∥Λ−1 ][0,H−h+1] . By Lemma C.5, when
P
√ p h
we set R0 = d(H + H 2 /2λ), B0 = β = 8dH 2 (1 + 1/λ) ξχ2 , it suffices to show that V̂h+1 λ ∈
Vh+1 (R0 , B0 , γ). Next we aim to find a union cover of the Vh+1 (R0 , B0 , γ), hence the term (iii) and
(iv) can be upper bounded. Let Nh (ϵ; R0 , B0 , γ) be the minimum ϵ-cover of Vh (R, B, λ) with respect
to the supreme norm, Nh ([0, H]) be the minimum ϵ-cover of [0, H] respectively. In other words, for

34
any function V ∈ Vh (R, B, λ), α ∈ [0, H], there exists a function V ′ ∈ Vh (R, B, λ) and a real number
αϵ ∈ [0, H] such that:

sup |V (s) − V ′ (s)| ≤ ϵ, |α − αϵ | ≤ ϵ.


s∈S

Recall the definition of (iii) and (iv). By Cauchy-Schwartz inequality and the fact that ∥a + b∥2Λ−1 ≤
h
2∥a∥2Λ−1 + 2∥b∥2Λ−1 , we have
h h

K K
X 2 X 2
(iii)2 ≤ 2 ϕ(sτh , aτh )ηhτ ([V̂h+1
λ
]αϵ ) +2 ϕ(sτh , aτh )ηhτ ([V̂h+1
λ λ
]α̂ − [V̂h+1 ]αϵ )
Λ−1
h Λ−1
h
τ =1 τ =1
K K
X

2 X

2 2ϵ2 K 2
≤4 ϕ(sτh , aτh )ηhτ ([Vh+1 ]αϵ ) +4 ϕ(sτh , aτh )ηhτ ([V̂h+1
λ
]αϵ − [Vh+1 ]αϵ ) + ,
Λ−1
h Λ−1
h γ
τ =1 τ =1
(C.34)

where (C.34) follows by the fact that


K K
X 2 X ′ 2ϵ2 K 2
2 λ
ϕ(sτh , aτh )ηhτ ([V̂h+1 ]α̂ − λ
[V̂h+1 ]αϵ ) ≤ 2ϵ 2
|ϕτh Λ−1 τ
h ϕh | ≤ .
Λ−1
h γ
τ =1 τ =1,τ ′ =1

Meanwhile, by the fact that |[V̂h+1


λ ]
αϵ − [Vh+1 ]αϵ | ≤ |V̂h+1 − Vh+1 |, we have
′ λ ′

K
X 2

4 λ
ϕ(sτh , aτh )ηhτ ([V̂h+1 ]αϵ − [Vh+1 ] αϵ )
Λ−1
h
τ =1
K

X
≤4 |ϕτh Λ−1 τ τ λ ′
h ϕh | max |ηh ([V̂h+1 ]αϵ − [Vh+1 ]αϵ )|
2

τ =1,τ ′ =1
K

X
≤ 4ϵ2 |ϕτh Λ−1 τ
h ϕh |
τ =1,τ ′ =1
4ϵ K 2
2
≤ .
γ
By applying the above two inequalities and the union bound into (C.34), we have
K
2
X

2 6ϵ2 K 2
(iii) ≤ 4 sup ϕ(sτh , aτh )ηhτ ([Vh+1 ]αϵ ) + .
V ′ ∈Nh (ϵ;R0 ,B0 ,γ),αϵ ∈Nh ([0,H]) Λ−1
h γ
τ =1

By Lemma F.3, applying a union bound over Nh (ϵ; R0 , B0 , γ) and Nh ([0, H]), with probability at
least 1 − δ/2H, we have
K
X 6ϵ2 K 2

2
4 sup ϕ(sτh , aτh )ηhτ ([Vh+1 ] αϵ ) +
V ′ ∈Nh (ϵ;R0 ,B0 ,γ),αϵ ∈Nh ([0,H]) τ =1 γ Λ−1
h

 2H|Nh (ϵ; R0 , B0 , γ)∥Nh ([0, H])|  6ϵ2 K 2


≤ 4H 2 2 log + d log(1 + K/γ) + .
δ γ

35
Similarly, by the fact that ∥a + b∥2Λ−1 ≤ 2∥a∥2Λ−1 + 2∥b∥2Λ−1 , noticing (iv) has the almost same form
h h h
as (iii), we have
K K
X 2 X 2
(iv)2 ≤ 2 ϕ(sτh , aτh )ηhτ ([V̂h+1
λ 2
]αϵ ) +2 ϕ(sτh , aτh )ηhτ ([V̂h+1
λ 2 λ 2
]α̂ − [V̂h+1 ]αϵ )
Λ−1
h Λ−1
h
τ =1 τ =1
K
X

2 24H 2 ϵ2 K 2
≤4 ϕ(sτh , aτh )ηhτ ([Vh+1 ]2αϵ ) + , (C.35)
Λ−1
h γ
τ =1

where (C.35) follows by the fact that


K K
X 2 X ′ 8H 2 ϵ2 K 2
2 ϕ(sτh , aτh )ηhτ ([V̂h+1
λ 2 λ 2
]α̂ − [V̂h+1 ] αϵ ) ≤ 8H 2 ϵ2 |ϕτh Λ−1 τ
h ϕh | ≤ ,
Λ−1
h γ
τ =1 τ =1,τ ′ =1

and
K
X

4∥ λ 2
ϕ(sτh , aτh )ηhτ ([V̂h+1 ]αϵ − [Vh+1 ]2αϵ )∥2Λ−1
h
τ =1
K

X
≤4 |ϕτh Λ−1 τ τ λ ′
h ϕh | max |ηh ([V̂h+1 ]αϵ − [Vh+1 ]αϵ )|
2

τ =1,τ ′ =1
K

X
≤ 16H ϵ 2 2
|ϕτh Λ−1 τ
h ϕh |
τ =1,τ ′ =1
16H 2 ϵ2 K 2
≤ .
γ

We apply the union bound and Lemma F.3, with probability at least 1 − δ/2H
 2H|Nh (ϵ; R0 , B0 , γ)∥Nh ([0, H])|  24H 2 ϵ2 K 2
(iv)2 ≤ 4H 4 2 log + d log(1 + K/γ) + .
δ γ
Therefore, with probability at least 1 − δ,

|Thλ V̂h+1
λ
(s, a) − ⟨ϕ(s, a), ŵhλ ⟩| (C.36)
d
X d
X
≤ γH(1 + H/2λ) ∥ϕi (s, a)1i ∥Λ−1 + γH 2 ∥ϕi (s, a)1i ∥Λ−1
h h
i=1 i=1
d
X  K
X K
X 
+ ∥ϕi (s, a)1i ∥Λ−1 (1 + H/2λ) ϕ(sτh , aτh )ηhτ ([V̂h+1
λ
]α̂i ) + ϕ(sτh , aτh )ηhτ ([V̂h+1
λ 2
]α̂i )
h Λ−1
h Λ−1
h
i=1 τ =1 τ =1
d
"
X
≤ ∥ϕi (s, a)1i ∥Λ−1 γH(1 + H + H/2λ)
h
i=1
s
 2H|Nh (ϵ; R0 , B0 , γ)∥Nh ([0, H])|  6ϵ2 K 2
+ (1 + H/2λ) 4H 2 2 log + d log(1 + K/γ) +
δ γ

36
s #
 2H|Nh (ϵ; R0 , B0 , γ)∥Nh ([0, H])|  24H 2 ϵ2 K 2
+ 4H 4 2 log + d log(1 + K/γ) + . (C.37)
δ γ

By the fact that R0 = d(H + H 2 /2λ), Lemma F.1 and Lemma F.2, we can upper bound the term
|Nh (ϵ; R0 , B0 , γ)| and |Nh ([0, H])| as follows:

log |Nh (ϵ; R0 , B0 , γ)| ≤ d log(1 + 4R0 /ϵ) + d2 log(1 + 8d1/2 B 2 /γϵ2 ), log |Nh ([0, H])| ≤ log(3H/ϵ).

We set ϵ = 1
K,γ = 1, with the upper bound above, we have
 2H|Nh (ϵ; R0 , B0 , γ)∥Nh ([0, H])|  6ϵ2 K 2
4H 2 2 log + d log(1 + K/γ) +
δ γ
 2
6H K 
≤ 4H 2 2 log + d log(1 + K) + d log(1 + d1/2 (1 + H/2λ)K) + d2 log(1 + d1/2 B 2 K 2 ) + 3/2
δ
 6H 2K 
≤ 4H 2 2d log + 2d log(K) + d log(d1/2 (1 + H/2λ)K) + 2d2 log d1/2 B 2 K 2
δ
 6H 2K 
≤ 8H 2 d2 log + log(K) + log(d1/2 (1 + H/2λ)K) + log 8d1/2 B 2 K 2
δ
= 8H 2 d2 (log 48K 5 H 2 B 2 d(1 + H/2λ)/δ).

Similarly, we can upper bound the third term in (C.37) as follows:


 2H|Nh (ϵ; R0 , B0 , γ)∥Nh ([0, H])|  24H 2 ϵ2 K 2
4H 4 2 log + d log(1 + K/γ) +
δ γ
 2H|Nh (ϵ; R0 , B0 , γ)∥Nh ([0, H])| 24ϵ2 K 2 
= H 2 4H 2 (2 log + d log(1 + K/γ)) +
δ γ
4 2 5 2 2
≤ 8H d (log 48K H B d(1 + H/2λ)/δ).

Hence, we apply this bound into the (C.37), we have


λ
|⟨ϕ(s, a), ŵhλ ⟩ − Thλ V̂h+1 (s, a)|
d
X
≤ ∥ϕ(s, a)1i ∥Λ−1 (H(1 + H + H/2λ)
h
i=1
p
+ (1 + H/2λ + H) 8H 2 d2 (log 48K 5 H 2 B 2 d(1 + H/2λ)/δ))
d
X
(C.38)
p
≤ ∥ϕ(s, a)1i ∥Λ−1 2(H/λ + H) 8H 2 d2 (log 48K 5 H 2 B 2 d(1 + H/2λ)/δ)
h
i=1
Xd q
≤ ∥ϕ(s, a)1i ∥Λ−1 2(H/λ + H)Hd 8(log 192K 5 H 6 d3 (1 + H/2λ)3 /δ) + log ξχ2 )
h
i=1
Xd q
= ∥ϕ(s, a)1i ∥Λ−1 2(H/λ + H)Hd 8(ξχ2 + log ξχ2 )
h
i=1
Xd
≤β ∥ϕ(s, a)1i ∥Λ−1 , (C.39)
h
i=1

37
where (C.38) comes from the fact that 1 + 1/λ ≤ 1 + H/2λ, the (C.39) comes from the fact
that log ξχ2 ≤ ξχ2 . Hence, the prerequisite is satisfied in Lemma C.1, we can upper bound the
suboptimality gap as:
H
∗ ,P
X
SubOpt(π̂, s, λ) ≤ 2 Eπ
 
sup Γh (sh , ah )|s1 = s
P ∈U λ (P 0 ) h=1
H d
hX i
π ∗ ,P
X
= 2β sup E ∥ϕi (s, a)1i ∥Λ−1 |s1 = s .
h
P ∈U λ (P 0 ) h=1 i=1

This concludes the proof.

C.2 Proof of Corollary 5.4


Proof. The proof follows the argument in (F.15) and (F.16) of Blanchet et al. (2024). Specifically,
we denote

h i
ΛPh,i = Eπ ,P (ϕi (sh , ah )1i )(ϕi (sh , ah )1i )⊤ s1 = s , ∀(h, i, P ) ∈ [H] × [d] × U λ (P 0 ). (C.40)

By Assumption 5.3, setting γ = 1, we have


H d
X 
π ∗ ,P
X
sup E ∥ϕi (sh , ah )1i ∥Λ−1 s1 = s
h
P ∈U λ (P 0 ) h=1 i=1
H X
d hq i
∗ ,P
X
Eπ Tr (ϕi (sh , ah )1i )(ϕi (sh , ah )1i )⊤ Λ−1

= sup h s1 = s
P ∈U λ (P 0 ) h=1 i=1
H X
X d q
Tr Eπ∗ ,P (ϕi (sh , ah )1i )(ϕi (sh , ah )1i )⊤ |s1 = s Λ−1 (C.41)
  
≤ sup h
P ∈U λ (P 0 ) h=1 i=1
H X
d q
X −1 
≤ sup Tr ΛPh,i · I + K · c† · ΛPh,i (C.42)
P ∈U λ (P 0 ) h=1 i=1
s
H X
d
X (Eπ∗ ,P [ϕi (sh , ah )|s1 = s])2
= sup
P ∈U λ (P 0 ) h=1 i=1 1 + c† · K · (Eπ∗ ,P [ϕi (sh , ah )|s1 = s])2
H X
d r
X 1
≤ sup (C.43)
P ∈U λ (P 0 ) h=1 i=1 c† ·K
dH
=√ ,
c† K
where (C.41) is due to the Jensen’s inequality, (C.42) holds by the definition in (C.40) and Assump-
tion 5.3, (C.43) holds by the fact that the only nonzero element of ΛPh,i is the i-th diagonal element.
Thus, by Theorem 5.1, with probability at least 1 − δ, for any s ∈ S the suboptimality can be upper
bounded as:
H d
X
π ∗ ,P
hX i βdH
SubOpt(π̂, s, λ) ≤ β sup E ∥ϕi (s, a)1i ∥Λ−1 s1 = s ≤ √ ,
P ∈U λ (P 0 ) h=1 i=1
h
c† K

38
where

if D is TV;

16Hd ξTV
 ,
(H/λ + ξKL ), if D is KL;
H/λ
p
β = 16dλe
if D is χ2 .

2
p
8dH (1 + 1/λ) ξχ2 ,

Hence, we conclude the proof.

D Proof of the Information-Theoretic Lower Bound


In this section, we prove the information-theoretic lower bound. We first introduce the construction
of hard instances in Appendix D.1, then we prove Theorem 5.5 in Appendix D.2

D.1 Construction of Hard Instances


We design a family of d-rectangular linear RRMDPs parameterized by a Boolean vector ξ = {ξh }h∈[H] ,
where ξh ∈ {−1, 1}d . For a given ξ and regularizer λ, the corresponding d-rectangular linear RRMDP
Mξρ is constructed as follows. The state space S = {x1 , x2 } and the action space A = {0, 1}d . The
initial state distribution µ0 is defined as
d+1 1
µ0 (s1 ) = and µ0 (x2 ) = .
d+2 d+2
The feature mapping ϕ : S × A → Rd+2 is defined as
a d
a2 1 ad X ai 
ϕ(s1 , a)⊤ = ,··· , ,1 −
, ,0 ,
d d d d
i=1
ϕ(s2 , a)⊤ = 0, 0, · · · , 0, 0, 1 ,


Pd
which satisfies ϕi (s, a) ≥ 0 and i=1 ϕi (s, a) = 1. The nominal distributions {µ0h }h∈[H] are defined
as
⊤
µ0h = (1 − ϵ)δs1 + ϵδs2 , (1 − ϵ)δs1 + ϵδs2 , · · · , (1 − ϵ)δs1 + ϵδs2 , δs2 , ∀h ∈ [H],
| {z }
d+1

where ϵ is an error term injected into the nominal model, which is to be determined later. Thus, the
transition is homogeneous and does not depend on action but only on state. The reward parameters
{θh }h∈[H] are defined as
ξ + 1 ξh2 + 1 ξhd + 1 1 
h1
θh⊤ = δ · , ,··· , , 0 , ∀h ∈ [H],
2 2 2 2
where δ is a parameter to control the differences among instances, which is to be determined
later. The reward rh is generated from the normal distribution rh ∼ N (rh (sh , ah ), 1), where
rh (s, a) = ϕ(s, a)⊤ θh . Note that
δ
rh (s1 , a) = ϕ(s1 , a)⊤ θh = ⟨ξh , a⟩ + d ≥ 0 and rh (s2 , a) = ϕ(s2 , a)⊤ θh = 0, ∀a ∈ A,

2d

39
Thus, the worst case transition kernel should have the highest possible transition probability to
s2 , and the optimal robust policy should lead to a transition probability to s2 as small as possible.
Therefore the optimal action at step h is
1 + ξ
h1 1 + ξh2 1 + ξhd 
a∗h = , ··· , .
2 2 2
We illustrate the designed d-rectangular linear RRMDP Mξλ in Figure 1(a) and Figure 1(b).
Finally, the offline dataset is collected by the following procedure: the behavior policy π b =
{πh }h∈[H] is defined as
b

πhb ∼ Unif {e1 , · · · , ed , 0} , ∀h ∈ [H],




where {ei }i∈[d] are the canonical basis vectors in Rd . The initial state is generated according to µ0
and then the behavior policy interact with the nominal environment K episodes to collect the offline
dataset D.

D.2 Proof of Theorem 5.5


With this family of hard instances, we are ready to prove the information-theoretic lower bound.
For any ξ ∈ {−1, 1}dH , let Qξ denote the distribution of dataset D collected from the instance
Mξ . Denote the family of parameters as Ω = {−1, 1}dH and the family of hard instances as
M = {Mξ : ξ ∈ Ω}. Before the proof, we introduce the following lemma bounding the robust value
function.

Lemma D.1. Under the constructed hard instances in Appendix D.1, let δ = d3/2 / 2K and
K > d3 H 2 /(2λ2 ). For any h ∈ [H] , we have
H d
δ X  X 
0≤ (1 − ϵ)j−h d + ξji Eπ aji − Vhπ,λ (s1 ) ≤ fhλ (ϵ), (D.1)
2d
j=h i=1

where fhλ (ϵ) is a error term, which is defined as:

if D is TV;

0,

λ
fh (ϵ) = (H − h)λϵ(e − 1), if D is KL;
(H − h)λϵ(1 − ϵ)/4, if D is χ2 .

Furthermore, if we set the ϵ as

if D is TV;

−1/H ,
1 − 2


ϵ = min{1 − 2−1/H , d3/2 /(64λ 2K)}, if D is KL; (D.2)

min{1 − 2−1/H , d3/2 /(8λ 2K)}, if D is TV,


then we have fhλ (ϵ) ≤ d3/2 H/32 2K.
Proof of Theorem 5.5. Invoking Lemma D.1, we have
H d 1 + ξ
∗ ,λ δ XX ji

V1π (s1 ) − V1π,λ (s1 ) ≥ (1 − ϵ)j−1 − ξji Eπ aji − f1λ (ϵ)
2d 2
j=1 i=1

40
H d
δ XX
= (1 − ϵ)j−1 (1 − ξji Eπ (2aji − 1)) − f1λ (ϵ)
4d
j=1 i=1
H d
δ XX
= (1 − ϵ)j−1 (ξji − Eπ (2aji − 1))ξji − f1λ (ϵ)
4d
j=1 i=1
H d
δ XX
= (1 − ϵ)j−1 |ξji − Eπ (2aji − 1)| − f1λ (ϵ) (D.3)
4d
j=1 i=1
H d
δ XX
≥ (1 − ϵ)H−1 |ξji − Eπ (2aji − 1)| − f1λ (ϵ), (D.4)
4d
j=1 i=1

where (D.3) follows from the fact that ξji ∈ {−1, 1}. To continue,
H d H d
δ XX δ XX
|ξji − Eπ (2aji − 1)| ≥ |ξji − Eπ (2aji − 1)|1{ξhi ̸= sign(E(2aji − 1))}|
4d 4d
j=1 i=1 j=1 i=1
H d
δ XX
≥ 1{ξhi ̸= sign(E(2aji − 1))}|
4d
j=1 i=1
δ
= DH (ξ, ξ π ), (D.5)
4d
where DH (·, ·) is the Hamming distance. Then applying the Assouad’s method (Tsybakov, 2009,
Lemma 2.12), we have
dH
inf sup Eξ [DH (ξ, ξ π )] ≥ min inf [Qξ (ψ(D) ̸= ξ) + Qξπ (ψ(D) ̸= ξ π )]
π ξ∈Ω 2 DH (ξ,ξ π )=1 ϕ

dH  1 1/2 
≥ 1− max DKL (Qξ ∥Qξπ ) , (D.6)
2 2 DH (ξ,ξπ )=1
where DKL represents the KL divergence. Next we bound DKL (Qξ ∥Qξπ ), according to the definition
of Qξ (D), we have
K Y
Y H
Qξ (D) = πhb (aτh |sτh )Ph0 (sτh+1 |sτh , aτh )Rh (rhτ |sτh , aτh ),
k=1 τ =1

where Rh (rhτ |sτh , aτh ) refers to the density function of N (rh (sτh , aτh ), 1) at rhτ . By the fact that the
difference between the two distributions Qξ (D) and Qξπ (D) lie only in the reward distribution
corresponding to the index where ξ and ξ π differ, we have
K
d+2
X   d + 1   d − 1  K δ2 1
DKL (Qξ (D)∥Qξπ (D)) = DKL N δ, 1 ∥N δ, 1 = 2
≤ , (D.7)
2d 2d d+2d 2
τ =1

where the last inequality follows from the definition of δ. By the fact that δ = d3/2 / 2K, we have
δH(1 − ϵ)H−1  1 1/2 
inf sup subopt(M, π̂, s, λ) ≥ 1− max DKL (Qξ ∥Qξπ ) − fhλ (ϵ) (D.8)
π̂ M ∈M 4 2 DH (ξ,ξπ )=1

41
δH(1 − ϵ)H−1
≥ − fhλ (ϵ) (D.9)
8
d3/2 H(1 − ϵ)H−1
= √ − fhλ (ϵ)
8 2K
d3/2 H d3/2 H
≥ √ − √ (D.10)
16 2K 32 2K
H d
1 X π∗ ,P h X i
≥ √ E ∥ϕi (s, a)1i ∥Λ−1 |s1 = s , (D.11)
128 2 h=1 i=1
h

where (D.8) holds by applying the inequality (D.4), (D.5) and (D.6) in order, (D.9) holds by (D.7),
(D.10) holds by the definition
√ of ϵ in (D.2), and (D.11) holds by Lemma F.4. Hence, it is sufficient
for taking c = 1/128 2. This concludes the proof.

E Proof of Technical Lemmas


E.1 Proof of Lemma C.1
Proof. We first decompose the SubOpt(π, s, λ) as follows:

SubOpt(π̂, s, λ) = V1⋆,λ (s) − V1π̂,λ (s)


= V1⋆,λ (s) − V̂1λ (s) + V̂1λ (s) − V1π̂,λ (s),
| {z } | {z }
(i) (ii)

where V̂1λ (s) is computed in the algorithm. We next bound the term (i) and (ii) respectively. For
term (i), the error comes from the estimated error of the value function and the Q-function, therefore
by (3.2) and the definition of the Q̂λh (s, a) in meta-algorithm, for any h ∈ [H], we can decompose
the error as:
⋆ ,λ ⋆ ,λ
Vhπ (s) − V̂hλ (s) = Qπh (s, πh⋆ (s)) − Q̂λh (s, π̂h (s))
⋆ ,λ
≤ Qπh (s, πh⋆ (s)) − Q̂λh (s, π ⋆ (s)) (E.1)
π ⋆ ,λ
= Thλ Vh+1 (s, πh⋆ (s)) − Thλ V̂h+1
λ
(s, πh⋆ (s)) + Thλ V̂h+1
λ
(s, πh⋆ (s)) − Q̂λh (s, π ⋆ (s))
π ,λ ⋆
= Thλ Vh+1 (s, πh⋆ (s)) − Thλ V̂h+1
λ
(s, πh⋆ (s)) + δhλ (s, πh⋆ (s)), (E.2)

where (E.1) comes from the fact that π̂h is the greedy policy with respect to Q̂λh (s, a), the regularized
robust Bellman update error δhλ is defined as:

δhλ (s, a) := Thλ V̂h+1


λ
(s, a) − Q̂λh (s, a), ∀(s, a) ∈ S × A, (E.3)

which aims to eliminate the clip operator in the definition of Q̂λh (s, a). Denote the worst transition
kernel w.r.t the regularized Bellman operator as P̂ = {P̂h }h∈[H] , where P̂h is defined as:

(s′ ) + λ⟨ϕ(s, a), D(µh ||µ0h )⟩


  λ  
P̂h (·|s, a) = argmin Es′ ∼Ph (·|s,a) V̂h+1
µh ∈∆(S)d ,Ph =⟨ϕ,µh ⟩

42
d
X
λ
(s′ )] + λD(µh,i ∥µ0h,i )
 
= ϕi (s, a) argmin Es′ ∼µh,i [V̂h+1
i=1 µh,i ∈∆(S)

Xd
= ϕi (s, a)µ̂h,i (·),
i=1

where the µ̂h,i is defined as µ̂h,i = argminµh,i ∈∆(S) Es′ ∼µh,i [V̂h+1 (s′ )] + λD(µh,i ∥µ0h,i ) . Hence the
 

difference between the regularized Bellman operator and the empirical regularized Bellman operator
can be bounded as

π ,λ
Thλ Vh+1 (s, πh⋆ (s)) − Thλ V̂h+1
λ
(s, πh⋆ (s))
 π⋆ ,λ ′ 
= rh (s, πh⋆ (s)) + (s ) + λ⟨ϕ(s, πh⋆ (s)), D(µh ||µ0h )⟩
 
inf Es′ ∼Ph (·|s,πh⋆ (s)) Vh+1
µh ∈∆(S)d ,Ph =⟨ϕ,µh ⟩

− rh (s, πh⋆ (s)) − (s′ ) + λ⟨ϕ(s, πh⋆ (s)), D(µh ||µ0h )⟩


  λ  
inf Es′ ∼Ph (·|s,πh⋆ (s)) V̂h+1
µh ∈∆(S)d ,Ph =⟨ϕ,µh ⟩
⋆,λ ′
λ
≤ Es′ ∼P̂h (·|s,π⋆ (s)) [V̂h+1 (s′ )] − Es′ ∼P̂h (·|s,π⋆ (s)) [Vh+1 (s )]
h h
⋆,λ ′
λ
= Es′ ∼P̂h (·|s,π⋆ (s)) [V̂h+1 (s′ ) − Vh+1 (s )]. (E.4)
h

Combining inequality (E.2) and (E.4), we have for any h ∈ [H]


⋆ ,λ
Vhπ (s) − V̂hλ (s) ≤ Es′ ∼P̂h (·|s,π⋆ (s)) [V̂h+1
λ ⋆,λ ′
(s′ ) − Vh+1 (s )] + δhλ (s, πh⋆ (s)). (E.5)
h

Recursively applying (E.5), we have


H
⋆ ⋆ ,P̂
V1π ,λ (s)
X
(i) = V̂1λ (s) Eπ
 λ 
− ≤ δh (sh , ah )∥s1 = s .
h=1

Next we bound term (ii), similar to term (i), by (C.1), the error can be decomposed to

V̂hλ (s) − Vhπ̂,λ (s) = Q̂λh (s, π̂h (s)) − Qπ̂,λ


h (s, π̂h (s))
π̂,λ
λ
= Thλ V̂h+1 (s, π̂h (s)) − δhλ (s, π̂h (s)) − Thλ Vh+1 (s, π̂h (s)). (E.6)

Denote P π̂ = {Phπ̂ }h∈[H] where Phπ̂ is defined as: ∀(s, a) ∈ S × A,

Phπ̂ (·|s, a) = (s′ ) + λ⟨ϕ(s, a), D(µh ||µ0h )⟩ .


  π̂  
argmin Es′ ∼Ph (·|s,a) V̂h+1
µh ∈∆(S)d ,Ph =⟨ϕ,µh ⟩

Hence similar to the bound in (E.4), the difference between the regularized Bellman operator and
the empirical regularized Bellman operator can be bounded as
π̂,λ π̂,λ ′
Thλ V̂h+1
λ
(s, π̂h (s)) − Thλ Vh+1 λ
(s, π̂h (s)) ≤ Es′ ∼P π̂ (·|s,π̂h (s)) [V̂h+1 (s′ ) − Vh+1 (s )]. (E.7)
h

Combining inequality (E.6), (E.7), we have for any h ∈ [H]

V̂hλ (s) − Vhπ̂,λ (s) ≤ Es′ ∼P π̂ (·|s,π̂h (s)) [V̂h+1


λ π̂,λ ′
(s′ ) − Vh+1 (s )] − δhλ (s, π̂h (s)). (E.8)
h

43
Recursively applying (E.8), we have the "pessimisim" of the estimated value function that ∀h ∈ [H]
H
π̂
V1π̂,λ (s)
X
V̂1λ (s) Eπ̂,P − δhλ (sh , ah )∥s1 = s .
 
− ≤
h=1

Therefore combining the two bounds above, we have


H H
⋆ ,P̂ π̂ 
X  X
SubOpt(π̂, s, λ) = (i) + (ii) ≤ Eπ δhλ (sh , ah )|s1 = s + Eπ̂,P − δhλ (sh , ah )∥s1 = s .
 

h=1 h=1
(E.9)

Hence, it requires to estimate the range of the regularized Bellman update error δhλ (s, a). Recall the
definition in (E.3), we claim that

0 ≤ δhλ (s, a) ≤ 2Γh (s, a) (E.10)

holds for ∀(s, a, h) ∈ S × A × [H]. For the LHS of (E.10), first we notice that if ⟨ϕ(s, a), ŵhλ ⟩ −
Γh (s, a) ≤ 0, the inequality holds trivially as Q̂λh (s, a) = 0. Next we consider the case where
⟨ϕ(s, a), ŵhλ ⟩ − Γh (s, a) ≥ 0. By the definition of Q̂λh (s, a) and the assumption in the lemma, we have

δhλ (s, a) = Thλ V̂h+1


λ
(s, a) − Q̂λh (s, a)
= Thλ V̂h+1
λ
(s, a) − min ⟨ϕ(s, a), ŵhλ ⟩ − Γh (s, a), H − h + 1


≥ Thλ V̂h+1
λ
(s, a) − ⟨ϕ(s, a), ŵhλ ⟩ + Γh (s, a)
≥ 0.

On the other hand, by the assumption in the lemma, we have

⟨ϕ(s, a), ŵhλ ⟩ − Γh (s, a) ≤ Thλ V̂h+1


λ
(s, a) ≤ H − h + 1.

Hence, we can upper bound δhλ (s, a) as

δhλ (s, a) = Thλ V̂h+1


λ
(s, a) − Q̂λh (s, a)
= Thλ V̂h+1
λ
(s, a) − max ⟨ϕ(s, a), ŵhλ ⟩ − Γh (s, a), 0


≤ Thλ V̂h+1
λ
(s, a) − ⟨ϕ(s, a), ŵhλ ⟩ + Γh (s, a)
≤ 2Γh (s, a).

This concludes the claim. Now it remains to bound the empirical transition kernel P̂ . Noticing the
fact that ∀h ∈ [H], (s, a) ∈ S × A,

λD(µ̂h,i ∥µ0h,i ) ≤ Es′ ∼µ̂h,i [V̂h+1


λ
(s′ )] + λD(µ̂h,i ∥µ0h,i )
Es′ ∼µh,i [V̂h+1 (s′ )] + λD(µh,i ∥µ0h,i )
 
= inf
µh,i ∈∆(S)
⋆,λ ′
(s )] + λD(µh,i ∥µ0h,i ) (E.11)
 
≤ inf Es′ ∼µh,i [Vh+1
µh,i ∈∆(S)
⋆,λ ′
≤ Es′ ∼µ0 [Vh+1 (s )]
h,i

44
⋆,λ
≤ max Vh+1 (s),
s∈S

where (E.11) comes from the pessimism of value function, i.e V̂h+1 λ (s) ≤ V ⋆,λ (s), ∀h ∈ [H]. Hence,
h
the empirical transition kernel P̂h (·|s, a) is contained in the set U λ (P 0 ) defined in Equation (5.1).
Hence, by (E.9) and (E.10), we have
H H
⋆ ,P̂ π̂ 
X  X
SubOpt(π̂, s, λ) ≤ Eπ δhλ (sh , ah )|s1 = s + Eπ̂,P − δhλ (sh , ah )∥s1 = s
 

h=1 h=1
H
⋆ ,P̂
X

 
≤2 Γh (sh , ah )|s1 = s
h=1
H
⋆ ,P
X

 
≤2 sup Γh (sh , ah )|s1 = s .
P ∈U λ (P 0 ) h=1

This concludes the proof.

E.2 Proof of Lemma C.2


Proof. For all h ∈ [H], from the definition of wh ,

∥whλ ∥2 = ∥Es∼µ0 [V̂h+1
λ
(s)]αh+1 ∥2 ≤ H d,
h

where the inequality follows from the fact that V̂h+1λ ≤ H, for all h ∈ [H]. Meanwhile, by the
definition of ŵh in Algorithm 2, and the triangle inequality,
λ

K
X
∥ŵhλ ∥2 = Λ−1
h ϕ(sτh , aτh )[V̂h+1
λ
(s)]αh+1
2
τ =1
K
X
≤H ∥Λ−1 τ τ
h ϕ(sh , ah )∥2
τ =1
K q
−1/2 −1/2
X
=H ϕ(sτh , aτh )⊤ Λh Λ−1
h Λh ϕ(sτh , aτh )
τ =1
K q
H X
≤√ ϕ(sτh , aτh )⊤ Λ−1 τ τ
h ϕ(sh , ah ) (E.12)
γ
τ =1

v
uK
H Ku X
≤ √ t ϕ(sτh , aτh )⊤ Λ−1 τ τ
h ϕ(sh , ah ) (E.13)
γ
τ =1
√ q
H K
= √ Tr(Λ−1h (Λh − γI))
γ

H Kp
≤ √ Tr(I)
γ
s
Kd
=H ,
γ

45
where (E.12) follows from the fact that ∥Λ−1
h ∥ ≤ γ , (E.13) follows from the Cauchy-Schwartz
−1

inequality. Then we conclude the proof.

E.3 Proof of Lemma C.3


Proof. By definition, we have
h

λ (s)
V̂h+1 i Z

λ (s)
V̂h+1
Z √
whλ ∥2 = Es∼µ0 e λ = e λ µ0h (s) ds ≤ ∥µ0h (s)∥2 ds ≤ d,
h 2 2
S S

this concludes the proof of whλ . For ŵhλ ,


K λ (s
V̂h+1
X h+1 )
∥ŵhλ ∥2 = Λ−1
h ϕ(sτh , aτh )e− λ
2
τ =1
K
X
≤ ∥Λ−1 τ τ
h ϕ(sh , ah )∥2
τ =1
K q
−1/2 −1/2
X
= ϕ(sτh , aτh )⊤ Λh Λ−1
h Λh ϕ(sτh , aτh )
τ =1
K q
1 X
≤√ ϕ(sτh , aτh )⊤ Λ−1 τ τ
h ϕ(sh , ah ) (E.14)
γ
τ =1
√ uK
v
K uX
≤ √ t ϕ(sτh , aτh )⊤ Λ−1 τ τ
h ϕ(sh , ah ) (E.15)
γ
τ =1
√ q
K
= √ Tr(Λ−1 h (Λh − γI))
γ
√ s
Kp Kd
≤ √ Tr(I) = ,
γ γ

where (E.14) follows from the fact that ∥Λ−1


h ∥ ≤ γ , (E.15) follows from the Cauchy-Schwartz
−1

inequality. Then we conclude the proof.

E.4 Proof of Lemma C.4


Proof. Denote A = β 2 Λ−1
h , then we have ∥θ∥2 ≤ L, ∥A∥2 ≤ B γ . For any two functions V1 , V2 ∈ V
2 −1

with parameters (θ1 , A1 ), (θ2 , A2 ), since both {·}[0,H−h+1] and maxa are contraction maps,

dist(V1 , V2 ) (E.16)
  d
X   d
X 
≤ sup ϕ(s, a)⊤ (θ1 − θ2 ) − λ log 1 + ∥ϕi (s, a)1i ∥A1 − log 1 + ∥ϕi (s, a)1i ∥A2
s,a
i=1 i=1
Pd
∥ϕi (s, a)1i ∥A1
1+
≤ sup ϕ⊤ (θ1 − θ2 ) − λ log Pdi=1
ϕ∈Rd ,∥ϕ∥≤1 1 + i=1 ∥ϕi (s, a)1i ∥A2

46
Pd
⊤ 1+ ∥ϕi (s, a)1i ∥A1
≤ sup |ϕ (θ1 − θ2 )| + λ sup log Pi=1
d
, (E.17)
ϕ∈Rd :∥ϕ∥≤1 ϕ∈Rd :∥ϕ∥≤1 1+ i=1 ∥ϕi (s, a)1i ∥A2

we notice the fact that: for any x > 0, y > 0,


1+x x − y 
log = log + 1 ≤ log(|x − y| + 1) ≤ |x − y|.
1+y 1+y
Therefore, (E.17) can be bounded as:
d
X d
X
(E.17) ≤ sup |ϕ⊤ (θ1 − θ2 )| + λ sup ∥ϕi (s, a)1i ∥A1 − ∥ϕi (s, a)1i ∥A2
ϕ∈Rd :∥ϕ∥≤1 ϕ∈Rd :∥ϕ∥≤1 i=1 i=1
Xd q Xd q
= sup |ϕ⊤ (θ1 − θ2 )| + λ sup ⊤
ϕi 1i A1 ϕi 1i − ϕi 1⊤i A 2 ϕ i 1i
ϕ∈Rd :∥ϕ∥≤1 ϕ∈Rd :∥ϕ∥≤1 i=1 i=1
d q
X
≤ ∥θ1 − θ2 ∥2 + λ sup ϕ i 1⊤
i (A1 − A2 )ϕi 1i (E.18)
ϕ∈Rd :∥ϕ∥≤1 i=1
p d
X
≤ ∥θ1 − θ2 ∥2 + λ ∥A1 − A2 ∥ sup ∥ϕi 1i ∥
ϕ∈Rd :∥ϕ∥≤1 i=1

(E.19)
p
≤ ∥θ1 − θ2 ∥2 + λ ∥A1 − A2 ∥F ,
√ √
where the (E.18) follows from the triangular inequality and the fact | x − y| ≤ |x − y|, and
p

∥ · ∥F denotes the Frobenius norm. We next define that Cθ is an ϵ/2-cover of {θ ∈ Rd |∥θ∥2 ≤ L},
and the CA is an ϵ2 /4λ2 -cover of {A ∈ Rd×d |∥A∥F ≤ d1/2 B 2 γ −1 }. By Lemma F.5, we have that:
2
|Cθ | ≤ (1 + 4L/ϵ)d , |CA | ≤ (1 + 8λ2 d1/2 B 2 /γϵ2 )d .

By (E.19), for any V1 ∈ V, there exists θ2 ∈ Cθ and A2 ∈ CA s.t V2 parametrized by (θ2 , A2 )


satisfying dist(V1 , V2 ) ≤ ϵ. Therefore, we have the following:

log |N (ϵ)| ≤ log |Cθ | + log |CA | ≤ d log(1 + 4L/ϵ) + d2 log(1 + 8λ2 d1/2 B 2 /γϵ2 ).

Hence we conclude the proof.

E.5 Proof of Lemma C.5


Proof. By definition, we have that

∥ŵhλ ∥2
h n 0 1 µ0h,i λ 1 µ0h,i λ oi
= max ʵh,i [V̂h+1
λ
(s)]α + (Ê [V̂h+1 (s)]α )2 − Ê [V̂h+1 (s)]2α
λ )
α∈[(V̂h+1 λ
min ,(V̂h+1 )max ]
4λ 4λ i∈[d] 2
h H2 i
≤ H+ (E.20)
2λ i∈[d] 2
√  H2 
= d H+ ,

0 0
where (E.20) follows by the fact that ʵh,i [V̂h+1
λ (s)] ∈ [0, H], ʵh,i [V̂ λ (s)]2 ∈ [0, H 2 ].
α h+1 α

47
E.6 Proof of Lemma D.1
Proof. We first proof the LHS of the lemma by induction from last stage H. From the definition of
VHπ,λ and θh , we can learn that

d
δ  
VHπ,λ (s1 )
X

= rH (s1 , πH (s1 )) = ϕ(s1 , π(s1 )) θh = d+ ξHi Eπ aHi .
2d
i=1

This is the base case. Now suppose the conclusion holds for stage h + 1, that is to say,
H d
π,λ δ X j−h−1
 X 
Vh+1 (s1 ) ≤ (1 − ϵ) d+ ξji Eπ aji .
2d
j=h+1 i=1

Recall the regularized robust bellman equation in Proposition 3.2 and the regularized duality of the
three divergences, we have

Qπ,λ
 π,λ ′ 
(s ) + λ⟨ϕ(s, a), D(µh ||µ0h )⟩
 
h (s1 , a) = rh (s1 , a) + inf Es′ ∼Ph (·|s,a) Vh+1
µh ∈∆(S)d+2 ,Ph =⟨ϕ,µh ⟩
π,λ ′
≤ rh (s1 , a) + Es′ ∼P 0 (·|s1 ,a) [Vh+1 (s )] (E.21)
h
π,λ
= rh (s1 , a) + (1 − ϵ)Vh+1 (s1 ). (E.22)

Then with regularized robust bellman equation in Proposition 3.2 and the inductive hypothesis, we
have

Vhπ,λ (s1 ) = Qπ,λ


h (s1 , π(s1 ))
π,λ
≤ rh (s1 , πh (s1 )) + (1 − ϵ)Vh+1 (s1 ) (E.23)
d H d
δ  X  δ X  X 
= d+ ξhi Eπ ahi + (1 − ϵ)j−h d + ξji Eπ aji
2d 2d
i=1 j=h+1 i=1
H d
δ X  X 
= (1 − ϵ)j−h d + ξji Eπ aji .
2d
j=h i=1

Hence, by the induction argument, we conclude the proof of the RHS. Furthermore, for any h ∈ [H],
we can upper bound Vhπ,λ (s) as

H d
δ X  X 
Vhπ,λ (s) ≤ (1 − ϵ)j−h d + ξji Eπ aji ≤ δ(H − h) ≤ λ(H − h)/H ≤ λ, (E.24)
2d
j=h i=1

where the third inequality holds by the definition of δ. For the left, we prove by discussing the KL,
χ2 and TV cases respectively.

Case I - TV. The case for TV holds trivially as by Proposition 4.2, we have

Qπ,λ
 π,λ ′ 
(s ) + λ⟨ϕ(s, a), D(µh ||µ0h )⟩
 
h (s1 , a) = rh (s1 , a) + inf Es′ ∼Ph (·|s,a) Vh+1
µh ∈∆(S)d+2 ,Ph =⟨ϕ,µh ⟩

48
π,λ ′
= rh (s1 , a) + ⟨ϕ(s1 , a), Es′ ∼µ0 [Vh+1 (s )]min π,λ ′ ⟩ (E.25)
h s′ (Vh+1 (s ))+λ
π,λ ′
= rh (s1 , a) + Es′ ∼P 0 (·|s,a) [Vh+1 (s )]min π,λ ′
h s′ (Vh+1 (s ))+λ
π,λ
= rh (s1 , a) + (1 − ϵ)Vh+1 (s1 ), (E.26)

where (E.26) holds by (E.24). Hence, the inequality in (E.23) holds for equality. This concludes the
proof for TV-divergence.

Case II - KL. We prove by induction. The case holds trivially in last stage H. Suppose
H d
π,λ δ X j−h−1
 X 
Vh+1 (s1 ) ≥ (1 − ϵ) d+ ξji Eπ aji − (H − h)λϵ(e − 1).
2d
j=h+1 i=1

Recall the duality form of Proposition 4.5, the Q-function at stage h can be upper bounded as:

Qπ,λ
  π,λ ′  0

h (s1 , a) = rh (s1 , a) + inf E s ′ ∼P (·|s,a) V
h h+1 (s ) + λ⟨ϕ(s, a), D(µh ||µh )⟩
µh ∈∆(S)d+2 ,Ph =⟨ϕ,µh ⟩
π,λ ′
= rh (s1 , a) + ⟨ϕ(s1 , a), −λ log Es′ ∼µ0 e−Vh+1 (s )/λ ⟩
h
π,λ
−Vh+1 (s1 )/λ

= rh (s1 , a) − λ log ϵ + (1 − ϵ)e
π,λ
π,λ
(s1 ) − λ log ϵeVh+1 (s1 )/λ + (1 − ϵ)

= rh (s1 , a) + Vh+1
π,λ
π,λ
(s1 ) − λϵ eVh+1 (s1 )/λ − 1 (E.27)

≥ rh (s1 , a) + Vh+1
π,λ
≥ rh (s1 , a) + Vh+1 (s1 ) − λϵ(e − 1), (E.28)

where (E.27) follows by the fact that log(1 + x) ≤ x, ∀x > 0, (E.28) follows by (E.24). Therefore, by
the inductive hypothesis, we have

Vhπ,λ (s1 ) = Qπ,λ


h (s1 , π(s1 ))
π,λ
≥ rh (s1 , πh (s1 )) + (1 − ϵ)Vh+1 (s1 ) − λϵ(e − 1)
H d
δ X  X 
= (1 − ϵ)j−h d + ξji Eπ aji − (H − h)λϵ(e − 1).
2d
j=h i=1

This finishes the KL setting.

Case III - χ2 . Similar to the case in TV, KL, by the duality of χ2 in Proposition 4.7, we have

Qπ,λ
 π,λ ′ 
(s ) + λ⟨ϕ(s, a), D(µh ||µ0h )⟩
 
h (s1 , a) = rh (s1 , a) + inf Es′ ∼Ph (·|s,a) Vh+1
µh ∈∆(S)d+2 ,Ph =⟨ϕ,µh ⟩
n
π,λ ϵ π,λ
o
= rh (s1 , a) + (1 − ϵ) sup [Vh+1 (s1 )]α − [Vh+1 (s1 )]2α
α∈[Vmin ,Vmax ] 4λ
h
π,λ ϵ i
≥ rh (s1 , a) + (1 − ϵ) Vh+1 (s1 ) − [V π,λ (s1 )]2
4λ h+1
π,λ ϵλ(1 − ϵ)
≥ rh (s1 , a) + (1 − ϵ)Vh+1 (s1 ) − , (E.29)
4

49
where (E.29) follows by (E.24). Hence, similar to Case II, by induction, we have
H d
δ X  X  ϵλ(1 − ϵ)
Vhπ,λ (s1 ) ≥ (1 − ϵ)j−h d + ξji Eπ aji − (H − h) .
2d 4
j=h i=1

This finishes the χ2 setting. This concludes the proof.

F Auxiliary Lemmas
Lemma F.1 (Lemma D.3 of Liu and Xu (2024a)). For any h ∈ [H], let Vh denote a class of functions
mapping from S to R with the following form:
n d
X o
Vh (x; θ, β, Λh ) = max ϕ(s, a)⊤ θ − β ∥ϕi (·, ·)1i ∥Λ−1 ,
a∈A h [0,H−h+1]
i=1

the parameters (θ, β, Λh ) satisfy ∥θ∥2 ≤ L, β ∈ [0, B], γmin (Λh ) ≥ γ. Let Nh (ϵ) be the ϵ-covering
number of V with respect to the distance dist(V1 , V2 ) = supx |V1 (x) − V2 (x)|. Then

log Nh (ϵ) ≤ d log(1 + 4L/ϵ) + d2 log(1 + 8d1/2 B 2 /γϵ2 ).

Lemma F.2 (Corollary 4.2.11 of Vershynin (2018)). Denote the ϵ-covering number of the closed
interval [a, b] for some real number b > a with respect to the distance metric d(α1 , α2 ) = |α1 − α2 |
as Nϵ ([a, b]), then we have Nϵ ([a, b]) ≤ 3(b − a)/ϵ.

Lemma F.3 (Lemma B.2 of Jin et al. (2021)). Let f : S → [0, R − 1] be any fixed function. For
any δ ∈ (0, 1), we have
 XK 
P ∥ ϕ(sτh , aτh )ηhτ (f )∥2Λ−1 ≥ R2 (2 log(1/δ) + d log(1 + K/γ)) ≤ δ,
h
τ =1

where ηhτ (f ) = Es′ ∼P 0 (·|sτ ,aτ ) [f (s′ )] − f (sτh+1 ).


h h h

Lemma F.4 (Lemma F.3 of Liu and Xu (2024b)). If K ≥ O(d6 ) and the feature map is define as
Appendix D.1, then with probability at least 1 − δ, we have for any transition P,
H d
hX i 4d3/2 H
π ⋆ ,P
X
E ∥ϕi (s, a)1i ∥Λ−1 |s1 = s1 ≤ √ .
h=1 i=1
h K

Lemma F.5 (Lemma 5.2 of Vershynin (2010)). For any ϵ > 0, the ϵ -covering number of the
Euclidean ball in Rd with radius R > 0 is upper bounded by (1 + 2R/ϵ)d .

References
Blanchet, J., Lu, M., Zhang, T. and Zhong, H. (2024). Double pessimism is provably efficient
for distributionally robust offline reinforcement learning: Generic algorithm and robust partial
coverage. Advances in Neural Information Processing Systems 36. 1, 2, 3, 8, 10, 11, 12, 38

50
Garcıa, J. and Fernández, F. (2015). A comprehensive survey on safe reinforcement learning.
Journal of Machine Learning Research 16 1437–1480. 1

Goyal, V. and Grand-Clement, J. (2023). Robust markov decision processes: Beyond rectangu-
larity. Mathematics of Operations Research 48 203–226. 2, 5

Iyengar, G. N. (2005). Robust dynamic programming. Mathematics of Operations Research 30


257–280. 1, 3

Jin, C., Yang, Z., Wang, Z. and Jordan, M. I. (2020). Provably efficient reinforcement learning
with linear function approximation. In Conference on learning theory. PMLR. 5

Jin, Y., Yang, Z. and Wang, Z. (2021). Is pessimism provably efficient for offline rl? In
International Conference on Machine Learning. PMLR. 6, 14, 15, 50

Lee, J., Jeon, W., Lee, B.-J., Pineau, J. and Kim, K.-E. (2021). Optidice: Offline policy
optimization via stationary distribution correction estimation. 2

Levine, S., Kumar, A., Tucker, G. and Fu, J. (2020). Offline reinforcement learning: Tutorial,
review, and perspectives on open problems. arXiv preprint arXiv:2005.01643 . 1

Liu, Z., Wang, W. and Xu, P. (2024). Upper and lower bounds for distributionally robust
off-dynamics reinforcement learning. arXiv preprint arXiv:2409.20521 . 3

Liu, Z. and Xu, P. (2024a). Distributionally robust off-dynamics reinforcement learning: Provable
efficiency with linear function approximation. In International Conference on Artificial Intelligence
and Statistics. PMLR. 2, 3, 9, 13, 14, 15, 16, 50

Liu, Z. and Xu, P. (2024b). Minimax optimal and computationally efficient algorithms for
distributionally robust offline reinforcement learning. arXiv preprint arXiv:2403.09621 . 2, 3, 6, 8,
9, 10, 12, 13, 14, 15, 50

Lu, M., Zhong, H., Zhang, T. and Blanchet, J. (2024). Distributionally robust reinforcement
learning with interactive data collection: Fundamental hardness and near-optimal algorithm. arXiv
preprint arXiv:2404.03578 . 3

Ma, X., Liang, Z., Blanchet, J., Liu, M., Xia, L., Zhang, J., Zhao, Q. and Zhou, Z. (2022).
Distributionally robust offline reinforcement learning with linear function approximation. arXiv
preprint arXiv:2209.06620 . 2, 3, 5, 8, 11, 12, 13, 14, 15, 16

Nelder, J. A. and Mead, R. (1965). A simplex method for function minimization. The computer
journal 7 308–313. 2

Nilim, A. and El Ghaoui, L. (2005). Robust control of markov decision processes with uncertain
transition matrices. Operations Research 53 780–798. 1, 3

Packer, C., Gao, K., Kos, J., Krähenbühl, P., Koltun, V. and Song, D. (2018). Assessing
generalization in deep reinforcement learning. arXiv preprint arXiv:1810.12282 . 1

Panaganti, K., Wierman, A. and Mazumdar, E. (2024). Model-free robust φ-divergence


reinforcement learning using both offline and online data. arXiv preprint arXiv:2405.05468 . 2, 4

51
Panaganti, K., Xu, Z., Kalathil, D. and Ghavamzadeh, M. (2022). Robust reinforcement
learning using offline data. Advances in neural information processing systems 35 32211–32224. 3,
10

Shi, L. and Chi, Y. (2024). Distributionally robust model-based offline reinforcement learning with
near-optimal sample complexity. Journal of Machine Learning Research 25 1–91. 1, 3, 7

Shi, L., Li, G., Wei, Y., Chen, Y., Geist, M. and Chi, Y. (2024). The curious price of
distributional robustness in reinforcement learning with a generative model. Advances in Neural
Information Processing Systems 36. 2, 11

Sutton, R. S. (2018). Reinforcement learning: An introduction. A Bradford Book . 5

Tamar, A., Mannor, S. and Xu, H. (2014). Scaling up robust mdps using function approximation.
In International conference on machine learning. PMLR. 14

Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Springer, New York. 41

Vershynin, R. (2010). Introduction to the non-asymptotic analysis of random matrices. arXiv


preprint arXiv:1011.3027 . 50

Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data


science, vol. 47. Cambridge university press. 50

Wang, H., Shi, L. and Chi, Y. (2024a). Sample complexity of offline distributionally robust linear
markov decision processes. arXiv preprint arXiv:2403.12946 . 2, 3, 10, 12

Wang, R., Yang, Y., Liu, Z., Zhou, D. and Xu, P. (2024b). Return augmented decision
transformer for off-dynamics reinforcement learning. arXiv preprint arXiv:2410.23450 . 1

Yang, W., Wang, H., Kozuno, T., Jordan, S. M. and Zhang, Z. (2023). Robust markov
decision processes without model estimation. arXiv preprint arXiv:2302.01248 . 2, 4, 14

Zhang, H., Chen, H., Xiao, C., Li, B., Liu, M., Boning, D. and Hsieh, C.-J. (2020). Robust
deep reinforcement learning against adversarial perturbations on state observations. Advances in
Neural Information Processing Systems 33 21024–21037. 1, 2, 3

Zhang, R., Hu, Y. and Li, N. (2023). Regularized robust mdps and risk-sensitive mdps: Equivalence,
policy gradient, and sample complexity. arXiv preprint arXiv:2306.11626 . 2, 4, 8, 10, 11

Zhou, Z., Zhou, Z., Bai, Q., Qiu, L., Blanchet, J. and Glynn, P. (2021). Finite-sample
regret bound for distributionally robust offline tabular reinforcement learning. In International
Conference on Artificial Intelligence and Statistics. PMLR. 14

52

You might also like