0% found this document useful (0 votes)
7 views53 pages

Learning in Budgeted Auctions With Spacing Objectives

Uploaded by

Sachin Barthwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views53 pages

Learning in Budgeted Auctions With Spacing Objectives

Uploaded by

Sachin Barthwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Learning in Budgeted Auctions with Spacing Objectives

Giannis Fikioris∗ Robert Kleinberg Yoav Kolumbus


Cornell University Cornell University Cornell University
[email protected] [email protected] [email protected]
Raunak Kumar Yishay Mansour†
arXiv:2411.04843v1 [cs.GT] 7 Nov 2024

Cornell University and Microsoft Tel Aviv University and Google Research
[email protected] [email protected]
Éva Tardos‡
Cornell University
[email protected]

Abstract

In many repeated auction settings, participants care not only about how frequently they win but
also about how their winnings are distributed over time. This problem arises in various practical do-
mains where avoiding congested demand is crucial, such as online retail sales and compute services,
as well as in advertising campaigns that require sustained visibility over time. We introduce a simple
model of this phenomenon, modeling it as a budgeted auction where the value of a win is a concave
function of the time since the last win. This implies that for a given number of wins, even spacing
over time is optimal. We also extend our model and results to the case when not all wins result in
“conversions” (realization of actual gains), and the probability of conversion depends on a context. The
goal is then to maximize and evenly space conversions rather than just wins.
We study the optimal policies for this setting in second-price auctions and offer learning algorithms
for the bidders that achieve low regret against the optimal bidding policy in a Bayesian online √ setting.
Our main result is a computationally efficient online learning algorithm that achieves 𝑂˜ ( 𝑇 ) regret.
We achieve this by showing that an infinite-horizon Markov decision process (MDP) with the budget
constraint in expectation is essentially equivalent to our problem, even when limiting that MDP to
a very small number of states. The algorithm achieves low regret by learning a bidding policy that
chooses bids as a function of the context and the state of the system, which will be the time elapsed
since the last win (or conversion). We show that state-independent strategies incur linear regret even
without uncertainty of conversions. We complement this by showing that there are state-independent
strategies that, while still having linear regret, achieve a (1 − 𝑒1 ) approximation to the optimal reward
as 𝑇 → ∞.

“The only reason for time is so that everything doesn’t happen at once.”
— Albert Einstein

∗ Supported in part by the Department of Defense (DoD) through the National Defense Science & Engineering Graduate
(NDSEG) Fellowship, the Onassis Foundation – Scholarship ID: F ZS 068-1/2022-2023, and ONR MURI grant N000142412742.
† Partially supported by funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research
and innovation program (grant agreement No. 882396), by the Israel Science Foundation, the Yandex Initiative for Machine
Learning at Tel Aviv University and a grant from the Tel Aviv University Center for AI and Data Science (TAD).
‡ Partially supported by AFOSR grant FA9550-23-1-0410, AFOSR grant FA9550-231-0068, and ONR MURI grant N000142412742
1 Introduction
Auctions are a cornerstone of economic theory, illustrating one of the earliest structured forms of market
interaction. Today, auctions play a central and ever-increasing role in the digital landscape, spanning
domains such as online advertising [28, 38, 57, 83], retail markets [29, 81], and blockchain fee markets
[14, 43, 65, 73], and are studied extensively. A prominent theme in online auction applications and the
ensuing theoretical studies is that a bidder (e.g., a seller on an online marketplace like Amazon or an
advertiser bidding for ad positions) participates in the auction multiple times. Hence, a bidder needs to
consider how to utilize their budget efficiently and how to learn from experience over time. Typically, the
bidder engages in these markets by using some learning algorithm to set the bid in each auction period
[2, 3, 10, 13, 15, 18, 32, 34, 35, 42, 45, 52, 53, 64, 68, 70, 84].
The literature on repeated auctions generally assumes that in each step of a sequence of𝑇 auctions, a bidder
has some value for the item being auctioned (for example, an online ad space). The bidder then derives
utility from winning the item. The total utility for the bidder is then simply the sum of their utilities from
all winning events in the repeated auction. Work in this area rests on the assumption that bidders are
completely indifferent to how their winning events are distributed over time—an assumption that may not
hold true in many contexts.
To see the flaw in this model, consider a retail seller marketing a product on an online ad auction platform.
The seller has a monthly budget for advertising on the platform and bids for ad impressions, which convert
(at some conversion rate) into orders. However, the seller faces a limited supply capacity for the product,
determined by the current inventory and the rate of incoming shipments. Suppose now that ad impressions
and resulting orders are concentrated in the first week of the month. If the orders exceed the supply
capacity, the seller will either need to reject some orders due to insufficient stock or incur additional costs
for expedited shipments of new supplies. In either case, the seller fully pays for the impressions won in the
auction but receives reduced value from these impressions due to the congested demand. In the standard
additive model, only the total value matters, and timing is irrelevant—but for our seller, the way sales are
distributed over time does matter.
Other examples where temporally congested winning events in the auction may lead to diminishing re-
turns due to limited supply capacity include computing services, logistics services, lodging, restaurants,
and more. A yet different scenario where the standard model fails is advertising campaigns—such as brand-
ing campaigns—that seek sustained visibility rather than clustered ad impressions within a single part of
the campaign period [25, 33, 49, 79].
In this paper, we explore the setting of repeated auctions where bidders care not only about how much they
win but also when they win. Specifically, we focus on scenarios in which bidders prefer that their winning
times in the series of auctions—whose total number is naturally limited by budget constraints—are not
clustered together but are relatively evenly spaced over time. We propose a simple model to describe
such temporal preferences, which allows us to analyze the optimal strategies for the bidders once they
know the distribution of prices and contexts, as well as the algorithms they could use in a more natural
scenario requiring them to learn to bid optimally online. Next, we describe our model and continue with
an illustrative example and an overview of our results and techniques. We postpone a more extended
discussion of related work to Section 2.
Model overview. An informal summary of our model is the following. (For the formal model, see Section
3.) We consider a bidder participating in a sequence of 𝑇 second-price auctions with budget 𝐵, where the
distribution of top bids of the competitors is in [0, 1] and unknown to the bidder. Our bidder is interested,
on the one hand, in maximizing the number of winning events in the series of auctions and, on the other
hand, in spacing those wins to avoid congestion. The combination of these two goals is captured by an

1
increasing concave reward function 𝑟 (·), such that the utility for the bidder in the entire sequence of 𝑇
auctions is the sum of 𝑟 (·) applied to the lengths of intervals between winning events. Intuitively, this
captures the two goals stated above: evenly spacing a given number of wins or adding additional wins
increases utility.
We extend our model and results to a setting where auctions are not identical, but each auction has a
context and not every winning event leads to a successful conversion event (realization of actual gains).
Conversion rates and prices depend on the context and may be correlated. In this extended setting, the goal
is to achieve many conversions and have them spaced relatively evenly. We note that this generalization
reduces to the classical auction setting with stochastic values when the spacing of wins is not part of the
objective (e.g., 𝑟 (ℓ) = 1 for all ℓ ≥ 1) and the probability of conversion serves as the “value” of winning.
Warm-up example. To gain some intuition about our setting, consider the following simple scenario:
the distribution of prices is uniform in [0, 1], in this example it is known to the bidder,
√ and all wins lead to
a conversion. The bidder has a budget 𝐵 ≤ 𝑇 /4, and the reward function is 𝑟 (ℓ) = ℓ. We are interested
in the long-term utility, where 𝑇 ≫ 1, and 𝜌 = 𝐵/𝑇 is constant.
How should our bidder act in this auction? To illustrate the challenge of optimizing our bidder’s objective,
we begin by reviewing two simple strategies and the performance they can achieve.
Fixed winning intervals: One possible approach is to enforce equal spacing: win with probability 1 once
every fixed number of steps so as to fully use the budget in expectation. That is, our bidder bids 1 every
𝑇
2𝐵 steps and pays an average price of 1/2 per win. Denote by √︁ 𝑘 the number of wins. In expectation,1

𝑘 = 2𝐵(1 − 𝑜 (1)), and the utility for our bidder is approximately 𝑇 /2𝐵 · 2𝐵 = 2𝜌𝑇 .
Fixed bid distribution: An alternative simple strategy would be to use a fixed bid distribution that maximizes
the number of wins. To achieve this, our bidder would like to make the lowest possible payments that still
use the full budget in expectation. This, in fact, can be obtained2 by consistently using a fixed bid level
𝑏. The probability of winning is then 𝑏, and the expected payment per win is 𝑏/2. As in our previous
example, the budget usage with a fixed bid 𝑏 and 𝑘 wins follows a uniform-sum distribution (see footnote
1), which is concentrated around 𝑏/2 per step for large 𝑘. Thus, the lowest bid that fully uses the budget
√︁ √ √
is 𝑏2 · 𝑏𝑇 = 𝐵 ⇒ 𝑏 = 2𝐵/𝑇 = 2𝜌. The maximum expected number of wins is therefore 2𝜌𝑇 . Note that
for 𝐵 < 𝑇 /2, as in our case, the bidder wins more frequently than in the fixed-interval strategy.
If all intervals were of equal length, the utility would be (2𝜌) 1/4𝑇 , which is an upper bound on the maximum
utility (i.e., maximum expected number of wins with perfect spacing). However, computing the resulting
utility for our bidder is somewhat more complicated, due to the randomness in interval lengths between
winning events. The full calculation is deferred to Appendix A. Our calculations in that example show that
a static bidding policy with a constant bid gives higher utility than the “fixed winning interval” strategy.
In addition, it achieves a constant-factor approximation to the optimal utility. Theorem B.1 shows that this
result holds more broadly: for any concave reward function and any price distribution, there exists a static
bidding policy that achieves at least (1 − 𝑒1 ) of the optimal policy’s value as 𝑇 → ∞ (see Appendix B.1 for
more details). However, fixed bidding strategies are not optimal (see Appendix B.2), and one can do better
by combining the two ideas: bidding to maximize the probability to win, that also depends on the time
since the last conversion.
√︁ √︁ √︁ √︁
1 The utility when there are 𝑘 wins is 𝑢 = 𝑘 𝑇 /2𝐵 + 𝑇 − (𝑇 /2𝐵)𝑘, and so E[𝑘] 𝑇 /2𝐵 ≤ E[𝑢] ≤ 2𝐵 𝑇 /2𝐵. The total
Í2𝐵
cost 𝐶𝑘 after 𝑘 wins follows a uniform-sum (Irwin–Hall) distribution. The expectation of 𝑘 is thus given by E[𝑘] = 𝑘=𝐵 𝑘·
Í2𝐵 𝑘 Í𝐵 𝑗 𝑘  𝑘
𝑃𝑟 [𝐶𝑘 ≤ 𝐵] = 𝑘=𝐵 𝑘! 𝑗=0 (−1) 𝑗 (𝐵 − 𝑗) . For large 𝑇 , with 𝐵 being a constant fraction of 𝑇 , 𝑘 is concentrated, and so
√︁ √  √︁
𝑇 /2𝐵 2𝐵 − 𝑂 ( 𝐵) ≤ E[𝑢] ≤ 𝑇 /2𝐵 · 2𝐵 with high probability.
2 This is since the price distribution in our example has full support and no point masses. In general, an optimal bid distribution
can be a randomization between two bids.

2
Next, we provide a brief summary of our results, followed by an overview of our main challenges, tech-
niques, and proof outlines. Related work is discussed in Section 2, and the formal presentation of our
model appears in Section 3. The subsequent sections include the full formal analysis.

Our Results

Considering repeated auctions where the bidder cares about spacing wins relatively evenly raises impor-
tant new questions. In this paper, we offer a general model for this problem using second-price auctions
where the value of winning is a concave function of the time since the last win, and we develop an effective
learning algorithm to address it.

We offer an efficient online algorithm for the bidder that achieves after 𝑇 time steps at most Õ ( 𝑇 ) regret
with high probability against the optimal bidding policy (Corollary 5.2). Optimal bidding in this problem
should naturally depend on the context, as well as on the time since last conversion, and so one would
assume that the learner would have to learn |X| ·𝑇 different bids, one for each context in the context space
X, and depending on the time since last conversion. The sample complexity of learning 𝑑 parameters
typically scales linearly in 𝑑 (which could be as large as |X| · 𝑇 in our setting), making it surprising that
one can achieve regret that is √ completely independent of the context space X (e.g., its dimension), and
depends on time only as Õ ( 𝑇 )—the same rate as is achievable when the bidder cares only about the
number of wins or conversions, and not about their spacing [10].
An interesting feature of our learning algorithm is that we do not need to discretize the bidding space,
avoiding issues related to discretization errors.
We also show that state-independent strategies (independent of the time since the last win) incur linear
regret (see Appendix B.2). On the positive side, we show that such policies can achieve a (1 − 𝑒1 ) approxi-
mation to the optimal reward as 𝑇 → ∞. (Theorem B.1.)

Our Techniques

We formally define our problem in Section 3. In Section 4, we start by focusing on the simpler version of the
problem with no contexts, that is, when all wins lead to conversion. We first notice that an extremely large
dynamic program can compute the optimal bidding policy under mild assumptions: assume that there are
𝐾 possible prices. The dynamic program aims to select the best bid for each time3 𝑡 ∈ [𝑇 ], depending on
the remaining budget 𝐵𝑡 . However, such a dynamic program is not amenable to learning policies online
without knowing all distributions ahead of time.
An equivalent small MDP. Our first main result is Theorem 4.4, offering a stronger, but much more
compact benchmark, even in the general case with contexts: an infinite-horizon Markov Decision Process
(MDP) with only 𝑚 = O (log𝑇 ) states, with state ℓ ∈ [𝑚] corresponding to the time elapsed since the last
conversion, capped at 𝑚. See Figure 1 for a visual representation of the MDP. We show that the average
reward of this much smaller MDP is almost equal to the original problem with minimal error.
Proving the equivalence of the small MDP and the full problem. To see how we can define an
approximately equivalent infinite-horizon MDP with a small state space, it is best first to consider a larger,
𝑇 -state, infinite-horizon MDP whose states are recording the time since the last win. Here, we only require
that the per-step budget is observed on average, as the time horizon approaches infinity (note that the time
horizon is not 𝑇 , but infinite, in this setting) and in expectation over the prices and contexts. See Figure
1 with 𝑚 = 𝑇 , where 𝑊ℓ is the winning probability of a particular bidding strategy in state ℓ. Given the
3 We denote [𝑛] = {1, 2, . . . , 𝑛} for 𝑛 ∈ N.

3
1 − 𝑊1 1 − 𝑊2 1 − 𝑊3 1 − 𝑊𝑚−1 1 − 𝑊𝑚
𝑊1 1 2 3 ... 𝑚−1 𝑚 1 − 𝑊𝑚

𝑊2 1 1 1
𝑊3 𝑊𝑚−1 𝑊𝑚

Figure 1: The MDP of the infinite-horizon setting when the bidder wins with probability 𝑊ℓ when ℓ ≤ 𝑚
rounds have elapsed since her last win and with probability 𝑊𝑚 when more than 𝑚 have elapsed.

relaxation of the budget constraint, we show that the optimal reward of the constrained MDP is at least
the value of the optimal strategy in the original auction.
The main result of Section 4 is that value of the 𝑇 -state MDP can be closely approximated by a small MDP
with only O (log𝑇 ) states (Theorem 4.4). This approximation, in turn, has value at least as large as the
value of the true optimal solution obtained by the dynamic program above (Lemma 4.1 and Theorem 4.2).
To compress the MDP to have only O (log𝑇 ) states, we assume that the average per-step budget 𝜌 is a
constant independent of 𝑇 . Under this assumption, an optimal policy is expected to win an auction on
average every constant number of steps. Specifically, Lemma 4.8 shows that after a constant number of
losses, the probability of the optimal policy winning an auction is at least a constant. This implies that
there is a win every O (log𝑇 ) rounds with high probability.
The main idea behind showing that O (log𝑇 ) states suffice to approximate the 𝑇 -state MDP is to focus on
winning probabilities in each state rather than bidding policies. This simplifies the analysis and allows
us to prove the result in both contextual and non-contextual settings. We show that any optimal policy
(for any reward function 𝑟 (·) and distribution of contexts and prices) wins more frequently as the time
since the last win increases (𝑊ℓ+1 ≥ 𝑊ℓ in Figure 1). This, together with the constant per-step budget,
implies that after a constant number of losses, the probability of winning in each state is at least constant.
We prove this monotonicity by considering the Lagrangian relaxation of the budget-constrained problem
(subtracting 𝜆 times the budget spent from the reward with a parameter 𝜆 ≥ 0). With this relaxation of the
budget constraint, we use properties of optimal MDP policies to show that for any 𝜆, the optimal winning
probabilities are increasing in the time since the last win (Lemma 4.6). The claim for the optimal solution
of the constrained problem then follows as it can be written as a linear program; therefore, its optimal
solution is also optimal for the Lagrangian problem using the value of the variable associated with the
budget constraint from the optimal solution of the dual problem.
Outline of the learning algorithm. Section 5 offers our online algorithm for bidding: Follow the 𝑘-
delayed Optimal Response Strategy (FKORS). (See Algorithm 1.) The idea is to use a form of episodic
learning over subsequences of the (single) 𝑇 -length sequence to learn the best bidding in each state of
the small MDP. We divide time into small epochs with stochastic length, with epochs ending either when
conversions occur or after having no conversions for 𝑘 steps for a parameter 𝑘 ∈ N. At the start of
each epoch, we use data from previous rounds to estimate the optimal bidding strategy in the small MDP.
An interesting feature of this learning algorithm is its use of epochs with stochastic lengths. Almost all
epochs end with a conversion, with the state of the MDP going back to state ℓ = 1. We end epochs when a
conversion does not occur for 𝑘 steps to keep the epoch length bounded deterministically. If epochs were
not stochastic, most of them would not end immediately after a conversion, which would cause additional
error in reward. It is important to use such variable-length epochs to avoid accumulating this error. First,
short epochs help the learning converge faster. Second we show that an epoch ending without a conversion
will only occur with very low probability, leading to only a minor impact on the overall reward. In the
remark after Corollary 5.2 stating the final regret bound in Section 5, we show that using fixed-length
epochs would lead to Ω(𝑇 2/3 ) regret bounds.

4
We use a delayed updating strategy: only updating our bidding strategy at the start of each epoch. Thus,
we use a fixed strategy throughout each epoch, allowing us to bound the expected error in reward. Note
that the epochs are internal to the algorithm’s design and, unlike in episodic learning, there is only a single
run of the sequence of auctions. Breaking the algorithm’s run into epochs aims to learn the optimal policy
in the small infinite-horizon MDP. We then use the result that the value of this MDP is an upper bound on
the optimal reward of our original problem, with √ a minimal error, to prove our main result (Theorem 5.1
and Corollary 5.2) that the algorithm achieves Õ ( 𝑇 ) regret compared to the optimal policy.
An important source of error in the algorithm comes both from the error in reward and depletion of
the budget before the final round, due to suboptimal bidding in each epoch. Because of the bound on
the sampling error we prove in Theorem 6.1, we know that the policy of each epoch is close to optimal in
terms of average reward. However, the reward of an epoch is the reward of the policy between conversions.
Another important property of that policy is the time between wins with conversions. These two quantities
might differ greatly from the ones of the optimal policy since it might not be unique. The bulk of our
technical effort involves proving that the reward of an epoch of (stochastic) length 𝐿 is about 𝐿 · OPT,
where OPT is the average per-time reward of the optimal policy (Lemma 5.3). A similar result proves
that the payment of the epoch is about 𝐿𝜌 (Lemma 5.4). We prove these results without comparing the
length, reward, and payment between wins with conversions of our policy with the optimal one. Breaking
the learning into epochs allows us to use concentration inequalities to prove the high probability bound
claimed. Note that we can not do this on quantities summing over rounds of the algorithm, as these are very
dependent. However, epochs are (conditionally) independent, which allows us to rely on concentration
bounds over epochs. This hints at why epochs of fixed length 𝑘 would not work: due to changing policies
between epochs that end without a win, we get constant error per epoch, leading to Ω(𝑇 /𝑘) error overall.
(Note that the error due to changing policies between epochs in our algorithm is zero, as long as the√︁epoch
does not end early.) However, concentration bounds on 𝑇 /𝑘 epochs of length 𝑘 would lead to Ω(𝑘 𝑇 /𝑘)
error. Optimizing over 𝑘 we get Ω(𝑇 2/3 ) regret when 𝑘 = Θ(𝑇 1/3 ).
An additional appealing feature of this algorithm is that we do not discretize the bidding space; instead,
we can compute the optimal bidding policy (for the samples of prices and contexts seen so far) as ex-
plained below. Importantly, this allows us to have no dependence on the size of the action space and avoid
discretization errors altogether.
Structure of the optimal solution to the small MDP. By showing the (near) equivalence of the small
MDP that has 𝑚 = O (log𝑇 ) states to the larger problem, it is now more promising to learn the optimal
policy. However, we still have to learn a function that maps contexts to bids in each of the 𝑚 states.
We show that optimal bidding has a simple structure, with only one parameter per state that our online
algorithm needs to learn. The key observation is to think of the main variable of the MDP as the desired
probability of winning a conversion 𝑊ℓ in state ℓ ∈ [𝑚] (rather than thinking of the bids directly). To make
the bidding policy optimal, the bidder wants to use a bidding function from context to bids that achieves
the desired winning probability in expectation over contexts with minimal expected cost.
The main observation here is that the optimal bidding policy can be expressed as a simple function of
the conversion rate, regardless of other parts of the context, i.e., the potential information about the dis-
tribution of prices associated with that context. Depending on the state of the system (time since last
conversion), the optimal policy at that state when the conversion rate is 𝑐 is to bid min(𝑐/𝜇, 1) for some 𝜇,
assuming for simplicity that the price distribution has no atoms (see Lemma 6.2 for the extension to the
case with atoms). To see why this is true, we again use the idea of a Lagrangian relaxation, now for the
problem of minimizing the cost while achieving a desired conversion probability. Consider the equivalent
optimization problem of maximizing the probability of winning a conversion for a given desired spending
level 𝜌 ′ at a state ℓ. Taking the Lagrangian relaxation by subtracting the expected cost with a multiplier 𝜇,

5
we get a simple optimization problem whose optimal solution is the bid min(𝑐/𝜇, 1).
Computing the optimal strategy in the small MDP. One step of our online algorithm is to compute
the optimal policy for the small MDP using the empirical distribution of prices and contexts observed so
far. This is discussed in Section 6.2. We show that the optimal bidding policy of our MDP can be written
as a linear program. In addition to its use in the online algorithm with context, this linear program also
offers a computational advantage. For example, when there are no contexts, the dynamic programming
solution mentioned at the start would run in at least 𝑂 (𝑇 3𝐾) time when there are 𝐾 possible prices of the
form 𝑖/𝐾 for 𝑖 ∈ [𝐾]. Our linear program with 𝐾 possible prices is only of size O (𝐾 log𝑇 ), significantly
improving the running time.

Accounting for sampling errors. The main result in Section 6 is Theorem 6.1, showing a Õ (𝑚 3 / 𝑡) error
bound on the optimal solution of OPT𝑚 inf when using only 𝑡 samples for computing it. This follows from

the following two results. In Section 6.4, we compute the expected error in 𝑊 (𝜇) and 𝑃 (𝜇), the expected
winning probability and cost, respectively, when bidding min(1, 𝑐/𝜇) using only 𝑡 samples. Section 6.3
shows the expected error of computing OPT𝑚 inf due to using approximate distributions only.

State-Independent policies. Our learning algorithm aims to learn optimal state-dependent policies. In
Appendix B.2, we show that such policies are necessary for attaining low regret, even in the special case
without contexts. In particular, we show that state-independent strategies incur linear regret. We note
that this also implies linear regret for standard budget pacing strategies, such as [11], as such strategies
would converge to a state-independent static bid.
We complement this by showing Theorem B.1 that there are state-independent strategies that, while still
having linear regret, achieve a (1 − 𝑒1 ) approximation to the optimal reward as 𝑇 → ∞.

2 Further Related Work


While there is a vast body of work on online learning and on algorithmic aspects in auctions (see [24, 77]
and [54, 60, 63, 72], respectively, for broad introductions), and in particular, budgeted auctions have been
extensively explored (e.g., [4, 8, 11, 12, 17, 36, 37, 39, 40, 45, 58]), to our knowledge, the problem of online
learning with a temporal spacing objective has not been previously studied.
No-regret learning in auctions. This theme has attracted broad interest, with research exploring topics
such as social welfare and the price of anarchy when bidders are regret-minimizers [4, 12, 16, 21, 32, 45, 74,
82]; the learning of auction parameters, such as reserve prices [23, 61, 62, 75]; the dynamics of no-regret
algorithms in auction settings [15, 32, 34, 42, 51, 52]; the impact of learning algorithms on user incentives [5,
41, 53]; optimal strategies for the auctioneer when the bidders are learning [19, 20, 76]; and the estimation
of user preferences and bid prediction under the assumption that have low regret [47, 64, 66, 67, 69]. Our
work enhances this literature by introducing a new aspect not previously studied: the bidder’s preference
for spacing wins evenly over time.
Budget pacing. The work most closely related to our setting is on the concept of “budget pacing,” where
the strategies of bidders in a budgeted auction are limited to setting a scalar “bid-shading factor” that
adjusts their bid in each auction by scaling their value. Two primary lines of research in this area examine
the equilibria in games with such strategy spaces for the bidders [27, 30, 31] and algorithms that learn the
shading factor online and their dynamics [2, 9, 11, 12, 46, 55, 59]. The main concern in our setting, however,
is not only to spend the budget at the right average rate but also to consider the spacing objective under
this constraint, and we show that state-independent strategies which ignore the time elapsed since the
last win are insufficient for this purpose and lead to high regret. For example, one could try to use the

6
budget pacing algorithms of [11, 12] in our problem, using conversion probability as the value of winning
the auction. However, these algorithms will converge to a fixed shading factor, i.e., a state-independent
bid distribution, and hence will incur linear regret.
Online learning of Markov Decision Processes (MDPs). There is a rich literature on online learning
of MDPs (see, e.g., [1]). There are several reasons why our algorithm does not directly apply existing ones
in that literature. First, a naïve model of the MDP on our setting would have 𝑇 states, making the regret
bounds linear. Second, in contrast to the episodic learning model that predominates in research on learning
of MDPs, we assume the learner participates in a single episode (a continuing task in reinforcement learning
terminology) and must simulate episodic learning by breaking the timeline into epochs of stochastic length,
as described earlier. Third, while we have a continuous action space (i.e., a bid interval), our regret bounds
have no dependence on the size of the action space, in contrast to the MDP learning literature that assumes
a discrete action space. Using discretization would result in a dependence on the number of actions which
is either square-root (in the bandit setting) or logarithmic (in the full information setting). Fourth, due
to the bidder’s budget constraint, our setting is modeled by a constrained MDP. While there are some
recently-discovered online learning algorithms for constrained MDPs in the infinite-horizon [26]4 and
episodic learning [80] settings, those algorithms also share some of the drawbacks discussed above.
Bandits with Knapsacks (BwK). Another related thread of work, following [7], is on the BwK model,
which is a generalization of the classic multi-armed bandits model that incorporates resource consumption
constraints [22, 44, 48, 56, 77, 78]. These constraints require algorithms to consider not only the reward
of each arm but also their long-term effect on the resource budgets. However, an important distinction
between the literature on BwK and our work is that the reward of an arm in BwK only depends on the
current round. In particular, as in the case of the budget-pacing algorithm discussed above, in a sequential
auction setting, the reward only depends on the current win (or conversion) and not on the time elapsed
since the last one.

3 Model
We consider a bidder facing an unknown price distribution in a sequence of 𝑇 second-price auctions. The
bidder has a budget 𝐵, and we denote the average budget per step by 𝜌 = 𝐵/𝑇 . We are interested in the
long-term behavior and outcomes in the series of auctions and think about 𝜌 as constant as 𝑇 → ∞. In
each round 𝑡 ∈ [𝑇 ], a price 𝑝𝑡 ∈ [0, 1] is sampled from an unknown distribution P. The price 𝑝𝑡 can be
interpreted as the highest bid among the pool of other competitors in the auction. The bidder chooses a
bid 𝑏𝑡 ∈ [0, 1]; if 𝑏𝑡 ≥ 𝑝𝑡 , the bidder wins the auction and pays a price 𝑝𝑡 from her budget. If 𝑏𝑡 < 𝑝𝑡 , the
bidder loses the auction, pays nothing, and observes 𝑝𝑡 . We discuss the utility model for the bidder next.
We denote by 𝑊 (𝑏) and 𝑃 (𝑏) the probability of winning the auction and the expected payment, respec-
tively, when using bid 𝑏. Formally, 𝑊 (𝑏) = P𝑝∼P [𝑏 ≥ 𝑝] and 𝑃 (𝑏) = E𝑝∼P [𝑝 1 [𝑏 ≥ 𝑝]]. In addition, for
a distribution of bids b ∈ Δ( [0, 1]), we define 𝑊 (b) = E𝑏∼b [𝑊 (𝑏)]; we define 𝑃 (b) analogously.
Reward model. The bidder is interested both in winning many auctions and spacing the winning events
over time. Two basic properties we require are that (i) adding any set of winning events to any sequence
𝑡 1, · · · , 𝑡𝑘 of winning times can only improve utility, and (ii) for any fixed number 𝑘 of winning events,
4 This work uses one long trajectory of length 𝑇 of a small MDP, but either (i) assumes that all policies are ergodic, which does
not hold in our setting, or (ii) has O (𝑇 2/3 ) regret when the MDP is weakly communicating, or (iii) is not efficient. Both (ii) and
(iii) require knowing some parameters of the optimal policy and also require a discrete action set. Therefore, even after reducing
our problem to its small MDP version and discretizing the action space, we would need to know some properties of the optimal
policy to use their algorithm in our setting.

7
the sequences in which those events are most evenly spaced lead to highest utility. While there is no
single way to represent such preferences, a natural and tractable model that achieves these properties is
maximizing the sum of some concave function of the lengths of intervals between winning times. This
leads to the following reward model.
The bidder has a reward per winning event, which depends on the number of rounds since the last winning
event. Formally, let 𝑡 ′ < 𝑡 be the last round before 𝑡 in which a winning event occurred and ℓ𝑡 = 𝑡 − 𝑡 ′ be
the number of elapsed rounds since this last winning event. (If there was no such event before 𝑡, define
𝑡 ′ = 0 and ℓ𝑡 = 𝑡.) Then, in round 𝑡, if the bidder wins the auction (i.e., if 𝑏𝑡 ≥ 𝑝𝑡 ), she receives a reward
𝑟 (ℓ𝑡 ), where 𝑟 : N ∪ {0} → R+ is an increasing and concave sequence with bounded differences. That is,
for every ℓ ∈ N ∪ {0}, it holds that
0 ≤ 𝑟 (ℓ + 2) − 𝑟 (ℓ + 1) ≤ 𝑟 (ℓ + 1) − 𝑟 (ℓ) ≤ 1,

where we assume 𝑟 (0) = 0. A simple example of such a sequence is 𝑟 (ℓ) = ℓ (as in the warm-up example
in Section 1). For any 𝑚 ∈ N, it will be useful to define 𝑟𝑚 (ℓ) = min{𝑟 (ℓ), 𝑟 (𝑚)} = 𝑟 (min{ℓ, 𝑚}).
Extension to contextual settings. In the first part of the paper, we focus on the model defined above
for simplicity of presentation. However, all our results extend to an important contextual generalization
of our model, which is defined as follows. In the contextual auction setting, in every round 𝑡, a context 𝑥𝑡
is sampled from some distribution over an arbitrary space X. The bidder observes 𝑥𝑡 before bidding, and
the context can naturally also affect the price. While we continue to assume that prices in each iteration
are independent, we allow the price and the context to be arbitrarily correlated.
Each context 𝑥𝑡 includes a conversion probability 𝑐𝑡 ∈ [0, 1], which is the probability that winning an
auction in round 𝑡 (i.e., 𝑏𝑡 ≥ 𝑝𝑡 ) results in a conversion event, i.e., a reward. Specifically, in round 𝑡 where
the last successful conversion was ℓ𝑡 rounds ago, the bidder will collect a reward 𝑟 (ℓ𝑡 ) if 𝑏𝑡 ≥ 𝑝𝑡 and a
conversion occurs, which, conditional on 𝑏𝑡 ≥ 𝑝𝑡 , happens with probability 𝑐𝑡 ; in this case ℓ𝑡 +1 = 1. Only
if these two conditions are satisfied does the bidder get any reward; otherwise, she gets zero reward in
round 𝑡 and in the next round, the state changes to ℓ𝑡 +1 = ℓ𝑡 + 1.
In the contextual model, the context affects both the probability of winning a conversion and the expected
payment of a bid. For this case, we will use a bidding function b that maps a context 𝑥 to a randomized
bid and define 𝑊 (b) and 𝑃 (b) analogously to denote the expected probability of winning a conversion
and the expected price for a bidding function b. For example, 𝑊 (b) = P𝑥,𝑝,𝑏∼b(𝑥 ) [𝑏 ≥ 𝑝]. We denote
𝑐¯ = 𝑊 (1) = E [𝑐] the expected conversion rate, i.e., the probability that the bidder gets positive reward if
she bids one, and assume that 𝑐¯ is a constant with respect to 𝑇 .

4 State-Space Reduction for Near-Optimal Planning


Before we address the problem of learning to bid online in the next sections, in this section, we ask the
following question. Suppose that the bidder knew the distribution of prices and contexts exactly. What is the
maximum reward she could aim for? And what policy would achieve this?
The main result of this section is introducing an infinite-horizon optimization problem with only O (log𝑇 )
parameters, whose solution can be used as a nearly optimal approximation for the budgeted auction prob-
lem with the spacing objective.
First, in Section 4.1, we define the optimal solution one would use knowing the distribution of prices:
the optimal algorithm that respects the budget constraint and takes into account all the information of
past rounds. Noticing that this solution is hard to compute online, we focus on a different solution. In
Section 4.2, we introduce an infinite-horizon optimization problem over 𝑚 states, each state representing

8
the time since the last win (such that 𝑚, 𝑚 + 1, . . . rounds since the last win all yield the same reward).
Theorem 4.2 shows that when 𝑚 = 𝑇 (the number of states equals the number of rounds in the original
problem), the reward in the infinite-horizon setting is at least that of the best algorithm in the original
problem. While this may initially appear more complex, learning the optimal solution is simplified due
to its stationary nature: with no time horizon, the only relevant variable is the current state the bidder
is in among the 𝑚 possible states. In Section 4.4, we focus on simplifying this solution even further. As
mentioned above, with 𝑚 = 𝑇 different states, we would have to learn a different bidding strategy for
each state. Theorem 4.4 shows that we can, instead, consider an exponentially smaller MDP—with only
𝑚 = O (log𝑇 ) states—and this results in minimal error. This new and nearly optimal solution is thus much
easier to learn online, which will be our goal in Section 5. We start the presentation of these results in the
simpler setting without contexts and conversion rates. In Subsection 4.3, after establishing the basics, we
switch back to the more general contextual setting.

4.1 Optimal Offline Algorithm

For the first two subsections, assume there are no contexts, and the conversion rate is always 1 for sim-
plicity of presentation. When the bidder knows the distribution of prices, she can maximize her expected
reward while satisfying the budget constraint without needing to learn this distribution. In particular, in
every round, she could calculate her optimal bid as a function of her current “state” (𝑡, 𝐵𝑡 , ℓ𝑡 ), where 𝑡 is
the current round, 𝐵𝑡 is her remaining budget in that round, and ℓ𝑡 is how many rounds ago was her last
winning event. We note that in every state where 𝐵𝑡 ≥ 0, the bidder can bid zero so that her budget is not
depleted. While the optimal strategy in this state space is complicated, it is possible to compute the optimal
policy using a very large dynamic program under mild assumptions. As long as the number of prices is
finite, the possible states are also finite. For example, if for some 𝐾 ∈ N every price was of the form 𝑝𝑡 = 𝐾𝑖
for some 𝑖 = 0, 1, . . . , 𝐾, then the number of possible states would be at most 𝐾𝑇 3 . This means that we can
calculate the optimal solution using dynamic programming. We denote the value of this optimal solution
by 𝑇 · OPTALG . That is, OPTALG is the average-per-round value of the optimal solution. This benchmark is
referred to in the literature as the best algorithm (e.g., in Bandits with Knapsacks; see [77, Chapter 10] or
[78]).
While the above solution is “tractable,” its state space is very big, and, most importantly, it is unclear how
to design an online learning algorithm that approximates the desired value, since the same state is never
seen twice. For this reason, we will focus on a different benchmark that is much simpler to define and,
as we will see below, is also a stronger comparator. Then, its compressed version, which we develop in
Section 4.4, will be learnable.

4.2 Infinite-Horizon Benchmark

To develop our benchmark, we consider a setting with an infinite number of rounds. In order to differen-
tiate the notation from our finite-horizon setting, we use here 𝐻 to denote the time horizon and consider
the limit 𝐻 → ∞. Note that in the following analysis, both notations will be used when comparing the two
settings: 𝑇 will refer to the time horizon of our original setting (as described above), and 𝐻 will be used for
the infinite-horizon setting. Similar to the finite-horizon setting, the bidder tries to maximize her reward
while adhering to a budget constraint. Since the time horizon is infinite, we need to define the notions of
reward and budget for the bidder, which we explain next.
Budget. In the infinite-horizon setting, we require the budget condition to hold only in expectation and
only in the limit. That is, we require the infinite-horizon time average of the expected spending (where

9
the expectation is over the prices) to be less than 𝜌. For example, if 𝑏ℎ is the bid in round ℎ ∈ [𝐻 ], then
we want the bidder’s expected average-per-round payment to be less than 𝜌 as 𝐻 → ∞, i.e.,
𝐻
1 ∑︁  
lim E 𝑃 (𝑏ℎ ) ≤ 𝜌, (1)
𝐻 →∞ 𝐻
ℎ=1

where, recall that 𝑃 (𝑏) is the expected payment of bid 𝑏 when the price is sampled from distribution P.
We note that the above limit might not exist for some bid sequences; we will deal with this later (see the
remark after Equation (3)).
Reward. Since in the original finite-horizon setting, we had that 𝑟 (ℓ𝑡 ) ≤ 𝑟 (𝑇 ), it is convenient to artificially
make a similar restriction here with a parameter 𝑚. Specifically, when the last winning event before round
ℎ was ℓℎ rounds ago, we limit the bidder’s reward to be 𝑟𝑚 (ℓℎ ) (recall 𝑟𝑚 (ℓ) = 𝑟 (min{𝑚, ℓ })). This reward
structure matches the one in Section 4.1 when 𝑚 = 𝑇 . Hence, for a sequence of bids 𝑏 1, 𝑏 2, . . . the expected
average-per-round reward is
𝐻
1 ∑︁  
lim E 𝑟𝑚 (ℓℎ )𝑊 (𝑏ℎ ) . (2)
𝐻 →∞ 𝐻
ℎ=1

Therefore, in this section, we aim to maximize the quantity in Equation (2) while satisfying the constraint
in Equation (1). We model this as an infinite-horizon average-reward constrained Markov decision process
(MDP), where the state is the time since the last win and with a constraint for the average spending. In
addition, because for ℓℎ ≥ 𝑚, the reward is the same as in the case ℓℎ = 𝑚, the optimal bid can be the
same for all ℓℎ ≥ 𝑚. In other words, we consider an infinite-horizon average-reward MDP with a budget
constraint over 𝑚 states and the bid range [0, 1] as the action set. Since there are only 𝑚 states and no
time horizon, we consider the stationary policies that are distributions of bids for each of the 𝑚 states. The
following optimization problem describes the optimal value for any 𝑚 ∈ N:
𝐻
inf 1 ∑︁  
OPT𝑚 = sup lim E 𝑟𝑚 (ℓℎ )𝑊 (bℓℎ ) ,
b1 ,...,b𝑚 ∈Δ( [0,1] ) 𝐻 →∞ 𝐻 ℓℎ
ℎ=1
(3)
𝐻
1 ∑︁  
s.t. lim E 𝑃 (bℓℎ ) ≤ 𝜌.
𝐻 →∞ 𝐻 ℓℎ
ℎ=1

A visual representation of the transitions of the MDP in Equation (3) is shown in Figure 1. We remark
that the limits now exist, according to [71, Theorem 8.1.1], since the state space is finite and the policy is
stationary.
Next, we prove that the optimal value of the infinite-horizon problem is at least the value of the best
algorithm in the finite-horizon problem, when 𝑚 = 𝑇 . This will allow us to limit our attention to learning
the optimal solution to this problem instead of the optimal algorithm of Section 4.1.

Lemma 4.1. For every budget per round 𝜌, reward function 𝑟 (·), distribution of prices P, and finite horizon
𝑇 , it holds that OPT𝑇inf ≥ OPTALG .

The intuition behind the proof is to consider a thought experiment where we simulate runs of the optimal
algorithm corresponding to OPTALG in the infinite-horizon setting. Specifically, we partition the rounds into
blocks of length 𝑇 , and in each block, we run a new instance of this algorithm. The expected reward in
each block is at least 𝑇 · OPTALG , and the payment is at most 𝑇 𝜌. This makes the time-average expected
reward at least OPTALG and the time-average payment at most 𝜌. See Appendix C.1 for the full proof.

10
It will be simpler to work with a change of variables from bids to winning probabilities. We used 𝑊 (𝑏)
as the probability of winning the auction with a bid 𝑏. Instead, we will consider a winning probability
𝑊 ∈ [0, 1], and aim to bid so as to make the winning probability𝑊 . For example, bidding 1 with probability
𝑊 and 0 with probability 1 − 𝑊 wins with probability5 𝑊 . On the other hand, multiple bid distributions
might win with probability𝑊 ∈ [0, 1]; in that case, we consider that the bidder uses the one with the lowest
expected spending, which we define as 𝑃 (𝑊 ). We note a slight subtlety here: for some price distributions,
there might not exist a bid distribution that achieves the minimum, but there exists one that gets arbitrarily
close. We discuss this in Section 6.1.
Using this new notation, optimization problem (3) becomes
𝐻
inf 1 ∑︁  
OPT𝑚 = sup lim E 𝑟𝑚 (ℓℎ )𝑊ℓℎ ,
® ∈ [0,1]𝑚 𝐻 →∞ 𝐻 ℓℎ
𝑊 ℎ=1
(4)
𝐻
1 ∑︁  
s.t. lim E 𝑃 (𝑊ℓℎ ) ≤ 𝜌.
𝐻 →∞ 𝐻 ℓℎ
ℎ=1

4.3 Adding Contexts

In this section, we extend the infinite-horizon constrained MDP defined in the previous section to the
problem with contexts and conversion rates. We focus on the final formulation (4), where the winning
probabilities were the variables. The important variables now are the probabilities of winning a conversion.
So, we redefine 𝑊 as the probability of winning a conversion (rather than the probability of only winning
the auction). Furthermore, the actual probability of winning a conversion also depends on the context and
conversion rate. Bidding 1 would now result in a conversion probability of only 𝑐¯ (recall 𝑐¯ = E [𝑐] is the
expected context per round). Using this, we need 𝑊 ∈ [0, 𝑐] ¯ (instead of the full range [0, 1]). Since the
optimal bid that achieves probability 𝑊 of winning a conversion will depend on the context, we need to
have a bidding function b mapping contexts X into bids such that 𝑊 (b) = 𝑊 . As before, we define the
payment for winning with probability 𝑊 ∈ [0, 𝑐] ¯ as

𝑃 (𝑊 ) = inf 𝑃 (b).
b:X→Δ( [0,1] )
s.t. 𝑊 (b)=𝑊

With this notation, the infinite-horizon constrained MDP we want to work with becomes the following
𝐻
inf 1 ∑︁  
OPT𝑚 = sup lim E 𝑟𝑚 (ℓℎ )𝑊ℓℎ ,
® ∈ [0,¯ 𝐻 →∞ 𝐻 ℓℎ
𝑊 𝑐 ]𝑚 ℎ=1
(5)
𝐻
1 ∑︁  
s.t. lim E 𝑃 (𝑊ℓℎ ) ≤ 𝜌.
𝐻 →∞ 𝐻 ℓℎ
ℎ=1

Analogous to Lemma 4.1 we have the following theorem, whose proof is a direct extension of Lemma 4.1
by considering the above modified definitions of 𝑊 , 𝑃 (𝑊 ), and the bidding functions b.

Theorem 4.2. For every finite-horizon setting (any context/price distribution, budget per round 𝜌, reward
function 𝑟 (·), and time horizon 𝑇 ) it holds that OPT𝑇inf ≥ OPTALG .
5 Thereis some subtlety here: if the price is zero with positive probability, then bidding zero does not result in winning with
zero probability. We can solve this issue by assuming that the bidder also has the option to “skip” an auction, even if this would
be avoided in any optimal solution.

11
To know the long-term average reward or cost, we need to consider the stationary distribution induced
by a bidding policy. Consider the stationary distribution defined by this vector of winning probabilities
𝑊® ∈ [0, 𝑐]
¯ 𝑚 . Specifically, we denote with 𝜋ℓ (𝑊® ) the probability mass of state ℓ ∈ [𝑚] in the stationary
distribution defined by 𝑊® . We now prove the following.
Lemma 4.3. Let OPT𝑚
inf be as defined in Equation (5) for any 𝑚 ∈ N. Then it holds that

𝑚
∑︁
inf
OPT𝑚 = sup 𝑟 (ℓ)𝑊ℓ 𝜋ℓ (𝑊® )
® ∈ [0,¯
𝑊 𝑐] ℓ=1
𝑚
∑︁
s.t. 𝑃 (𝑊ℓ )𝜋ℓ (𝑊® ) ≤ 𝜌,
ℓ=1
𝑚
∑︁
where 𝜋 1 (𝑊® ) = 𝑊ℓ 𝜋ℓ (𝑊® )
ℓ=1 (6)
𝜋ℓ (𝑊® ) = 1 − 𝑊ℓ −1 𝜋ℓ −1 (𝑊® )

∀ℓ = 2, 3, . . . , 𝑚 − 1
𝑚
∑︁
𝜋𝑚 (𝑊® ) = (1 − 𝑊ℓ )𝜋ℓ −1 (𝑊® )
ℓ=𝑚−1
𝑚
∑︁
𝜋ℓ (𝑊® ) = 1.
ℓ=1

We note that the equations of optimization problem (6) uniquely identify the stationary distribution unless
𝑊𝑚 = 0. We can avoid this issue by assuming that 𝑊𝑚 is at least some infinitesimally small positive con-
stant. In addition, Lemma 4.6 will show that for any optimal solution, 𝑊ℓ is non-decreasing in ℓ, implying
that if 𝑊𝑚 = 0 then the bidder never win any rounds, leading to zero reward.
The proof of the lemma shows that for any stationary policy, the average reward in the objective of op-
timization problem 6 converges to the stationary distribution of that policy. Proving the same for the
average payment proves the lemma.

Proof. Fix some 𝑊® . All we have to prove is that for the solution of the equations of 𝜋ℓ (𝑊® )

ℓ ∈ [𝑚]
in (6) it
holds that
𝐻 𝑚 𝐻 𝑚
1 ∑︁   ∑︁ 1 ∑︁   ∑︁
lim E 𝑟𝑚 (ℓℎ )𝑊ℓℎ = 𝑟 (ℓ)𝑊ℓ 𝜋ℓ (𝑊® ) and lim E 𝑃 (𝑊ℓℎ ) = 𝑃 (𝑊ℓ )𝜋ℓ (𝑊® ),
𝐻 →∞ 𝐻 𝐻 →∞ 𝐻
ℎ=1 ℓ=1 ℎ=1 ℓ=1

where the expectation is taken over ℓℎ . Let 𝐴 be the (right-stochastic) transition matrix on the state space
[𝑚] as defined by 𝑊® . Let 𝑓® and 𝑞® be the vectors with 𝑚 elements that have 𝑟 (ℓ)𝑊ℓ and 𝑃 (𝑊ℓ ), respectively,
in the ℓ-th entry, and 𝑒 1 be the unit vector with 1 in the first coordinate. Then we have,
𝐻 𝐻
1 ∑︁   1 ∑︁ ⊤ ℎ−1 ® ⊤ ¯ ®
lim E 𝑟𝑚 (ℓℎ )𝑊ℓℎ = lim 𝑒1 𝐴 𝑓 = 𝑒1 𝐴 𝑓
𝐻 →∞ 𝐻 𝐻 →∞ 𝐻
ℎ=1 ℎ=1
𝐻 𝐻
1 ∑︁  1 ∑︁ ⊤ ℎ−1
𝑒 1 𝐴 𝑞® = 𝑒 1⊤𝐴®

lim E 𝑃 (𝑊ℓℎ ) = lim ¯𝑞,
𝐻 →∞ 𝐻 𝐻 →∞ 𝐻
ℎ=1 ℎ=1
Í
where 𝐴¯ = lim𝐻 𝐻1 ℎ∈ [𝐻 ] 𝐴𝑡 and the last equality on each line above follows from [71] (see the discussion
at the start of Section 8.2.1 there) which uses the fact that 𝑓® and 𝑞® are bounded and the state space is finite,
meaning that 𝐴¯ is stochastic. This proves the lemma, since 𝑒 1⊤𝐴¯ is the vector {𝜋ℓ (𝑊® )}ℓ , as described in (6)
(also see Figure 1). ■

12
4.4 Reducing the Number of States in the Infinite-Horizon Problem (6) to O (log𝑇 )

As shown in Theorem 4.2 we can use the infinite-horizon benchmark, OPT𝑇inf , to approximate the expected
reward of the best algorithm OPTALG . While OPT𝑇inf is easier to describe than OPTALG , the number of states in
the MDP is 𝑇 . This means that, during the 𝑇 steps of running our learning algorithm, some states will be
visited only at most once, which is not enough to learn. Furthermore, because regret bounds for learning
MDPs usually√depend polynomially on the number of states, the 𝑇 -state MDP will not allow √ us to get our
promised Õ ( 𝑇 ) regret bound. For example, standard regret bounds in this setting are O ( 𝑆𝑇 ) [6] where
𝑆 is the number of states; in our setting, if 𝑆 = 𝑇 , this would imply linear regret.
We solve this issue by showing that using OPT𝑚 inf instead of OPTinf in fact, leads to minimal error even
𝑇
when 𝑚 ≈ log𝑇 . Specifically, the final result for this section is the following.

Theorem 4.4. Fix a constant 𝐶 > 0 and an integer 𝑚 ≥ 𝑐𝜌 ¯ log𝑇 . Then, for any integer 𝑀 ≥ 𝑚, it holds that
2𝐶

inf inf
OPT𝑚 ≥ OPT𝑀 − O ( 𝑐𝜌 1 −𝐶
¯ 𝑇 ). In addition, this can be achieved even if we constrain 𝑊𝑚 = 𝑐¯ (i.e., bid 1 in
state 𝑚).

The high-level idea of the proof is that with an average budget 𝜌, the bidder can win an auction on average
every 𝜌 −1 steps and, with expected conversion probability 𝑐, ¯ get a conversion every (¯𝑐𝜌) −1 rounds. We
will prove that in the optimal policy, the probability that no conversion occurred for a sufficiently long
time is so low that dropping this part of the MDP leads to minimal error. The hard part of turning this idea
into a proof is to argue that in any optimal strategy, the winning probabilities 𝑊ℓ are monotone increasing
in ℓ (intuitively, the longer the time since the last conversion, the more eager the bidder is to win now).
Without this property, the following solution could win on average every 𝜌 −1 steps (consider 𝑐¯ = 1 for
simplicity): 𝑊1 = 1 − 𝜌 O (𝑇 −1/2 ), 𝑊2 = . . . = 𝑊√𝑇 −1 = 0, and 𝑊√𝑇 = 1. Dropping the latter part of such
1−𝜌

a solution would result in a significant loss in value.


We will prove the monotonicity of the optimal winning probabilities 𝑊ℓ and the above theorem by exam-
ining the Lagrangian of the problem. In the Lagrangian version of the problem, we add the spending into
the objective function with a Lagrange multiplier 𝜆 ≥ 0. The value of winning a conversion in state ℓ with
probability 𝑊 thus becomes:
Lℓ (𝑊ℓ , 𝜆) = 𝑟𝑚 (ℓ)𝑊ℓ − 𝜆𝑃 (𝑊ℓ ).
Traditionally, one would write the Lagrangian objective as 𝑟𝑚 (ℓ)𝑊ℓ +𝜆(𝜌 −𝑃 (𝑊ℓ )), as (the infinite-horizon
average of) this expression for any 𝜆 ≥ 0 constitutes an upper bound on the value of the optimal solution
to the constrained problem (5). Our formulation of the Lagrangian objective above omits the 𝜆𝜌 term as
we focus on maximizing L by picking 𝑊 , and 𝜆𝜌 is a constant in this respect.
Fix a 𝜆. Now we focus on the MDP whose reward in state ℓ ∈ [𝑚] is Lℓ (𝑊ℓ , 𝜆) and the action space in
that state is 𝑊ℓ ∈ [0, 𝑐].
¯ Without constraints, we can use standard machinery from unconstrained MDP
optimization. Following the notation of [71, Section 5], let 𝑔ℓ∗ (𝜆) and ℎ ℓ∗ (𝜆) be the gain and bias of the
optimal policy 𝑊1∗ (𝜆), . . . ,𝑊𝑚∗ (𝜆), which are defined as follows. First, the gain 𝑔ℓ∗ (𝜆) is the time-average
expected reward of the optimal policy when starting at state ℓ. Formally,
𝐻
1 ∑︁ 
𝑔ℓ∗ (𝜆) E Lℓℎ 𝑊ℓ∗ℎ (𝜆), 𝜆 ℓ1 = ℓ .
 
= lim
𝐻 →∞ 𝐻 ℓℎ
ℎ=1

[71, Theorem 8.3.2] shows that for weakly communicating MDPs (MDPs where every state is reachable
from any other state) there is always an optimal policy with constant gain, i.e., 𝑔ℓ∗ (𝜆) = 𝑔ℓ∗′ (𝜆) for ℓ ≠ ℓ ′ .
Therefore, we denote 𝑔ℓ∗ (𝜆) = 𝑔∗ (𝜆). The bias ℎ ℓ∗ (𝜆) at state ℓ is defined as the difference in total expected

13
reward if starting at state ℓ instead of getting 𝑔∗ (𝜆) every round. Formally,
𝐻 
∑︁ 
∗ ∗ ∗ ∗ ∗
 
ℎ ℓ (𝜆) = lim E Lℓℎ 𝑊ℓℎ (𝜆 ), 𝜆 ℓ1 = ℓ − 𝑔 (𝜆) .
𝐻 →∞ ℓℎ
ℎ=1

We note that the above limit does not necessarily exist. In that case, we can substitute the limit above with
the Cesaro limit6 to solve this issue and make the bias function well-defined.
Using the gain and the bias functions, we use the Bellman optimality condition for infinite horizon average
reward MDPs (see [71, Section 8.4.1]). Specifically, in our setting, this condition implies that for all ℓ ∈ [𝑚]
h i
ℎ ℓ∗ (𝜆) + 𝑔∗ (𝜆) = max Lℓ (𝑊 , 𝜆) + 𝑊 ℎ 1 (𝜆) + (1 − 𝑊 )ℎ min{𝑚,ℓ+1} (𝜆)
𝑊
h i
= max 𝑊 𝑟 (ℓ) − 𝜆𝑃 (𝑊 ) + 𝑊 ℎ 1 (𝜆) + (1 − 𝑊 )ℎ min{𝑚,ℓ+1} (𝜆) .
𝑊

We use two simplifying definitions. First, we define ℎ𝑚+1 (𝜆) = ℎ𝑚 (𝜆); this simplifies the minimum in the
above notation. Second, because the above equation is invariant to adding a constant to ℎ ∗· (𝜆), we can
assume w.l.o.g. that ℎ 1 (𝜆) = 0. Using this notation, we get that for any optimal solution 𝑊1∗ (𝜆), . . . ,𝑊𝑚∗ (𝜆)
it holds that
ℎ ℓ∗ (𝜆) + 𝑔∗ (𝜆) = 𝑊ℓ∗ (𝜆)𝑟 (ℓ) − 𝜆𝑃 𝑊ℓ∗ (𝜆) + 1 − 𝑊ℓ∗ (𝜆) ℎ ℓ+1
  ∗
(𝜆). (7)

Proving that𝑊ℓ∗ (𝜆) is non-decreasing in ℓ. Here we prove that for every optimal solution and any 𝜆 ≥ 0,
it holds 𝑊ℓ∗ (𝜆) ≤ 𝑊ℓ+1
∗ (𝜆) for every ℓ ∈ [𝑚 − 1]. The first step is a simple lemma for the bias function

ℎ ℓ∗ (𝜆). The lemma bounds from above the difference ℎ ℓ+1


∗ (𝜆) − ℎ ∗ (𝜆), i.e., starting in state ℓ instead of ℓ + 1

is not much worse.
Lemma 4.5. For every ℓ ∈ [𝑚] it holds that

(𝜆) ≤ ℎ ℓ∗ (𝜆) + 𝑟 min{ℓ + 1, 𝑚} − 𝑟 (ℓ).

ℎ ℓ+1 (8)

One idea for proving the above lemma is the following. Starting at state ℓ, one strategy is to pretend that
we started at state ℓ + 1 and followed that strategy. That is, pretend that we started one state ahead. The
discrepancy in reward between these two scenarios is 𝑟 (ℓ ′ + 1) − 𝑟 (ℓ ′ ), where ℓ ′ is the state where we won
for the first time. Because of concavity, the discrepancy is at most 𝑟 (ℓ + 1) − 𝑟 (ℓ), proving the same bound
∗ (ℓ) − ℎ ∗ (ℓ), as needed. The formal proof we present in Appendix C.3 is more technical and proves
for ℎ ℓ+1 ℓ
the lemma by induction using the optimality equation (7).
Using the above lemma, we prove that any optimal solution for any Lagrange multiplier 𝜆 must be non-
decreasing. The additional assumption we need for this result is that the reward function 𝑟 (·) must satisfy
strict concavity. We note that we can artificially enforce this in any reward function with negligible error
by perturbing it by O (1/poly(𝑇 )) terms. The proof uses Lemma 4.5, the Bellman optimality conditions,
and the strict concavity of the reward function. See the details in Appendix C.
Lemma 4.6. Fix any optimal solution 𝑊® ∗ (𝜆) and ℓ ∈ [𝑚 − 1]. If 𝑟 (ℓ + 1) − 𝑟 (ℓ) > 𝑟 (ℓ + 2) − 𝑟 (ℓ + 1), then
𝑊ℓ∗ (𝜆) ≤ 𝑊ℓ+1
∗ (𝜆).

The above lemma achieves the goal of this section: it proves that any optimal solution of the constrained
problem has weakly increasing winning probabilities: 𝑊1∗ ≤ 𝑊2∗ ≤ . . . ≤ 𝑊𝑚∗ . This is a corollary of the
fact that the optimization problem (6) can be written as a linear program (using the occupancy measure),
making 𝑊® ∗ optimal for the Lagrangian problem for some 𝜆. We include this proof in Appendix C.4 for
completeness.
6 https://en.wikipedia.org/wiki/Cesaro_summation

14
Corollary 4.7. Assume 𝑟 (·) is strictly concave. Then any optimal solution of the optimization problem (5) is
∗ for all ℓ ∈ [𝑚 − 1].
weakly increasing, i.e., 𝑊ℓ∗ ≤ 𝑊ℓ+1

Proving Theorem 4.4. The key result to prove Theorem 4.4 is that the optimal solution visits state ℓ
very infrequently for large ℓ. Specifically, we show that for large enough ℓ, the probability of winning a
¯
𝑐𝜌 ¯
𝑐𝜌
conversion in state ℓ is at least 2 ; this implies that state ℓ + 1 is visited 1 − 2 times less than state ℓ. This
is captured in the following lemma.
Lemma 4.8. For any optimal solution of the constrained problem, 𝑊® ∗ , if ℓ ≥ 2 then 𝑊 ∗ ≥ .
¯
𝑐𝜌
¯
𝑐𝜌 ℓ 2

We prove the lemma by first noticing that the average payment of the optimal solution must be exactly
𝜌 (unless the bidder can always afford to pay the price 𝑝𝑡 in expectation). We then prove that unless the
inequality in Lemma 4.8 is true, 𝑊ℓ∗ must be small for all ℓ ≤ 2 (by Corollary 4.7). We end up proving that
¯
𝑐𝜌

this implies that the bidder is paying less than 𝜌 per average, contradicting our earlier claim. We defer the
details of the proof to Appendix C.4.
We proceed to sketch the proof of Theorem 4.4 with the detailed proof in Appendix C.5. First, let 𝑅𝑚 (𝑊® )
and 𝐶 (𝑊® ) be the expected average reward (when there are 𝑚 states) and payment of a vector 𝑊® . The proof
consists of two steps. First, we define vector 𝑊® ′ identical to an optimal vector 𝑊® ∗ for states ℓ < 𝑚. For
state ℓ ≥ 𝑚 we set 𝑊® ′ = 𝑐.¯ We then examine the average-time reward of 𝑊® ′ : in the original MDP with
𝑀 states, it holds that 𝑅𝑀 (𝑊® ′ ) ≥ 𝑅𝑀 (𝑊® ∗ ) = OPT𝑀 inf , but this is not necessarily the case when there are 𝑚

states; Claim C.9 proves that 𝑅𝑀 (𝑊® ∗ ) − 𝑅𝑚 (𝑊® ′ ) is very small. Afterward, we examine the spending of 𝑊® ′ ,
which can be more than the spending of 𝑊® ∗ ; Claim C.10 proves that 𝐶 (𝑊® ∗ ) − 𝜌 is very small.
The second step of the proof involves finding a feasible solution with reward close to 𝑊® ′ . We define a
solution 𝑊® ′′ with very small spending. Defining a convex combination of 𝑊® ′ and 𝑊® ′′ yields the theorem.

5 Online Learning Algorithm


This section presents our algorithm that achieves low regret against OPT inf
√𝑚 . We prove regret bounds for
OPT𝑚inf for any 𝑚 ≤ 𝑇 . Specifically, Theorem 5.1 proves a Õ (poly(𝑚) 𝑇 ) regret bound. Then, setting

𝑚 = Θ(log𝑇 ) as in Theorem 4.4 will provide Õ ( 𝑇 ) regret against our original benchmark, the best
algorithm that knows the distributions. The polynomial dependence on 𝑚 in the above regret bound
shows why we proved in Section 4.4 that OPT𝑚 inf ≈ OPTinf for 𝑚 = Θ(log𝑇 ).
𝑇
The algorithm works by partitioning the 𝑇 rounds into epochs of stochastic length. Based on past data,
the bidder calculates the optimal bidding policy of past data at the beginning of each epoch (in Section 6

we show that 𝑡 samples result in O (𝑚3/ 𝑡 ) accuracy by solving a linear program). During each epoch, we
bid using this empirically optimal bidding without updating it. Each epoch can end in one of two ways:
when the bidder wins a conversion or if 𝑘 rounds have passed without a conversion. 𝑘 is a parameter that
ensures that the length of each epoch remains bounded with probability 1. To ensure that an epoch will
end with winning a conversion with a high probability, we choose a high enough 𝑘 and also constrain the
empirical bidding to bid 1 in state ℓ = 𝑚 (Theorem 4.4 shows that this results in negligible error if 𝑚 is
large enough).
To keep epochs similar, we consider one more modification. Specifically, at the start of each epoch, we
bid as the bidder was in state ℓ = 1 of the MDP. This ensures that each epoch starts ‘anew,’ even if there
was no win with conversion in the previous one. This leads to a mismatch between the time since the last
conversion. However, the real allocated reward can only be bigger than the reward of our ‘fake’ start. This
is because if we win when the fake state is ℓ, then the bidder receives at least 𝑟𝑚 (ℓ + 𝑘), which cannot be
less than 𝑟𝑚 (ℓ) by monotonicity. The full algorithm is in Algorithm 1.

15
ALGORITHM 1: Follow the 𝑘-delayed Optimal Response Strategy (FKORS)
Input: Average budget 𝜌, reward function 𝑟 : N → R>0 , parameters 𝑚, 𝑘 ∈ N
Bid 0 for the first 𝑘 rounds and observe price and conversion rates {(𝑝𝑡 , 𝑥𝑡 )}𝑡 ∈ [𝐾 ]
Set the length and end of epoch 0: 𝐿0 = 𝑘 and 𝑇0 = 𝑘
for epoch 𝑖 = 1, 2, . . . do
Calculate the optimal vector of bidding policies from contexts to bids b®𝑇𝑖 −1 on the empirical dataset
{(𝑝𝑡 , 𝑥𝑡 )}𝑡 ∈ [𝑇𝑖 −1 ] using linear program (10), constrained that b𝑇𝑚𝑖 −1 (·) = 1
Restart the MDP: set ℓ˜𝑇𝑖 −1 +1 = 1
for rounds 𝑡 = 𝑇𝑖 −1 + 1,𝑇𝑖 −1 + 2, . . . ,𝑇𝑖 −1 + 𝑘 do
Observe context 𝑥𝑡
If the remaining budget is at least 1, bid using b𝑖ℓ˜ (𝑥𝑡 ), otherwise bid 0
𝑡
Observe price 𝑝𝑡
if conversion happens then
Receive reward 𝑟 ( ℓ˜𝑡 )
Set the end of this epoch 𝑇𝑖 = 𝑡 and its duration 𝐿𝑖 = 𝑇𝑖 − 𝑇𝑖 −1
Go to the next epoch
else
Set ℓ˜𝑡 +1 = min{𝑚, ℓ˜𝑡 + 1}
end
if no conversion happened in epoch 𝑖 then
Set the end of this epoch 𝑇𝑖 = 𝑇𝑖 −1 + 𝑘 and its duration 𝐿𝑖 = 𝑘
end

The rest of the section focuses on proving the regret bound of Algorithm 1. We make the simplifying
assumption that we know 𝑐. ¯ This is needed to set 𝑚 and 𝑘 at least a function of 𝑐1¯ . This can be replaced
¯ as long as this lower bound is within a constant of 𝑐,
by any lower bound for 𝑐; ¯ our regret bound becomes
¯
multiplicative bigger only by that constant. Such a loose lower bound for 𝑐 can be calculated by sampling
the conversion rate a few times before running our algorithm. We do not include this for simplicity.
In the following theorem, we prove a parametric high probability √ regret bound for Algorithm 1, assuming
1
𝑘 ≥ 𝑚 + 𝑐¯ log𝑇 . One part of the regret bound is a Õ (𝑘 𝑇 ) term. The other part depends on the error
due to sampling, i.e., calculating the bidding of epoch 𝑖 with an empirical distribution. To present this
error, we first overload our previous notation and define 𝑅( b) ® and 𝐶 ( b)
® as the expected average reward
and payment of using a vector of bidding policies b® in the infinite horizon setting. One part of this error
®𝑡 ) + where b®𝑡 is the optimal bidding using the
inf − 𝑅( b

is the sub-optimality of this bidding: 𝜀𝑅 (𝑡) = OPT𝑚
empirical distribution of the first 𝑡 rounds. The other part of this error is due to the potential over-payment
of this error: 𝜀𝐶 (𝑡) = (𝐶 ( b®𝑡 ) − 𝜌) + where b®𝑡 is defined as before.
Theorem 5.1. Fix any 𝑚 ∈ N and let 𝑘 ≥ 𝑚 + 𝑐1¯ log𝑇 . Let 𝑁 be the number of epochs. Then for all 𝛿 > 0
with probability at least 1 − 𝛿 Algorithm 1 achieves regret against OPT𝑚 inf that is at most

𝑁 √︂ !
∑︁  𝑇
𝐿 𝑗 𝜀𝑅 (𝑇 𝑗 −𝑖 ) + 𝜀𝐶 (𝑇 𝑗 −𝑖 ) + O 𝑘 𝑇 log
𝑗=1
𝛿

where 𝜀𝑅 (𝑇 𝑗 −𝑖 ) and 𝜀𝐶 (𝑇 𝑗 −𝑖 ) are error terms of the bidding of epoch 𝑗, b®𝑇𝑗 −1 : 𝜀𝑅 (𝑇 𝑗 −𝑖 ) = OPT𝑚 ®𝑇𝑗 −1 ) +
inf − 𝑅( b

+
is the reward sub-optimality gap and 𝜀𝐶 (𝑇 𝑗 −𝑖 ) = 𝐶 ( b®𝑇𝑗 −1 ) − 𝜌 is the expected average payment above 𝜌.

Using Theorem 6.1 that bounds 𝜀𝑅 (𝑡), 𝜀𝐶 (𝑡) and assuming 𝑚 is large enough we get the unconditional
regret bound against OPT𝑇inf .

16
Corollary 5.2. Let 𝑚 = ⌈ 𝑐𝜌¯ log𝑇 ⌉ and 𝑘 = ⌈𝑚 + 𝑐¯ log𝑇 ⌉. Then for all 𝛿 > 0 with probability at least 1 − 𝛿,
2 1

Algorithm 1 achieves total reward at least


√︂ !
inf 1 3 𝑇
𝑇 · OPT𝑇 − O 3 4 log 𝑇 𝑇 log
𝑐¯ 𝜌 𝛿

To prove Theorem 5.1, we face many challenges. Many standard tools that analyze sequential decisions do
not work in our setting since, depending on past rounds, the bidder is in a different state of the MDP. This
problem becomes even greater when the bidding policy is updated, and the underlying Markov Chain
changes. This is why we do not change the bidding policy during an epoch, fixing the Markov Chain
over its duration. While this is not entirely new in the Online Learning MDP literature, using epochs of
unknown stochastic length is.
Remark. We briefly go over why we need epochs of stochastic length, even under very simplistic error analysis.
Assume we had epochs of fixed length 𝑘, i.e., commit to a bidding strategy for 𝑘 consecutive rounds. In that
case, in every epoch, even if the bidding is close to optimal, we get 𝑘 · OPT𝑚inf − Ω(1) value in expectation;

the “−Ω(1)” error is because of the change of policies between epochs. This makes the total reward over all
epochs (𝑘 · OPT𝑚 inf − Ω(1)) 𝑇 = 𝑇 · OPTinf − Ω( 𝑇 ). However, this creates the following issue: since we need
𝑘 𝑘
high probability bounds to deal with the event of running out of budget early, we have to use concentration
inequalities. These concentration inequalities cannot be used√︁ on individual rounds as they are dependent, so
we use them across epochs. However, this will lead to Ω(𝑘 𝑇 /𝑘) error across all of the 𝑇𝑘 epochs, making the

total error Ω( 𝑇𝑘 + 𝑇 𝑘). Picking 𝑘 = Θ(𝑇 1/3 ) yields the optimal error of Ω(𝑇 2/3 ), which is much higher than
what we offer in Corollary 5.2.

Theorem 5.1 follows by a series of lemmas. First, we prove Lemma 5.3, a lower bound on the realized
reward of the algorithm in every round, assuming it does not run out of budget. √ Specifically, we prove
that by any such round 𝜏, the realized reward is 𝜏OPT𝜏inf − 𝑗 𝐿 𝑗 𝜀𝑅 (𝑇 𝑗 −1 ) − O (𝑘 𝑇 ) with high probability.
Í
This lemma proves Theorem 5.1, assuming that the budget is not depleted by some round 𝜏 close enough
to 𝑇 . This is implied by Lemma 5.4, where we prove an upper bound on the realized spending of the
algorithm by any round. Specifically, Lemma 5.4 proves that the total spending by any round 𝜏 is at
Í √
most 𝜏𝜌 + 𝑗 𝐿 𝑗 𝜀𝐶 (𝑇 𝑗 −1 ) + O (𝑘 𝜏) with high probability. Combining these two lemmas, we get that by
Í √
round 𝜏 = 𝑇 − 𝑗 𝐿 𝑗 𝜀𝐶 (𝑇 𝑗 −1 ) − O (𝑘 𝑇 ) the budget is not depleted, and therefore the total reward is at

least 𝜏OPT𝜏inf − 𝑖 𝐿 𝑗 𝜀𝑅 (𝑇 𝑗 −1 ) − O (𝑘 𝑇 ), with high probability. The detailed proof of Theorem 5.1 is in
Í
Appendix D.3, after proving the aforementioned lemmas.
We now present the lower bound on the realized reward. Specifically, we show that the total reward of
inf with some error for any 𝜏. The first part of the error comes from the
the first 𝜏 rounds is at least 𝜏OPT𝑚
sub-optimality of the bidding in epoch 𝑖, 𝜀𝑅 (𝑇𝑖 −1 ). The second part of the error comes from the error in
concentration.
Lemma 5.3. Let 𝑤𝑡 ∈ {0, 1} indicate whether the bidder got a conversion in round 𝑡. Assume 𝑘 ≥ 𝑚 + 𝑐1¯ log𝑇 .
Fix round 𝜏 ≤ 𝑇 − 𝑘 and let 𝐼𝜏 be the epoch of round 𝜏. Then the total realized reward up to round 𝜏 is at least
𝐼𝜏 √︂ !
inf
∑︁ 𝑇
𝜏OPT𝑚 − 𝐿 𝑗 𝜀𝑅 (𝑇 𝑗 −1 ) − O 𝑘 𝑇 log
𝑗=1
𝛿

with probability at least 1 − 𝛿 for any 𝛿 > 0.

The proof of the lemma is quite convoluted to get the high probability guarantee. The proof involves
bounding quantities for which we do not have useful information. For example, the expected reward of an

17
epoch might not correlate with the expected reward between wins with conversions of an optimal solution.
First, we prove that the realized reward across epochs is close to the reward of their expectation. Then,
we use the fact that the total expected reward of epoch 𝑖 equals the expected average-time reward of b®𝑇𝑖 −1
(which is close to OPT𝑚 ®𝑇𝑖 −1 . This step results in OPTinf
inf ) times the expected return time to state 1 of b
𝑚
appearing. However, now we have to handle the expected return time to state 1 of b®𝑇𝑖 −1 . We show that
the sum of these expected lengths is close to their realized values, which sum up to 𝜏. Throughout the
proof, we take advantage of the fact that an epoch ending early happens with high probability. This makes
quantities like an epoch’s expected length very close to the return time to state 1 in the infinite horizon
setting. We defer the full proof to Appendix D.1.
We now present the upper bound on the realized payment. We show that by any round 𝜏, the total payment
of the algorithm is at most 𝜏𝜌 plus some error. Similar to Lemma 5.3, this error comes from the overpayment
due to suboptimal bidding in each epoch 𝑖, 𝜀𝐶 (𝑇𝑖 −1 ), and some error due to concentration.

Lemma 5.4. Assume that 𝑘 ≥ 𝑚 + 𝑐1¯ log𝑇 . Fix a round 𝜏 and let 𝐼𝜏 be its epoch. For any 𝛿 > 0, with
probability at least 1 − 𝛿 the total payment of the algorithm until round 𝜏 is at most
𝐼𝜏 √︂ !
∑︁ 𝑇
𝜏𝜌 + 𝐿 𝑗 𝜀𝐶 (𝑇 𝑗 −1 ) + O 𝑘 𝑇 log
𝑗=1
𝛿

The proof follows similar steps to the proof in Lemma 5.3. First, we prove that with high probability, the
total payment of the epochs is close to their expected spending between wins with conversions in the
infinite horizon setting. However, this is not useful directly. We show that for each epoch, this is equal
to the expected average spending of the bidding of that epoch, times the return time to state 1. Similar to
Lemma 5.3, the expected average spending is close to 𝜌, but the return time is not directly useful. We show
that these return times are about the realized length of the epochs, which equals 𝜏, the round we examine
in Lemma 5.4. The full proof can be found in Appendix D.2.

6 Estimating OPTinf from Samples


In this section, we examine the calculation of the optimal solution of OPT𝑚 inf , the benchmark we use in

Algorithm 1. Specifically, we prove the following theorem, showing that using 𝑡 price/conversion rate
samples, inf with only
we can use a linear program of size O (𝑡𝑚) to calculate the optimal solution of OPT𝑚

Õ (𝑚 3 / 𝑡) error.

Theorem 6.1. Fix 𝑚 ∈ N such that 𝑚 ≥ log𝑇 . Using price/conversion rate samples (𝑝 1, 𝑐 1 ), . . . , (𝑝𝑡 , 𝑐𝑡 ),
¯
2
𝑐𝜌
we can construct an empirical vector of mappings from contexts to bids b®𝑡 with a linear program of size O (𝑡𝑚)
constrained to bid 1 in state 𝑚. For this vector, for any 𝛿 > 0, with probability at least 1 − 𝛿, the expected
average payment and reward are
√︂ ! √︂ !
1 1 1 1
𝐶 ( b®𝑡 ) ≤ 𝜌 + O 𝑚 3 log =⇒ 𝜀𝐶 (𝑡) ≤ O 𝑚 3 log
𝑡 𝛿 𝑡 𝛿
√︂ ! √︂ !
𝑚 3 1 1 𝑚 3 1 1
𝑅𝑚 ( b® ) ≥ OPT𝑀 − O
𝑡 inf
log =⇒ 𝜀𝑅 (𝑡) ≤ O log
𝜌 𝑡 𝛿 𝜌 𝑡 𝛿

where 𝜀𝐶 (𝑡) and 𝜀𝑅 (𝑡) are as in Theorem 5.1.

18
To prove this theorem, we will provide the following lemmas. First, in Section 6.1, Lemma 6.2 proves a
crucial result: the optimal bidding does not fully consider the context. In particular, any optimal solution
uses bids of the form min(1, 𝑐𝜇 ) for some 𝜇 ≥ 0 (𝜇 = 0 corresponds to bidding 1). This is the key simplifying
step to Theorem 6.1. First, it simplifies the calculation of the optimal solution since we only look for linear
mappings from conversion rates to bids, allowing us to compute the optimal solutions with the linear
program (10) in Section 6.2. Second, it simplifies the functions we want to learn: the probability of winning
with a conversion 𝑊 (·) and the expected payment 𝑃 (·) become monotone mappings from the parameter
𝜇 to [0, 1]. Utilizing this, Lemma 6.4 in Section 6.4 proves a uniform convergence bound on 𝑊 (·) and 𝑃 (·).
The full proof can be found in Appendix E.4, after proving the above lemmas.

6.1 Simplicity of Optimal Bidding

This section examines the problem of maximizing the probability for a single conversion subject to an
expected budget constraint. Specifically, we assume that the bidder has per-round budget 𝜌 ′ ∈ (0, 1]
(note this can depend on the state ℓ and does not need to be 𝜌) and bids to maximize her probability
of a conversion. She picks a function 𝑏 (·) that maps a context and a conversion rate to a (potentially
randomized) bid. This results in the following optimization problem:
h  i
sup E 𝑐 1 b(𝑥) ≥ 𝑝
𝑥,𝑝
b:X→Δ( [0,1] )
h  i (9)
s.t. E 𝑝 1 b(𝑥) ≥ 𝑝 ≤ 𝜌 ′
𝑥,𝑝

Analyzing the structure of the above optimization problem will give insight into how the bidder should
bid in a single state ℓ of the MDP of OPTinf . We proceed to show that the optimal solution takes a very
simple form.

Lemma 6.2. The bidding function of (9) does not need to depend on the context 𝑥, only the conversion rate
𝑐. In addition, the optimal bidding takes the following form:
• If the price distribution for a given conversion rate has no atoms, the optimal solution is 𝑏 𝜇 (𝑐) = min(1, 𝑐𝜇 )
for some 𝜇 ≥ 0; note 𝜇 = 0 implies bidding 1 for any context.
• If the price distribution for a given conversion rate has finite support, the optimal solution is supported on
at most two bidding policies, 𝑏 𝜇1 (𝑐) and 𝑏 𝜇2 (𝑐), for some 𝜇1, 𝜇2 ≥ 0.
• In all other cases, the maximum might not be obtained, but a solution supported on 𝑏 𝜇1 (𝑐) and 𝑏 𝜇2 (𝑐) can
get arbitrarily close.

The proof starts by defining the Lagrangian of (9): E𝑥,𝑝 [(𝑐 − 𝜇𝑝) 1 [𝑏 (𝑥) ≥ 𝑝]] + 𝜇𝜌 ′ for some multiplier
𝜇 ≥ 0. We next observe that the optimal solution to this is to bid 𝑏 (𝑐) = 𝑐𝜇 . The rest of the proof involves
proving that the optimal solution for the Lagrangian can be translated to an optimal one for the constrained
problem. We defer the full proof in Appendix E.1.

6.2 inf
Linear Program for OPT𝑚

In this section, we re-formulate the optimization problem (6) into a linear program. We do so only when the
distribution of the prices and contexts has finite support. Specifically, we consider that there are at most 𝑛
price/conversion rate pairs in the support: {(𝑝𝑖 , 𝑐𝑖 )}𝑖 ∈ [𝑛] . As we showed in Lemma 6.2, the optimal solution
of maximizing the probability of winning with a conversion subject to an expected budget constraint is

19
bidding 𝑐𝜇 for some 𝜇 ≥ 0. For such a 𝜇, we overload our previous notation and define the probability of
getting a conversion using 𝜇 as
𝑛  
    ∑︁    
𝑊 (𝜇) = E 𝑐 1 𝑐 ≥ 𝜇𝑝 = 𝑐𝑖 1 𝑐𝑖 ≥ 𝜇𝑝𝑖 P (𝑝, 𝑐) = (𝑝𝑖 , 𝑐𝑖 )
𝑝,𝑐 𝑝,𝑐
𝑖=1

and similarly we define the expected payment 𝑃 (𝜇). This naturally defines 𝜇𝑖 = 𝑝𝑐𝑖𝑖 for every (𝑝𝑖 , 𝑐𝑖 ) in the
support since any other 𝜇 has the same effect as some 𝜇𝑖 . The only exception is bidding 0 which might
not be covered by some 𝜇𝑖 ; thus we defined 𝜇 0 = ∞, which corresponds to bidding 𝜇𝑐0 = 0. We will use
{𝜇𝑖 }𝑖 ∈ [𝑛]∪{0} as the “actions” of each state.
We use {𝑞 ℓ𝑖 }ℓ ∈ [𝑚],𝑖 ∈ [𝑛]∪{0} as our decision variables. For a state ℓ and action 𝑖, 𝑞 ℓ𝑖 stands for the occupancy
measure of that state/action pair, i.e., is the probability of being in state ℓ and using 𝜇𝑖 . This makes the
Í
stationary probability of state ℓ to be 𝜋ℓ = 𝑖 𝑞 ℓ𝑖 . Substituting this in (6) we get
𝑚
∑︁ 𝑛
∑︁
inf
OPT𝑚 = max 𝑟 (ℓ) 𝑊 (𝜇𝑖 )𝑞 ℓ𝑖
𝑞 ℓ,𝑖 ≥0
ℓ=1 𝑖=0
𝑚 ∑︁
∑︁ 𝑛
such that 𝑃 (𝜇𝑖 )𝑞 ℓ𝑖 ≤ 𝜌
ℓ=1 𝑖=0
𝑛
∑︁ 𝑚 ∑︁
∑︁ 𝑛
where 𝑞 1𝑖 = 𝑊 (𝜇𝑖 )𝑞 ℓ𝑖
𝑖=0 ℓ=1 𝑖=1
𝑛 𝑛 (10)
∑︁ ∑︁ 
𝑞 ℓ𝑖 = 1 − 𝑊 (𝜇𝑖 ) 𝑞 ℓ −1,𝑖 ∀ℓ = 2, 3, . . . , 𝑚 − 1
𝑖=0 𝑖=0
𝑛
∑︁ 𝑚
∑︁ 𝑛
∑︁
𝑞𝑚𝑖 = (1 − 𝑊 (𝜇𝑖 ))𝑞 ℓ𝑖
𝑖=0 ℓ=𝑚−1 0=1
𝑚 ∑︁
∑︁ 𝑛
𝑞 ℓ𝑖 = 1
ℓ=1 0=1

We notice that the above is linear in the variables 𝑞 ℓ𝑖 . In addition, there are (𝑛 + 1)𝑚 variables and 𝑚 + 2
constraints. We also note that the constraint of bidding 1 in state 𝑚 can be encoded by adding an action
Í
that corresponds to such a bid, 𝜇𝑛+1 = 0, and adding the constraint 𝑛𝑖=0 𝑞𝑚𝑖 = 0. This proves that we can
execute Algorithm 1 in polynomial time since in every round 𝑡, the empirical distribution has support at
most 𝑡.

6.3 inf with an Approximate Distribution


Approximating OPT𝑚

In this section, we examine finding the optimal solution for the infinite horizon problem using an approx-
imate distribution over prices and conversion rates. Specifically, we model this as having functions 𝑊 ′ (𝜇)
and 𝑃 ′ (𝜇) that are close to the real ones, 𝑊 (𝜇) and 𝑃 (𝜇). We then prove that any vector 𝝁® has expected
average reward and payment that are similar for both functions.
Lemma 6.3. Let 𝑊 (𝜇) and 𝑃 (𝜇) be the probability of winning and the expected payment of bid min{1, 𝑐𝜇 }.
Assume that 𝑊 ′ (𝜇), 𝑃 ′ (𝜇) ∈ [0, 1] are such that |𝑊 (𝜇) − 𝑊 ′ (𝜇)| ≤ 𝜀 and |𝑃 (𝜇) − 𝑃 ′ (𝜇)| ≤ 𝜀 for all 𝜇 ≥ 0.
Fix a 𝝁® such that 𝝁𝑚 = 0 (i.e., bids 1 at state 𝑚). Assume that 𝑚 ≥ 𝑐1¯ . Then we have
|𝑅 ′ ( 𝝁® ) − 𝑅( 𝝁® )| ≤ 36𝑚 2𝜀 and |𝐶 ′ ( 𝝁® ) − 𝐶 ( 𝝁® )| ≤ 39𝑚 3𝜀

The full proof can be found in Appendix E.2.

20
6.4 Approximating 𝑊 (𝜇) and 𝑃 (𝜇) with Samples

This section presents the bounded sample error needed for Lemma 6.3. Specifically, we show the following

lemma that shows that with 𝑛 samples, the error on 𝑃 (·) and 𝑊 (·) is Õ (1/ 𝑛) with high probability.
Lemma 6.4. Let 𝑊 (𝜇) and 𝑃 (𝜇) be the probability of winning a conversion and the expected payment when
bidding min{1, 𝑐𝜇 }. Let 𝑊𝑛 (𝜇) and 𝑃𝑛 (𝜇) be the empirical estimates of these two functions using 𝑛 samples
{(𝑝𝑖 , 𝑐𝑖 )}𝑖 ∈ [𝑛] , i.e.,
𝑛 𝑛
1 ∑︁ 1 ∑︁
and
   
𝑊𝑛 (𝜇) = 𝑐𝑖 1 𝑐𝑖 ≥ 𝜇𝑝𝑖 𝑃𝑛 (𝜇) = 𝑝𝑖 1 𝑐𝑖 ≥ 𝜇𝑝𝑖
𝑛 𝑖=1 𝑛 𝑖=1

Then, for all 𝛿 ∈ (0, 1) with probability at least 1 − 𝛿 it holds that for all 𝜇 ≥ 0
√︂ ! √︂ !
1 2 1 2
|𝑊𝑛 (𝜇) − 𝑊 (𝜇)| ≤ O log and |𝑃𝑛 (𝜇) − 𝑃 (𝜇)| ≤ O log
𝑛 𝛿 𝑛 𝛿
h i √ 
The proof follows by a standard result: the expected maximum error, E sup𝜇 |𝑊 (𝜇) − 𝑊𝑛 (𝜇)| , is O 1/ 𝑛 .
Using McDiarmid’s inequality, the lemma follows, converting that bound to a high probability one. The
full proof can be found in Appendix E.3

References
[1] Agarwal, A., Jiang, N., Kakade, S. M., and Sun, W. Reinforcement learning: Theory and algorithms. CS
Dept., UW Seattle, Seattle, WA, USA, Tech. Rep 32 (2019), 96.
[2] Aggarwal, G., Badanidiyuru, A., Balseiro, S. R., Bhawalkar, K., Deng, Y., Feng, Z., Goel, G., Liaw, C., Lu,
H., Mahdian, M., et al. Auto-bidding and auctions in online advertising: A survey. ACM SIGecom Exchanges
22, 1 (2024), 159–183.
[3] Aggarwal, G., Fikioris, G., and Zhao, M. No-regret algorithms in non-truthful auctions with budget and ROI
constraints. CoRR abs/2404.09832 (2024).
[4] Aggarwal, G., Fikioris, G., and Zhao, M. No-regret algorithms in non-truthful auctions with budget and ROI
constraints. CoRR abs/2404.09832 (2024).
[5] Alimohammadi, Y., Mehta, A., and Perlroth, A. Incentive compatibility in the auto-bidding world. arXiv
preprint arXiv:2301.13414 (2023).
[6] Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. In Proceedings
of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017
(2017), D. Precup and Y. W. Teh, Eds., vol. 70 of Proceedings of Machine Learning Research, PMLR, pp. 263–272.
[7] Badanidiyuru, A., Kleinberg, R., and Slivkins, A. Bandits with knapsacks. In 54th Annual IEEE Symposium
on Foundations of Computer Science, FOCS 2013, 26-29 October, 2013, Berkeley, CA, USA (2013), IEEE Computer
Society, pp. 207–216.
[8] Balseiro, S. R., Besbes, O., and Weintraub, G. Y. Repeated auctions with budgets in ad exchanges: Approxi-
mations and design. Management Science 61, 4 (2015), 864–884.
[9] Balseiro, S. R., Bhawalkar, K., Feng, Z., Lu, H., Mirrokni, V., Sivan, B., and Wang, D. A field guide for
pacing budget and ros constraints. arXiv preprint arXiv:2302.08530 (2023).
[10] Balseiro, S. R., and Brown, D. B. Learning in repeated auctions with budgets: Regret minimization and
equilibrium. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (2019),
Curran Associates Inc., pp. 1623–1633.

21
[11] Balseiro, S. R., and Gur, Y. Learning in repeated auctions with budgets: Regret minimization and equilibrium.
In Proceedings of the 2017 ACM Conference on Economics and Computation, EC ’17, Cambridge, MA, USA, June
26-30, 2017 (2017), C. Daskalakis, M. Babaioff, and H. Moulin, Eds., ACM, p. 609.
[12] Balseiro, S. R., and Gur, Y. Learning in repeated auctions with budgets: Regret minimization and equilibrium.
Manag. Sci. 65, 9 (2019), 3952–3968.
[13] Banchio, M., and Skrzypacz, A. Artificial intelligence and auction design. In Proceedings of the 23rd ACM
Conference on Economics and Computation (2022), pp. 30–31.
[14] Basu, S., Easley, D., O’Hara, M., and Sirer, E. G. Stablefees: A predictable fee market for cryptocurrencies.
Management Science 69, 11 (2023), 6508–6524.
[15] Bichler, M., Lunowa, S. B., Oberlechner, M., Pieroth, F. R., and Wohlmuth, B. On the convergence of
learning algorithms in bayesian auction games. arXiv preprint arXiv:2311.15398 (2023).
[16] Blum, A., Hajiaghayi, M., Ligett, K., and Roth, A. Regret minimization and the price of total anarchy. In
Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing (2008), pp. 373–382.
[17] Borgs, C., Chayes, J., Immorlica, N., Mahdian, M., and Saberi, A. Multi-unit auctions with budget-
constrained bidders. In Proceedings of the 6th ACM Conference on Electronic Commerce (2005), pp. 44–51.
[18] Borgs, C., Chayes, J., Immorlica, N., Mahdian, M., and Saberi, A. Dynamics for the gsp auction: Stability
and efficiency. ACM Transactions on Economics and Computation (TEAC) 5, 2 (2007), 1–29.
[19] Braverman, M., Mao, J., Schneider, J., and Weinberg, M. Selling to a no-regret buyer. In Proceedings of the
2018 ACM Conference on Economics and Computation (2018), pp. 523–538.
[20] Cai, L., Weinberg, S. M., Wildenhain, E., and Zhang, S. Selling to multiple no-regret buyers. In International
Conference on Web and Internet Economics (2023), Springer, pp. 113–129.
[21] Caragiannis, I., Kaklamanis, C., Kanellopoulos, P., Kyropoulou, M., Lucier, B., Leme, R. P., and Tardos,
E. Bounding the inefficiency of outcomes in generalized second price auctions. Journal of Economic Theory 156
(2015), 343–388.
[22] Castiglioni, M., Celli, A., and Kroer, C. Online learning with knapsacks: the best of both worlds. In
International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA (2022),
K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvári, G. Niu, and S. Sabato, Eds., vol. 162 of Proceedings of Machine
Learning Research, PMLR, pp. 2767–2783.
[23] Cesa-Bianchi, N., Gentile, C., and Mansour, Y. Regret minimization for reserve prices in second-price
auctions. IEEE Transactions on Information Theory 61, 1 (2014), 549–564.
[24] Cesa-Bianchi, N., and Lugosi, G. Prediction, learning, and games. Cambridge university press, 2006.
[25] Chang, C.-T., Chu, X.-Y. M., and Tsai, I.-T. How cause marketing campaign factors affect attitudes and pur-
chase intention: Choosing the right mix of product and cause types with time duration. Journal of Advertising
Research 61, 1 (2021), 58–77.
[26] Chen, L., Jain, R., and Luo, H. Learning infinite-horizon average-reward markov decision process with con-
straints. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA
(2022), K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvári, G. Niu, and S. Sabato, Eds., vol. 162 of Proceedings of
Machine Learning Research, PMLR, pp. 3246–3270.
[27] Chen, X., Kroer, C., and Kumar, R. The complexity of pacing for second-price auctions. In EC ’21: The 22nd
ACM Conference on Economics and Computation, Budapest, Hungary, July 18-23, 2021 (2021), P. Biró, S. Chawla,
and F. Echenique, Eds., ACM, p. 318.
[28] Choi, H., Mela, C. F., Balseiro, S. R., and Leary, A. Online display advertising markets: A literature review
and future directions. Information Systems Research 31, 2 (2020), 556–575.
[29] Chu, L. Y., Nazerzadeh, H., and Zhang, H. Position ranking and auctions for online marketplaces. Manage-
ment Science 66, 8 (2020), 3617–3634.

22
[30] Conitzer, V., Kroer, C., Panigrahi, D., Schrijvers, O., Sodomka, E., Moses, N. E. S., and Wilkens, C. Pacing
equilibrium in first-price auction markets. In Proceedings of the 2019 ACM Conference on Economics and Compu-
tation, EC 2019, Phoenix, AZ, USA, June 24-28, 2019 (2019), A. R. Karlin, N. Immorlica, and R. Johari, Eds., ACM,
p. 587.
[31] Conitzer, V., Kroer, C., Sodomka, E., and Moses, N. E. S. Multiplicative pacing equilibria in auction mar-
kets. In Web and Internet Economics - 14th International Conference, WINE 2018, Oxford, UK, December 15-17,
2018, Proceedings (2018), G. Christodoulou and T. Harks, Eds., vol. 11316 of Lecture Notes in Computer Science,
Springer, p. 443.
[32] Daskalakis, C., and Syrgkanis, V. Learning in auctions: Regret is hard, envy is easy. In IEEE Annual Sympo-
sium on Foundations of Computer Science, FOCS (2016), pp. 219–228.
[33] Dekimpe, M. G., and Hanssens, D. M. Sustained spending and persistent response: A new look at long-term
marketing profitability. Journal of Marketing Research 36, 4 (1999), 397–412.
[34] Deng, X., Hu, X., Lin, T., and Zheng, W. Nash convergence of mean-based learning algorithms in first price
auctions. In Proceedings of the ACM Web Conference 2022 (2022), pp. 141–150.
[35] Deng, X., Xiao, L., and Zhang, J. Auto-bidding in repeated first-price auctions with budgets. In Proceedings
of the Web Conference 2021 (2021), pp. 274–285.
[36] Dobzinski, S., Lavi, R., and Nisan, N. Multi-unit auctions with budget limits. Games and Economic Behavior
74, 2 (2012), 486–503.
[37] Dobzinski, S., and Leme, R. P. Efficiency guarantees in auctions with budgets. In International Colloquium on
Automata, Languages, and Programming (2014), Springer, pp. 392–404.
[38] Edelman, B., Ostrovsky, M., and Schwarz, M. Internet advertising and the generalized second-price auction:
Selling billions of dollars worth of keywords. American economic review 97, 1 (2007), 242–259.
[39] Feldman, M., Fiat, A., Leonardi, S., and Sankowski, P. Revenue maximizing envy-free multi-unit auctions
with budgets. In Proceedings of the 13th ACM conference on electronic commerce (2012), pp. 532–549.
[40] Feldman, M., Lucier, B., and Syrgkanis, V. Limits of efficiency in sequential auctions. In International
Conference on Web and Internet Economics (2013), Springer, pp. 160–173.
[41] Feng, Y., Lucier, B., and Slivkins, A. Strategic budget selection in a competitive autobidding world. In
Proceedings of the 56th Annual ACM Symposium on Theory of Computing (2024), pp. 213–224.
[42] Feng, Z., Lu, P., and Tang, Z. G. The convergence of no-regret learning in auction mechanisms. In Proceedings
of the ACM on Economics and Computation (2020), vol. 1, ACM, pp. 104–124.
[43] Ferreira, M. V., Moroz, D. J., Parkes, D. C., and Stern, M. Dynamic posted-price mechanisms for the
blockchain transaction-fee market. In Proceedings of the 3rd ACM Conference on Advances in Financial Tech-
nologies (2021), pp. 86–99.
[44] Fikioris, G., and Tardos, É. Approximately stationary bandits with knapsacks. In The Thirty Sixth Annual
Conference on Learning Theory, COLT 2023, 12-15 July 2023, Bangalore, India (2023), G. Neu and L. Rosasco, Eds.,
vol. 195 of Proceedings of Machine Learning Research, PMLR, pp. 3758–3782.
[45] Fikioris, G., and Tardos, É. Liquid welfare guarantees for no-regret learning in sequential budgeted auctions.
In Proceedings of the 24th ACM Conference on Economics and Computation (2023), pp. 678–698.
[46] Gaitonde, J., Li, Y., Light, B., Lucier, B., and Slivkins, A. Budget pacing in repeated auctions: Regret and
efficiency without convergence. In 14th Innovations in Theoretical Computer Science Conference, ITCS 2023,
January 10-13, 2023, MIT, Cambridge, Massachusetts, USA (2023), Y. T. Kalai, Ed., vol. 251 of LIPIcs, Schloss
Dagstuhl - Leibniz-Zentrum für Informatik, pp. 52:1–52:1.
[47] Gentry, M. L., Hubbard, T. P., Nekipelov, D., Paarsch, H. J., et al. Structural Econometrics of Auctions: A
Review. now publishers, 2018.

23
[48] Immorlica, N., Sankararaman, K. A., Schapire, R. E., and Slivkins, A. Adversarial bandits with knapsacks.
In 60th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2019, Baltimore, Maryland, USA,
November 9-12, 2019 (2019), D. Zuckerman, Ed., IEEE Computer Society, pp. 202–219.
[49] Keller, K. L., and Lehmann, D. R. Brands and branding: Research findings and future priorities. Marketing
science 25, 6 (2006), 740–759.
[50] Kleinberg, R., Leme, R. P., Schneider, J., and Teng, Y. U-calibration: Forecasting for an unknown agent. CoRR
abs/2307.00168 (2023).
[51] Kolumbus, Y., Halpern, J., and Tardos, É. Paying to do better: Games with payments between learning
agents. arXiv preprint arXiv:2405.20880 (2024).
[52] Kolumbus, Y., and Nisan, N. Auctions between regret-minimizing agents. In ACM Web Conference, WebConf
(2022), pp. 100–111.
[53] Kolumbus, Y., and Nisan, N. How and why to manipulate your own agent: On the incentives of users of
learning agents. In Annual Conference on Neural Information Processing Systems, NeurIPS (2022).
[54] Krishna, V. Auction theory. Academic press, 2009.
[55] Kumar, B., Morgenstern, J., and Schrijvers, O. Optimal spend rate estimation and pacing for ad campaigns
with budgets. arXiv preprint arXiv:2202.05881 (2022).
[56] Kumar, R., and Kleinberg, R. Non-monotonic resource utilization in the bandits with knapsacks problem. In
Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Sys-
tems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022 (2022), S. Koyejo, S. Mohamed,
A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds.
[57] Lewis, R. A., and Reiley, D. H. Online ads and offline sales: measuring the effect of retail advertising via a
controlled experiment on yahoo! Quantitative Marketing and Economics 12 (2014), 235–266.
[58] Lucier, B., Pattathil, S., Slivkins, A., and Zhang, M. Autobidders with budget and roi constraints: Ef-
ficiency, regret, and pacing dynamics. In The Thirty Seventh Annual Conference on Learning Theory (2024),
PMLR, pp. 3642–3643.
[59] Lucier, B., Pattathil, S., Slivkins, A., and Zhang, M. Autobidders with budget and ROI constraints: Effi-
ciency, regret, and pacing dynamics. In The Thirty Seventh Annual Conference on Learning Theory, June 30 -
July 3, 2023, Edmonton, Canada (2024), S. Agrawal and A. Roth, Eds., vol. 247 of Proceedings of Machine Learning
Research, PMLR, pp. 3642–3643.
[60] Milgrom, P. Auction research evolving: Theorems and market designs. American Economic Review 111, 5
(2021), 1383–1405.
[61] Mohri, M., and Munoz, A. Optimal regret minimization in posted-price auctions with strategic buyers. In
Advances in Neural Information Processing Systems (2014), pp. 1871–1879.
[62] Morgenstern, J., and Roughgarden, T. Learning simple auctions. In Conference on Learning Theory (2016),
PMLR, pp. 1298–1318.
[63] Nedelec, T., Calauzènes, C., Karoui, N. E., and Perchet, V. Learning in repeated auctions. Foundations and
Trends® in Machine Learning 15, 3 (2022), 176–334.
[64] Nekipelov, D., Syrgkanis, V., and Tardos, E. Econometrics for learning agents. In Proceedings of the sixteenth
acm conference on economics and computation (2015), pp. 1–18.
[65] Nisan, N. Serial monopoly on blockchains. arXiv preprint arXiv:2311.12731 (2023).
[66] Nisan, N., and Noti, G. An experimental evaluation of regret-based econometrics. In Proceedings of the 26th
International Conference on World Wide Web (2017), pp. 73–81.
[67] Nisan, N., and Noti, G. A "quantal regret" method for structural econometrics in repeated games. In Proceed-
ings of the 2017 ACM Conference on Economics and Computation (2017).

24
[68] Noti, G., and Syrgkanis, V. Bid prediction in repeated auctions with learning. In Proceedings of the Web
Conference 2021 (2021), pp. 3953–3964.
[69] Noti, G., and Syrgkanis, V. Bid prediction in repeated auctions with learning. In ACM Web Conference,
WebConf (2021), pp. 3953–3964.
[70] Perlroth, A., and Mehta, A. Auctions without commitment in the auto-bidding world. In Proceedings of the
ACM Web Conference 2023 (2023), pp. 3478–3488.
[71] Puterman, M. L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Series in Proba-
bility and Statistics. Wiley, 1994.
[72] Roughgarden, T. Algorithmic game theory. Communications of the ACM 53, 7 (2010), 78–86.
[73] Roughgarden, T. Transaction fee mechanism design. Journal of the ACM 71, 4 (2024), 1–25.
[74] Roughgarden, T., Syrgkanis, V., and Tardos, E. The price of anarchy in auctions. Journal of Artificial
Intelligence Research 59 (2017), 59–101.
[75] Roughgarden, T., and Wang, J. R. Minimizing regret with multiple reserves. ACM Transactions on Economics
and Computation (TEAC) 7, 3 (2019).
[76] Rubinstein, A., and Zhao, J. Strategizing against no-regret learners in first-price auctions. arXiv preprint
arXiv:2402.08637 (2024).
[77] Slivkins, A. Introduction to multi-armed bandits. CoRR abs/1904.07272 (2019).
[78] Slivkins, A., Sankararaman, K. A., and Foster, D. J. Contextual bandits with packing and covering con-
straints: A modular lagrangian approach via regression. In The Thirty Sixth Annual Conference on Learning
Theory, COLT 2023, 12-15 July 2023, Bangalore, India (2023), G. Neu and L. Rosasco, Eds., vol. 195 of Proceedings
of Machine Learning Research, PMLR, pp. 4633–4656.
[79] Stevenson, R. T., and Vavreck, L. Does campaign length matter? testing for cross-national effects. British
Journal of Political Science 30, 2 (2000), 217–235.
[80] Stradi, F. E., Germano, J., Genalti, G., Castiglioni, M., Marchesi, A., and Gatti, N. Online learning
in cmdps: Handling stochastic and adversarial constraints. In Forty-first International Conference on Machine
Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 (2024), OpenReview.net.
[81] Sun, H., Fan, M., and Tan, Y. An empirical analysis of seller advertising strategies in an online marketplace.
Information Systems Research 31, 1 (2020), 37–56.
[82] Sundararajan, M., and Talgam-Cohen, I. Prediction and welfare in ad auctions. Theory of Computing Systems
59 (2016), 664–682.
[83] Varian, H. R. Online ad auctions. American Economic Review 99, 2 (2009), 430–434.
[84] Weed, J., Perchet, V., and Rigollet, P. Online learning in repeated auctions. In Conference on Learning Theory
(2016), PMLR, pp. 1562–1583.

25
Appendix

A Warm-up Example: Further Details

Figure 2: Numerical evaluation of the competitive ratio of using a fixed bid distribution, shown as a function
of the budget per step 𝜌 in the example of uniform prices and square-root objective.

Calculation of the utility with a fixed bid:. An upper bound on our bidder’s utility is the scenario
√ where
we have the same number of wins, and the winning times are equally spaced in intervals of 𝑇 / 2𝐵𝑇 , which
gives an upper bound on the utility of (2𝜌) 1/4𝑇 .
While perfect spacing is not attainable, we would like to see how far our bidder’s utility is from this value.
Suppose there are 𝑘 winning events, and the intervals between wins (including times zero and 𝑇 as interval
Í √
ends) are ℓ1, . . . , ℓ𝑘+1 . The utility for our bidder is 𝑢 = 𝑘+1
𝑖=1 ℓ𝑖 . By Wald’s equality, the expected utility
equals the product of expectations:
𝑘+1
∑︁ √ √
E[𝑢] = ℓ𝑖 = E[𝑘 + 1] E[ ℓ𝑖 ].
𝑖=1

The number of wins follows a binomial distribution with expectation E[𝑘] = 𝑏𝑇 (to simplify, we henceforth
omit one interval and look at 𝑘 rather than 𝑘 + 1). Except the first and last intervals, the interval lengths
are geometrically distributed; the expectation is approximately (up to an 𝑜 (1) error as 𝑇 → ∞. See below).
∞ √
√ ∑︁ 𝑏
E[ ℓ𝑖 ] = 𝑏 ℓ (1 − 𝑏) ℓ −1 = 𝐿𝑖 1 (1 − 𝑏).
ℓ=1
1 −𝑏 −2
Í∞ 𝑥 𝑛
where 𝐿𝑖𝑠 (𝑥) is the polylogarithm function, defined as 𝐿𝑖𝑠 (𝑥) = 𝑛=1 𝑛𝑠 . Thus, the long-term expected
utility is √
2𝜌 √︁ √︁ 2𝜌𝑇 √︁
E[𝑢] = √ · 2𝜌𝑇 · 𝐿𝑖 − 12 (1 − 2𝜌) = √ · 𝐿𝑖 − 21 (1 − 2𝜌).
1 − 2𝜌 1 − 2𝜌
The ratio between this expected utility and the optimal benchmark,

E[𝑢] (2𝜌) 3/4 √︁


1/4
= √ · 𝐿𝑖 − 12 (1 − 2𝜌),
(2𝜌) 𝑇 1 − 2𝜌

26
is an increasing function of 𝜌 that is strictly less than one for all 𝜌 < 1/2. By numerical evaluation (see
Figure 2), in our case where 𝜌 < 1/4, the fixed bidding strategy obtains an approximation ratio between
≈ 0.886 and ≈ 0.973 of the optimal utility.
Error due to infinite horizon summation:. The error term arises from the fact that with a finite time
horizon 𝑇 , the values of 𝑙 are not infinite.
√ However, the error from not truncating the sum vanishes as 𝑇
becomes large: the expectation of 𝑙 is

√︁ ∞ √
 ∑︁ ∞ √
∑︁ 
𝑙 −1 𝑙 −1
E[ 𝑙𝑖 ] = 𝑧 𝑏 𝑙 (1 − 𝑏) − 𝑏 𝑙 (1 − 𝑏) .
𝑙=1 𝑙=𝑇 +1

1
where 𝑧 = 1− (1−𝑏 )𝑇
= 1 − 𝑜 (1) is the normalization constant from the truncated geometric distribution.
The last term is bounded by
∞ √ ∞
∑︁ 𝑏 ∑︁ 𝑏 (1 − 𝑏)𝑇 +1 (𝑏𝑇 + 1) 1
𝑏 𝑙 (1 − 𝑏)𝑙 −1 < 𝑙 (1 − 𝑏)𝑙 = · = · (1 − 𝑏)𝑇 (𝑏𝑇 + 1) −−−−→ 0.
1 −𝑏 1 −𝑏 𝑏2 𝑏 𝑇 →∞
𝑙=𝑇 +1 𝑙=𝑇 +1

B State-Independent Strategies

B.1 Utility Guarantee for State Independent Bidding

In our example in the introduction (details in Appendix A), we saw that the static bidding policy of using
a constant bid attained a constant factor approximation to the utility of an optimal policy. The following
theorem shows that the example is, in fact, an instance of a more general result.

Theorem B.1. There exists a state-independent static policy that depends only on the context that achieves
an (1 − 𝑒1 ) fraction of the optimal policy as 𝑇 → ∞.

To prove this theorem, we need the following key Lemma.

Lemma B.2 (A Reverse Jensen Inequality for Geometric Random Variables). If 𝑟 : N → [0, ∞) is a non-
decreasing concave function, 𝑋 is a geometric random variable supported on the positive integers, and 𝑌 is an
integer-valued random variable satisfying E[𝑌 ] = E[𝑋 ], then

E[𝑟 (𝑋 )] ≥ 1 − 𝑒1 E[𝑟 (𝑌 )]. (11)

The lemma can be thought of as a reverse version of Jensen’s inequality and it applies generally to ge-
ometric random variables. In our context, this result implies that any concave reward function and any
distribution of prices, there exists a static bidding policy that achieves at least 1 − 𝑒1 − 𝑜 (1) times the value
of the optimal policy.

Proof of Lemma B.1. Consider the sequence of non-decreasing concave functions 𝑟 1, 𝑟 2, . . . defined by

𝑟𝑚 (𝑛) = min{𝑚, 𝑛}.

Also, let 𝑟 ∞ (𝑛) = 𝑛. The proof of the lemma begins by showing that 𝑟 is equal to a non-negative weighted
sum of functions in the set ℛ = {𝑟 1, 𝑟 2, . . . , } ∪ {𝑟 ∞ }. The proof concludes by showing that each of the
functions in ℛ satisfies inequality (11).

27
Let 𝑤 1 = 𝑟 (1) and, for each integer 𝑛 > 1, let 𝑤𝑛 = 𝑟 (𝑛) − 𝑟 (𝑛 − 1). Since 𝑟 is concave and non-decreasing,
the sequence 𝑤 1, 𝑤 2, . . . is non-negative and non-increasing, hence it converges to its greatest lower bound

𝑣 ∞ = inf {𝑤𝑛 : 𝑛 > 0} = lim 𝑤𝑛 ≥ 0


𝑛→∞
Í
For each positive integer 𝑛 let 𝑣𝑛 = 𝑤𝑛 − 𝑤𝑛+1 ≥ 0. We have 𝑤𝑛 = 𝑣 ∞ + 𝑚≥𝑛 𝑣𝑚 and
𝑛 𝑛
! ∞
∑︁ ∑︁ ∑︁ ∑︁ ∑︁
𝑟 (𝑛) = 𝑤𝑘 = 𝑣∞ + 𝑣𝑚 = 𝑣 ∞ · 𝑛 + 𝑣𝑚 · min{𝑚, 𝑛} = 𝑣𝑚 𝑟𝑚 (𝑛). (12)
𝑘=1 𝑘=1 𝑚≥𝑘 𝑚=1 𝑚∈N∪{∞}

Since the coefficients 𝑣𝑚 (1 ≤ 𝑚 ≤ ∞) are all non-negative, Equation (12) shows that 𝑟 is equal to a
non-negative sum of functions in ℛ, as claimed. Now, by linearity of expectation,
 ∑︁  ∑︁
 
E[𝑟 (𝑋 )] = E  𝑣𝑚 𝑟𝑚 (𝑋 )  = 𝑣𝑚 E[𝑟𝑚 (𝑋 )]
𝑚∈N∪{∞}  𝑚∈N∪{∞}
 
 ∑︁  ∑︁
 
E[𝑟 (𝑌 )] = E  𝑣𝑚 𝑟𝑚 (𝑌 )  = 𝑣𝑚 E[𝑟𝑚 (𝑌 )].
𝑚∈N∪{∞}  𝑚∈N∪{∞}
 
If we can show that 
1
∀𝑚 ∈ N ∪ {∞} E[𝑟𝑚 (𝑋 )] ≥ 1 − 𝑒 E[𝑟𝑚 (𝑌 )] (13)
then the lemma follows by taking a weighted sum of inequalities of the form (13), with the 𝑚 th inequality
weighted by 𝑣𝑚 . Let 𝜇 = E[𝑋 ] = E[𝑌 ]. By Jensen’s Inequality, we have 𝑟𝑚 (𝜇) ≥ E[𝑟𝑚 (𝑌 )], so inequal-
ity (13) will follow if we can prove

∀𝑚 ∈ N ∪ {∞} E[𝑟𝑚 (𝑋 )] ≥ 1 − 𝑒1 𝑟𝑚 (𝜇). (14)

When 𝑚 = ∞, the function 𝑟𝑚 is the identity function, so E[𝑟𝑚 (𝑋 )] = 𝑟𝑚 (E[𝑋 ]), establishing the 𝑚 = ∞
case of (14). When 𝑚 < ∞, we can calculate E[𝑟𝑚 (𝑋 )] by setting 𝑝 = 1 − 𝜇1 and observing that for all 𝑛 ≥ 1
we have P [𝑋 ≥ 𝑛] = 𝑝 𝑛−1 . Hence,

∑︁   1 − 𝑝𝑚
E[𝑟𝑚 (𝑋 )] = E[min{𝑚, 𝑋 }] = P min{𝑚, 𝑋 } ≥ 𝑛 = 1 + 𝑝 + · · · + 𝑝𝑚−1 = = (1 − 𝑝𝑚 )𝜇.
𝑛=1
1 − 𝑝

The function (1 − 𝑝 𝑥 )𝜇 is an increasing concave function of 𝑥, so it satisfies


(
(1 − 𝑝 𝜇 )𝑥 if 0 ≤ 𝑥 ≤ 𝜇
(1 − 𝑝 𝑥 )𝜇 ≥ (15)
(1 − 𝑝 𝜇 )𝜇 if 𝑥 > 𝜇.

Recalling that 𝑝 = 1 − 1
𝜇 < 𝑒 −1/𝜇 , we see that 𝑝 𝜇 < 𝑒 −1 so 1 − 𝑝 𝜇 > 1 − 𝑒1 . Now, when 0 ≤ 𝑚 ≤ 𝜇,
 
E[𝑟𝑚 (𝑋 )] = (1 − 𝑝𝑚 )𝜇 ≥ (1 − 𝑝 𝜇 )𝑚 > 1 − 1
𝑒 𝑚 = 1− 1
𝑒 𝑟𝑚 (𝜇).

When 𝜇 < 𝑚 < ∞,


 
E[𝑟𝑚 (𝑋 )] = (1 − 𝑝𝑚 )𝜇 ≥ (1 − 𝑝 𝜇 )𝜇 > 1 − 1
𝑒 𝜇 = 1− 1
𝑒 𝑟𝑚 (𝜇).

We have established inequality (14) for all 𝑚 ∈ N ∪ {∞}, which completes the proof. ■

28
We now proceed to prove Theorem B.1.

Proof of Theorem B.1. Now we apply the reverse Jensen inequality to show that there is always a static
(potentially randomized) bidding policy whose expected value is at least 1 − 𝑒1 times the value of the
optimal dynamic bidding policy.
Consider the optimal dynamic policy for time horizon 𝑇 . Let 𝑤 𝑦𝑡 denote the probability that this policy
wins at time 𝑡 and that the preceding win is at time 𝑡 − 𝑦. The overall probability of winning at time 𝑡 will
Í Í
be denoted by 𝑤𝑡 = 𝑡𝑦=1 𝑤 𝑦𝑡 , and we will denote the policy’s expected number of wins by 𝑤 tot = 𝑇𝑡=1 𝑤𝑡 .
Let 𝑌 be a random variable supported on the positive integers, whose distribution is given by
Í𝑇
  𝑡 =1 𝑤 𝑦𝑡
P 𝑌 =𝑦 = .
𝑤 tot
The expected value of running the optimal policy is
𝑇
∑︁
𝑤 𝑦𝑡 𝑟 (𝑦) = E[𝑟 (𝑌 )] · 𝑤 tot (16)
𝑡 =1

A useful observation about the expected value of 𝑌 is the following. Let 𝜏 denote the final time that the
policy wins in the time interval [1,𝑇 ], or 𝜏 = 0 for a sample path on which the policy never wins. We have
𝑇
∑︁  
E[𝜏] = P 𝜏 ≥𝑠
𝑠=1
𝑇 ∑︁
∑︁ 𝑇
 
= P first win in [𝑠, 𝑡] occurs at 𝑡
𝑠=1 𝑡 =𝑠
𝑇 ∑︁
∑︁ 𝑇 𝑡
∑︁
= 𝑤 𝑦𝑡
𝑠=1 𝑡 =𝑠 𝑦=𝑡 −𝑠+1
∑︁ 𝑡
𝑇 ∑︁ 𝑡
∑︁
= 𝑤 𝑦𝑡
𝑡 =1 𝑦=1 𝑠=𝑡 −𝑦+1
𝑇 ∑︁
∑︁ 𝑇
= 𝑦𝑤 𝑦𝑡
𝑦=1 𝑡 =𝑦

= E[𝑌 ] · 𝑤 tot .

Now consider a static bidding policy defined as follows. Let 𝑤¯ = 𝑤 tot /𝑇 and find the distribution over bids
𝑏 that minimizes E[𝑃 (𝑏)] subject to the constraint E[𝑊 (𝑏)] = 𝑤¯ . If 𝑏 ′ is the empirical bid distribution of
the optimal policy then E[𝑊 (𝑏 ′ )] = 𝑤, ¯ while E[𝑃 (𝑏 ′ )] ≤ 𝜌 because the policy obeys the budget constraint.
Hence, by the definition of 𝑏, we must have E[𝑃 (𝑏)] ≤ 𝜌. If one runs the static policy with bid distribution
𝑏, the expected number of wins7 is 𝑤 tot and the distribution of the spacing between wins is a geometric
random variable, 𝑋 . If 𝜏 ′ denotes the final time that the static bidding policy wins in the interval [1,𝑇 ] then,
as above, E[𝜏 ′ ] = E[𝑋 ]·𝑤 tot , and since both 𝜏 and 𝜏 ′ should be𝑇 −𝑜 (𝑇 ), this implies E[𝑋 ] = (1±𝑜 (1))·E[𝑌 ].
Applying Lemma B.2, we have E[𝑟 (𝑋 )] ≥ (1 − 𝑒1 − 𝑜 (1)) E[𝑟 (𝑌 )]. Finally, since the expected values of
the static policy and the optimal policy are given, respectively, by E[𝑟 (𝑋 )] · 𝑤 tot and E[𝑟 (𝑌 )] · 𝑤 tot , we
conclude that the static policy obtains at least 1 − 𝑒1 − 𝑜 (1) times the value of the optimal policy. ■
7 Actually, this overestimates the expected number of wins, but only slightly. The issue is that the static policy obeys the
budget constraint in expectation, but it may run out of budget before 𝑇 and, for this reason, the probability of winning in the final
𝑜 (𝑇 ) rounds will be strictly less than 𝑤.
¯

29
B.2 State-independent Strategies occur linear regret

We show an example demonstrating that a state bidding strategy cannot attain a perfect approximation
ratio of 1, and so must have linear regret. We show that this can happen even in the simpler case without
contexts and when the conversion rate is always 1. Consider a uniform price distribution 𝑝𝑡 ∼ 𝑈 [0, 1] and
the concave reward function is 𝑟 (ℓ) = min{ℓ, 2}. The optimal state-independent policy is (similar to what

we saw in our warm-up example) to bid 2𝜌 every round since that depletes the budget in expectation.

This leads to winning with probability 2𝜌 every round.
Let 𝐿 be the geometric random variable that denotes the time between wins with conversions when bidding
√ √ √
2𝜌 every round. Specifically, we have with P [𝐿 = ℓ] = 2𝜌 (1 − 2𝜌) ℓ −1 for every ℓ ∈ N. The expected
time-average reward of our state-independent policy is the ratio of the expected reward of a win divided
by the expected time between wins (see Definition C.1 for a formal statement of this result):
√ √
E [𝑟 (𝐿)] 1 · 2𝜌 + 2 · (1 − 2𝜌) √︁ √︁
= √ = (2 − 2𝜌) 2𝜌 (17)
E [𝐿] 1/ 2𝜌

Since the reward function has only two possible values, 𝑟 (1), and 𝑟 (2), for every value of 𝜌 we can find
the optimal pair of bids to use for ℓ = 1 and ℓ ≥ 2, and programmatically search the value of 𝜌 that
maximizes
√ the utility gap between our fixed bidding strategy and the optimal strategy. This turns out to
be 𝜌 = 3 − 32 ≈ 0.23, in which case the optimal state-dependent policy results in ≈ 1.11 more utility than
the utility of the state-independent policy. We present the details next.
√ √
Fix 𝜌 = 3 − 32 consider the strategy that bids 𝑏 1 = 2 − 3 ≈ 0.27 when ℓ = 1 and 𝑏 2 = 1 when ℓ ≥ 22. Now
consider the random variable
√ 𝐿 ′ that denotes the time
√ between wins of this strategy.√Specifically, we have

P [𝐿 = 1] = 𝑏 1 = 2 − 3 and P [𝐿 ′ = 2] = 1 − 𝑏 1 = 3 − 1. First, note that E [𝐿 ′ ] = 3.
Now, we show that this policy does not run out of budget in expectation (this results in only 𝑜 (𝑇 ) error with
high probability when always adhering to the budget constraint, which does not affect the approximation
ratio bound we want to show). To calculate the expected time-average spending of this policy, we first
calculate the expected payment of a conversion:

𝑏1 𝑏2 √ 3
· 𝑏1 + · (1 − 𝑏 1 ) = 3 − 3
2 2 2

The expected time-average payment of this policy is the ratio of the expected payment√between conver-
3− 3 3 √
sions and the expected time between conversions (again see Definition C.1), which is √3 2 = 3 − 23 = 𝜌.
The expected time-average reward of this policy is (similar to (17)):

E [𝑟 (𝐿 ′ )] 1 · 𝑏 1 + 2 · (1 − 𝑏 1 )
= √ =1
E [𝐿 ′ ] 3

This reward is √ √1 √ ≈ 1.11 times higher than the utility of the state-independent strategy, as
3+2 2 3−3−2 3
promised.

C Deferred Proofs of Section 4


In this section, we complete the proofs of the results in Section 4

30
C.1 Proof of Lemma 4.1

We first restate the lemma.

Lemma 4.1. For every budget per round 𝜌, reward function 𝑟 (·), distribution of prices P, and finite horizon
𝑇 , it holds that OPT𝑇inf ≥ OPTALG .

Proof. Fix the strategy of the optimal algorithm that achieves OPTALG ; this depends on (i) the current round
𝑡 ∈ [𝑇 ], (ii) how much is the total payment by that round, and (iii) the last win ℓ𝑡 . Now consider the infinite-
horizon setting, where we follow a strategy inspired by the optimal algorithm in the finite-horizon setting.
Fix a round ℎ and let 𝑛 ∈ N ∪ {0} and 𝑡 ∈ [𝑇 ] such that ℎ = 𝑇 𝑛 + 𝑡. In that round ℎ, we consider the bid that
would have been used in OPTALG if that algorithm was started in round 𝑇 𝑛 + 1. This means that in round
ℎ = 𝑇𝑛 + 𝑡, we consider that the algorithm’s current round is 𝑡 and its total payment is the total payment
in rounds 𝑇𝑛 + 1,𝑇 𝑛 + 2, . . . ,𝑇 𝑛 + 𝑡 − 1. What is less obvious to define is the number of rounds since the
last winning event. If a winning event had happened in some round 𝑇 𝑛 + 𝑡 ′ (where 𝑡 ′ ≥ 1), we consider
ℓ = 𝑡 − 𝑡 ′ . However, if no winning event has happened in rounds 𝑇 𝑛 + 1, . . . ,𝑇 𝑛 + 𝑡 − 1, we consider ℓ = 𝑡.
Note that the actual number of rounds since the last winning event can only be bigger.
We now calculate the reward and spending levels of the above bidding strategy in the infinite-horizon
setting. By every round ℎ = 𝑇 𝑛 + 𝑡, the expected reward is at least 𝑇 𝑛 · OPTALG , since in rounds [𝑇 𝑛] we
have collected at least this much reward in expectation; note here we use the fact that the actual time since
the last winning event is at least the one considered by each run of OPTALG . Given that 𝑇 𝑛 ≥ (ℎ − 𝑇 ), the
time-averaged expected reward as ℎ → ∞ is at least ℎ−𝑇 ALG → OPTALG .
ℎ OPT
Now we verify that we satisfy the budget constraint. By round ℎ = 𝑇 𝑛 + 𝑡, with probability 1 the total
spending is at most 𝑇 (𝑛 +1)𝜌, since in each interval of 𝑇 rounds OPTALG spends at most 𝑇 𝜌. Since 𝑇 (𝑛 +1) ≤
ℎ + 𝑇 , we have that the average payment as ℎ → ∞ is at most ℎ+𝑇 ℎ 𝜌 → 𝜌. This completes the proof. ■

C.2 Useful Equalities about (5)

We now make some very useful definitions. We will use these throughout the proofs of our results.

Definition C.1. For any 𝑚 ∈ N and 𝑊® ∈ [0, 1]𝑚 , we define the following quantities. For ease of notation,
we define 𝑊ℓ = 𝑊𝑚 for ℓ > 𝑚. The equalities are proven next.
• 𝐿(𝑊® ) is the expected time between wins with conversions, i.e., the expected return time to state 1. Formally,

∑︁ ℓ −1
Ö ∞ Ö
∑︁ ℓ −1
𝐿(𝑊® ) = ℓ𝑊ℓ (1 − 𝑊𝑖 ) = (1 − 𝑊𝑖 )
ℓ=1 𝑖=1 ℓ=1 𝑖=1

• 𝑅 conv (𝑊® ) is the expected reward of a single win with a conversion, starting from state 1. Formally,

∑︁ ℓ −1
Ö 𝑚
∑︁ ℓ −1
Ö
conv ®
𝑅𝑚 (𝑊 ) = 𝑟𝑚 (ℓ)𝑊ℓ (1 − 𝑊𝑖 ) = (𝑟𝑚 (ℓ) − 𝑟𝑚 (ℓ − 1)) (1 − 𝑊𝑖 )
ℓ=1 𝑖=1 ℓ=1 𝑖=1
𝑚−1
∑︁ −1
ℓÖ 𝑚
Ö
= 𝑟 (ℓ)𝑊ℓ (1 − 𝑊𝑖 ) + 𝑟 (𝑚) (1 − 𝑊𝑖 )
ℓ=1 𝑖=1 𝑖=1

When 𝑚 is clear from the context, we omit the 𝑚 subscript.

31
• 𝑅𝑚 (𝑊® ) is the expected average reward, i.e., the optimization objective of Optimization Problem (5). Formally,
𝑚
∑︁ 𝑅𝑚 ®)
conv (𝑊
𝑅𝑚 (𝑊® ) = 𝑟 (ℓ)𝑊ℓ 𝜋ℓ (𝑊® ) =
ℓ=1 𝐿(𝑊® )

When 𝑚 is clear from the context, we omit the 𝑚 subscript.


• 𝐶 conv (𝑊® ) is the expected payment until the first win with conversion, starting from state 1. Formally,
∞ ℓ −1 𝑚−1 ℓ −1 𝑚
∑︁ Ö ∑︁ Ö 𝑃 (𝑊𝑚 ) Ö
𝐶 conv
(𝑊® ) = 𝑃 (𝑊ℓ ) (1 − 𝑊𝑖 ) = 𝑃 (𝑊ℓ ) (1 − 𝑊𝑖 ) + (1 − 𝑊𝑖 )
ℓ=1 𝑖=1 ℓ=1 𝑖=1
𝑊𝑚 𝑖=1

• 𝐶 (𝑊® ) as the expected average payment, i.e., the term in the inequality of Optimization Problem (5). For-
mally,
𝐶 conv (𝑊® )
𝑚
∑︁
𝐶 (𝑊® ) = 𝑃 (𝑊ℓ )𝜋ℓ (𝑊® ) =
ℓ=1 𝐿(𝑊® )
• Reachℓ (𝑊® ) as the probability of not getting a win with conversion at least ℓ times, starting from state 1.
Formally, for any ℓ ∈ N
ℓ −1
Ö
Reachℓ (𝑊® ) =

1 − 𝑊𝑖
𝑖=1

For the rest of this subsection we prove the above equalities. We fix an 𝑚 ∈ N and 𝑊® , and often drop the
(𝑊® ) notation from all the quantities. Recall that we define 𝑊ℓ = 𝑊𝑚 for ℓ > 𝑚.

Claim C.2 (Return time to 1). It holds



∑︁
𝐿(𝑊® ) = ℓ𝑊min{ℓ,𝑚} Reachℓ (𝑊® )
ℓ=1
𝑚−1  
∑︁
® 1
= ℓ𝑊ℓ Reachℓ (𝑊 ) + 𝑚 − 1 + Reach𝑚
ℓ=1
𝑊𝑚
𝑚−1
∑︁ 1
= Reachℓ (𝑊® ) + Reach𝑚 (𝑊® )
ℓ=1
𝑊𝑚

Proof. The first equality holds by definition of 𝐿. The second equality follows from the first one by noticing
that

∑︁ ∞
∑︁  ℓ −𝑚
ℓ𝑊𝑚 Reachℓ = 𝑊𝑚 Reach𝑚 ℓ 1 − 𝑊𝑚
ℓ=𝑚 ℓ=𝑚
1 − 𝑊𝑚 + 𝑚𝑊𝑚
= 𝑊𝑚 Reach𝑚
𝑊𝑚2

The third equality follows from the second one by noticing that for every ℓ ∈ [2, 𝑚]

(ℓ − 1)Reachℓ =(ℓ − 1) (1 − 𝑊ℓ −1 )Reachℓ −1


=Reachℓ −1 + (ℓ − 2)Reachℓ −1 − (ℓ − 1)𝑊ℓ −1 Reachℓ −1

32
which, if applied recursively proves that for every ℓ ∈ [2, 𝑚]
ℓ −1
∑︁ ℓ −1
∑︁
(ℓ − 1)Reachℓ = Reach𝑖 − 𝑖𝑊𝑖 Reach𝑖
𝑖=1 𝑖=1

Applying the above for ℓ = 𝑚 to the second equation of Claim C.2 we get the third inequality. ■

Claim C.3 (Stationary distribution). It holds


1
ℓ <𝑚 : 𝜋ℓ (𝑊® ) = Reachℓ (𝑊® )
𝐿

® 1 ∑︁ 1 1
𝜋𝑚 (𝑊 ) = Reach𝑖 (𝑊® ) = Reach𝑚 (𝑊® )
𝐿(𝑊® ) 𝑖=𝑚 𝐿(𝑊® ) 𝑊𝑚

Proof. The second inequality of 𝜋𝑚 holds because for 𝑖 ≥ 𝑚, Reach𝑖 = Reach𝑚 (1 − 𝑊𝑚 )𝑖 −𝑚 .


We need to prove that the above satisfies 𝑚 out of the 𝑚 + 1 equalities of (6) (since that system is over-
Í
defined). It holds that ℓ 𝜋ℓ = 1, by the third equality of Claim C.2. Since Reachℓ = Reachℓ −1 (1 − 𝑊ℓ −1 )
the above satisfy 𝜋ℓ = (1 − 𝑊ℓ −1 )𝜋ℓ −1 for 2 ≤ ℓ ≤ 𝑚 − 1, as needed in (6). The fact that Reach𝑚 =
Reach𝑚−1 (1 − 𝑊𝑚−1 ) proves that

𝜋𝑚 = (1 − 𝑊𝑚−1 )𝜋𝑚−1 + (1 − 𝑊𝑚 )𝜋𝑚 ⇐⇒ 𝑊𝑚 𝜋𝑚 = (1 − 𝑊𝑚−1 )𝜋𝑚−1

Claim C.4. It holds


𝑚−1
!
1 ∑︁
𝑅(𝑊® ) = 𝑟 (ℓ)𝑊ℓ Reachℓ (𝑊® ) + 𝑟 (𝑚) Reach𝑚 (𝑊® )
𝐿 ℓ=1
𝑚−1
∑︁
𝑟 (ℓ) − 𝑟 (ℓ − 1) 𝜋ℓ (𝑊® ) + 𝑟 (𝑚) − 𝑟 (𝑚 − 1) 𝑊𝑚 𝜋𝑚 (𝑊® )
 
=
ℓ=1
𝑚−1
!
1 ∑︁ 𝑃 (𝑏𝑚 )
𝐶 (𝑊® ) = 𝑃 (𝑊ℓ )Reachℓ (𝑊® ) + Reach𝑚 (𝑊® )
𝐿 ℓ=1
𝑊𝑚

Proof. The first equality of each quantity follow by the definitions of 𝑅 and 𝐶 and Claim C.3. For the second
equality of 𝑅 we have that
𝑚
∑︁
𝑅= 𝑟 (ℓ)𝑊ℓ 𝜋ℓ
ℓ=1
𝑚−1
∑︁  
ℓ ≤𝑚−2:𝜋ℓ +1 =(1−𝑊ℓ )𝜋ℓ
= 𝑟 (ℓ) (𝜋ℓ − 𝜋ℓ+1 ) + 𝑟 (𝑚 − 1) (𝜋𝑚−1 − 𝑊𝑚 𝜋𝑚 ) + 𝑟 (𝑚)𝑊𝑚 𝜋𝑚 𝜋𝑚 =(1−𝑊𝑚−1 )𝜋𝑚−1 +(1−𝑊𝑚 )𝜋𝑚
ℓ=1
𝑚−1
∑︁   
= 𝑟 (ℓ) − 𝑟 (ℓ − 1) 𝜋ℓ + 𝑟 (𝑚) − 𝑟 (𝑚 − 1) 𝑊𝑚 𝜋𝑚 𝑟 (0)=0
ℓ=1

Corollary C.5. The equalities about 𝑅 conv and 𝐶 conv in Definition C.1 follow by combining their definition
with Claims C.3 and C.4.

33
C.3 Proof of Lemma 4.5

We first restate the lemma.

Lemma 4.5. For every ℓ ∈ [𝑚] it holds that



(𝜆) ≤ ℎ ℓ∗ (𝜆) + 𝑟 min{ℓ + 1, 𝑚} − 𝑟 (ℓ).

ℎ ℓ+1 (8)

Proof. We fix a 𝜆 and drop (𝜆) from all notation. We will prove the lemma using induction on the state ℓ,
starting from state 𝑚.

For ℓ = 𝑚, the claim is trivial because ℎ𝑚+1 ∗ : (8) becomes 𝑟 (𝑚) ≥ 𝑟 (𝑚).
= ℎ𝑚
Now fix ℓ ∈ [𝑚 − 1]. Using the Bellman optimality condition (7) we have,
h i

ℎ ℓ+1 ∗
+ 𝑔∗ = max 𝑊 𝑟 (ℓ + 1) − 𝜆𝑃 (𝑊 ) + (1 − 𝑊 )ℎ ℓ+2
𝑊
h i
∗ 
≤ max 𝑊 𝑟 (ℓ + 1) − 𝜆𝑃 (𝑊 ) + (1 − 𝑊 ) ℎ ℓ+1 + 𝑟 (ℓ + 2) − 𝑟 (ℓ + 1) inductive hypothesis, (8)
𝑊
h i
∗ 
≤ max 𝑊 𝑟 (ℓ + 1) − 𝜆𝑃 (𝑊 ) + (1 − 𝑊 ) ℎ ℓ+1 + 𝑟 (ℓ + 1) − 𝑟 (ℓ) 𝑟 (ℓ+2) −𝑟 (ℓ+1) ≥𝑟 (ℓ+1) −𝑟 (ℓ )
𝑊
h i

= max 𝑊 𝑟 (ℓ) − 𝜆𝑃 (𝑊 ) + (1 − 𝑊 )ℎ ℓ+1 + 𝑟 (ℓ + 1) − 𝑟 (ℓ)
𝑊
ℎ ℓ∗ + 𝑔∗

= + 𝑟 (ℓ + 1) − 𝑟 (ℓ) using (7)

which proves the lemma. ■

C.4 Proving that the Optimal Constrained Winning Probabilities are Increasing

First, we complete the proof of Lemma 4.6.

Lemma 4.6. Fix any optimal solution 𝑊® ∗ (𝜆) and ℓ ∈ [𝑚 − 1]. If 𝑟 (ℓ + 1) − 𝑟 (ℓ) > 𝑟 (ℓ + 2) − 𝑟 (ℓ + 1), then
𝑊ℓ∗ (𝜆) ≤ 𝑊ℓ+1
∗ (𝜆).

Proof. We fix a 𝜆 and drop (𝜆) from all notation. Fix ℓ ∈ [𝑚 − 1]. By the Bellman optimality, we have that

𝑊ℓ∗𝑟 (ℓ) − 𝜆𝑃 (𝑊ℓ∗ ) + (1 − 𝑊ℓ∗ )ℎ ℓ+1


∗ ∗
≥ 𝑊ℓ+1 ∗
𝑟 (ℓ) − 𝜆𝑃 (𝑊ℓ+1 ∗
) + (1 − 𝑊ℓ+1 ∗
)ℎ ℓ+1

and that
∗ ∗ ∗ ∗
𝑊ℓ+1 𝑟 (ℓ + 1) − 𝜆𝑃 (𝑊ℓ+1 ) + (1 − 𝑊ℓ+1 )ℎ ℓ+2 ≥ 𝑊ℓ∗𝑟 (ℓ + 1) − 𝜆𝑃 (𝑊ℓ∗ ) + (1 − 𝑊ℓ∗ )ℎ ℓ+2

Adding the above two inequalities and rearranging, we get


  
∗ ∗ ∗
ℎ ℓ+1 − ℎ ℓ+2 + 𝑟 (ℓ + 1) − 𝑟 (ℓ) 𝑊ℓ+1 − 𝑊ℓ∗ ≥ 0.

∗ ≥ 𝑊 ∗.
We proceed to prove that the term in the left parentheses is strictly positive, which does imply 𝑊ℓ+1 ℓ
Using Lemma 4.5 we get
∗ ∗
ℎ ℓ+1 − ℎ ℓ+2 + 𝑟 (ℓ + 1) − 𝑟 (ℓ) ≥ 𝑟 (ℓ + 1) − 𝑟 (ℓ + 2) + 𝑟 (ℓ + 1) − 𝑟 (ℓ) > 0,

where the strict equality holds by strict concavity of 𝑟 : 𝑟 (ℓ + 1) − 𝑟 (ℓ) > 𝑟 (ℓ + 2) − 𝑟 (ℓ + 1). ■

34
Given that any optimal solution 𝑊® ∗ (𝜆) is non-decreasing for any 𝜆, we want to prove that the optimal
solution of the constrained problem, 𝑊® ∗ , is also non-decreasing. We get this as a corollary of Lemma 4.6,
since the Optimization problem (5) can be re-written as a Linear Program, making the 𝑊® ∗ optimal for the
Lagrangian problem of some multiplier 𝜆. We re-state this result and prove it for completeness.

Corollary 4.7. Assume 𝑟 (·) is strictly concave. Then any optimal solution of the optimization problem (5) is
∗ for all ℓ ∈ [𝑚 − 1].
weakly increasing, i.e., 𝑊ℓ∗ ≤ 𝑊ℓ+1

The proof will follow from the following lemmas, which we present next. Recall the notation of Defi-
nition C.1: 𝑅(𝑊® ) and 𝐶 (𝑊® ) are the expected average-time reward and payment, respectively, under the
vector of winning probabilities 𝑊® . This means that for any 𝑊® , the objective of 𝑊® for the Lagrange problem
with multiplier 𝜆 is 𝑅(𝑊® ) − 𝜆𝐶 (𝑊® ).
We start with a simple and intuitive lemma: the average payment of solution 𝑊® ∗ (𝜆) cannot increase as 𝜆
gets larger.

Lemma C.6. The function 𝐶 (𝑊® ∗ (𝜆)) is non-increasing in 𝜆.

Proof. Fix 𝜆 ≥ 0 and 𝜀 > 0. We will prove that 𝐶 (𝑊® ∗ (𝜆)) ≤ 𝐶 (𝑊® ∗ (𝜆 + 𝜀)). First we use the fact that 𝑊® ∗ (𝜆)
is optimal for the Lagrangian problem with multiplier 𝜆. Specifically, we use that its objective value is
larger than the one of 𝑊® ∗ (𝜆 + 𝜀):

𝑅(𝑊® ∗ (𝜆)) − 𝜆𝐶 (𝑊® ∗ (𝜆)) ≥ 𝑅(𝑊® ∗ (𝜆 + 𝜀)) − 𝜆𝐶 (𝑊® ∗ (𝜆 + 𝜀))

Similarly, we use the optimality of 𝑊® ∗ (𝜆 + 𝜀) when the Lagrange multiplier is 𝜆 + 𝜀:

𝑅(𝑊® ∗ (𝜆 + 𝜀)) − (𝜆 + 𝜀)𝐶 (𝑊® ∗ (𝜆 + 𝜀)) ≥ 𝑅(𝑊® ∗ (𝜆)) − (𝜆 + 𝜀)𝐶 (𝑊® ∗ (𝜆))

Adding the above two inequalities, we get

𝜀𝐶 (𝑊® ∗ (𝜆)) ≥ 𝜀𝐶 (𝑊® ∗ (𝜆 + 𝜀))

which proves the lemma since 𝜀 > 0. ■

Given the above lemma we can define 𝜆0 such that:


   
∀𝜆 > 𝜆0 : 𝐶 𝑊® ∗ (𝜆) ≤ 𝜌 and ∀𝜆 < 𝜆0 : 𝐶 𝑊® ∗ (𝜆) ≥ 𝜌

Given the above we have the following simple observation.

Lemma C.7. If 𝐶 (𝑊®𝜆∗0 ) = 𝜌 then 𝑊® ∗ is an optimal solution for the Lagrangian problem with 𝜆 = 𝜆0 .

Proof. This is easy to prove since if 𝑅(𝑊® ∗ ) < 𝑅(𝑊®𝜆∗0 ), 𝑊® ∗ wouldn’t be optimal for the constrained problem.

 
Lemma C.8. Assume that 𝐶 𝑊® ∗ (𝜆0 ) ≠ 𝜌, implying that
   
∀𝜆 > 𝜆0 : 𝐶 𝑊® ∗ (𝜆) < 𝜌 and ∀𝜆 < 𝜆0 : 𝐶 𝑊® (𝜆) > 𝜌

Then the 𝑊® ∗ is an optimal solution for the Lagrangian problem with 𝜆 = 𝜆0 .

35
Proof. Fix 𝜀 > 0 such that 𝜆0 − 𝜀 ≥ 0. Consider the vectors 𝑊® ∗ (𝜆0 − 𝜀) and 𝑊® ∗ (𝜆0 + 𝜀). We create the vector
𝑊® ′ by mixing these two solutions. Specifically, let

𝜌 − 𝐶 (𝑊® ∗ (𝜆0 + 𝜀))


𝛼=
𝐶 (𝑊® ∗ (𝜆0 − 𝜀)) − 𝐶 (𝑊® ∗ (𝜆0 + 𝜀))

Note that 𝛼 ∈ (0, 1) and is well defined since 𝐶 (𝑊® ∗ (𝜆0 − 𝜀)) > 𝜌 > 𝐶 (𝑊® ∗ (𝜆0 + 𝜀)). We define 𝑊 ′ by
mixing the occupancy measures of 𝑊® ∗ (𝜆0 − 𝜀) and 𝑊® ∗ (𝜆0 + 𝜀). The occupancy measure of a vector 𝑊® is a
probability distribution over [𝑚] × [0, 1]. Specifically, if 𝑞 is the occupancy measure of 𝑊® then

𝑞(ℓ, 𝑤) = 𝜋ℓ (𝑊® )𝛿 (𝑤 − 𝑊ℓ )
∫1
where 𝛿 (𝑥) is the Dirac delta function. In this case we can write 𝑅(𝑊® ) = 𝑚
Í
ℓ=1 𝑟 (ℓ) 0
𝑤𝑞(ℓ, 𝑤)𝑑𝑤 and
∫1
𝐶 (𝑊® ) = ℓ=1 𝑃 (𝑤)𝑞(ℓ, 𝑤)𝑑𝑤. This makes 𝑅(·) and 𝐶 (·) linear in the occupancy measure.
Í𝑚
0
We know define 𝑊 ′ by defining its occupancy measure using the occupancy measures of the other two
solutions: 𝛼 times the one of 𝑊® ∗ (𝜆0 − 𝜀) and 1 − 𝛼 times the one of 𝑊® ∗ (𝜆0 + 𝜀) (we note that this could be
suboptimal in the sense that a different vector achieves lower payment). Since the functions 𝑅 and 𝐶 are
linear in the occupancy measures, we get that

𝐶 (𝑊® ′ ) = 𝛼𝐶 (𝑊® ∗ (𝜆0 − 𝜀)) + (1 − 𝛼)𝐶 (𝑊® ∗ (𝜆0 + 𝜀)) = 𝜌

For the reward 𝑅(𝑊® ′ ) we have

𝑅(𝑊® ′ ) = 𝛼𝑅(𝑊® ∗ (𝜆0 − 𝜀)) + (1 − 𝛼)𝑅(𝑊® ∗ (𝜆0 + 𝜀))


 
≥ 𝛼 (𝜆0 − 𝜀)𝐶 (𝑊® ∗ (𝜆0 − 𝜀)) + 𝑅(𝑊® ∗ (𝜆0 )) − (𝜆0 − 𝜀)𝐶 (𝑊®𝜆∗0 )
  ®∗ ® ∗ (𝜆0 +𝜀 ) optimal

𝑊 (𝜆0 −𝜀 ),𝑊
+ (1 − 𝛼) (𝜆0 + 𝜀)𝐶 (𝑊® ∗ (𝜆0 + 𝜀)) + 𝑅(𝑊® ∗ (𝜆0 )) − (𝜆0 + 𝜀)𝐶 (𝑊®𝜆∗0 ) ® ∗ (𝜆0 )
compared to 𝑊

= 𝑅(𝑊® ∗ (𝜆0 )) + 𝜆0 𝜌 − 𝐶 (𝑊® ∗ (𝜆0 )) − 𝑂 (𝜀)




where in the inequality we use the fact that for every 𝜆, 𝑊 ∗ (𝜆) is the optimal solution for the Lagrangian
problem with multiplier 𝜆. The above implies that

𝑅(𝑊® ′ ) − 𝜆0𝐶 (𝑊® ′ ) ≥ 𝑅(𝑊® ∗ (𝜆0 )) − 𝜆0𝐶 (𝑊® ∗ (𝜆0 )) − 𝑂 (𝜀)

The above implies that as we take 𝜀 → 0, the solution 𝑊® ′ becomes optimal for the Lagrangian problem with
𝜆0 . Since 𝐶 (𝑊® ′ ) = 𝐶 (𝑊® ∗ ) = 𝜌 (𝑊® ′ is feasible for the constrained problem) we have that 𝑅(𝑊® ∗ ) ≥ 𝑅(𝑊® ′ ).
This implies that as 𝜀 → 0, the vector 𝑊® ∗ also becomes optimal for 𝜆0 . This proves what we want. ■

Finally, the proof of Corollary 4.7 follows easily.

Proof of Corollary 4.7. The proof follows by Lemmas C.7 and C.8 and Lemma 4.6. ■

C.5 Proof of Theorem 4.4

We start by proving Lemma 4.8.

Lemma 4.8. For any optimal solution of the constrained problem, 𝑊® ∗ , if ℓ ≥ then 𝑊ℓ∗ ≥ 2 .
2 ¯
𝑐𝜌
¯
𝑐𝜌

36
We now present the formal proof of Lemma 4.8.

Proof of Lemma 4.8. We assume the average optimal payment satisfies 𝐶 (𝑊® ∗ ) = 𝜌. Otherwise, if 𝐶 (𝑊® ∗ ) <
𝜌, the optimal solution bids 1 every round (implying 𝑊ℓ∗ = 𝑐¯ that implies the lemma) or by increasing
the payment, we could get a higher reward. We now note that for any 𝑊 ∈ [0, 𝑐], ¯ 𝑃 (𝑊 ) ≤ 𝑊𝑐¯ , since
a conversion probability of at least 𝑊 can always be achieved by bidding 1 with probability 𝑊𝑐¯ . Using
𝐶 (𝑊® ∗ ) = 𝜌 and 𝑃 (𝑊 ) ≤ 𝑊𝑐¯ we first make the following simple observation:
∑︁ 1 ∑︁ ∗ 1
𝜌 = 𝐶 (𝑊® ∗ ) = 𝑃 (𝑊ℓ∗ )𝜋ℓ (𝑊® ∗ ) ≤ 𝑊ℓ 𝜋ℓ (𝑊® ∗ ) = 𝜋 1 (𝑊® ∗ ) (18)
𝑐¯ 𝑐¯
ℓ ∈ [𝑚] ℓ ∈ [𝑚]

where in the equality we use the condition of 𝜋1 (𝑊® ∗ ) in Optimization Problem 5.


Let ℓ ′ = ⌈ 𝑐𝜌
2 ∗
¯ ⌉. We will prove that 𝑊ℓ ′ ≥ 𝑐𝜌/2,
¯ which using Corollary 4.7 would prove the desired lemma.
∗ ′
Towards a contradiction, assume that 𝑊 (ℓ ) < 𝑐𝜌/2 ¯ which implies 𝑊 ∗ (ℓ) < 𝑐𝜌/2
¯ for ℓ ≤ ℓ ′ . We have
that
∑︁
𝐶 (𝑊® ∗ ) = 𝑃 (𝑊ℓ∗ )𝜋ℓ (𝑊® ∗ )
ℓ ∈ [𝑚]
∑︁  𝑐𝜌 ¯
 ∑︁
𝜋ℓ (𝑊® ∗ ) + 𝑃 (𝑊ℓ∗ )𝜋ℓ (𝑊® ∗ )

≤ 𝑊1∗ ≤𝑊2∗ ≤...≤𝑊ℓ∗′ <
¯
𝑐𝜌
𝑃 2
ℓ ≤ℓ ′
2 ℓ>ℓ ′
1 ∑︁ 𝑐𝜌 ¯ 1 ∑︁
𝜋ℓ (𝑊® ∗ ) + 𝜋ℓ (𝑊® ∗ )

≤ 𝑃 (𝑊 ) ≤ 𝑊 1
𝑐¯ ≤ 𝑐¯
𝑐¯ ℓ ≤ℓ ′ 2 𝑐¯ ℓ>ℓ ′
  !
1 ¯ ∑︁
𝑐𝜌 ∗
𝜋ℓ (𝑊® )

= 1− 1− Í

® ∗ )=1
𝜋ℓ (𝑊
𝑐¯ 2 ℓ ≤ℓ ′

𝜋ℓ (𝑊® ∗ ):
Í
We proceed to lower bound ℓ ≤ℓ ′

∑︁ ∑︁ ℓ −1
Ö
𝜋ℓ (𝑊® ∗ ) = 𝜋1 (𝑊® ∗ ) (1 − 𝑊𝑖∗ )

using (5)
ℓ ≤ℓ ′ ℓ ≤ℓ ′ 𝑖=1
ℓ −1  
∑︁ Ö ¯
𝑐𝜌 
¯
> 𝑐𝜌 1− 𝑊𝑖∗ <
¯
𝑐𝜌
2
® ∗ ) ≥𝑐𝜌
and (18): 𝜋1 (𝑊 ¯
ℓ ≤ℓ ′ 𝑖=1
2
∑︁   ℓ −1  ℓ′ !
¯
𝑐𝜌 ¯
𝑐𝜌
= 𝑐𝜌
¯ 1− =2 1− 1−
ℓ ≤ℓ ′
2 2
 2 !
 𝑐𝜌  
¯ ¯
𝑐𝜌 1 
≥ 2 1− 1− ≥ 2 1− ℓ ′ ≥ 𝑐𝜌
¯
2
2 𝑒

Plugging this back in the previous inequality we get


   
∗ 1 ¯
𝑐𝜌 1
𝜌 = 𝐶 (𝑊® ) ≤ 1 − 2 1 − 1− ⇐⇒ 𝑒 + 𝑐𝜌
¯ ≤2
𝑐¯ 2 𝑒

where the last inequality is a contradiction. This completes the proof of the lemma. ■

Now we can proceed to prove Theorem 4.4.

37
Theorem 4.4. Fix a constant 𝐶 > 0 and an integer 𝑚 ≥ 𝑐𝜌 ¯ log𝑇 . Then, for any integer 𝑀 ≥ 𝑚, it holds that
2𝐶

inf inf
OPT𝑚 ≥ OPT𝑀 − O ( 𝑐𝜌 1 −𝐶
¯ 𝑇 ). In addition, this can be achieved even if we constrain 𝑊𝑚 = 𝑐¯ (i.e., bid 1 in
state 𝑚).

Proof. Fix 𝑚 = ⌈ 𝑐𝜌 2𝐶
¯ log𝑇 ⌉. We prove the theorem for this 𝑚, which would imply the theorem for larger
values as well. The proof will revolve around a slight modification of the optimal vector 𝑊® ∗ . Specifically,
we consider the strategy 𝑊® ′ such that 𝑊ℓ′ = 𝑊ℓ∗ for ℓ < 𝑚 and 𝑊ℓ′ = 𝑐¯ for ℓ ≥ 𝑚. We have to resolve two
problems that 𝑊® ′ has. First, while its expected time-average reward is greater than the one of 𝑊® ∗ when
the reward in state ℓ is 𝑟 𝑀 (·) (i.e., 𝑅𝑀 (𝑊® ′ ) ≥ 𝑅𝑀 (𝑊® ∗ ) = OPT𝑀
inf ), we cannot directly claim something

about 𝑅𝑚 (𝑊® ′ ), its average reward when the per-round reward function is 𝑟𝑚 (·). Second, the expected
time-average spending of 𝑊® ′ might be larger than 𝜌, since it is as aggressive as possible in states ℓ ≥ 𝑚.
We solve each of these problems by proving two claims.
Claim C.9. The expected time-average reward of 𝑊® ′ with reward 𝑟𝑚 (·) is 𝑅𝑚 (𝑊® ′ ) ≥ OPT𝑀
inf − 4 𝑇 −𝐶 .
¯
𝑐𝜌

The claim above takes care of the first problem. The infeasibility due to the payment is solved by the
following claim.
 
Claim C.10. The expected time-average payment of 𝑊® ′ is at most 𝜌 1 + 𝑐𝜌
16 −𝐶
¯ 𝑇 .

These two claims show that 𝑊® ′ is only a little suboptimal compared to OPT𝑇inf and only a little infeasible.
We will combine 𝑊® ′ with another solution that has very small spending to get the desired solution. We
prove both claims after completing the proof of Theorem 4.4.
Recall that 𝐶 (𝑊® ) is the expected average payment of solution 𝑊® and 𝐶 conv (𝑊® ) is the expected payment
between wins with conversions. Also for any 𝑊® that has 𝑚 coordinates we define 𝑊ℓ = 𝑊𝑚 for ℓ > 𝑚.
We now define a strategy 𝑊® ′′ that we will mix with 𝑊® ′ to get a solution that is slightly suboptimal but
feasible. Specifically, we consider 𝑊ℓ′′ = 0 for ℓ < 𝑚 and 𝑊ℓ′′ = 𝑐¯ for ℓ ≥ 𝑚. The expected average payment
of 𝑊® ′′ can be found by calculating
𝑚−1 ∞
∑︁ ∑︁ (1 − 𝑐)¯𝑚
𝐿(𝑊® ′′ ) = 1+ ¯ ℓ −1 = 𝑚 +
(1 − 𝑐)
ℓ=1 ℓ=𝑚
𝑐¯

and

∑︁ (1 − 𝑐)¯𝑚
𝐶 conv
(𝑊® ′′ ) = 𝑃 (¯ ¯ ℓ −1 ≤
𝑐) (1 − 𝑐)
ℓ=𝑚
𝑐¯

making 𝐶 (𝑊® ′′ ) 1 1 𝜌
≤ 1+𝑚𝑐¯ (1− 𝑐¯ ) −𝑚 ≤ 1+𝑚𝑐¯ ≤ 2𝐶 log𝑇 . We now define our feasible solution whose reward will
be a lower bound for OPT𝑚inf . This solution’s occupancy measure is defined by taking a convex combination

of the occupancy measures of 𝑊® ′ and 𝑊® ′′ ; this makes its expected average reward and payment the same
convex combination of the rewards and payments of 𝑊® ′ and 𝑊® ′′ . Specifically, we consider 1 − 𝛼 times the
occupancy measure of 𝑊® ′ and 𝛼 times the occupancy measure of 𝑊® ′′ where

32𝐶𝑇 −𝐶 log𝑇
 
1 −𝐶
𝛼= ≤O 𝑇
¯ log𝑇 + 32𝐶𝑇 −𝐶 log𝑇 − 𝑐𝜌
2𝐶𝑐𝜌 ¯ ¯
𝑐𝜌
This results in expected average payment:
 
® ′ ® ′′ 16 −𝐶 𝜌
(1 − 𝛼)𝐶 (𝑊 ) + 𝛼𝐶 (𝑊 ) ≤ (1 − 𝛼)𝜌 1 + 𝑇 +𝛼 =𝜌
¯
𝑐𝜌 2𝐶 log𝑇

38
making this solution feasible. Its expected reward is
   
® ′ ® ′′ inf 4 −𝐶 inf 1 −𝐶
(1 − 𝛼)𝑅𝑚 (𝑊 ) + 𝛼𝑅𝑚 (𝑊 ) ≥ (1 − 𝛼) OPT𝑀 − 𝑇 ≥ OPT𝑀 − O 𝑇
¯
𝑐𝜌 ¯
𝑐𝜌

where in the first inequality we used Claim C.9. This completes the theorem’s proof. ■

We now prove the two claims.

Proof of Claim C.9. Recall that 𝑅𝑚 (𝑊® ) and 𝑅𝑚 ® ) are the expected time-average reward of a solution
conv (𝑊
®
𝑊 and its expected reward between wins with conversions when the reward on state ℓ is 𝑟𝑚 (ℓ).
®)
conv (𝑊
Using that 𝑅𝑚 (𝑊® ) =
𝑅𝑚
®
(recall Definition C.1), we have
𝐿 (𝑊 )

𝑀
∑︁ ℓ −1
conv ® ∗ conv ® ′ Ö
𝑅𝑀 (𝑊 ) − 𝑅𝑚 (𝑊 ) = 𝑟 𝑀 (ℓ) − 𝑟 𝑀 (ℓ − 1) (1 − 𝑊𝑖∗ )
ℓ=1 𝑖=1
𝑚
∑︁ ℓ −1
Ö
(1 − 𝑊𝑖∗ )

− 𝑟𝑚 (ℓ) − 𝑟𝑚 (ℓ − 1)
ℓ=1 𝑖=1
𝑀
∑︁ −1
ℓÖ  
ℓ<𝑚 =⇒
(1 − 𝑊𝑖∗ )

= 𝑟 𝑀 (ℓ) − 𝑟 𝑀 (ℓ − 1) 𝑊ℓ∗ =𝑊ℓ′ ,𝑟 𝑀 (ℓ )=𝑟𝑚 (ℓ )
ℓ=𝑚 𝑖=1

∑︁ ℓ −1
Ö  
𝑟 𝑀 (ℓ ) −𝑟 𝑀 (ℓ −1) ≤1
≤ (1 − 𝑊𝑖∗ ) 1−𝑊𝑖∗ ≤1
2
ℓ=𝑚 𝑖=⌈
¯
𝑐𝜌 ⌉
∞ Ö
ℓ −1  
∑︁ ¯
𝑐𝜌 
≤ 1− Lemma 4.8
2
2
ℓ=𝑚 𝑖=⌈
¯
𝑐𝜌 ⌉
∞   ¯2 −1  ¯2 −1
¯ ℓ+ 𝑐𝜌 ¯ 𝑚+ 𝑐𝜌

∑︁ 𝑐𝜌 1 𝑐𝜌
≤ 1− = 𝑐𝜌
¯ 1−
ℓ=𝑚
2 2
2
  2𝐶
¯ log𝑇 −1
2 ¯ 𝑐𝜌
𝑐𝜌 
≤ 1− 2𝐶
¯ log𝑇
𝑚≥ 𝑐𝜌
¯
𝑐𝜌 2
  2𝐶
¯ log𝑇
4 ¯ 𝑐𝜌
𝑐𝜌 
≤ 1− ¯ ≤1
𝑐𝜌
¯
𝑐𝜌 2
4 1 4
≤ = 𝑇 −𝐶 (19)
¯ exp(𝐶 log𝑇 ) 𝑐𝜌
𝑐𝜌 ¯

It is not hard to observe that 𝐿(𝑊® ′ ) ≤ 𝐿(𝑊® ∗ ), since 𝑊ℓ∗ ≤ 𝑊ℓ′ for all ℓ. Using the above, we have

® ′)
conv (𝑊
𝑅𝑚 𝑅𝑀 ® ∗ ) − 4 𝑇 −𝐶
conv (𝑊
′ ¯
𝑐𝜌 4
®
𝑅𝑚 (𝑊 ) = ≥ ≥ 𝑅𝑀 (𝑊® ∗ ) − 𝑇 −𝐶
®
𝐿(𝑊 ) ′ ®
𝐿(𝑊 ) ∗ ¯
𝑐𝜌

where the last inequality holds because 𝐿(𝑊® ∗ ) ≥ 1. This proves the claim. ■

We now prove Claim C.10.

39
®)
𝐶 conv (𝑊
Proof of Claim C.10. Using 𝐶 (𝑊® ) = ®)
, we upper bound 𝐶 (𝑊® ′ ) using the fact that it is close to 𝐶 (𝑊® ∗ ).
𝐿 (𝑊
We assume that 𝐶 (𝑊® ∗ ) = 𝜌 since otherwise, because of optimality of 𝑊® ∗ , it must hold that 𝑊® ∗ bids 1 every
round, implying 𝑊ℓ∗ = 𝑐¯ for all ℓ; this would imply 𝑊® ′ = 𝑊® ∗ making the claim trivial. First, we compare
𝐿(𝑊® ∗ ) and 𝐿(𝑊® ′ ), using the equations of Definition C.1
∞ Ö ℓ −1 ℓ −1
!
∑︁ Ö  
∗ ′ ∗ ′ ℓ<𝑚 =⇒
𝐿(𝑊® ) − 𝐿(𝑊® ) = (1 − 𝑊𝑖 ) − (1 − 𝑊𝑖 ) ∗
𝑊 =𝑊 ′
ℓ ℓ
ℓ=𝑚 𝑖=1 𝑖=1
∞ Ö
∑︁ ℓ −1
≤ (1 − 𝑊𝑖∗ )
2
ℓ=𝑚 𝑖=⌈
¯
𝑐𝜌 ⌉

4 −𝐶 4
≤ 𝑇 ≤ 𝑇 −𝐶 𝐿(𝑊® ′ ) (20)
¯
𝑐𝜌 ¯
𝑐𝜌
where in the second to last inequality we used the same calculation as in (19) and in the last inequality we
used the fact that 𝐿(𝑊® ′ ) ≥ 1.
Now, using Definition C.1, for the average payment per epoch of 𝑊® ′ we have

∑︁ ℓ −1
Ö  
ℓ<𝑚 =⇒
𝐶 conv (𝑊® ′ ) − 𝐶 conv (𝑊® ∗ ) ≤ 𝑃 (𝑊ℓ′ ) (1 − 𝑊𝑖′ ) 𝑊ℓ∗ =𝑊ℓ′
ℓ=𝑚 𝑖=1

∑︁ −1
ℓÖ
≤ 𝑃 (¯
𝑐) (1 − 𝑊𝑖′ )
2
ℓ=𝑚 𝑖=⌈
¯
𝑐𝜌 ⌉
∞  2
¯ −1
 ℓ+ 𝑐𝜌
∑︁ ¯
𝑐𝜌 
≤ 𝑃 (¯
𝑐) 1− Lemma 4.8
ℓ=𝑚
2
4 −𝐶 4
≤ 𝑃 (¯
𝑐) 𝑇 ≤ 𝑇 −𝐶 𝐶 conv (𝑊® ∗ ) (21)
¯
𝑐𝜌 𝜌
where in the second to last inequality, we used the same analysis as in (19) and in the final inequality we
used that for any vector 𝑊® it holds that
∞ ℓ −1 ∞ ℓ −1
∑︁ Ö 𝑃 (¯𝑐) ∑︁ Ö 𝑃 (¯𝑐)
𝐶 conv
(𝑊® ) = 𝑃 (𝑊ℓ ) (1 − 𝑊𝑖 ) ≤ 𝑊ℓ (1 − 𝑊𝑖 ) =
ℓ=1 𝑖=1
𝑐¯ ℓ=1 𝑖=1
𝑐¯

where the inequality holds because 𝑃 (𝑊 ) ≤ 𝑊𝑐¯ 𝑃 (¯


𝑐): the bidder can win with probability 𝑊 by bidding 1
with probability 𝑊𝑐¯ and 0 otherwise and bidding 1 results in paying 𝑃 (¯
𝑐) in expectation, i.e. the expected
price.
Combining (20) and (21) we get
4 −𝐶
𝐶 conv (𝑊® ′ ) 𝐶 conv (𝑊® ∗ ) 1 + 𝜌 𝑇
 2  
® ′ ® ∗ 4 −𝐶 16 −𝐶
𝐶 (𝑊 ) = ≤ ≤ 𝐶 (𝑊 ) 1 + 𝑇 ≤ 𝜌 1+ 𝑇
𝐿(𝑊® ′ ) 𝐿(𝑊® ∗ ) 1+ 4 𝑇 −𝐶
1 ¯
𝑐𝜌 ¯
𝑐𝜌
¯
𝑐𝜌

where in the last inequality we used 𝐶 (𝑊® ∗ ) ≤ 𝜌 and 2 −𝐶


¯ 𝑇
𝑐𝜌 ≤ 1. This proves the claim. ■

D Deferred Proofs of Section 5


In this section, we complete the proofs of the results in Section 5.

40
D.1 Lower Bound on the Realized Reward

Before proving the main lemma of this part, Lemma 5.3, we prove a simple lemma on the expected reward
of a single epoch. Fix an epoch 𝑖. We want to compare the reward of epoch 𝑖 with 𝑅 conv ( b®𝑇𝑖 −1 ), the expected
reward between wins with conversions of using b®𝑇𝑖 −1 in the infinite horizon setting (we overload the nota-
tion of Definition C.1). These would be the same in expectation if we let 𝑘 = ∞. However, because epoch 𝑖
might end early, there is some reward loss. This loss depends on the probability of an epoch ending early.
This probability becomes small because of bidding 1 in state 𝑚, b𝑇𝑚𝑖 −1 = 1, and by setting 𝑘 − 𝑚 ≈ log𝑇 .
Lemma D.1. Let H𝑡 be the history up to round 𝑡 (prices, contexts, conversions). Fix an epoch 𝑖 that starts at
least 𝑘 rounds before 𝑇 : 𝑇𝑖 −1 ≤ 𝑇 − 𝑘. Assume 𝑘 ≥ 𝑚 + 𝑐1¯ 𝐶 log𝑇 for some 𝐶 ≥ 1. Let REW𝑖 be the reward of
epoch 𝑖 if the budget is not violated. Then,

E REW𝑖 H𝑇𝑖 −1 ≥ 𝑅 conv ( b®𝑇𝑖 −1 ) − 𝑚𝑇 −𝐶


 

Since b𝑇𝑚𝑖 −1 = 1 we know that after not getting a win with impression 𝑚 times, the probability that an epoch
ends is 𝑐¯ per round. By the lower bound on 𝑘, the probability of not winning with an impression is less
than 𝑇 −𝐶 , which would lead to losing at most 𝑚 reward.

Proof. We first lower bound the probability of epoch 𝑖 ending without a conversion. We use the fact that,
𝑖 = 1 when it has been 𝑚 rounds or more since the last conversation, it holds that 𝑊 (1) = 𝑐:
by bidding b𝑚 ¯
𝑘
Ö 𝑘 𝑘
 Ö  Ö 1
1 − 𝑊 (b𝑖ℓ ) ≤ 1 − 𝑊 (b𝑖ℓ ) = (1 − 𝑐) ¯ 𝑘 −𝑚+1 ≤ (1 − 𝑐)
¯ = (1 − 𝑐) ¯ 𝑐¯ 𝐶 log𝑇 ≤ 𝑇 −𝐶
ℓ=1 ℓ=𝑚 ℓ=𝑚

where in the second to last inequality we used 𝑘 ≥ 𝑚 + 𝑐1¯ 𝐶 log𝑇 . This proves that
h i 
E REW𝑖 H𝑇𝑖 −1 ≥ E 𝑅 conv ( b®𝑇𝑖 −1 ) H𝑇𝑖 −1 1 − 𝑇 −𝐶
 

The above proves the lemma since 𝑅 conv ( b®𝑇𝑖 −1 ) ≤ 𝑚 and 𝑅 conv ( b®𝑇𝑖 −1 ) does not depend on the randomness
of rounds after 𝑇𝑖 −1 . ■

We now prove the main lemma of this part, which we restate here for completeness.
Lemma 5.3. Let 𝑤𝑡 ∈ {0, 1} indicate whether the bidder got a conversion in round 𝑡. Assume 𝑘 ≥ 𝑚 + 𝑐1¯ log𝑇 .
Fix round 𝜏 ≤ 𝑇 − 𝑘 and let 𝐼𝜏 be the epoch of round 𝜏. Then the total realized reward up to round 𝜏 is at least
𝐼𝜏 √︂ !
inf
∑︁ 𝑇
𝜏OPT𝑚 − 𝐿 𝑗 𝜀𝑅 (𝑇 𝑗 −1 ) − O 𝑘 𝑇 log
𝑗=1
𝛿

with probability at least 1 − 𝛿 for any 𝛿 > 0.

Proof of Lemma 5.3. Let 𝑤𝑡 ∈ {0, 1} indicate whether the bidder won a conversion in round 𝑡. We first
notice that
𝜏
∑︁ 𝐼𝜏
∑︁
𝑤𝑡 𝑟 ( ℓ𝑡 ) ≥
˜ REW 𝑗 − 𝑚 (22)
𝑡 =1 𝑗=1

where the first inequality holds because by round 𝜏, the algorithm has been allocated the rewards of the
first 𝐼𝜏 − 1 epochs but may not have been allocated the reward of epoch 𝐼𝜏 , which is at most 𝑟 (𝑚) ≤ 𝑚.

41
Í𝐼𝜏 Í𝐼𝜏  
We shoe next that 𝑗=1 REW 𝑗 ≈ 𝑗=1 E REW 𝑗 H𝑇𝑗 −1 . For 𝑖 = 0, 1, . . . , we define the sequence

𝑖
∑︁ 𝑖
∑︁  
𝑍𝑖 = REW 𝑗 − E REW 𝑗 H𝑇𝑗 −1
𝑗=1 𝑗=1
 
We notice that 𝑍𝑖 is a martingale (since E 𝑍𝑖 − 𝑍𝑖 −1 H𝑇𝑖 −1 = 0) that has differences |𝑍𝑖 −𝑍𝑖 −1 | ≤ 𝑚. Using
Azuma’s inequality, we get that for all 𝛿 > 0 with probability at least 1 − 𝛿
𝑖 𝑖 √︂ !
∑︁ ∑︁   1
REW 𝑗 ≥ E REW 𝑗 H𝑇𝑗 −1 − O 𝑚 𝑖 log
𝑗=1 𝑗=1
𝛿

Using the union bound over all epochs 𝑖 (note there are at most 𝑇 epochs), we get the above inequality for
all 𝑖, implying that all 𝛿 > 0 with probability at least 1 − 𝛿
𝐼𝜏 𝐼𝜏 √︂ !
∑︁ ∑︁   𝑇
REW 𝑗 ≥ E REW 𝑗 H𝑇𝑗 −1 − O 𝑚 𝑇 log
𝑗=1 𝑗=1
𝛿

We now use Lemma D.1 and get


𝐼𝜏 𝐼𝜏 √︂ !
∑︁ ∑︁ 𝑇
REW 𝑗 ≥ 𝑅 conv ( b®𝑇𝑗 −1 ) − O 𝑚 𝑇 log (23)
𝑗=1 𝑗=1
𝛿

where we include the 𝑗 𝑚𝑇 −1 ≤ 𝑚 term in the O (·) term. We proceed to analyze the expected reward
Í

between wins with conversions 𝑅 conv ( b®𝑇𝑗 −1 ) = 𝑅( b®𝑇𝑗 −1 )𝐿( b®𝑇𝑗 −1 ) (recall Definition C.1, by overloading the
notation: 𝑅( b) ® is the time average reward of bidding b® and 𝐿( b) ® is the expected time between conversions):
 +  + 
𝑅 conv ( b®𝑇𝑗 −1 ) = 𝑅( b®𝑇𝑗 −1 )𝐿( b®𝑇𝑗 −1 ) ≥ OPT𝑚
inf
− 𝜀𝑅 (𝑇 𝑗 −1 ) 𝐿( b®𝑇𝑗 −1 ) ≥ OPT𝑚
inf

− 𝜀𝑅 (𝑇 𝑗 −1 ) E 𝐿 𝑗 H𝑇𝑗 −𝑖 (24)

where the first inequality holds by the sub-optimality assumption on b®𝑇𝑖 −1 and the second holds because
epoch 𝑗 may be stopped early, so its expected time until a conversion can only be bigger than the expected
length of the epoch. Relying on the fact that 𝜀𝑅 (𝑇 𝑗 −1 ) is a deterministic function of H𝑇𝑗 −𝑖 , we now define
another martingale:
𝑖 
∑︁ + 𝑖  +
inf
  ∑︁ inf
𝑌𝑖 = OPT𝑚 − 𝜀𝑅 (𝑇 𝑗 −1 ) E 𝐿 𝑗 H𝑇𝑗 −𝑖 − OPT𝑚 − 𝜀𝑅 (𝑇 𝑗 −1 ) 𝐿 𝑗
𝑗=1 𝑗=1

which has differences bounded by 𝑘 since 𝐿 𝑗 ≤ 𝑘 and OPT𝑚 inf − 𝜀 (𝑇


𝑅 𝑗 −1 ) ≤ 1. Using Azuma’s inequality, we
get that with probability at least 1 − 𝛿 for all 𝛿 > 0 it holds
𝑖  𝑖  √︂ !
∑︁ +  ∑︁ + 1
inf inf

OPT𝑚 − 𝜀𝑅 (𝑇 𝑗 −1 ) E 𝐿 𝑗 H𝑇𝑗 −𝑖 ≥ OPT𝑚 − 𝜀𝑅 (𝑇 𝑗 −1 ) 𝐿 𝑗 − O 𝑘 𝑖 log
𝑗=1 𝑗=1
𝛿
𝑖 √︂ !
inf
∑︁ 1 
≥ OPT𝑚 (𝑇𝑖 − 𝑘) − 𝐿 𝑗 𝜀𝑅 (𝑇 𝑗 −1 ) − O 𝑘 𝑖 log
Í𝑖
𝑗 =1 𝐿 𝑗 =𝑇𝑖 −𝑘
𝑗=1
𝛿

42
Using the union bound on the above for all epochs 𝑖 (there are at most 𝑇 ) and plugging it into (24) and
then (23) we get
𝐼𝜏 𝐼𝜏 √︂ !
∑︁
inf
∑︁ 𝑇
REW 𝑗 ≥ OPT𝑚 (𝑇𝐼𝜏 − 𝑘) − 𝐿 𝑗 𝜀𝑅 (𝑇 𝑗 −1 ) − O (𝑚 + 𝑘) 𝑇 log
𝑗=1 𝑗=1
𝛿

inf ≤ 𝑘,
The proof is finalized by plugging the above into (22) and using the following inequalities: 𝑘OPT𝑚
𝑇𝐼𝜏 ≥ 𝜏, 𝑚 + 𝑘 ≤ 2𝑘. ■

D.2 Upper Bound on the Realized Payment

Before proving the main lemma of this part, Lemma 5.4, we prove a simple lemma on the expected length
of a single epoch. This lemma proves that the expected length of an epoch 𝑖 is not too much smaller than
the expected time between wins with conversions in the infinite horizon setting, 𝐿( b®𝑇𝑖 −1 ). While this looks
simple, we have to look into the exact difference between the two quantities. The length of an epoch is
bounded almost surely but the time between wins with conversions is not.

Lemma D.2. Fix an epoch 𝑖 and assume that 𝑘 ≥ 𝑚 + 𝑐1¯ 𝐶 log𝑇 and 𝐶 ≥ 1. Then, it holds that
 1
𝐿( b®𝑇𝑖 −1 ) ≤ E 𝐿𝑖 H𝑇𝑖 −1 + 𝑇 −𝐶

𝑐¯

Proof. The only difference between 𝐿𝑖 and 𝐿( b®𝑇𝑖 −1 ) is that in 𝐿𝑖 we might stop early if there is no conversion
after 𝑘 rounds. This means that the difference of their expectations is
∞ ℓ −1  ∞
 ∑︁ ℓ −1  
 ∑︁ Ö Ö
®
 𝑗 𝑗 𝑗
𝑇 𝑗 −1
𝐿( b ) − E 𝐿 𝑗 H𝑇𝑗 −1 = ℓ𝑊 (bℓ ) 1 − 𝑊 (bℓ ′ ) − min{ℓ, 𝑘 }𝑊 (bℓ ) 1 − 𝑊 (bℓ𝑗 ′ )
ℓ=1 ℓ ′ =1 ℓ=1 ℓ ′ =1

∑︁ ℓ −1
Ö  
= (ℓ − 𝑘)𝑊 (bℓ𝑗 ) 1 − 𝑊 (bℓ𝑗 ′ )
ℓ=𝑘+1 ℓ ′ =1

∑︁ ℓ −1
Ö  
b𝑚
𝑖 =1
≤ (ℓ − 𝑘)𝑐¯ (1 − 𝑐)
¯ 𝑘 ≥𝑚
ℓ=𝑘+1 ℓ ′ =𝑚

∑︁ ¯ 𝑘 −𝑚+1
(1 − 𝑐) 1
= ¯ ℓ −𝑚 =
(ℓ − 𝑘)𝑐¯(1 − 𝑐) ≤ 𝑇 −𝐶
𝑐¯ 𝑐¯
ℓ=𝑘+1

where in the last inequality we used 𝑘 ≥ 𝑚 + 𝑐1¯ 𝐶 log𝑇 . ■

We now restate Lemma 5.4 for completeness and then prove it.

Lemma 5.4. Assume that 𝑘 ≥ 𝑚 + 𝑐1¯ log𝑇 . Fix a round 𝜏 and let 𝐼𝜏 be its epoch. For any 𝛿 > 0, with
probability at least 1 − 𝛿 the total payment of the algorithm until round 𝜏 is at most
𝐼𝜏 √︂ !
∑︁ 𝑇
𝜏𝜌 + 𝐿 𝑗 𝜀𝐶 (𝑇 𝑗 −1 ) + O 𝑘 𝑇 log
𝑗=1
𝛿

Proof. Let PAY𝑖 denote the total payment during epoch 𝑖. It is not hard to see that the total payment up to
Í𝐼𝜏
round 𝜏 is at most 𝑖=1 PAY𝑖 since epoch 𝐼𝜏 ends at round 𝜏 or later.

43
For 𝑖 = 0, 1, . . . we define the following martingale with respect to the history H𝑇𝑗 −1 up to each epoch 𝑗 − 1.
𝑖
∑︁ 𝑖
∑︁  
𝑍𝑖 = PAY 𝑗 − E PAY 𝑗 H𝑇𝑗 −1
𝑗=1 𝑗=1

which has bounded differences |𝑍𝑖 − 𝑍𝑖 −1 | ≤ 𝑘, since the maximum payment at any epoch can be at most
𝑘. Using Azuma’s inequality we get that for all 𝛿 > 0 with probability at least 1 − 𝛿
𝑖 𝑖 √︂ !
∑︁ ∑︁   1
PAY 𝑗 ≤ E PAY 𝑗 H𝑇𝑗 −1 + O 𝑘 𝑖 log
𝑗=1 𝑗=1
𝛿

We now notice that E PAY 𝑗 H𝑇𝑗 −1 ≤ 𝐶 conv ( b®𝑇𝑗 −1 ): the expected payment of epoch 𝑗 is less than the
 

expected payment between wins with conversions in the infinite horizon setting, since an epoch might
stop earlier. Using the union bound over all epochs 𝑖 (there are at most 𝑇 of them) we get the above
inequality for all 𝑖 and therefore also 𝑖 = 𝐼𝜏 : with probability at least 1 − 𝛿 it holds
𝐼𝜏 𝐼𝜏 √︂ !
∑︁ ∑︁ 𝑇
PAY 𝑗 ≤ 𝐶 conv ( b®𝑇𝑗 −1 ) + O 𝑘 𝑇 log (25)
𝑗=1 𝑗=1
𝛿

We now examine 𝐶 conv ( b®𝑇𝑗 −1 ). Using Definition C.1,


 2
𝐶 conv ( b®𝑇𝑗 −1 ) = 𝐶 ( b®𝑇𝑗 −1 )𝐿( b®𝑇𝑗 −1 ) ≤ 𝜌 + 𝜀𝐶 (𝑇 𝑗 −𝑖 ) 𝐿( b®𝑇𝑗 −1 ) ≤ 𝜌 + 𝜀𝐶 (𝑇 𝑗 −𝑖 ) E 𝐿𝑖 H𝑇𝑖 −1 + 𝑇 −1
  
𝑐¯
where in the first inequality we used that the payment of b®𝑇𝑖 −1 is 𝜀𝐶 (𝑇 𝑗 −𝑖 ) approximate and that it is feasible
for the empirical distribution; for the second equality we used Lemma D.2 and that 𝜌 + 𝜀𝐶 (𝑇 𝑗 −𝑖 ) ≤ 2.
Plugging into (25) we get
𝐼𝜏 𝐼𝜏 √︂ ! 𝐼
𝜏
∑︁ ∑︁    𝑇 ∑︁ 2 −1
PAY 𝑗 ≤ 𝜌 + 𝜀𝐶 (𝑇 𝑗 −𝑖 ) E 𝐿 𝑗 H𝑇𝑗 −1 + O 𝑘 𝑇 log + 𝑇
𝑗=1 𝑗=1
𝛿 𝑗=1
𝑐¯
𝐼𝜏 √︂ !
∑︁    𝑇
≤ 𝜌 + 𝜀𝐶 (𝑇 𝑗 −𝑖 ) E 𝐿 𝑗 H𝑇𝑗 −1 + O 𝑘 𝑇 log (26)
𝑗=1
𝛿

where in the last inequality we move the last sum term into the O (·) term by using 𝐼𝜏 ≤ 𝑇 , and that 𝑘 ≥ 𝑐1¯ .
The rest of the proof is similar to the one of Lemma 5.3: we define a martingale
𝑖
∑︁ 𝑖
∑︁
   
𝑌𝑖 = 𝜌 + 𝜀𝐶 (𝑇 𝑗 −𝑖 ) E 𝐿 𝑗 H𝑇𝑗 −1 − 𝜌 + 𝜀𝐶 (𝑇 𝑗 −𝑖 ) 𝐿 𝑗
𝑗=1 𝑗=1

which has bounded differences |𝑌𝑖 − 𝑌𝑖 −1 | ≤ 2𝑘 since 1 ≤ 𝐿 𝑗 ≤ 𝑘 and 𝜌 + 𝜀𝐶 (𝑇 𝑗 −𝑖 ) ≤ 2. Using Azuma’s


inequality we get that with probability at least 1 − 𝛿
𝑖 𝑖 √︂ !
∑︁    ∑︁  1
𝜌 + 𝜀𝐶 (𝑇 𝑗 −𝑖 ) E 𝐿 𝑗 H𝑇𝑗 −1 ≤ 𝜌 + 𝜀𝐶 (𝑇 𝑗 −𝑖 ) 𝐿 𝑗 + O 𝑘 𝑇 log
𝑗=1 𝑗=1
𝛿
𝑖 √︂ !
∑︁ 1 
= (𝑇𝑖 − 𝑘)𝜌 + 𝐿 𝑗 𝜀𝐶 (𝑇 𝑗 −𝑖 ) + O 𝑘 𝑇 log
Í𝑖
𝑗 =1 𝐿 𝑗 =𝑇𝑖 −𝑘
𝑗=1
𝛿

44
Getting the above inequality for 𝑖 = 𝐼𝜏 (by using the union bound) and combining it with Eq. (26) we get
𝐼𝜏 𝐼𝜏 √︂ !
∑︁ ∑︁ 𝑇
PAY 𝑗 ≤ (𝑇𝐼𝜏 − 𝑘)𝜌 + 𝐿 𝑗 𝜀𝐶 (𝑇 𝑗 −𝑖 ) + O 𝑘 𝑇 log
𝑗=1 𝑗=1
𝛿

By using 𝜏 ≥ 𝑇𝐼𝜏 − 𝑘, we get the lemma. ■

D.3 Deferred Proofs of Theorem 5.1 and Corollary 5.2



Using Lemmas 5.3 and 5.4 it is not hard to prove Theorem 5.1. By picking 𝜏 = 𝑇 − Õ ( 𝑇 ) so that the
algorithm does not run out of budget in round 𝜏 with high probability, we get the promised high reward
of Lemma 5.3.
Theorem 5.1. Fix any 𝑚 ∈ N and let 𝑘 ≥ 𝑚 + 𝑐1¯ log𝑇 . Let 𝑁 be the number of epochs. Then for all 𝛿 > 0
with probability at least 1 − 𝛿 Algorithm 1 achieves regret against OPT𝑚 inf that is at most

𝑁 √︂ !
∑︁  𝑇
𝐿 𝑗 𝜀𝑅 (𝑇 𝑗 −𝑖 ) + 𝜀𝐶 (𝑇 𝑗 −𝑖 ) + O 𝑘 𝑇 log
𝑗=1
𝛿

where 𝜀𝑅 (𝑇 𝑗 −𝑖 ) and 𝜀𝐶 (𝑇 𝑗 −𝑖 ) are error terms of the bidding of epoch 𝑗, b®𝑇𝑗 −1 : 𝜀𝑅 (𝑇 𝑗 −𝑖 ) = OPT𝑚 ®𝑇𝑗 −1 ) +
inf − 𝑅( b

+
is the reward sub-optimality gap and 𝜀𝐶 (𝑇 𝑗 −𝑖 ) = 𝐶 ( b®𝑇𝑗 −1 ) − 𝜌 is the expected average payment above 𝜌.

Proof of Theorem 5.1. Fix a 𝛿 > 0. Using the union bound, assume that Lemmas 5.3 and 5.4 hold for all
𝜏 ∈ [𝑇 ] with probability at least 1 − 𝛿. Fix a round 𝜏 such that
𝐼𝜏 √︂ !
∑︁ 𝑇
𝑇 −𝜏 ≥ 𝐿 𝑗 𝜀𝐶 (𝑇 𝑗 −1 ) + O 𝑘 𝑇 log (27)
𝑗=1
𝛿

Í
Lemma 5.4 implies that the algorithm has run out of budget by that round 𝜏. This means that 𝜏𝑡=1 𝑤𝑡 𝑟 ( ℓ˜𝑡 )
is a lower bound for the algorithm’s reward by that round (lower bound because the actual lengths between
wins with conversions can only be bigger than ℓ˜𝑡 ). Lemma 5.3 now implies that the total reward by round
𝜏 is at least
𝐼𝜏 √︂ !
inf
∑︁ 𝑇
𝜏OPT𝑚 − 𝐿 𝑗 𝜀𝑅 (𝑇 𝑗 −1 ) − O (𝑚 + 𝑘) 𝑇 log
𝑗=1
𝛿
𝐼𝜏 √︂ !
inf
∑︁  𝑇
≥ 𝑇 OPT𝑚 − 𝐿 𝑗 𝜀𝑅 (𝑇 𝑗 −1 ) + 𝜀𝐶 (𝑇 𝑗 −1 ) − O (𝑚 + 𝑘) 𝑇 log
𝑗=1
𝛿

inf ≤ 1. This proves the theorem


where in the inequality we used (27) and OPT𝑚 ■

Finally, we prove Corollary 5.2. The proof is quite simple using Theorem 6.1, but we include it for com-
pleteness.
Corollary 5.2. Let 𝑚 = ⌈ 𝑐𝜌¯ log𝑇 ⌉ and 𝑘 = ⌈𝑚 + 𝑐¯ log𝑇 ⌉. Then for all 𝛿 > 0 with probability at least 1 − 𝛿,
2 1

Algorithm 1 achieves total reward at least


√︂ !
inf 1 3 𝑇
𝑇 · OPT𝑇 − O 3 4 log 𝑇 𝑇 log
𝑐¯ 𝜌 𝛿

45
Proof of Corollary 5.2. Using Theorem 6.1 and the union bound, we get that with probability at least 1 − 𝛿
for all rounds 𝑡 it holds that √︂ !
𝑚3 1 𝑇
𝜀𝑅 (𝑡) + 𝜀𝐶 (𝑡) ≤ O √ log
𝜌 𝑡 𝛿
The above implies that for any round 𝜏
𝐼𝜏 √︂ 𝐼𝜏
!
∑︁  𝑚3 𝑇 ∑︁ 𝐿𝑗
𝐿 𝑗 𝜀𝑅 (𝑇 𝑗 −1 ) + 𝜀𝐶 (𝑇 𝑗 −1 ) ≤ O log √︁
𝑗=1
𝜌 𝛿 𝑗=1 𝑇 𝑗 −1


√𝐿 𝑗
Í𝐼𝜏
The corollary follows by proving that 𝑗=1 = O ( 𝑇 ), which makes the above term dominate the
𝑇 𝑗 −1
other error term in Theorem 5.1.
𝐼𝜏 𝐼𝜏 𝑇∑︁
𝑗 −1 𝐼𝜏 
∑︁ 𝐿𝑗 ∑︁ 1 ∑︁ √︁ √︁   √ √ 
≤ √ ≤2 𝑇 𝑗 −1 − 1 − 𝑇 𝑗 −1 − 𝐿 𝑗
Í𝑏 1
√︁ 𝑡 =𝑎 √𝑡 ≤2 𝑏 −1− 𝑎−1
𝑗=1 𝑇 𝑗 −1 𝑗=1 𝑡 =𝑇 𝑗 −1 −𝐿 𝑗 +1 𝑡 𝑗=1
𝐼
∑︁ 𝜏  √︃ √︃ 

≤2 𝑇𝑗 − 𝑘 − 1 − 𝑇𝑗 − 𝑘 − 𝐿 𝑗 𝑇 𝑗 −1 ≥𝑇 𝑗 −𝑘
𝑗=1
𝐼𝜏 √︃

∑︁ √︃ 
√︁ 
=2 𝑇 𝑗 − 𝑘 − 𝑇 𝑗 −1 − 𝑘 = 2 𝑇𝐼𝜏 ≤ 2 𝑇 𝑇 𝑗 =𝑇 𝑗 −1 +𝐿 𝑗
𝑗=1

E Deferred Proofs of Section 6


In this section, we present the deferred proofs of Section 6.

E.1 Deferred Proof of Lemma 6.2

We first restate the lemma.

Lemma 6.2. The bidding function of (9) does not need to depend on the context 𝑥, only the conversion rate
𝑐. In addition, the optimal bidding takes the following form:
• If the price distribution for a given conversion rate has no atoms, the optimal solution is 𝑏 𝜇 (𝑐) = min(1, 𝑐𝜇 )
for some 𝜇 ≥ 0; note 𝜇 = 0 implies bidding 1 for any context.
• If the price distribution for a given conversion rate has finite support, the optimal solution is supported on
at most two bidding policies, 𝑏 𝜇1 (𝑐) and 𝑏 𝜇2 (𝑐), for some 𝜇1, 𝜇2 ≥ 0.
• In all other cases, the maximum might not be obtained, but a solution supported on 𝑏 𝜇1 (𝑐) and 𝑏 𝜇2 (𝑐) can
get arbitrarily close.

Proof. We first write the Lagrangian of (9) for some Lagrange multiplier 𝜇 ≥ 0.
h i
sup E 𝑐 − 𝜇𝑝 1 𝑏 (𝑥) ≥ 𝑝 + 𝜇𝜌 ′
 
𝑥,𝑝
b

where recall the conversion rate 𝑐 is part of the context 𝑥.

46
The above is maximized when 𝑏 (𝑥) = min( 𝑐𝜇 , 1) independent of other parts of the context, since the objec-
tive when 𝑏 (𝑥) = 𝑐𝜇 becomes E 𝑐 − 𝜇𝑝) + + 𝜇𝜌 ′ , which is the supremum value of the above optimization
 

problem. When 𝑐𝜇 > 1, then winning the auction for sure by bidding 1 maximizes the value. Because the
only part of the context that matters is now the conversion rate, we used 𝑏 (𝑐) instead of 𝑏 (𝑥) and focus
on the distribution over the pair (𝑝, 𝑐) instead of (𝑝, 𝑥).
We overload the previous notation of the expected payment of a bid 𝑏, 𝑃 (𝑏) and define the expected pay-
ment of using bid 𝑐/𝜇 as   
𝑃 (𝜇) = E 𝑝 1 𝑐 ≥ 𝜇𝑝 .
𝑥,𝑝,𝑐

The above function is non-decreasing in 𝜇 and lower semi-continuous. Let 𝜇 ∗ be the greatest non-negative
number such that 𝑃 (𝜇 ∗ ) ≥ 𝜌 ′ . If no such 𝜇 ∗ exists, always bidding 1 (i.e., using 𝜇 = 0) is the optimal solution

to (9). If 𝑃 (𝜇 ∗ ) = 𝜌 ′ then 𝑏 𝜇 (𝑐) = 𝜇𝑐∗ is an optimal solution for (9). We note that if the distribution on (𝑝, 𝑐)
has no atoms, then 𝑃 (𝜇) is continuous, implying that a 𝜇 ∗ with 𝑃 (𝜇 ∗ ) = 𝜌 ′ always exists. For the rest of
the proof, we focus on the case when the distribution on (𝑝, 𝑐) has atoms and assume that 𝑃 (𝜇 ∗ ) > 𝜌 ′ .
Fix 𝜀 > 0. Note that 𝑃 (𝜇 ∗ + 𝜀) < 𝜌 ′ . We notice that bidding 𝜇 ∗𝑐+𝜀 in the Lagrangian problem with multiplier
𝜇 ∗ is near optimal when 𝜀 is close to 0:
h i h i h i
E 𝑐 − 𝜇 ∗𝑝 1 𝑐 ≥ 𝑝 (𝜇 ∗ + 𝜀) = E 𝑐 − 𝜇 ∗𝑝 1 𝑐 ≥ 𝑝𝜇 ∗ − E 𝑐 − 𝜇 ∗𝑝 1 𝑐 ∈ [𝑝𝜇 ∗, 𝑝 (𝜇 ∗ + 𝜀))
     
h i
≥ E 𝑐 − 𝜇 ∗𝑝 1 𝑐 ≥ 𝑝𝜇 ∗ − 𝜀 P 𝑐 ∈ (𝑝𝜇 ∗, 𝑝 (𝜇 ∗ + 𝜀))
   
(28)

Let b∗ be the distribution that bids 𝜇𝑐∗ with probability 𝑞 and 𝜇 ∗𝑐+𝜀 with probability 1 − 𝑞. Because it holds
𝑃 (𝜇 ∗ + 𝜀) < 𝜌 ′ < 𝑃 (𝜇 ∗ ), we pick 𝑞 such that (1 − 𝑞)𝑃 (𝜇 ∗ − 𝜀) + 𝑞𝑃 (𝜇 ∗ ) = 𝜌 ′ . This makes makes
h i
E 𝑐 − 𝜇 ∗ 𝑝 1 b∗ ≥ 𝑝
 
h i h i
= 𝑞 E 𝑐 − 𝜇 ∗𝑝 1 𝑐 ≥ 𝜇 ∗𝑝 + (1 − 𝑞) E 𝑐 − 𝜇 ∗𝑝 1 𝑐 ≥ (𝜇 ∗ + 𝜀)𝑝
   
h i
≥ E 𝑐 − 𝜇 ∗𝑝 1 𝑐 ≥ 𝑝𝜇 ∗ − 𝜀 P 𝑐 ∈ (𝑝𝜇 ∗, 𝑝 (𝜇 ∗ + 𝜀))
    
using (28) (29)

We proceed to prove that b∗ achieves conversion probability that approximates the value of (9) as 𝜀 → 0.
Fix any bidding distribution b that is feasible for the constrained problem, i.e., E [𝑝 1 [b ≥ 𝑝]] ≤ 𝜌 ′ . Given
that 𝜇𝑐∗ is optimal for the Lagrangian problem with multiplier 𝜇 ∗ , its objective value is better than b:
h i h i
E 𝑐 − 𝜇 ∗𝑝 1 𝑐 ≥ 𝜇 ∗𝑝 ≥ E 𝑐 − 𝜇 ∗𝑝 1 b ≥ 𝑝
   

= E 𝑐 1 b ≥ 𝑝 − 𝜇∗ E 𝑝 1 b ≥ 𝑝
     

≥ E 𝑐 1 b ≥ 𝑝 − 𝜇∗𝜌 ′
  

Combining the above with (29) and the fact that E [𝑝 1 [b∗ ≥ 𝑝]] = 𝜌 ′ we get that

E 𝑐 1 b∗ ≥ 𝑝 ≥ E 𝑐 1 b ≥ 𝑝 − 𝜀 P 𝑐 ∈ (𝑝𝜇 ∗, 𝑝 (𝜇 ∗ + 𝜀))
       

This proves the third bullet of the lemma since as 𝜀 → 0, the probability of conversion by b∗ becomes almost
the one of b for any feasible b. To prove the second bullet, we prove that b∗ is the optimal solution when the
support is finite. We do so by taking 𝜀 to be small enough (put positive) so that P [𝑐 ∈ (𝑝𝜇 ∗, 𝑝 (𝜇 ∗ + 𝜀))] = 0.
This is possible since otherwise, a (𝑝, 𝑐) value in the support needs to satisfy 𝑐 − 𝑝𝜇 ∗ ∈ (0, 𝜀) for all 𝜀 > 0,
which is impossible. ■

47
E.2 Deferred Proof of Lemma 6.3

We first restate the lemma.

Lemma 6.3. Let 𝑊 (𝜇) and 𝑃 (𝜇) be the probability of winning and the expected payment of bid min{1, 𝑐𝜇 }.
Assume that 𝑊 ′ (𝜇), 𝑃 ′ (𝜇) ∈ [0, 1] are such that |𝑊 (𝜇) − 𝑊 ′ (𝜇)| ≤ 𝜀 and |𝑃 (𝜇) − 𝑃 ′ (𝜇)| ≤ 𝜀 for all 𝜇 ≥ 0.
Fix a 𝝁® such that 𝝁𝑚 = 0 (i.e., bids 1 at state 𝑚). Assume that 𝑚 ≥ 𝑐1¯ . Then we have

|𝑅 ′ ( 𝝁® ) − 𝑅( 𝝁® )| ≤ 36𝑚 2𝜀 and |𝐶 ′ ( 𝝁® ) − 𝐶 ( 𝝁® )| ≤ 39𝑚 3𝜀

Proof of Lemma 6.3. We will prove the lemma assuming that 𝜀 < 36𝑚 1
2 ; otherwise, the lemma is trivially
true since 𝑅( 𝝁® ), 𝑅 ′ ( 𝝁® ),𝐶 ( 𝝁® ), 𝐶 ′ ( 𝝁® ) ∈ [0, 1]. We first show the following claim. We use the notation ±𝑥 to
𝑥±𝑦 𝑥 −𝑦 𝑥+𝑦
denote the interval [−𝑥, 𝑥] and define all operations on it, e.g., 1±𝑧 = [ 1+𝑧 , 1−𝑧 ] as long as 𝑥, 𝑦 ≥ 0 and
0 ≤ 𝑧 < 1.
Claim E.1. Fix an integer 𝑚. Let 𝜀 ∈ (0, 1/𝑚) and 𝑞 1, 𝑞 2, . . . , 𝑞𝑚 ∈ [0, 1]. Then for all ℓ ∈ [𝑚] it holds

Ö ℓ
Ö ℓ
 Ö
(𝑞𝑖 ± 𝜀) ⊆ 𝑞𝑖 ± ℓ𝜀 + ℓ 2𝜀 2 ⊆ 𝑞𝑖 ± 2ℓ𝜀
𝑖=1 𝑖=1 𝑖=1

Proof. We use induction on ℓ. The claim is trivially true for ℓ = 1. Then for any ℓ + 1 we have
ℓ+1 ℓ
! ℓ+1
Ö Ö Ö 
2 2
(𝑞𝑖 ± 𝜀) ⊆ (𝑞 ℓ+1 ± 𝜀) 𝑞𝑖 ± (ℓ𝜀 + ℓ 𝜀 ) ⊆ 𝑞𝑖 ± ℓ𝜀 + ℓ 2𝜀 2 + 𝜀 + ℓ𝜀 2 + ℓ 2𝜀 3
𝑖=1 𝑖=1 𝑖=1
ℓ+1
Ö 
⊆ 𝑞𝑖 ± (ℓ + 1)𝜀 + (ℓ + 1) 2𝜀 2
𝑖=1

where in the last part we used that ℓ 2𝜀 2 + ℓ𝜀 2 + ℓ 2𝜀 3 ≤ ℓ 2𝜀 2 + ℓ𝜀 2 + ℓ𝜀 2 ≤ (ℓ + 1) 2𝜀 2 . This completes the


proof. ■

We now show that the time between conversions of 𝝁® is similar for both 𝑊 and 𝑊 ′ . Using Definition C.1:
𝑚−1 ℓ −1 𝑚−1

∑︁ Ö
′ 1 Ö
1 − 𝑊 ′ (𝝁 𝑖 )
 
𝐿 ( 𝝁® ) = 1 − 𝑊 (𝝁 𝑖 ) + ′
ℓ=1 𝑖=1
𝑊 (𝝁𝑚 ) 𝑖=1
𝑚−1 ℓ −1
! 𝑚−1
!
∑︁ Ö  1 Ö   
Claim E.1
⊆ 1 − 𝑊 (𝝁 𝑖 ) ± 2𝑚𝜀 + 1 − 𝑊 (𝝁 𝑖 ) ± 2𝑚𝜀 𝑊 (𝝁 ℓ ) ⊆𝑊 ′ (𝝁 ℓ )±𝜀
ℓ=1 𝑖=1
𝑊 (𝝁𝑚 ) ± 𝜀 𝑖=1
𝑚−1 ℓ −1   𝑚−1 !
1 2  
𝑊 (𝝁𝑚 )=¯
∑︁ Ö  Ö  𝑐
⊆ 1 − 𝑊 (𝝁 𝑖 ) ± 2𝑚 2𝜀 + ± 𝜀 1 − 𝑊 (𝝁 𝑖 ) ± 2𝑚𝜀 𝜀 ≤ 𝑐2¯
ℓ=1 𝑖=1
𝑊 (𝝁𝑚 ) 𝑐¯2 𝑖=1
 
® ± 2𝑚 2𝜀 + 2𝑚 𝜀 + 2 𝜀 + 4𝑚 𝜀 2 ⊆ 𝐿( b) ® ± 8𝑚 2𝜀

⊆ 𝐿( b) 1
𝜀 ≤ 2𝑚 ≤ 𝑐2¯
𝑐¯ 𝑐¯2 𝑐¯2

48
Using Appendix C.2, for any ℓ < 𝑚 we have
Îℓ −1 ′

′ 𝑖=1 1 − 𝑊 (𝝁 𝑖 )
𝜋ℓ ( 𝝁® ) =
𝐿 ′ ( 𝝁® )
Îℓ −1 ′

𝑖=1 1 − 𝑊 (𝝁 𝑖 )

𝐿( 𝝁® ) ± 8𝑚 2𝜀
Îℓ −1 ′

𝑖=1 1 − 𝑊 (𝝁 𝑖 ) 
⊆ ± 16𝑚 2𝜀 𝐿 ( 𝝁® ) ≥1≥16𝑚 2 𝜀
𝐿( 𝝁® )
Îℓ −1 
𝑖=1 1 − 𝑊 (𝝁 𝑖 ) ± 2𝑚𝜀 
⊆ ± 16𝑚 2𝜀 Claim E.1
𝐿( 𝝁® )
Îℓ −1 
𝑖=1 1 − 𝑊 (𝝁 𝑖 ) 
⊆ ± 18𝑚 2𝜀 𝐿 ( 𝝁® ) ≥1,𝑚≥1 (30)
𝐿( 𝝁® )

The same calculation as above give us



Î𝑚−1  Î𝑚−1 
′ ′ 𝑖=1 1 − 𝑊 (𝝁 𝑖 ) 𝑖=1 1 − 𝑊 (𝝁 𝑖 )
𝑊 (𝑏𝑚 )𝜋𝑚 ( 𝝁® ) = ⊆ ± 18𝑚 2𝜀
𝐿 ′ ( 𝝁® ) 𝐿( 𝝁® )
= 𝑊 (b𝑚 )𝜋𝑚 ( 𝝁® ) ± 18𝑚 2𝜀

Now we examine the average-time reward of 𝝁®


𝑚−1
∑︁
𝑅 ′ ( 𝝁® ) = 𝑟 (ℓ) − 𝑟 (ℓ − 1) 𝜋ℓ′ ( 𝝁® ) + (𝑟 (𝑚) − 𝑟 (𝑚 − 1))𝜋𝑚

( 𝝁® )𝑊 ′ (𝑏𝑚 )

ℓ=1
𝑚−1
∑︁    
⊆ 𝑟 (ℓ) − 𝑟 (ℓ − 1) 𝜋𝑙 ( 𝝁® ) ± 18𝑚 2𝜀 + 𝑟 (𝑚) − 𝑟 (𝑚 − 1) 𝜋𝑚 ( 𝝁® )𝑊 (b𝑚 ) ± 18𝑚 2𝜀
ℓ=1
𝑚−1
!
∑︁  
⊆ 𝑟 (ℓ) − 𝑟 (ℓ − 1) 𝜋𝑙 ( 𝝁® ) ± 18𝑚 2𝜀 + 𝑟 (𝑚) − 𝑟 (𝑚 − 1) 𝜋𝑚 ( 𝝁® )𝑊 (b𝑚 ) ± 18𝑚 2𝜀
ℓ=1
⊆ 𝑅( 𝝁® ) ± 36𝑚 2𝜀
Í
where in the second step we used that 𝑚−1
ℓ=1 (𝑟 (ℓ) − 𝑟 (ℓ − 1))𝜋 ℓ ( 𝝁
® ) ≤ maxℓ (𝑟 (ℓ) − 𝑟 (ℓ − 1)) ≤ 1 which
Í
follows from ℓ 𝜋ℓ ≤ 1. We now analyze the expected average payment
𝑚
∑︁ 𝑚
∑︁ 𝑚
∑︁
′ ′
𝐶 ( 𝝁® ) = 𝑃 ( 𝝁® )𝜋ℓ′ ( 𝝁® ) ⊆ (𝑃 ( 𝝁® ) ± 𝜀)𝜋ℓ′ ( 𝝁® ) ⊆ 𝑃 ( 𝝁® )𝜋ℓ′ ( 𝝁® ) ± 𝜀
ℓ=1 ℓ=1 ℓ=1

While we know a bound for 𝜋ℓ′ ( 𝝁® ) for ℓ < 𝑚, we have not proven one for 𝜋𝑚
′ (𝝁
® ). We have
′ (𝝁
𝜋𝑚 ® )𝑊 ′ (𝝁𝑚 ) 𝜋𝑚′ (𝝁
® )𝑊 ′ (𝝁𝑚 )

𝜋𝑚 ( 𝝁® ) = ⊆
𝑊 ′ (𝝁𝑚 ) 𝑊 (𝝁𝑚 ) ± 𝜀
′ ′
𝜋𝑚 ( 𝝁® )𝑊 (𝝁𝑚 ) 𝜋𝑚 ( 𝝁® )𝑊 (𝝁𝑚 )
 
2 2 𝑚2
⊆ ± 2𝜀 ⊆ ± 2 + 18 𝜀
𝑊 (𝝁𝑚 ) 𝑐¯ 𝑊 (𝝁𝑚 ) 𝑐¯ 𝑐¯
𝑚2
⊆𝜋𝑚 ( 𝝁® ) ± 20 𝜀
𝑐¯
′ (𝝁
where in the second ⊆ relation we used that 𝜀 ≤ 𝑐2¯ , in the next one that 𝜋𝑚 ® )𝑊 ′ (𝝁𝑚 ) ⊆ 𝜋𝑚 ( 𝝁® )𝑊 (𝝁𝑚 ) ±
1
18𝑚 2𝜀 and that 𝑊 (𝝁𝑚 ) = 𝑐.
¯ In the final one, we use that 𝑚 ≥ 𝑐¯ .

49
Using the bound for all 𝜋ℓ′ we have that
𝑚
∑︁
𝐶 ′ ( 𝝁® ) ⊆ 𝑃 ( 𝝁® )𝜋ℓ′ ( 𝝁® ) ± 𝜀
ℓ=1
𝑚−1  
∑︁
2  𝑚2
⊆ 𝑃 ( 𝝁® ) 𝜋ℓ ( 𝝁® ) ± 18𝑚 𝜀 + 𝑃 (𝝁𝑚 ) 𝜋𝑚 ( 𝝁® ) ± 20 𝜀 ± 𝜀
ℓ=1
𝑐¯
𝑚  
∑︁ 𝑚2 
⊆ 𝑃 ( 𝝁® )𝜋ℓ ( 𝝁® ) ± 18𝑚 3 + 20 + 1 𝜀 ⊆ 𝐶 ( 𝝁® ) ± 39𝑚 3𝜀 𝑃 (𝜇 ) ≤1
ℓ=1
𝑐¯

This completes the proof. ■

E.3 Deferred Proof Lemma 6.4

We first restate the lemma.


Lemma 6.4. Let 𝑊 (𝜇) and 𝑃 (𝜇) be the probability of winning a conversion and the expected payment when
bidding min{1, 𝑐𝜇 }. Let 𝑊𝑛 (𝜇) and 𝑃𝑛 (𝜇) be the empirical estimates of these two functions using 𝑛 samples
{(𝑝𝑖 , 𝑐𝑖 )}𝑖 ∈ [𝑛] , i.e.,
𝑛 𝑛
1 ∑︁ 1 ∑︁
and
   
𝑊𝑛 (𝜇) = 𝑐𝑖 1 𝑐𝑖 ≥ 𝜇𝑝𝑖 𝑃𝑛 (𝜇) = 𝑝𝑖 1 𝑐𝑖 ≥ 𝜇𝑝𝑖
𝑛 𝑖=1 𝑛 𝑖=1

Then, for all 𝛿 ∈ (0, 1) with probability at least 1 − 𝛿 it holds that for all 𝜇 ≥ 0
√︂ ! √︂ !
1 2 1 2
|𝑊𝑛 (𝜇) − 𝑊 (𝜇)| ≤ O log and |𝑃𝑛 (𝜇) − 𝑃 (𝜇)| ≤ O log
𝑛 𝛿 𝑛 𝛿

Proof of Lemma 6.4. We first prove the claim about the 𝑊 (·) function. First, for 𝑖 ∈ [𝑛] we define 𝑋𝑖 =
(𝑝𝑖 , 𝑐𝑖 ). Define the error of 𝑊𝑛 from 𝑊 :
 
𝑓 (𝑋 1, . . . , 𝑋𝑛 ) = sup |𝑊𝑛 (𝜇) − 𝑊 (𝜇)| = ∥𝑊𝑛 − 𝑊 ∥ ∞ = 𝑊𝑛 − E 𝑊𝑛
𝜇 ≥0 ∞

where the expectation in the right most part is taken over the pairs (𝑝𝑖 , 𝑐𝑖 ). The claim we want to make now
is that 𝑓 (𝑋 1, . . . , 𝑋𝑛 ) is small with high probability. We first bound its expectation, using [50, Theorem 23].
Because for each 𝑖, the function 𝑐𝑖 1 [𝑐𝑖 ≥ 𝜇𝑝𝑖 ] is non-increasing in 𝜇 and takes values in [0, 1] we get8 that

E [𝑓 (𝑋 1, . . . , 𝑋𝑛 )] ≤ O ( 1/ 𝑛). In addition, we notice that we can use McDiarmid’s inequality on 𝑓 since 𝑓
has the bounded differences property: for any 𝑥 1, . . . , 𝑥𝑛 = (𝑝 1, 𝑐 1 ), . . . , (𝑝𝑛 , 𝑐𝑛 ), any 𝑖, and any 𝑥𝑖′ = (𝑝𝑖′, 𝑐𝑖′ )
it holds

𝑓 (𝑥 1, . . . , 𝑥𝑖 , . . . , 𝑥𝑛 ) − 𝑓 (𝑥 1, . . . , 𝑥𝑖′, . . . , 𝑥𝑛 )

1 ∑︁   1   1 ∑︁  
= sup 𝑐 𝑗 1 𝑐 𝑗 ≥ 𝜇𝑝 𝑗 − 𝑊 (𝜇) − sup 𝑐𝑖 1 𝑐𝑖 ≥ 𝜇𝑝𝑖 + 𝑐 𝑗 1 𝑐 𝑗 ≥ 𝜇𝑝 𝑗 − 𝑊 (𝜇)
𝜇 ≥0 𝑛 𝜇 ≥0 𝑛 𝑛
𝑗 ∈ [𝑛] 𝑗 ∈ [𝑛]\{𝑖 }

1  1
≤ sup 𝑐𝑖 1 𝑐𝑖 ≥ 𝜇𝑝𝑖 − 𝑐𝑖′ 1 𝑐𝑖′ ≥ 𝜇𝑝𝑖′ ≤ 1
  
𝜇 ≥0 𝑛 𝑛
8 [50, Theorem 23] requires that the functions are non-decreasing which we can get by taking the sum over over 𝑖 of 1 −
𝑐𝑖 1 [𝑐𝑖 ≥ 𝜇𝑝𝑖 ].

50
Using McDiarmid’s inequality we get that for all 𝛿 > 0, with probability at least 1 − 𝛿 it holds
√︂   √︂
  𝑛 1 1 1 1
𝑓 (𝑋 1, . . . , 𝑋𝑛 ) ≤ E 𝑓 (𝑋 1, . . . , 𝑋𝑛 ) + log ≤ O √ + log
2 𝛿 𝑛 2𝑛 𝛿

where the second inequality holds by the bound on E [𝑓 (𝑋 1, . . . , 𝑋𝑛 )]. This proves the bound for 𝑊𝑛 (·).
We can prove a similar bound for the approximation 𝑃𝑛 (·) using the exact same steps. Using the union
bound on these two bounds yields the lemma. ■

E.4 Deferred Proof of Theorem 6.1

We now finally prove Theorem 6.1, which follows as a combination of Lemmas 6.3 and 6.4 and Linear
Program 10. We first restate the theorem.

Theorem 6.1. Fix 𝑚 ∈ N such that 𝑚 ≥ log𝑇 . Using price/conversion rate samples (𝑝 1, 𝑐 1 ), . . . , (𝑝𝑡 , 𝑐𝑡 ),
¯
2
𝑐𝜌
we can construct an empirical vector of mappings from contexts to bids b®𝑡 with a linear program of size O (𝑡𝑚)
constrained to bid 1 in state 𝑚. For this vector, for any 𝛿 > 0, with probability at least 1 − 𝛿, the expected
average payment and reward are
√︂ ! √︂ !
1 1 1 1
𝐶 ( b®𝑡 ) ≤ 𝜌 + O 𝑚 3 log =⇒ 𝜀𝐶 (𝑡) ≤ O 𝑚 3 log
𝑡 𝛿 𝑡 𝛿
√︂ ! √︂ !
𝑚 3 1 1 𝑚 3 1 1
𝑅𝑚 ( b®𝑡 ) ≥ OPT𝑀 inf
−O log =⇒ 𝜀𝑅 (𝑡) ≤ O log
𝜌 𝑡 𝛿 𝜌 𝑡 𝛿

where 𝜀𝐶 (𝑡) and 𝜀𝑅 (𝑡) are as in Theorem 5.1.

Proof of Theorem 6.1. We use the empirical estimates of Lemma 6.4. Let 𝑅 ′ (·) and 𝐶 ′ (·) denote the resulting
expected average reward and payment of these distributions. To keep consistent with the notation in the
statement of Theorem 6.1 we use vectors of mappings from contexts to bids as the proposed solution, b, ®
instead of vectors of multipliers, 𝝁® (which was the notation in Lemmas 6.3 and 6.4). We can calculate the
optimal solution using the Linear Program of (10), since the empirical distributions have finite support.
Let b®𝑡 be the resulting distribution.
Assume that the error bounds of Lemma 6.4 hold, which happens with probability at least 1 − 𝛿. This
means that √︂ ! √︂ !
1 1 1 1
𝐶 ( b® ) ≤ 𝐶 ( b® ) + O 𝑚
𝑡 ′ 𝑡 3
log ≤𝜌+O 𝑚 3
log
𝑡 𝛿 𝑡 𝛿

where the first inequality holds by Lemmas 6.3 and 6.4 and the second hold by the fact that b®𝑡 is feasible
for the cost 𝐶 ′ (·).
The calculation for the reward is a bit more complicated.
 Fix any b® such that 𝐶 ( b)
® ≤ 𝜌, implying also
√︃
® ≤ 𝜌 + 𝜀 where 𝜀 = O 𝑚 3 1 log 1 (similar to the calculation above). This means that b® is
that 𝐶 ′ ( b) 𝑡 𝛿

not necessarily feasible when 𝑅 ′ (·) and 𝐶 ′ (·) are the reward and payment. For this reason, we define b®′
as the combination of b and the bidding that always bids 0. Specifically, we define b®′ by taking a convex
combination of the occupancy measures or b® and the always bidding 0 solution. We take 1 − 𝜌𝜀 times the

51
occupancy measure of b® and 𝜀
𝜌 the occupancy measure of always bidding 0. This results in the following
average payment of b®′ :    
′® ′ 𝜀 ′ ® 𝜀
𝐶 ( b ) = 1 − 𝐶 ( b) ≤ 1 − (𝜌 + 𝜀) ≤ 𝜌
𝜌 𝜌

The above makes b®′ feasible. Because b®𝑡 is constrained to bid 1 at state 𝑚, we cannot directly compare it
to b®′ . For this reason we need Theorem 4.4. This makes
√︂ ! √︂ !  
1 1 1 1 𝑚 2
Theorem 4.4 and 𝑚≥ 𝑐𝜌 log𝑇
𝑅( b® ) ≥ 𝑅 ( b® ) − O 𝑚
𝑡 ′ 𝑡 2
log ≥ 𝑅 ( b® ) − O 𝑚
′ ′ 2
log − ¯
𝑡 𝛿 𝑡 𝛿 𝑇 𝐶 ′ ( b®′ ) ≤𝜌
√︂ !
1 1  
≥ 𝑅( b® ) − O 𝑚
′ 2 2
log Lemmas 6.3 and 6.4, 𝑚 𝑚
𝑇 ≤ 𝑡

𝑡 𝛿
  √︂ ! √︂ !
𝜀 1 1 𝑚 3 1 1
® − O 𝑚2 ® −O

≥ 1 − 𝑅( b) log ≥ 𝑅( b) log definition of b®′
𝜌 𝑡 𝛿 𝜌 𝑡 𝛿
 √︃ 
1 1
where the last inequality holds by definition of 𝜀 = O 𝑚 𝑡 log 𝛿 . The above 𝑅(·) functions referred to
3

when the per-round reward function was 𝑟𝑚 . We get the result for the case when the per-round reward
function is 𝑟 𝑀 my using Theorem 4.4 once more. ■

52

You might also like