SQR: Balancing Speed, Quality and Risk in Online Experiments
SQR: Balancing Speed, Quality and Risk in Online Experiments
ABSTRACT out. We can learn quickly, and ultimately, build better products
faster. However, how fast we innovate can also be limited by
Controlled experimentation, also called A/B testing, is widely how we run experiments. This becomes more apparent the more
adopted to accelerate product innovations in the online world. experiments we run. At companies like LinkedIn, where
However, how fast we innovate can be limited by how we run experimentation is truly embraced as a default step in every
experiments. Most experiments go through a “ramp up” process product change, this issue can be even more magnified.
where we gradually increase the traffic to the new treatment to At LinkedIn, we run over 5,000 experiments a year. Every
100%. We have seen huge inefficiency and risk in how experiment goes through a “ramp up” process. A new feature
experiments are ramped, and it is getting in the way of usually starts out by ramping to a small percentage of users,
innovation. This can go both ways: we ramp too slowly and waits to see if metrics are good, and then repeats by ramping up
much time and resource is wasted; or we ramp too fast and to a higher percentage, until finally it reaches 100%. Such “ramp
suboptimal decisions are made. In this paper, we build up a up” process is a standard practice across the industry to control
ramping framework that can effectively balance among Speed, unknown risks associated with any new feature launches.
Quality and Risk (SQR). We start out by identifying the top However, many times the way we ramp slows us down. This can
common mistakes experimenters make, and then introduce the go both ways: we ramp too slowly and much time and resource
four SQR principles corresponding to the four ramp phases of an is wasted; or we ramp too fast and suboptimal decisions are
experiment. To truly scale SQR to all experiments, we develop a made. On average, an experiment at LinkedIn takes four ramps
statistical algorithm that is embedded into the process of running to reach 100%, where each ramp takes about six days – that’s
every experiment to automatically recommend ramp decisions. almost a month from start to finish! In addition, experimenters
Finally, to complete the whole picture, we briefly cover the auto- tend to treat each of the incremental ramps the same. For
ramp engineering infrastructure that can collect inputs and example, on average, we spend 6 days waiting on a 5% ramp,
execute on the recommendations timely and reliably. while 6.5 days on a 50% ramp. So having 4 ramps for one
experiment is almost equivalent to running 4 separate
CCS CONCEPTS experiments sequentially!
• Mathematics of computing → Probabilistic inference The message behind these numbers is loud and clear: while
problems • Computing methodologies → Causal reasoning we can democratize experimentation with a fully self-served
and diagnostics platform [3], we need principles to guide us on how we should
ramp. Running slower doesn’t mean we are safer. Taking longer
KEYWORDS to finish an experiment doesn’t mean we are more cautious
A/B testing, experimentation, ramp, controlled experiment, regarding negative user impact. It is important to point out that
causal inference, speed, risk, quality while there is a need for “speed”, the principles should not be
driven by “ramping as quickly as possible”, but by “ramping with
1 INTRODUCTION the right balance of Speed, Quality and Risk”. The core of our
paper is to answer the following question: how can we go fast
There is no doubt that experimentation, or A/B testing, has
while controlling risk and improving decision quality?
become a driving force of innovation in the online world. It is
At first, the answer may seem to lie with “power analysis.”
not just the established players who have bought into the value
Power analysis is widely used to calculate the minimum sample
of experimentation, as shared in several past papers from
size required to detect a given effect size [5]. In the world of
Microsoft, Google and LinkedIn [1,2,3]. Startups and smaller
online A/B testing, where samples trickle into experiment
websites have also invested in building out their
continuously, this usually translates to deciding on both
experimentation program as a necessity of growth [4].
percentage of traffic ramped to the treatment (ramp percentage)
One primary reason that companies have relied on A/B
and also the duration for the experiment. However, power
testing is that it accelerates product innovation. It is as if you
analysis fails to serve our needs. (1) It can only be performed for
have a “crystal ball” that can tell us how our users will react and
one metric at a time. With hundreds of metrics we monitor
how the business metrics will change if a new feature is rolled
closely for each experiment [3], summary from power analysis of
individual metrics becomes uninterpretable. (2) What is the foundation of automating the ramping process. Section 5
considered as “enough power” can be different depending on the briefly goes over the auto-ramp infrastructure. Section 6
stage of the ramp process. In general, we can tolerate a lower concludes with future work.
power (i.e. higher Type II error) during earlier ramps. We will
discuss more formally about this in Section 4. (3) Power analysis 2 LITERATURE REVIEW
alone is not actionable. For most experimenters, power is a hard
In this section we review the evolution of controlled experiment
concept to digest, and is even harder to take action upon. If my
theories and applications, especially in the field of sequential
experiment does not have enough power, what should I do?
testing. The foundation of experimentation theory was first
After several attempts at making our ramp process more
introduced by Sir Ronald A. Fisher in the 1920s with a focus on
scientific and efficient, including an introduction and
agricultural activities [6]. Since then, this subject has been
deprecation of power analysis on our experimentation platform,
studies by many researchers in papers and textbooks [7,8].
we have arrived at a solution that has been hugely successful at
Controlled experiment has gained its popularity beyond the
LinkedIn, called SQR, for Speed, Quality and Risk. The SQR
original agricultural and manufacturing industries, and has been
framework divides up the whole ramp process into four phases,
widely adopted across companies including commercial and tech
each with a primary goal to achieve. The first phase is mainly for
sectors. In particular, past papers have discussed
risk mitigation, so the SQR framework focuses on trading off
experimentation at scale in Microsoft Bing, Google, Facebook
between speed and risk. The second phase is for making precise
and LinkedIn [1,2,9,3], sharing success stories, best practices and
measurement, so the focus is on trading off between speed and
pitfalls.
quality. The last two phases are optional and are used to address
Despite the benefits of running an experiment, the associated
additional operational concerns (third phase) and long-term
cost and logistical constraints cannot be overlooked. Being able
impact (forth phase). The first two phases are common to all
to stop an experiment early and iterating faster has been a focus
experiments and are usually the only two phases an experiment
for many researchers. Abraham Wald first formulated and
needs. Most of our effort in this paper is dedicated to addressing
studied the sequential testing problem [10]. In the past seventy
how to ramp through these two phases.
years, researchers are drawn to this field and contributed to the
In most cases, each step in ramping an experiment is done by
theoretical foundation [11,12]. Sequential testing methodology
engineers and product managers. The principles that SQR offers
has been widely adopted in clinical trials [13]. Recently, it has
need to be as simple and straightforward as possible to be
gained its popularity in educational/psychological testing [14].
accessible to all experimenters. While we can gain a lot of
Although many companies adopt online A/B testing, as far as we
efficiency simply by following the principles manually, we need
know, among large scale online A/B testing platforms,
to embed the principles as part of the ramp process in an
Optimizely is the first to apply sequential testing methodology in
automatic fashion to fully scale to all experiments. This requires
experiment evaluation process [15]. However, the application
a whole suite of solutions, including both a statistical algorithm
there focuses on sequential monitoring of single stage A/B
that can recommend ramp decisions and an engineering
experiment with no practical considerations on balancing speed,
infrastructure that can collect input and execute on the
quality and risk. In this paper, we utilize sequential testing and
recommendations reliably. We will discuss the algorithm in
tailor it to fit the goals in different ramping stages.
depth in Section 4, as it is probably most interesting to the KDD
In this paper, we leverage extensively the basic terminology
community, while briefly going over the auto-ramp
and knowledge of the field of A/B testing, such as hypothesis
infrastructure in Section 5 to complete the whole picture.
testing, t-test, power etc. Readers who are not familiar with these
Here is a summary of our contributions from this paper:
topics are encouraged to read the survey and practical guide by
• As far as we know, we are the first to conceptualize the Kohavi et al. [5].
need of balancing among speed, quality and risk in online
experiments, and also the first to study it extensively.
• We offer simple and straightforward principles that any 3 SQR FRAMEWORK
practitioner can follow to effectively balance speed, quality As we discussed in Section 1, an online experiment usually
and risk for their experiments. goes through a “ramp up” process, where it starts with a small
• We develop rigorous statistical procedures that percent of traffic for the new treatment and then gradually
algorithmically recommend next steps in the experiment increases the percentage to 100%. On a fully self-served platform,
ramping process to achieve SQR. experimenters are free to choose whichever incremental ramps
• We share the engineering infrastructure that allows us to they want, any fractions between 0% and 100%. As a matter of
automatically ramp experiments in a reliable and timely
fact, people chose over 300 unique ramp sequences for their
fashion.
The paper is organized as follows. Section 2 starts with a experiments at LinkedIn in year 2015, with an average of four
review of the existing literature in both areas of A/B testing and ramps per experiment. We may expect some diversity in the
sequential experiments. Section 3 develops the SQR ramping ramp process due to the diversity of our experiments, but 300 is
framework, describing common mistakes that people make and an astonishingly big number. People are also spending a lot of
offering easy-to-follow principles to guide experimenters. time at each ramp. Figure 1 plots the distribution of time-spent
Section 4 focuses on the Ramp Recommender algorithm that sets on ramps at a given ramp%. While 1% ramps tend to go a little
2
faster, all the other common incremental ramps tend to take same effect size, we are more likely to get statistically significant
similar amount of time to finish (about 6 days on average). All results. However, the tradeoff between running longer vs.
these numbers are indications that leaving people to decide how ramping up is rarely considered.
they ramp can be extremely inefficient and that it is getting in
50 % ramp
80%
% of Experiments
50% ramps
longer (A).
20%
3
Mistake #3: We have enough users at the 10% ramp. Let’s - what if something goes wrong? That’s why we usually start at
ramp to 100%. a small ramp, with the goal to contain impact and mitigate
potential risk.
It is hard for people to believe that at the scale of LinkedIn’s
In certain situations, we also need intermediate ramps
traffic, when we have millions of users visiting our site
between MPR and 100%. For example, it is often that for
everyday, we still need more users for our experiments. The
operation reasons we need to make sure the new services or
reason is two-fold. First, many of our experiments are effective
endpoints are able to handle the increasing traffic load. In these
for only a small subset of our users. As a matter of fact, 60% of
cases, we need to stop at extra ramps (e.g. 75%) to make sure the
our experiments are triggered for less than 10% of active users.
service metrics are stable before fully ramping to 100%. Another
So the real sample size is much smaller than millions. Second,
common example is to learn. While learning should be part of
many of our metrics are extremely volatile, especially revenue
every ramp, we sometimes conduct a long-term holdout ramp
related metrics. These metrics need a higher volume for the
primarily for learning purpose. This is a ramp where only a
normality assumption to be plausible according to Central Limit
small fraction of users are kept out of the new experience (e.g.
Theorem [17]. Their variance is usually too high for conclusions
5%) for a long time, usually over a month. The goal is to learn
to be drawn at a lower ramp. Because of these reasons, there is
whether the impact measured during MPR is sustainable in the
always a need for better resolution with more users, so that
long run.
experiments with real impact can be identified (even if the
impact is unintended). We have seen similar discussions from
Google [2] and Microsoft [18] with their scale of traffic too.
4
notice some unexpected impact on metrics. At that time, the risk metrics. Because such impact is to be baked into financial
is reassessed to be higher according to the data and we may not forecasting, we need to know whether the effect sustains in the
be able to ramp further. Another important factor to include in long run.
risk assessment is trigger rate. By and large, trigger rate refers to
the percentage of traffic volume that is affected by the 4 RAMP RECOMMENDER
experiment. As we mentioned earlier when discussing Mistake
To execute the SQR principles, we rely on an algorithm driven
#3, not all experiments impact all users. For example, some
approach to automate the ramping process as much as possible.
experiments are only triggered if a user is on an older mobile
One core piece of the automation is the underlying algorithm
app version. For such experiments with low trigger rate, the total
that provides the statistical foundation. This section is dedicated
impact tends to be smaller, and hence the risk should be lower as
to introduce the ramp recommendation algorithm. The
well.
engineering infrastructure that executes the recommendation
Principle #2: Spend enough “waiting” time at MPR. will be covered in Section 5.
SQR framework suggests that we should ramp as quickly as
Because MPR is the ramp dedicated to measure the impact of the possible to the Maximum Power Ramp, and spend enough time
experiment, it is crucial that we not only have the most precision to measure experiment impact at MPR. Correspondingly, Ramp
but also capture any other time-dependent factors. For example, Recommender performs two tasks depending on the ramp phase:
an experiment that runs for only one day will have results that (1) Guide the ramps towards MPR. (2) Give signal to ramp up
are biased towards heavy users. Another example is burn-in from MPR. We will cover each in Section 4.1 and 4.2
effect. Many of our new features that involve drastic UI changes respectively. The last two phases of the ramp process, post-MPR
have impact (either positive or negative) that dies down over and long-term holdout, are both optional with relatively simple
time. Because there is usually little gain on precision after one ramp criteria, and hence are omitted from the discussions here.
week and because we want to capture at least one full week, we
advise all our teams to keep their experiment at MPR for at least 4.1 Ramping towards MPR
a week, and longer if burn-in effect is present.
This is the phase where we need to effectively trade off between
Principle #3: Conduct quick post-MPR ramps if there are speed and risk concerns. We would like to ramp as quickly as
operational concerns. possible to MPR, but we need to make sure the risk is tolerable.
We formulate the problem first by quantifying risk and then by
By the time an experiment is passed the MPR phase, there should developing a rigorous statistical procedure to test the risk.
be no additional concerns regarding end user impact. In most
Intuitively, at every time point 𝑡, we utilize all the information
cases, operational concerns should also be resolved in earlier
we have up to that point, including both prior knowledge and
ramps. There are some cases where we worry about increasing
information learnt from data, to answer the question whether
traffic load to some engineering infrastructure that warrants
the risk is small enough to ramp up, and if so, ramp to what
incremental ramps before going to 100%. These ramps should
percentage of traffic. We start assuming there is only one metric
only take a day or less, usually covering peak traffic period, with
of interest, and then extend to cases with multiple metrics.
close monitoring.
Without loss of generality, we use “day” as the default time unit.
Principle #4: Conduct optional long-term holdout ramps 4.1.1 Risk and Tolerable Risk. We start with quantifying risk.
only if the learning objectives are clear. Because we ultimately need to translate “risk” into a binary
decision of ramping up or not, our definition includes both risk
We have seen an increasing popularity of long-term holdout and risk tolerance. Intuitively, if the risk is below the risk
ramps. In most cases, the experiment is left running for a long tolerance, we can ramp up. Otherwise, we cannot. We define the
time, and eventually terminated without any learning. The
risk of ramping to traffic percentage 𝑞 as
decision to do a long-term holdout should be made with a clear
objective. In general, we should only do a long-term holdout if 𝑅(𝑞) = |𝛿| ∗ 𝑔 𝑟 ∗ ℎ(𝑞)
we have a strong belief that users may interact with the new where
feature differently over time. It is not a default last step in the 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 𝑚𝑒𝑎𝑛 − 𝑐𝑜𝑛𝑡𝑟𝑜𝑙 𝑚𝑒𝑎𝑛
𝛿=
SQR ramping model, neither is it a panacea to any questions 𝑐𝑜𝑛𝑡𝑟𝑜𝑙 𝑚𝑒𝑎𝑛
leftover from an earlier ramp. captures relative impact on the triggered population, and
We have identified a few scenarios where long-term holdout 𝑟, 𝑟 ≥ 𝑟:
𝑔 𝑟 =
can be helpful. The most common scenario is when an 𝑟: , 𝑟 < 𝑟:
experiment has a long-lasting burn-in effect. There are some is trigger rate 𝑟 truncated at 𝑟: , and
known experiment areas where burn-in exists by design, such as 𝑞, 𝑞 ≥ 𝑞:
ℎ 𝑞 =
𝑞: , 𝑞 < 𝑞:
the People You May Know recommendation [3]. Another
is the ramp percent 𝑞 truncated at 𝑞: . Naturally, the risk is
example is when a feature takes time for users to discover. Even
higher for a higher trigger rate experiment that has a bigger
though the short-term observed impact is zero, we believe there
impact and is ramped to more users. The reason that we choose
is benefit in the long term. The last example is when the
a truncated version of both trigger rate and ramp percent is
experiment shows big impact during MPR for topline business
5
because we do consider a really bad experiment (large 𝛿) too 4.1.3 Sequential Testing. We use the Generalized Sequential
risky for our users even if it is only impacting a small set of Probability Ratio Test (GSPRT) [21]. At time 𝑡, the test statistic
L
users. Other reasonable, monotonically increasing functions can for 𝐻Q is constructed as follows:
be used depending on the business consideration. Also note that sup 𝜋Q 𝑓QS 𝑿𝒕
Y
we do not restrict risk to negative metrics impact only. We have L WX
𝐿S 𝐻Q = , 𝑘 = 0, 1 (2)
seen many real examples where a positive move on a metric @
_`: sup 𝜋_ 𝑓_S 𝑿S
Y
turns out to be bad [20]. W^
We say that the risk of ramping to 𝑞 is tolerable if it is below where 𝑓QS is the likelihood function, 𝑿𝒕 = (𝑋@S , 𝑋AS , … ) is the user-
the risk tolerance 𝜏, i.e. level metric value up to time 𝑡, and 𝜋Q is the prior probability for
𝑅 𝑞 ≤ 𝜏, hypothesis 𝐻Q . Note that it is important to consider prior risk
where 𝜏 is set based on business requirement and is different for assessment. In most cases, experimenters have an expectation
different metrics. It is usually decided by metrics owners through whether their experiments may be impacting certain metrics or
answering “As an organization, we do not want any experiment not. For example, many of our infrastructure experiments are not
to hurt the overall health of metric X by more than 𝜏 for a day.” expected to move metrics at all, and hence the priors for such
4.1.2 Hypothesis Testing. With risk defined, we are now experiments should have 𝜋: ≫ 𝜋@ .
L
ready to formulate the problem in terms of hypothesis testing. Following GSPRT, the hypothesis 𝐻Q is accepted if
Let 𝑄 = {𝑞@ , 𝑞A , 𝑞B , 𝑞C , … } be the set of possible ramps. For L 1
𝐿S 𝐻Q >
practical reasons, we want to restrict the cardinality of 𝑄 to a 1 + 𝐴Q
reasonable number and use only the representative ramp L
with 𝐴Q chosen to control the errors of accepting 𝐻Q incorrectly.
percentages. For example, at LinkedIn we use 𝑄 = L
Note that since the posterior probabilities 𝐿S 𝐻: + 𝐿S 𝐻@ = 1,
L
{1%, 5%, 10%, 25%, 50%}. we can choose 0 < 𝐴Q < 1 to ensure that at most one hypothesis
The first ramp is determined based on the initial risk
𝐻Q , 𝑘 = 1,2 is accepted [21].
assessment by the experimenter. Naturally, the higher the risk is,
Figure 4 below demonstrates the three regions that the test
the smaller the first ramp. With data from the initial ramp, we L
statistic 𝐿S (𝐻: ) can fall in: the acceptance region, the waiting
can then test to see whether risk is low enough for us to ramp
region and the rejection region. Note that we can construct an
further to the next ramp 𝑞. We can formulate this in the L
equivalent set of regions based on 𝐿S (𝐻@ ) with thresholds
following hypothesis test. For any potential next ramp 𝑞 ∈ 𝑄, we L
have: 𝐴: /(1 + 𝐴: ) and 1/(1 + 𝐴@ ). If 𝐿S (𝐻: ) falls into the rejection
L
𝐻: : 𝑅 𝑞 ≤ 𝜏 region, the experiment is considered too risky to be ramped to 𝑞.
L
L
𝐻@ : 𝑅 𝑞 > 𝜏 (1) If 𝐿S (𝐻: ) falls into the acceptance region, it is considered safe to
L
Notice that the risk function is monotonically increasing in the ramp to 𝑞. If 𝐿S (𝐻: ) falls in between, we do not have enough
L
ramp percent 𝑞. Therefore, for any 𝑞@ ≤ 𝑞A , if 𝐻 :O is accepted evidence to support either of the hypotheses, and hence we keep
LP the experiment running at the current ramp to collect more data.
then 𝐻 : is also accepted. In other words, if we can safely ramp
to 𝑞A , we can safely ramp to a smaller ramp 𝑞@ as well. The
Ramp Recommender takes on a greedy approach that picks the
maximum ramp among all feasible ramps. After the experiment
is ramped to 𝑞, the hypothesis testing process repeats until
reaching MPR. Figure 4: Three regions the test statistic can fall into.
Following the first SQR principle, we would like to ramp
quickly and efficiently during the pre-MPR phase. Therefore, The explicit form of 𝑓QS is unknown and different for
instead of using a fixed duration for each ramp, we want to different metrics. When the sample sizes are large, based on
evaluate the possibility of ramping up continuously as more data multivariate Central Limit Theorem, the likelihood function of 𝛥,
become available. With the large sample sizes that come with the relative difference of the sample means, approaches normal
online experiments, t-test or z-test [5] can be the choice to [17]. The test statistic in Equation (2) becomes
determine which region 𝑅 𝑞 falls into. However, performing 𝛥−𝛿 A
sup 𝜋Q exp (− )
the above hypothesis testing everyday can easily inflate the type Y
WX
𝑠A
L
I/II error rates due to multiple testing [17]. For example, if we 𝐿S 𝐻Q =
@ 𝛥−𝛿 A
continuously test for risk, and ramp up as soon as risk is lower _`: supY
𝜋_ exp (−
𝑠A
)
W^
than the threshold, we have a higher chance of ramping up a
where 𝑠 A is the variance of 𝛥 that can be estimated from the data
risky feature. Sequential testing, however, allows continuously
and 𝛿 is the population parameter defined earlier that we are
monitoring of risk 𝑅 𝑞 while restricting type I/II errors. We
testing in the hypothesis Equations (1). We omit the time
introduce sequential hypothesis testing techniques in the section
parameter 𝑡 in some notations here to keep it easier to read.
below.
To see how we can choose 𝐴Q to control the errors of
L
accepting 𝐻Q incorrectly, let 𝛼: be the probability that 𝐻: is
6
L
accepted if 𝐻@ is true. Similarly, 𝛼@ is the probability of accepting testing; Similarly, if we accept 𝐻: only if it is accepted for all
𝐻@ if 𝐻: is true. Note that 𝛼: is equivalent to the “usual” Type II metrics, we are too conservative (Type II error).
error, while 𝛼@ corresponds to the “usual” Type I error. We solve the former by leveraging the work on false
Intuitively, assuming 𝐻@ is true, it is less likely to accept 𝐻: discovery rate (FDR). As one of the most popular FDR-control
incorrectly with a smaller 𝐴: (and thus a bigger 1/(1 + 𝐴: )). It procedures, Benjamini–Hochberg [22] controls FDR by
has been shown that the errors 𝛼Q can be bound by the choices comparing p-values from individual hypothesis tests with
of 𝐴Q , i.e. 𝛼Q ≤ 𝐴Q [21]. adjusted thresholds (even under dependency [23]). We can
L @ L
To guide organizations on what bounds to use for such adopt a similar procedure on 𝐿S (𝐻@ ). Suppose 𝐿S 𝐻@ , … ,
“Type I and II” errors, it is helpful to think about these errors as n L L
𝐿S 𝐻Q are the 𝐿S 𝐻@ values from each metric sorted in
a tradeoff between speed and risk. When a Type I error is made,
descending order, where 𝑀 is the total number of metrics.
the algorithm says we shouldn’t ramp up when we should, so we
Instead of comparing against a fixed threshold 1/(1 + 𝐴@ ) in
are going too slowly. When a Type II error is made, we ramp up
Step 1) above, we use
when the risk is actually higher than the threshold we set, so we
p L 1
are taking too much risk. At LinkedIn, we have infrastructure in 𝐿S 𝐻@ >
𝑚𝐴@
place to identify bad experiments quickly, so we are confortable 1+
𝑀
with a higher Type II error and a lower Type I error in the pre- L
We accept 𝐻@ when the comparison is true for at least one
MPR phase: metric 𝑚 = 1, … , 𝑀.
𝐴: = 0.2, 𝐴@ = 0.01. On the other hand, to make sure we are not inflating false
Putting it together, the algorithm follows the steps below. On negatives, we ramp to 𝑞 when the following two conditions are
any day 𝑡, L
L
met: (1) 𝐻@ is not accepted according to the procedure above,
1) If 𝐿S 𝐻@ > 1/(1 + 𝐴@ ) for every possible 𝑞 ∈ 𝑄, we accept L
L
and (2) 𝐻: is accepted for the majority of metrics. Here we
𝐻@ . The metric impact is deemed to be severely worse than define majority using the equivalent threshold as 𝛼: , i.e. 80%.
the tolerable threshold. We cannot ramp the treatment
further. 4.2 Ramping at MPR
L L
2) If for some 𝑞 ∈ 𝑄, 𝐿S 𝐻: > 1/(1 + 𝐴: ), we can accept 𝐻: . Once we have ramped to the Maximum Power Ramp, the criteria
Ramp 𝑞 is deemed to be risk tolerable. As we discussed to ramp further are quite different from the earlier ramps. The
before, if multiple 𝑞’s are risk tolerable, we take the greedy pre-MPR ramps are primarily for risk mitigation, so the SQR
approach and ramp to the largest 𝑞 acceptable. ramp framework mainly focuses on trading off speed and risk.
3) Otherwise, there is not enough evidence supporting any of On the other hand, since MPR is primarily for measurement, the
the hypotheses. Continuing running the experiment on day tradeoff at MPR is mainly on speed vs. decision quality. There
𝑡 + 1 and evaluate 𝐿Sl@ . are three criteria that the Ramp Recommender considers at MPR,
4) If by day 𝑡 = 7, none of the hypothesis can be accepted, discussed in the following sections.
we assume no effect is detected, and recommend ramping. 4.2.1 MPR Duration. As we mentioned in Section 3.2, it is
4.1.4 Multiple Metrics. We have so far assumed there is only important that we spend one week at MPR before making
one metric of interest. The reality is that for most experiments, decisions. This is to make sure we capture a representative set of
there are over one hundred metrics that are closely watched. users and use cases throughout a full week cycle, and to take
There are company-wide metrics that are the most important for advantage of the reduction of variance as more data trickle in
all experiments. In addition to that, each experiment has its own over time. The Ramp Recommender only starts to kick in after
success metrics. These metrics tends to capture more direct the experiment is at MPR for one week.
impact from the experiment and are likely to be local to a 4.2.2 Metric Impact. Clearly, if some metrics are significantly
particular product area. As we mentioned in Section 4.1.3, negatively impacted, further ramping beyond MPR is not
different prior risk 𝜋Q can be used for each metrics depending on recommended. Again, the challenge here is to control false
how likely we expect them to be impacted by the experiment. positives. As mentioned in Section 4.1.4, there are usually over
From our experience, simplify the prior risk input to only two or one hundred metrics that are closely monitored for every
three categories help make it actionable for everyone. experiment. However, not every metric is created equal,
We can follow the testing procedure in Section 4.1.3 for especially when it comes to the final ramp decision. For a
every important metric, but we still need to combine the results handful of metrics that are most important for the experiment, to
and come up with a single ramp decision. Naturally, if one metric keep the recommendation transparent and interpretable, we
is truly impacted beyond our tolerance, we should not ramp up. simply use the same statistical significance definition as shown
On the other hand, we should ramp up only if we are confident on our experiment dashboard. If any of these metrics are
that risk is small across all the important metrics. However, if we significantly down (p-value < 0.05), we would like the
L experimenters to take a closer look instead of recommending
accept 𝐻@ as long as it is accepted for one metric, we inflate the
L
chance of accepting 𝐻@ incorrectly (Type I error) due to multiple further ramp. For the majority of the other metrics, we use false
discovery rate to control the number of false alarms. Here we
7
use false discovery rate of 0.1. We use the following steps to for “undecided” cases after Day-7. As we can see from the table,
determine if any metric impact is significantly negative. 58% of these 484 experiments are recommended to ramp up after
1) Let 𝑝p be the p-value for metric 𝑚, (𝑚 = 1, … 𝑀). We first one day, and all these recommendations would hold even if we
rank 𝑝p ’s in increasing order: 𝑝 @ , 𝑝 A , … 𝑝 n . had run the experiment for an entire week. Results from other
2) Find the largest 𝑙, such that 𝑝 r ≤ 𝑙 ∗
:.@ pre-MPR ramp percentages are similar and hence are omitted
n here.
3) If such 𝑙 exists, and there exists 𝑗, 𝑗 ≤ 𝑙, such that the We have also replayed the entire pre-MPR process for each
impact for metric 𝑗 is negative, we have identified at one of the 484 experiments. Given that we do not have prior risk
metric with significant negative impact. We cannot ramp to assessment, we choose 1% ramp as the initial ramp to be on the
100%. Otherwise, we can ramp to 100%. safe side, but use the low risk priors to ramp up from there on.
Note that if there are 100 metrics, the Ramp Recommender will Guided by our algorithm, experiments can either be ramped up
not recommend ramping to 100% if there are any metrics that are to MPR or be terminated halfway. As you can see from Table 2,
negatively impacted with p-value less than 0.001 (0.1/100). 71% of the experiments are recommended to ramp to MPR while
4.2.3 Alarming Insights. If an alarming insight is detected the rest 29% are deemed too risky to ramp further. These are
during the ramp, we need to take extra caution when making likely experiments that intended to move metrics. Most
ramp decisions. These insights include burn-in effect, experiments are flagged down at 1% ramp and 25% ramp. For
inconsistent results across ramps, heterogeneous treatment those experiments that are recommended to ramp to MPR, we
effects etc. Such insights are automatically computed and can be have also compared the duration from the replay with the actual
leveraged by the Ramp Recommender to make better, and more duration of the experiments. Under SQR, it takes 2.9 ramps on
informed recommendations. average (including the initial 1% ramp) or 12 days in median to
reach MPR. This implies about 50% time saved comparing with
4.3 Evaluation
the actual duration.
In this section, we want to evaluate how the Ramp
Recommender algorithm performs by replaying it on historical Table 2: Replayed recommendations for entire pre-MPR
experiments. We primarily focus on evaluating the pre-MPR phase.
phase, as we cannot easily quantify decision quality on MPR Flagged during Pre-MPR Reached MPR
ramps based on historical data. As we mentioned earlier, during Last Ramp% 1% 5% 10% 25% 50%
the pre-MPR phase, the key is to tradeoff effectively between % of Experiments 9% 3% 4% 13% 71%
speed and risk. There are two aspects we evaluate:
• Consistency. How consistent is the recommendation over 5 AUTO-RAMP
time? Ideally, if the algorithm recommends ramping up
We have so far discussed the principles and algorithms that
with data collected by time 𝑡, the same recommendation
determine whether an experiment should ramp up, and if so, to
should hold with data collected by time 𝑡 + 1.
• Speed. With ramp recommendation, how much time would what percent of traffic. In this section, we briefly cover the
we save? Ideally, we should save on both the number of engineering infrastructure that takes the recommendation and
ramps and the total duration before reaching MPR. automatically ramps up the experiment accordingly. As the
We collected 484 experiments ran in the past year, which had backbone for auto ramping, system infrastructure is designed to
one ramp at 50% that lasted for at least a week. These achieve the following goals.
experiments followed various different ramp sequences • Reliable. Every ramp decision could potentially impact
historically. Therefore, to reply the ramp recommendation, we thousands if not millions of users, auto-ramp system needs
take the data from the 50% ramps, and simulate the results for to be able to failover and retry when needed, and all
any pre-MPR ramp 𝑞 ∈ {1%, 5%, 10%, 25%}. failures and progress need to be closely monitored and
communicated to stakeholders.
Table 1 below compares the recommendation at 5% ramp
• Timely. Ramp action itself is time sensitive as it correlates
after Day-1 vs. after Day-7 where all seven days’ data are
to product feature ramping timeline. The data required to
available to make recommendations. make ramp recommendation also need time to accumulate.
Table 1: Replayed recommendations show consistency. This requires a robust scheduling system to deliver
execution in a timely manner.
Day-7 Fail Day-7 Ramp up Auto-ramp infrastructure is a multi-component system,
Day-1 Fail 8% 1% including an easy to use user interface (UI) for various ramping
Day-1 Wait 2% 31% configurations and user inputs, a middle layer application that
Day-1 Ramp up 0% 58%
interacts with UI and manages metadata, a highly concurrent
Note that this replay is for us to evaluate whether the same execution engine that executes ramp recommendation, a
recommendation holds over time. In reality, we would not have distributed and fail safe scheduling system handling time
observed Day-7 recommendations for experiments that are sensitive tasks and monitoring and alerts modules as ramp
ramped up after Day-1. In addition, while there is an option to progresses.
“wait” for more data on Day-1, the algorithm defaults to ramp up
8
Setup. To balance between flexibility and simplicity, the 3 Xu, Ya, Chen, Nanyu, Fernandez, Addrian, Sinno, Omar, and Bhasin, Anmol.
From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social
auto-ramp system asks for only critical inputs from users, Networks. In Proceedings of the 21th ACM SIGKDD International Conference on
including risk level assessments, and completion/failure criteria. Knowledge Discovery and Data Mining (Sydney 2015), 2227-2236.
An auto-ramp is considered completed if it reaches the 4 Ries, Eric. The Lean Startup: How Today's Entrepreneurs Use Continuous
maximum ramp percentage the experimenter selects, which is Innovation to Create Radically Successful Businesses. Crown Business, 2011.
not always 100%. For example, some experimenters prefer to exit 5 Kohavi, Ron, Longbotham, Roger, Sommerfield, Dan, and Henne, Randal M.
Controlled experiments on the web: survey and practical guide. Data Mining
auto-ramp mode after the experiment reaches MPR. An auto- and Knowledge Discovery , 18, 1 (Feb 2009), 140-181.
ramp is considered failed if it passes its preset due date. This can 6 Box, Joan F. R. A. Fisher and the Design of Experiments, 1922-1926. The
happen when the Ramp Recommender does not have enough American Statistician, 34, 1 (1980), 1-7.
information to confidently make the recommendation. These 7 Tamhane, Ajit C. Statistical analysis of designed experiments: theory and
applications. John Wiley & Sons, Inc., 2009.
configurations, as part of persisted ramp metadata, are also used
8 Rubin, Donald B. Estimating causal effects of treatments in randomized and
for initial ramping plan recommendation. nonrandomized studies. Journal of Educational Psychology (0 1974) Key:
Approval. After the auto-ramp is setup for an experiment, it citeulike:4632390, 66, 5 (October 1974), 688-701.
is then sent to SREs (Site Reliability Engineers) and other key 9 Bakshy, Eytan, Eckles, Dean, and Bernstein, Michael S. Designing and
Deploying Online Field Experiments. (Seoul 2014), Proceedings of the 23rd
stakeholders for approval, as part of the regular experiment international conference on World wide web, 283-292.
activation process. Once the activators review the setup, validate 10 Wald, Abraham. Sequential Tests of Statistical Hypotheses. Annals of
the risk level and recommended initial plan, and approve the Mathematical Statistics , 16, 2 (June 1945), 117-186.
requests, the auto-ramp process is kicked off. Auto-ramp follows 11 Johnson, N. L. Sequential analysis:A survey. Journal of the Royal Statistical
Society. Series A (General), 124, 3 (1961), 372-411.
the “design - publish - execution” paradigm. Upon approval, its
12 Lai, T. L. Sequential analysis. John Wiley & Sons, Ltd., 2001.
state freezes and becomes read-only for future executions.
13 Bartroff, Jay, Lai, Tze Leung, and Shih, Mei-Chiung. Sequential experimentation
Execution. Execution is triggered based on frequency, time in clinical trials: design and analysis. Springer Science & Business Media , 2012.
range and time zone. According to the frequency configured, the 14 Chang, Yuan-chin Ivan. Application of sequential probability ratio test to
execution engine periodically checks if the auto-ramp is overdue computerized criterion-referenced testing. Sequential Analysis , 23, 1 (2004), 45-
61.
or completed. Otherwise, it queries the Ramp Recommender
15 Johari, Ramesh, Pekelis, Leo, and Walsh, David J. Alwaysvalid inference:
based on the configuration and executes the recommendation. Bringing sequential analysis to A/B testing. eprint arXiv:1512.04922 (Dec. 2015).
16 Siroker, Dan and Koomen, Pete. A / B Testing: The Most Powerful Way to Turn
6 SUMMARY AND FUTURE WORK Clicks Into Customers. Wiley Publishing, 2013.
17 Lehmann, Erich L. and Romano, Joseph P. Testing Statistical Hypotheses.
In this paper, we discussed how we can effectively ramp an Springer, 2008.
experiment by balancing among speed, quality and risk. We first 18 Deng, Alex, Xu, Ya, Kohavi, Ron, and Walker, Toby. Improving the sensitivity of
established the SQR framework, offering four principles that online controlled experiments by utilizing pre-experiment data. (Rome),
Proceedings of the sixth ACM international conference on Web search and data
practitioners can easily follow. Following these principles, we mining.
also developed statistical algorithms that recommend ramp 19 Kohavi, Ron, Deng, Alex, Longbotham, Roger, and Xu, Ya. Seven Rules of
decisions and engineering infrastructure that automatically Thumb for Web Site Experimenters. (New York 2014), Proceedings of the 20th
ramps up the experiment accordingly. ACM SIGKDD international conference on Knowledge discovery and data
mining.
One interesting and related problem we have not studied in
20 Kohavi, Ron, Deng, Alex, Frasca, Brian, Longbotham, Roger, Walker, Toby, and
depth is deciding when a long-term holdout is beneficial and Xu, Ya. Trustworthy online controlled experiments: Five puzzling outcomes
how to conduct it effectively. Google has shared some long-term explained. (Beijing 2012), Proceedings of the 18th ACM SIGKDD international
conference on Knowledge discovery and data mining.
experiments they conducted to quantify user-learning effects in
21 Novikov, Andrey. Optimal sequential multiple hypothesis tests. ArXiv e-prints
the context of Ads [24]. We need a generic solution that can (Nov. 2008).
become part of the experimentation process for any experiment. 22 Benjamini, Yoav and Hochberg, Yosef. Controlling the False Discovery Rate: A
Another area we can improve upon in our current Ramp Practical and Powerful Approach to Multiple Testing. Journal of the Royal
Statistical Society. Series B (Methodological), 57, 1 (1995), 289-300.
Recommender algorithm is on how to decide it is safe to ramp up
23 Benjamini, Yoav and Yekutieli, Daniel. The control of the false discovery rate in
during the pre-MPR phase in the case of multiple metrics. The multiple testing under dependency. The Annals of Statistics, 29, 4 (2001), 1165-
ramp criteria we have proposed compromises between risk and 1188.
speed, but we do not have theoretical guarantees that it controls 24 Hohnhold, Henning, O'Brien, Deirdre, and Tang, Diane. Focus on the Long-
Term: It's better for Users and Business. (Sydney 2015), Proceedings of the 21st
𝛼: as we do in the single metric case. From our literature search, Conference on Knowledge Discovery and Data Mining , 1849-1858.
this does not seem to be an extensively studied area.
REFERENCES
1 Kohavi, Ron, Deng, Alex, Frasca, Brian, Walker, Toby, Xu, Ya, and Pohlmann,
Nils. Online Controlled Experiments at Large Scale. In Proceedings of the 19th
ACM SIGKDD international conference on Knowledge discovery and data mining
(Chicago 2013), 1168-1176.
2 Tang, Diane, Agarwal, Ashish, O'Brien, Deirdre, and Meyer, Mike. Overlapping
Experiment Infrastructure: More, Better, Faster Experimentation. In Proceedings
16th Conference on Knowledge Discovery and Data Mining (Washington, DC
2010), 17-26.