ABtest cheatsheet
ABtest cheatsheet
com/ab-testing-cheat-sheet
尊敬的用户,
感谢您接收我们提供的学习资料。请注意,本文件中包含的网站链接仅作为推荐资源供您学习之
用。我们强烈建议您访问该链接,获取对您有价值的信息和知识。
通过本文件提供的链接访问的内容及其知识产权归属原网站及其内容创作者所有。我们既不拥有
也不声称拥有通过该链接访问的任何内容的所有权、版权或其他知识产权。我们对该链接内容的
提供仅出于推荐目的,并不意味着我们与该内容的创作者、网站或其运营商有任何形式的合作、
赞助或认可关系。
我们在此温馨提示您:通过本文件提供的网站链接所包含的所有信息,包括但不限于使用的所有
商标、标识、企业名称和标志、文字、图片、音频、视频、程序的全部内容,网站布局、网页设
计、网站架构及资料编辑、使用的软件或程序,以及其知识产权,均属于其网站独立拥有或与相
关内容之合法权利人共同拥有。
请您注意:
未经书面授权,对上述内容以任何形式复制、编辑、传播、汇编、修改、下载、模仿、抄袭、重
新发布、镜像、进行还原工程、解编或反向组译等行为,均属于侵犯他人知识产权的行为。
使用该链接访问的内容应仅限于个人学习和研究目的。
对于通过该链接访问的内容及其使用,我们不承担任何责任。任何因使用该内容而引发的法律问
题或纠纷,应由用户自行负责。
我们提醒您,尊重知识产权是每个网民的责任。未经允许,下载、复制、传播他人的知识产权作
品,可能会侵犯版权法等相关法律法规,带来不必要的法律风险。
若有任何疑问或需要进一步的信息,请不要犹豫与我们联系。
A/B
TESTING
CHEAT
SHEET
A/B Testing Cheat Sheet
This comprehensive guide serves as a quick reference for various concepts, steps, and techniques in A/B
tests. With this guide, you will be equipped with the knowledge and tools necessary to answer interview
questions related to A/B testing.
Table of Contents
The Basics of A/B Tests
Selecting Metrics for Experimentation (video)
Selecting Randomization Units (video)
General Considerations
Different Choices of Randomization Units
Randomization Unit vs. Unit of Analysis
Choosing a Target Population
Computing Sample Size (video)
Determine Test Duration
Analyzing Results (video)
Sanity Checks
Hypothesis Tests (video)
Statistical and Practical Significance
Common Problems and Pitfalls (video)
Alternatives to A/B tests
A single or a very small set of metrics that capture the ultimate success you are striving towards
Stable: should not be necessary to update goal metrics every time you launch a new feature
Reflects hypotheses on the drivers of success and indicates we are moving in the right direction to
move the goal metrics
Actionable
Resistant to gaming
User funnel
Minimize to 5 key metrics (success and driver metrics) as a rough rule of thumb. When dealing
with a lot of metrics, OEC (Overall Evaluation Criterion), a combination of multiple key metrics,
can be used. Devising an OEC makes the tradeoffs explicit and makes the exact definition of
success clear. The OEC can be the weighted sum of normalized metrics (each normalized to a
predefined range, say 0-1).
Organizational Guardrails
Ensures we move towards success with the right balance and without violating important
constraints
E.g. Website/App performance, latency: wait times for pages to load, error logs: number of
error messages, client crashes: number of crashes per user, business goals, revenue: revenue
Trust-related guardrails
E.g. the Sample Ratio Mismatch (SRM) guardrail and cache hit ratio to be the same among
Control and Treatment.
For changes visible to users, we should use a user ID or a cookie as the randomization unit.
For changes invisible to users, e.g., change in latency, it depends on what we want to measure. A
user ID or a cookie are still good options if we want to see what happens over time.
2. Variability
If the randomization unit is the same as the unit of analysis, the empirically computed variability is
similar to the analytically computed variability.
If the randomization unit is coarser than the unit of analysis, e.g., the randomization unit is the user
and we wish to analyze the click-through rate (the unit of analysis is a page view), the variability of
the metric will be much higher. This is because the independence assumption is invalid as we are
dividing groups of correlated units, which increases the variability.
3. Ethical considerations
May face security and confidentiality issues when using identifiable randomization units.
It allows for long-term measurements, such as user retention and users’ learning effect.
Cons: Identifiable
Pros: Anonymous.
Session-Based (or Page View-Based): Every user session is a randomization unit. A session starts
when a user logs in and ends when a user logs out or after 30 min of inactivity.
Pros: Finer level of granularity creates more units, and the test will have more power to detect
smaller changes.
Cons: May lead to inconsistent user experience, so it’s appropriate when changes are not visible
to the user
IP-Based: Every IP address is a randomization unit. Every device in every network is assigned a
unique IP.
Pros: Maybe the only option for certain experiments, e.g., testing latency using one hosting service
versus another
Cons: Changes when users change places, creating an inconsistent experience. Many users may
share the same IP address. Therefore, not recommended unless it’s the only option.
The general recommendation for the randomization unit is the same as (or coarser than) the
unit of analysis.
e.g., the randomization unit is the user and we analyze the click-through rate (the unit of analysis
is a page view).
The caveat is that in this case, we need to pay attention to the variability of the unit of analysis as
explained earlier.
It does not work if the randomization unit is finer than the unit of analysis.
e.g., the randomization unit is a page view and we analyze user-level metrics.
This is because the user’s experience is likely to include a mix of variants (i.e., some in Control
and some in Treatment), and computing user-level metrics will not be meaningful.
Consider geographic region, platform (mobile vs tablet vs laptop), device type, user demographics
(age, gender, country, etc), usage or engagement level (analyze the user journey), etc.
Video
Be careful if you select users based on usage and your treatment affects usage. This violates
the stable unit treatment value assumption.
α = Type I Error = when the null hypothesis is actually true, rejecting the null hypothesis =
incorrectly rejecting the null hypothesis = False Positive
Significance level: The probability that we reject H0 even when the treatment has no effect. The
probability of committing a Type I error (α).
β = Type II Error = when the alternative hypothesis is true, failing to reject the null hypothesis =
incorrectly accepting the null hypothesis = False Negative
Statistical power: The probability that we reject H0 when the treatment indeed has an effect. This
measures how sensitive the experiment is. If power is too low, we can’t detect true effects; if it’s
unrealistically high (.99), we may never finish the experiment.
Variances:
Because samples are independent, Var(Δ) = Var(Yˉ t ) + Var(Yˉ c ) where Δ is the difference
between the Treatment average and the Control average. Variances are often estimated either
from historical data or from A/A tests.
Test Duration
Sample Size
Randomization Units/Day
Pitfalls:
mitigate risk (0-5%): Start with team members, company employees, loyal users, etc. in fear of bugs
or other risks — these people tend to be more forgiving.
long-term holdout (optional): Be aware of opportunity costs and ethics because those users won't
enjoy new features for a while.
Sample Ratio Mismatch (SRM). For the study population, we want 50% in the treatment and
50% in the control. If our study population was 1,000 with 800 in the treatment and 200 in the
When the sample size is big enough, by the central limit theorem (CLT), the sampling
distribution of μt − μc should be normally distributed.
Normality test
Organizational-related guardrail metrics are used to ensure that the performance of the
organization is following the standard we expect.
Website/App performance
Business goals
Engagement: e.g., time spent per user, daily active users (DAU), and page views per user.
Z-test or T-test: Both tests can be used to compare proportions or group means and test for
significant differences between them.
Note: In step 1, the "standard deviation" is the standard deviation of the sampling
distribution for the proportion, or standard error (SE). SE should be used in computations
instead of SD.
Chi-Squared Test
Example: checking for the SRM. Using the Chi-squared test as a goodness of fit test (Fairness
of Die in the Wiki page), it is analogous to testing if the treatment/control assignment
mechanism is a fair game (should be 50/50).
These failures should be a priority concern before moving on to analyzing the data. Is this just a
one-time issue or if it will persist or become worse over time? These are supposed to be invariant
metrics; we do not want these to differ between groups.
Z-test or t-test
P-value
Definition: If H0 is true, what's the probability of seeing an outcome (e.g., a t-statistic) at least
this extreme?
How to use: If the p-value is below your threshold of significance (typically 0.05), then you can
reject the null hypothesis.
Assumptions
Normality: When the sample size is big enough, by the central limit theorem (CLT), the
sampling distribution of the difference in the means between the two groups should be
normally distributed.
If the sample isn't large enough for the sampling distribution to be normal
1. Statistically and practically significant: The result is both statistically (p < .05 and 95% CI does not
contain 0) and practically significant, so we should obviously launch it. → Launch!
2. Not practically significant:
Scenario 1: The change is neither statistically (95% CI contains 0) nor practically significant (95% CI
sits in the middle), so not worth launching. → The change does not do much. Either decide to iterate or
abandon this idea.
Scenario 2: Statistically significant (95% CI doesn’t contain 0) but not practically significant → if
implementing a new algorithm is costly, then it’s probably not worth launching; if the cost is low, then it
doesn’t hurt to launch.
Scenario 1: The 95% CI contains 0 and the CI is outside of what is practically significant. → There is
not enough power to draw a strong conclusion and we do not have enough data to make any launch
decision. Run a follow-up test with more units, providing greater statistical power.
Scenario 2: Likely practically significant. Even though our best guess (i.e., point estimate) is larger
than the practical significance boundary, it’s also possible that there is no impact at all. → Repeat this
test but with greater power to gain more precision in the result.
Both scenarios suggest our experiment may be underpowered, so we should probably run new
experiments with more units if time and resources allow.
1. Multiple success metrics (Multiple hypotheses): When the significance level (false positive
probability) is 5% for each metric. With N metrics, Pr(at least one metric is false positive) =
1−(1−0.05)N is much greater than 5%.
Group metrics into expected to change, not sure, and not expected to change.
2. Post-experiment result segmentation: Multiple hypotheses are squeezed into one experiment. Also
a higher chance of false positive results. The overall result can contradict segmented results
(Simpson's Paradox).
Causes
P-hacking: Stop the experiment earlier than the designed duration when observing the p-value
is lower than the threshold value.
The experiment ran as designed but there are not enough randomization units.
High variance
Solutions
If the experiment is still running, we should run the experiment until enough units are
collected.
Clean data to reduce variance: remove outliers (e.g., capping), log transformation (don't log
transform revenue!)
Use trigger analysis, i.e., only include impacted units (e.g. conversion rate may be 0.5% when
you include users from the top funnel but it may be 50% right before the change). The caveat
is when generalizing to all users, true effect could be anywhere between 0 and the observed
effect.
Causes
Seasonality
Market change
Solutions
Long-term monitoring
Network Effects
Use isolation methods. Ensure little or no spillover between the control and treatment units
Cluster-based randomization
Randomize based on groups of people who are more likely to interact with fellow group
members, rather than outsiders
Geo-based randomization
Time-based randomization
Select a random time and place all users in either control or treatment groups for a short
period of time.
Conduct user experience research: Great for generating hypotheses. Terrible for scaling.
Focus groups: A bit more scalable but users may fall into groupthink
Human evaluation: Having human raters rate results or label data is useful for debugging, but they
may differ from actual users.
Quantitative Analysis
Conduct retrospective analysis by analyzing users’ activity logs: Use historical data to understand
baselines, metric distributions, form hypotheses, etc.
Causal inference: interrupted time series (same group go through control and treatment over time),
interleaved experiments (results by two rankers are de-duped and mixed together), regression
Requires making many assumptions and incorrect assumptions can lead to a lack of validity.