A - B Testing
A - B Testing
Tech
A/B testing
To test the causal effect as P(Y|T, X)
Table of Content
• Aim
• Objectives
• Self Assessments
• Activities –
• Did You Know
• Summary
• Terminal Questions
Aim
Objective
A/B testing
A/B testing, also known as split testing, is a method used to compare two versions of a webpage,
app, email, or other digital assets to determine which one performs better in terms of a specific
metric, such as conversion rate, click-through rate, or user engagement.
How A/B Testing Works:
1. Create Variations: You create two versions of the element you want to test. Version A is the
control, or original version, and Version B is the variation with some change (e.g., different color
buttons, headlines, images, etc.).
2. Split Traffic: The traffic to your site or app is randomly split between the two versions. Half of the
users see Version A, and the other half see Version B.
3. Measure Results: You track the performance of each version based on the metric you’re
interested in. For example, if you’re testing a landing page, you might measure how many users
sign up for a newsletter.
4. Analyze Data: After running the test for a sufficient period, you analyze the data to see which
version performed better. Statistical significance is often calculated to ensure the results are not
due to chance.
5. Implement Changes: If Version B outperforms Version A, you might implement the changes
from Version B across your site or app. If there’s no significant difference, you might try testing
other variations.
Applications of A/B Testing:
1. Website Optimization: Testing different layouts, call-to-action buttons, images, and content
to increase user engagement or conversions.
2. Email Campaigns: Testing subject lines, content, or send times to improve open rates and
click-through rates.
3. Product Features: Testing new features or changes in an app to determine how users
respond.
4. Advertising: Testing different ad creatives, headlines, and targeting options to improve ad
performance.
Goal of the feature/idea (the treatment) conception:
1. What do you hope to achieve with this feature incorporation/update?
2. Why this specific feature update for that goal and not any other feature.
3. Has this been experimented before? Are other product lines also following suit?
4. Is it because of previous experiment data, industry insights, reports or other evidence that
supports your hypothesis?
5. Is this feature for a specific user group or for all user groups?
The test metric - that feature expects to bring an impact to and the data available.
1. The list of metrics and finalize the metric we want to test significance on.
• Metrics - In this case study we can focus on the primary impact of the feature being
more user engagement, daily active users etc.
• User Engagement can be defined as % of active users who have engaged with
Facebook in some way (like, comments, save, reactions).
• Daily Active Users - # of unique users who have logged on Facebook each day. We
expect this metric to increase with this new feature
2. Come up with north star metrics, supporting metrics (if applicable) and guard rail metrics.
• North Star Metric - the most crucial parameter to focus on e.g. % of user with
engagement
• Supporting Metric - Daily Active Users
• Guard Rail Metric (This is the metric that should not degrade in pursuit of a new feature)
- % of media content (assuming media content provides more value, we don't want this
% to decrease because of this feature). Or because this feature takes up so much
space, are we seeing lesser number of posts on average that people interact with
Steps - A/B testing
1. Set up hypothesis (State the null hypothesis & alternate hypothesis) - What would be the
null and alternative hypothesis in the case?
a. Null Hypothesis (Treatment t=0) - There is no significant difference in user
engagement between the treatment and control groups.
b. Alternative Hypothesis (Treatment t=1) - There is a significant difference in user
engagement between the treatment and control groups.
2. Choice of test - Since we are comparing two (y=0, y=1) ratios, we can use the Z-proportions
test.
a. Z-test, T-test is a statistical test used to determine whether there is a significant
difference between the means of two groups or between a sample mean and a
population mean.
3. Choosing experiment control & treatment subjects
a. Who is the experiment being run on?
b. Are we targeting all users on the platform? Or should we pick a proper segment of users for whom we
feel this test will be particularly well suited
4. Sample Size Calculation
a. Baseline metrics
i. Let’s say that before this feature launch, the user engagement is around 45%
b. Minimum detectable effect - what change is considered meaningful enough for you to take an action
i. Let’s say that the business stakeholders are hoping for a 1% increase in user engagement in the
treatment group
c. Significance level (Usually 95%) or confidence interval.
d. Power (Usually 80%): A function of the significance level, sample size, and the size of the effect being
detected. A power of close to 100% means the test is good at detecting a false null hypothesis.
Increasing the significance level increases the power of the test.
e. With the above number, assume you would roughly need 40K users in each group for us to design this
experiment in a statistically significant manner.
5. Experiment Duration - Based on sample size estimated and the approximate traffic -
● Divide sample size by the number of users in each group
i. Since we need a sample size of 80K (40K in each group) based on the above
calculation
ii. Assuming FB gets a traffic of 5K everyday
iii. Experiment duration = 80/5 = 16 days
6. Significance testing after we have reached the required sample size on the north star metric
to identify significance.
7. Continue monitoring supporting metrics and guard rail metrics.
Testing Pitfalls - A/B testing
Experimental Design Bias
1. Novelty / Primacy Effect -
a. Primacy Effect - When changes happen, some people that got used to how things work
may feel reluctant to change.
● Some users in the treatment group are reluctant to try out the new feature as they
were used to the older status UI so they stop using FB much to post status.
● So user engagement for first 2 weeks are low Wk1 = 45% and Wk 2 = 48%
● But as these reluctant users see more users engaging with this colored status button,
they will slowly start using this feature more.
● So from Wk3 onwards , the user engagement stabilizes to 62%
● It's important to not take the first 2 weeks of low user engagement due to the primacy
effect into consideration when comparing with control.
➢ Here it would have shown that there is no significant difference between the two
groups in the first 2 weeks even though subsequently we see that this feature
actually gets the users more engaged.
b. Novelty Effect - These users resonate with the new change and use more frequently
● Some users in the treatment group got excited about the new feature.
● The excited users use this feature and subsequently engage more in the first two
weeks after which the excitement dies down.
● So user engagement for first 2 weeks are high Wk1 = 65% and Wk 2 = 68%
● But from Wk3 onwards , the user engagement stabilizes to 52%
● It's important to not take the first 2 weeks of high user engagement due to novelty
effect into consideration when comparing with control.
c. Both of these effects are not long term effects, so it’s important that results are not
biased due to this effect. Treatment results may get exaggerated/undermined initially
due to these effects.
d. Solutions:
● Run the experiment for a longer time than required if possible to observe for any
novelty or primacy effect.
● The test can be conducted only on the first time users.
● Compare first time users with experienced users in the treatment group (we can
get an estimated impact of primacy / novelty effect).
2. Group Interference Qs - Interference between variants happens a lot. It's important to select
your sample in such a way that this interaction doesn’t cause biased results.
a. Eg: IF the treatment group is seeing a positive effect because of this new FB status
feature.
b. This effect can spill over to the control group (who is not seeing the new feature and
makes new posts seeing their friend who is affected by the new feature in the treatment
group). This is called a network effect.
c. So in this, the difference underestimates the treatment effect.
d. In reality the difference may actually be more than 1% but due to network effect, Actual
Effect > Treatment Effect.
e. Hence giving an incorrect result that this new feature did not significantly impact the
north star metric.
3. Outcome Bias
Look out for other design or system issues that led to the actual effect being undermined or
over estimated to the treatment effect.
Recommendations - based on experiment results “Launch or not?”
Link results to the goal and business impact
1. Example: What does 1% lift in engagement rate translate to revenue?
● If the 1% lift is increasing revenue through Ads by $20M, it might be worth it, however if
it only increases revenue by $50K it might not be (based on efforts estimation).
2. Is it worth it to launch the product given all the costs?
3. While the perfect scenario is that the increase in success metrics are significant and we don’t
see any difference in the guardrail metrics - Give recommendations on what to do in case of
conflicting situations.
4. Example: There's an increase in % user engagement among active users but also the daily
active users have decreased.
Link results to the goal and business impact
5. Translate this to impact to users and business -
a. Is the increased engagement among existing users bringing increased revenue to
balance out the loss of some daily active users?
b. For eg:
● Let's say the daily active users were 5K earlier but now it has come down to 3K.
● However the user engagement has increased from 45% to 65%
● If the increase in user engagement has led to a revenue increase despite the loss
of daily active users. This feature might be worth the consideration.
● ITs good to give a thought to strategy to retain the daily active users as next steps.
Consider short term and long term impact of the launch -
1. Sometimes a short-term impression increase can conflict with the brand image or company’s
mission in the long run.
2. One reasonable suggestion could be even with the decrease in daily active users, the launch
of color background search could potentially bring in more engaged users to the platform and
in the long term, the benefit may outweigh the drawbacks.
Activities
Outcomes:
Terminal Questions
Reference Links
Thank you