0% found this document useful (0 votes)
11 views38 pages

DDDM_Lecture3_ExperimentBasics_Dec11

The document discusses the fundamentals of experimental design in data-driven decision making, focusing on A/B testing and best practices for conducting rigorous experiments. It covers hypothesis development, randomization strategies, sample size determination, and statistical analysis methods, emphasizing the importance of proper execution to avoid misleading conclusions. Key statistical concepts such as hypothesis testing, Type I and II errors, and various test statistics are also outlined to ensure effective evaluation of experimental results.

Uploaded by

u3636157
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views38 pages

DDDM_Lecture3_ExperimentBasics_Dec11

The document discusses the fundamentals of experimental design in data-driven decision making, focusing on A/B testing and best practices for conducting rigorous experiments. It covers hypothesis development, randomization strategies, sample size determination, and statistical analysis methods, emphasizing the importance of proper execution to avoid misleading conclusions. Key statistical concepts such as hypothesis testing, Type I and II errors, and various test statistics are also outlined to ensure effective evaluation of experimental results.

Uploaded by

u3636157
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Data Driven Decision Making in Business

Lecture 3 Experimental Design —— Fundamentals

Shuyang Yang
Dec 11th
Agenda
Lecture 1 A Peep into DDDM
I. Best Practices for Conducting Rigorous A/B Tests

II. Statistics in Experiments:

• Hypothesis testing
• Test Statistics
• MDR

Data Driven Decision Making - Shuyang 2


0. Introduction: AB Testing / RCT/ Experiments

The concept is trivial:


1. Randomly split sample into two (or
more)
• A (Control) VS B(Treatment)
2. Collect data, calculate metrics of interest
• Run statistical tests
3. Analyze and decide which group to go w/

Data Driven Decision Making - Shuyang 3


0. Introduction: AB Testing / RCT/ Experiments
What experiments?
• Causal interpretation: randomized assignment provides the ground truth for evaluating the

impact of a treatment/intervention/strategy/feature ---- Golden Rule

• Iterative improvement: small, incremental tests compound over time to create significant

improvement ---- culture of experimentation

• Measuring real-world impact: evaluate changes directly with real users/customers in real-world

conditions
It doesn't matter how beautiful your theory is, it doesn't matter
how smart you are. If it doesn't agree with experiment[s], it's
wrong -- Richard Feynman

Data Driven Decision Making - Shuyang 4


I. Best Practices for Conducting Rigorous A/B Tests

• The advantages of A/B testing only hold if the test is conducted rigorously and

adheres to best practices.

• A poorly designed or executed A/B test can lead to misleading conclusions, wasted

resources, or even harmful changes to the business……

Insufficient sample size

Incorrect randomization Poor monitoring during test

Ignoring statistical significance

Data Driven Decision Making - Shuyang 5


I. Best Practices for Conducting Rigorous A/B Tests

Data Driven Decision Making - Shuyang 6


I. Best Practices: Develop Hypothesis

Identify the Problem Form a hypothesis

Identify specific phenomena or problems Formalize the problem into measurable/quantifiable


from historical data or experience hypothesis

low-engagement customers Subjects - Who


Too many push ads disturb non-active
who receive 10+ push messages
customers and reduce retention Treatment - What
(compared to customers who receive
less than than 10 msg)
quantifiable outcome
has 10% lower 1-month retention
- How

Data Driven Decision Making - Shuyang 7


0.Best
I. Introduction: AB Testing
Practices: Develop / RCT/ Experiments
Hypothesis
A good hypothesis should be:
• Plausible Mechanism:There should be a logical or theoretical explanation for how the cause

leads to the effect.

• Measurable: Clearly defines the variables involved and the expected outcome, includes metrics or

criteria that can be measured.

Data Driven Decision Making - Shuyang 8


0.Best
I. Introduction: AB Testing
Practices: Develop / RCT/ Experiments
Hypothesis

Data Driven Decision Making - Shuyang 9


0.Best
I. Introduction: AB up
Practices: Set Testing / RCT/ Experiments
Experiment
Experiment Unit Randomization Define Success
Strategy Metrics

Target
Population

Traffic Requirement

Data Driven Decision Making - Shuyang 10


0.Best
I. Introduction: AB up
Practices: Set Testing / RCT/ Experiments
Experiment
1. Target Population:
The entire group of individuals, users, or entities that the test aims to represent and from which the
experimental sample is drawn. It defines the scope of the A/B test and ensures that the results are
generalizable to the group of interest.
• Inclusivity: represent all potential individuals affected by the change being tested

Full population VS Specific Subpopulation

all dimensions: demographic/states by dimension: female; age 30+; low-engaged;

w/ trigger behavior: customer who visited a


w/o trigger behavior: all registered user
particular page

no exclusion: all registered user exclusion: blocklist

Data Driven Decision Making - Shuyang 11


0.Best
I. Introduction: AB up
Practices: Set Testing / RCT/ Experiments
Experiment
Experiment Unit:
An experimental unit in an A/B test is the smallest entity or subject to which a treatment (or variation)
is applied and for which an outcome is measured.

Granularity: at what level the randomization occurs


• Individuals-level
• Individual-Session level
• Group / Clusters-level: geographic regions
• Time-based Units:

Data Driven Decision Making - Shuyang 12


0.Best
I. Introduction: AB up
Practices: Set Testing / RCT/ Experiments
Experiment
Experiment Unit:

Data Driven Decision Making - Shuyang 13


0.Best
I. Introduction: AB up
Practices: Set Testing / RCT/ Experiments
Experiment
Experiment Unit:

Data Driven Decision Making - Shuyang 14


0.Best
I. Introduction: AB up
Practices: Set Testing / RCT/ Experiments
Experiment
Experiment Unit:

Data Driven Decision Making - Shuyang 15


0.Best
I. Introduction: AB up
Practices: Set Testing / RCT/ Experiments
Experiment
Randomization Strategy:
Randomization strategies determine how participants are assigned to experimental groups (control
and variation). Proper randomization ensures that the groups are comparable, reducing bias and
increasing the reliability of causal inferences

• Simple randomization -- large sample sizes with more homogenous distribution


• Stratified randomization -- participants are grouped into strata (subgroups) based on specific
characteristics (e.g., age, location), and randomization occurs within each subgroup.
• Cluster randomization -- entire clusters are randomly selected (Han’s school example)
• Adaptive Randomization: dynamic approach that adjusts allocation over time – improve efficiency
• Multi-Armed Bandit Algorithms

Data Driven Decision Making - Shuyang 16


0.Best
I. Introduction: AB up
Practices: Set Testing / RCT/ Experiments
Experiment
Randomization Strategy:

Data Driven Decision Making - Shuyang 17


0.Best
I. Introduction: AB up
Practices: Set Testing / RCT/ Experiments
Experiment
Define Success Metrics:

Type of Metrics Definition Bing Example


Primary Metrics Directly driven by the test; main outcome Clickthrough rate
The most important metric that measures Searches per user
North Star Metric
long-term success Ad income
Metrics that ensure the test doesn’t
Guardrail Loading time
negatively impact critical areas of the business
Indirect outcomes; process metrics to uncover
Secondary Metrics Mouse hover duration
mechanism

Data Driven Decision Making - Shuyang 18


0.Best
I. Introduction: AB up
Practices: Set Testing / RCT/ Experiments
Experiment
Traffic Requirement – determine sample size
Required Input:
• Baseline value: the current value of the metrics you are measuring; e.g. 10% CTR
• (+) Minimum Detectable Effect (MDE): smallest effect size you want to detect
• Previous test results
• Based on ROI, the impact has to be XX% to cover cost
• (-) Significance level (𝛼) – Type I error (false positive)
• (+) Statistical Power (1-𝛽) – The probability of detecting a true effect when it exists

Output: sample size N

Data Driven Decision Making - Shuyang 19


0.Best
I. Introduction: AB up
Practices: Set Testing / RCT/ Experiments
Experiment
Traffic Requirement – determine sample size

https://www.evanmiller.org/ab-testing/sample-size.html
Data Driven Decision Making - Shuyang 20
0.Best
I. Introduction: AB up
Practices: Set Testing / RCT/ Experiments
Experiment
Traffic Requirement – Establish Test Duration

• Based on sample size


• Account for seasonal and temporal
effects
• At least one full cycle (Week?)
• Avoid major anomalies
• Account for metrics calculation
• 7-day attrition rate

Data Driven Decision Making - Shuyang 21


0.Best
I. Introduction: AB Testing
Practices: Run / RCT/ Experiments
an Experiment
Monitor the process

• Traffic allocation:
• Ensure that participants are correctly allocated to control and variation groups
• No significant discrepancies in sample size
• Early Indicators of Results:
• monitoring trends can help identify potential issues
• Data Integrity Audits
• Detect and resolve data issues before analysis

Data Driven Decision Making - Shuyang 22


0.Best
I. Introduction: AB Testing
Practices: Analyze / RCT/ Experiments
and Decide
Analysis

• Analyze result for all metrics of interest


• Check statistical significance (perform statistical tests)
• Consider statistical power, ensure sufficient power to detect MDE
• Assess ”real-world” significance
• Is the impact big enough to drive North Star?
• Examine segment level results
• Identify variations across different user segments ---- more on this later

Data Driven Decision Making - Shuyang 23


0.Best
I. Introduction: AB Testing
Practices: Analyze / RCT/ Experiments
and Decide
Make a Decision

Monitor post-implemtation metric


Roll out the treatment to all units
Confimr the impact in population

Deep dive on causes


Maintain the Status Quo
Explore alternative hypothesis

Refine the test design


Conduct a Follow-up Tests
Refine the hypothesis

Data Driven Decision Making - Shuyang 24


0. Statistics
II. Introduction:
in AB
ABTesting
Testing / RCT/ Experiments
Hypothesis Testing – Key Definitions

Hypothesis testing is conducted around two key hypothesis:


• Null hypothesis (𝐻! )
• no difference between the control and variation
• Example: "The underlying design does not increase click-through rate (CTR)."
• Alternative hypothesis (𝐻" )
• assumes there is a difference
• Example: "The underlying design design increases click-through rate.”

Data Driven Decision Making - Shuyang 25


0. Statistics
II. Introduction:
in AB
ABTesting
Testing / RCT/ Experiments
Hypothesis Testing – Steps

1. Define 𝐻! and 𝐻"


2. Determine a test statistic to use and
choose a significance level (𝛼)
3. Calculate a test statistic based on sample
data, obtain p-value
• P-value: the probability of obtaining a test
statistic at least as extreme as the one
observed, under 𝐻!

4. Compare p-value to 𝛼
• reject p < 𝛼

Data Driven Decision Making - Shuyang 26


0. Statistics
II. Introduction:
in AB
ABTesting
Testing / RCT/ Experiments
Hypothesis Testing – Type I and Type II Errors
• Type I error:
• false positive (𝐻! is rejected when true)
• denoted 𝛼
• e.g. no change in CTR -> but conclude that there
is a change
• Type II error:
• false negative (fail to reject 𝐻! when 𝐻" is true) ,
denoted 𝛽
• e.g. CTR is increased -> but conclude that there is
no impact
• Statistical power: 1- 𝛽

Data Driven Decision Making - Shuyang 27


0. Statistics
II. Introduction:
in AB
ABTesting
Testing / RCT/ Experiments
Hypothesis Testing – Type I and Type II Errors

• Statistical power: 1- 𝛽
• The test’s ability to detect a true effect when it
exists (probability of avoiding Type II error)
• Common desired: 80%
• Ensure a test is sensitive enough to detect a
meaningful effect

Data Driven Decision Making - Shuyang 28


0. Statistics
II. Introduction:
in AB
ABTesting
Testing / RCT/ Experiments
Hypothesis Testing – Balancing Type I and Type II

• Set an appropriate significance level (𝛼)


• Increase statistical power:
• reduce variability in the data – precise estimate

• Increase sample size


• maximize effect size
• Context matter!
• In high-stake situation, Type I error is more

The lower the significance level (𝛼), the more likely Type II error occurs critical (clinical trial)
• Common standard: set 𝛼 = 0.05, 𝛽 = 0.2

Data Driven Decision Making - Shuyang 29


0. Statistics
II. Introduction:
in AB
ABTesting
Testing / RCT/ Experiments
Hypothesis Testing – Balancing Type I and Type II

Data Driven Decision Making - Shuyang 30


0. Statistics
II. Introduction:
in AB
ABTesting
Testing / RCT/ Experiments
Test Statistics – T-test (most common)
• Difference-in-mean test: compare the mean of two groups
• small sample or population std is unknown
• Assumptions:
o Normality: sample mean follows a normal distribution (CLT)
o Independent Observations: within and between group
o Equal Variance: variance of two groups are equal
o If no, Welch’s t-test
o Applications: comparing metrics of average values
o average duration on site; average order value
o Limitations:
o sensitive to outliers

Data Driven Decision Making - Shuyang 31


0. Statistics
II. Introduction:
in AB
ABTesting
Testing / RCT/ Experiments
Test Statistics – Proportion Test

• Difference-in-proportion: compare proportions between


groups (Bernoulli distributions)
• Assumptions - Special case of t-test
o Applications: comparing metrics of binary outcomes
o Clickthrough rate; conversion rate;
o Limitations:
o sensitive to rare events (very low probability)

Data Driven Decision Making - Shuyang 32


0. Statistics
II. Introduction:
in AB
ABTesting
Testing / RCT/ Experiments
Test Statistics –Chi-square Test

• Tests the association or independence between categorical


variables: deal with multiple categories
• Assumptions – flexible (large sample requirement)
o Applications: comparing metrics of binary outcomes
o Test if click-through or conversion rates differ between
groups;
o Test balance of traffic allocation (sample size of control
and treatment group)
o Limitations:
o Only tests for significance, not the size of the effect

Data Driven Decision Making - Shuyang 33


0. Statistics
II. Introduction:
in AB
ABTesting
Testing / RCT/ Experiments
Test Statistics – Test for Derived Metric

• Use case:
• Analysis unit is different from experiment unit
• e.g. average clicks per content
• Observations at group level are NOT independent – variance is incorrect
• Ratio metrics measure the ratio of two metrics
• e.g. page-level CTR= # clicks / # page visits
= (# clicks / #user) / (# page visits / # user)
= (user-level) / (user-level) ------ ratio metric, i.i.d. (numerator, denominator)
• Delta-Method: 𝑇𝑎𝑦𝑙𝑜𝑟 𝐸𝑥𝑝𝑎𝑛𝑠𝑖𝑜𝑛 (1𝑠𝑡 𝑂𝑟𝑑𝑒𝑟)
#$%&'
(()*+)) %$
• 𝐶𝑇𝑅 = ,- = = f(x, y)
(()*+)) &$

Data Driven Decision Making - Shuyang 34


0. Statistics
II. Introduction:
in AB
ABTesting
Testing / RCT/ Experiments
Test Statistics – Test for Derived Metric

• Using delta-method, the variance becomes:

• More use cases:


• average msg sent in each group ; -- randomize at user level
• Revenue per rider– randomize at device (shared bike)
• Session CTR – randomize at user level (one user has multiple sequential sessions)

Data Driven Decision Making - Shuyang 35


0. Statistics
II. Introduction:
in AB
ABTesting
Testing / RCT/ Experiments
MDE Revisit

• Minimum detectable effect: the smallest effect size that a statistical test can reliably detect given the

sample size, significance level (α), and statistical power (1−β)

• represents the smallest change in a metric (e.g., conversion rate or revenue) that you want to detect

with a specified level of confidence à determine prior to test based on business hypothesis

• Larger MDE -> smaller sample size required


Assuming 𝑛1 = 𝑘 ∗ 𝑛2
Under the H1, the power can be calculated as:

." 0 /
/ .#
1 − 𝛽 = Pr( > 𝑍"02/4 |𝑝1, 𝑝2) =0.8
%$" "& %$" %$ "& %$#
1 #
'" '#

Data Driven Decision Making - Shuyang 36


0. Statistics
II. Introduction:
in AB
ABTesting
Testing / RCT/ Experiments
MDE Revisit

Data Driven Decision Making - Shuyang 37


III. Reading:
Lecture 1 A Peep into DDDM
Kohavi, Ron, Diane Tang, and Ya Xu. Trustworthy online controlled
experiments: A practical guide to a/b testing. Cambridge University
Press, 2020.

Data Driven Decision Making - Shuyang 38

You might also like