A/B Testing

Master controlled experimentation: randomization, sample size calculation, statistical vs practical significance, and common pitfalls in A/B testing.

23 min read
Intermediate

The Scientific Method for Business

A/B testing is controlled experimentation applied to real-world decisions. Should you make the button blue or green? Use this headline or that one? Charge 9.99or9.99 or 10.99?

Instead of guessing, you test. Half your users see version A, half see version B. Then you measure which performs better.

This is how modern tech companies make decisions: Google, Amazon, Netflix run thousands of A/B tests every year.

The A/B Testing Framework

A randomized controlled experiment comparing two versions (A and B) to determine which performs better on a metric you care about.

Key components:

  • Control (A): Current version (baseline)
  • Treatment (B): New version (challenger)
  • Randomization: Users randomly assigned to A or B
  • Metric: Quantifiable outcome (clicks, sales, signups)
Classic A/B Test

Hypothesis: Changing button color from blue to green increases click-through rate.

Control (A): Blue button Treatment (B): Green button
Metric: Click-through rate (CTR)

Execution:

  • Randomly show 50% of users blue, 50% green
  • Measure: A CTR = 12%, B CTR = 14%
  • Test: Is 14% vs 12% statistically significant?

Result: If significant, roll out green button to everyone.

Randomization is Critical

Why randomize? To ensure groups are comparable except for the treatment.

Without randomization:

  • Maybe power users get version B → unfair comparison
  • Maybe mobile users get A, desktop get B → confounding
  • Maybe early morning users get A → time effects

Randomization eliminates systematic differences. Any difference in outcomes is due to the treatment (or chance).

This is why randomized controlled trials (RCTs) are the gold standard in medicine and science.

Never assign treatments based on user characteristics! "Show version B to high-value customers" creates selection bias. Randomization is the only way to ensure causal inference.

Choosing Sample Size

How many users do you need? This depends on:

1. Baseline rate (p₁)
Current conversion rate, CTR, etc.

2. Minimum detectable effect (MDE)
Smallest improvement worth detecting (e.g., +2 percentage points)

3. Significance level (α)
Usually 0.05 (5% false positive rate)

4. Power (1-β)
Usually 0.80 (80% chance of detecting real effect)

Sample Size Calculation

Current CTR: 10% (p₁ = 0.10)
Want to detect: +2% improvement (p₂ = 0.12)
α = 0.05, power = 0.80

Using a sample size formula (or online calculator):
Need ~3,800 users per group (7,600 total)

If you only have 500 users, you might not detect even a large improvement — the test is underpowered.

Rule of thumb: Detecting small effects requires huge samples. A 0.1% improvement might need 1 million users. Always calculate sample size before running the test!

Statistical vs Practical Significance

Statistical significance: The difference is unlikely due to chance (p < 0.05).

Practical significance: The difference matters in the real world.

These aren't the same!

Significant but Meaningless

You test a new checkout flow with 1 million users.

Result:

  • Control: 10.00% conversion
  • Treatment: 10.01% conversion
  • p-value = 0.03 (statistically significant!)

But: 0.01% improvement → 1 extra sale per 10,000 users → maybe $10/day extra revenue.

Cost of implementing: Redesigning checkout, engineering time → $50,000.

Conclusion: Statistically significant but not worth doing. With huge samples, tiny differences become "significant" even when they don't matter.

Always consider both: Is it statistically significant? And does the effect size matter? A 0.01% improvement might not be worth the engineering effort.

Common Mistakes

1. Peeking at results early Checking results repeatedly and stopping when p < 0.05 inflates false positive rate. Decide sample size upfront and stick to it.

2. Running too many tests simultaneously Test 20 things, 1 will be "significant" by chance (multiple testing problem). Use corrections like Bonferroni.

3. Ignoring seasonality Running a test over a holiday or special event can confound results. Control for time effects or randomize timing.

4. Not checking for sample ratio mismatch If you expected 50/50 split but got 48/52, something's wrong with randomization.

5. Confusing correlation with causation Without randomization, you're just observing correlations. Only randomized experiments establish causation.

6. Stopping early because "it's not significant" Let the test run to planned sample size. Early stopping when losing is as bad as early stopping when winning.

Beyond Simple A/B: Variants

A/B/C/.../Z testing: Test multiple variants simultaneously. Requires larger samples (power decreases with more groups).

Multivariate testing: Test combinations. Example: Test 3 headlines × 2 images = 6 combinations. Much more complex analysis.

Sequential testing: Update results as data arrives with proper corrections (e.g., always-valid p-values).

Personalization: Instead of A vs B for everyone, use machine learning to assign best variant per user based on features.

The Analysis

Most A/B tests analyze proportions (conversion rates):

Hypothesis test: H₀: p₁ = p₂ (no difference)
H₁: p₁ ≠ p₂ (two-sided) or p₂ > p₁ (one-sided)

Test statistic: Two-proportion Z-test

z=p^2p^1p^(1p^)(1n1+1n2)z = \frac{\hat{p}_2 - \hat{p}_1}{\sqrt{\hat{p}(1-\hat{p})(\frac{1}{n_1} + \frac{1}{n_2})}}

Where p̂ is the pooled proportion.

Confidence interval for the difference:

(p^2p^1)±zα/2p^1(1p^1)n1+p^2(1p^2)n2(\hat{p}_2 - \hat{p}_1) \pm z_{\alpha/2} \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}
Test Results

Control: 240/2000 = 12% conversion (n=2000)
Treatment: 300/2000 = 15% conversion (n=2000)

Difference: 3 percentage points
95% CI: [0.8%, 5.2%]
p-value = 0.01

Interpretation: Treatment increased conversion by 3 percentage points, and we're 95% confident the true improvement is between 0.8% and 5.2%. This is statistically significant and likely worth rolling out.

Test your knowledge

🧠 Knowledge Check
1 / 5

Why is randomization critical in A/B testing?