What is a p-value in A/B testing?

A p-value is the probability of seeing a difference at least as large as the one observed if the two variants were actually identical. A small p-value (typically below 0.05) means the result is unlikely to be due to chance alone.

How do I calculate the sample size for an A/B test?

Sample size depends on your baseline rate, the minimum detectable effect you care about, the significance level (alpha) and the desired statistical power. This calculator computes it for you and estimates test duration from your traffic.

What is a confidence interval?

A confidence interval is a range of plausible values for the true effect. A 95% confidence interval means that if the experiment were repeated many times, about 95% of such intervals would contain the true value.

Free A/B Test Calculator: Sample Size & Significance

🚀 Getting started

New to testing? Start here. Learn the vocabulary, see the whole process end to end, then validate your platform before you trust a single result.

📖Plain-English glossary 🧭The full A/B test process 🅰️Start here: the A/A test

📚 Case studies

Five real, end-to-end stories — from business problem to final decision — each using a different metric and showing how to size a test efficiently with a holdout.

📚Worked examples (5 stories)

🧠 Core concepts

The four ideas every result rests on: significance, the two ways to be wrong, how big a sample you need, and the confidence interval around your answer.

📊How statistical significance works ⚖️Type I & Type II errors 📏What widens a confidence interval 📐Sample size vs MDE 🔋How power grows with sample size

✂️ Variance reduction

Techniques that get you to the same answer with less traffic — and the right test for metrics built as a ratio. Learn these once you're comfortable with the core concepts above.

✂️CUPED variance reduction ➗The delta method (ratio metrics)

🎯 Choosing your test

Which test fits your metric — and what each one actually means, in both statistical and business terms. Get this right before you analyse anything.

🧪Z-test vs t-test: which and why 📋Tests at a glance ⚙️How each test works 🎯Matching the test to the situation

📈 Distributions & theory

Why your data looks the way it does, and why averages behave so predictably at scale. The intuition that makes every test make sense.

🗂️Catalog of distribution shapes 🧭A field guide to distributions 📐Anatomy: mean, median, mode & SD 🎲Worked example: two dice 🔔The Central Limit Theorem 🧩The theory that ties it together 🔍When to check for normality 📘Case studies: checking distributions

🔬 Advanced & pitfalls

The traps that quietly invalidate results — unequal splits, peeking at results early, and testing too many things at once.

⚖️Unequal traffic splits (e.g. 90/10) 👀Why peeking inflates errors 🎰Multiple testing & error inflation

Plain-English glossary: the terms, with examples

Every A/B testing term you'll meet, explained without jargon — each with a concrete example. Start here if the vocabulary feels intimidating.

🎯Null hypothesis (H₀)

The starting assumption that there is no real difference between your variants — any gap you see is just chance. A test tries to gather enough evidence to reject it.

Example: "The new checkout button converts exactly the same as the old one." You run the test hoping to disprove this.

🎲P-value

The probability of seeing a difference at least as big as the one you observed if the variants were actually identical. A small p-value means "this gap would be surprising if nothing were really going on" — so you start to believe something is.

Example: p = 0.03 means: if the button truly made no difference, you'd see a gap this large only 3% of the time by luck. That's rare enough to call it a real effect. It does not mean "97% chance the button is better" — a common trap.

🚦Significance level (α)

The bar you set in advance for how small the p-value must be to call a result "significant." It is also your accepted false-positive rate. Usually 0.05.

Example: α = 0.05 means you accept a 1-in-20 chance of being fooled by noise into declaring a winner that isn't real.

🆚P-value vs α — what's the difference?

They are easy to confuse but they are opposite kinds of thing. α is a threshold you choose before the test — fixed, set by you, the same every time. The p-value is a number the data produces after the test — it changes with every experiment. You compare one to the other: if p < α, the result is significant.

Think of an exam: α is the pass mark you set in advance (say 60%). The p-value is the score a student actually got (say 73%). The pass mark is your rule; the score is the evidence. Saying "p-value and α are the same" is like saying the pass mark and the grade are the same thing.

✅Statistical significance

A result is "statistically significant" when its p-value falls below your α. It means the difference is unlikely to be pure chance — not that it is large or important.

Example: A 0.1% lift can be significant with millions of users, yet be too small to matter for the business. Significant ≠ meaningful.

📏Confidence interval (CI)

A range of plausible values for the true effect, instead of a single number. A 95% CI means: if you repeated the experiment many times, about 95% of the intervals you'd build would contain the true value. Wider interval = more uncertainty.

Example: A lift of "+8% with 95% CI [+2%, +14%]" says the real effect is probably between +2% and +14%. Because the whole range is above zero, you can be confident it's a genuine improvement. If the CI were [−3%, +19%], it crosses zero — you couldn't rule out no effect (or even harm).

💪Statistical power

The chance your test will detect a real effect if one truly exists. Low power means you might miss a genuine winner. The standard target is 80%.

Example: 80% power means: if the new button really is better, you have an 80% chance of your test catching it — and a 20% chance of missing it because your sample was too small.

📐Minimum detectable effect (MDE)

The smallest improvement worth catching. You decide it before the test — it drives how many users you need. Smaller MDE = much larger sample.

Example: "I only care if the button lifts conversion by at least 5%." Detecting a 5% lift needs far fewer users than detecting a 1% lift.

👥Sample size

How many users you need per variant to reliably detect your MDE at your chosen α and power. Too few and the test is underpowered; the Plan tab computes this for you.

Example: To spot a 5% lift on a 3% baseline at 80% power, you might need ~50,000 users per arm — roughly two weeks at typical traffic.

⚠️Type I & Type II errors

The two ways a test can be wrong: a false positive (shipping nothing) and a false negative (missing a real win). Explained in full, with the α/β trade-off chart, in Type I & Type II errors under Core concepts.

🔋β (beta)

The chance of missing a real effect — a false negative. It's the flip side of power: power = 1 − β. If you aim for 80% power, you're accepting β = 20%.

Example: β = 0.20 means: if the variant truly is better, there's a 20% chance your test fails to notice — usually because the sample was too small.

📉Standard error (SE)

How much your measured result would bounce around if you reran the experiment with a fresh sample. It shrinks as your sample grows — specifically as 1/√n, so quadrupling the users halves the wobble.

Example: Measure a 4.0% conversion rate with a small SE and you can trust the 4.0%. A large SE means the true rate could easily be 3% or 5% — the same number, far less certainty.

🌪️Coefficient of variation (CV)

The spread of a metric relative to its average (SD ÷ mean). A unit-free "noisiness" score: high CV means the metric jumps around a lot for its size, so you'll need more data to see a signal.

Example: Revenue per user is famously high-CV — a few big spenders dwarf everyone else. That's exactly why ARPU tests need huge samples (or CUPED, below).

🚨Sample Ratio Mismatch (SRM)

When the actual traffic split differs from the one you intended — e.g. you asked for 50/50 but got 52/48. It's a red flag for a bug in assignment, and it invalidates the whole test, even if the p-value looks great. The Analyse tab checks this for you automatically.

Example: A redirect was quietly dropping users before they reached the variant, so control filled faster. Until that's fixed, no result can be trusted — you fix the cause and rerun, never "adjust" the numbers.

⚡CUPED

A variance-reduction technique that subtracts each user's predictable, pre-experiment behaviour from their result — leaving a cleaner signal, so you reach a decision with less data. It changes nothing about your estimate of the lift, only its precision.

Example: A user who spent heavily last month will likely spend heavily this month regardless of your change. Subtracting that known tendency can cut the required sample by a third or more. Full walkthrough in the CUPED deep-dive below.

The full A/B test process — step by step

Tap each step to expand it. This is the path from a vague idea to a decision you can defend.

Start here: the A/A test

Before you trust any A/B result, prove your splitting platform is fair. An A/A test sends two identical experiences to two randomly split groups — and expects to find no difference.

❓

What it is. You run an experiment where A and B are exactly the same. Since nothing differs, any "winner" your tool reports is pure noise. It is the experiment equivalent of weighing an empty scale to check it reads zero.

🎯

Why juniors should start here. Most splitting bugs are invisible until you look: a broken randomiser, users landing in both groups, bots skewing one side, or logging that double-counts. An A/A test surfaces all of these before a real test misleads you.

📊

How to read it. Expect a non-significant result (p > 0.05) roughly 95% of the time. A single significant A/A result is not alarming — at α = 0.05, about 1 in 20 will trip by chance. Run several. If significants come up far more often than ~5%, your platform has a problem.

⚖️

Check the split ratio (SRM). If you asked for a 50/50 split, the group sizes should be very close. A large imbalance — a Sample Ratio Mismatch — means assignment is broken, and no result from that platform can be trusted until it is fixed.

✅ Practical starting checklist

1. Run an A/A test on your platform. 2. Confirm the split ratio matches what you configured. 3. Confirm the p-value is non-significant across a few runs. 4. Only then start real A/B tests. You can paste the two A/A group numbers into the Analyse tab and confirm "no clear winner" — exactly what you want to see.

Case studies: A/B testing from problem to decision

Five real-world stories, each using a different metric. Every one follows the same arc — problem, plan, run, analyse, decide — and shows how to size a test efficiently with a holdout instead of exposing everyone.

💡 The holdout principle (used in every case)

Test on the minimum sample that gives 80% power — extra users past that add almost no certainty. Keep the rest as a holdout on the proven version, then roll the winner out to everyone afterwards. The full benefit comes at rollout, not during the test.

📧A better email subject lineConversionZ-test

The problem

A SaaS team sends a weekly newsletter to 200,000 subscribers. Open rate is 22%. Marketing thinks a curiosity-driven subject line ("You're missing 3 features") will beat the current one. They want evidence before rolling it out to the whole list.

Hypothesis

Because past sends show curiosity-gap subject lines earned higher opens in our own data, we believe a curiosity-driven line will raise the open rate from 22% to at least 23.1% (+5% relative). We'll know we're right if the Z-test clears p < 0.05 with the lift above that threshold.

Plan & holdout

Smallest lift worth shipping: +5% relative (22% → 23.1%). The calculator says ~22,700 per group. So they test on 25,000 per group (50,000 total) and hold out the other 150,000 on the old line — no point exposing everyone to an unproven idea.

Power rises steeply then flattens — past the knee, more users mostly waste exposure.

Run & collect

A single send goes out Tuesday; opens counted over 48 hours. Control 5,503/25,000 = 22.0%. Variant 5,795/25,000 = 23.2%.

Analyse

Two-proportion Z-test: z ≈ 3.1, p ≈ 0.002. The 95% intervals for the two open rates don't overlap, and the split was a clean 50/50 (no SRM concern).

✅

Decision: roll out to all 200,000. The new line lifts opens ~5% relative — about 2,300 extra opens every week from then on. The holdout meant the full list was never risked on an untested idea.

Takeaway: a yes/no outcome (opened or not) is a conversion metric → Z-test. Size to the minimum, hold out the rest, capture the benefit at rollout.

🔔When the test says "no"ConversionZ-test

The problem

A growth team adds emoji and urgency to a re-engagement push ("⏰ Don't lose your streak!"), expecting more dormant users to return. Re-engagement rate is ~4%.

Hypothesis

Because support tickets and survey replies suggest users forget the app rather than dislike it, we believe an urgency-and-emoji push will lift the return rate from 4% by at least +0.5pp. We'll also watch a guardrail — mute/unsubscribe rate — so a "win" on returns doesn't quietly cost us long-term reach.

Plan & holdout

Even with millions of dormant users, they test on a holdout-sized cohort — 50,000 per group — not the whole base. No reason to blast everyone with an unproven message.

Run & collect

Control 4.0% vs variant 4.1%. The difference is tiny.

Analyse

Z-test: p = 0.42. The 95% interval for the lift runs from −0.14pp to +0.34pp — it straddles zero, so no effect can be claimed.

The confidence interval crosses zero — the data can't rule out "no effect."

🛑

Decision: don't ship. No detectable lift — and a guardrail check showed the emoji slightly raised mute rates. "No change" saved them from shipping clutter and annoying users.

Takeaway: a non-significant result is not a failed experiment. It's evidence that stops you shipping noise — and the holdout means you learned it cheaply.

🛒A checkout redesign — and a split that liedConversionZ-test

The problem

An e-commerce team rebuilds the checkout to cut drop-off. Current purchase-completion rate is 30%. They want to detect at least a +8% relative lift before shipping a change this big.

Hypothesis

Because funnel analytics show 1 in 3 users abandon at a cluttered payment step, we believe a streamlined checkout with a progress bar will raise completion from 30% by at least +8% relative (to ≈32.4%). Success = a significant positive Z-test on a clean 50/50 split — the split health matters as much as the p-value here.

Plan & holdout

The calculator says ~5,850 per group. With heavy daily traffic they expose just 10% of sessions to the test and keep 90% on the current checkout — a small, safe exposure for a risky change. They'll ramp up only if it's healthy.

⚠ The catch: SRM

First read: 6,200 in control vs 5,800 in variant. The Analyse tab's Sample Ratio Mismatch check flags it — p ≈ 0.0004, far below the 0.001 alarm. A redirect bug was dropping some users before assignment. Any result here would be untrustworthy. They fix the bug and rerun.

The arms drifted from the intended 50/50 — a signal that assignment, not the variant, is the story.

Analyse (after fix)

Clean rerun: control 30.0% vs variant 32.5%, a healthy 50/50 split, z ≈ 4.2, p < 0.001.

✅

Decision: ramp up, then ship. A real +2.5pp lift in completed purchases. They increase exposure gradually, confirm the gain holds, and roll out.

Takeaway: always check the split before the result. A Sample Ratio Mismatch means something is broken upstream — the lift is an illusion until it's fixed.

💸A pricing page that lifts revenue per visitorContinuousWelch's t

The problem

A team reorders the plans on a pricing page, hoping to raise average revenue per visitor (currently ~$18, but very spread out — many $0 visitors, a few large purchases).

Hypothesis

Because session recordings show visitors anchor on the first plan they see, we believe leading with the mid-tier plan will raise average revenue per visitor from $18 by at least +$1. Because revenue is a continuous, highly-skewed amount, success is a significant Welch's t-test — not a Z-test — on the per-visitor means.

Plan — and why not a Z-test

Revenue per visitor is a continuous value, not yes/no, so this needs Welch's t-test, not a Z-test — and the spread (standard deviation), not just the average, drives the sample size. With high variance the calculator asks for ~23,700 per group. Traffic is only ~1,500/day, so the lesson flips: rather than a holdout, they must run ~5 weeks just to reach the minimum.

Revenue is highly skewed with a long tail — that spread is exactly why the sample must be large.

Run & collect

After 5 weeks: ~43,000 visitors per arm. Control mean $18.00, variant mean $19.15 (both with large standard deviations ~$42).

Analyse

Welch's t-test: t ≈ 4.0, p < 0.001. The +$1.15 per visitor is small relative to the spread, but with 43,000 per arm it's clearly distinguishable from noise.

✅

Decision: ship it. +$1.15 per visitor across all traffic is meaningful revenue. The long runtime was the price of a noisy metric — worth it for a permanent gain.

Takeaway: a per-user dollar amount is continuous → Welch's t-test. High variance means big samples; when traffic is scarce, time replaces the holdout.

🚀An onboarding nudge that changes behaviourRatioDelta method

The problem

A product team adds a guided checklist to onboarding, hoping new users take more key actions per session. The metric is actions per session — a ratio, where each user contributes many sessions.

Hypothesis

Because activation data shows users who complete 3+ actions in week one retain far better, we believe a guided checklist will raise actions per session from ~1.0 by a meaningful margin. Because the metric is a ratio (many sessions per user), success must be judged with the delta method, which gets the variance right — a naive test would over-claim.

Plan — why the delta method

You randomise by user but measure per session, so a user's sessions are correlated. A naive t-test on session-level ratios would understate the variance and cry "significant" too easily. The right tool is the delta method, which estimates the correct variance and feeds a Z-test. They test on a safe 20% slice first (a risky onboarding change), holding the rest on the current flow.

Run & collect

8,000 users per arm. Control averages ~1.01 actions/session; variant ~1.10 — but the per-user session counts vary widely, so the variance must be handled carefully.

Each dot is a user's ratio. The variant cluster sits higher, but per-user spread is what the delta method accounts for.

Analyse

Delta method + Z-test: z ≈ 6.9, p < 0.001. The lift in actions per session is real, and the variance was estimated correctly — a naive test would have over-stated the certainty.

✅

Decision: ramp and ship. The checklist genuinely changes behaviour. They expand from 20% to full rollout after confirming the lift holds.

Takeaway: ratio metrics (actions/session, revenue/order) need the delta method to get the variance right — otherwise you'll ship false positives.

How statistical significance works

The null hypothesis assumes no real effect. If your test statistic lands in the shaded rejection region (|Z| > z_α), you reject it. α = 0.05 shown.

Type I & Type II errors

H₀ (blue) and H₁ (orange) distributions overlap. Red tail = α (false positive). Grey area = β (missed effect). Power = 1 − β.

Type I · false positiveDeclaring a winner that isn't real. You ship a change that does nothing. Its rate is α — the significance level you set (usually 5%).

Type II · false negativeMissing a real winner. A change that truly works fails your test. Its rate is β; power = 1 − β is your chance of catching it (aim for 80%+).

Example: Shipping a button that actually does nothing = Type I. Killing a button that actually worked = Type II.

What makes a confidence interval wider or narrower?

A CI's width is its uncertainty. Three things drive it — and sample size is the one you control most directly.

Confidence interval narrows as sample size grows

95% CI for a 30% conversion rate. Drag the slider: the interval shrinks in proportion to 1/√n — so to halve the width you must quadruple the users.

95% margin of error (half-width) ±1.96 × √( p(1−p) / n ) = ±1.96 × √( 0.3·0.7 / 1000 ) = ±2.84 pp

Users per group (n): 1,000

👥

Sample size (n) — biggest lever. More data means less uncertainty. The width shrinks with 1/√n, so going from 1,000 to 4,000 users halves it, but from 1,000 to 2,000 only cuts it by ~30%. Diminishing returns.

📊

Variability (spread) of the data. Noisier metrics give wider intervals. For a proportion the spread is fixed by the rate itself (largest near 50%); for revenue or time, a few extreme values widen the CI — which is why variance-reduction tricks like CUPED help.

🎚️

Confidence level you choose. Demanding more confidence widens the interval: a 99% CI is wider than a 95% CI, which is wider than 90%. Higher confidence buys certainty at the cost of precision.

💡 Why this matters in practice

A wide CI that crosses zero is the data telling you "I can't rule out no effect yet." The usual fix is more users — but because of the 1/√n rule, detecting ever-smaller effects gets expensive fast. That trade-off is exactly what the Plan tab quantifies when it sizes your test.

Sample size vs Minimum Detectable Effect

Detecting smaller effects requires exponentially more users. Halving the MDE roughly quadruples the required sample size (α = 0.05, power = 80%).

How power grows with sample size

For a fixed 5% relative MDE and α = 0.05, power rises with n. The dashed lines mark the standard 80% and 90% thresholds.

CUPED: how variance reduction shrinks the sample you need

CUPED removes pre-experiment noise. Drag the slider to see how the variance-reduction percentage cuts the required sample size.

Variance reduction: 30%

Deep dive: CUPED, step by step

1 · The idea

The intuition first. Imagine measuring whether a new gym program adds muscle. People start at wildly different weights, so the after-numbers are all over the place — the noise from "where each person started" drowns out the effect of your program. But you know each person's starting weight. If you measure the change from their own baseline instead of the raw after-number, that starting-point noise vanishes and the program's real effect pops out. CUPED is exactly this trick for product metrics: subtract the part of each user's behaviour you could have predicted before the test, and what's left is a much cleaner read on your change.

CUPED (Controlled-experiment Using Pre-Experiment Data) is the single highest-leverage trick in modern experimentation. The idea: a user who spent a lot last month will probably spend a lot this month regardless of your change. That predictable part is noise you can subtract.

The one-line version for stakeholders: "CUPED doesn't change the answer; it gets us to the same answer with less traffic, by cancelling out noise we already knew about."

2 · When do you actually run CUPED?

The most common confusion: CUPED needs data from both sides of the experiment. The covariate X is collected before the test, but the adjustment itself is only computed after the test ends — because the formula needs each user's during-test outcome Y, which doesn't exist until the test is over. You decide on CUPED up front, but you run the math at analysis time.

Before the testplanning & setup

Decide you'll use CUPED while designing the test. It affects your sample-size plan: expecting ~30% variance reduction lets you plan for fewer users or a shorter run.
Choose the covariate X. The standard, most powerful choice is your target metric measured in the period just before the test (e.g. each user's revenue in the 1–4 weeks pre-launch). It must be a per-user value you can join later by a stable user ID.
Fix the pre-period window and freeze it. 1–4 weeks is typical: too short and the covariate is noisy, too long and old behaviour stops predicting current behaviour.
Decide the new-user rule now. Users with no history can't be adjusted. Choose in advance: exclude them, impute X = 0, or analyse returning vs. new users separately. Never decide this after seeing results.
Capture X for every enrolled user as they enter, keyed to their user ID. This is the one step that must happen at enrolment.

During the testjust run it

Run the experiment normally. CUPED changes nothing about execution — same randomisation, exposure, and metric collection.
Don't touch X. The covariate is locked from the pre-period; you're now collecting Y (the during-test outcome). X stays untouched by the treatment — guaranteed, because it was measured before exposure.

After the testcompute & analyse

Validate the experiment first. Run the SRM / split-health check. CUPED reduces variance, it does not fix a broken experiment.
Join X to Y by user ID. You now have, per user, their pre-period X and during-test Y.
Compute θ on the pooled data (both arms together, not per-arm): θ = Cov(X, Y) / Var(X). Pooling matters — per-arm θ can leak the treatment effect into the adjustment.
Adjust every user: Ỹ = Y − θ·(X − X̄), where X̄ is the population mean of X.
Run your normal test on Ỹ and read the p-value exactly as usual.

One-line timing: collect X before, collect Y during, compute θ and adjust after. CUPED is a planning decision made up front but a calculation done at analysis time.

3 · A worked case, step by step

Forget formulas for a moment. Here are 8 users from a revenue test — four in control (A), four in the variant (B). For each we have X = spend in the weeks before the test, and Y = spend during it. Small enough to follow every number by hand.

First, try it the normal way — and watch it fail

Average Y in control = $24.00, in variant = $29.25 — looks like a big +21.9% lift! But run a t-test and p = 0.74, nowhere near significant. Why? U2 ($55) and U6 ($60) are naturally big spenders bouncing the averages around so hard the test can't tell the gain from luck. The signal is real but the noise drowns it.

Now apply CUPED

Find θ. θ = Cov(X,Y) / Var(X) across all 8 users = 337.4 / 296.0 = 1.14. (Every extra $1 spent before predicts ~$1.14 during — consistent spenders.)

Find X̄. The average of X over all 8 users = $22.375.

Adjust each user: Ỹ = Y − θ·(X − X̄). For big spender U2: X = $50, so subtract 1.14 × (50 − 22.375) = $31.50, giving Ỹ = 55 − 31.50 = $23.50. Do this for all 8.

Side-by-side: original Y vs. CUPED-adjusted Ỹ

Watch the wild original column get pulled into a tight band — while the group averages barely move:

User	Original Y ($)	CUPED Ỹ ($)
U1 (A)	12	26.11
U2 (A)	55	23.50
U3 (A)	19	21.71
U4 (A)	10	26.39
Mean A	24.00	24.43
U5 (B)	18	29.83
U6 (B)	60	30.79
U7 (B)	27	27.43
U8 (B)	12	27.25
Mean B	29.25	28.82

What changed?

The effect stayed the same. Mean A and Mean B barely moved (24.00→24.43, 29.25→28.82). The gap between groups — the thing you care about — is preserved. CUPED did not invent or inflate the result.
The noise collapsed. The original values swing $10–$60 (SD ≈ $21); the CUPED values sit in a tight $21–$31 band (SD ≈ $2). Big spenders U2 and U6 are pulled back to the pack, because CUPED knew they'd spend big anyway.
So the verdict flips. Same difference in means, but now measured against tiny noise instead of huge noise. The result that was invisible becomes obvious.

Step 4 — re-run the same test, see the verdict flip

Test on…	Lift	t	p	Verdict at α = 0.05
Original Y	+21.9%	0.35	0.74	Not significant ✗
CUPED Ỹ	+18.0%	3.09	0.023	Significant ✓

The lesson: the raw test said "no difference" — but that was the noise talking, not the truth. The variant really was better; a few big spenders hid it. If your test comes back "not significant" on a noisy metric like revenue, don't conclude "no effect" — first ask whether you have pre-period data to CUPED-adjust with. You may have a winner hiding under the noise.

(This 8-user set is deliberately tiny so you can check the arithmetic — its correlation is unrealistically clean. Real covariates correlate around 0.3–0.6; the realistic example below uses ρ = 0.55.)

4 · The same idea at realistic scale

📘 Case study

StreamFlix tests a new checkout flow

StreamFlix wants to know if a one-click checkout lifts average revenue per user (ARPU). ARPU is brutally noisy — a handful of annual-plan buyers dominate, so the standard deviation is huge relative to the mean.

The problem without CUPED: Baseline ARPU is $12.00 with an SD of $48.00 (CV = 4.0). To detect a 3% lift at 80% power, they'd need ≈ 168,000 users per arm — about 6 weeks of traffic.

Apply CUPED: Each user's spend in the 4 weeks before the test (X) correlates with their spend during the test (Y) at ρ = 0.6. CUPED subtracts that predictable part: Y_cuped = Y − θ·(X − X̄).

The result: Variance drops by ρ² = 0.36 — a 36% reduction. The effective SD falls from $48 to $38.4. Required sample size drops to ≈ 107,500 per arm — they reach a decision almost 2 weeks sooner, with no change to the estimated lift.

Takeaway: CUPED didn't change the answer — it got them to the same answer faster. The noisier your metric and the stronger the pre-period correlation, the bigger the win.

Because required sample size scales linearly with variance, a 50% variance reduction means you reach significance in half the time — or detect an effect half the size in the same time. That is why every large experimentation platform uses it.

5 · Limitations & pitfalls

You need persistent user IDs. CUPED links each user's "before" to their "after," so it only works if you can identify the same person across both periods. Anonymous-session-only tracking blocks CUPED until you fix identity.

It only helps when X correlates with Y. The variance reduction is exactly ρ². A weak covariate gives almost nothing — but CUPED is safe even then: with no correlation it leaves your estimate essentially unchanged rather than worse.

Needs pre-experiment history. New users, new markets, or first-time events have no "before" to subtract. Decide the rule in advance — exclude or impute — never after seeing results.

The covariate must be pre-treatment. Using anything measured during the experiment as X can absorb part of the treatment effect and bias your lift toward zero. Pre-period only.

It's still a t-test underneath. CUPED doesn't escape the t-test's assumptions: you need normal-ish data or a large enough sample for the CLT. Easily met at thousands of users, risky on tiny tests.

It reduces variance, not bias. SRM, contamination, or selection bias all survive the transform. Validate the experiment first, then apply CUPED.

6 · Python script

Copy-paste runnable (needs only numpy + scipy). Swap the simulated arrays for your own per-user X (pre-period) and Y (in-experiment).

# CUPED: variance reduction for an A/B test. Needs numpy + scipy.
import numpy as np
from scipy import stats

def analyse(Yc, Yv, Xc, Xv):
    # 1) estimate theta on the POOLED pre/post data (both arms together)
    Xall, Yall = np.concatenate([Xc, Xv]), np.concatenate([Yc, Yv])
    theta = np.cov(Yall, Xall, ddof=1)[0, 1] / Xall.var(ddof=1)
    # 2) transform each arm with the same theta and global mean of X
    Xbar = Xall.mean()
    Yc_c = Yc - theta * (Xc - Xbar)
    Yv_c = Yv - theta * (Xv - Xbar)
    # 3) Welch t-test, raw vs CUPED
    t_raw, p_raw = stats.ttest_ind(Yv,   Yc,   equal_var=False)
    t_cup, p_cup = stats.ttest_ind(Yv_c, Yc_c, equal_var=False)
    rho = np.corrcoef(Xall, Yall)[0, 1]
    var_red = 1 - np.concatenate([Yc_c, Yv_c]).var(ddof=1) / Yall.var(ddof=1)
    print(f"rho = {rho:.3f}  (var. reduction rho^2 = {rho**2:.1%})")
    print(f"empirical var. drop  = {var_red:.1%}")
    print(f"raw   : lift={(Yv.mean()-Yc.mean())/Yc.mean():+.2%}  t={t_raw:.2f}  p={p_raw:.4f}")
    print(f"cuped : lift={(Yv_c.mean()-Yc_c.mean())/Yc_c.mean():+.2%}  t={t_cup:.2f}  p={p_cup:.4f}")

# --- demo data: replace with your own per-user arrays ---
rng = np.random.default_rng(7)
N = 20_000
latent = rng.lognormal(2.2, 0.55, 2*N)
X = latent + rng.normal(0, 5.5, 2*N)          # pre-period spend
Y = latent + rng.normal(0, 5.5, 2*N)          # in-experiment spend
Y[N:] *= 1.03                                  # inject a real +3% lift in the variant
analyse(Y[:N], Y[N:], X[:N], X[N:])

Deep dive: the delta method, step by step

The intuition first. Suppose you want to know the average "actions per session." You randomise people into control and variant — but each person has many sessions. Here's the trap: one chatty user might have 40 sessions, another just 1. If you pile all sessions into one bucket and treat them as independent, you're pretending you have far more independent data points than you really do. Sessions from the same person rise and fall together, so they don't each count as fresh evidence. The delta method is the fix: it works out the true uncertainty of a ratio (a top number divided by a bottom number, where both wobble) and hands that honest uncertainty to a normal Z-test.

What kind of metric needs it? Any "ratio metric" where the denominator isn't fixed: actions per session, revenue per order, click-through rate per session, items per basket. The giveaway is that you randomise by one unit (the user) but count by another (sessions, orders, clicks).

1️⃣

Spot the ratio — Your metric is R = X / Y, where X is the numerator total (e.g. total actions) and Y the denominator total (e.g. total sessions). Both are random — they change from sample to sample — which is exactly what a naive test ignores.

2️⃣

Why the naive test lies — Treating each session as an independent row makes the standard error 2–10× too small. A too-small SE makes any difference look hugely significant — so you ship false positives. The error is invisible: the p-value just looks great.

3️⃣

Estimate the variance of the ratio — The delta method uses a first-order Taylor approximation to turn the messy variance of "a random top over a random bottom" into something computable: Var(R) ≈ (1/Ȳ²)·[Var(X) − 2R·Cov(X,Y) + R²·Var(Y)]. In plain terms: it accounts for how the numerator and denominator move together, which is the piece the naive test throws away.

4️⃣

Feed it to a Z-test — That corrected variance gives an honest standard error. From there it's an ordinary two-sample Z-test on the two groups' ratios. The delta method is not a separate test — it's the variance-estimation step that makes the Z-test trustworthy for ratio metrics.

A common shortcut — and when it's enough. For many product metrics you can sidestep the formula: aggregate to one value per user first (each user's own actions ÷ their own sessions), then run a plain Welch's t-test on those per-user numbers. That also respects the user-level independence. The full delta method earns its keep when you can't cleanly aggregate per user, or when the denominator can be zero for some users.

📘 Case study

A nudge that lifts "actions per session"

A team ships an onboarding nudge and measures actions per session. They randomise by user, but users have very different session counts — a handful of power users have dozens of sessions each.

The trap: A naive test on 50,000 sessions reports z ≈ 11, p < 0.0001 — looks like a slam dunk. But those sessions come from far fewer users, and one power user's good day counts dozens of times.

Apply the delta method: Estimating the ratio's variance correctly inflates the standard error to its honest size. The z statistic falls from ~11 to ≈ 6.9 — still significant, but no longer absurdly so.

The result: p < 0.001 with the correct variance. The lift is real and they ship — but now the confidence interval is honest, so the projected impact they promise leadership isn't inflated.

Takeaway: The delta method rarely flips "significant" to "not significant" — its job is to stop you over-stating certainty. On ratio metrics, a naive test's confidence is a mirage.

Unequal traffic splits: when 50/50 is too risky

Sometimes you can't send half your users to an untested change — it may be expensive, risky, or operationally heavy. You can send a smaller slice (say 10%) to the variant. That safety has a real statistical cost.

The trade-off in one sentence: a 50/50 split is the most statistically efficient — it reaches significance with the fewest total users — but an uneven split (e.g. 90/10) exposes far fewer people to a risky variant, at the cost of needing more total traffic and more time.

Total users needed vs how uneven the split is

Baseline 5%, detecting a +10% relative lift at 95% confidence, 80% power. Drag to set the share going to the variant.

At this split variant gets 10% · total users needed 179,760 2.9× vs 50/50

Share to variant: 10%

🛡️

Why you'd do it. A pricing change, a new checkout flow, or anything with revenue or support risk. A 10% exposure caps your downside: if the variant is bad, only 10% of users feel it.

🔋

Effect on power. Power is driven by the smaller arm. Shrinking the variant to 10% starves it of data, so to keep 80% power you must enlarge the whole experiment — the control arm balloons to compensate. Net effect: more total users than a balanced test.

⏱️

Effect on duration. The variant arm fills slowly because it only gets a thin slice of daily traffic. With the same total traffic, a 90/10 test can take roughly 2–3× longer than 50/50 to reach the needed numbers — the small arm is the bottleneck.

📐

Rule of thumb. 50/50 is optimal for speed and sample efficiency. Don't go more extreme than you need: 80/20 costs ~60% more total users than 50/50, while 90/10 costs nearly 3×. Use the mildest imbalance that keeps the risk acceptable.

💡 Practical guidance

Start risky or expensive changes on a small slice (5–10%) to confirm nothing breaks, then — if it's safe — ramp toward 50/50 to gather evidence efficiently. If you must keep a small exposure for the whole test, plan for the longer runtime up front and check that your traffic can realistically reach the variant's required sample within your timeframe.

Why peeking inflates your false-positive rate

Each extra look at the data gives the false-positive rate another chance to cross 5%. Drag to see how the true error rate climbs with the number of peeks.

Number of peeks: 5

A catalog of distribution shapes

Real metrics rarely look like a textbook bell curve. Here are the eight shapes you will meet most often — smooth curves for continuous metrics, bars for discrete counts.

A field guide to distributions

🔔

Normal (symmetric). The classic bell curve — values cluster around the mean and tail off evenly on both sides. Heights, measurement errors, and sample averages look like this. Most textbook tests assume it.

📈

Right-skewed (positive skew). A long tail to the right. Revenue, order value, time-on-page, and lifetime value almost always look like this — most users spend a little, a few spend a lot. The mean is dragged above the median.

📉

Left-skewed (negative skew). A long tail to the left. Rarer in product data, but appears in things like exam scores near a ceiling or days-until-churn for a loyal base — most values are high, a few are very low.

🐫

Bimodal. Two peaks, usually meaning two hidden sub-populations mixed together — e.g. free vs paid users, or mobile vs desktop. Often a signal you should segment before analysing.

🟰

Uniform. Every value is equally likely — a flat line. A single fair die is the textbook case. Rare in real metrics, but the building block for simulations and randomisation.

🎲

Triangular (sum of dice). The shape in the question above: roll two dice and sum them, and the totals form a triangle peaking at 7. It comes from adding two uniform variables — and hints at the deeper rule that summing things pushes you toward a bell.

⏳

Exponential. A steep drop from a peak at zero. Models waiting times — time between purchases, time to first action, session gaps. Memoryless: the past doesn't change the expected wait.

🔢

Poisson (counts). For counts of rare events in a fixed window — orders per day, errors per session, support tickets per hour. Discrete, right-leaning, and defined by a single rate parameter.

🎯

Heavy-tailed. Mostly normal-looking but with rare extreme outliers far from the centre. A single whale purchase or a bot session can sit way out in the tail and distort the mean and variance.

Anatomy of a distribution: mean, median, mode & the spread

Three "centres" describe a distribution — the mode (the peak), the median (the middle value), and the mean (the average). Where they sit relative to each other tells you the shape at a glance. The spread is measured in standard deviations (SD).

🔔Normal — the 68–95–99.7 rule

In a perfectly symmetric normal distribution, mean = median = mode — all three sit dead centre. The spread follows the empirical rule: about 68% of values fall within ±1 SD of the mean, 95% within ±2 SD, and 99.7% within ±3 SD. Only ~0.3% lie beyond ±3 SD, which is why a 3-sigma event is considered rare.

📈Right-skewed — the mean is pulled toward the tail

mode median mean

With a long right tail, the order is mode < median < mean. The peak (mode) sits on the left where most values cluster; the few large values in the tail drag the mean rightward, past the median. Revenue and order value behave this way — which is why the average revenue per user is almost always higher than what a typical user spends. The spread is asymmetric: values stretch much further above the mean than below it.

📉Left-skewed — the mirror image

mode median mean

A long left tail flips the order to mean < median < mode. The peak sits on the right where most values bunch up, and a few small values pull the mean leftward. Think of exam scores where most students do well but a handful fail badly, or the age at which a loyal cohort finally churns. Here the typical value (mode) is higher than the average.

🪙Binomial — counting successes

The binomial distribution counts how many successes occur in a fixed number of yes/no trials — e.g. how many of 20 visitors convert. It is discrete (bars, not a curve) with mean = n × p (trials × probability). When p = 0.5 it is symmetric like a normal; when p is small it leans right. The key fact for A/B testing: as n grows, the binomial is well-approximated by a normal distribution — which is exactly why the Z-test works on conversion rates.

Worked example: why two dice make a triangle

A single die is uniform — each face 1–6 is equally likely. But sum two dice and the result is no longer flat: it peaks at 7 and falls away symmetrically toward 2 and 12. This is a triangular distribution.

The reason is simple counting. There is only one way to roll a 2 (1+1) and one way to roll a 12 (6+6), but six different ways to roll a 7 (1+6, 2+5, 3+4, 4+3, 5+2, 6+1). More combinations means higher probability, so the middle towers over the edges.

🎯 The deeper lesson

Adding just two uniform dice already bends a flat distribution toward a peak. Add more dice and it gets smoother and more bell-shaped. This is the Central Limit Theorem in miniature — sums and averages of random things tend toward a normal distribution, whatever the original shape.

Why skew usually doesn't break your test: the Central Limit Theorem

Your test cares about the distribution of the average, not the raw data. Even from a very skewed population, the average of a sample becomes bell-shaped as the sample grows. Drag the sample size and watch it happen.

Sample size (n): 1

The theory that ties it together

🎯

Central Limit Theorem (CLT). The headline result of statistics: if you take many independent samples and average each one, those averages follow a normal distribution — no matter what shape the original data had. This is why tests built on the normal curve work on messy, skewed real-world metrics, as long as the sample is large enough.

📊

Law of Large Numbers. As your sample grows, its average converges to the true population average. The CLT describes the shape of the estimate's uncertainty; the Law of Large Numbers says that uncertainty shrinks toward zero with more data. Together they justify why bigger experiments are more trustworthy.

📐

Standard error and √n. The spread of the sample mean is the standard deviation divided by √n. Quadrupling your sample only halves your uncertainty — which is exactly why detecting tiny effects needs disproportionately large samples.

🧱

Distributions build on each other. Sum two uniforms → triangular. Sum many of almost anything → normal. Count rare events → Poisson. Time between those events → exponential. The shapes aren't arbitrary; they arise from how the underlying randomness is generated.

When should you actually check for normality?

The most common misconception in A/B testing is that your raw data must be normally distributed. It usually doesn't. The t-test and Z-test assume the sampling distribution of the mean is roughly normal — and thanks to the Central Limit Theorem, that happens automatically once your sample is large enough, no matter how skewed the underlying data is.

✅ The practical rule

With more than ~30–50 users per group for a fairly tame metric, or a few hundred for a heavily skewed one like revenue, you do not need to check normality at all. The CLT has you covered.

✔️

You can skip the check when: your sample is large (hundreds+ per arm) and your metric is a mean or proportion. This covers the large majority of online experiments.

⚠️

You should check (or switch tests) when: your sample is small (under ~100–200 per arm), your metric is extremely skewed or dominated by a few outliers, or your data is clearly bimodal. In those cases the CLT may not have kicked in yet.

🔍

How to check: plot a histogram and look at the shape; compare the mean and median (far apart = skewed); or use a Q–Q plot. Formal tests like Shapiro–Wilk exist but are over-sensitive on large samples, so the eyeball test is usually more useful.

🔧

What to do if it's not normal and n is small: use the Mann–Whitney U test (compares ranks, no normality assumption), apply a log transform to tame right-skew, or cap/winsorize extreme outliers. Collecting more data is the simplest fix of all.

Case studies: should you check the distribution?

✅ Case 1 · Safe to skip

Checkout conversion rate, 40,000 users per arm

A team tests a new checkout button. The metric is binary — converted or not — and each arm has 40,000 users. Someone asks whether they need to check normality first.

The data: A proportion (≈3% convert). The raw data is just 0s and 1s — about as far from a bell curve as possible.

The decision: No check needed. With tens of thousands of users, the sampling distribution of the proportion is essentially normal by the CLT. The Z-test is valid.

Lesson: For proportions at scale, normality of the raw data is irrelevant — only the sample size matters, and here it is huge.

⚠️ Case 2 · You must check

Revenue per user, 80 users in a B2B pilot

A B2B company pilots a pricing change with only 80 customers per arm. The metric is revenue per account — and two enterprise deals dwarf everyone else.

The data: Extremely right-skewed with heavy outliers, and the sample is small (n = 80). The CLT has not reliably kicked in for such a skewed metric at this size.

The decision: Check the shape. A histogram shows the skew; mean and median are far apart. A plain t-test here is fragile — one more whale could flip the result.

Lesson: Use Mann–Whitney U, or a log transform, or report the median. Small n plus heavy skew is the one combination where you genuinely must look.

🐫 Case 3 · The hidden trap

Session duration that looked fine but wasn't

A team analysed average session duration with 5,000 users per arm — comfortably large — and trusted the t-test. The result was significant, but the rollout didn't move the needle.

What they missed: The distribution was bimodal — a spike of bounces near 0 seconds and a second hump of engaged users. The "average" sat in an empty valley between the two groups, describing nobody.

The fix: Segment first. Splitting bounced vs engaged sessions revealed the change only affected one group. The headline average had hidden it.

Lesson: A large sample protects the test's validity, but not your interpretation. Always eyeball the shape — bimodality is a signal to segment, not to average.

Matching the test to the situation

Beyond normality, the distribution shape often tells you which test or metric to reach for. A few common product situations:

💰

Revenue / order value (right-skewed): Report the mean for business impact, but pair it with the median and consider CUPED to cut variance. With large n the t-test is fine; with small n switch to Mann–Whitney.

⏱️

Time-to-event / latency (exponential): Averages are dominated by the tail. Prefer percentiles (p50, p95) and consider a log transform before testing means.

🔢

Counts per user (Poisson-like): Orders per user, errors per session. The mean is meaningful, but variance grows with the mean — Welch's t (unequal variances) is a safe default.

🐫

Mixed populations (bimodal): Segment before you test. A single average across two distinct groups is usually misleading, however large the sample.

✅

Binary outcomes (proportions): Conversion, click, signup. Use the Z-test; normality of raw data never matters, only that you have enough events (a rule of thumb: at least ~10 conversions in each arm).

Z-test vs t-test: which to use, and why

The single most common question — and the most consequential. The right test depends on one thing: what kind of number your metric produces per user.

The deciding question: is each user a yes/no, or a number? That answer picks your test. Everything else is detail.

✅Z-test — for conversion / proportions

Use it wheneach user is a binary outcome: converted or not, clicked or not, signed up or not.

In statisticsIt compares two proportions. For a binary variable the spread is fixed by the rate itself (variance = p(1−p)), so once you know the conversion rate you already know the variability — no extra information needed. With thousands of users the proportion is essentially normal, so a Z (normal) test is exact enough.

In businessAnswers "did a meaningfully larger share of users take the action?" Think checkout conversion, signup rate, email click-through. You only need two numbers per group — how many were shown it, and how many converted.

📊t-test (Welch's) — for continuous metrics

Use it wheneach user produces a number that can vary freely: revenue, session length, items in cart, days retained.

In statisticsIt compares two means. Unlike a proportion, a continuous metric's spread is not determined by its average — two groups can share a mean yet differ wildly in variance — so the test must estimate the spread from the data. Welch's version doesn't assume the two groups have equal variance, which makes it the safe default. (The "t" accounts for the extra uncertainty of estimating that spread; at large n it converges to the Z-test.)

In businessAnswers "did the average value per user move?" Think ARPU, average order value, time on site. You need the mean, the standard deviation, and the sample size for each group — the spread matters as much as the average.

🔑 The one-line reason they differ

For a proportion, the rate fixes the variance — so summary counts are enough (Z-test). For a continuous value, the average and the spread are independent — so you must measure the spread separately, and the t-test is built to handle that uncertainty.

🔀

Ratio metrics (AOV, CTR per session): these have a varying denominator. Compute one value per user, then treat it as a continuous metric and use Welch's t.

📉

Skewed data with small n (<200): when the Central Limit Theorem hasn't smoothed things out yet, switch to the Mann–Whitney U test, which compares ranks and assumes no particular shape.

🎰

3+ variants: run an omnibus test first (χ² for conversion, ANOVA for continuous), then corrected pairwise comparisons — see "Multiple testing" below.

Tests at a glance

Metric type	Test	When to use
Conversion / proportion	Z-test	Binary outcome per user
Continuous per-user metric	Welch's t-test	One numeric value per user; handles unequal variances
Ratio metric (AOV, CTR…)	Aggregate → Welch's t	Compute one value per user, treat as average metric
Skewed / small n (<200)	Mann–Whitney U	When CLT hasn't kicked in; tests stochastic dominance
3+ variants on conversion	χ² + Bonferroni	Run χ² first; pairwise Z with α/k if significant

How each test works

Z-test for conversion rates

What it does: Tests whether two proportions (p₁, p₂) are statistically different.

Formula: Z = (p₂ − p₁) / √(p̂(1−p̂)(1/n₁ + 1/n₂)) where p̂ is the pooled proportion.

Assumptions: n·p ≥ 5 and n·(1−p) ≥ 5 for both groups. Violated with very low conversion rates and small samples.

Examples: Signup rate, purchase rate, click-through rate, open rate, retention, trial-to-paid.

Welch's t-test (continuous metric)

What it does: Tests whether two group means differ, without assuming equal variances. Always safer than Student's t.

Formula: t = (x̄₂ − x̄₁) / √(s₁²/n₁ + s₂²/n₂) with Welch–Satterthwaite degrees of freedom.

Assumptions: Independence. For small n, approximate normality. For large n, CLT covers you.

Examples: ARPU, average session length, items in cart, 14-day spend.

Ratio metrics — aggregation approach

The problem: Metrics like AOV or CTR per session have a varying denominator. Running a t-test on raw transactions treats each order as independent — but orders from the same user are correlated. This makes the standard error 2–10× too small, inflating false positives to 10–30%.

The safe approach: Aggregate to one value per user before testing.

AOV per user = user's total revenue ÷ user's total orders
CTR per session = user's total clicks ÷ user's total sessions

Then run Welch's t on those per-user values. Users with zero denominator (zero orders) should be excluded or handled explicitly.

Mann–Whitney U (non-parametric)

When to use: Heavily skewed distributions or n < 200 per arm, where outliers dominate and CLT hasn't fully applied.

What it tests: Stochastic dominance — whether a random value from B tends to exceed one from A. Does not test mean equality.

Examples: B2B contract values, enterprise deal sizes, early-stage metrics with thin data.

Multi-variant tests (3+ variants)

For binary outcomes: run χ² across the full contingency table first. If significant, follow with pairwise Z-tests using Bonferroni correction (α/k per pair, where k = number of pairs).

For continuous metrics with 3+ arms: one-way ANOVA, then pairwise Welch's t with Bonferroni or Holm–Bonferroni.

Why not just run all pairs at α = 0.05? With 3 variants you have 3 pairs — the family-wise error rate reaches ~14%. Bonferroni keeps it at 5%.

Limitations & how to overcome them

High variance → low power

High-variance metrics (ARPU, LTV) need enormous samples to detect meaningful lifts. Three ways to fix this:

🎯

CUPED — Regress out a pre-experiment covariate (e.g. prior-period revenue). Typical variance reduction: 30–50%. Formula: Y_cuped = Y − θ·(X − E[X]), where θ = Cov(Y,X)/Var(X). Same expected value, much less noise.

📊

Stratified randomisation — Split users into strata (country, device tier) before assignment. Removes between-strata variance from the error term.

✂️

Winsorisation — Cap values at the 99th percentile to stop whale users dominating. Apply the same cap to both variants before analysis.

Novelty & primacy effects

Users respond differently to something new (novelty) or resist change (primacy). Both distort early estimates.

📅

Run for at least 2 full weeks — to capture the full weekly cycle — weekday vs weekend behaviour differs for most products.

👥

Segment by user tenure — Novelty effects are strongest for returning users who notice the change; new users have no baseline to compare.

📉

Check if the effect decays — If week 1 shows +10% and week 3 shows +2%, the long-run impact is ~2%. Don't call a test on day 3.

Sample Ratio Mismatch (SRM)

An SRM occurs when the actual split differs from the intended split. This indicates a bug and invalidates results even if p < 0.05.

🔍

Detect it — Run a χ² test on observed variant counts vs. expected split. If p < 0.01, stop and investigate.

🐛

Common causes — Bot filtering on one variant only; redirect assignment losing users; CDN caching; client-side events firing conditionally.

🚫

Never adjust statistically — Fix the root cause and rerun from scratch.

Peeking — calling results too early

Stopping as soon as p < 0.05 inflates your actual error rate. With daily peeking over 4 weeks, it can reach 25%+.

📋

Pre-register your MDE and plan — before launch. Commit to running until you hit the planned sample size.

🔄

Sequential testing — Methods like mSPRT let you peek continuously with controlled error rates. Worth it for high-velocity teams.

Network effects & interference between users

Standard A/B testing assumes user independence. This breaks in social products, marketplaces, and referral loops.

🗂️

Cluster randomisation — Assign clusters of connected users (region, social graph) rather than individuals. Less power, less interference.

🚧

Holdout groups — Keep a portion of users out of the experiment entirely to measure aggregate spillover.

⏱️

Switchback experiments — Alternate treatment and control in time windows. Common in ride-hailing and logistics.

Multiple metrics — what do you optimise for?

Testing many metrics at once inflates the chance at least one looks significant by chance.

🎯

Pre-register one primary metric — the decision hinges on it. Secondary metrics are directional signals only.

🛡️

Guardrail metrics — (revenue, latency, crash rate) — check without Bonferroni correction. You want maximum sensitivity to regressions.

📐

Benjamini–Hochberg FDR — If testing many metrics, BH correction is less conservative than Bonferroni while still disciplined.

The experiment lifecycle

🎯

1. Hypothesis — State the change, the metric it should move, and the direction. "Bigger CTA button → higher signup rate."

📐

2. Power analysis — Decide the MDE that matters for the business, then compute the sample size and duration before launch. (That's what the Plan tab is for.)

🚀

3. Run — Launch, check for SRM in the first hours, then leave it alone for at least one full business cycle (usually 1–2 weeks). No peeking.

📊

4. Analyse — At the planned sample size, run the right test for your metric type, check guardrails, and look at key segments. (That's the Analyse tab.)

✅

5. Decide & document — Ship, iterate, or abandon. Write down the result either way — negative results stop your team from repeating dead ideas.

How α, β, and power interact

Lever	α	β	Power	Mechanism
Increase n	—	↓	↑	SE shrinks like 1/√n
CUPED / stratification	—	↓	↑	Same as more data, achieved statistically
Raise α (0.05 → 0.10)	↑	↓	↑	Lower rejection threshold
Lower α (0.05 → 0.01)	↓	↑	↓	Higher rejection threshold
One-sided test	—	↓	↑	All α budget on one tail
Pre-register hypothesis	↓	—	—	Prevents post-hoc shopping

All calculations run locally in your browser — no data leaves your device.

Multiple testing: why more comparisons inflate false positives

The classic α = 0.05 threshold protects a single comparison. The moment you test several hypotheses at once, the chance of at least one false positive climbs fast — and you must correct for it.

The problem. If you fix α = 0.05 and test one hypothesis, your chance of a false positive is 5%. But test several independent hypotheses and the probability of at least one false positive — the family-wise error rate (FWER) — is 1 − (1 − α)^m. With 3 comparisons it is already ~14%; with 10 it is ~40%. More variants, more metrics, or repeated peeking all multiply your chances of being fooled.

False-positive risk vs number of comparisons

The orange line is the uncorrected family-wise error rate; the dashed line shows how Bonferroni holds it at 5%. Drag the slider to set how many comparisons you run.

Family-wise error rate FWER = 1 − (1 − α)^m = 1 − (1 − 0.05)³ = 14.3%

Comparisons (m): 3

➗

Bonferroni correction. Divide your target α by the number of comparisons m (e.g. 0.05 / 10 = 0.005); only p-values below that count as significant. Dead simple and strictly controls FWER — but over-conservative, so it costs you power and can miss real effects.

📶

Holm–Bonferroni (step-down). Sort the p-values smallest to largest and test them in order against progressively looser thresholds (α/m, then α/(m−1), …). Uniformly more powerful than plain Bonferroni while giving the same FWER guarantee. A sensible default.

📊

Benjamini–Hochberg (FDR control). Instead of guarding against even one false positive, it controls the expected fraction of false positives among your significant results (the False Discovery Rate). Much more powerful when you screen many metrics, at the cost of allowing a controlled share of errors. The standard choice for large dashboards and exploratory analyses.

🧪 Run the omnibus test first

With 3+ variants, don't jump straight to pairwise comparisons. First run a single omnibus test — χ² for conversion, one-way ANOVA for continuous metrics — that asks one question: "is any variant different from the others?" Only if that comes back significant do you proceed to corrected pairwise tests to find which ones. This two-stage gate (the protected procedure) controls the error rate and stops you from fishing through pairs that the data never justified. This calculator follows exactly that order: χ² first, then Holm-corrected pairwise comparisons vs control.

✅ Which to use

A few pre-planned comparisons where any false positive is costly → Holm–Bonferroni (the recommended default). Screening many metrics where you can tolerate a small false-positive share for more discoveries → Benjamini–Hochberg. Plain Bonferroni is fine when m is small and you want the simplest defensible rule. Guardrail metrics that protect against regressions are usually left uncorrected — there you want maximum sensitivity.

A/B Testing Calculator

📖What this means