Enter your daily eligible traffic to estimate experiment duration.
Pick your metric, drop in your numbers, get a confident decision — with the right statistical test for what you're actually measuring.
Enter your daily eligible traffic to estimate experiment duration.
1. Run an A/A test on your platform. 2. Confirm the split ratio matches what you configured. 3. Confirm the p-value is non-significant across a few runs. 4. Only then start real A/B tests. You can paste the two A/A group numbers into the Analyse tab and confirm "no clear winner" — exactly what you want to see.
Test on the minimum sample that gives 80% power — extra users past that add almost no certainty. Keep the rest as a holdout on the proven version, then roll the winner out to everyone afterwards. The full benefit comes at rollout, not during the test.
A SaaS team sends a weekly newsletter to 200,000 subscribers. Open rate is 22%. Marketing thinks a curiosity-driven subject line ("You're missing 3 features") will beat the current one. They want evidence before rolling it out to the whole list.
Because past sends show curiosity-gap subject lines earned higher opens in our own data, we believe a curiosity-driven line will raise the open rate from 22% to at least 23.1% (+5% relative). We'll know we're right if the Z-test clears p < 0.05 with the lift above that threshold.
Smallest lift worth shipping: +5% relative (22% → 23.1%). The calculator says ~22,700 per group. So they test on 25,000 per group (50,000 total) and hold out the other 150,000 on the old line — no point exposing everyone to an unproven idea.
A single send goes out Tuesday; opens counted over 48 hours. Control 5,503/25,000 = 22.0%. Variant 5,795/25,000 = 23.2%.
Two-proportion Z-test: z ≈ 3.1, p ≈ 0.002. The 95% intervals for the two open rates don't overlap, and the split was a clean 50/50 (no SRM concern).
A growth team adds emoji and urgency to a re-engagement push ("⏰ Don't lose your streak!"), expecting more dormant users to return. Re-engagement rate is ~4%.
Because support tickets and survey replies suggest users forget the app rather than dislike it, we believe an urgency-and-emoji push will lift the return rate from 4% by at least +0.5pp. We'll also watch a guardrail — mute/unsubscribe rate — so a "win" on returns doesn't quietly cost us long-term reach.
Even with millions of dormant users, they test on a holdout-sized cohort — 50,000 per group — not the whole base. No reason to blast everyone with an unproven message.
Control 4.0% vs variant 4.1%. The difference is tiny.
Z-test: p = 0.42. The 95% interval for the lift runs from −0.14pp to +0.34pp — it straddles zero, so no effect can be claimed.
An e-commerce team rebuilds the checkout to cut drop-off. Current purchase-completion rate is 30%. They want to detect at least a +8% relative lift before shipping a change this big.
Because funnel analytics show 1 in 3 users abandon at a cluttered payment step, we believe a streamlined checkout with a progress bar will raise completion from 30% by at least +8% relative (to ≈32.4%). Success = a significant positive Z-test on a clean 50/50 split — the split health matters as much as the p-value here.
The calculator says ~5,850 per group. With heavy daily traffic they expose just 10% of sessions to the test and keep 90% on the current checkout — a small, safe exposure for a risky change. They'll ramp up only if it's healthy.
First read: 6,200 in control vs 5,800 in variant. The Analyse tab's Sample Ratio Mismatch check flags it — p ≈ 0.0004, far below the 0.001 alarm. A redirect bug was dropping some users before assignment. Any result here would be untrustworthy. They fix the bug and rerun.
Clean rerun: control 30.0% vs variant 32.5%, a healthy 50/50 split, z ≈ 4.2, p < 0.001.
A team reorders the plans on a pricing page, hoping to raise average revenue per visitor (currently ~$18, but very spread out — many $0 visitors, a few large purchases).
Because session recordings show visitors anchor on the first plan they see, we believe leading with the mid-tier plan will raise average revenue per visitor from $18 by at least +$1. Because revenue is a continuous, highly-skewed amount, success is a significant Welch's t-test — not a Z-test — on the per-visitor means.
Revenue per visitor is a continuous value, not yes/no, so this needs Welch's t-test, not a Z-test — and the spread (standard deviation), not just the average, drives the sample size. With high variance the calculator asks for ~23,700 per group. Traffic is only ~1,500/day, so the lesson flips: rather than a holdout, they must run ~5 weeks just to reach the minimum.
After 5 weeks: ~43,000 visitors per arm. Control mean $18.00, variant mean $19.15 (both with large standard deviations ~$42).
Welch's t-test: t ≈ 4.0, p < 0.001. The +$1.15 per visitor is small relative to the spread, but with 43,000 per arm it's clearly distinguishable from noise.
A product team adds a guided checklist to onboarding, hoping new users take more key actions per session. The metric is actions per session — a ratio, where each user contributes many sessions.
Because activation data shows users who complete 3+ actions in week one retain far better, we believe a guided checklist will raise actions per session from ~1.0 by a meaningful margin. Because the metric is a ratio (many sessions per user), success must be judged with the delta method, which gets the variance right — a naive test would over-claim.
You randomise by user but measure per session, so a user's sessions are correlated. A naive t-test on session-level ratios would understate the variance and cry "significant" too easily. The right tool is the delta method, which estimates the correct variance and feeds a Z-test. They test on a safe 20% slice first (a risky onboarding change), holding the rest on the current flow.
8,000 users per arm. Control averages ~1.01 actions/session; variant ~1.10 — but the per-user session counts vary widely, so the variance must be handled carefully.
Delta method + Z-test: z ≈ 6.9, p < 0.001. The lift in actions per session is real, and the variance was estimated correctly — a naive test would have over-stated the certainty.
A wide CI that crosses zero is the data telling you "I can't rule out no effect yet." The usual fix is more users — but because of the 1/√n rule, detecting ever-smaller effects gets expensive fast. That trade-off is exactly what the Plan tab quantifies when it sizes your test.
The intuition first. Imagine measuring whether a new gym program adds muscle. People start at wildly different weights, so the after-numbers are all over the place — the noise from "where each person started" drowns out the effect of your program. But you know each person's starting weight. If you measure the change from their own baseline instead of the raw after-number, that starting-point noise vanishes and the program's real effect pops out. CUPED is exactly this trick for product metrics: subtract the part of each user's behaviour you could have predicted before the test, and what's left is a much cleaner read on your change.
CUPED (Controlled-experiment Using Pre-Experiment Data) is the single highest-leverage trick in modern experimentation. The idea: a user who spent a lot last month will probably spend a lot this month regardless of your change. That predictable part is noise you can subtract.
The one-line version for stakeholders: "CUPED doesn't change the answer; it gets us to the same answer with less traffic, by cancelling out noise we already knew about."
The most common confusion: CUPED needs data from both sides of the experiment. The covariate X is collected before the test, but the adjustment itself is only computed after the test ends — because the formula needs each user's during-test outcome Y, which doesn't exist until the test is over. You decide on CUPED up front, but you run the math at analysis time.
One-line timing: collect X before, collect Y during, compute θ and adjust after. CUPED is a planning decision made up front but a calculation done at analysis time.
Forget formulas for a moment. Here are 8 users from a revenue test — four in control (A), four in the variant (B). For each we have X = spend in the weeks before the test, and Y = spend during it. Small enough to follow every number by hand.
Average Y in control = $24.00, in variant = $29.25 — looks like a big +21.9% lift! But run a t-test and p = 0.74, nowhere near significant. Why? U2 ($55) and U6 ($60) are naturally big spenders bouncing the averages around so hard the test can't tell the gain from luck. The signal is real but the noise drowns it.
Watch the wild original column get pulled into a tight band — while the group averages barely move:
| User | Original Y ($) | CUPED Ỹ ($) |
|---|---|---|
| U1 (A) | 12 | 26.11 |
| U2 (A) | 55 | 23.50 |
| U3 (A) | 19 | 21.71 |
| U4 (A) | 10 | 26.39 |
| Mean A | 24.00 | 24.43 |
| U5 (B) | 18 | 29.83 |
| U6 (B) | 60 | 30.79 |
| U7 (B) | 27 | 27.43 |
| U8 (B) | 12 | 27.25 |
| Mean B | 29.25 | 28.82 |
| Test on… | Lift | t | p | Verdict at α = 0.05 |
|---|---|---|---|---|
| Original Y | +21.9% | 0.35 | 0.74 | Not significant ✗ |
| CUPED Ỹ | +18.0% | 3.09 | 0.023 | Significant ✓ |
The lesson: the raw test said "no difference" — but that was the noise talking, not the truth. The variant really was better; a few big spenders hid it. If your test comes back "not significant" on a noisy metric like revenue, don't conclude "no effect" — first ask whether you have pre-period data to CUPED-adjust with. You may have a winner hiding under the noise.
(This 8-user set is deliberately tiny so you can check the arithmetic — its correlation is unrealistically clean. Real covariates correlate around 0.3–0.6; the realistic example below uses ρ = 0.55.)
StreamFlix wants to know if a one-click checkout lifts average revenue per user (ARPU). ARPU is brutally noisy — a handful of annual-plan buyers dominate, so the standard deviation is huge relative to the mean.
Because required sample size scales linearly with variance, a 50% variance reduction means you reach significance in half the time — or detect an effect half the size in the same time. That is why every large experimentation platform uses it.
Copy-paste runnable (needs only numpy + scipy). Swap the simulated arrays for your own per-user X (pre-period) and Y (in-experiment).
# CUPED: variance reduction for an A/B test. Needs numpy + scipy. import numpy as np from scipy import stats def analyse(Yc, Yv, Xc, Xv): # 1) estimate theta on the POOLED pre/post data (both arms together) Xall, Yall = np.concatenate([Xc, Xv]), np.concatenate([Yc, Yv]) theta = np.cov(Yall, Xall, ddof=1)[0, 1] / Xall.var(ddof=1) # 2) transform each arm with the same theta and global mean of X Xbar = Xall.mean() Yc_c = Yc - theta * (Xc - Xbar) Yv_c = Yv - theta * (Xv - Xbar) # 3) Welch t-test, raw vs CUPED t_raw, p_raw = stats.ttest_ind(Yv, Yc, equal_var=False) t_cup, p_cup = stats.ttest_ind(Yv_c, Yc_c, equal_var=False) rho = np.corrcoef(Xall, Yall)[0, 1] var_red = 1 - np.concatenate([Yc_c, Yv_c]).var(ddof=1) / Yall.var(ddof=1) print(f"rho = {rho:.3f} (var. reduction rho^2 = {rho**2:.1%})") print(f"empirical var. drop = {var_red:.1%}") print(f"raw : lift={(Yv.mean()-Yc.mean())/Yc.mean():+.2%} t={t_raw:.2f} p={p_raw:.4f}") print(f"cuped : lift={(Yv_c.mean()-Yc_c.mean())/Yc_c.mean():+.2%} t={t_cup:.2f} p={p_cup:.4f}") # --- demo data: replace with your own per-user arrays --- rng = np.random.default_rng(7) N = 20_000 latent = rng.lognormal(2.2, 0.55, 2*N) X = latent + rng.normal(0, 5.5, 2*N) # pre-period spend Y = latent + rng.normal(0, 5.5, 2*N) # in-experiment spend Y[N:] *= 1.03 # inject a real +3% lift in the variant analyse(Y[:N], Y[N:], X[:N], X[N:])
The intuition first. Suppose you want to know the average "actions per session." You randomise people into control and variant — but each person has many sessions. Here's the trap: one chatty user might have 40 sessions, another just 1. If you pile all sessions into one bucket and treat them as independent, you're pretending you have far more independent data points than you really do. Sessions from the same person rise and fall together, so they don't each count as fresh evidence. The delta method is the fix: it works out the true uncertainty of a ratio (a top number divided by a bottom number, where both wobble) and hands that honest uncertainty to a normal Z-test.
What kind of metric needs it? Any "ratio metric" where the denominator isn't fixed: actions per session, revenue per order, click-through rate per session, items per basket. The giveaway is that you randomise by one unit (the user) but count by another (sessions, orders, clicks).
A team ships an onboarding nudge and measures actions per session. They randomise by user, but users have very different session counts — a handful of power users have dozens of sessions each.
The trade-off in one sentence: a 50/50 split is the most statistically efficient — it reaches significance with the fewest total users — but an uneven split (e.g. 90/10) exposes far fewer people to a risky variant, at the cost of needing more total traffic and more time.
Start risky or expensive changes on a small slice (5–10%) to confirm nothing breaks, then — if it's safe — ramp toward 50/50 to gather evidence efficiently. If you must keep a small exposure for the whole test, plan for the longer runtime up front and check that your traffic can realistically reach the variant's required sample within your timeframe.
In a perfectly symmetric normal distribution, mean = median = mode — all three sit dead centre. The spread follows the empirical rule: about 68% of values fall within ±1 SD of the mean, 95% within ±2 SD, and 99.7% within ±3 SD. Only ~0.3% lie beyond ±3 SD, which is why a 3-sigma event is considered rare.
With a long right tail, the order is mode < median < mean. The peak (mode) sits on the left where most values cluster; the few large values in the tail drag the mean rightward, past the median. Revenue and order value behave this way — which is why the average revenue per user is almost always higher than what a typical user spends. The spread is asymmetric: values stretch much further above the mean than below it.
A long left tail flips the order to mean < median < mode. The peak sits on the right where most values bunch up, and a few small values pull the mean leftward. Think of exam scores where most students do well but a handful fail badly, or the age at which a loyal cohort finally churns. Here the typical value (mode) is higher than the average.
The binomial distribution counts how many successes occur in a fixed number of yes/no trials — e.g. how many of 20 visitors convert. It is discrete (bars, not a curve) with mean = n × p (trials × probability). When p = 0.5 it is symmetric like a normal; when p is small it leans right. The key fact for A/B testing: as n grows, the binomial is well-approximated by a normal distribution — which is exactly why the Z-test works on conversion rates.
A single die is uniform — each face 1–6 is equally likely. But sum two dice and the result is no longer flat: it peaks at 7 and falls away symmetrically toward 2 and 12. This is a triangular distribution.
The reason is simple counting. There is only one way to roll a 2 (1+1) and one way to roll a 12 (6+6), but six different ways to roll a 7 (1+6, 2+5, 3+4, 4+3, 5+2, 6+1). More combinations means higher probability, so the middle towers over the edges.
Adding just two uniform dice already bends a flat distribution toward a peak. Add more dice and it gets smoother and more bell-shaped. This is the Central Limit Theorem in miniature — sums and averages of random things tend toward a normal distribution, whatever the original shape.
The most common misconception in A/B testing is that your raw data must be normally distributed. It usually doesn't. The t-test and Z-test assume the sampling distribution of the mean is roughly normal — and thanks to the Central Limit Theorem, that happens automatically once your sample is large enough, no matter how skewed the underlying data is.
With more than ~30–50 users per group for a fairly tame metric, or a few hundred for a heavily skewed one like revenue, you do not need to check normality at all. The CLT has you covered.
A team tests a new checkout button. The metric is binary — converted or not — and each arm has 40,000 users. Someone asks whether they need to check normality first.
A B2B company pilots a pricing change with only 80 customers per arm. The metric is revenue per account — and two enterprise deals dwarf everyone else.
A team analysed average session duration with 5,000 users per arm — comfortably large — and trusted the t-test. The result was significant, but the rollout didn't move the needle.
Beyond normality, the distribution shape often tells you which test or metric to reach for. A few common product situations:
The deciding question: is each user a yes/no, or a number? That answer picks your test. Everything else is detail.
For a proportion, the rate fixes the variance — so summary counts are enough (Z-test). For a continuous value, the average and the spread are independent — so you must measure the spread separately, and the t-test is built to handle that uncertainty.
| Metric type | Test | When to use |
|---|---|---|
| Conversion / proportion | Z-test | Binary outcome per user |
| Continuous per-user metric | Welch's t-test | One numeric value per user; handles unequal variances |
| Ratio metric (AOV, CTR…) | Aggregate → Welch's t | Compute one value per user, treat as average metric |
| Skewed / small n (<200) | Mann–Whitney U | When CLT hasn't kicked in; tests stochastic dominance |
| 3+ variants on conversion | χ² + Bonferroni | Run χ² first; pairwise Z with α/k if significant |
What it does: Tests whether two proportions (p₁, p₂) are statistically different.
Formula: Z = (p₂ − p₁) / √(p̂(1−p̂)(1/n₁ + 1/n₂)) where p̂ is the pooled proportion.
Assumptions: n·p ≥ 5 and n·(1−p) ≥ 5 for both groups. Violated with very low conversion rates and small samples.
Examples: Signup rate, purchase rate, click-through rate, open rate, retention, trial-to-paid.
What it does: Tests whether two group means differ, without assuming equal variances. Always safer than Student's t.
Formula: t = (x̄₂ − x̄₁) / √(s₁²/n₁ + s₂²/n₂) with Welch–Satterthwaite degrees of freedom.
Assumptions: Independence. For small n, approximate normality. For large n, CLT covers you.
Examples: ARPU, average session length, items in cart, 14-day spend.
The problem: Metrics like AOV or CTR per session have a varying denominator. Running a t-test on raw transactions treats each order as independent — but orders from the same user are correlated. This makes the standard error 2–10× too small, inflating false positives to 10–30%.
The safe approach: Aggregate to one value per user before testing.
Then run Welch's t on those per-user values. Users with zero denominator (zero orders) should be excluded or handled explicitly.
When to use: Heavily skewed distributions or n < 200 per arm, where outliers dominate and CLT hasn't fully applied.
What it tests: Stochastic dominance — whether a random value from B tends to exceed one from A. Does not test mean equality.
Examples: B2B contract values, enterprise deal sizes, early-stage metrics with thin data.
For binary outcomes: run χ² across the full contingency table first. If significant, follow with pairwise Z-tests using Bonferroni correction (α/k per pair, where k = number of pairs).
For continuous metrics with 3+ arms: one-way ANOVA, then pairwise Welch's t with Bonferroni or Holm–Bonferroni.
Why not just run all pairs at α = 0.05? With 3 variants you have 3 pairs — the family-wise error rate reaches ~14%. Bonferroni keeps it at 5%.
High-variance metrics (ARPU, LTV) need enormous samples to detect meaningful lifts. Three ways to fix this:
Y_cuped = Y − θ·(X − E[X]), where θ = Cov(Y,X)/Var(X). Same expected value, much less noise.Users respond differently to something new (novelty) or resist change (primacy). Both distort early estimates.
An SRM occurs when the actual split differs from the intended split. This indicates a bug and invalidates results even if p < 0.05.
Stopping as soon as p < 0.05 inflates your actual error rate. With daily peeking over 4 weeks, it can reach 25%+.
Standard A/B testing assumes user independence. This breaks in social products, marketplaces, and referral loops.
Testing many metrics at once inflates the chance at least one looks significant by chance.
| Lever | α | β | Power | Mechanism |
|---|---|---|---|---|
| Increase n | — | ↓ | ↑ | SE shrinks like 1/√n |
| CUPED / stratification | — | ↓ | ↑ | Same as more data, achieved statistically |
| Raise α (0.05 → 0.10) | ↑ | ↓ | ↑ | Lower rejection threshold |
| Lower α (0.05 → 0.01) | ↓ | ↑ | ↓ | Higher rejection threshold |
| One-sided test | — | ↓ | ↑ | All α budget on one tail |
| Pre-register hypothesis | ↓ | — | — | Prevents post-hoc shopping |
All calculations run locally in your browser — no data leaves your device.
The problem. If you fix α = 0.05 and test one hypothesis, your chance of a false positive is 5%. But test several independent hypotheses and the probability of at least one false positive — the family-wise error rate (FWER) — is 1 − (1 − α)m. With 3 comparisons it is already ~14%; with 10 it is ~40%. More variants, more metrics, or repeated peeking all multiply your chances of being fooled.
With 3+ variants, don't jump straight to pairwise comparisons. First run a single omnibus test — χ² for conversion, one-way ANOVA for continuous metrics — that asks one question: "is any variant different from the others?" Only if that comes back significant do you proceed to corrected pairwise tests to find which ones. This two-stage gate (the protected procedure) controls the error rate and stops you from fishing through pairs that the data never justified. This calculator follows exactly that order: χ² first, then Holm-corrected pairwise comparisons vs control.
A few pre-planned comparisons where any false positive is costly → Holm–Bonferroni (the recommended default). Screening many metrics where you can tolerate a small false-positive share for more discoveries → Benjamini–Hochberg. Plain Bonferroni is fine when m is small and you want the simplest defensible rule. Guardrail metrics that protect against regressions are usually left uncorrected — there you want maximum sensitivity.