A/B Test Sample Size Calculator

Share:

Compute users per variant for a statistically valid A/B test. Inputs: baseline rate, minimum detectable effect, power, significance. Free, no signup.

RT-FIN-137 · Finance & Money

A/B Test Sample Size Calculator

⚠ Disclaimer: Estimates for planning purposes only. Industry benchmarks drift over time and your specific circumstances may differ materially. Verify against your own data and consult an accountant or business adviser for material decisions.

Before running an A/B test, compute how many users per variant you need to detect a given lift with confidence. The math: a two-proportion z-test sized for your chosen significance (α) and statistical power (1−β). Standard defaults: α = 0.05 (95% confidence), power = 80%.

%
%
users / day
📅 Research current as of 23 May 2026 · Sources: Two-proportion z-test sample size. n = (z_α/2 × sqrt(2pq) + z_β × sqrt(p₁q₁ + p₂q₂))² ÷ (p₁ − p₂)². Standard frequentist A/B sizing. References: Kohavi/Tang/Xu, Casella/Berger.
Rates, regulations, and lender practices change frequently — verify current figures with your provider or licensed advisor before acting.
Users required per variant
Total both variants: · ≈
Baseline (control)
Target (variant)
Absolute lift (pp)
Significance α
Power
Advertisement
After results · AD-W1Responsive · Post-tool

How to Use the Sample Size Calculator

Enter baseline conversion rate

From your current analytics for the metric you're testing. E.g. current checkout conversion is 3%. Use the most recent stable period (last 2-4 weeks).

Pick a realistic minimum detectable effect (MDE)

Relative lift you'd consider a win. 10% relative lift is typical for content / UX changes; 20% for major flow restructures; 5% for very polished funnels where small improvements compound. Don't pick "any lift" — that requires impractically large samples.

Use defaults for α and power unless you know better

α = 0.05 and power = 80% are the industry-standard defaults. Use stricter (α = 0.01, power = 95%) for high-stakes tests (homepage redesign, pricing change); relaxed (α = 0.10) is rarely justified.

Read users per variant + days needed

If the days-needed is over 28, your MDE is too small for your traffic. Either raise the MDE (accept that you can only detect a bigger lift), increase traffic to the test, or pick a higher-traffic metric.

Advertisement
After how-to · AD-W2Responsive

A/B Test Sample Sizing — The Half-Done Discipline

Why Most A/B Tests Are Underpowered

The dirty secret of CRO is that the typical A/B test reported as "won" by product teams was statistically underpowered — too few users, too short a duration, false-positive rate dramatically higher than the nominal 5%. Ronny Kohavi (former Amazon, Microsoft, Airbnb experimentation leader) estimates 70-80% of self-reported A/B test wins fail to replicate. The root cause is almost always sample size: teams pick a baseline metric, run a test for two weeks, see a 12% lift, declare victory, and roll it out — without ever computing what sample size would be required to detect a 12% lift at 80% power.

The math is forgiving up to a point: at a 3% baseline conversion rate, detecting a 10% relative lift (3% → 3.3%) at 95% confidence and 80% power requires roughly 50,000 users per variant. A site with 2,000 daily users gets there in 25 days per variant — 50 days total. Teams that don't compute this in advance routinely run 14-day tests on 30,000 users, see noise that looks like a win, and ship it. The replicate-rate problem follows directly.

The Sample Size Formula

Two-proportion z-test sample size: n per arm = (zα/2 × √(2p̄q̄) + zβ × √(p₁q₁ + p₂q₂))² ÷ (p₁ − p₂)², where p₁ is the baseline rate, p₂ is the target rate, q = 1 − p, zα/2 is the critical value for your chosen significance (1.96 for α = 0.05), and zβ is the critical value for power (0.842 for 80% power). The formula has two terms: the first ensures you don't falsely call a true null result a win (Type I error / α); the second ensures you can detect a real effect when it exists (Type II error / β). Both matter — most teams worry about α and ignore β, which is why so many tests are underpowered.

The non-linear punch line: sample size scales as 1 / (effect size)². Detecting a 5% relative lift takes 4× the sample of detecting a 10% lift. Detecting a 2% lift takes 25× the sample. This is why mature CRO programs pick large MDEs (15-25%) for testing — small lifts are theoretically possible but require sample sizes most products can't deliver. Better to do fewer, larger-effect tests than many small-effect tests that can't reach significance.

"At a 3% baseline conversion rate, detecting a 10% relative lift at standard statistical thresholds requires 50,000 users per variant. Two-week tests on lower volumes produce noise dressed up as wins. The replicate rate of self-reported A/B test winners is around 20-30%."

What Power Actually Means

Statistical power = the probability that your test will correctly detect a real effect of the size you specified, given the sample size you ran. 80% power means: if the true effect is exactly the MDE you specified, you'd correctly call it significant 80% of the time and miss it (Type II error / β = 20%) 20% of the time. Higher power = more sensitive test = larger sample. The industry-standard 80% power is a tradeoff — high enough to catch most real effects, low enough to be practical. Increase to 90% for high-stakes tests (homepage, pricing); 95% only when missing a real effect is genuinely costly. Below 80%, you're running a test that will frequently fail to detect real wins.

10 Facts About A/B Test Statistics

01

Sample size scales as 1 / (effect size)². Detecting a 5% lift takes 4× the users of detecting a 10% lift.

02

α = 0.05 (5% false-positive rate) is industry default. Stricter for high-stakes tests.

03

Power = 80% is industry default. Tests below 80% power frequently miss real wins.

04

Ronny Kohavi (Amazon/Microsoft/Airbnb experimentation lead) — estimates 70-80% of self-reported A/B wins fail to replicate.

05

Two-proportion z-test is the standard for binary conversion metrics. Used by Optimizely, VWO, Google Optimize.

06

"Peeking" at A/B test results before reaching sample size inflates false-positive rate from 5% to 25-30%.

07

Sequential testing (Optimizely Stats Engine, Evan Miller's mSPRT) allows early stopping without inflating Type I error.

08

Bayesian A/B testing (Lyst, Convoy, others) returns "probability variant beats control" — different interpretation, similar sample needs.

09

14-day minimum runtime regardless of sample reached — captures weekly seasonality.

10

Multiple testing correction (Bonferroni, Holm-Bonferroni) needed when testing many metrics or variants.

Frequently Asked Questions

  • 10-15% relative lift is typical for UX / copy changes. 15-25% for layout / flow restructures. 5-10% only for highly polished funnels with massive traffic where small lifts compound. Setting MDE too small (< 5%) usually means impossibly large sample sizes; setting it too high (50%+) means you're only catching huge wins and missing many real moderate ones. Start at 10% and adjust based on your traffic.
  • 0.05 (95% confidence) for most tests. Use 0.01 (99% confidence) for high-stakes irreversible changes: homepage redesign, checkout flow restructure, pricing change. The stricter threshold roughly doubles required sample size but reduces false-positive risk. Use 0.10 (90% confidence) only for low-stakes pilot tests where shipping a bad variant doesn't matter.
  • A/B testing is a statistical detection problem and signal-vs-noise depends on baseline rate. Low baseline rates (e.g. 2% checkout) need much more sample to detect small lifts because the absolute change is tiny relative to baseline noise. Higher baselines (e.g. 30% click-through) need less sample for the same relative lift. The 1/(effect size)² scaling makes small-effect detection genuinely expensive.
  • Peeking = checking A/B test results before the planned sample size is reached, and stopping the test as soon as you see significance. This destroys the false-positive math — true false-positive rate jumps from 5% to 25-30% with daily peeking. Either commit to the planned sample size and don't peek, or use sequential-testing methods (Optimizely's Stats Engine, Bayesian frameworks) that explicitly allow early stopping.
  • Yes — even if sample size is reached earlier. 14 days captures weekly seasonality (weekday vs weekend behaviour). Shorter tests can be biased by single-week phenomena (a viral moment, a TV ad, a major sporting event). High-traffic sites often run 14-21 days; low-traffic sites may need 28-56 days even for moderate MDEs.
  • A/B/n testing (control + multiple variants) needs Bonferroni correction or similar to control family-wise error rate. With 4 variants compared to control, divide α by 4 (so α/4 = 0.0125 per comparison). This roughly doubles the per-variant sample size. Alternative: use the multi-armed bandit (Thompson sampling, Bayesian optimisation) which doesn't suffer from the multiple-testing penalty but is harder to operationalise.
  • Bayesian methods (used by Lyst, Convoy, VWO Bayesian mode) return "probability variant beats control" instead of a p-value. The interpretation is more intuitive ("85% probability variant is better") but the underlying sample-size needs are similar to frequentist methods. Main advantages: graceful handling of small sample sizes (no awkward "not significant" verdicts), natural fit with sequential / early-stopping decisions. Disadvantage: requires picking a prior (subjective).
  • Yes — same underlying two-proportion z-test formula. Evan Miller's calculator is the most widely cited reference, used by Optimizely and many CRO blogs. Our tool produces the same result for the same inputs. We add: explicit "days needed" given your daily traffic, multi-tier alpha/power dropdowns, and a sample size for the most common SaaS / e-com baseline rates.
  • Three possibilities: (1) the variant has zero or near-zero effect — call it a tie, don't ship; (2) the variant has a real but smaller-than-MDE effect — keep running OR accept the smaller-effect finding requires bigger sample; (3) there's a real moderate effect but your power is too low. Compute the post-hoc power: at the observed difference, what % power did your test actually have? If under 60%, the test was underpowered and "no significance" doesn't mean "no effect" — it means "not enough data." Plan more carefully next time.
  • A lot. Singapore + Hong Kong tech products often have small audiences (50K-500K MAU) where the standard 50,000-users-per-variant requirement means tests run 4-8 weeks. The pragmatic adaptation: pick larger MDEs (20-30%), test fewer-bigger changes rather than many-small-changes, lean Bayesian for early-stopping flexibility. Indonesia + Philippines products typically have much higher traffic (5M-50M MAU) so sample size is rarely the bottleneck — runtime is, since weekly seasonality still applies.

Related News

You may be interested in these recent stories from our newsroom.

View all news →
Advertisement
Pre-footer · AD-W3 728 × 90

75 more free tools

Calculators, converters, security tools — no signup.