A/B Test Sample Size Calculator
Compute users per variant for a statistically valid A/B test. Inputs: baseline rate, minimum detectable effect, power, significance. Free, no signup.
A/B Test Sample Size Calculator
Before running an A/B test, compute how many users per variant you need to detect a given lift with confidence. The math: a two-proportion z-test sized for your chosen significance (α) and statistical power (1−β). Standard defaults: α = 0.05 (95% confidence), power = 80%.
How to Use the Sample Size Calculator
Enter baseline conversion rate
From your current analytics for the metric you're testing. E.g. current checkout conversion is 3%. Use the most recent stable period (last 2-4 weeks).
Pick a realistic minimum detectable effect (MDE)
Relative lift you'd consider a win. 10% relative lift is typical for content / UX changes; 20% for major flow restructures; 5% for very polished funnels where small improvements compound. Don't pick "any lift" — that requires impractically large samples.
Use defaults for α and power unless you know better
α = 0.05 and power = 80% are the industry-standard defaults. Use stricter (α = 0.01, power = 95%) for high-stakes tests (homepage redesign, pricing change); relaxed (α = 0.10) is rarely justified.
Read users per variant + days needed
If the days-needed is over 28, your MDE is too small for your traffic. Either raise the MDE (accept that you can only detect a bigger lift), increase traffic to the test, or pick a higher-traffic metric.
A/B Test Sample Sizing — The Half-Done Discipline
Why Most A/B Tests Are Underpowered
The dirty secret of CRO is that the typical A/B test reported as "won" by product teams was statistically underpowered — too few users, too short a duration, false-positive rate dramatically higher than the nominal 5%. Ronny Kohavi (former Amazon, Microsoft, Airbnb experimentation leader) estimates 70-80% of self-reported A/B test wins fail to replicate. The root cause is almost always sample size: teams pick a baseline metric, run a test for two weeks, see a 12% lift, declare victory, and roll it out — without ever computing what sample size would be required to detect a 12% lift at 80% power.
The math is forgiving up to a point: at a 3% baseline conversion rate, detecting a 10% relative lift (3% → 3.3%) at 95% confidence and 80% power requires roughly 50,000 users per variant. A site with 2,000 daily users gets there in 25 days per variant — 50 days total. Teams that don't compute this in advance routinely run 14-day tests on 30,000 users, see noise that looks like a win, and ship it. The replicate-rate problem follows directly.
The Sample Size Formula
Two-proportion z-test sample size: n per arm = (zα/2 × √(2p̄q̄) + zβ × √(p₁q₁ + p₂q₂))² ÷ (p₁ − p₂)², where p₁ is the baseline rate, p₂ is the target rate, q = 1 − p, zα/2 is the critical value for your chosen significance (1.96 for α = 0.05), and zβ is the critical value for power (0.842 for 80% power). The formula has two terms: the first ensures you don't falsely call a true null result a win (Type I error / α); the second ensures you can detect a real effect when it exists (Type II error / β). Both matter — most teams worry about α and ignore β, which is why so many tests are underpowered.
The non-linear punch line: sample size scales as 1 / (effect size)². Detecting a 5% relative lift takes 4× the sample of detecting a 10% lift. Detecting a 2% lift takes 25× the sample. This is why mature CRO programs pick large MDEs (15-25%) for testing — small lifts are theoretically possible but require sample sizes most products can't deliver. Better to do fewer, larger-effect tests than many small-effect tests that can't reach significance.
"At a 3% baseline conversion rate, detecting a 10% relative lift at standard statistical thresholds requires 50,000 users per variant. Two-week tests on lower volumes produce noise dressed up as wins. The replicate rate of self-reported A/B test winners is around 20-30%."
What Power Actually Means
Statistical power = the probability that your test will correctly detect a real effect of the size you specified, given the sample size you ran. 80% power means: if the true effect is exactly the MDE you specified, you'd correctly call it significant 80% of the time and miss it (Type II error / β = 20%) 20% of the time. Higher power = more sensitive test = larger sample. The industry-standard 80% power is a tradeoff — high enough to catch most real effects, low enough to be practical. Increase to 90% for high-stakes tests (homepage, pricing); 95% only when missing a real effect is genuinely costly. Below 80%, you're running a test that will frequently fail to detect real wins.
10 Facts About A/B Test Statistics
Sample size scales as 1 / (effect size)². Detecting a 5% lift takes 4× the users of detecting a 10% lift.
α = 0.05 (5% false-positive rate) is industry default. Stricter for high-stakes tests.
Power = 80% is industry default. Tests below 80% power frequently miss real wins.
Ronny Kohavi (Amazon/Microsoft/Airbnb experimentation lead) — estimates 70-80% of self-reported A/B wins fail to replicate.
Two-proportion z-test is the standard for binary conversion metrics. Used by Optimizely, VWO, Google Optimize.
"Peeking" at A/B test results before reaching sample size inflates false-positive rate from 5% to 25-30%.
Sequential testing (Optimizely Stats Engine, Evan Miller's mSPRT) allows early stopping without inflating Type I error.
Bayesian A/B testing (Lyst, Convoy, others) returns "probability variant beats control" — different interpretation, similar sample needs.
14-day minimum runtime regardless of sample reached — captures weekly seasonality.
Multiple testing correction (Bonferroni, Holm-Bonferroni) needed when testing many metrics or variants.
Frequently Asked Questions
- 10-15% relative lift is typical for UX / copy changes. 15-25% for layout / flow restructures. 5-10% only for highly polished funnels with massive traffic where small lifts compound. Setting MDE too small (< 5%) usually means impossibly large sample sizes; setting it too high (50%+) means you're only catching huge wins and missing many real moderate ones. Start at 10% and adjust based on your traffic.
- 0.05 (95% confidence) for most tests. Use 0.01 (99% confidence) for high-stakes irreversible changes: homepage redesign, checkout flow restructure, pricing change. The stricter threshold roughly doubles required sample size but reduces false-positive risk. Use 0.10 (90% confidence) only for low-stakes pilot tests where shipping a bad variant doesn't matter.
- A/B testing is a statistical detection problem and signal-vs-noise depends on baseline rate. Low baseline rates (e.g. 2% checkout) need much more sample to detect small lifts because the absolute change is tiny relative to baseline noise. Higher baselines (e.g. 30% click-through) need less sample for the same relative lift. The 1/(effect size)² scaling makes small-effect detection genuinely expensive.
- Peeking = checking A/B test results before the planned sample size is reached, and stopping the test as soon as you see significance. This destroys the false-positive math — true false-positive rate jumps from 5% to 25-30% with daily peeking. Either commit to the planned sample size and don't peek, or use sequential-testing methods (Optimizely's Stats Engine, Bayesian frameworks) that explicitly allow early stopping.
- Yes — even if sample size is reached earlier. 14 days captures weekly seasonality (weekday vs weekend behaviour). Shorter tests can be biased by single-week phenomena (a viral moment, a TV ad, a major sporting event). High-traffic sites often run 14-21 days; low-traffic sites may need 28-56 days even for moderate MDEs.
- A/B/n testing (control + multiple variants) needs Bonferroni correction or similar to control family-wise error rate. With 4 variants compared to control, divide α by 4 (so α/4 = 0.0125 per comparison). This roughly doubles the per-variant sample size. Alternative: use the multi-armed bandit (Thompson sampling, Bayesian optimisation) which doesn't suffer from the multiple-testing penalty but is harder to operationalise.
- Bayesian methods (used by Lyst, Convoy, VWO Bayesian mode) return "probability variant beats control" instead of a p-value. The interpretation is more intuitive ("85% probability variant is better") but the underlying sample-size needs are similar to frequentist methods. Main advantages: graceful handling of small sample sizes (no awkward "not significant" verdicts), natural fit with sequential / early-stopping decisions. Disadvantage: requires picking a prior (subjective).
- Yes — same underlying two-proportion z-test formula. Evan Miller's calculator is the most widely cited reference, used by Optimizely and many CRO blogs. Our tool produces the same result for the same inputs. We add: explicit "days needed" given your daily traffic, multi-tier alpha/power dropdowns, and a sample size for the most common SaaS / e-com baseline rates.
- Three possibilities: (1) the variant has zero or near-zero effect — call it a tie, don't ship; (2) the variant has a real but smaller-than-MDE effect — keep running OR accept the smaller-effect finding requires bigger sample; (3) there's a real moderate effect but your power is too low. Compute the post-hoc power: at the observed difference, what % power did your test actually have? If under 60%, the test was underpowered and "no significance" doesn't mean "no effect" — it means "not enough data." Plan more carefully next time.
- A lot. Singapore + Hong Kong tech products often have small audiences (50K-500K MAU) where the standard 50,000-users-per-variant requirement means tests run 4-8 weeks. The pragmatic adaptation: pick larger MDEs (20-30%), test fewer-bigger changes rather than many-small-changes, lean Bayesian for early-stopping flexibility. Indonesia + Philippines products typically have much higher traffic (5M-50M MAU) so sample size is rarely the bottleneck — runtime is, since weekly seasonality still applies.
Related News
You may be interested in these recent stories from our newsroom.
-
Snowflake jumps 36 per cent in a day on an earnings beat and a US$6 billion AWS chip deal
Snowflake had its best day as a public company on 28 May, closing up 36 per cent after a clean first-quarter beat and a five-year, US$6 bill...
-
MAS Scraps Mandatory Financial Advice for Most Complex Product Buyers in Retail Shake-Up
Singapore retail investors buying structured notes, derivatives and investment-linked policies will no longer need mandatory financial advic...
-
SEC Rewrites Float Rules, PSE Moves to Implement Them — Clearing the Path for GCash's USD 1B Philippine IPO
The SEC lowered the public float floor for large Philippine issuers in February 2026. The PSE followed with a consultation paper in April. T...
75 more free tools
Calculators, converters, security tools — no signup.