A/B Test Significance Calculator

By RECATOOLS Editorial Updated 23 May 2026

About this tool

p-value + 95% CI + ship/no-ship verdict for finished A/B tests.

Enter control + variant conversions and users. Get z-score, p-value, 95% CI of lift, and statistical-significance verdict. Free, no signup.

RT-FIN-138 · Finance & Money

⚠ Disclaimer: Estimates for planning purposes only. Industry benchmarks drift over time and your specific circumstances may differ materially. Verify against your own data and consult an accountant or business adviser for material decisions.

After your A/B test runs, paste the results here. The tool computes z-score, p-value, 95% confidence interval of the difference, and gives you a ship/don't-ship verdict at your chosen significance level. Two-tailed two-proportion z-test — the math used by Optimizely, VWO, Google Optimize.

Control (A)

Users in control

users

Conversions in control

events

Variant (B)

Users in variant

users

Conversions in variant

events

Significance threshold (α)

two-tailed

📅 Research current as of 23 May 2026 · Sources: Two-proportion z-test (pooled SE for hypothesis, non-pooled SE for CI). Two-tailed p-value via standard normal CDF (Abramowitz–Stegun approximation).

Rates, regulations, and lender practices change frequently — verify current figures with your provider or licensed advisor before acting.

p-value (two-tailed)

—

z = — · 95% CI of lift: —

Control rate

—

Variant rate

—

Absolute lift

—

Relative lift

—

How to Use the Significance Calculator

Enter unique users + conversions per variant

Users = unique visitors exposed to the variant. Conversions = number of users who completed the goal action. Both must be from the same time window and segmentation.

Pick significance level α

0.05 (95% confidence) is standard. Stricter for high-stakes tests (0.01). Don't pick α after seeing results — that's p-hacking.

Read the verdict bar

Green = variant won at your chosen threshold. Yellow = not significant; either need more data or there's no effect. Red = variant lost; don't ship.

Sanity-check with the confidence interval

The 95% CI of the absolute lift gives you the range of plausible true lifts. If the CI crosses zero, the lift could realistically be zero or negative. A wide CI means more data would help.

Interpreting A/B Test Results — Beyond the p-Value

What the p-Value Actually Means

The p-value is the probability of observing your test result (or a more extreme one) assuming the variant has zero effect. A p-value of 0.03 means: if the variant truly does nothing, we'd see this much (or more) difference in our test data only 3% of the time. Below the chosen α (typically 0.05), we reject the null hypothesis of zero effect and call the result "statistically significant." The p-value does NOT tell you the probability the variant actually beats control — that interpretation is wrong but extremely common, even in published academic work.

What you really care about is "is the variant better?" Bayesian methods answer that directly ("85% probability variant > control"). The frequentist p-value is the closest off-the-shelf proxy, but always interpret it alongside (1) effect size — a 0.1% relative lift with p = 0.04 is statistically significant but practically meaningless; (2) confidence interval — if it barely crosses zero, the true effect could be tiny; (3) sample size — small-N tests are noisy regardless of p-value.

The Confidence Interval Is The More Useful Number

A 95% confidence interval of [+0.1 pp, +2.1 pp] for absolute lift tells you the plausible range of true lift much more clearly than a p-value. It includes information about effect size AND uncertainty AND statistical significance in one number — if the CI doesn't include zero, the result is significant at p = 0.05. Modern Bayesian-style reporting often shows credible intervals instead of p-values for this reason. Even if you don't switch to Bayesian, training yourself to look at the CI first and the p-value second produces better A/B decisions.

Wide CI = more data would help. Narrow CI not crossing zero = confident win or loss. Narrow CI crossing zero = confident no-effect result. Wide CI crossing zero = test is inconclusive and underpowered. The CI also lets you decide whether the lift is large enough to matter operationally — a CI of [+0.01 pp, +0.5 pp] is significant but the lift is too small to justify the engineering work to ship.

"A p-value of 0.04 does not mean the variant has a 96% chance of being better. It means: if the variant did nothing, you'd see this much improvement in your data only 4% of the time. The interpretations look similar — they are not the same."

Common Failure Modes

Peeking: checking results daily and stopping the test the moment p drops below 0.05. This inflates real false-positive rate from 5% to 25-30%. Either commit to the planned sample size (computed via the sister sample-size calculator, RT-FIN-137) or use sequential-testing methods. Multiple comparisons: running 10 separate metrics through significance testing — you'd expect one to come up significant by pure chance at α = 0.05. Bonferroni correction (use α / k for k tests) or pre-register your primary metric. Cherry-picked segments: a test that lost overall but won in the "iOS / California / Tuesday" segment is almost certainly cherry-picked. Always validate segment wins on hold-out data.

The Discipline of Pre-Registration

Pre-registration is the practice of writing down what you'll test, which metric is primary, what sample size you'll run, and what decision rule you'll apply BEFORE the test launches. It's the single highest-impact change a CRO program can make for replicate rate. Pre-registered tests have ~60% replicate rate vs ~20-30% for post-hoc analysed tests. A simple template works: "We hypothesise variant X will lift primary metric Y by Z%. We will run N users per arm. At p < α, we ship; otherwise we don't." Stick to this even when results look exciting in unexpected metrics — those are hypotheses for the next test, not conclusions from this one.

10 Facts About A/B Test Statistics

p-value = probability of observed result, assuming variant has zero effect. NOT "probability variant is better."

The 0.05 threshold traces to R.A. Fisher (1925); not a fundamental statistical truth, just convention.

Two-proportion z-test is the standard for binary conversion metrics in commercial A/B tools.

Confidence interval is usually more useful than p-value: shows effect size + uncertainty + significance.

Statistical significance ≠ practical significance. A tiny lift can be significant in large samples.

Peeking inflates Type I error rate from 5% to 25-30%. Either commit to N or use sequential testing.

Bonferroni correction: divide α by the number of tests when running multiple comparisons.

14-day minimum runtime regardless of sample size — captures weekly seasonality.

Bayesian A/B returns "probability variant beats control" — more intuitive interpretation; similar sample needs.

Replication crisis: 50-70% of self-reported A/B wins fail to replicate, per Microsoft + Airbnb experimentation teams.

Frequently Asked Questions

It means: if the variant had zero true effect, the probability of seeing the observed difference (or larger) is less than 5%. It does NOT mean "95% chance the variant is better" — that's a Bayesian interpretation that's wrong under frequentist methods. Combine the p-value with the confidence interval and effect size for a complete picture.
No, not at α = 0.05. Options: (1) collect more data — if there's a real effect, p will drop below 0.05 with more sample; (2) accept "not significant" as a verdict — the variant doesn't have a strong-enough effect at your chosen threshold; (3) reframe — if the test was high-stakes you wouldn't have used α = 0.05 anyway. Avoid the temptation to relax α post-hoc — that's p-hacking and destroys the math.
Pooled SE assumes both variants have the same true rate (i.e. tests the null hypothesis of zero effect) and is used for the p-value calculation. Non-pooled SE doesn't make that assumption and is used for the confidence interval of the actual difference. The two SEs are usually similar; this tool uses pooled for the hypothesis test and non-pooled for the CI — standard practice.
The CI shows the plausible range of true lift in human-readable form. "Lift is between +0.5 pp and +2.0 pp" is much more useful than "p = 0.03". The CI also tells you whether significance is fragile (CI barely excludes zero) or robust (CI clearly on one side). Modern experimentation platforms increasingly lead with CI; the p-value is becoming a secondary number.
Two-tailed almost always in commercial A/B testing. A one-tailed test assumes you don't care if the variant LOSES — only whether it wins. That's rarely true in practice; a losing variant is important to detect because you don't want to ship it. This tool uses two-tailed throughout, matching Optimizely / VWO / Google Optimize defaults.
It means your test doesn't have enough power to estimate the true effect precisely. Either the sample size is too small for the baseline rate, or the true effect is too small to detect cleanly. Use the sample-size calculator (RT-FIN-137) to compute how many more users you'd need. If the wide CI clearly excludes zero, the variant probably wins; if it crosses zero, you can't tell.
Because the rate alone (e.g. "3.5%") doesn't determine statistical significance — sample size matters too. 3.5% from 100 users is much noisier than 3.5% from 100,000 users. The two-proportion z-test uses both raw counts to weight the noise correctly. Always pull the raw numerator + denominator from analytics rather than the pre-computed rate.
For straightforward two-proportion z-tests with no sequential testing, yes — same underlying math. Optimizely's Stats Engine adds sequential-testing capabilities (early-stopping without inflating Type I error) which this tool doesn't replicate. For ad-hoc post-test analysis ("did this test we already ran win?"), this calculator is fine. For real-time experimentation with peeking, use Optimizely / VWO / Statsig with their built-in sequential math.
This tool handles binary metrics only (converted yes/no). For continuous metrics like revenue per user, time-on-site, or basket size, you need a two-sample t-test or Mann-Whitney U-test instead. Most commercial platforms (Optimizely, VWO, Statsig) handle both. For a quick sanity check on continuous metrics, an unpaired t-test in Python (scipy.stats.ttest_ind) or R is the standard tool.
US clients increasingly expect Bayesian or sequential testing for high-frequency product experimentation, especially in mature CRO programs at Meta, Airbnb, Stripe, Shopify. Classical p-value frameworks are still fine for SMB and most product teams, but read the room — if the client uses Stats Engine / Statsig / Eppo, they expect Bayesian interpretation in your write-ups. Calibrate communication: stakeholders in the US tech world increasingly read "95% probability of beating control" as natural; the same audience finds "p = 0.04" technical and unintuitive.

Related News

You may be interested in these recent stories from our newsroom.

View all news →