T-Test Calculator

By RECATOOLS Editorial Updated 11 Jun 2026

About this tool

One-sample / Welch / paired

Compute one-sample, independent two-sample (Welch's), or paired t-tests. Outputs t-statistic, degrees of freedom, two-tailed p-value, and significance verdict.

RT-CNV-090 · Converters & Units

Test type

Group A data

Group B data

Significance threshold (α)

t-statistic

—

Degrees of freedom

—

p-value (2-tailed)

—

Enter data + pick test mode to compute t-test

How to use the T-Test Calculator

Pick the right test type

Independent two-sample: comparing means of two different groups (e.g., treatment vs control). Default — uses Welch\'s formula which doesn\'t require equal variances. One-sample: comparing your sample to a known/hypothesised population mean (quality control, vs industry benchmark). Paired: same subjects measured twice (pre/post, before/after), with matched observations.

Paste your data

Values separated by commas, spaces, or newlines. The tool ignores any non-numeric content. For paired t-test: ensure equal n in both groups + correct pairing order. For two-sample: groups can have different n.

Set significance threshold (α)

α = 0.05 is the conventional default (5% false-positive rate). α = 0.01 for stricter tests (medical, regulatory). α = 0.10 for exploratory + small-sample work. α = 0.001 for very strict (multiple-comparison-adjusted). The choice affects the verdict, not the underlying math.

Read the verdict

If p < α: reject H₀ — the difference is statistically significant. If p ≥ α: fail to reject H₀ — insufficient evidence. Important: "not significant" ≠ "no difference"; it means your data couldn\'t prove a difference exists. With small samples + moderate effects, you\'ll often fail to reject H₀ even when a real effect exists.

T-tests — the most-used statistical test in business + research

The t-test is the workhorse of statistical significance testing. Developed by William Sealy Gosset (publishing under the pseudonym "Student") at Guinness Brewery in 1908 to compare small samples of beer ingredients, the t-test became the foundation for hypothesis testing in research, business, and quality control. Today, every A/B test, every clinical trial, every quality-improvement programme uses some variant of the t-test. The math is simple: compute the difference between sample means, divide by a measure of how variable the data is (standard error), and look up the result in a t-distribution to get a p-value. This tool implements three common variants — one-sample, independent two-sample (Welch\'s, the robust modern default), and paired.

Which t-test variant to use

One-sample t-test: when you have ONE sample + want to compare its mean to a known or hypothesised value. Example: "Our new battery model lasts an average of 220 minutes; is that different from the industry standard of 200 minutes?" — one-sample test with μ₀ = 200. Independent two-sample (Welch\'s): when you have TWO independent samples + want to compare their means. Example: "Group A (drug) lost 8kg avg; Group B (placebo) lost 3kg avg; is the difference real?" — two-sample test. Welch\'s formula doesn\'t assume equal variances (more robust than classical Student\'s t-test) and is the modern default. Paired t-test: when the SAME subjects are measured twice + you\'re analysing the difference. Example: "Patients\' blood pressure before + after treatment" — paired test. Subjects act as their own controls, reducing variance significantly compared to two-sample tests.

Student's t-test was developed in 1908 by William Sealy Gosset at Guinness Brewery to compare small samples of beer ingredients. Today it powers every A/B test + clinical trial.

The pitfalls of t-test interpretation

Five common misinterpretations of t-test results. (1) "Not significant" ≠ "no effect": failing to reject H₀ doesn\'t prove there\'s no difference — it just means your data couldn\'t prove one exists. With small samples + moderate effects, you\'ll often miss real effects (Type II error). (2) "Significant" ≠ "important": with very large samples (n > 10,000), even tiny effects become statistically significant. Always look at effect size (e.g., Cohen\'s d) alongside p-value. A "highly significant" 0.1% difference might be operationally meaningless. (3) p-value isn\'t probability of being wrong: p = 0.04 means "if H₀ were true, there\'s a 4% chance we\'d see data this extreme or more." It does NOT mean "there\'s a 4% chance H₀ is true." (4) Multiple testing inflates false positives: running 20 t-tests at α = 0.05 means ~1 will be "significant" by pure chance. Use Bonferroni correction or false discovery rate (FDR) adjustment when running many tests. (5) t-test assumes approximate normality: heavy skew or extreme outliers can invalidate results. Use Mann-Whitney U (two-sample) or Wilcoxon signed-rank (paired) as non-parametric alternatives for non-normal data.

The ASEAN business research context

A/B testing + customer research methodology has matured rapidly across ASEAN markets, especially in larger tech firms (Grab, Shopee, Sea, GoTo, Lazada). Most run thousands of t-tests + chi-square tests across product experiments monthly via internal tools. However, statistical literacy in marketing + product teams varies: many practitioners still cite "we saw 5% lift" without significance testing. The same observed difference can be highly significant (n=10,000) or pure noise (n=100). This calculator gives you proper math without requiring SPSS or R. For A/B testing: always report sample size + effect size alongside p-value. For business decisions, practical significance (large enough difference to matter) is what matters — statistical significance is a necessary but not sufficient condition. The major ASEAN platforms now bake significance testing into their experimentation tools (e.g., Sea\'s internal A/B framework, Grab\'s experimentation platform).

10 Things to Know About T-Tests

"Student" = William Sealy Gosset, publishing pseudonym at Guinness Brewery in 1908. Real name kept secret per Guinness policy.

Three main variants: one-sample (vs known mean), two-sample (between groups), paired (matched observations).

Welch\'s t-test (this tool\'s two-sample default) doesn\'t assume equal variances — more robust than classical Student\'s.

α = 0.05 (5% false-positive rate) is the conventional significance threshold. 0.01 for stricter; 0.10 exploratory.

"Not significant" ≠ "no effect." With small samples + moderate effects, you often miss real differences.

"Significant" ≠ "important." With n > 10,000, tiny effects become statistically significant but may be operationally meaningless.

p-value: probability of seeing data this extreme IF H₀ is true. NOT the probability H₀ is true.

Always report effect size (Cohen\'s d) alongside p-value. Effect size measures practical magnitude; p-value only tests "is it noise?"

Multiple testing inflates false positives. Use Bonferroni correction or FDR when running many tests.

Non-parametric alternatives: Mann-Whitney U (two-sample), Wilcoxon signed-rank (paired) — for heavily skewed data.

Frequently Asked Questions

Paired t-test: when each observation in Group A has a meaningful pairing with one in Group B. Examples: same patient before/after treatment; same customer\'s spending in two different months; twin studies. Paired tests are more powerful (smaller p-values for same observed effect) because subject-level variation cancels out. Two-sample t-test: when groups are independent — different customers in test vs control, separate cohorts, separate experiments. Using two-sample when data is actually paired wastes statistical power.
Student\'s original 1908 t-test assumes the two groups have equal variances. Welch\'s t-test (1947) doesn\'t — uses separate variance estimates and adjusts degrees of freedom (Welch-Satterthwaite). Why Welch\'s is the modern default: real-world data rarely has identical variances; assuming equal variances when they\'re different inflates Type I error rates. Welch\'s is virtually as powerful as Student\'s when variances ARE equal, and significantly more robust when they\'re not. This tool uses Welch\'s for the two-sample mode.
Depends on effect size + desired power. For 80% power at α = 0.05: large effect (d = 0.8): ~26 per group; medium effect (d = 0.5): ~64 per group; small effect (d = 0.2): ~393 per group. For A/B testing conversion rates, sample sizes are usually MUCH larger (often 5K-50K per group) because effects are small. Use the Sample Size Calculator (RT-CNV-093) for precise pre-test sizing. Common rule of thumb: aim for at least n = 30 per group to invoke Central Limit Theorem normality.
Three options. (1) Use large n: Central Limit Theorem makes the sample mean approximately normal regardless of underlying distribution when n > 30 per group. T-test works fine with non-normal data at large n. (2) Transform the data: log transform for right-skewed data (income, prices, time-to-event); square root for count data; reciprocal for extreme right skew. (3) Use non-parametric alternatives: Mann-Whitney U test (two-sample) or Wilcoxon signed-rank (paired) don\'t assume normality. They\'re slightly less powerful when data IS normal but much more robust to skew + outliers.
This tool reports two-tailed p-values — the modern default. Two-tailed tests for any difference (positive or negative). One-tailed tests in a specific direction only (e.g., "treatment improves outcomes"). One-tailed tests are more powerful BUT require pre-registering the direction before seeing data — otherwise you\'re effectively cheating. Most academic + business contexts use two-tailed as the safe + transparent default. To convert: one-tailed p-value = two-tailed p-value ÷ 2 (when the observed direction matches the hypothesised direction).
Standard APA format: "t(df) = X.XX, p = .XXX, d = X.XX". Example: "Treatment group (M = 12.4, SD = 2.1) scored significantly higher than control (M = 9.8, SD = 1.9), t(58) = 4.21, p < .001, d = 1.08." Always include: (1) means + SDs for each group; (2) t-statistic + degrees of freedom; (3) p-value (exact if > .001, "< .001" otherwise); (4) effect size (Cohen\'s d); (5) what the direction means in real terms. This format is mandatory in most academic journals + increasingly expected in business research reports.
d = (M₁ − M₂) / SD_pooled. Measures the standardized difference between two means. Interpretation: d = 0.2 small effect; d = 0.5 medium; d = 0.8 large. d = 1.0 means the two groups differ by 1 standard deviation. Why effect size matters: p-value tells you if the effect is "real" (not noise); d tells you if it\'s "big" (worth caring about). A statistically significant p = 0.01 with d = 0.05 is real but tiny — likely not worth acting on. A non-significant p = 0.15 with d = 0.8 means a large effect that didn\'t reach significance because n was too small. Both p and d matter.
Possible but suboptimal. T-test treats conversions as continuous (e.g., revenue per user, time on page) — works for continuous metrics. For binary outcomes (converted yes/no), use a z-test for proportions or chi-square test of independence (RT-CNV-091). Both are more appropriate. Common A/B tests use cases: revenue per user → t-test; conversion rate → chi-square; time-to-purchase → survival analysis or t-test on transformed data. The right test depends on the metric type, not just "it\'s A/B."
No. All calculations run in your browser via JavaScript. Open DevTools → Network and confirm zero outbound requests. Data stays on your device. Safe for confidential research + business data.
Pair with: Chi-Square Test (RT-CNV-091) for categorical data; ANOVA (RT-CNV-092) for 3+ group comparisons; Sample Size Calculator (RT-CNV-093) for pre-test sizing; Standard Deviation (RT-CNV-081) for descriptive stats; Confidence Interval (RT-CNV-083) for effect estimation; Linear Regression (RT-CNV-084) for relationship modelling. External: R + Python (scipy.stats), SPSS, JASP (free academic alternative), GraphPad Prism (medical research standard).

Related News

You may be interested in these recent stories from our newsroom.

No related news yet for this tool. Our editorial team publishes new pieces every week.

Browse all news →

T-Test Calculator