Linear Regression & Correlation Calculator

Share:

Paste X,Y pairs, get linear regression equation (slope + intercept), Pearson correlation coefficient r, R² (variance explained), standard error, and a scatter plot with fitted line.

RT-CNV-084 · Converters & Units

Linear Regression & Correlation Calculator

Data points: 0
Regression equation
Slope (m)
Intercept (b)
Pearson r
R² (variance)
Std error
Mean (x̄, ȳ)
/
Paste at least 2 X,Y pairs (one per line: "x, y")
Advertisement
After results · AD-W1Responsive · Post-tool — peak engagement

How to use the Linear Regression Calculator

Paste X,Y data pairs

One pair per line, separated by comma, tab, semicolon, or space. The parser is flexible: 1, 2.5 works; 1 2.5 works; 1; 2.5 works; pasting two columns from Excel / Google Sheets works (newlines + tab separator). Non-numeric lines are silently skipped. Need at least 2 pairs for a valid regression; 10+ for reliable correlation; 30+ for statistical inference.

Read the regression equation

The output equation y = mx + b describes the best-fit line through your data. m (slope): how much y changes per unit increase in x. Positive slope = positive relationship; negative = inverse. b (intercept): the predicted y when x = 0. The equation lets you predict y for any new x value by plugging it in. Always inspect the scatter plot to verify the linear fit is reasonable — non-linear relationships need non-linear models.

Interpret Pearson r and R²

Pearson r: correlation coefficient from -1 to +1. +1 = perfect positive linear relationship, -1 = perfect negative, 0 = no linear relationship. |r| > 0.7 is "strong"; 0.4-0.7 is "moderate"; 0.2-0.4 is "weak". : square of r — represents the fraction of variance in y explained by x. R² = 0.81 means 81% of y's variability is explained by x; the remaining 19% is unexplained scatter. R² is bounded [0, 1]; higher = better fit.

Visualise the scatter plot

The chart on the right shows your data points (purple dots) and the fitted regression line (red). Look for: outliers (points far from the line) that might be dominating the fit; non-linear patterns (curves, parabolas) where a straight line is the wrong model; heteroscedasticity (scatter widening as x increases) that violates regression assumptions. If the plot shows obvious issues, the regression numbers might be misleading — consider transforming the data (log, square root) or fitting a different model.

Advertisement
After how-to · AD-W2Responsive

Linear regression — the simplest predictive model that still works

Linear regression is the simplest and most-used predictive model in all of statistics. The basic idea: given pairs of (x, y) data, find the straight line that best fits through them. "Best fit" is defined mathematically as minimising the sum of squared residuals — the vertical distances between each data point and the line. This "least squares" formulation, attributed to Carl Friedrich Gauss in 1795 (though Legendre published it first in 1805, sparking a famous priority dispute), produces a unique optimal line with closed-form formulas for slope and intercept. Despite being over 200 years old, it remains the workhorse of forecasting, A/B testing analysis, scientific research, and machine learning baselines. When someone says "let me run a quick regression," they almost always mean simple linear regression.

The relationship between r and R²

Pearson's correlation coefficient r measures the strength + direction of a linear relationship on a scale from -1 to +1. R² (coefficient of determination) equals r² — the same information squared. The interpretations differ: r is about direction + strength (sign tells you up vs down, magnitude tells you fit); R² is about variance explained (it answers "what fraction of y's variability is predictable from x?"). Examples: r = 0.9 means strong positive correlation; R² = 0.81 means x explains 81% of y's variance. r = -0.7 means moderate negative correlation; R² = 0.49 means x explains 49%. r = 0.3 means weak positive correlation; R² = 0.09 means only 9% explained — usually not worth modeling. The threshold for "useful" depends on field: physics often wants R² > 0.95; social sciences accept R² > 0.30; financial prediction settles for R² > 0.10.

Linear regression is 200+ years old, embarrassingly simple, and still the right tool 80% of the time. The other 20% needs more, but the linear baseline tells you what "more" needs to beat.

The "correlation does not imply causation" warning

The single most-violated rule in applied statistics: high r and R² values do NOT prove that x causes y. Two variables can be correlated because: (1) x truly causes y (the simplest case), (2) y causes x (reverse causation), (3) a third variable Z causes both x and y (confounding), or (4) the correlation is purely coincidental (spurious correlation). The famous example: ice cream sales correlate with drowning deaths. Cause? Neither — both are caused by hot weather (confounding). Tylervigen.com catalogues hundreds of strong correlations with no plausible causation (US cheese consumption vs people who died tangled in bedsheets, r ≈ 0.95). Real causal inference requires: randomised controlled experiments, instrumental variables, regression discontinuity designs, or causal-graphical reasoning (Judea Pearl's framework). Linear regression alone never proves causation — only describes association.

The ASEAN data-analysis angle

Linear regression is the workhorse of quantitative work across ASEAN. Common applications: property pricing (size + location → price models for PropertyGuru, 99.co, iProperty, Lamudi, Carousell Property). Demand forecasting (historical sales → predicted volume for Lazada, Shopee, Tokopedia, Grab — usually augmented with seasonality + trend). Credit scoring (income + history → default probability for Singapore banks, Indonesian fintechs, Malaysian Islamic banks). Marketing attribution (ad spend across channels → revenue lift for Grab Ads, Shopee Ads, GoTo Group). Healthcare (biomarker → diagnosis correlations in clinical research at NUS / NUHS / MGH). The math in this tool covers single-predictor cases; multiple regression (multiple x's predicting y) requires matrix algebra and is typically run in R / Python / Stata / SPSS. Most APAC tech companies use Python + scikit-learn for production regression; this tool is for quick exploratory analysis and education.

10 Things to Know About Linear Regression

01

Linear regression finds the line that minimises the sum of squared residuals — vertical distances from data points to the line, squared and summed.

02

The least-squares method was attributed to Carl Friedrich Gauss in 1795, though Legendre published it first in 1805, sparking a famous priority dispute.

03

Pearson r ranges from -1 (perfect negative) to +1 (perfect positive). Zero means no linear relationship — but a non-linear one might still exist.

04

R² = r² when there's only one predictor. R² represents the fraction of y's variance explained by x. Bounded [0, 1]; higher = better fit.

05

Correlation does NOT imply causation. The most-violated rule in applied statistics — strong correlations can come from confounding, reverse causation, or coincidence.

06

Anscombe's Quartet (1973) shows four datasets with identical means, variances, correlations, and regression lines — but completely different visual patterns. Always plot your data.

07

The "best fit" line always passes through the mean point (x̄, ȳ). The intercept b = ȳ − m × x̄ guarantees this property.

08

Heteroscedasticity — when the variance of residuals isn't constant across x values — violates regression assumptions and inflates standard errors. Diagnose with a residual plot.

09

Multiple regression with k predictors uses matrix algebra (β = (X'X)⁻¹X'y) — the same idea generalised. This tool handles single-predictor case.

10

In machine learning, linear regression is the simplest baseline model. Any production model should beat it; if it doesn't, the problem might be unsolvable or the data inadequate.

Frequently Asked Questions

  • It's the method for finding the "best" line through data points. For each data point, compute the residual: the vertical distance from the point to the candidate line. Square each residual (to ensure positive values + emphasise larger errors). Sum the squared residuals. The "best" line is the one that minimises this sum. The math gives unique closed-form solutions for slope and intercept: slope = sum((x-mean_x)(y-mean_y)) / sum((x-mean_x)²); intercept = mean_y - slope × mean_x. Squaring residuals (vs taking absolute values) is convenient because it makes the math tractable AND penalises outliers more heavily — both algorithmically and statistically motivated.

  • r is the Pearson correlation coefficient — measures strength + direction of linear relationship on a -1 to +1 scale. R² is r squared — measures variance explained on a 0 to 1 scale. r = 0.9 means strong positive correlation; the same dataset has R² = 0.81 meaning x explains 81% of y's variability. r tells you "how related"; R² tells you "how much explained". In multiple regression with more than one predictor, R² remains meaningful but the relationship to r doesn't hold directly (R² no longer equals r²).

  • Depends entirely on field + use case. Physics / engineering: R² > 0.95 is the norm — phenomena are usually highly deterministic. Chemistry / biology: R² > 0.85 for cleanly-controlled experiments. Social sciences / psychology: R² > 0.30 is publishable — human behaviour has high inherent variability. Finance / economics: R² > 0.10 is meaningful — markets are extremely noisy. Machine learning baselines: R² > 0.50 for tabular data is solid. The "good enough" threshold is whatever your domain accepts. Beware of high R² in time-series data — auto-correlation inflates it artificially.

  • NO. The single most-violated rule in applied statistics. Strong correlations between X and Y can arise from: (1) X causes Y. (2) Y causes X (reverse causation). (3) Some third variable Z causes both X and Y (confounding). (4) Pure coincidence (spurious correlation — see tylervigen.com for hundreds of examples). Demonstrating causation requires: randomised controlled experiments, natural experiments + instrumental variables, regression discontinuity designs, or formal causal-graphical reasoning (Pearl). Linear regression alone never proves causation. State results as "X is associated with Y" not "X causes Y" unless you have a properly designed causal study.

  • Linear regression assumes the relationship is straight-line. If it's actually curved (quadratic, exponential, logarithmic), linear regression will underfit. Visually inspect the scatter plot — if it shows a curve, transform the data first. Power relationships (y = ax^b): log-log transform both axes, then run linear regression — slope is b. Exponential growth (y = ae^bx): log transform y only. Saturation curves: use logistic or asymptotic models. Polynomial relationships: add x² and x³ as additional predictors (multiple regression). For complex non-linear patterns, use spline regression or non-parametric methods (GAMs, decision trees, neural nets).

  • Four datasets constructed by statistician Francis Anscombe (1973) to demonstrate why you should always plot your data. All four have identical: means, variances, correlations (r = 0.816), regression lines (y = 3 + 0.5x), R² (0.67). But the visual patterns are completely different: one is a normal-looking linear relationship; one is perfectly linear except for one outlier; one shows a curve; one has all x values constant except one outlier. The summary statistics hide the structure — only the scatter plots reveal it. Anscombe's lesson: descriptive statistics ≠ understanding. Always plot your data before drawing conclusions.

  • Minimum 2 (for any line). For meaningful correlation: 10+. For stable correlation estimates: 30+. For statistical inference (significance testing, confidence intervals): 30+ is the rule-of-thumb. For high-confidence inference: 100+. Correlations from very small samples are notoriously unstable — a r=0.7 from n=5 might become r=0.2 with n=50. Don't draw strong conclusions from regressions with n < 30. For ML/forecasting use cases, more data is essentially always better — diminishing returns kick in above ~1000 for stable estimates.

  • Least-squares regression is very sensitive to outliers because residuals are squared — one extreme point dominates the fit. Diagnose by visually inspecting the scatter plot + the residual plot. Mitigation options: (1) Investigate outliers — are they real or data entry errors? Fix obvious errors. (2) Use robust regression (M-estimators, Theil-Sen, RANSAC) that down-weights outliers automatically. (3) Transform the data to reduce skew (log-transform often helps). (4) Cap/winsorize extreme values. The simplest approach is also the most honest: report results both WITH and WITHOUT the outliers and discuss the difference.

  • No. All calculations + the scatter plot rendering run entirely in your browser via JavaScript + Canvas. There's no server roundtrip — open DevTools → Network and confirm zero outbound requests. Your data stays on your device. Safe for proprietary research data, A/B test correlations, clinical trial endpoints, or any sensitive analysis that shouldn't leave your machine.

  • This tool handles single-predictor regression only. Multiple regression with k > 1 predictors requires matrix algebra: β = (X'X)⁻¹X'y, where X is the design matrix and β is the vector of coefficients. The interpretation generalises: each coefficient represents the effect of that predictor holding others constant. Tools for multiple regression: R (lm function — gold standard, free), Python (statsmodels, scikit-learn LinearRegression), Excel (Data Analysis ToolPak — limited), SAS / SPSS / Stata (commercial). For quick multiple regression with a few predictors, Excel's LINEST function works for <15 variables. For production work, Python or R is essential.

Related News

You may be interested in these recent stories from our newsroom.

No related news yet for this tool. Our editorial team publishes new pieces every week.

Browse all news →
Advertisement
Pre-footer · AD-W3 728 × 90

75 more free tools

Calculators, converters, security tools — no signup.