flowchart LR C[Correlation r] --> R[Regression Y = a + bX] R --> R2[R² = explained variance] C -. does not imply .- CAU[Causation] CAU -.- E[Experiments<br/>IV, DID, RDD] style C fill:#E3F2FD,stroke:#1565C0 style R fill:#FFF3E0,stroke:#EF6C00 style CAU fill:#FCE4EC,stroke:#AD1457
73 Correlation and Regression Analysis
73.1 What is Correlation?
Correlation is the statistical measure of the strength and direction of the linear relationship between two variables. Regression goes a step further — it models the relationship and lets you predict one variable from another.
| Feature | Correlation | Regression |
|---|---|---|
| Purpose | Measure association | Predict / model dependence |
| Symmetry | Symmetric — Cor(X,Y) = Cor(Y,X) | Asymmetric — Y on X ≠ X on Y |
| Direction | None (just measures strength) | Dependent vs independent variable |
| Output | Coefficient r ∈ [−1, +1] | Equation Y = a + bX |
| Causation | No causal claim | Often interpreted causally (with care) |
73.2 Karl Pearson’s Correlation Coefficient
Karl Pearson’s coefficient of linear correlation (1896) — the most-used measure (pearson1896?):
\[r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}}\]
| Property | What it says |
|---|---|
| Range | −1 ≤ r ≤ +1 |
| Sign | + = positive linear; − = negative linear |
| Magnitude | Closer to ±1 = stronger linear relationship |
| Symmetric | r(X,Y) = r(Y,X) |
| Unit-less | Independent of units |
| Linear only | Captures linear association — not curvilinear |
| Outlier-sensitive | One extreme point can change r dramatically |
| |r| | Strength |
|---|---|
| 0.0 to 0.3 | Weak |
| 0.3 to 0.7 | Moderate |
| 0.7 to 1.0 | Strong |
73.3 Other Correlation Coefficients
| Coefficient | Use |
|---|---|
| Spearman’s ρ | Ordinal / ranked data; robust to outliers |
| Kendall’s τ | Ordinal data; concordant vs discordant pairs |
| Point biserial | One continuous, one dichotomous |
| Phi (φ) | Two dichotomous variables |
| Tetrachoric | Two dichotomised continuous variables |
Spearman’s formula:
\[\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}\]
where \(d_i\) is the difference in ranks for the ith pair.
73.4 Coefficient of Determination — R²
The coefficient of determination is the proportion of variance in Y that is explained by X:
\[R^2 = r^2\]
For a regression line, \(R^2\) ∈ [0, 1]. An \(R^2 = 0.65\) means 65 per cent of the variance in Y is explained by the regression. The remaining 35 per cent is the unexplained variance.
73.5 Simple Linear Regression
The simple linear regression equation:
\[Y = a + bX + \epsilon\]
where \(a\) is the intercept, \(b\) the slope, and \(\epsilon\) the error term. By the least-squares method:
\[b = \frac{n\sum xy - \sum x \sum y}{n \sum x^2 - (\sum x)^2}\]
\[a = \bar{y} - b\bar{x}\]
| Line | Used for | Slope |
|---|---|---|
| Y on X | Predict Y given X | bYX |
| X on Y | Predict X given Y | bXY |
The product of the two slopes equals \(r^2\):
\[b_{YX} \cdot b_{XY} = r^2\]
The two regression lines coincide only when \(r = \pm 1\) (perfect correlation).
73.6 Multiple Linear Regression
When more than one independent variable is involved:
\[Y = a + b_1 X_1 + b_2 X_2 + \dots + b_k X_k + \epsilon\]
The coefficients are the partial slopes — the effect of each X on Y, holding the others constant. Adjusted R² corrects for the addition of variables that may not improve fit.
73.7 Assumptions of Linear Regression
| Assumption | What it requires | Violation issue |
|---|---|---|
| Linearity | Linear relationship between X and Y | Curvilinear pattern in residuals |
| Independence of errors | Errors uncorrelated | Autocorrelation (Durbin-Watson) |
| Homoscedasticity | Constant error variance | Heteroscedasticity (Breusch-Pagan, White) |
| Normality of residuals | Errors normally distributed | Use Q-Q plot |
| No multicollinearity | Independent X variables not too correlated | Variance Inflation Factor (VIF > 10) |
| Mean of errors = 0 | E(ε) = 0 | Inclusion of intercept |
73.8 Spurious Correlation, Causation and the Limits of r
Correlation does not imply causation. Two variables can be correlated due to:
| Source | Example |
|---|---|
| Common cause (confounder) | Ice-cream sales and drowning deaths both rise in summer |
| Coincidence | Stock-market and any unrelated time series |
| Reverse causation | Test scores and stress |
The textbook tools for causal inference go beyond regression — randomised experiments, instrumental variables, regression discontinuity, difference-in-differences, propensity-score matching.
73.9 Practice Questions
Pearson's correlation coefficient r ranges between:
View solution
Spearman's rho is most appropriate when the data is:
View solution
If a regression of Y on X yields R² = 0.65, this means:
View solution
In simple regression, the product of the two regression slopes (bYX × bXY) equals:
View solution
Multicollinearity in regression is detected primarily through:
View solution
A high correlation between X and Y means:
View solution
Heteroscedasticity is the violation of which assumption of CLRM?
View solution
The classical product-moment correlation coefficient is associated with:
View solution
- Correlation = strength + direction of linear relation. Regression = predicts one variable from another.
- Pearson’s r ∈ [−1, +1]. Symmetric, unit-less, linear-only, outlier-sensitive.
- Other coefficients: Spearman’s ρ (ranks), Kendall’s τ, Point-biserial, Phi, Tetrachoric.
- R² = r² = proportion of variance explained.
- Y = a + bX + ε. Two regression lines: Y on X and X on Y. bYX × bXY = r².
- Multiple regression — partial slopes. Adjusted R² corrects for added variables.
- CLRM assumptions: linearity, independence (Durbin-Watson), homoscedasticity (Breusch-Pagan), normality (Q-Q), no multicollinearity (VIF), zero-mean errors.
- Correlation ≠ causation. Causal-inference tools: experiments, IV, DID, RDD, PSM.