73 Correlation and Regression Analysis

73.1 What is Correlation?

Correlation is the statistical measure of the strength and direction of the linear relationship between two variables. Regression goes a step further — it models the relationship and lets you predict one variable from another.

Correlation vs Regression

Feature	Correlation	Regression
Purpose	Measure association	Predict / model dependence
Symmetry	Symmetric — Cor(X,Y) = Cor(Y,X)	Asymmetric — Y on X ≠ X on Y
Direction	None (just measures strength)	Dependent vs independent variable
Output	Coefficient r ∈ [−1, +1]	Equation Y = a + bX
Causation	No causal claim	Often interpreted causally (with care)

73.2 Karl Pearson’s Correlation Coefficient

Karl Pearson’s coefficient of linear correlation (1896) — the most-used measure (pearson1896?):

\[r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}}\]

Properties of Pearson’s r

Property	What it says
Range	−1 ≤ r ≤ +1
Sign	+ = positive linear; − = negative linear
Magnitude	Closer to ±1 = stronger linear relationship
Symmetric	r(X,Y) = r(Y,X)
Unit-less	Independent of units
Linear only	Captures linear association — not curvilinear
Outlier-sensitive	One extreme point can change r dramatically

Common Interpretation Bands

\|r\|	Strength
0.0 to 0.3	Weak
0.3 to 0.7	Moderate
0.7 to 1.0	Strong

73.3 Other Correlation Coefficients

Other Correlation Measures

Coefficient	Use
Spearman’s ρ	Ordinal / ranked data; robust to outliers
Kendall’s τ	Ordinal data; concordant vs discordant pairs
Point biserial	One continuous, one dichotomous
Phi (φ)	Two dichotomous variables
Tetrachoric	Two dichotomised continuous variables

Spearman’s formula:

\[\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}\]

where \(d_i\) is the difference in ranks for the ith pair.

73.4 Coefficient of Determination — R²

The coefficient of determination is the proportion of variance in Y that is explained by X:

\[R^2 = r^2\]

For a regression line, \(R^2\) ∈ [0, 1]. An \(R^2 = 0.65\) means 65 per cent of the variance in Y is explained by the regression. The remaining 35 per cent is the unexplained variance.

73.5 Simple Linear Regression

The simple linear regression equation:

\[Y = a + bX + \epsilon\]

where \(a\) is the intercept, \(b\) the slope, and \(\epsilon\) the error term. By the least-squares method:

\[b = \frac{n\sum xy - \sum x \sum y}{n \sum x^2 - (\sum x)^2}\]

\[a = \bar{y} - b\bar{x}\]

Two Regression Lines

Line	Used for	Slope
Y on X	Predict Y given X	b_YX
X on Y	Predict X given Y	b_XY

The product of the two slopes equals \(r^2\):

\[b_{YX} \cdot b_{XY} = r^2\]

The two regression lines coincide only when \(r = \pm 1\) (perfect correlation).

73.6 Multiple Linear Regression

When more than one independent variable is involved:

\[Y = a + b_1 X_1 + b_2 X_2 + \dots + b_k X_k + \epsilon\]

The coefficients are the partial slopes — the effect of each X on Y, holding the others constant. Adjusted R² corrects for the addition of variables that may not improve fit.

73.7 Assumptions of Linear Regression

Six Assumptions of Classical Linear Regression (CLRM)

Assumption	What it requires	Violation issue
Linearity	Linear relationship between X and Y	Curvilinear pattern in residuals
Independence of errors	Errors uncorrelated	Autocorrelation (Durbin-Watson)
Homoscedasticity	Constant error variance	Heteroscedasticity (Breusch-Pagan, White)
Normality of residuals	Errors normally distributed	Use Q-Q plot
No multicollinearity	Independent X variables not too correlated	Variance Inflation Factor (VIF > 10)
Mean of errors = 0	E(ε) = 0	Inclusion of intercept

73.8 Spurious Correlation, Causation and the Limits of r

Correlation does not imply causation. Two variables can be correlated due to:

Three Sources of Spurious Correlation

Source	Example
Common cause (confounder)	Ice-cream sales and drowning deaths both rise in summer
Coincidence	Stock-market and any unrelated time series
Reverse causation	Test scores and stress

The textbook tools for causal inference go beyond regression — randomised experiments, instrumental variables, regression discontinuity, difference-in-differences, propensity-score matching.

flowchart LR
  C[Correlation r] --> R[Regression Y = a + bX]
  R --> R2[R² = explained variance]
  C -. does not imply .- CAU[Causation]
  CAU -.- E[Experiments<br/>IV, DID, RDD]
  style C fill:#E3F2FD,stroke:#1565C0
  style R fill:#FFF3E0,stroke:#EF6C00
  style CAU fill:#FCE4EC,stroke:#AD1457

73.9 Practice Questions

Q 01 r Range Easy

Pearson's correlation coefficient r ranges between:

A0 and 1
B−1 and +1
C0 and 100
D−∞ and +∞

View solution

Correct Option: B

r ∈ [−1, +1]. Sign indicates direction; magnitude the strength.

Q 02 Spearman Medium

Spearman's rho is most appropriate when the data is:

AContinuous and normally distributed
BOrdinal / ranked or has outliers
CBinary
DCategorical

View solution

Correct Option: B

Spearman's ρ uses ranks — ideal for ordinal data and robust to outliers.

Q 03 R² Medium

If a regression of Y on X yields R² = 0.65, this means:

A65% of variance in Y is explained by X
B65% of Y values are correct
C65% probability of a Type I error
D65% confidence level

View solution

Correct Option: A

R² = proportion of variance in Y explained by the regression model.

Q 04 Slope Product Medium

In simple regression, the product of the two regression slopes (b_YX × b_XY) equals:

Ar
Br²
C1 − r
D2r

View solution

Correct Option: B

b_YX × b_XY = r². The two regression lines coincide only when r = ±1.

Q 05 Multicollinearity Medium

Multicollinearity in regression is detected primarily through:

ADurbin-Watson statistic
BVariance Inflation Factor (VIF)
CQ-Q plot
DF-test

View solution

Correct Option: B

VIF > 10 typically signals multicollinearity. Durbin-Watson is for autocorrelation; Q-Q plot for normality of residuals.

Q 06 Causation Easy

A high correlation between X and Y means:

AX causes Y
BX and Y are linearly related; causation requires further evidence
CY causes X
DNo relationship

View solution

Correct Option: B

Correlation does not imply causation. Possible explanations: common cause, coincidence, reverse causation.

Q 07 Heteroscedasticity Medium

Heteroscedasticity is the violation of which assumption of CLRM?

ALinearity
BConstant error variance (homoscedasticity)
CNormality of residuals
DIndependence of errors

View solution

Correct Option: B

Heteroscedasticity = error variance varies — violates the homoscedasticity assumption. Tests: Breusch-Pagan, White.

Q 08 Pearson Easy

The classical product-moment correlation coefficient is associated with:

AR.A. Fisher
BKarl Pearson
CCharles Spearman
DWilliam Gosset

View solution

Correct Option: B

Karl Pearson's product-moment correlation coefficient (1896). Spearman developed the rank counterpart.

Quick recall

Correlation = strength + direction of linear relation. Regression = predicts one variable from another.
Pearson’s r ∈ [−1, +1]. Symmetric, unit-less, linear-only, outlier-sensitive.
Other coefficients: Spearman’s ρ (ranks), Kendall’s τ, Point-biserial, Phi, Tetrachoric.
R² = r² = proportion of variance explained.
Y = a + bX + ε. Two regression lines: Y on X and X on Y. b_YX × b_XY = r².
Multiple regression — partial slopes. Adjusted R² corrects for added variables.
CLRM assumptions: linearity, independence (Durbin-Watson), homoscedasticity (Breusch-Pagan), normality (Q-Q), no multicollinearity (VIF), zero-mean errors.
Correlation ≠ causation. Causal-inference tools: experiments, IV, DID, RDD, PSM.