73  Correlation and Regression Analysis

73.1 What is Correlation?

Correlation is the statistical measure of the strength and direction of the linear relationship between two variables. Regression goes a step further — it models the relationship and lets you predict one variable from another.

TipCorrelation vs Regression
Feature Correlation Regression
Purpose Measure association Predict / model dependence
Symmetry Symmetric — Cor(X,Y) = Cor(Y,X) Asymmetric — Y on X ≠ X on Y
Direction None (just measures strength) Dependent vs independent variable
Output Coefficient r ∈ [−1, +1] Equation Y = a + bX
Causation No causal claim Often interpreted causally (with care)

73.2 Karl Pearson’s Correlation Coefficient

Karl Pearson’s coefficient of linear correlation (1896) — the most-used measure (pearson1896?):

\[r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}}\]

TipProperties of Pearson’s r
Property What it says
Range −1 ≤ r ≤ +1
Sign + = positive linear; − = negative linear
Magnitude Closer to ±1 = stronger linear relationship
Symmetric r(X,Y) = r(Y,X)
Unit-less Independent of units
Linear only Captures linear association — not curvilinear
Outlier-sensitive One extreme point can change r dramatically
TipCommon Interpretation Bands
|r| Strength
0.0 to 0.3 Weak
0.3 to 0.7 Moderate
0.7 to 1.0 Strong

73.3 Other Correlation Coefficients

TipOther Correlation Measures
Coefficient Use
Spearman’s ρ Ordinal / ranked data; robust to outliers
Kendall’s τ Ordinal data; concordant vs discordant pairs
Point biserial One continuous, one dichotomous
Phi (φ) Two dichotomous variables
Tetrachoric Two dichotomised continuous variables

Spearman’s formula:

\[\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}\]

where \(d_i\) is the difference in ranks for the ith pair.

73.4 Coefficient of Determination — R²

The coefficient of determination is the proportion of variance in Y that is explained by X:

\[R^2 = r^2\]

For a regression line, \(R^2\) ∈ [0, 1]. An \(R^2 = 0.65\) means 65 per cent of the variance in Y is explained by the regression. The remaining 35 per cent is the unexplained variance.

73.5 Simple Linear Regression

The simple linear regression equation:

\[Y = a + bX + \epsilon\]

where \(a\) is the intercept, \(b\) the slope, and \(\epsilon\) the error term. By the least-squares method:

\[b = \frac{n\sum xy - \sum x \sum y}{n \sum x^2 - (\sum x)^2}\]

\[a = \bar{y} - b\bar{x}\]

TipTwo Regression Lines
Line Used for Slope
Y on X Predict Y given X bYX
X on Y Predict X given Y bXY

The product of the two slopes equals \(r^2\):

\[b_{YX} \cdot b_{XY} = r^2\]

The two regression lines coincide only when \(r = \pm 1\) (perfect correlation).

73.6 Multiple Linear Regression

When more than one independent variable is involved:

\[Y = a + b_1 X_1 + b_2 X_2 + \dots + b_k X_k + \epsilon\]

The coefficients are the partial slopes — the effect of each X on Y, holding the others constant. Adjusted R² corrects for the addition of variables that may not improve fit.

73.7 Assumptions of Linear Regression

TipSix Assumptions of Classical Linear Regression (CLRM)
Assumption What it requires Violation issue
Linearity Linear relationship between X and Y Curvilinear pattern in residuals
Independence of errors Errors uncorrelated Autocorrelation (Durbin-Watson)
Homoscedasticity Constant error variance Heteroscedasticity (Breusch-Pagan, White)
Normality of residuals Errors normally distributed Use Q-Q plot
No multicollinearity Independent X variables not too correlated Variance Inflation Factor (VIF > 10)
Mean of errors = 0 E(ε) = 0 Inclusion of intercept

73.8 Spurious Correlation, Causation and the Limits of r

Correlation does not imply causation. Two variables can be correlated due to:

TipThree Sources of Spurious Correlation
Source Example
Common cause (confounder) Ice-cream sales and drowning deaths both rise in summer
Coincidence Stock-market and any unrelated time series
Reverse causation Test scores and stress

The textbook tools for causal inference go beyond regression — randomised experiments, instrumental variables, regression discontinuity, difference-in-differences, propensity-score matching.

flowchart LR
  C[Correlation r] --> R[Regression Y = a + bX]
  R --> R2[R² = explained variance]
  C -. does not imply .- CAU[Causation]
  CAU -.- E[Experiments<br/>IV, DID, RDD]
  style C fill:#E3F2FD,stroke:#1565C0
  style R fill:#FFF3E0,stroke:#EF6C00
  style CAU fill:#FCE4EC,stroke:#AD1457

73.9 Practice Questions

Q 01 r Range Easy

Pearson's correlation coefficient r ranges between:

  • A0 and 1
  • B−1 and +1
  • C0 and 100
  • D−∞ and +∞
View solution
Correct Option: B
r ∈ [−1, +1]. Sign indicates direction; magnitude the strength.
Q 02 Spearman Medium

Spearman's rho is most appropriate when the data is:

  • AContinuous and normally distributed
  • BOrdinal / ranked or has outliers
  • CBinary
  • DCategorical
View solution
Correct Option: B
Spearman's ρ uses ranks — ideal for ordinal data and robust to outliers.
Q 03 Medium

If a regression of Y on X yields R² = 0.65, this means:

  • A65% of variance in Y is explained by X
  • B65% of Y values are correct
  • C65% probability of a Type I error
  • D65% confidence level
View solution
Correct Option: A
R² = proportion of variance in Y explained by the regression model.
Q 04 Slope Product Medium

In simple regression, the product of the two regression slopes (bYX × bXY) equals:

  • Ar
  • B
  • C1 − r
  • D2r
View solution
Correct Option: B
bYX × bXY = r². The two regression lines coincide only when r = ±1.
Q 05 Multicollinearity Medium

Multicollinearity in regression is detected primarily through:

  • ADurbin-Watson statistic
  • BVariance Inflation Factor (VIF)
  • CQ-Q plot
  • DF-test
View solution
Correct Option: B
VIF > 10 typically signals multicollinearity. Durbin-Watson is for autocorrelation; Q-Q plot for normality of residuals.
Q 06 Causation Easy

A high correlation between X and Y means:

  • AX causes Y
  • BX and Y are linearly related; causation requires further evidence
  • CY causes X
  • DNo relationship
View solution
Correct Option: B
Correlation does not imply causation. Possible explanations: common cause, coincidence, reverse causation.
Q 07 Heteroscedasticity Medium

Heteroscedasticity is the violation of which assumption of CLRM?

  • ALinearity
  • BConstant error variance (homoscedasticity)
  • CNormality of residuals
  • DIndependence of errors
View solution
Correct Option: B
Heteroscedasticity = error variance varies — violates the homoscedasticity assumption. Tests: Breusch-Pagan, White.
Q 08 Pearson Easy

The classical product-moment correlation coefficient is associated with:

  • AR.A. Fisher
  • BKarl Pearson
  • CCharles Spearman
  • DWilliam Gosset
View solution
Correct Option: B
Karl Pearson's product-moment correlation coefficient (1896). Spearman developed the rank counterpart.
ImportantQuick recall
  • Correlation = strength + direction of linear relation. Regression = predicts one variable from another.
  • Pearson’s r ∈ [−1, +1]. Symmetric, unit-less, linear-only, outlier-sensitive.
  • Other coefficients: Spearman’s ρ (ranks), Kendall’s τ, Point-biserial, Phi, Tetrachoric.
  • R² = r² = proportion of variance explained.
  • Y = a + bX + ε. Two regression lines: Y on X and X on Y. bYX × bXY = r².
  • Multiple regression — partial slopes. Adjusted R² corrects for added variables.
  • CLRM assumptions: linearity, independence (Durbin-Watson), homoscedasticity (Breusch-Pagan), normality (Q-Q), no multicollinearity (VIF), zero-mean errors.
  • Correlation ≠ causation. Causal-inference tools: experiments, IV, DID, RDD, PSM.