74  Correlation and Regression Analysis

74.1 Correlation — Concept

Correlation = the statistical measure of the strength and direction of the linear relationship between two variables. Sir Francis Galton (1888) introduced “co-relation”; Karl Pearson (1896) gave the formal coefficient.

74.2 Pearson Correlation Coefficient (r)

\[r = \frac{\sum (X - \bar{X})(Y - \bar{Y})}{\sqrt{\sum (X - \bar{X})^2 \cdot \sum (Y - \bar{Y})^2}}\]

TipPearson r interpretation
  • −1 ≤ r ≤ +1.
  • r = +1 → perfect positive linear.
  • r = −1 → perfect negative linear.
  • r = 0 → no linear relationship.
  • |r| > 0.7 strong · 0.4-0.7 moderate · 0.1-0.4 weak.
  • r² (Coefficient of Determination) = % of variance in Y explained by X.

74.3 Other Correlation Measures

TipCorrelation types
  • Spearman’s Rank Correlation (ρ) — Charles Spearman (1904); ordinal data.
  • Kendall’s Tau (τ).
  • Point-Biserial — one continuous + one dichotomous.
  • Phi (Φ) / Cramer’s V — two categorical.
  • Partial Correlation — controlling for other variables.
  • Multiple Correlation (R) — Y with multiple Xs.

74.4 Correlation vs Causation

TipCorrelation does not imply causation
  • Spurious correlation — both variables driven by a third (lurking) variable.
  • Reverse causation — direction unclear.
  • Confounding variables.
  • Hill’s Criteria for Causation (Bradford Hill 1965) — strength, consistency, specificity, temporality, biological gradient, plausibility, coherence, experiment, analogy.

74.5 Regression Analysis

Regression = modelling the relationship between a dependent variable Y and one or more independent variables X. Galton (1886) — coined “regression toward the mean”. Karl Pearson formalised. A.A. Markov-Gauss gave the theoretical foundation.

74.6 Simple Linear Regression

\[Y = \alpha + \beta X + \epsilon\]

TipOLS estimators
  • Slope (β) = Σ(X−X̄)(Y−Ȳ) / Σ(X−X̄)² = r × (σ_Y / σ_X).
  • Intercept (α) = Ȳ − β·X̄.
  • = % variance explained.
  • SEE (Standard Error of Estimate).
  • Method: Ordinary Least Squares (OLS) — Gauss-Legendre.

74.7 Multiple Regression

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_k X_k + \epsilon\]

TipMultiple-regression concepts
  • Adjusted R² — penalises for additional variables.
  • F-test for overall significance.
  • t-test for individual coefficients.
  • Multicollinearity — predictors correlated; VIF (Variance Inflation Factor) > 10 is concern.
  • Heteroskedasticity — non-constant variance.
  • AutocorrelationDurbin-Watson (close to 2 = no autocorrelation).
  • Normality of residuals.

74.8 Assumptions of OLS — LINE

TipOLS assumptions (LINE)
  • LLinearity of relationship.
  • IIndependence of errors.
  • NNormality of errors.
  • EEqual variance (Homoskedasticity).
  • Plus: No multicollinearity, exogeneity.

74.9 Variants of Regression

TipRegression variants
  • Linear — straight line.
  • Polynomial — Y = β₀ + β₁X + β₂X² + …
  • Logistic — binary Y; Sigmoid function.
  • Probit — alternative to logit.
  • Multinomial Logistic — multi-class Y.
  • Ordinal Logistic — ranked Y.
  • Poisson — count Y.
  • Negative Binomial — overdispersed counts.
  • Cox Proportional Hazards — survival.
  • Tobit — censored data.
  • Ridge / Lasso / ElasticNet — regularised.
  • Quantile Regression.
  • Hierarchical / Multilevel.
  • GAM (Generalised Additive Models).

74.10 Logistic Regression

\[\ln\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X\]

Used when Y is binary (Yes/No, Default/Not, Buy/Not). Output is probability.

74.11 Time-Series Regression and Forecasting

TipTime-series methods
  • AR (Autoregressive).
  • MA (Moving Average).
  • ARMA / ARIMA — Box-Jenkins (1970).
  • SARIMA — seasonal.
  • VAR (Vector Autoregression).
  • GARCH — volatility modelling.
  • Holt-Winters exponential smoothing.
  • Prophet (Facebook 2017).
  • LSTM / Transformers — deep learning.

74.12 Modern Machine Learning

Regression extends into ML:

TipML regression approaches
  • Decision Trees / Random Forests / Gradient Boosting (XGBoost, LightGBM, CatBoost).
  • Neural Networks.
  • SVR (Support Vector Regression).
  • k-NN Regression.
  • Bayesian Regression.
  • Gaussian Processes.

74.13 Indian Applications

TipIndian applications
  • Credit scoring — CIBIL, Experian.
  • Demand forecasting — Reliance, HUL.
  • Election prediction — CSDS, C-Voter.
  • Macro forecasting — RBI, NCAER.
  • Insurance pricing — LIC, GIC.
  • Healthcare — AIIMS, ICMR studies.

74.14 Practice Questions

Q 01PearsonEasy

Pearson r ranges from:

  • A0 to 1
  • B−1 to +1
  • C−2 to +2
  • D0 to 100
View solution
Correct Option: B
−1 ≤ r ≤ +1.
Q 02GaltonMedium

"Regression toward the mean" was coined by:

  • AFrancis Galton
  • BKarl Pearson
  • CFisher
  • DGauss
View solution
Correct Option: A
Francis Galton (1886).
Q 03SpearmanMedium

Spearman's correlation (1904) is suited for:

  • AInterval data
  • BRatio data
  • COrdinal / ranked data
  • DNominal data
View solution
Correct Option: C
Rank-based.
Q 04Medium

r² represents:

  • A% of variance explained
  • BSlope
  • CMean
  • DSD
View solution
Correct Option: A
Coefficient of Determination.
Q 05OLSMedium

OLS estimates coefficients by minimising:

  • ASum of squared residuals
  • BSum of absolute residuals
  • CMaximum likelihood
  • DCorrelation
View solution
Correct Option: A
Least squares.
Q 06VIFHard

A VIF > 10 indicates:

  • AHeteroskedasticity
  • BMulticollinearity
  • CAutocorrelation
  • DNormality issue
View solution
Correct Option: B
Variance Inflation Factor > 10 → severe multicollinearity.
Q 07Durbin-WatsonHard

Durbin-Watson statistic tests for:

  • AMulticollinearity
  • BAutocorrelation
  • CNormality
  • DHeteroskedasticity
View solution
Correct Option: B
Value close to 2 → no autocorrelation.
Q 08LogisticMedium

Logistic regression is used when Y is:

  • AContinuous
  • BBinary
  • CCount
  • DTime-to-event
View solution
Correct Option: B
Binary classification.
Q 09CoxHard

Cox Proportional Hazards regression is for:

  • ASurvival / time-to-event
  • BCounts
  • CBinary
  • DContinuous
View solution
Correct Option: A
Survival analysis.
Q 10Bradford HillHard

Hill's Criteria for causation (1965) were given by:

  • ABradford Hill
  • BGalton
  • CPearson
  • DSpearman
View solution
Correct Option: A
Sir Austin Bradford Hill (1965).
Q 11Ridge / LassoHard

Ridge and Lasso are:

  • ARegularised regressions
  • BTime-series models
  • CCategorical tests
  • DSampling methods
View solution
Correct Option: A
L2 (Ridge) and L1 (Lasso) penalty.
Q 12LINEMedium

OLS assumption "E" stands for:

  • AEqual variance
  • BEndogeneity
  • CError term
  • DExogeneity
View solution
Correct Option: A
L=Linearity · I=Independence · N=Normality · E=Equal variance (Homoskedasticity).
Q 13Adjusted R²Medium

Adjusted R² differs from R² by:

  • APenalising for additional predictors
  • BMultiplying by sample size
  • CAdding intercept
  • DSame as R²
View solution
Correct Option: A
Penalises for extra predictors.
Q 14PoissonHard

Poisson regression is used when Y is:

  • ACounts
  • BBinary
  • COrdinal
  • DContinuous
View solution
Correct Option: A
Count data.
Q 15MatchHard

Match:

(i) Correlation (a) Galton
(ii) Regression (b) Spearman
(iii) Rank correlation (c) Pearson
(iv) Bootstrap (d) Efron
  • A(i)-(c), (ii)-(a), (iii)-(b), (iv)-(d)
  • B(i)-(a), (ii)-(b), (iii)-(c), (iv)-(d)
  • C(i)-(b), (ii)-(c), (iii)-(d), (iv)-(a)
  • D(i)-(d), (ii)-(c), (iii)-(a), (iv)-(b)
View solution
Correct Option: A
Correlation — Pearson; Regression — Galton; Rank — Spearman; Bootstrap — Efron.

74.14.1 Advanced Format Questions

AR 1Assertion-ReasonHard

A: r = 0 means no relationship.
R: Pearson r captures only linear relationships.

  • ABoth true; R explains A
  • BBoth true; R does not explain A
  • CA true, R false
  • DA false, R true
View solution
Correct Option: D
A is false; r = 0 means no linear relationship, but non-linear may exist.
S 1Statement-basedMedium

OLS assumptions (LINE): (i) Linearity. (ii) Independence. (iii) Normality. (iv) Equal variance.

  • AAll four
  • B(i) and (ii) only
  • C(iii) and (iv) only
  • D(iv) only
View solution
Correct Option: A
N 1NumericalMedium

If r = 0.8, what % of variance in Y is explained by X?

  • A64 %
  • B80 %
  • C40 %
  • D100 %
View solution
Correct Option: A
r² = 0.64 = 64%.
N 2NumericalHard

Regression: Y = 10 + 2X. When X = 5, predicted Y is:

  • A20
  • B15
  • C10
  • D25
View solution
Correct Option: A
10 + 2(5) = 20.

74.15 Quick Recall

ImportantQuick recall
  • Correlation — Galton “co-relation” (1888); Karl Pearson (1896).
  • Pearson r ∈ [−1, +1]; r² = % variance.
  • Spearman ρ (1904) — ranked data; Kendall τ.
  • Partial vs Multiple correlation.
  • Correlation ≠ Causation; Hill’s Criteria (1965): strength · consistency · specificity · temporality · gradient · plausibility · coherence · experiment · analogy.
  • Regression — Galton (1886) “regression to mean”; OLS via Gauss-Legendre.
  • Simple linear: Y = α + βX + ε.
  • Multiple: Y = β₀ + Σβᵢ Xᵢ + ε.
  • Adjusted R²; F-test (overall); t-test (individual β).
  • Diagnostics: Multicollinearity (VIF > 10) · Heteroskedasticity · Autocorrelation (Durbin-Watson ≈ 2) · Normality of residuals.
  • OLS Assumptions (LINE): Linearity · Independence · Normality · Equal variance.
  • Variants: Polynomial · Logistic · Probit · Multinomial · Ordinal · Poisson · Negative Binomial · Cox PH · Tobit · Ridge · Lasso · ElasticNet · Quantile · Hierarchical · GAM.
  • Time series: AR · MA · ARMA / ARIMA (Box-Jenkins 1970) · SARIMA · VAR · GARCH · Holt-Winters · Prophet · LSTM.
  • ML regression: Trees · RF · XGBoost · LightGBM · NN · SVR · k-NN · Bayesian · GP.
  • India applications: credit scoring · demand forecasting · macro · insurance · elections · healthcare.