74 Correlation and Regression Analysis
74.1 Correlation — Concept
Correlation = the statistical measure of the strength and direction of the linear relationship between two variables. Sir Francis Galton (1888) introduced “co-relation”; Karl Pearson (1896) gave the formal coefficient.
74.2 Pearson Correlation Coefficient (r)
\[r = \frac{\sum (X - \bar{X})(Y - \bar{Y})}{\sqrt{\sum (X - \bar{X})^2 \cdot \sum (Y - \bar{Y})^2}}\]
- −1 ≤ r ≤ +1.
- r = +1 → perfect positive linear.
- r = −1 → perfect negative linear.
- r = 0 → no linear relationship.
- |r| > 0.7 strong · 0.4-0.7 moderate · 0.1-0.4 weak.
- r² (Coefficient of Determination) = % of variance in Y explained by X.
74.3 Other Correlation Measures
- Spearman’s Rank Correlation (ρ) — Charles Spearman (1904); ordinal data.
- Kendall’s Tau (τ).
- Point-Biserial — one continuous + one dichotomous.
- Phi (Φ) / Cramer’s V — two categorical.
- Partial Correlation — controlling for other variables.
- Multiple Correlation (R) — Y with multiple Xs.
74.4 Correlation vs Causation
- Spurious correlation — both variables driven by a third (lurking) variable.
- Reverse causation — direction unclear.
- Confounding variables.
- Hill’s Criteria for Causation (Bradford Hill 1965) — strength, consistency, specificity, temporality, biological gradient, plausibility, coherence, experiment, analogy.
74.5 Regression Analysis
Regression = modelling the relationship between a dependent variable Y and one or more independent variables X. Galton (1886) — coined “regression toward the mean”. Karl Pearson formalised. A.A. Markov-Gauss gave the theoretical foundation.
74.6 Simple Linear Regression
\[Y = \alpha + \beta X + \epsilon\]
- Slope (β) = Σ(X−X̄)(Y−Ȳ) / Σ(X−X̄)² = r × (σ_Y / σ_X).
- Intercept (α) = Ȳ − β·X̄.
- R² = % variance explained.
- SEE (Standard Error of Estimate).
- Method: Ordinary Least Squares (OLS) — Gauss-Legendre.
74.7 Multiple Regression
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_k X_k + \epsilon\]
- Adjusted R² — penalises for additional variables.
- F-test for overall significance.
- t-test for individual coefficients.
- Multicollinearity — predictors correlated; VIF (Variance Inflation Factor) > 10 is concern.
- Heteroskedasticity — non-constant variance.
- Autocorrelation — Durbin-Watson (close to 2 = no autocorrelation).
- Normality of residuals.
74.8 Assumptions of OLS — LINE
- L — Linearity of relationship.
- I — Independence of errors.
- N — Normality of errors.
- E — Equal variance (Homoskedasticity).
- Plus: No multicollinearity, exogeneity.
74.9 Variants of Regression
- Linear — straight line.
- Polynomial — Y = β₀ + β₁X + β₂X² + …
- Logistic — binary Y; Sigmoid function.
- Probit — alternative to logit.
- Multinomial Logistic — multi-class Y.
- Ordinal Logistic — ranked Y.
- Poisson — count Y.
- Negative Binomial — overdispersed counts.
- Cox Proportional Hazards — survival.
- Tobit — censored data.
- Ridge / Lasso / ElasticNet — regularised.
- Quantile Regression.
- Hierarchical / Multilevel.
- GAM (Generalised Additive Models).
74.10 Logistic Regression
\[\ln\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X\]
Used when Y is binary (Yes/No, Default/Not, Buy/Not). Output is probability.
74.11 Time-Series Regression and Forecasting
- AR (Autoregressive).
- MA (Moving Average).
- ARMA / ARIMA — Box-Jenkins (1970).
- SARIMA — seasonal.
- VAR (Vector Autoregression).
- GARCH — volatility modelling.
- Holt-Winters exponential smoothing.
- Prophet (Facebook 2017).
- LSTM / Transformers — deep learning.
74.12 Modern Machine Learning
Regression extends into ML:
- Decision Trees / Random Forests / Gradient Boosting (XGBoost, LightGBM, CatBoost).
- Neural Networks.
- SVR (Support Vector Regression).
- k-NN Regression.
- Bayesian Regression.
- Gaussian Processes.
74.13 Indian Applications
- Credit scoring — CIBIL, Experian.
- Demand forecasting — Reliance, HUL.
- Election prediction — CSDS, C-Voter.
- Macro forecasting — RBI, NCAER.
- Insurance pricing — LIC, GIC.
- Healthcare — AIIMS, ICMR studies.
74.14 Practice Questions
Pearson r ranges from:
View solution
"Regression toward the mean" was coined by:
View solution
Spearman's correlation (1904) is suited for:
View solution
r² represents:
View solution
OLS estimates coefficients by minimising:
View solution
A VIF > 10 indicates:
View solution
Durbin-Watson statistic tests for:
View solution
Logistic regression is used when Y is:
View solution
Cox Proportional Hazards regression is for:
View solution
Hill's Criteria for causation (1965) were given by:
View solution
Ridge and Lasso are:
View solution
OLS assumption "E" stands for:
View solution
Adjusted R² differs from R² by:
View solution
Poisson regression is used when Y is:
View solution
Match:
| (i) | Correlation | (a) | Galton |
| (ii) | Regression | (b) | Spearman |
| (iii) | Rank correlation | (c) | Pearson |
| (iv) | Bootstrap | (d) | Efron |
View solution
74.14.1 Advanced Format Questions
A: r = 0 means no relationship.
R: Pearson r captures only linear relationships.
View solution
OLS assumptions (LINE): (i) Linearity. (ii) Independence. (iii) Normality. (iv) Equal variance.
View solution
If r = 0.8, what % of variance in Y is explained by X?
View solution
Regression: Y = 10 + 2X. When X = 5, predicted Y is:
View solution
74.15 Quick Recall
- Correlation — Galton “co-relation” (1888); Karl Pearson (1896).
- Pearson r ∈ [−1, +1]; r² = % variance.
- Spearman ρ (1904) — ranked data; Kendall τ.
- Partial vs Multiple correlation.
- Correlation ≠ Causation; Hill’s Criteria (1965): strength · consistency · specificity · temporality · gradient · plausibility · coherence · experiment · analogy.
- Regression — Galton (1886) “regression to mean”; OLS via Gauss-Legendre.
- Simple linear: Y = α + βX + ε.
- Multiple: Y = β₀ + Σβᵢ Xᵢ + ε.
- Adjusted R²; F-test (overall); t-test (individual β).
- Diagnostics: Multicollinearity (VIF > 10) · Heteroskedasticity · Autocorrelation (Durbin-Watson ≈ 2) · Normality of residuals.
- OLS Assumptions (LINE): Linearity · Independence · Normality · Equal variance.
- Variants: Polynomial · Logistic · Probit · Multinomial · Ordinal · Poisson · Negative Binomial · Cox PH · Tobit · Ridge · Lasso · ElasticNet · Quantile · Hierarchical · GAM.
- Time series: AR · MA · ARMA / ARIMA (Box-Jenkins 1970) · SARIMA · VAR · GARCH · Holt-Winters · Prophet · LSTM.
- ML regression: Trees · RF · XGBoost · LightGBM · NN · SVR · k-NN · Bayesian · GP.
- India applications: credit scoring · demand forecasting · macro · insurance · elections · healthcare.