88 Data Warehousing, Mining and Knowledge Management
88.1 Data Warehouse — Concept
Data Warehouse (DW) = “a subject-oriented, integrated, time-variant, non-volatile collection of data in support of management’s decision-making process” — W.H. Inmon (1990). Inmon is the “Father of Data Warehousing”; Ralph Kimball is the proponent of the dimensional / star-schema approach.
88.2 Inmon vs Kimball
| Dimension | Inmon | Kimball |
|---|---|---|
| Approach | Top-down | Bottom-up |
| Data model | 3NF / Normalised | Dimensional / Star |
| Build order | EDW first, then data marts | Data marts first, integrated later |
| Data marts | Dependent | Conformed |
| Use | Enterprise integration | Fast analytics |
88.3 Data Warehouse Architecture
- Source systems — OLTP DBs, files.
- ETL / ELT (Extract, Transform, Load).
- Data Warehouse / Lake / Lakehouse.
- OLAP cubes.
- Data Marts — subject-specific.
- Metadata repository.
- BI tools — Tableau, Power BI.
88.4 Schemas
- Star schema — central fact + dimension tables.
- Snowflake schema — normalised dimensions.
- Galaxy / Fact-constellation — multiple facts.
- Slowly Changing Dimensions (SCD) — Type 1, 2, 3, 6.
88.5 OLTP vs OLAP
| Dimension | OLTP | OLAP |
|---|---|---|
| Purpose | Daily operations | Analysis |
| Workload | Many small txns | Few large queries |
| Schema | Normalised | Star/Snowflake |
| Data age | Current | Historical |
| Users | Operational staff | Analysts, executives |
88.6 Data Mining — Concept
Data Mining = the process of discovering patterns, correlations and useful information from large datasets using statistical, machine-learning and database techniques. Han & Kamber’s textbook is the standard reference. The acronym KDD (Knowledge Discovery in Databases) is from Usama Fayyad (1996).
88.7 KDD Process
- Data Selection.
- Pre-processing (clean, integrate).
- Transformation — feature engineering.
- Data Mining — algorithms.
- Interpretation / Evaluation.
88.8 CRISP-DM
Cross-Industry Standard Process for Data Mining (1999) — six phases: Business Understanding → Data Understanding → Data Preparation → Modelling → Evaluation → Deployment.
88.9 Data Mining Tasks
- Classification — supervised; predict class.
- Regression — predict numeric.
- Clustering — unsupervised grouping.
- Association Rule Mining — Apriori, FP-Growth (market-basket).
- Sequential pattern mining — time-ordered.
- Anomaly detection — fraud, outliers.
- Text / Web / Image / Graph mining.
- Process mining — discover business processes from logs.
88.10 Market-Basket and Apriori
- Support = P(A ∩ B) = count(A∩B)/N.
- Confidence = P(B|A) = support(A∩B) / support(A).
- Lift = confidence / P(B).
- Apriori algorithm — Agrawal & Srikant (1994).
- FP-Growth — Han 2000.
- Diapers and Beer — famous anecdote.
88.11 Knowledge Management — Concept
Knowledge Management (KM) = the systematic process of capturing, developing, sharing and effectively using organisational knowledge for competitive advantage. Ikujiro Nonaka & Hirotaka Takeuchi wrote The Knowledge-Creating Company (1995). Karl-Erik Sveiby — intellectual capital. Thomas Davenport & Laurence Prusak — Working Knowledge (1998). Peter Drucker — coined “knowledge worker” (1959).
88.12 Tacit vs Explicit Knowledge
- Tacit — personal, hard to codify; experience, intuition. Michael Polanyi (1966) — “We can know more than we can tell.”
- Explicit — documented, codifiable.
88.13 Nonaka’s SECI Model
| Mode | From → To | Example |
|---|---|---|
| Socialisation | Tacit → Tacit | Apprenticeship |
| Externalisation | Tacit → Explicit | Manuals, writing |
| Combination | Explicit → Explicit | Reports, training |
| Internalisation | Explicit → Tacit | Learning by doing |
Plus Ba — shared context for knowledge creation (Nonaka, Konno 1998).
88.14 DIKW Pyramid
Data → Information → Knowledge → Wisdom — Russell Ackoff (1989).
88.15 KM Strategies — Hansen-Nohria-Tierney (1999)
- Codification — store in databases (Accenture, Andersen).
- Personalisation — person-to-person (McKinsey, BCG).
88.16 KM Technologies
- Document Management Systems (DMS).
- Content Management Systems (CMS).
- Intranets / Portals.
- Knowledge bases / Wikis.
- Communities of Practice (CoP) — Wenger 1998.
- Expert systems.
- Social collaboration — Slack, Teams.
- Semantic search.
- AI-based KM — Gen-AI chatbots over corporate knowledge (RAG).
88.17 Indian Context
- TCS, Infosys, Wipro — pioneers of large-scale KM.
- Infosys KM programme (KShop) was an early case study.
- CII Centre for Excellence in KM.
- ISRO — knowledge re-use across missions.
- IITs, IIMs — academic KM research (NMIMS, IIM-Bangalore).
- Data Mining Indian firms: Mu Sigma, Fractal, Tiger Analytics, Tredence.
- Government: PRAGATI dashboard (PM monitoring), DBT-Aadhaar linkage.
88.18 Modern Trends
- Cloud data warehouse / Lakehouse — Snowflake, Databricks, BigQuery.
- Real-time analytics — Kafka + Spark.
- AutoML — Google, DataRobot.
- MLOps — production ML pipelines.
- DataOps — agile data delivery.
- Knowledge graphs — Google, Microsoft.
- Generative-AI-powered KM — LLM + RAG over enterprise data.
- Vector databases — Pinecone, Weaviate, Milvus.
- Semantic layer.
- Data mesh (Dehghani 2019) — decentralised.
- Federated KM.
- Data contracts.
- Privacy-preserving analytics — differential privacy.
88.19 Practice Questions
"Father of Data Warehousing" is:
View solution
Dimensional / star-schema approach is by:
View solution
SECI Model is by:
View solution
CRISP-DM has:
View solution
Apriori algorithm (1994) is by:
View solution
DIKW pyramid order is:
View solution
"We can know more than we can tell" is by:
View solution
Star schema has central:
View solution
"Knowledge worker" was coined by:
View solution
McKinsey & BCG use which KM strategy?
View solution
Communities of Practice is by:
View solution
In association rules, Lift = 1 means:
View solution
"Data Mesh" architecture is by:
View solution
OLAP optimised for:
View solution
Match SECI:
| (i) | Socialisation | (a) | Tacit→Explicit |
| (ii) | Externalisation | (b) | Tacit→Tacit |
| (iii) | Combination | (c) | Explicit→Tacit |
| (iv) | Internalisation | (d) | Explicit→Explicit |
View solution
88.19.1 Advanced Format Questions
A: Inmon used 3NF and top-down approach.
R: Kimball used dimensional star schema and bottom-up.
View solution
SECI Model (Nonaka): (i) Socialisation. (ii) Externalisation. (iii) Combination. (iv) Internalisation.
View solution
Mining tasks: (i) Classification. (ii) Clustering. (iii) Association. (iv) Anomaly detection.
View solution
88.20 Quick Recall
- DW (Inmon 1990) — subject-oriented, integrated, time-variant, non-volatile.
- Inmon (top-down, 3NF) vs Kimball (bottom-up, dimensional/star).
- Schemas: Star · Snowflake · Galaxy · SCD types.
- OLTP vs OLAP — operations vs analysis.
- Architecture: Source → ETL/ELT → DW/Lake → OLAP → Marts → Metadata → BI.
- Data mining = KDD — Fayyad (1996) 5 steps; CRISP-DM 6 phases.
- Mining tasks: Classification · Regression · Clustering · Association rules (Apriori — Agrawal-Srikant 1994; FP-Growth — Han 2000) · Sequential · Anomaly · Text/Web/Image/Graph · Process mining.
- Market basket: Support · Confidence · Lift; Diapers-Beer.
- KM: Davenport-Prusak · Nonaka-Takeuchi 1995 · Sveiby · Drucker (knowledge worker 1959).
- Tacit vs Explicit (Polanyi 1966).
- SECI: Socialisation (T→T) · Externalisation (T→Ex) · Combination (Ex→Ex) · Internalisation (Ex→T); + Ba.
- DIKW — Data → Information → Knowledge → Wisdom (Ackoff 1989).
- KM strategies (Hansen-Nohria-Tierney 1999): Codification vs Personalisation.
- Tools: DMS · CMS · portals · wikis · CoP (Wenger 1998) · ES · Slack · semantic search · LLM+RAG.
- India: TCS · Infosys KShop · CII KM · ISRO · IIM-B · Mu Sigma · Fractal.
- Modern: Cloud DW (Snowflake, Databricks, BigQuery) · Lakehouse · Kafka+Spark · AutoML · MLOps · DataOps · Knowledge graphs · Vector DBs · Semantic layer · Data Mesh (Dehghani 2019) · differential privacy.