87  Data Warehousing, Mining and Knowledge Management

When operational systems generate transactions and analytical systems generate insight, the data warehouse is the bridge that lets a single, time-stamped, subject-oriented store feed reporting, OLAP and data mining. Bill Inmon’s Building the Data Warehouse (1992) and Ralph Kimball’s The Data Warehouse Toolkit (1996) define the two canonical schools of warehouse design. Layered above the warehouse, data mining — formalised by Fayyad, Piatetsky-Shapiro and Smyth in the KDD process (1996) — extracts patterns; the broader discipline of knowledge management, conceptualised by Nonaka and Takeuchi (1995), turns those patterns and human know-how into a strategic asset.

TipWorking definitions
Term Working definition
Data warehouse A subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision-making process — Bill Inmon.
Data mining The non-trivial extraction of implicit, previously unknown and potentially useful information from data — Frawley, Piatetsky-Shapiro & Matheus, 1992.
Knowledge management The process of capturing, distributing and effectively using knowledge — Davenport & Prusak, 1998.

87.1 Inmon vs Kimball

Dimension Inmon (top-down) Kimball (bottom-up)
Architecture Enterprise data warehouse first, data marts later Data marts first, integrated via conformed dimensions
Modelling Third-Normal-Form (3NF) at warehouse Dimensional / star schema
Time to first value Slow (large up-front design) Fast (mart by mart)
Best fit Large, regulated organisations Agile BI environments

Both schools share Inmon’s four-fold definition — subject-oriented, integrated, time-variant, non-volatile.

87.2 Star and Snowflake Schemas

Kimball’s dimensional model is best examined in two schema types:

flowchart TD
A[Star schema]
A --> A1[Fact: Sales]
A1 --> A2[Dim: Date]
A1 --> A3[Dim: Product]
A1 --> A4[Dim: Store]
A1 --> A5[Dim: Customer]
B[Snowflake schema]
B --> B1[Fact: Sales]
B1 --> B2[Dim: Product]
B2 --> B2a[Dim: Category]
B2a --> B2b[Dim: Department]

In a star schema all dimension tables are denormalised and connect directly to the central fact table; in a snowflake schema dimension tables are normalised and chained. A fact constellation (galaxy) shares dimensions across multiple fact tables.

87.3 ETL Pipeline and OLAP

A warehouse is loaded by an Extract-Transform-Load (ETL) process; modern cloud designs prefer ELT, where transformation happens inside the warehouse engine. Reporting on top is via Online Analytical Processing (OLAP), contrasted with Online Transaction Processing (OLTP).

TipOLTP vs OLAP
Feature OLTP OLAP
Purpose Day-to-day operations Historical analysis
Schema 3NF, normalised Dimensional, denormalised
Workload Many short writes Few long reads
Users Front-line employees Analysts, executives
Example Banking core, POS Sales cube, financial cube

OLAP supports five operations on a cube: roll-up (aggregation up a hierarchy), drill-down (the reverse), slice (fix one dimension), dice (fix multiple dimensions), and pivot (rotate axes). Cube technology comes in three flavours: MOLAP (multidimensional), ROLAP (relational) and HOLAP (hybrid).

87.4 Data Mining: The KDD Process

Fayyad’s KDD process organises mining into five steps:

flowchart LR
A[Selection] --> B[Pre-processing]
B --> C[Transformation]
C --> D[Data mining]
D --> E[Interpretation / Evaluation]

The CRISP-DM model (1999), the dominant industry methodology, has six iterative phases — Business Understanding, Data Understanding, Data Preparation, Modelling, Evaluation, Deployment.

87.5 Mining Tasks and Algorithms

TipMining tasks
Task Goal Classical algorithm
Classification Assign records to known classes Decision tree (C4.5), Naive Bayes, SVM, Random Forest
Regression Predict a numeric value Linear / multiple regression, gradient boosting
Clustering Group similar records without labels k-means, DBSCAN, hierarchical
Association rule mining Find co-occurrence patterns Apriori (Agrawal 1994), FP-Growth
Anomaly / outlier detection Spot unusual records Isolation Forest, LOF
Sequential pattern mining Order-preserving rules GSP, PrefixSpan

The market-basket “if diapers, then beer” rule, with its support, confidence and lift metrics, remains the most-asked example in association mining.

87.6 Knowledge Management

Knowledge has two faces. Explicit knowledge is articulable — manuals, patents, formulas. Tacit knowledge is the harder-to-codify intuition embedded in skill and experience (Polanyi, 1966). Nonaka and Takeuchi’s SECI model explains the spiral by which the two interconvert:

TipSECI model
Mode Direction Example
Socialisation Tacit → Tacit Apprenticeship, observation
Externalisation Tacit → Explicit Writing a manual, articulation
Combination Explicit → Explicit Synthesising documents into reports
Internalisation Explicit → Tacit Learning by doing from a manual

Communities of Practice (CoP) (Wenger 1998), after-action reviews (US Army), lessons learned repositories, expertise locators (“yellow pages”) and knowledge audits are the practical instruments. The KM technology stack typically includes content-management systems, collaboration tools, enterprise search, and increasingly LLM-based retrieval-augmented generation (RAG).

87.7 Business Intelligence and Analytics Maturity

Tom Davenport’s Competing on Analytics (2007) and Gartner’s analytics maturity model arrange capabilities along a value curve:

TipAnalytics maturity
Stage Question answered
Descriptive What happened?
Diagnostic Why did it happen?
Predictive What will happen?
Prescriptive What should we do?

87.8 Practice Questions

Q 01 Inmon definition Easy

Which of the following is NOT one of Bill Inmon’s four characteristics of a data warehouse?

  • A. Subject-oriented
  • B. Integrated
  • C. Time-variant
  • D. Volatile
View solution
Correct Option: D
Inmon’s four characteristics are subject-oriented, integrated, time-variant and non-volatile. Once data enters the warehouse, it is not over-written.

Q 02 Schema Medium

A schema with a central fact table connected directly to denormalised dimension tables is called:

  • A. Star schema
  • B. Snowflake schema
  • C. Fact constellation
  • D. Third Normal Form
View solution
Correct Option: A
In a star schema, dimensions are denormalised and link straight to the central fact table; the snowflake further normalises dimensions.

Q 03 OLAP operations Medium

Moving from monthly sales to quarterly sales in an OLAP cube is an example of:

  • A. Drill-down
  • B. Roll-up
  • C. Slice
  • D. Pivot
View solution
Correct Option: B
Aggregating up a hierarchy (month → quarter) is roll-up; the reverse is drill-down.

Q 04 Mining algorithms Medium

The Apriori algorithm is used primarily for:

  • A. Classification
  • B. Clustering
  • C. Association rule mining
  • D. Outlier detection
View solution
Correct Option: C
Agrawal & Srikant’s Apriori (1994) finds frequent itemsets and generates association rules with support, confidence and lift.

Q 05 SECI model Medium

In Nonaka and Takeuchi’s SECI model, the conversion from tacit to explicit knowledge is termed:

  • A. Socialisation
  • B. Externalisation
  • C. Combination
  • D. Internalisation
View solution
Correct Option: B
Externalisation articulates tacit knowledge into explicit form (manuals, models, written procedures).

Q 06 CRISP-DM Medium

The standard CRISP-DM methodology has how many iterative phases?

  • A. 4
  • B. 5
  • C. 6
  • D. 7
View solution
Correct Option: C
CRISP-DM has six phases: Business Understanding, Data Understanding, Data Preparation, Modelling, Evaluation, Deployment.

Q 07 Analytics maturity Medium

“What should we do?” corresponds to which level of analytics maturity?

  • A. Descriptive
  • B. Diagnostic
  • C. Predictive
  • D. Prescriptive
View solution
Correct Option: D
Prescriptive analytics recommends actions, often via optimisation or simulation. The other levels answer what / why / what-will-happen respectively.

Q 08 Match the following Hard

Match the contributor with the concept:

(P) Bill Inmon (1) KDD process
(Q) Ralph Kimball (2) Top-down EDW
(R) Fayyad et al. (3) Star schema / data marts
(S) Nonaka & Takeuchi (4) SECI model
  • A. P-2, Q-3, R-1, S-4
  • B. P-3, Q-2, R-1, S-4
  • C. P-2, Q-1, R-3, S-4
  • D. P-4, Q-3, R-2, S-1
View solution
Correct Option: A
Inmon — top-down enterprise data warehouse; Kimball — star schema / dimensional marts; Fayyad — KDD process; Nonaka & Takeuchi — SECI model.
ImportantQuick recall
  • Warehouse = subject-oriented, integrated, time-variant, non-volatile (Inmon).
  • Inmon (top-down 3NF EDW) vs Kimball (bottom-up dimensional star schema).
  • OLAP operations: roll-up, drill-down, slice, dice, pivot.
  • KDD (Fayyad 1996) = 5 steps; CRISP-DM = 6 phases.
  • SECI: Socialisation, Externalisation, Combination, Internalisation (Nonaka-Takeuchi 1995).