flowchart TD A[Star schema] A --> A1[Fact: Sales] A1 --> A2[Dim: Date] A1 --> A3[Dim: Product] A1 --> A4[Dim: Store] A1 --> A5[Dim: Customer] B[Snowflake schema] B --> B1[Fact: Sales] B1 --> B2[Dim: Product] B2 --> B2a[Dim: Category] B2a --> B2b[Dim: Department]
87 Data Warehousing, Mining and Knowledge Management
When operational systems generate transactions and analytical systems generate insight, the data warehouse is the bridge that lets a single, time-stamped, subject-oriented store feed reporting, OLAP and data mining. Bill Inmon’s Building the Data Warehouse (1992) and Ralph Kimball’s The Data Warehouse Toolkit (1996) define the two canonical schools of warehouse design. Layered above the warehouse, data mining — formalised by Fayyad, Piatetsky-Shapiro and Smyth in the KDD process (1996) — extracts patterns; the broader discipline of knowledge management, conceptualised by Nonaka and Takeuchi (1995), turns those patterns and human know-how into a strategic asset.
| Term | Working definition |
|---|---|
| Data warehouse | A subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision-making process — Bill Inmon. |
| Data mining | The non-trivial extraction of implicit, previously unknown and potentially useful information from data — Frawley, Piatetsky-Shapiro & Matheus, 1992. |
| Knowledge management | The process of capturing, distributing and effectively using knowledge — Davenport & Prusak, 1998. |
87.1 Inmon vs Kimball
| Dimension | Inmon (top-down) | Kimball (bottom-up) |
|---|---|---|
| Architecture | Enterprise data warehouse first, data marts later | Data marts first, integrated via conformed dimensions |
| Modelling | Third-Normal-Form (3NF) at warehouse | Dimensional / star schema |
| Time to first value | Slow (large up-front design) | Fast (mart by mart) |
| Best fit | Large, regulated organisations | Agile BI environments |
Both schools share Inmon’s four-fold definition — subject-oriented, integrated, time-variant, non-volatile.
87.2 Star and Snowflake Schemas
Kimball’s dimensional model is best examined in two schema types:
In a star schema all dimension tables are denormalised and connect directly to the central fact table; in a snowflake schema dimension tables are normalised and chained. A fact constellation (galaxy) shares dimensions across multiple fact tables.
87.3 ETL Pipeline and OLAP
A warehouse is loaded by an Extract-Transform-Load (ETL) process; modern cloud designs prefer ELT, where transformation happens inside the warehouse engine. Reporting on top is via Online Analytical Processing (OLAP), contrasted with Online Transaction Processing (OLTP).
| Feature | OLTP | OLAP |
|---|---|---|
| Purpose | Day-to-day operations | Historical analysis |
| Schema | 3NF, normalised | Dimensional, denormalised |
| Workload | Many short writes | Few long reads |
| Users | Front-line employees | Analysts, executives |
| Example | Banking core, POS | Sales cube, financial cube |
OLAP supports five operations on a cube: roll-up (aggregation up a hierarchy), drill-down (the reverse), slice (fix one dimension), dice (fix multiple dimensions), and pivot (rotate axes). Cube technology comes in three flavours: MOLAP (multidimensional), ROLAP (relational) and HOLAP (hybrid).
87.4 Data Mining: The KDD Process
Fayyad’s KDD process organises mining into five steps:
flowchart LR A[Selection] --> B[Pre-processing] B --> C[Transformation] C --> D[Data mining] D --> E[Interpretation / Evaluation]
The CRISP-DM model (1999), the dominant industry methodology, has six iterative phases — Business Understanding, Data Understanding, Data Preparation, Modelling, Evaluation, Deployment.
87.5 Mining Tasks and Algorithms
| Task | Goal | Classical algorithm |
|---|---|---|
| Classification | Assign records to known classes | Decision tree (C4.5), Naive Bayes, SVM, Random Forest |
| Regression | Predict a numeric value | Linear / multiple regression, gradient boosting |
| Clustering | Group similar records without labels | k-means, DBSCAN, hierarchical |
| Association rule mining | Find co-occurrence patterns | Apriori (Agrawal 1994), FP-Growth |
| Anomaly / outlier detection | Spot unusual records | Isolation Forest, LOF |
| Sequential pattern mining | Order-preserving rules | GSP, PrefixSpan |
The market-basket “if diapers, then beer” rule, with its support, confidence and lift metrics, remains the most-asked example in association mining.
87.6 Knowledge Management
Knowledge has two faces. Explicit knowledge is articulable — manuals, patents, formulas. Tacit knowledge is the harder-to-codify intuition embedded in skill and experience (Polanyi, 1966). Nonaka and Takeuchi’s SECI model explains the spiral by which the two interconvert:
| Mode | Direction | Example |
|---|---|---|
| Socialisation | Tacit → Tacit | Apprenticeship, observation |
| Externalisation | Tacit → Explicit | Writing a manual, articulation |
| Combination | Explicit → Explicit | Synthesising documents into reports |
| Internalisation | Explicit → Tacit | Learning by doing from a manual |
Communities of Practice (CoP) (Wenger 1998), after-action reviews (US Army), lessons learned repositories, expertise locators (“yellow pages”) and knowledge audits are the practical instruments. The KM technology stack typically includes content-management systems, collaboration tools, enterprise search, and increasingly LLM-based retrieval-augmented generation (RAG).
87.7 Business Intelligence and Analytics Maturity
Tom Davenport’s Competing on Analytics (2007) and Gartner’s analytics maturity model arrange capabilities along a value curve:
| Stage | Question answered |
|---|---|
| Descriptive | What happened? |
| Diagnostic | Why did it happen? |
| Predictive | What will happen? |
| Prescriptive | What should we do? |
87.8 Practice Questions
Q 01 Inmon definition Easy
Which of the following is NOT one of Bill Inmon’s four characteristics of a data warehouse?
View solution
Q 02 Schema Medium
A schema with a central fact table connected directly to denormalised dimension tables is called:
View solution
Q 03 OLAP operations Medium
Moving from monthly sales to quarterly sales in an OLAP cube is an example of:
View solution
Q 04 Mining algorithms Medium
The Apriori algorithm is used primarily for:
View solution
Q 05 SECI model Medium
In Nonaka and Takeuchi’s SECI model, the conversion from tacit to explicit knowledge is termed:
View solution
Q 06 CRISP-DM Medium
The standard CRISP-DM methodology has how many iterative phases?
View solution
Q 07 Analytics maturity Medium
“What should we do?” corresponds to which level of analytics maturity?
View solution
Q 08 Match the following Hard
Match the contributor with the concept:
| (P) Bill Inmon | (1) KDD process |
| (Q) Ralph Kimball | (2) Top-down EDW |
| (R) Fayyad et al. | (3) Star schema / data marts |
| (S) Nonaka & Takeuchi | (4) SECI model |
View solution
- Warehouse = subject-oriented, integrated, time-variant, non-volatile (Inmon).
- Inmon (top-down 3NF EDW) vs Kimball (bottom-up dimensional star schema).
- OLAP operations: roll-up, drill-down, slice, dice, pivot.
- KDD (Fayyad 1996) = 5 steps; CRISP-DM = 6 phases.
- SECI: Socialisation, Externalisation, Combination, Internalisation (Nonaka-Takeuchi 1995).