86  Artificial Intelligence and Big Data

Artificial intelligence (AI) and big data are intertwined: AI is the family of algorithms that learn patterns and make predictions, while big data is the raw material on which the most powerful learning algorithms are trained. The intellectual roots of AI run from the Turing Test (Alan Turing, 1950) and the Dartmouth Conference of 1956, where John McCarthy coined the term, through the AI winters of the 1970s and 1980s, to the deep-learning renaissance after AlexNet (2012) and the foundation-model era inaugurated by transformer architectures (Vaswani et al., Attention Is All You Need, 2017). The standard reference is Russell and Norvig, Artificial Intelligence: A Modern Approach; for big data, the conceptual benchmark is the Doug Laney 3-V framework (2001).

TipWorking definitions
Term Working definition
Artificial intelligence The science of designing computational agents that perceive, reason and act so as to achieve goals — Russell & Norvig.
Machine learning A subfield of AI in which algorithms learn parameters from data rather than being explicitly programmed — Tom Mitchell, 1997.
Big data Data assets whose volume, velocity and variety exceed the capability of conventional data-management tools — Doug Laney, 2001.

86.1 Types of AI

TipA taxonomy of AI
Layer Subset What it does
AI (broad) Symbolic / rule-based, expert systems, logic Encodes human knowledge as rules
Machine learning Supervised, unsupervised, reinforcement, semi-supervised Learns patterns from data
Deep learning Neural networks with many layers (CNN, RNN, transformer) Learns representations from raw data
Generative AI Models that produce new content (text, image, code) LLMs (GPT, Claude, Gemini), diffusion models

A second cut, by capability:

TipAI by capability
Type Description
Narrow / Weak AI Specialised system; today’s reality (ChatGPT, AlphaFold)
General AI (AGI) Human-level competence across all cognitive domains; aspirational
Super AI Beyond human-level intelligence; speculative

86.2 Machine Learning Paradigms

The three classical paradigms differ in what the algorithm is told:

flowchart LR
A[Machine learning] --> B[Supervised learning]
A --> C[Unsupervised learning]
A --> D[Reinforcement learning]
B --> B1[Regression]
B --> B2[Classification]
C --> C1[Clustering]
C --> C2[Dimensionality reduction]
D --> D1[Policy learning via reward]

Supervised learning uses labelled data — the algorithm sees input-output pairs and learns the mapping (linear / logistic regression, decision tree, random forest, SVM, neural network). Unsupervised learning finds structure in unlabelled data (k-means, DBSCAN, hierarchical clustering, PCA). Reinforcement learning trains an agent through rewards and penalties in an environment (Q-learning, policy gradient; DeepMind’s AlphaGo).

86.3 Deep Learning Architectures

Architecture Best suited for
Convolutional Neural Network (CNN) Image, video, spatial data
Recurrent Neural Network / LSTM / GRU Time series, speech
Transformer Sequence-to-sequence, language, vision
Generative Adversarial Network (GAN) Image generation, deepfakes
Diffusion model Text-to-image (Stable Diffusion, Imagen)
Graph Neural Network Network / relational data

The transformer architecture (2017) underpins today’s Large Language Models (LLMs) — GPT, Claude, Gemini, Llama — and the broader category of foundation models whose pre-training on vast unlabelled corpora is followed by task-specific fine-tuning or in-context prompting.

86.4 Big Data: The V’s

Doug Laney’s original 3-V definition has expanded to 5–7 V’s in industry use:

TipThe V’s of big data
V Meaning
Volume Petabytes / exabytes of data
Velocity Speed of generation and processing (streaming)
Variety Structured + semi-structured + unstructured
Veracity Trustworthiness; data quality
Value Business worth extracted from data
Variability Inconsistency / change of meaning over time
Visualisation Communicability through dashboards and charts

86.5 Big Data Architecture

A typical pipeline ingests, stores, processes, analyses and serves data:

flowchart LR
A[Sources: IoT, web, mobile, ERP] --> B[Ingestion: Kafka, Flume]
B --> C[Storage: HDFS, S3, NoSQL]
C --> D[Processing: Spark, Flink, MapReduce]
D --> E[Analytics: ML, OLAP, BI]
E --> F[Visualisation: Tableau, PowerBI]

Foundational technologies include the Hadoop ecosystem (HDFS for distributed storage, MapReduce for batch processing, YARN for resource management), Apache Spark (in-memory cluster computing), NoSQL databases in four families — key-value (Redis, DynamoDB), document (MongoDB), column-family (Cassandra, HBase), and graph (Neo4j) — and streaming engines such as Kafka and Flink.

86.6 AI / Big Data Use-Cases in Business

TipSector applications
Function Application
Marketing Customer segmentation, propensity modelling, recommender systems
Operations Predictive maintenance, demand forecasting, route optimisation
Finance Credit scoring, fraud detection, algorithmic trading
HR Resume screening, attrition prediction, sentiment analysis
Strategy Competitive intelligence, scenario simulation, business war gaming

86.7 Ethics, Bias and Governance

AI systems can amplify training-data bias (Amazon’s recruiting tool that down-ranked women, COMPAS recidivism in the US). Frameworks for responsible AI — fairness, accountability, transparency, explainability — are formalised in:

  • OECD AI Principles, 2019 (the first inter-governmental standard).
  • EU Artificial Intelligence Act, 2024 (a four-tier risk-based law: unacceptable / high / limited / minimal risk).
  • NITI Aayog National Strategy for AI, 2018 and India’s Digital India Act / DPDP Act, 2023.
  • NIST AI Risk Management Framework, 2023.

The standard tests for AI fairness are demographic parity, equalised odds, equal opportunity, and counterfactual fairness; explainability tools include LIME and SHAP.

86.8 Practice Questions

Q 01 AI history Easy

The term “artificial intelligence” was coined at which 1956 event?

  • A. Turing’s 1950 paper
  • B. Dartmouth Conference
  • C. Royal Society Lecture
  • D. Bell Labs Symposium
View solution
Correct Option: B
John McCarthy coined the term “artificial intelligence” at the Dartmouth Summer Research Project on AI in 1956, organised with Marvin Minsky, Nathaniel Rochester, and Claude Shannon.

Q 02 3 V’s Easy

Doug Laney’s original 3-V framework for big data (2001) refers to:

  • A. Volume, Variety, Velocity
  • B. Volume, Veracity, Value
  • C. Variety, Velocity, Veracity
  • D. Volume, Validity, Value
View solution
Correct Option: A
Laney’s META Group note (2001) defined big data along Volume, Velocity, and Variety. Veracity, Value and others were added later.

Q 03 ML paradigms Medium

A spam filter trained on emails labelled as “spam” or “not spam” is an example of:

  • A. Supervised learning
  • B. Unsupervised learning
  • C. Reinforcement learning
  • D. Self-supervised learning
View solution
Correct Option: A
Labelled training data with binary outcomes makes spam filtering a supervised binary-classification task.

Q 04 Deep learning Medium

The transformer architecture, the basis of modern LLMs, was introduced in which 2017 paper?

  • A. “ImageNet Classification with Deep CNNs”
  • B. “Attention Is All You Need”
  • C. “Generative Adversarial Networks”
  • D. “Deep Residual Learning”
View solution
Correct Option: B
Vaswani et al., “Attention Is All You Need” (2017), introduced the transformer architecture that powers GPT, Claude, BERT, and modern LLMs.

Q 05 Hadoop Easy

HDFS in the Hadoop ecosystem is a:

  • A. Distributed processing framework
  • B. Distributed file system
  • C. NoSQL database
  • D. Streaming engine
View solution
Correct Option: B
Hadoop Distributed File System stores very large files across commodity nodes; MapReduce / Spark process the data sitting on HDFS.

Q 06 Capability Medium

ChatGPT or any current LLM-based assistant is best described as:

  • A. Narrow AI
  • B. Artificial General Intelligence
  • C. Super AI
  • D. Symbolic AI
View solution
Correct Option: A
Despite their breadth, current LLMs are narrow AI — capable in language and reasoning tasks but not exhibiting the full general competence required of AGI.

Q 07 EU AI Act Hard

The EU AI Act adopted in 2024 follows which structural approach?

  • A. Sector-based
  • B. Risk-based, four-tier
  • C. Self-regulation
  • D. Outcome-based
View solution
Correct Option: B
The EU AI Act classifies AI systems into four risk tiers — unacceptable, high, limited, minimal — with proportionate obligations.

Q 08 Match the following Hard

Match the technology with its category:

(P) CNN (1) Streaming engine
(Q) MongoDB (2) Image recognition
(R) Kafka (3) Document NoSQL
(S) SHAP (4) Explainability tool
  • A. P-2, Q-3, R-1, S-4
  • B. P-2, Q-1, R-3, S-4
  • C. P-3, Q-2, R-1, S-4
  • D. P-1, Q-3, R-2, S-4
View solution
Correct Option: A
CNN — image recognition; MongoDB — document NoSQL; Kafka — streaming engine; SHAP — explainability for ML.
ImportantQuick recall
  • “AI” coined at the Dartmouth Conference, 1956; transformer paper “Attention Is All You Need” (2017) underpins LLMs.
  • Three ML paradigms: supervised, unsupervised, reinforcement; deep learning is a subset of ML.
  • Big data 3-V (Laney 2001): Volume, Velocity, Variety; later 5–7-V extensions add Veracity, Value.
  • Hadoop = HDFS + MapReduce + YARN; Spark is in-memory; NoSQL is non-relational (key-value, document, column, graph).
  • Governance: EU AI Act 2024 (4 tiers), OECD Principles 2019, NIST AI RMF 2023, NITI Aayog NSAI 2018, India DPDP 2023.