86 Artificial Intelligence and Big Data

Artificial intelligence (AI) and big data are intertwined: AI is the family of algorithms that learn patterns and make predictions, while big data is the raw material on which the most powerful learning algorithms are trained. The intellectual roots of AI run from the Turing Test (Alan Turing, 1950) and the Dartmouth Conference of 1956, where John McCarthy coined the term, through the AI winters of the 1970s and 1980s, to the deep-learning renaissance after AlexNet (2012) and the foundation-model era inaugurated by transformer architectures (Vaswani et al., Attention Is All You Need, 2017). The standard reference is Russell and Norvig, Artificial Intelligence: A Modern Approach; for big data, the conceptual benchmark is the Doug Laney 3-V framework (2001).

Working definitions

Term	Working definition
Artificial intelligence	The science of designing computational agents that perceive, reason and act so as to achieve goals — Russell & Norvig.
Machine learning	A subfield of AI in which algorithms learn parameters from data rather than being explicitly programmed — Tom Mitchell, 1997.
Big data	Data assets whose volume, velocity and variety exceed the capability of conventional data-management tools — Doug Laney, 2001.

86.1 Types of AI

A taxonomy of AI

Layer	Subset	What it does
AI (broad)	Symbolic / rule-based, expert systems, logic	Encodes human knowledge as rules
Machine learning	Supervised, unsupervised, reinforcement, semi-supervised	Learns patterns from data
Deep learning	Neural networks with many layers (CNN, RNN, transformer)	Learns representations from raw data
Generative AI	Models that produce new content (text, image, code)	LLMs (GPT, Claude, Gemini), diffusion models

A second cut, by capability:

AI by capability

Type	Description
Narrow / Weak AI	Specialised system; today’s reality (ChatGPT, AlphaFold)
General AI (AGI)	Human-level competence across all cognitive domains; aspirational
Super AI	Beyond human-level intelligence; speculative

86.2 Machine Learning Paradigms

The three classical paradigms differ in what the algorithm is told:

flowchart LR
A[Machine learning] --> B[Supervised learning]
A --> C[Unsupervised learning]
A --> D[Reinforcement learning]
B --> B1[Regression]
B --> B2[Classification]
C --> C1[Clustering]
C --> C2[Dimensionality reduction]
D --> D1[Policy learning via reward]

Supervised learning uses labelled data — the algorithm sees input-output pairs and learns the mapping (linear / logistic regression, decision tree, random forest, SVM, neural network). Unsupervised learning finds structure in unlabelled data (k-means, DBSCAN, hierarchical clustering, PCA). Reinforcement learning trains an agent through rewards and penalties in an environment (Q-learning, policy gradient; DeepMind’s AlphaGo).

86.3 Deep Learning Architectures

Architecture	Best suited for
Convolutional Neural Network (CNN)	Image, video, spatial data
Recurrent Neural Network / LSTM / GRU	Time series, speech
Transformer	Sequence-to-sequence, language, vision
Generative Adversarial Network (GAN)	Image generation, deepfakes
Diffusion model	Text-to-image (Stable Diffusion, Imagen)
Graph Neural Network	Network / relational data

The transformer architecture (2017) underpins today’s Large Language Models (LLMs) — GPT, Claude, Gemini, Llama — and the broader category of foundation models whose pre-training on vast unlabelled corpora is followed by task-specific fine-tuning or in-context prompting.

86.4 Big Data: The V’s

Doug Laney’s original 3-V definition has expanded to 5–7 V’s in industry use:

The V’s of big data

V	Meaning
Volume	Petabytes / exabytes of data
Velocity	Speed of generation and processing (streaming)
Variety	Structured + semi-structured + unstructured
Veracity	Trustworthiness; data quality
Value	Business worth extracted from data
Variability	Inconsistency / change of meaning over time
Visualisation	Communicability through dashboards and charts

86.5 Big Data Architecture

A typical pipeline ingests, stores, processes, analyses and serves data:

flowchart LR
A[Sources: IoT, web, mobile, ERP] --> B[Ingestion: Kafka, Flume]
B --> C[Storage: HDFS, S3, NoSQL]
C --> D[Processing: Spark, Flink, MapReduce]
D --> E[Analytics: ML, OLAP, BI]
E --> F[Visualisation: Tableau, PowerBI]

Foundational technologies include the Hadoop ecosystem (HDFS for distributed storage, MapReduce for batch processing, YARN for resource management), Apache Spark (in-memory cluster computing), NoSQL databases in four families — key-value (Redis, DynamoDB), document (MongoDB), column-family (Cassandra, HBase), and graph (Neo4j) — and streaming engines such as Kafka and Flink.

86.6 AI / Big Data Use-Cases in Business

Sector applications

Function	Application
Marketing	Customer segmentation, propensity modelling, recommender systems
Operations	Predictive maintenance, demand forecasting, route optimisation
Finance	Credit scoring, fraud detection, algorithmic trading
HR	Resume screening, attrition prediction, sentiment analysis
Strategy	Competitive intelligence, scenario simulation, business war gaming

86.7 Ethics, Bias and Governance

AI systems can amplify training-data bias (Amazon’s recruiting tool that down-ranked women, COMPAS recidivism in the US). Frameworks for responsible AI — fairness, accountability, transparency, explainability — are formalised in:

OECD AI Principles, 2019 (the first inter-governmental standard).
EU Artificial Intelligence Act, 2024 (a four-tier risk-based law: unacceptable / high / limited / minimal risk).
NITI Aayog National Strategy for AI, 2018 and India’s Digital India Act / DPDP Act, 2023.
NIST AI Risk Management Framework, 2023.

The standard tests for AI fairness are demographic parity, equalised odds, equal opportunity, and counterfactual fairness; explainability tools include LIME and SHAP.

86.8 Practice Questions

Q 01 AI history Easy

The term “artificial intelligence” was coined at which 1956 event?

A. Turing’s 1950 paper
B. Dartmouth Conference
C. Royal Society Lecture
D. Bell Labs Symposium

View solution

Correct Option: B

John McCarthy coined the term “artificial intelligence” at the Dartmouth Summer Research Project on AI in 1956, organised with Marvin Minsky, Nathaniel Rochester, and Claude Shannon.

Q 02 3 V’s Easy

Doug Laney’s original 3-V framework for big data (2001) refers to:

A. Volume, Variety, Velocity
B. Volume, Veracity, Value
C. Variety, Velocity, Veracity
D. Volume, Validity, Value

View solution

Correct Option: A

Laney’s META Group note (2001) defined big data along Volume, Velocity, and Variety. Veracity, Value and others were added later.

Q 03 ML paradigms Medium

A spam filter trained on emails labelled as “spam” or “not spam” is an example of:

A. Supervised learning
B. Unsupervised learning
C. Reinforcement learning
D. Self-supervised learning

View solution

Correct Option: A

Labelled training data with binary outcomes makes spam filtering a supervised binary-classification task.

Q 04 Deep learning Medium

The transformer architecture, the basis of modern LLMs, was introduced in which 2017 paper?

A. “ImageNet Classification with Deep CNNs”
B. “Attention Is All You Need”
C. “Generative Adversarial Networks”
D. “Deep Residual Learning”

View solution

Correct Option: B

Vaswani et al., “Attention Is All You Need” (2017), introduced the transformer architecture that powers GPT, Claude, BERT, and modern LLMs.

Q 05 Hadoop Easy

HDFS in the Hadoop ecosystem is a:

A. Distributed processing framework
B. Distributed file system
C. NoSQL database
D. Streaming engine

View solution

Correct Option: B

Hadoop Distributed File System stores very large files across commodity nodes; MapReduce / Spark process the data sitting on HDFS.

Q 06 Capability Medium

ChatGPT or any current LLM-based assistant is best described as:

A. Narrow AI
B. Artificial General Intelligence
C. Super AI
D. Symbolic AI

View solution

Correct Option: A

Despite their breadth, current LLMs are narrow AI — capable in language and reasoning tasks but not exhibiting the full general competence required of AGI.

Q 07 EU AI Act Hard

The EU AI Act adopted in 2024 follows which structural approach?

A. Sector-based
B. Risk-based, four-tier
C. Self-regulation
D. Outcome-based

View solution

Correct Option: B

The EU AI Act classifies AI systems into four risk tiers — unacceptable, high, limited, minimal — with proportionate obligations.

Q 08 Match the following Hard

Match the technology with its category:

(P) CNN	(1) Streaming engine
(Q) MongoDB	(2) Image recognition
(R) Kafka	(3) Document NoSQL
(S) SHAP	(4) Explainability tool

A. P-2, Q-3, R-1, S-4
B. P-2, Q-1, R-3, S-4
C. P-3, Q-2, R-1, S-4
D. P-1, Q-3, R-2, S-4

View solution

Correct Option: A

CNN — image recognition; MongoDB — document NoSQL; Kafka — streaming engine; SHAP — explainability for ML.

Quick recall

“AI” coined at the Dartmouth Conference, 1956; transformer paper “Attention Is All You Need” (2017) underpins LLMs.
Three ML paradigms: supervised, unsupervised, reinforcement; deep learning is a subset of ML.
Big data 3-V (Laney 2001): Volume, Velocity, Variety; later 5–7-V extensions add Veracity, Value.
Hadoop = HDFS + MapReduce + YARN; Spark is in-memory; NoSQL is non-relational (key-value, document, column, graph).
Governance: EU AI Act 2024 (4 tiers), OECD Principles 2019, NIST AI RMF 2023, NITI Aayog NSAI 2018, India DPDP 2023.