Artificial Intelligence, Machine Learning, and
Deep Learning
IIT ROPAR Minor In AI
21 March, 2025
Contents
1 Introduction to AI, Machine Learning, and Deep Learning 3
1.1 AI: Mimicking Human Intelligence . . . . . . . . . . . . . . . . . 3
1.2 Machine Learning: Learning from Data without Explicit Coding 3
1.3 Deep Learning: Inspired by Human Brain, Uses Neural Networks 4
2 Phases of AI: Rule-based, Predictive, Generative, Agentic 4
2.1 Rule-based AI (1950s-1990s) . . . . . . . . . . . . . . . . . . . . . 4
2.2 Predictive AI (1990s-2010s) . . . . . . . . . . . . . . . . . . . . . 5
2.3 Generative AI (2010s-Present) . . . . . . . . . . . . . . . . . . . . 5
2.4 Agentic AI (Emerging) . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Deep Learning Basics 6
3.1 Inspiration from Human Brain Neurons . . . . . . . . . . . . . . 6
3.2 Perceptrons and Multi-layer Neural Networks . . . . . . . . . . . 6
3.3 Convolutional Neural Networks (CNN) for Image Processing . . . 7
3.4 Recurrent Neural Networks (RNN) for Sequential Data . . . . . . 7
4 Transformers and Attention Mechanism 8
4.1 Google’s ”Attention is All You Need” Paper (2017) . . . . . . . . 8
4.2 Self-attention and Parallel Processing Capabilities . . . . . . . . 9
5 Large Language Models (LLMs) 9
5.1 Training Process: Pre-training, Post-training, Reinforcement Learn-
ing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.1.1 Pre-training . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.1.2 Post-training (Fine-tuning) . . . . . . . . . . . . . . . . . 10
5.1.3 Reinforcement Learning from Human Feedback (RLHF) . 10
5.2 Applications and Limitations . . . . . . . . . . . . . . . . . . . . 10
5.2.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.2.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1
6 Prompt Engineering 11
6.1 Types: Zero-shot, Few-shot, Chain of Thought . . . . . . . . . . 11
6.1.1 Zero-shot Learning . . . . . . . . . . . . . . . . . . . . . . 11
6.1.2 Few-shot Learning . . . . . . . . . . . . . . . . . . . . . . 12
6.1.3 Chain of Thought (CoT) . . . . . . . . . . . . . . . . . . 12
6.2 Components: Instruction, Context, Input Data, Output Indicator 12
7 Future Developments 13
7.1 Agentic AI and Autonomous Agents . . . . . . . . . . . . . . . . 13
7.2 Debates on AI Capabilities and Potential Risks . . . . . . . . . . 14
7.2.1 Alignment and Safety . . . . . . . . . . . . . . . . . . . . 14
7.2.2 Scaling Laws and Emergent Abilities . . . . . . . . . . . . 14
8 Evaluation of LLMs 15
8.1 Code-based, Human Evaluation, LLM as Judge . . . . . . . . . . 15
8.1.1 Code-based Evaluation . . . . . . . . . . . . . . . . . . . . 15
8.1.2 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . 15
8.1.3 LLM as Judge . . . . . . . . . . . . . . . . . . . . . . . . 15
8.2 Concepts like Distillation and Mixture of Experts . . . . . . . . . 16
8.2.1 Knowledge Distillation . . . . . . . . . . . . . . . . . . . . 16
8.2.2 Mixture of Experts (MoE) . . . . . . . . . . . . . . . . . . 16
2
1 Introduction to AI, Machine Learning, and
Deep Learning
1.1 AI: Mimicking Human Intelligence
Artificial Intelligence (AI) refers to systems designed to perform tasks that typ-
ically require human intelligence, such as visual perception, speech recognition,
decision-making, and language translation.
Historical Context
The term ”Artificial Intelligence” was coined by John McCarthy in 1956
at the Dartmouth Conference, which is considered the founding event
of AI as a field. Early AI systems were predominantly rule-based and
focused on symbolic reasoning.
Case Study: IBM’s Deep Blue
The 1997 chess match between IBM’s Deep Blue and world champion Garry
Kasparov represented an early milestone in AI. Deep Blue used a combination of
brute-force computation and sophisticated evaluation functions to defeat Kas-
parov, demonstrating how machines could outperform humans in specific do-
mains through different approaches than human cognition.
Task Performance
Learning Capability
AI System = (1)
Adaptability
Reasoning Mechanisms
1.2 Machine Learning: Learning from Data without Ex-
plicit Coding
Machine Learning (ML) is a subset of AI that focuses on building systems
that can learn from and make decisions based on data, without being explicitly
programmed for specific tasks.
f :X→Y (2)
Where X represents input data and Y represents output predictions. The
function f is learned from training data rather than being explicitly defined.
Key ML Paradigms:
• Supervised Learning: Training on labeled data
• Unsupervised Learning: Finding patterns in unlabeled data
3
• Reinforcement Learning: Learning through interaction with an envi-
ronment
Case Study: Netflix Recommendation System
Netflix employs machine learning algorithms to analyze user viewing history,
ratings, and preferences to recommend content. This system processes billions
of data points to learn patterns that predict which shows a user might enjoy,
demonstrating how ML can create personalized experiences at scale. The rec-
ommendation system combines collaborative filtering (comparing user behavior
with similar users) and content-based methods (analyzing show attributes).
1.3 Deep Learning: Inspired by Human Brain, Uses Neu-
ral Networks
Deep Learning is a subset of machine learning that uses neural networks with
multiple layers (hence ”deep”) to progressively extract higher-level features from
raw input.
y = σ(w · x + b) (3)
Where σ is an activation function, w represents weights, x is the input, and
b is a bias term.
Case Study: AlphaFold
DeepMind’s AlphaFold represents a breakthrough application of deep learn-
ing in protein structure prediction. Prior to AlphaFold, determining protein
structures was an enormously time-consuming laboratory process. AlphaFold
uses deep neural networks trained on known protein structures to predict the
three-dimensional structure of proteins from their amino acid sequences with
unprecedented accuracy, revolutionizing molecular biology and drug discovery.
2 Phases of AI: Rule-based, Predictive, Gener-
ative, Agentic
2.1 Rule-based AI (1950s-1990s)
Rule-based AI systems operate using explicitly programmed rules in the form
of if-then statements.
Listing 1: Example of Rule-Based AI in Prolog
% Facts
p a r e n t ( john , mary ) .
p a r e n t ( john , tom ) .
p a r e n t ( mary , ann ) .
% Rules
g r a n d p a r e n t (X, Z ) :− p a r e n t (X, Y) , p a r e n t (Y, Z ) .
4
Case Study: MYCIN
MYCIN was an early expert system developed at Stanford University in
the 1970s to diagnose infectious blood diseases and recommend antibiotics. It
contained approximately 600 rules that encoded the knowledge of infectious
disease experts. When tested, MYCIN performed at a level comparable to
specialists, demonstrating how explicit rules could capture expert knowledge.
2.2 Predictive AI (1990s-2010s)
Predictive AI uses statistical methods and machine learning to make predictions
based on patterns in data.
Case Study: Credit Scoring Models
Financial institutions use predictive AI to assess credit risk. These systems
analyze factors such as payment history, debt levels, and income to predict
the likelihood of loan repayment. Modern credit scoring systems use ensemble
methods combining multiple models (decision trees, logistic regression, etc.)
to improve prediction accuracy. These models have transformed lending by
enabling more objective, data-driven decisions.
2.3 Generative AI (2010s-Present)
Generative AI creates new content (text, images, audio, etc.) that resembles
human-created content.
P (xt |x<t ) = softmax(W · ht + b) (4)
Where xt is the next token to be generated, x<t represents previous tokens,
and ht is the hidden state.
Case Study: DALL-E
OpenAI’s DALL-E demonstrates the capabilities of generative AI in visual
domains. Given a text prompt like ”an astronaut riding a horse in a photoreal-
istic style,” DALL-E can generate original images that integrate these concepts.
This demonstrates how generative models can combine concepts in creative ways
never explicitly shown during training, exhibiting a form of artificial creativity.
2.4 Agentic AI (Emerging)
Agentic AI systems can operate autonomously, make decisions, and take actions
to achieve specified goals.
5
Agentic AI Framework
1. Perception: Understanding the environment
2. Planning: Determining action sequences
3. Execution: Implementing planned actions
4. Learning: Improving from experiences
Case Study: AutoGPT
AutoGPT represents an early example of agentic AI application. It combines
large language models with the ability to use tools (web search, file operations,
etc.) and maintain a memory of past actions. Given a high-level objective
like ”research the market for electric vehicles and write a report,” AutoGPT
can break this down into sub-tasks, execute them sequentially, and produce the
desired output with minimal human intervention, demonstrating autonomous
goal-directed behavior.
3 Deep Learning Basics
3.1 Inspiration from Human Brain Neurons
Artificial neural networks draw inspiration from the structure and function of
biological neurons in the human brain.
Biological Neuron → Artificial Neuron (5)
Dendrites → Input Weights (6)
Cell Body → Summation & Activation (7)
Axon → Output (8)
While artificial neurons are vast simplifications of biological neurons, they
capture the essential computational elements: receiving weighted inputs, in-
tegrating them, and producing an output if the integrated signal exceeds a
threshold.
3.2 Perceptrons and Multi-layer Neural Networks
The perceptron is the fundamental building block of neural networks.
n
X
z= wi xi + b (9)
i=1
a = σ(z) (10)
6
Where wi are weights, xi are inputs, b is bias, and σ is an activation function.
Multi-layer networks stack these units to create more complex architectures:
z[1] = W[1] x + b[1] (11)
[1] [1]
a = σ(z ) (12)
[2] [2] [1] [2]
z =W a +b (13)
[2] [2]
a = σ(z ) (14)
Case Study: XOR Problem
The XOR problem (exclusive OR) illustrates why multi-layer networks are
necessary. A single perceptron cannot solve the XOR problem because it’s
not linearly separable. However, a neural network with at least one hidden
layer can learn this function. This simple example demonstrates how adding
layers enables networks to represent increasingly complex functions and decision
boundaries.
3.3 Convolutional Neural Networks (CNN) for Image Pro-
cessing
CNNs apply convolutional operations to extract spatial features from images.
XX
(f ∗ g)(x, y) = f (m, n)g(x − m, y − n) (15)
m n
Key Components:
• Convolutional layers: Extract features using learnable filters
• Pooling layers: Reduce dimensionality while preserving important infor-
mation
• Fully connected layers: Final classification based on extracted features
Case Study: ResNet
Residual Networks (ResNet) addressed the problem of training very deep
CNNs by introducing skip connections that allow gradients to flow more easily
through the network. This innovation enabled the creation of networks with
over 100 layers that could be effectively trained. ResNet dramatically improved
image classification performance on the ImageNet dataset and became a foun-
dational architecture for many computer vision applications.
3.4 Recurrent Neural Networks (RNN) for Sequential Data
RNNs process sequential data by maintaining a hidden state that captures in-
formation from previous timesteps.
7
ht = σ(Wxh xt + Whh ht−1 + bh ) (16)
yt = σ(Why ht + by ) (17)
LSTM (Long Short-Term Memory) networks address the vanishing gra-
dient problem in traditional RNNs:
ft = σ(Wf · [ht−1 , xt ] + bf ) (18)
it = σ(Wi · [ht−1 , xt ] + bi ) (19)
C̃t = tanh(WC · [ht−1 , xt ] + bC ) (20)
Ct = ft ∗ Ct−1 + it ∗ C̃t (21)
ot = σ(Wo · [ht−1 , xt ] + bo ) (22)
ht = ot ∗ tanh(Ct ) (23)
Case Study: Neural Machine Translation
Google’s Neural Machine Translation (GNMT) system demonstrated the
power of RNNs in sequence-to-sequence learning. Prior to the transformer
architecture, GNMT used bidirectional LSTMs with attention mechanisms to
translate between languages. The system showed significant improvements over
phrase-based statistical methods, especially for grammatically complex language
pairs like English-Japanese, by capturing long-range dependencies and context.
4 Transformers and Attention Mechanism
4.1 Google’s ”Attention is All You Need” Paper (2017)
The landmark paper by Vaswani et al. introduced the transformer architecture,
which revolutionized natural language processing by eliminating recurrence and
convolutions in favor of attention mechanisms.
Transformer Architecture
Key innovations:
• Self-attention mechanism
• Positional encoding
• Multi-head attention
• Feed-forward networks in each layer
8
4.2 Self-attention and Parallel Processing Capabilities
Self-attention computes relationships between all positions in a sequence:
QK T
Attention(Q, K, V ) = softmax √ V (24)
dk
Where Q (queries), K (keys), and V (values) are derived from the input
sequence.
Multi-head attention computes attention multiple times in parallel:
MultiHead(Q, K, V ) = Concat(head1 , . . . , headh )W O (25)
Where each head performs attention with different linear projections.
Case Study: BERT
Google’s Bidirectional Encoder Representations from Transformers (BERT)
demonstrated the power of transformer architectures for language understand-
ing. BERT is pre-trained using masked language modeling and next sentence
prediction objectives on a large corpus of text. When fine-tuned on specific
tasks, BERT achieved state-of-the-art results on a wide range of natural lan-
guage understanding benchmarks. Its bidirectional attention mechanism allows
it to consider context from both directions, improving performance on tasks like
question answering and sentiment analysis.
Parallel Processing Advantage:
Unlike RNNs, which process sequences element by element, transformers
process entire sequences in parallel:
RNN Complexity = O(n) sequential operations (26)
Transformer Complexity = O(1) sequential operations (27)
This parallelization enables efficient training on modern GPU hardware, al-
lowing for much larger models.
5 Large Language Models (LLMs)
5.1 Training Process: Pre-training, Post-training, Rein-
forcement Learning
5.1.1 Pre-training
During pre-training, models learn general language patterns from vast amounts
of text.
9
Pre-training Scale
Modern LLMs are trained on:
• Hundreds of billions to trillions of tokens
• Diverse sources: books, websites, code, research papers
• Months of computation on thousands of GPUs
5.1.2 Post-training (Fine-tuning)
After pre-training, models are adapted for specific capabilities:
• Supervised Fine-tuning (SFT): Using human-created demonstrations
• Instruction Tuning: Teaching models to follow user instructions
5.1.3 Reinforcement Learning from Human Feedback (RLHF)
RLHF aligns model outputs with human preferences:
pθ (y|x)
LRLHF = Ex∼D [rϕ (x, y) − β log ] (28)
pref (y|x)
Where rϕ is a learned reward model based on human preferences, and the
second term is a KL-divergence penalty to prevent excessive deviation from the
reference model.
Case Study: ChatGPT
OpenAI’s ChatGPT illustrates the full LLM training pipeline. Starting with
a GPT architecture pre-trained on a diverse text corpus, it underwent instruc-
tion tuning to follow user directions and RLHF to align with human preferences.
This process transformed a general text prediction model into an assistant that
could respond helpfully to user queries, follow instructions, and generate more
useful, safe, and truthful responses. Its capabilities and limitations demonstrate
both the potential and challenges of current LLM technology.
5.2 Applications and Limitations
5.2.1 Applications
• Content Generation: Writing, summarization, translation
• Code Assistance: Generating, explaining, and debugging code
• Conversational AI: Customer service, digital assistants
• Information Extraction: Analyzing documents, reports
10
5.2.2 Limitations
Hallucination: LLMs can generate plausible-sounding but factually incorrect
information.
Example of Hallucination
When asked about obscure topics, LLMs may confidently generate fic-
tional information, such as inventing non-existent research papers or cre-
ating false historical events.
Knowledge Cutoff: LLMs cannot know about events after their training
data ends.
(
Comprehensive for t < tcutoff
Knowledge Access = (29)
None for t > tcutoff
Context Length: LLMs have a finite window of text they can process at
once.
Maximum Context = n tokens (30)
Where n has increased from about 2,048 in early models to 128,000+ in
recent architectures.
Case Study: Mitigating Limitations in Claude
Anthropic’s Claude demonstrates approaches to addressing LLM limitations.
To reduce hallucinations, Claude was trained using constitutional AI methods
that encourage the model to express uncertainty rather than confabulate when
asked about topics outside its knowledge base. To overcome context limitations,
Claude implements techniques for efficient context compression and retrieval, al-
lowing it to process longer documents while maintaining coherent understand-
ing.
6 Prompt Engineering
6.1 Types: Zero-shot, Few-shot, Chain of Thought
6.1.1 Zero-shot Learning
The model performs tasks without specific examples:
Listing 2: Zero-shot Prompt
C l a s s i f y the f o l l o w i n g text as e i t h e r p o s i t i v e or negative :
”The s e r v i c e a t t h i s r e s t a u r a n t was t e r r i b l e and t h e f o o d was c o l d . ”
11
6.1.2 Few-shot Learning
Providing examples helps the model understand the desired pattern:
Listing 3: Few-shot Prompt
C l a s s i f y reviews as p o s i t i v e or negative :
Review : ”Amazing f o o d and e x c e l l e n t s e r v i c e ! ”
Sentiment : P o s i t i v e
Review : ” Waited an hour and t h e f o o d was bland . ”
Sentiment : N e g a t i v e
Review : ”The ambiance was n i c e but o v e r p r i c e d f o r what you g e t . ”
Sentiment :
6.1.3 Chain of Thought (CoT)
Encouraging step-by-step reasoning improves performance on complex tasks:
Listing 4: Chain of Thought Prompt
Q u e s t i o n : I f a s t o r e has 10 a p p l e s and s e l l s 3 t o customer A and 4 t o customer B
Let ’ s t h i n k through t h i s s t e p by s t e p :
1 . The s t o r e s t a r t s with 10 a p p l e s .
2 . I t s e l l s 3 a p p l e s t o customer A, l e a v i n g 10 − 3 = 7 a p p l e s .
3 . I t s e l l s 4 a p p l e s t o customer B, l e a v i n g 7 − 4 = 3 a p p l e s .
4 . I t buys 5 more a p p l e s , g i v i n g i t 3 + 5 = 8 a p p l e s t o t a l .
T h e r e f o r e , t h e s t o r e has 8 a p p l e s now .
Case Study: GSM8K Math Problems
Research on the GSM8K benchmark (grade school math problems) demon-
strates the dramatic improvement in performance achieved through chain-of-
thought prompting. Without CoT, even large language models struggle with
multi-step reasoning problems. With CoT prompting, performance improved by
20-40 percentage points across various model sizes, highlighting how the right
prompting strategy can unlock capabilities already present in the model.
6.2 Components: Instruction, Context, Input Data, Out-
put Indicator
Effective prompts typically include:
12
Prompt Components
1. Instruction: Clear directions about the task
2. Context: Background information or constraints
3. Input Data: The specific content to process
4. Output Indicator: Format or style specifications
Example of a Structured Prompt:
Listing 5: Structured Prompt Components
# INSTRUCTION
Summarize t h e f o l l o w i n g m e d i c a l r e s e a r c h a b s t r a c t i n s i m p l e terms t h a t a p a t i e n t
# CONTEXT
This i s f o r a p a t i e n t e d u c a t i o n w e b s i t e . The a u d i e n c e has no m e d i c a l background .
# INPUT DATA
[ Research a b s t r a c t t e x t h e r e ]
# OUTPUT INDICATOR
Your summary s h o u l d be 3−5 s h o r t p a r a g r a p h s . I n c l u d e a one−s e n t e n c e ”Key Takeawa
Case Study: Legal Document Analysis
Law firms use structured prompts to extract specific information from con-
tracts. By providing clear instructions (e.g., ”Identify all payment terms and
obligations”), relevant context (e.g., ”This is for a procurement contract re-
view”), specific input data (the contract text), and output indicators (e.g.,
”Format as a table with clause references”), they achieve consistent, structured
outputs that can be directly incorporated into legal workflows, demonstrating
how well-crafted prompts can turn LLMs into specialized information extraction
tools.
7 Future Developments
7.1 Agentic AI and Autonomous Agents
Agentic AI systems combine LLMs with:
• Planning: Breaking down complex goals into subtasks
• Memory: Maintaining information across interactions
• Tool Use: Leveraging external capabilities (APIs, databases, etc.)
• Self-Improvement: Learning from successes and failures
13
Perception Module
Memory System
Agent Architecture = Planning Engine (31)
Action Execution
Learning Mechanism
Case Study: BabyAGI
BabyAGI demonstrates simple but powerful agentic capabilities. Given a
high-level task like ”Research investment opportunities in renewable energy,”
it autonomously creates subtasks, executes them in a reasonable order, utilizes
tools like web search and document analysis, and compiles findings into a coher-
ent output. While limited compared to human researchers, its ability to work
autonomously toward complex goals illustrates the direction of agent-based AI
systems.
7.2 Debates on AI Capabilities and Potential Risks
7.2.1 Alignment and Safety
As AI systems become more capable, ensuring they act in accordance with
human values becomes increasingly important:
Alignment Gap = AI Capability − Alignment Level (32)
7.2.2 Scaling Laws and Emergent Abilities
Research suggests that capabilities may emerge non-linearly as models scale:
Performance ≈ C · (Compute)α · (Data)β · (Parameters)γ (33)
Case Study: Frontier Model Research
Research by organizations like Anthropic on frontier models has revealed
surprising emergent capabilities. As models scaled beyond certain thresholds,
they suddenly demonstrated abilities not observed in smaller versions, such as
multi-step reasoning, code generation, and creative problem-solving. These dis-
continuous improvements suggest that further scaling may unlock additional
capabilities that are difficult to predict in advance, highlighting both the poten-
tial and uncertainty in continued AI advancement.
14
8 Evaluation of LLMs
8.1 Code-based, Human Evaluation, LLM as Judge
8.1.1 Code-based Evaluation
Automated metrics provide objective but limited assessment:
• BLEU, ROUGE: Lexical overlap with reference texts
• Perplexity: Probability assigned to correct tokens
• Task-specific Metrics: Accuracy, F1 score, etc.
8.1.2 Human Evaluation
Human judgments capture nuanced quality aspects:
Helpfulness
Accuracy
Human Evaluation = Safety (34)
Quality
Bias
8.1.3 LLM as Judge
Using stronger models to evaluate outputs:
Listing 6: LLM-as-Judge Prompt Template
Rate t h e q u a l i t y o f t h e f o l l o w i n g r e s p o n s e t o t h e g i v e n query :
Query : [ User query ]
Response : [ Model r e s p o n s e ]
S c o r e from 1−10 on :
− R e l e v a n c e t o query
− Factual accuracy
− Completeness
− Clarity
− Helpfulness
P r o v i d e j u s t i f i c a t i o n f o r each s c o r e .
Case Study: MMLU Benchmark
The Massive Multitask Language Understanding (MMLU) benchmark eval-
uates models across 57 subjects ranging from elementary mathematics to pro-
fessional medicine. This comprehensive evaluation reveals both strengths and
15
weaknesses in model capabilities across different domains of knowledge. Re-
cent models achieve human expert-level performance in some categories while
still struggling in others, providing a nuanced picture of progress and remaining
challenges in language model development.
8.2 Concepts like Distillation and Mixture of Experts
8.2.1 Knowledge Distillation
Transferring knowledge from larger to smaller models:
Ldistill = αLtask + (1 − α)LKD (35)
Where LKD measures the divergence between student and teacher model
outputs.
8.2.2 Mixture of Experts (MoE)
Combining specialized sub-networks:
n
X
y= g(x, i) · fi (x) (36)
i=1
Where g(x, i) is a gating function determining how much expert fi con-
tributes to the output.
Case Study: Google’s Switch Transformer
Google’s Switch Transformer demonstrated the efficiency gains possible with
MoE architectures. By using a sparse mixture of experts approach where only
a subset of experts process each input token, the model achieved performance
comparable to dense models with significantly less computation during inference.
This approach enables larger effective model sizes while maintaining reasonable
training and deployment costs, potentially offering a more efficient scaling path
than simply increasing dense model parameters.
Benefits of MoE:
• Computational efficiency through sparse activation
• Specialization of different components for different subtasks
• Capacity scaling without proportional computation increase
Parameters in MoE ≫ Parameters used per forward pass (37)
Effective Capacity ≈ Experts × Parameters per Expert (38)
Computation ≈ Active Experts × Parameters per Expert (39)
16