0% found this document useful (0 votes)
2 views

AI_&_ML tw shivam

The document provides an overview of intelligent agents, categorizing them into human, robotic, and software agents, and detailing types such as simple reflex, model-based reflex, goal-based, utility-based, and learning agents. It also discusses First Order Logic in AI, characteristics of problems in AI & ML, types of reasoning, and specific algorithms like A* and Alpha-Beta search. Each section includes definitions, examples, and applications relevant to artificial intelligence.

Uploaded by

Aryan Kanojia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

AI_&_ML tw shivam

The document provides an overview of intelligent agents, categorizing them into human, robotic, and software agents, and detailing types such as simple reflex, model-based reflex, goal-based, utility-based, and learning agents. It also discusses First Order Logic in AI, characteristics of problems in AI & ML, types of reasoning, and specific algorithms like A* and Alpha-Beta search. Each section includes definitions, examples, and applications relevant to artificial intelligence.

Uploaded by

Aryan Kanojia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

MODULE 01

Q1. Write a short note on Intelligent Agent.

Ans: An intelligent agent (IA) is a system that perceives its environment through sensors and acts upon it
using actuators to achieve specific objectives. These agents can be human, robotic, or software-based
and are designed to operate autonomously while making intelligent decisions based on their
observations.

Categories of Agents

Intelligent agents can be classified into three main categories:

1. Human Agents
○ Humans are natural intelligent agents.
○ They perceive the environment through senses (eyes, ears, skin) and act through muscles,
speech, and movement.
○ Example: A driver who senses road conditions and reacts accordingly.
2. Robotic Agents
○ These agents are physical machines that interact with the real world.
○ They use sensors (cameras, infrared, gyroscopes) to perceive their environment and
actuators (motors, arms, wheels) to perform actions.
○ Example: A self-driving car uses sensors like LiDAR and cameras to navigate roads and
avoid obstacles.
3. Software Agents
○ These agents exist in a digital environment and interact with software and databases.
○ They use algorithms and programming logic to make intelligent decisions.
○ Example: A chatbot answering customer queries on a website.

Types of Intelligent Agents

Intelligent agents are further categorized based on their decision-making mechanisms:

1. Simple Reflex Agents


○ Act only based on the current percept without considering past experiences.
○ Follow condition-action rules (if-then statements).
○ Example: A thermostat that turns on heating when the temperature drops below a set
value.
2. Model-Based Reflex Agents
○ Maintain an internal model of the world to handle partial observability (when they do not
have complete information).
○ Use the model to infer missing details and take better actions.
○ Example: A vacuum cleaner that remembers obstacles and adjusts its path accordingly.
3. Goal-Based Agents
○ Make decisions based on predefined goals rather than just reacting to percepts.
○ Evaluate different possible actions to achieve the goal.
○ Example: A GPS navigation system that finds the best route to a destination.
4. Utility-Based Agents
○ Similar to goal-based agents but also consider the degree of success in achieving the goal.
○ Optimize actions to maximize overall utility (performance, efficiency, user satisfaction).
○ Example: A recommendation system suggesting movies based on user preferences.
5. Learning Agents
○ Improve their performance over time by learning from experience.
○ Use techniques like machine learning to refine their decision-making process.
○ Example: AI in games that adapts to a player's behavior and adjusts difficulty accordingly.

PEAS Representation

PEAS (Performance measure, Environment, Actuators, Sensors) is a framework used to describe an


agent’s functionality.

Example: Self-Driving Car

● Performance Measure:
○ Safety, speed, fuel efficiency, passenger comfort, adherence to traffic rules.
● Environment:
○ Roads, traffic signals, pedestrians, weather conditions, other vehicles.
● Actuators:
○ Steering wheel, accelerator, brake, indicators, wipers.
● Sensors:
○ Cameras, GPS, LiDAR, radar, speed sensors, proximity sensors.

Using PEAS helps in designing intelligent agents by clearly defining their capabilities and limitations.

Q2. Write a short note on First Order Logic.

Ans: Propositional Logic and First-Order Logic in AI

Logic plays a crucial role in Artificial Intelligence (AI) by enabling machines to represent and reason
about knowledge. The two main types of logic used in AI are Propositional Logic (PL) and First-Order
Logic (FOL).

1. Propositional Logic (PL)

Propositional Logic, also known as Boolean Logic, deals with propositions (statements that are either
true or false). It uses logical connectives like:

● AND ( ∧ ) – Both conditions must be true.


● OR ( ∨ ) – At least one condition must be true.
● NOT ( ¬ ) – Negates a statement.
● IMPLIES ( → ) – If one statement is true, another follows.
Example:

● P: "It is raining."
● Q: "The ground is wet."
● A rule: P → Q ("If it is raining, then the ground is wet").

PL is limited because it cannot handle relationships or quantify statements like "All humans are mortal."

2. First-Order Logic (FOL)

First-Order Logic (FOL), also called Predicate Logic, extends PL by introducing:

● Objects (e.g., Alice, Bob)


● Relations (e.g., Loves(Alice, Bob))
● Quantifiers:
○ Universal ( ∀ ) – "For all" (e.g., ∀x Human(x) → Mortal(x))
○ Existential ( ∃ ) – "There exists" (e.g., ∃x Loves(x, Alice))

Example:

● Statement: "All humans are mortal."


● FOL Representation: ∀x (Human(x) → Mortal(x))
○ Meaning: For every x, if x is a Human, then x is Mortal.

Q3. Write the characteristics of problems?

Ans: In AI & ML, solving problems efficiently requires proper problem formulation, representation, and
search strategies.

1. Definition of a Problem in AI & ML

A problem is a computational task that needs to be solved using algorithms and models. It involves a
well-defined initial state, a goal state, and a set of possible actions to transition between states.

Example:

● In chess, the initial state is the board setup, the goal state is checkmating the opponent, and the
actions are valid chess moves.
● A classification problem aims to correctly label data points based on input features.

2. Problem Space

The problem space refers to all possible states and actions that can be taken to reach the goal from the
initial state. It includes:

● Initial State – The starting condition of the problem.


● Goal State – The desired solution.
● Operators – Actions that transition between states.
● Path Cost – A measure of efficiency (e.g., shortest path, lowest error).

Example:

● In a maze-solving AI, the problem space consists of all possible paths from the start to the exit.
● In ML, the problem space consists of all possible models and parameters that can be used to
minimize loss.

3. Characteristics of a Problem in AI & ML

1. Fully vs. Partially Observable


○ Fully Observable: All necessary information is available (e.g., chess).
○ Partially Observable: Some information is missing (e.g., poker).

2. Deterministic vs. Stochastic


○ Deterministic: Future states are predictable (e.g., tic-tac-toe).
○ Stochastic: Outcomes have randomness (e.g., stock market prediction).

3. Discrete vs. Continuous


○ Discrete: Finite number of states (e.g., Sudoku).
○ Continuous: Infinite possible states (e.g., robot motion planning).

4. Single-Agent vs. Multi-Agent


○ Single-Agent: One decision-making entity (e.g., pathfinding in GPS).
○ Multi-Agent: Multiple interacting entities (e.g., autonomous vehicles).

5. Static vs. Dynamic


○ Static: Environment doesn’t change while solving (e.g., crossword puzzle).
○ Dynamic: Environment changes over time (e.g., self-driving cars).

6. Episodic vs. Sequential


○ Episodic: Independent actions (e.g., image recognition).
○ Sequential: Current actions affect future outcomes (e.g., reinforcement learning).

Q4. Explain the types of Reasoning In artificial intelligence in detail?


Ans: Reasoning is a fundamental aspect of AI that enables machines to process information and draw
conclusions. There are six main types of reasoning used in AI: Deductive, Inductive, Abductive,
Common Sense, Monotonic, and Non-Monotonic.

1. Deductive Reasoning

● Definition: Deductive reasoning derives specific conclusions from general rules or premises. If
the premises are true, the conclusion must also be true.
● Approach: Top-down (General → Specific).
● Example:
○ Premises:
■ "All humans are mortal."
■ "Socrates is a human."

○ Conclusion: "Socrates is mortal."

● AI Applications:
○ Rule-based expert systems
○ Automated theorem proving
○ Formal logic programming

2. Inductive Reasoning

● Definition: Inductive reasoning infers general rules from specific observations. The conclusion is
probable but not always true.
● Approach: Bottom-up (Specific → General).
● Example:
○ "The sun has risen in the east every day so far."
○ "Therefore, the sun will rise in the east tomorrow."
● AI Applications:
○ Machine Learning (ML) (models learn from data patterns).
○ Data mining and predictive analytics.

3. Abductive Reasoning

● Definition: Abductive reasoning starts with an observation and seeks the most likely
explanation. It works with incomplete information.
● Approach: Best Hypothesis (Inference to the best explanation).
● Example:
○ "The grass is wet."
○ "It probably rained last night."
● AI Applications:
○ Medical Diagnosis Systems (If a patient has symptoms X, they might have disease Y).
○ Fault detection in engineering and IT systems.

4. Common Sense Reasoning

● Definition: This reasoning is based on general world knowledge and experience to make logical
conclusions.
● Approach: Human-like reasoning, intuitive knowledge.
● Example:
○ "If a glass falls on the floor, it will likely break."
● AI Applications:
○ Natural Language Processing (NLP)
○ Robotics (AI assistants interacting with humans)

5. Monotonic Reasoning

● Definition: In monotonic reasoning, once a fact is established, it cannot be changed, even if new
knowledge is added.
● Approach: Knowledge remains fixed.
● Example:
○ "All birds can fly."
○ "A sparrow is a bird."
○ Conclusion: "A sparrow can fly."
○ (But this does not allow exceptions like "Penguins cannot fly.")
● AI Applications:
○ Traditional rule-based expert systems.

6. Non-Monotonic Reasoning

● Definition: In non-monotonic reasoning, conclusions can change when new knowledge is


introduced.
● Approach: Knowledge can be revised.
● Example:
○ "All birds can fly."
○ "Penguins are birds."
○ Revised Conclusion: "Penguins cannot fly."
● AI Applications:
○ Dynamic AI systems, self-learning assistants, robotics.

Q5. Discuss simple and model-based reflex agents.

Ans: A Model-Based Reflex Agent is an intelligent agent that uses an internal model of the
environment to handle partial observability (when the agent does not have complete information). It
maintains an internal state that keeps track of past experiences to make better decisions.

1. Key Components of a Model-Based Reflex Agent

a) Model

● The model represents knowledge about how the world works.


● It helps the agent predict the consequences of its actions.
● Example: In a self-driving car, the model includes rules like "If a traffic light turns red, cars must
stop."

b) Internal State
● The internal state stores past perceptions to infer missing information.
● It helps the agent make decisions even when some sensor data is unavailable.
● Example: In a maze-solving robot, if a wall is detected earlier but no longer visible, the internal
state remembers its presence.

c) Decision-Making

● The agent uses the model and internal state to select appropriate actions.
● Condition-Action Rules guide its responses to different situations.
● Example: If a robot vacuum detects a dirty area, it stores this information and revisits it later.

2. Working of a Model-Based Reflex Agent

1. Sense the environment (via sensors).


2. Update the internal state using the model.
3. Use condition-action rules to decide the best action.
4. Perform the action using actuators.
5. Repeat the process continuously.

3. Example: Self-Driving Car

A self-driving car is a real-world example of a model-based reflex agent.

● Model: Traffic rules, road structure, weather conditions.


● Internal State: Previous locations, speed of nearby vehicles, detected obstacles.
● Decision-Making: If a pedestrian was detected earlier but is now hidden, the car slows down
based on the internal state.

4. Merits of Model-Based Reflex Agents

● Handles Partial Observability: Can work even when all information is not available.
● More Efficient: Uses memory and past experiences for better decision-making.
● Can Adapt to Changes: Adjusts actions based on new observations.

5. Demerits of Model-Based Reflex Agents

● Complex Implementation: Requires a well-defined model and memory management.


● Higher Computational Cost: Maintaining an internal state increases processing requirements.
● Not Always Perfect: If the model is incorrect or incomplete, the agent may make wrong
decisions.
MODULE 02

Q1. Explain A* Algorithm with a suitable example.

Ans: The A* algorithm is a widely used graph/tree search algorithm in AI for finding the shortest path
between a start node and a goal node. It is commonly used in robotics, game development, and
navigation systems.

1. Key Features of A Algorithm*

● Combines Best-First Search & Dijkstra’s Algorithm – It balances exploration and optimization.
● Uses Heuristic Function – Helps in making informed decisions.
● Ensures Optimality – If the heuristic is admissible (i.e., never overestimates), A* always finds the
optimal solution.

2. Cost Functions in A*

A* uses two cost functions:

1. g(n) → Path cost from the start node to the current node (n).
2. h(n) → Heuristic function estimating the cost from n to the goal.
3. f(n) → Total cost function:
f(n)=g(n)+h(n)f(n) = g(n) + h(n)

g(n) ensures optimality, while h(n) provides efficiency.

3. Example: Finding Shortest Path in a Graph

Let’s consider a graph with nodes (A, B, C, D, E, and G) where we want to find the shortest path from A
to G.

Graph Representation with Costs

/ \

(1) (4)

B ---- C

| \ \

(2) (5) (1)

D -- E -- G

| |
(7) (3)

Step-by-Step Execution

1. Start at Node A
○ g(A) = 0, h(A) = heuristic value (assumed).
○ f(A) = g(A) + h(A).

2. Expand A → B & C
○ Calculate f(B) and f(C).
○ Choose the node with the lowest f(n).

3. Expand B → D & E, Expand C → G


○ Update g(n) and f(n) values.
○ Select the lowest-cost path.

4. Continue until the goal (G) is reached


○ The algorithm ensures the shortest possible path.

Final Path:

A → C → G (Shortest path found with minimum cost).

4. Advantages of A*

● Guaranteed Optimality – Finds the shortest path if heuristics are admissible.


● Efficient – Faster than uninformed search methods like BFS & DFS.
● Widely Used – Used in GPS, AI gaming, and robotics.
● It is complete and optimal.
● It is the best one from other techniques. It is used to solve very complex problems.

5. Limitations of A*

● Memory Intensive – Stores all generated nodes in memory.


● Depends on Heuristic Quality – Poor heuristics can lead to inefficient paths.
● This algorithm is complete if the branching factor is finite and every action as fixed cost.
● The speed execution of A* search is highly dependent on the accuracy of the heuristic algorithm
that is used to compute h (n).

6. Applications of A*

● Navigation Systems (Google Maps, GPS) – Finding the fastest route.


● Game AI (Pathfinding in Games) – Enemy movement in strategy games.
● Robotics – Obstacle avoidance and autonomous path planning..

Q2. Discuss Alpha-Beta search algorithm with a suitable example.

Ans: The Alpha-Beta Pruning Algorithm is an optimization technique for the Minimax algorithm used in
decision-making and game-playing AI (e.g., Chess, Tic-Tac-Toe). It reduces the number of nodes
evaluated in a game tree, making Minimax more efficient.

1. Key Concepts of Alpha-Beta Pruning

a) Minimax Algorithm Recap

● Minimax is used in two-player games where one player (MAX) tries to maximize the score, while
the other (MIN) tries to minimize it.
● The algorithm builds a game tree and evaluates possible future moves.

b) Alpha (α) and Beta (β) Values

● Alpha (α): Best (highest) value that the MAX player can guarantee.
● Beta (β): Best (lowest) value that the MIN player can guarantee.
● Pruning occurs when:
○ β ≤ α, meaning further exploration is unnecessary.

2. Example of Alpha-Beta Pruning

Consider a game tree where MAX and MIN take turns.

(MAX)

/ \

(MIN) (MIN)

B C

/\ / \

3 5 2 9

Step-by-Step Execution

1. Start at Node A (MAX’s turn).


2. Move to Node B (MIN’s turn).
○ Evaluate child nodes: 3 and 5.
○ Since MIN wants the smallest value, B = 3.
3. Move to Node C (MIN’s turn).
○ First child (2) is evaluated.
○ Second child (9) is not evaluated because 2 is already less than 3 (α-cutoff).
○ Pruning occurs at C.
4. Final Decision:
○ MAX chooses the maximum value: MAX(A) = 3.

3. Advantages of Alpha-Beta Pruning

● Reduces Computation: Prunes unnecessary branches, making Minimax faster.


● Same Optimal Move: Finds the same result as Minimax but with fewer calculations.
● Works in Any Order: Can be applied to any move order but is most effective when better moves
are evaluated first.

4. Limitations of Alpha-Beta Pruning

● Depends on Move Order: Works best when the best moves are explored first.
● Not Useful for Non-Adversarial Problems: Only applies to game-playing AI.

5. Applications of Alpha-Beta Pruning

● Chess, Tic-Tac-Toe, Checkers AI – Speeds up decision-making.


● AI Game Bots – Used in competitive AI agents

Q3. Explain Heuristic Search Techniques in detail?

Ans: Heuristic search techniques are informed search algorithms that use a heuristic function to find
the most optimal solution efficiently. These techniques are widely used in pathfinding, game AI, and
problem-solving where an optimal solution needs to be found quickly.

1. Best-First Search (Informed Search)

Definition

● Best-First Search (BFS) is an informed search algorithm that uses a heuristic function (h(n)) to
select the most promising node at each step.
● It expands the node with the lowest heuristic cost first.

Working of Best-First Search

1. Start from the initial node.


2. Use a priority queue (sorted by heuristic values h(n)) to select the best node.
3. Expand the node with the smallest heuristic value.
4. Repeat until the goal node is reached.

Example

Consider a graph where we need to find the shortest path from A to G.

/ \

(4) (2)

B ---- C

| \ \

(3) (5) (6)

D -- E -- G

● Heuristic Values (h(n)) estimate the remaining cost to G.


● The algorithm chooses C first (h = 2) instead of B (h = 4).

Advantages

● Faster than uninformed search algorithms (like BFS & DFS).


● Uses heuristic information to improve efficiency.

Limitations

● May not always find the optimal solution (greedy approach).


● Performance depends on the quality of the heuristic function.

2. A Search Algorithm*

Definition

● A Search* is an optimal and complete heuristic search algorithm.


● It uses both the cost to reach a node (g(n)) and the heuristic estimate to the goal (h(n)).
● The total cost function:
f(n)=g(n)+h(n)f(n) = g(n) + h(n)

Working of A*

1. Start from the initial node.


2. Compute f(n) = g(n) + h(n) for each node.
3. Select the node with the lowest f(n) value.
4. Expand it and update the costs of neighboring nodes.
5. Continue until the goal is reached.

Example

Consider finding the shortest path from S to G.

/\

A B

(3) (2)

/\ |\

C D E G

(1) (4)(5) (1)

● g(n) represents the actual cost to reach a node.


● h(n) estimates the remaining cost.
● A* chooses the lowest f(n) node to expand first.

Advantages

● Guarantees the shortest path (optimal).


● More efficient than Best-First Search.

Limitations

● Can be memory-intensive.
● Performance depends on heuristic quality.

3. Pathfinding - A and AO*

A Pathfinding

● A search is the most used algorithm* for finding the shortest path in games, navigation, and
robotics.
● It ensures the optimal path by balancing g(n) and h(n).

AO* Search Algorithm

● AO (AND-OR) Search* is used for problems with multiple goals or decisions.


● It represents problems as an AND-OR graph, where:
○ AND nodes: Require all child nodes to be solved.
○ OR nodes: Require only one child node to be solved.
AO* Working

1. Starts from the initial state and expands nodes.


2. Uses heuristics to choose the best path.
3. Explores AND-OR structures, handling multiple solutions.
4. Backtracks and updates costs when new paths are found.

Example

Used in decision trees, expert systems, and automated planning.

4. Applications of Heuristic Search Techniques

● GPS & Navigation Systems – A* is used to find the best route.


● Game AI & Pathfinding – A* helps in character movement.
● Robotics – Path planning in autonomous robots.
● Decision-Making AI – AO* is used in expert systems.

Q4. Explain the types of Hill Climbing in heuristic search?

Ans: Hill Climbing is a heuristic search algorithm used for optimization problems where the goal is to
find the best possible solution by making incremental improvements. It continuously moves towards a
state with a higher heuristic value until it reaches a peak (local or global optimum).

1. Types of Hill Climbing

a) Simple Hill Climbing

● Definition: Evaluates only one neighbor at a time and moves to the first neighbor with a better
heuristic value.
● Process:
1. Start at an initial state.
2. Evaluate a single neighboring state.
3. If it is better, move to it; otherwise, stop.
● Advantages:
1. Simple and easy to implement.
2. Uses minimal memory.
● Limitations:
1. Can get stuck in local maxima.
2. Might not find the best solution due to a limited search.

b) Steepest-Ascent Hill Climbing

● Definition: Evaluates all possible neighboring states and chooses the best one with the highest
heuristic value.
● Process:
1. Start at an initial state.
2. Examine all neighbors.
3. Move to the best neighbor (highest heuristic value).
4. Repeat until no better neighbor exists.
● Advantages:
1. Finds a better solution than Simple Hill Climbing.
2. Less likely to get stuck in plateaus.
● Limitations:
1. More computationally expensive.

2. Can still get stuck in local maxima.

c) Stochastic Hill Climbing

● Definition: Instead of choosing the best neighbor, it randomly selects a neighbor and decides
whether to move based on probability.
● Process:
○ Start at an initial state.
○ Randomly choose a neighboring state.
○ Move to it if it is better or sometimes even if it is worse (to escape local maxima).
● Advantages:
1. Can escape local maxima.
2. Useful for large and complex search spaces.
● Limitations:
1. Less predictable as it relies on randomness.
2. May take longer to converge to an optimal solution.

2. Challenges in Hill Climbing

a) Local Maxima

● A peak where all neighbors are worse but it is not the best solution (global maximum).
● Solution: Use random restarts or simulated annealing.

b) Plateaus

● A flat region where all states have equal heuristic values.


● Solution: Use random jumps to explore new areas.

c) Ridges

● A narrow path of increasing heuristic values where steepest ascent fails.


● Solution: Use bidirectional search or more advanced algorithms.

3. Applications of Hill Climbing

1. Route Optimization – Finding the shortest path in GPS systems.


2. Genetic Algorithms – Optimizing AI models.
3. Robotics – Motion planning and object recognition.
4. Game AI – AI decision-making in games.

Q5. Explain AO* Algorithm with an example.

Ans: The AO (AND-OR) Search Algorithm* is a heuristic search technique used for solving problems with
multiple goals and dependencies. It is particularly useful in AND-OR graphs, where solutions require
either all or some of the sub-goals to be solved.

1. Key Concepts of AO*

● AND-OR Graph: Represents problems where nodes can have AND (multiple conditions must be
met) and OR (at least one condition must be met) relationships.
● Heuristic Search: Uses h(n) (heuristic function) to estimate the best path.
● Backtracking: Updates path costs dynamically based on heuristic updates.
● Efficient Path Selection: Instead of expanding all nodes, AO* focuses only on the most promising
paths.

2. Working of AO*

Steps of AO* Algorithm

1. Start from the initial node.


2. Expand nodes and classify them as AND or OR.
3. Compute cost values using:
○ AND nodes → Sum of all child costs.
○ OR nodes → Minimum cost among children.
4. Choose the path with the lowest cost.
5. If a better solution appears, update previous paths (backtracking).
6. Repeat until the goal state is reached.

3. Example of AO*

Problem Statement

A robot needs to reach the goal (G) while overcoming multiple obstacles. The decision paths include
multiple sub-goals, some of which must be solved together (AND) and some that can be solved
separately (OR).

AND-OR Graph Representation

Start

/ \
A(OR) B(OR)

/ | | \

C(AND) D(AND) E

/ \ | |

G1 G2 G3 G4

Explanation

● A and B are OR nodes (either can be chosen).


● C and D are AND nodes (both sub-goals must be solved together).
● G1, G2, G3, G4 are goal states.
● AO* chooses the path with the minimum cost dynamically.

Optimal Path Selection

1. Evaluate paths A → C → (G1, G2) and B → D → G3.


2. Choose the path with the lowest total cost.
3. If a better path is found later, backtrack and update the cost.

4. Advantages of AO*

● Handles Complex Decision Trees – Useful in hierarchical problem-solving.


● Efficient Pruning – Expands only necessary nodes.
● Backtracking Support – Updates solutions dynamically.

5. Limitations of AO*

● Depends on Good Heuristics – Poor heuristics may lead to inefficiency.


● Complex Implementation – More challenging than A*.

6. Applications of AO*

● AI Planning & Robotics – Decision-making in dynamic environments.


● Expert Systems – Rule-based AI for problem-solving.
● Game AI – Multi-path decision-making in strategy games.
MODULE 03

Q1. Explain Adaline neural network with an example.


Ans. Adaline (Adaptive Linear Neuron) is a type of artificial neural network developed by Bernard
Widrow and Marcian Hoff in the 1960s. It is a single-layer neural network that uses a linear activation
function for training but applies a threshold function for classification.
Adaline is similar to the Perceptron, but the key difference lies in how it updates its weights during
training. Instead of using the perceptron's binary step function for weight updates, Adaline minimizes
the Mean Squared Error (MSE) using Gradient Descent.

Working of Adaline

1. Weighted Sum Calculation


Each input xix_ixi is multiplied by its corresponding weight wiw_iwi, and a bias term bbb is
added:
y=∑wixi+by = \sum w_i x_i + by=∑wixi+b
2. Activation Function
Unlike perceptrons that use a step function, Adaline applies a linear activation function (identity
function):
yactivation=yy_{\text{activation}} = yyactivation=y
3. Weight Update using LMS (Least Mean Squares) Rule

○ Compute the error e=(ydesired−yactivation)e = (y_{\text{desired}} -


y_{\text{activation}})e=(ydesired−yactivation)

○ Adjust the weights using Gradient Descent:


wi=wi+η⋅e⋅xiw_i = w_i + \eta \cdot e \cdot x_iwi=wi+η⋅e⋅xi
where η\etaη is the learning rate.

Training Adaline

1. Initialize weights: w1=0.1,w2=−0.2,b=0.05w_1 = 0.1, w_2 = -0.2, b = 0.05w1=0.1,w2=−0.2,b=0.05

2. Compute weighted sum for each input

3. Update weights using LMS rule:

○ Calculate error: e=ydesired−yactivatione = y_{\text{desired}} -


y_{\text{activation}}e=ydesired−yactivation

○ Adjust weights using:


wi=wi+η⋅e⋅xiw_i = w_i + \eta \cdot e \cdot x_iwi=wi+η⋅e⋅xi
4. Repeat for multiple epochs until error reduces.

Advantages of Adaline
1. Uses Mean Squared Error (MSE), leading to a smoother optimization process.
2. Can be extended to Multi-Layer Networks (basis of modern neural networks).
3. More stable training compared to perceptron.

Q2. Explain Activation Function.

Ans. An Activation Function is a mathematical function used in neural networks to decide whether a
neuron should be activated or not. It takes an input, processes it, and outputs a value that determines
the strength of the neuron’s signal in the next layer.

Why is an Activation Function Important?

1. Introduces Non-Linearity:
○ Without activation functions, neural networks would behave like linear regression
models, unable to learn complex patterns.

2. Controls Output Range:


○ Some activation functions (like sigmoid and tanh) limit outputs between specific values,
making training stable.

3. Helps in Backpropagation:
○ It influences gradient flow during training, affecting how the network updates weights.

Types of Activation Functions

1. Linear Activation Function

● f(x)=xf(x) = xf(x)=x
● The output is the same as the input.
● Limitation: Cannot learn complex patterns.

2. Non-Linear Activation Functions

a) Sigmoid Function

● Formula:
f(x)=11+e−xf(x) = \frac{1}{1 + e^{-x}}f(x)=1+e−x1
● Output range: (0,1)
● Used in probability-based problems.
● Limitation: Can cause vanishing gradient problems in deep networks.

b) Tanh (Hyperbolic Tangent) Function

● Formula:
f(x)=ex−e−xex+e−xf(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}f(x)=ex+e−xex−e−x
● Output range: (-1,1)
● Zero-centered, making it more efficient than sigmoid.

c) ReLU (Rectified Linear Unit)

● Formula:
f(x)=max⁡(0,x)f(x) = \max(0, x)f(x)=max(0,x)
● Output: 0 for negative inputs, same as input for positive values.
● Advantages: Faster computation and avoids vanishing gradients.
● Limitation: Can cause dead neurons (neurons output 0 forever).

d) Leaky ReLU

● Formula:
f(x)=max⁡(0.01x,x)f(x) = \max(0.01x, x)f(x)=max(0.01x,x)
● Allows small negative values, preventing dead neurons.

e) Softmax Function (Used in Classification)

● Converts output into probabilities (sums to 1).


● Commonly used in the output layer of multi-class classification.

Q3. What is gradient descent and how does it work?

Ans. Gradient Descent is an optimization algorithm used to find the minimum of a function. It is widely
used in machine learning and deep learning for optimizing model parameters by minimizing the loss
function.

How Gradient Descent Works?


Gradient Descent works by iteratively updating parameters (weights) in the direction of the negative
gradient of the loss function. This helps the algorithm move towards the optimal (minimum) value.

Mathematical Explanation
● Let’s say we have a function f(x)f(x)f(x) that we want to minimize.

● The gradient (derivative) of the function, denoted as ∇f(x)\nabla f(x)∇f(x), gives the direction of
the steepest ascent.

● To minimize f(x)f(x)f(x), we move in the opposite direction of the gradient.

The update rule for gradient descent is:

θ:=θ−α⋅∇J(θ)
Where:

● θ= Model parameters (weights, biases)

● J(θ) = Loss function (cost function)

● ∇J(θ)= Gradient of the loss function

● α= Learning rate (step size)

Types of Gradient Descent


Gradient Descent can be categorized based on how frequently updates are performed.

1. Batch Gradient Descent (BGD)

● Uses the entire dataset to compute the gradient and update weights.

● Pros: More stable convergence.

● Cons: Computationally expensive for large datasets.

2. Stochastic Gradient Descent (SGD)

● Updates the weights after each training example.

● Pros: Faster updates, works well for large datasets.

● Cons: Noisy updates may lead to fluctuations.

3. Mini-Batch Gradient Descent

● Uses a small batch (subset) of data to update weights.

● Pros: Balances efficiency and stability.

● Cons: Still requires careful tuning of batch size.

Applications of Gradient Descent

Linear and Logistic Regression – Optimization of weights.


Neural Networks – Training deep learning models.
Computer Vision & NLP – Used in CNNs, RNNs, and transformers.

Q4. Discuss Perceptron algorithm with a neat flowchart.


Ans. The Perceptron is one of the simplest types of artificial neural networks.
It’s a binary classifier — meaning it can decide whether an input belongs to one class or another.

It works by:

● Taking multiple inputs,


● Multiplying them by weights,
● Summing them,
● Applying an activation function (usually a step function) to decide the output.

Working Steps of the Perceptron Algorithm

Step 1: Initialize the Weights and Bias

● Set weights ( wiw_iwi ) and bias bbb to small random numbers or zeros.

Step 2: For Each Training Sample

● Calculate the weighted sum:


z=w1x1+w2x2+…+wnxn+bz = w_1x_1 + w_2x_2 + \ldots + w_nx_n + bz=w1x1+w2x2+…+wnxn+b
● Apply the activation function (Step function):
Output={1if z≥00otherwise\text{Output} = \begin{cases} 1 & \text{if } z \geq 0 \\ 0 &
\text{otherwise} \end{cases}Output={10if z≥0otherwise

Step 3: Update the Weights and Bias

● If the prediction is wrong, update using:


wi:=wi+Δwiw_i := w_i + \Delta w_iwi:=wi+Δwi
where
Δwi=α(ytrue−ypredicted)xi\Delta w_i = \alpha (y_{\text{true}} - y_{\text{predicted}})x_iΔwi
=α(ytrue−ypredicted)xi
and
b:=b+α(ytrue−ypredicted)b := b + \alpha (y_{\text{true}} - y_{\text{predicted}})b:=b+α(ytrue
−ypredicted)
○ α\alphaα = learning rate (small positive value like 0.01)

○ ytruey_{\text{true}}ytrue = actual label

○ ypredictedy_{\text{predicted}}ypredicted = output from activation

Step 4: Repeat

● Repeat the process for a number of epochs or until the model classifies all training examples
correctly.

Flowchart for Perceptron Algorithm


Start

Initialize weights and
bias

For each training
sample:

Compute weighted
sum (z)

Apply activation
function

Compare predicted
MODULE 04

Q1. Explain the K-nearest neighbor algorithm with an example.

Ans. K-Nearest Neighbor (KNN) is a supervised learning algorithm used for both classification and
regression problems.

The reason for having K-Nearest neighbor is as follows:

Imagine there are two categories saly category X and category Y and one another new data point say x1
got introduced, now it should be categorized under which type? This decision is made by the K-Nearest
Neighbor classifier.

● Steps for applying K-Nearest Neighbor are:


○ Step 1: K is selected first.
○ Step 2: The distance with the neighbors are found using Euclidean distance formula.
○ Step 3: Based on the Euclidean distance formula nearest neighbors are chosen.
○ Step 4: After choosing the nearest neighbor, count the number of each category and
choose the one with maximum value.
○ Step 5: It’s ready.

It is also called a lazy learner classification algorithm. K is a constant value which is defined by the user.It
can be applied by using any of the one as formula mentioned below:

● Euclidean Distance Formula


d=√[(x2-x1)2 + (y2-y1)2 ]
where,
o (x1,y1) are the coordinates of one point.
o (x2,y2) are the coordinates of the other point.
o d is the distance between (x1,y1) and (x2,y2).

● Manhattan Distance
Mdist = |x2-x1| + |y2-y1|
● Advantages of K-Nearest Neighbor
1. Easy to Implement: Simple steps and straightforward logic.
2. Easily Updatable: New data can be added anytime without retraining.
3. Simple to Use: Just share the dataset; KNN finds the best match without complex
procedures.

● Disadvantages of K-nearest Neighbor


1. Slow with Large Datasets: Distance calculation becomes costly as dataset size increases.
2. Scalability Issues: More data means slower predictions.
3. Needs Homogeneous Data: Works better with uniform, scaled data.

● KNN application
• Medicine
• Online shopping
• Data mining
• Agriculture etc.

Q2. Explain Decision Tree classifier. How does information gain help in determining the best attribute
to split on?

Ans. A Decision Tree is a supervised learning algorithm used for classification and regression tasks.

● It can be used for both classification and regression

● In the decision tree the split of regions is applied which makes it useful to use and make
decisions
● Decision Tree is non-parametric, meaning it grows by analyzing data without predefined
parameters.
● The tree expands gradually, adding branches and leaves as it learns from the data.
● It is one of the oldest and most popular machine learning algorithms.
● Decision Trees are robust and can handle missing and noisy data effectively.
● Tree building starts with the root node and proceeds level by level by adding child nodes.
● This building method is called binary recursive splitting.
● The first main step is to classify the data into subsets.
● After classification, the Decision Tree algorithm is applied to fully build the tree.
● Decision Tree terminologies
○ Parent node: Root node is the parent node
○ Child nodes: The successor node are the child nodes
○ Root node: It is the first node which is located at the top of the decision tree, it
represents the whole data set.
○ Leaf node: These are end nodes of the decision tree, after reaching till leaf node further
segregation is not possible.
○ Splitting: It is a mechanism of dividing the root or the decision node according to the
condition given.
○ Sub tree or the branch: it is the splitted branch.
○ Pruning: Removing a particular branch from the tree is known as pruning
● Working of Decision tree
○ It starts with the root node then it will go through the classification that has been applied
in the dataset.
○ After going through the classification it will go ahead with comparison between the data.
○ Based on the data it will move further if it is coming out to be of one form then
accordingly one node will be created otherwise it will move on with another node.
○ Like this comparison will continue again and again till it goes or reaches the end part.
○ End nodes are nothing but leaf nodes after which no other branches will be created.
Decision tree example

● The image is a good example of a Decision Tree aiming to decide whether to buy a laptop or not.
● Initially, all data like laptop price and configuration is considered.
● The root node is created based on price range (e.g., ₹40,000 to ₹80,000).
● If the price condition is satisfied (Yes), it moves to the next node; if not (No), a declined node is
created.
● The next node checks the OS version; if it’s the latest, it moves to the buying node, otherwise
declines.
● The tree proceeds step-by-step, making the decision process simple and accurate.

Q3. Explain Naive Bayes classifier?

Ans. It is a classification technique which uses Bayes theorem. It is a machine learning model.

● It is mostly preferred on those kinds of data which are very large in nature.It uses a probability
approach to find the solution.
● The term “naive” is used here because it considers many features of it and basically these
features are independent of each other.
● For example, consider tomatoes if we apply Naive Bayes. The main thing is the features so lets
say its round in shape, red in color, small in size etc. Even though all these together will give you
the end result but still they are independent in nature.
● This algorithm is very very popular, the main reason is it is very simple to code and understand.

Conditional Probability

○ Conditional probability is simply the probability of an event divided by the total possible
outcomes.
○ For example, when rolling a die (6 faces), the probability of any one face is 1/6=0.1661/6
= 0.1661/6=0.166.
○ Similarly, conditional probabilities for other scenarios can be calculated the same way.

The Bayes rule

It works with known and unknown values, using given evidence to find solutions.

● Prior: Probability before evidence.


● Posterior: Updated probability after evidence.
● Likelihood: Probability assuming the belief is true.
● Marginal: Overall probability of evidence.
Named after Thomas Bayes.

Naive Bayes classification

Where:

P(A|B): A occurrence when B occurs.


P(A): A occurrence
P(B): B occurrence
P(B|A): B occurrence when A occurs.

In a figure with squares and circles, circles are more numerous.


Using Naive Bayes, a new shape is classified under the major category (circle) based on prior probability
— meaning decisions are made using past data.

Formulas:

● Prior (circle) = Number of circles / Total objects


● Prior (square) = Number of squares / Total objects

Example:

● Total objects = 16 (10 circles, 6 squares)


● Prior (circle) = 10/16
● Prior (square) = 6/16

Advantages of Naive Bayes Classification

• It is very fast in applying predictions and finding the solution.


• When it comes to text classification in comparison with other algorithms Naive Bayes has high success
rates.
• Performance in categorical inputs are better in comparison with other types.
MODULE 05
Q1. Explain Logistic Regression in detail with suitable examples.

Logistic Regression is a supervised machine learning algorithm used for predicting the probability of a
categorical dependent variable. It is mainly used for classification problems, where the output is in
discrete categories like 0 or 1, Yes or No, True or False.

Nature of Output: Unlike linear regression, which predicts continuous values, logistic regression predicts
probabilistic values between 0 and 1. These probabilities are then mapped to class labels using a threshold
(usually 0.5).

Logistic Regression is a significant machine learning algorithm because it has the ability to provide
probabilities and classify new data using continuous and discrete datasets.

Logistic Regression can be used to classify the observations using different types of data and can easily
determine the most effective variables used for the classification. The below image is showing the logistic
function:

The sigmoid function is a mathematical function used to map the predicted values to probabilities.

It maps any real value into another value within a range of 0 and 1.

The value of the logistic regression must be between 0 and 1, which cannot go beyond this limit, so it
forms a curve like the "S" form. The S-form curve is called the Sigmoid function or the logistic function.

Assumptions for Logistic Regression:


The dependent variable must be categorical in nature.
The independent variable should not have multi-collinearity.

Logistic Regression Equation:


Types of Logistic Regression:

1. Binomial Logistic Regression:


Used when the dependent variable has only two categories, such as 0 or 1, Pass or Fail, Yes or No.

2. Multinomial Logistic Regression:


Used when the dependent variable has three or more unordered categories, like Cat, Dog, Sheep.

3. Ordinal Logistic Regression:


Used when the dependent variable has three or more ordered categories, such as Low, Medium, High.

Applications with Examples:

Medical Diagnosis: Predict whether a tumor is malignant (1) or benign (0).


Banking: Classify whether a customer will default on a loan or not.
Marketing: Predict if a user will click on an ad (Click = 1, No Click = 0).

Q2. Explain Random forest algorithm in detail with steps.

Random Forest is a supervised machine learning algorithm used for both classification and regression
tasks. It works by creating multiple decision trees from random subsets of the data and combines their
results — majority vote for classification and average for regression.

Working Principle:
Random Forest is based on the ensemble method, which combines multiple models to improve
performance.

Types of Ensemble Methods:

1. Bagging: Builds multiple models using different subsets of data (with replacement) and combines
them (e.g., Random Forest).

2. Boosting: Builds models sequentially, where each model improves on the errors of the previous one
(e.g., AdaBoost, XGBoost).

Bagging (Bootstrap Aggregation) is an ensemble technique used in Random Forest. It involves selecting
random samples with replacement (bootstrap samples) from the original dataset. Each sample is used to
train a separate model independently. The final prediction is made by combining the outputs of all
models using majority voting (for classification) or averaging (for regression).

Steps involved in random forest algorithm:


Step 1: In Random forest n number of random records are taken from the data set having k number of
records.

Step 2: Individual decision trees are constructed for each sample.

Step 3: Each decision tree will generate an output.

Step 4: Final output is considered based on Majority Voting or Averaging for Classification and
regression respectively

Example

Consider the fruit basket as the data as shown in the figure below. Now n number of samples are taken
from the fruit basket and an individual decision tree is constructed for each sample. Each decision tree
will generate an output as shown in the figure. The final output is considered based on majority voting.
In the below figure you can see that the majority decision tree gives output as an apple when compared
to a banana, so the final output is taken as an apple

Importance

Each tree uses different subsets of features, making every tree unique.
Reduces feature space by not using all variables in each tree.
Trees are built independently, allowing efficient use of CPU resources.
About 30% of the data (out-of-bag) is unused during training and can be used for testing.
Final results are more consistent due to aggregation (voting or averaging).

Q3. Explain Expectation-Maximization algorithm with an example.

The Expectation-Maximization (EM) algorithm is a powerful method used for estimating the values of
latent variables—variables that are not directly observed but inferred from other observable data. EM is
particularly useful when the general form of the underlying probability distribution of these variables is
known.
This algorithm plays a key role in unsupervised learning, especially in clustering techniques such as
Gaussian Mixture Models (GMMs).

EM was formally introduced in a 1977 paper by Arthur Dempster, Nan Laird, and Donald Rubin. It is
widely used to find maximum likelihood estimates of parameters in statistical models when data is
incomplete or has missing values.

Algorithm:

1. Initialization: Given a set of incomplete data, start with an initial guess for the parameters.

2. Expectation Step (E-step): Using the available observed data, estimate or "guess" the missing values
(latent variables).

3. Maximization Step (M-step): Once the missing data is estimated, use this complete data to update the
model parameters.

4. Iteration: Repeat the E-step and M-step until the parameters converge or the algorithm reaches a
stopping criterion.

Usage of the EM Algorithm:

1. Filling Missing Data: It can be used to estimate missing values in a dataset.

2. Unsupervised Clustering: Serves as a foundation for clustering algorithms.

3. Hidden Markov Models (HMMs): Useful for estimating parameters in HMMs.

4. Latent Variable Discovery: Helps in identifying values of latent (unobserved) variables.

Advantages of the EM Algorithm:

1. Likelihood Increase: Ensures that the likelihood improves with every iteration.

2. Ease of Implementation: Both the E-step and M-step are relatively simple for many problems.
3. Closed-Form Solutions: Often, solutions to the M-step exist in a closed form, making the process more
efficient.

Disadvantages of the EM Algorithm:

1. Slow Convergence: The algorithm can be slow to converge, requiring many iterations.

2. Local Optima: It may converge to local optima rather than the global optimum.

3. Complex Probability Requirements: It needs both forward and backward probabilities (whereas some
numerical optimizations only require forward probability).

Q4. Baysian Belief networks

A Bayesian Belief Network (BBN) is a probabilistic graphical model used to represent variables and their
conditional dependencies through a directed acyclic graph (DAG). It is also referred to as a Bayes
Network, belief network, decision network, or Bayesian model.

Key Features

Probabilistic Nature: BBNs are built upon probability distributions and leverage probability theory for
tasks like prediction and anomaly detection.

Applications: Since real-world scenarios often involve uncertainty, Bayesian networks are useful in
various fields, such as: Prediction, Anomaly Detection, Diagnostics, Automated Insight, Reasoning, Time
Series Prediction, Decision Making Under Uncertainty

Bayesian networks model complex relationships between events, making them essential for handling
uncertain data and providing insights in dynamic systems.

Bayesian Network can be used for building models from data and experts’ opinions, and it consists of
two parts:

1. Directed Acyclic Graph

2. Table of conditional probabilities.

The generalized form of Bayesian network that represents and solve decision problems under uncertain
knowledge is known as an Influence diagram. A Bayesian network graph is made up of nodes and Arcs
(directed links), where:
Nodes: Each node in a Bayesian network represents a random variable, which can be either continuous
or discrete.

Arcs/Edges: Directed arrows (or arcs) between nodes represent causal relationships or conditional
probabilities. These arcs indicate that one node directly influences the other. If no directed arrow exists
between two nodes, it means those nodes are independent of each other.

For example, in a Bayesian network with nodes A, B, C, and D:


A is the parent of B if there is a directed arrow from A to B.
C is independent of A if no direct link exists between them.

Q5. Explain Hierarchical Clustering in Detail

Hierarchical clustering is an alternative to partitioned clustering because it doesn't require specifying the
number of clusters in advance. It creates a tree-like structure known as a dendrogram by recursively
merging or splitting clusters.

Clusters are formed by progressively combining similar data points into larger clusters.

To determine the number of clusters, you can cut the tree at the appropriate level.

The most common hierarchical clustering method is Agglomerative Hierarchical Clustering, where each
data point starts as its own cluster and clusters are merged based on similarity.

The hierarchical clustering technique has two approaches:


1. Agglomerative Hierarchical Clustering (Bottom-Up Approach)

Agglomerative clustering starts with each data point as its own cluster and progressively merges them
based on similarity until one cluster remains.

Steps:

1. Initialize Clusters: Each data point is a separate cluster.

2. Compute Distance: Calculate the distance between all pairs of clusters.

3. Merge Closest Clusters: Find and merge the two closest clusters.

4. Update Distance: Recalculate the distance matrix based on the new clusters.

5. Repeat: Continue merging until all points are in one cluster.

6. Create Dendrogram: Visualize the cluster hierarchy and cut the tree to get the desired number of
clusters.

2. Divisive Hierarchical Clustering (Top-Down Approach)

Divisive clustering starts with the whole dataset as one cluster and recursively splits it into smaller
clusters until each data point is its own cluster.

Steps:

1. Initialize: Start with one cluster containing all data points.

2. Determine Best Split: Identify how to split the cluster into two.

3. Split: Divide the cluster into two sub-clusters.

4. Repeat: Continue splitting until the desired number of clusters is achieved.

5. Create Dendrogram: Visualize the splits and cut the tree to get the final clusters.

Key Differences:

· Agglomerative: Merges individual clusters.

· Divisive: Splits one large cluster into smaller ones.


MODULE 06

Q1) Explain support vector machine in detail.

Ans. Algorithm for Support vector machine –

SVM (Support Vector Machine) is a supervised learning method used for classification and sometimes
regression. Its main goal is to find the best line or boundary (called a hyperplane) that separates
different classes of data.

● If there are 2 features, the hyperplane is a line.


● If there are 3 features, it becomes a 2D plane.
● With more than 3 features, we can't visualize it, but the concept is the same.

SVM works best for classification problems.

Consider two independent variables, x1, and x2, as well as one dependent variable, either a blue or a
red circle.

two features, the hyperplane is a line). But how do we choose the best one?

SVM picks the line that gives the maximum margin, meaning it leaves the biggest gap between the two classes.
This helps the model make better predictions on new data.

Choosing the most appropriate hyper-plane:

The best hyperplane is the one that creates the biggest gap, or margin, between the two classes. This helps
separate the data clearly.
So, we choose the hyperplane that has the biggest distance from the closest points on both sides. This is called
the maximum-margin or hard margin hyperplane. In the diagram, that would be line L2.

Let's take a look at a scenario like the one below.

There’s one blue ball inside the red area, which is an outlier. But that’s okay! SVM can handle outliers
and still find the best hyperplane with the largest margin. Outliers don’t affect SVM much.
For this kind of data, SVM still finds the best margin but allows some points to cross it. This is called a soft margin.
SVM adds a penalty for each point that breaks the margin. A common penalty is called hinge loss—the more a
point crosses the margin, the bigger the loss.

So far, we've talked about linearly separable data (data that can be split by a straight line). But what if the data
can’t be separated by a straight line? Let's see how SVM handles that next!

If the data can’t be separated by a straight line, SVM uses something called a kernel to solve the problem. The
kernel transforms the data into a new space using a new variable y(like distance from the origin). This makes it
possible to separate the data with a line in the new space.

In this case, we create a new variable y based on the distance from the origin. This is done using a kernel.

A kernel is a special function in SVM that transforms data into a higher-dimensional space. This helps turn a non-
separable problem into a separable one. Simply put, the kernel reshapes the data so SVM can find the best way to
separate it.

SVM has the following advantages:

● SVM works well with high-dimensional data.


● It uses only some training points (called support vectors) to make decisions, which saves memory.
● You can use different kernel functions, even custom ones, to fit the data better.
Q2) Explain Bagging.

Ans. Bagging Classifier is an ensemble method that builds multiple models using random subsets of the
training data (with replacement). Each model is trained separately, and their predictions are combined
(e.g. by voting) to make the final prediction. This helps reduce overfitting and makes the model more
stable.

N: Size of the original training set

Each base model in bagging is trained on a different random set of the data. Some data points may
repeat, and some may be left out. Bagging reduces overfitting by averaging or voting, which lowers
variance but can slightly increase bias—though overall performance improves.

Bagging works on training dataset

Bagging picks random samples from the training data with replacement, so some data may appear more
than once, while others might be skipped.

Original dataset : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

Resampled training set 1: 2, 3, 3, 5, 6, 1, 8, 10, 9, 1

Resampled training set 2: 1, 1, 5, 6, 3, 8, 9, 10, 2, 7

Resampled training set 3: 1, 5, 8, 9, 2, 10, 9, 7, 5, 4


Q3) Explain what steps for implementing the Adaboost Algorithm in are detail?

Ans. AdaBoost stands for Adaptive Boosting. It is a machine learning algorithm that helps us make
better predictions by combining many simple models (called weak learners) into one strong model.

You can think of it like asking many people for advice. Each person may not be perfect on their own, but
if you ask the right people and give more attention to the best ones, you get a better decision in the
end. That’s what AdaBoost does with models.

The goal is to focus more on the mistakes made by earlier models and try to correct them in the next
ones. In the end, it combines all the models to make a powerful prediction system.

Steps to Implement AdaBoost :

1. Start with Equal Weights

● At the beginning, all the data points are considered equally important.
● Suppose you have 100 examples in your dataset. Each one gets a weight of 1/100.
● These weights help the model understand how much to pay attention to each example.

2. Train the First Weak Learner

● Use a simple model like a small decision tree (called a stump).


● Train it on the data using the current weights.
● It uses the weights to decide which data points matter more.

3. Check the Errors

● See which points the model got wrong.


● Calculate the total error: how much weight is on the wrong predictions.

Error = Sum of weights of misclassified points

4. Calculate Model’s Importance (Alpha)

● If the model is good (low error), it gets a high "vote" or weight (alpha).
● If it’s bad (high error), it gets a small vote.

The formula is:

alpha = 0.5 * log((1 - error) / error)

5. Update the Weights

● Increase weights for the data points that were predicted wrong (so the next model focuses more
on them).
● Decrease weights for the ones predicted correctly.
● This helps the next model improve on the mistakes.

New weight = old weight × exp(±alpha)


6. Repeat Steps 2–5

● Train another weak model on the updated weights.


● Keep repeating for a set number of rounds or until you get good accuracy.

7. Final Prediction

● To make a final decision, combine all weak models using their alpha (importance).
● Each model gives a vote, and stronger models (with higher alpha) have more say.

Final prediction = sign(sum of (alpha × model_prediction))

Ø Example:

If Model 1 says "Yes" with alpha = 0.8, and Model 2 says "No" with alpha = 0.3, then "Yes" wins because
it has more weight.

Advantages of AdaBoost Algorithm:

· Fast, simple and easy to program.

· Robust to overfitting.

· Extended to learning problems beyond binary classification (i.e.) can be used with text or numeric
data.

Drawbacks:

· Sensitive to noisy data and outliers.

· Weak classifiers can lead to overfitting.

AdaBoost picks the training focus for each new model based on the mistakes of the previous one. It also
decides how much weight to give each model's answer. By combining many weak models, it creates a
strong one and was the first successful boosting method for solving yes/no (binary) problems.

Q4) Describe the key principles behind ensemble learning. Differentiate between bagging and
boosting techniques.

Ans.

Ø Key Principles of Ensemble Learning:

● Ensemble learning means combining multiple models to make better predictions.


● Instead of using just one model, we use a group (or ensemble) of models.
● The goal is to reduce errors and improve accuracy.
● It works because many weak models together can make a strong model.
1) Bagging – It creates different sets of training data by picking random samples with replacement
(some data points can repeat). Then, it trains separate models on each set. The final result is decided
by majority voting (the answer most models agree on).

Example: Random Forest uses bagging with decision trees.

2) Boosting – It combines many weak models (that are not very accurate) to make one strong
model.It builds models one after another, and each new model tries to fix the mistakes made by the
previous one.The goal is to keep improving the accuracy step by step.

Bagging :

Bagging (Bootstrap Aggregation) is used in Random Forest. It takes random samples from the original
data with replacement (called bootstrap). Each model is trained independently on these samples. Then,
all models’ results are combined using majority voting, which is called aggregation.

We take random samples with replacement from the original data (called bootstrap samples). Then, we
train separate models on each sample. Each model gives a result. If most models say "Happy" (like a
majority of happy emojis), then the final result is "Happy" based on majority voting.
Steps in Random Forest :

Step 1: Pick random samples from the original data (some records may repeat).
Step 2: Build a separate decision tree for each sample.
Step 3: Each tree makes its own prediction.

Step 4: For classification, the final result is the one that most trees agree on (majority voting).
For regression, the final result is the average of all tree predictions.

Feature Bagging Boosting

Main Builds many models Builds models one after another


Idea independently

Focus Reduces variance Reduces bias and variance

Data Uses random samples (with Uses full data, but changes weights on
Sampling replacement) mistakes
Learning All models learn at the same Each model learns from previous errors
Style time

Final Uses majority voting or Uses weighted voting (more accurate


Result averaging models have more say)

Overfittin Less likely Can happen if not tuned properly


g

Speed Can be faster (parallel Can be slower (step-by-step)


training)

Example Random Forest AdaBoost, XGBoost

Q5) Explain stacking architecture.

Ans: Stacking (or Stacked Generalization) is an ensemble method that combines the predictions of
multiple models to make a better final prediction.

It works like a team of models, where a special model (called the meta-model) learns how to best
combine the outputs from the other models.

Stacking Architecture:

Step 1: Train Base Models

● First, you train several different models (like Decision Tree, SVM, KNN, etc.) on the same
training data.
● These are called base models or level-0 models.
● Each base model gives its own prediction.

Step 2: Collect Predictions


● Next, take the predictions made by all the base models.
● These predictions are used as input features for another model.

Step 3: Train Meta-Model

● A new model (called the meta-model or level-1 model) is trained on the predictions from the
base models.
● The meta-model learns which base model to trust more for the final prediction.

Step 4: Make Final Prediction

● When testing new data, each base model gives a prediction.


● The meta-model uses those predictions to make the final output.

Example :

Let’s say you’re asking three friends (base models) for movie suggestions.

● One likes action, one likes comedy, and one likes drama.
● You notice whose suggestions you usually like best.
Now you ask a fourth friend (meta-model) to pick a movie, but they choose based on what the
other three said, and who usually makes better choices.

Why stacking?

Stacked models are often used to win machine learning competitions because they give better
accuracy than single models.
By using different types of models in the first layer (like decision trees, SVM, etc.), we can capture
different patterns in the data.
Combining their predictions helps to make more accurate results.
MODULE 07

Q1 What is Multidimensional Scaling (MDS), and what is its primary purpose?How does MDS differ
from Principal Component Analysis?

Multidimensional Scaling (MDS) is a dimensionality reduction technique. It's mainly used to turn high-
dimensional data (data with many features) into a low-dimensional representation — usually 2D or 3D
— so that we can visualize it more easily.

But what makes MDS unique is what kind of information it uses.


Instead of using the raw feature values of the data points (like PCA does), MDS uses a matrix of
dissimilarities — a table that shows how different each pair of items is.

For example, if you have a list of cities and the distances between each pair of cities, MDS can turn that
distance matrix into a map-like plot where the cities appear in positions that reflect those distances.

The primary goal of MDS is to:

· Take complex, high-dimensional or relational data

· And map it into a low-dimensional space (usually 2D or 3D)

· While preserving the pairwise distances (or similarities) between points as much as possible

This is super helpful when you want to:

● Visualize relationships between items (e.g., which items are similar or different)
● Detect clusters, patterns, or outliers
● Explore data that doesn't have clear features but does have pairwise relationships (like
similarities or preferences)

MDS is often used in fields like psychology, marketing, and social sciences, where people might rate
items by similarity, and you want to understand the structure behind their judgments

Here is the difference between MDS and PCA

Feature MDS PCA

What it uses Dissimilarity (or distance) Actual feature values of data


between data points

Goal Preserve the pairwise distances Find directions (principal


between items components) of maximum variance
Focus How different or similar the How much the features vary
items are

Type of input Distance matrix or similarity Full data matrix with features
matrix

Visualization Low-dimensional map that Low-dimensional axes that capture


output reflects item relationships data variation

Q2 Compare Feature Extraction and Feature Selection techniques. Explain how dimensionality can be
reduced using Principal Components Analysis.?

Both feature extraction and feature selection are dimensionality reduction techniques used to simplify
datasets by reducing the number of features (variables) while trying to keep as much useful information
as possible. However, they work in different ways:

Feature Selection:

● What it does: Selects a subset of the original features that are most relevant to the task (e.g.,
prediction, classification).
● How: Removes irrelevant or redundant features based on certain criteria (like correlation,
information gain, mutual information, etc.).
● Result: Keeps the original meaning of features; just fewer are used.
● Example: From a dataset with 100 features, selecting the top 10 most important ones based on
their relevance to the target.

Feature Extraction:

● What it does: Creates new features by combining or transforming the original ones.
● How: Uses techniques (like PCA) to transform high-dimensional data into a new lower-
dimensional space.
● Result: New features may not have a direct interpretation, but they capture important patterns
or structure in the data.
● Example: From 100 original features, creating 10 new ones that summarize the data but in a
transformed way.

How PCA Reduces Dimensionality (Feature Extraction Technique)

Principal Component Analysis (PCA) is a feature extraction method used to reduce dimensionality by
transforming the original data into a new set of variables called principal components.
Here's how PCA works step-by-step:

1. Standardize the data: Make sure all features have the same scale (especially if they are in
different units).
2. Compute the covariance matrix: This shows how features vary with respect to each other.
3. Find eigenvalues and eigenvectors: These help identify the directions (principal components)
where the data varies the most.
4. Select top components: Choose the top k components (based on the highest eigenvalues) that
capture the most variance in the data.
5. Transform the data: Project the original data onto the selected components to get a new, lower-
dimensional representation.

Example:

Suppose you have data with 100 features. PCA might tell you that 95% of the variance (information) in
the data can be captured using just 10 principal components. So, you reduce the data from 100 to 10
dimensions while still retaining most of its structure and meaning.

Q3 Discuss Dimensionality Reduction in detail.

Dimensionality reduction is the process of reducing the number of input variables or features in a
dataset. In simple terms, it’s about taking high-dimensional data (data with many features or columns)
and simplifying it without losing too much important information.

High-dimensional data can be difficult to analyze, visualize, and model. It can lead to problems like:

● Overfitting in machine learning models


● Increased computation time
● Difficulty in visualization
● Noise and redundancy in the data

So, dimensionality reduction helps us:

● Improve model performance


● Make data easier to understand and visualize
● Clean the data by removing irrelevant or redundant features

Why is Dimensionality Reduction Needed?

1. High-dimensional data is hard to visualize (e.g., you can’t plot data with 50+ features)
2. Too many features can cause overfitting (the model learns noise instead of patterns)
3. Redundant or irrelevant features add unnecessary complexity
4. Speed and efficiency improve with fewer features
5. Helps in noise reduction and better generalization
This problem is also known as the curse of dimensionality – as the number of features increases, the
data becomes sparse and harder to analyze effectively.

Types of Dimensionality Reduction Techniques

There are two main categories of techniques:

1. Feature Selection

● Selects a subset of the original features


● Keeps only the most relevant features
● Does not change the features, just removes the less useful ones
● Examples:
○ Filter Methods (e.g., correlation, chi-square test)
○ Wrapper Methods (e.g., recursive feature elimination)
○ Embedded Methods (e.g., LASSO)

2. Feature Extraction

● Transforms the data from a high-dimensional space to a lower-dimensional space


● Creates new features from the original ones
● These new features may not be directly interpretable
● Examples:
○ Principal Component Analysis (PCA)
○ Linear Discriminant Analysis (LDA)
○ t-SNE (t-distributed Stochastic Neighbor Embedding)
○ Autoencoders (Neural networks-based)

Principal Component Analysis (PCA) – A Key Dimensionality Reduction Technique

PCA is one of the most popular feature extraction techniques. Here’s how it works:

1. It finds new axes (principal components) that capture the maximum variance in the data.
2. The first principal component captures the most variation; the second one captures the next
most, and so on.
3. You can select the top k components (based on variance explained) and project your data onto
them.
4. This reduces the dimensionality while still retaining most of the important information.

For example:
You have 100 features. PCA tells you that just 10 components explain 95% of the data’s variation. You
can then reduce your dataset to 10 dimensions.

Benefits of Dimensionality Reduction

● Simplifies models and makes them faster


● Reduces storage and memory usage
● Helps in data visualization (e.g., converting data to 2D or 3D for plotting)
● Improves accuracy by removing noise and irrelevant data
● Helps to uncover hidden patterns in the data

Challenges in Dimensionality Reduction

● Risk of losing important information if not done carefully


● Some techniques (like PCA) create features that are hard to interpret
● You may need to tune parameters (e.g., number of components to keep)
● May not work well if nonlinear relationships are present (though techniques like t-SNE handle
that)

Q4 Explain Bayesian belief network and clustering approaches

Bayesian Belief Network (BBN)

A Bayesian Belief Network (BBN), also known as a Bayesian Network, is a probabilistic graphical model
that represents a set of variables and their conditional dependencies using a directed acyclic graph
(DAG).

It combines graph theory and probability theory to model uncertainty in complex systems.

Key Components:

1. Nodes → Represent random variables (e.g., weather, disease, symptoms)


2. Edges (arrows) → Represent conditional dependencies (one variable influences another)
3. Conditional Probability Tables (CPTs) → Each node has a table that shows the probability of the
node given its parent(s)

How it works:

● Each node's value depends on its parents.


● You can predict unknown values based on known information using Bayes’ Theorem.
● It allows reasoning under uncertainty (e.g., what is the probability of someone having a disease if
they show certain symptoms?).

Example:

Imagine you're trying to diagnose a disease:

● Node A: "Has flu"


● Node B: "Has fever"
● Node C: "Has sore throat"

The graph might show that flu causes both fever and sore throat. Given a person has fever and sore
throat, the network can estimate the probability that the person has the flu.

Applications:
● Medical diagnosis
● Decision support systems
● Risk assessment
● Spam detection
● Weather prediction

Clustering Approaches in Machine Learning

Clustering is an unsupervised learning method used to group similar data points into clusters. There
are several popular approaches to clustering, each using a different strategy to group the data. Let’s
look at the four main types:

1. Distribution-Based Clustering

● In this method, data is assumed to come from a specific statistical distribution (often
Gaussian).
● Each cluster corresponds to a probability distribution, and data points are grouped based on
which distribution they are most likely to belong to.
● This method is useful when data naturally follows certain patterns or distributions.
● Example: Gaussian Mixture Models (GMM).

2. Density-Based Clustering

● Clusters are formed based on dense regions of data points.


● Data points in high-density areas are grouped together, while points in low-density areas (like
noise or outliers) are left out.
● It’s good for finding arbitrarily shaped clusters and handling outliers.
● Example: DBSCAN (Density-Based Spatial Clustering of Applications with Noise).
3. Fuzzy-Based Clustering (Soft Clustering)

● In this approach, data points can belong to more than one cluster.
● Instead of assigning a point to just one cluster, it assigns membership scores (probabilities) to
multiple clusters.
● Useful when cluster boundaries are not clearly defined and may overlap.
● Example: Fuzzy C-Means Algorithm.

4. Centroid-Based Clustering

● Clusters are formed around a central point (called a centroid).


● The algorithm iteratively adjusts the position of centroids to minimize the distance between
the data points and their nearest centroid.
● It works well with spherical-shaped clusters but is sensitive to outliers.
● Example: K-Means Clustering.

Q5 What is soft margin hyperplane? What are the advantagesS of K Nearest neighbour?

In many real-world scenarios, data is not linearly separable, meaning it cannot be perfectly divided by a
straight line or linear boundary. Support Vector Machines (SVM) address this issue through the concept
of a Soft Margin Hyperplane.

A soft margin allows the SVM to tolerate a certain number of misclassifications while still trying to find a
hyperplane that separates the data as best as possible. The idea is to maximize the margin between
classes while minimizing classification errors.
Mathematically, this is achieved by introducing slack variables (ξᵢ) that measure how much a data point
violates the margin. A penalty is added to the objective function for each violation, and a parameter C
controls the trade-off:

● Small C → Focuses more on maximizing margin (allows more errors).


● Large C → Focuses more on minimizing errors (tighter margin).

This soft margin approach helps prevent overfitting, especially when the data is noisy or overlapping,
and leads to better generalization on unseen data.

Formula Summary for Optimization Problem:

1. Objective Function:

1. · Constraints:

Advantages of K-Nearest Neighbour (KNN)

1. No Training Phase Needed


○ KNN works instantly without any prior training phase, making it fast for deployment.
2. Simple to Implement
○ The algorithm is easy to code and understand.
3. Easy to Update with New Data
○ Since there is no model to retrain, adding new data is straightforward.
4. Adapts to Changing Data
○ It adjusts instantly to changes in the dataset as it always works with current data.
5. Supports Multiple Distance Metrics
○ Works with Euclidean, Manhattan, Minkowski distances, giving it flexibility for different
kinds of data.
6. Solves Both Classification and Regression Problems
○ KNN is a versatile algorithm that can be applied to both types of tasks.
7. No Assumptions About Data
○ It’s a non-parametric method, so it doesn’t assume any underlying distribution.

You might also like