AI_&_ML tw shivam
AI_&_ML tw shivam
Ans: An intelligent agent (IA) is a system that perceives its environment through sensors and acts upon it
using actuators to achieve specific objectives. These agents can be human, robotic, or software-based
and are designed to operate autonomously while making intelligent decisions based on their
observations.
Categories of Agents
1. Human Agents
○ Humans are natural intelligent agents.
○ They perceive the environment through senses (eyes, ears, skin) and act through muscles,
speech, and movement.
○ Example: A driver who senses road conditions and reacts accordingly.
2. Robotic Agents
○ These agents are physical machines that interact with the real world.
○ They use sensors (cameras, infrared, gyroscopes) to perceive their environment and
actuators (motors, arms, wheels) to perform actions.
○ Example: A self-driving car uses sensors like LiDAR and cameras to navigate roads and
avoid obstacles.
3. Software Agents
○ These agents exist in a digital environment and interact with software and databases.
○ They use algorithms and programming logic to make intelligent decisions.
○ Example: A chatbot answering customer queries on a website.
PEAS Representation
● Performance Measure:
○ Safety, speed, fuel efficiency, passenger comfort, adherence to traffic rules.
● Environment:
○ Roads, traffic signals, pedestrians, weather conditions, other vehicles.
● Actuators:
○ Steering wheel, accelerator, brake, indicators, wipers.
● Sensors:
○ Cameras, GPS, LiDAR, radar, speed sensors, proximity sensors.
Using PEAS helps in designing intelligent agents by clearly defining their capabilities and limitations.
Logic plays a crucial role in Artificial Intelligence (AI) by enabling machines to represent and reason
about knowledge. The two main types of logic used in AI are Propositional Logic (PL) and First-Order
Logic (FOL).
Propositional Logic, also known as Boolean Logic, deals with propositions (statements that are either
true or false). It uses logical connectives like:
● P: "It is raining."
● Q: "The ground is wet."
● A rule: P → Q ("If it is raining, then the ground is wet").
PL is limited because it cannot handle relationships or quantify statements like "All humans are mortal."
Example:
Ans: In AI & ML, solving problems efficiently requires proper problem formulation, representation, and
search strategies.
A problem is a computational task that needs to be solved using algorithms and models. It involves a
well-defined initial state, a goal state, and a set of possible actions to transition between states.
Example:
● In chess, the initial state is the board setup, the goal state is checkmating the opponent, and the
actions are valid chess moves.
● A classification problem aims to correctly label data points based on input features.
2. Problem Space
The problem space refers to all possible states and actions that can be taken to reach the goal from the
initial state. It includes:
Example:
● In a maze-solving AI, the problem space consists of all possible paths from the start to the exit.
● In ML, the problem space consists of all possible models and parameters that can be used to
minimize loss.
1. Deductive Reasoning
● Definition: Deductive reasoning derives specific conclusions from general rules or premises. If
the premises are true, the conclusion must also be true.
● Approach: Top-down (General → Specific).
● Example:
○ Premises:
■ "All humans are mortal."
■ "Socrates is a human."
● AI Applications:
○ Rule-based expert systems
○ Automated theorem proving
○ Formal logic programming
2. Inductive Reasoning
● Definition: Inductive reasoning infers general rules from specific observations. The conclusion is
probable but not always true.
● Approach: Bottom-up (Specific → General).
● Example:
○ "The sun has risen in the east every day so far."
○ "Therefore, the sun will rise in the east tomorrow."
● AI Applications:
○ Machine Learning (ML) (models learn from data patterns).
○ Data mining and predictive analytics.
3. Abductive Reasoning
● Definition: Abductive reasoning starts with an observation and seeks the most likely
explanation. It works with incomplete information.
● Approach: Best Hypothesis (Inference to the best explanation).
● Example:
○ "The grass is wet."
○ "It probably rained last night."
● AI Applications:
○ Medical Diagnosis Systems (If a patient has symptoms X, they might have disease Y).
○ Fault detection in engineering and IT systems.
● Definition: This reasoning is based on general world knowledge and experience to make logical
conclusions.
● Approach: Human-like reasoning, intuitive knowledge.
● Example:
○ "If a glass falls on the floor, it will likely break."
● AI Applications:
○ Natural Language Processing (NLP)
○ Robotics (AI assistants interacting with humans)
5. Monotonic Reasoning
● Definition: In monotonic reasoning, once a fact is established, it cannot be changed, even if new
knowledge is added.
● Approach: Knowledge remains fixed.
● Example:
○ "All birds can fly."
○ "A sparrow is a bird."
○ Conclusion: "A sparrow can fly."
○ (But this does not allow exceptions like "Penguins cannot fly.")
● AI Applications:
○ Traditional rule-based expert systems.
6. Non-Monotonic Reasoning
Ans: A Model-Based Reflex Agent is an intelligent agent that uses an internal model of the
environment to handle partial observability (when the agent does not have complete information). It
maintains an internal state that keeps track of past experiences to make better decisions.
a) Model
b) Internal State
● The internal state stores past perceptions to infer missing information.
● It helps the agent make decisions even when some sensor data is unavailable.
● Example: In a maze-solving robot, if a wall is detected earlier but no longer visible, the internal
state remembers its presence.
c) Decision-Making
● The agent uses the model and internal state to select appropriate actions.
● Condition-Action Rules guide its responses to different situations.
● Example: If a robot vacuum detects a dirty area, it stores this information and revisits it later.
● Handles Partial Observability: Can work even when all information is not available.
● More Efficient: Uses memory and past experiences for better decision-making.
● Can Adapt to Changes: Adjusts actions based on new observations.
Ans: The A* algorithm is a widely used graph/tree search algorithm in AI for finding the shortest path
between a start node and a goal node. It is commonly used in robotics, game development, and
navigation systems.
● Combines Best-First Search & Dijkstra’s Algorithm – It balances exploration and optimization.
● Uses Heuristic Function – Helps in making informed decisions.
● Ensures Optimality – If the heuristic is admissible (i.e., never overestimates), A* always finds the
optimal solution.
2. Cost Functions in A*
1. g(n) → Path cost from the start node to the current node (n).
2. h(n) → Heuristic function estimating the cost from n to the goal.
3. f(n) → Total cost function:
f(n)=g(n)+h(n)f(n) = g(n) + h(n)
Let’s consider a graph with nodes (A, B, C, D, E, and G) where we want to find the shortest path from A
to G.
/ \
(1) (4)
B ---- C
| \ \
D -- E -- G
| |
(7) (3)
Step-by-Step Execution
1. Start at Node A
○ g(A) = 0, h(A) = heuristic value (assumed).
○ f(A) = g(A) + h(A).
2. Expand A → B & C
○ Calculate f(B) and f(C).
○ Choose the node with the lowest f(n).
Final Path:
4. Advantages of A*
5. Limitations of A*
6. Applications of A*
Ans: The Alpha-Beta Pruning Algorithm is an optimization technique for the Minimax algorithm used in
decision-making and game-playing AI (e.g., Chess, Tic-Tac-Toe). It reduces the number of nodes
evaluated in a game tree, making Minimax more efficient.
● Minimax is used in two-player games where one player (MAX) tries to maximize the score, while
the other (MIN) tries to minimize it.
● The algorithm builds a game tree and evaluates possible future moves.
● Alpha (α): Best (highest) value that the MAX player can guarantee.
● Beta (β): Best (lowest) value that the MIN player can guarantee.
● Pruning occurs when:
○ β ≤ α, meaning further exploration is unnecessary.
(MAX)
/ \
(MIN) (MIN)
B C
/\ / \
3 5 2 9
Step-by-Step Execution
● Depends on Move Order: Works best when the best moves are explored first.
● Not Useful for Non-Adversarial Problems: Only applies to game-playing AI.
Ans: Heuristic search techniques are informed search algorithms that use a heuristic function to find
the most optimal solution efficiently. These techniques are widely used in pathfinding, game AI, and
problem-solving where an optimal solution needs to be found quickly.
Definition
● Best-First Search (BFS) is an informed search algorithm that uses a heuristic function (h(n)) to
select the most promising node at each step.
● It expands the node with the lowest heuristic cost first.
Example
/ \
(4) (2)
B ---- C
| \ \
D -- E -- G
Advantages
Limitations
2. A Search Algorithm*
Definition
Working of A*
Example
/\
A B
(3) (2)
/\ |\
C D E G
Advantages
Limitations
● Can be memory-intensive.
● Performance depends on heuristic quality.
A Pathfinding
● A search is the most used algorithm* for finding the shortest path in games, navigation, and
robotics.
● It ensures the optimal path by balancing g(n) and h(n).
Example
Ans: Hill Climbing is a heuristic search algorithm used for optimization problems where the goal is to
find the best possible solution by making incremental improvements. It continuously moves towards a
state with a higher heuristic value until it reaches a peak (local or global optimum).
● Definition: Evaluates only one neighbor at a time and moves to the first neighbor with a better
heuristic value.
● Process:
1. Start at an initial state.
2. Evaluate a single neighboring state.
3. If it is better, move to it; otherwise, stop.
● Advantages:
1. Simple and easy to implement.
2. Uses minimal memory.
● Limitations:
1. Can get stuck in local maxima.
2. Might not find the best solution due to a limited search.
● Definition: Evaluates all possible neighboring states and chooses the best one with the highest
heuristic value.
● Process:
1. Start at an initial state.
2. Examine all neighbors.
3. Move to the best neighbor (highest heuristic value).
4. Repeat until no better neighbor exists.
● Advantages:
1. Finds a better solution than Simple Hill Climbing.
2. Less likely to get stuck in plateaus.
● Limitations:
1. More computationally expensive.
● Definition: Instead of choosing the best neighbor, it randomly selects a neighbor and decides
whether to move based on probability.
● Process:
○ Start at an initial state.
○ Randomly choose a neighboring state.
○ Move to it if it is better or sometimes even if it is worse (to escape local maxima).
● Advantages:
1. Can escape local maxima.
2. Useful for large and complex search spaces.
● Limitations:
1. Less predictable as it relies on randomness.
2. May take longer to converge to an optimal solution.
a) Local Maxima
● A peak where all neighbors are worse but it is not the best solution (global maximum).
● Solution: Use random restarts or simulated annealing.
b) Plateaus
c) Ridges
Ans: The AO (AND-OR) Search Algorithm* is a heuristic search technique used for solving problems with
multiple goals and dependencies. It is particularly useful in AND-OR graphs, where solutions require
either all or some of the sub-goals to be solved.
● AND-OR Graph: Represents problems where nodes can have AND (multiple conditions must be
met) and OR (at least one condition must be met) relationships.
● Heuristic Search: Uses h(n) (heuristic function) to estimate the best path.
● Backtracking: Updates path costs dynamically based on heuristic updates.
● Efficient Path Selection: Instead of expanding all nodes, AO* focuses only on the most promising
paths.
2. Working of AO*
3. Example of AO*
Problem Statement
A robot needs to reach the goal (G) while overcoming multiple obstacles. The decision paths include
multiple sub-goals, some of which must be solved together (AND) and some that can be solved
separately (OR).
Start
/ \
A(OR) B(OR)
/ | | \
C(AND) D(AND) E
/ \ | |
G1 G2 G3 G4
Explanation
4. Advantages of AO*
5. Limitations of AO*
6. Applications of AO*
Working of Adaline
Training Adaline
Advantages of Adaline
1. Uses Mean Squared Error (MSE), leading to a smoother optimization process.
2. Can be extended to Multi-Layer Networks (basis of modern neural networks).
3. More stable training compared to perceptron.
Ans. An Activation Function is a mathematical function used in neural networks to decide whether a
neuron should be activated or not. It takes an input, processes it, and outputs a value that determines
the strength of the neuron’s signal in the next layer.
1. Introduces Non-Linearity:
○ Without activation functions, neural networks would behave like linear regression
models, unable to learn complex patterns.
3. Helps in Backpropagation:
○ It influences gradient flow during training, affecting how the network updates weights.
● f(x)=xf(x) = xf(x)=x
● The output is the same as the input.
● Limitation: Cannot learn complex patterns.
a) Sigmoid Function
● Formula:
f(x)=11+e−xf(x) = \frac{1}{1 + e^{-x}}f(x)=1+e−x1
● Output range: (0,1)
● Used in probability-based problems.
● Limitation: Can cause vanishing gradient problems in deep networks.
● Formula:
f(x)=ex−e−xex+e−xf(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}f(x)=ex+e−xex−e−x
● Output range: (-1,1)
● Zero-centered, making it more efficient than sigmoid.
● Formula:
f(x)=max(0,x)f(x) = \max(0, x)f(x)=max(0,x)
● Output: 0 for negative inputs, same as input for positive values.
● Advantages: Faster computation and avoids vanishing gradients.
● Limitation: Can cause dead neurons (neurons output 0 forever).
d) Leaky ReLU
● Formula:
f(x)=max(0.01x,x)f(x) = \max(0.01x, x)f(x)=max(0.01x,x)
● Allows small negative values, preventing dead neurons.
Ans. Gradient Descent is an optimization algorithm used to find the minimum of a function. It is widely
used in machine learning and deep learning for optimizing model parameters by minimizing the loss
function.
Mathematical Explanation
● Let’s say we have a function f(x)f(x)f(x) that we want to minimize.
● The gradient (derivative) of the function, denoted as ∇f(x)\nabla f(x)∇f(x), gives the direction of
the steepest ascent.
θ:=θ−α⋅∇J(θ)
Where:
● Uses the entire dataset to compute the gradient and update weights.
It works by:
● Set weights ( wiw_iwi ) and bias bbb to small random numbers or zeros.
Step 4: Repeat
● Repeat the process for a number of epochs or until the model classifies all training examples
correctly.
Ans. K-Nearest Neighbor (KNN) is a supervised learning algorithm used for both classification and
regression problems.
Imagine there are two categories saly category X and category Y and one another new data point say x1
got introduced, now it should be categorized under which type? This decision is made by the K-Nearest
Neighbor classifier.
It is also called a lazy learner classification algorithm. K is a constant value which is defined by the user.It
can be applied by using any of the one as formula mentioned below:
● Manhattan Distance
Mdist = |x2-x1| + |y2-y1|
● Advantages of K-Nearest Neighbor
1. Easy to Implement: Simple steps and straightforward logic.
2. Easily Updatable: New data can be added anytime without retraining.
3. Simple to Use: Just share the dataset; KNN finds the best match without complex
procedures.
● KNN application
• Medicine
• Online shopping
• Data mining
• Agriculture etc.
Q2. Explain Decision Tree classifier. How does information gain help in determining the best attribute
to split on?
Ans. A Decision Tree is a supervised learning algorithm used for classification and regression tasks.
● In the decision tree the split of regions is applied which makes it useful to use and make
decisions
● Decision Tree is non-parametric, meaning it grows by analyzing data without predefined
parameters.
● The tree expands gradually, adding branches and leaves as it learns from the data.
● It is one of the oldest and most popular machine learning algorithms.
● Decision Trees are robust and can handle missing and noisy data effectively.
● Tree building starts with the root node and proceeds level by level by adding child nodes.
● This building method is called binary recursive splitting.
● The first main step is to classify the data into subsets.
● After classification, the Decision Tree algorithm is applied to fully build the tree.
● Decision Tree terminologies
○ Parent node: Root node is the parent node
○ Child nodes: The successor node are the child nodes
○ Root node: It is the first node which is located at the top of the decision tree, it
represents the whole data set.
○ Leaf node: These are end nodes of the decision tree, after reaching till leaf node further
segregation is not possible.
○ Splitting: It is a mechanism of dividing the root or the decision node according to the
condition given.
○ Sub tree or the branch: it is the splitted branch.
○ Pruning: Removing a particular branch from the tree is known as pruning
● Working of Decision tree
○ It starts with the root node then it will go through the classification that has been applied
in the dataset.
○ After going through the classification it will go ahead with comparison between the data.
○ Based on the data it will move further if it is coming out to be of one form then
accordingly one node will be created otherwise it will move on with another node.
○ Like this comparison will continue again and again till it goes or reaches the end part.
○ End nodes are nothing but leaf nodes after which no other branches will be created.
Decision tree example
● The image is a good example of a Decision Tree aiming to decide whether to buy a laptop or not.
● Initially, all data like laptop price and configuration is considered.
● The root node is created based on price range (e.g., ₹40,000 to ₹80,000).
● If the price condition is satisfied (Yes), it moves to the next node; if not (No), a declined node is
created.
● The next node checks the OS version; if it’s the latest, it moves to the buying node, otherwise
declines.
● The tree proceeds step-by-step, making the decision process simple and accurate.
Ans. It is a classification technique which uses Bayes theorem. It is a machine learning model.
● It is mostly preferred on those kinds of data which are very large in nature.It uses a probability
approach to find the solution.
● The term “naive” is used here because it considers many features of it and basically these
features are independent of each other.
● For example, consider tomatoes if we apply Naive Bayes. The main thing is the features so lets
say its round in shape, red in color, small in size etc. Even though all these together will give you
the end result but still they are independent in nature.
● This algorithm is very very popular, the main reason is it is very simple to code and understand.
Conditional Probability
○ Conditional probability is simply the probability of an event divided by the total possible
outcomes.
○ For example, when rolling a die (6 faces), the probability of any one face is 1/6=0.1661/6
= 0.1661/6=0.166.
○ Similarly, conditional probabilities for other scenarios can be calculated the same way.
It works with known and unknown values, using given evidence to find solutions.
Where:
Formulas:
Example:
Logistic Regression is a supervised machine learning algorithm used for predicting the probability of a
categorical dependent variable. It is mainly used for classification problems, where the output is in
discrete categories like 0 or 1, Yes or No, True or False.
Nature of Output: Unlike linear regression, which predicts continuous values, logistic regression predicts
probabilistic values between 0 and 1. These probabilities are then mapped to class labels using a threshold
(usually 0.5).
Logistic Regression is a significant machine learning algorithm because it has the ability to provide
probabilities and classify new data using continuous and discrete datasets.
Logistic Regression can be used to classify the observations using different types of data and can easily
determine the most effective variables used for the classification. The below image is showing the logistic
function:
The sigmoid function is a mathematical function used to map the predicted values to probabilities.
It maps any real value into another value within a range of 0 and 1.
The value of the logistic regression must be between 0 and 1, which cannot go beyond this limit, so it
forms a curve like the "S" form. The S-form curve is called the Sigmoid function or the logistic function.
Random Forest is a supervised machine learning algorithm used for both classification and regression
tasks. It works by creating multiple decision trees from random subsets of the data and combines their
results — majority vote for classification and average for regression.
Working Principle:
Random Forest is based on the ensemble method, which combines multiple models to improve
performance.
1. Bagging: Builds multiple models using different subsets of data (with replacement) and combines
them (e.g., Random Forest).
2. Boosting: Builds models sequentially, where each model improves on the errors of the previous one
(e.g., AdaBoost, XGBoost).
Bagging (Bootstrap Aggregation) is an ensemble technique used in Random Forest. It involves selecting
random samples with replacement (bootstrap samples) from the original dataset. Each sample is used to
train a separate model independently. The final prediction is made by combining the outputs of all
models using majority voting (for classification) or averaging (for regression).
Step 4: Final output is considered based on Majority Voting or Averaging for Classification and
regression respectively
Example
Consider the fruit basket as the data as shown in the figure below. Now n number of samples are taken
from the fruit basket and an individual decision tree is constructed for each sample. Each decision tree
will generate an output as shown in the figure. The final output is considered based on majority voting.
In the below figure you can see that the majority decision tree gives output as an apple when compared
to a banana, so the final output is taken as an apple
Importance
Each tree uses different subsets of features, making every tree unique.
Reduces feature space by not using all variables in each tree.
Trees are built independently, allowing efficient use of CPU resources.
About 30% of the data (out-of-bag) is unused during training and can be used for testing.
Final results are more consistent due to aggregation (voting or averaging).
The Expectation-Maximization (EM) algorithm is a powerful method used for estimating the values of
latent variables—variables that are not directly observed but inferred from other observable data. EM is
particularly useful when the general form of the underlying probability distribution of these variables is
known.
This algorithm plays a key role in unsupervised learning, especially in clustering techniques such as
Gaussian Mixture Models (GMMs).
EM was formally introduced in a 1977 paper by Arthur Dempster, Nan Laird, and Donald Rubin. It is
widely used to find maximum likelihood estimates of parameters in statistical models when data is
incomplete or has missing values.
Algorithm:
1. Initialization: Given a set of incomplete data, start with an initial guess for the parameters.
2. Expectation Step (E-step): Using the available observed data, estimate or "guess" the missing values
(latent variables).
3. Maximization Step (M-step): Once the missing data is estimated, use this complete data to update the
model parameters.
4. Iteration: Repeat the E-step and M-step until the parameters converge or the algorithm reaches a
stopping criterion.
1. Likelihood Increase: Ensures that the likelihood improves with every iteration.
2. Ease of Implementation: Both the E-step and M-step are relatively simple for many problems.
3. Closed-Form Solutions: Often, solutions to the M-step exist in a closed form, making the process more
efficient.
1. Slow Convergence: The algorithm can be slow to converge, requiring many iterations.
2. Local Optima: It may converge to local optima rather than the global optimum.
3. Complex Probability Requirements: It needs both forward and backward probabilities (whereas some
numerical optimizations only require forward probability).
A Bayesian Belief Network (BBN) is a probabilistic graphical model used to represent variables and their
conditional dependencies through a directed acyclic graph (DAG). It is also referred to as a Bayes
Network, belief network, decision network, or Bayesian model.
Key Features
Probabilistic Nature: BBNs are built upon probability distributions and leverage probability theory for
tasks like prediction and anomaly detection.
Applications: Since real-world scenarios often involve uncertainty, Bayesian networks are useful in
various fields, such as: Prediction, Anomaly Detection, Diagnostics, Automated Insight, Reasoning, Time
Series Prediction, Decision Making Under Uncertainty
Bayesian networks model complex relationships between events, making them essential for handling
uncertain data and providing insights in dynamic systems.
Bayesian Network can be used for building models from data and experts’ opinions, and it consists of
two parts:
The generalized form of Bayesian network that represents and solve decision problems under uncertain
knowledge is known as an Influence diagram. A Bayesian network graph is made up of nodes and Arcs
(directed links), where:
Nodes: Each node in a Bayesian network represents a random variable, which can be either continuous
or discrete.
Arcs/Edges: Directed arrows (or arcs) between nodes represent causal relationships or conditional
probabilities. These arcs indicate that one node directly influences the other. If no directed arrow exists
between two nodes, it means those nodes are independent of each other.
Hierarchical clustering is an alternative to partitioned clustering because it doesn't require specifying the
number of clusters in advance. It creates a tree-like structure known as a dendrogram by recursively
merging or splitting clusters.
Clusters are formed by progressively combining similar data points into larger clusters.
To determine the number of clusters, you can cut the tree at the appropriate level.
The most common hierarchical clustering method is Agglomerative Hierarchical Clustering, where each
data point starts as its own cluster and clusters are merged based on similarity.
Agglomerative clustering starts with each data point as its own cluster and progressively merges them
based on similarity until one cluster remains.
Steps:
3. Merge Closest Clusters: Find and merge the two closest clusters.
4. Update Distance: Recalculate the distance matrix based on the new clusters.
6. Create Dendrogram: Visualize the cluster hierarchy and cut the tree to get the desired number of
clusters.
Divisive clustering starts with the whole dataset as one cluster and recursively splits it into smaller
clusters until each data point is its own cluster.
Steps:
2. Determine Best Split: Identify how to split the cluster into two.
5. Create Dendrogram: Visualize the splits and cut the tree to get the final clusters.
Key Differences:
SVM (Support Vector Machine) is a supervised learning method used for classification and sometimes
regression. Its main goal is to find the best line or boundary (called a hyperplane) that separates
different classes of data.
Consider two independent variables, x1, and x2, as well as one dependent variable, either a blue or a
red circle.
two features, the hyperplane is a line). But how do we choose the best one?
SVM picks the line that gives the maximum margin, meaning it leaves the biggest gap between the two classes.
This helps the model make better predictions on new data.
The best hyperplane is the one that creates the biggest gap, or margin, between the two classes. This helps
separate the data clearly.
So, we choose the hyperplane that has the biggest distance from the closest points on both sides. This is called
the maximum-margin or hard margin hyperplane. In the diagram, that would be line L2.
There’s one blue ball inside the red area, which is an outlier. But that’s okay! SVM can handle outliers
and still find the best hyperplane with the largest margin. Outliers don’t affect SVM much.
For this kind of data, SVM still finds the best margin but allows some points to cross it. This is called a soft margin.
SVM adds a penalty for each point that breaks the margin. A common penalty is called hinge loss—the more a
point crosses the margin, the bigger the loss.
So far, we've talked about linearly separable data (data that can be split by a straight line). But what if the data
can’t be separated by a straight line? Let's see how SVM handles that next!
If the data can’t be separated by a straight line, SVM uses something called a kernel to solve the problem. The
kernel transforms the data into a new space using a new variable y(like distance from the origin). This makes it
possible to separate the data with a line in the new space.
In this case, we create a new variable y based on the distance from the origin. This is done using a kernel.
A kernel is a special function in SVM that transforms data into a higher-dimensional space. This helps turn a non-
separable problem into a separable one. Simply put, the kernel reshapes the data so SVM can find the best way to
separate it.
Ans. Bagging Classifier is an ensemble method that builds multiple models using random subsets of the
training data (with replacement). Each model is trained separately, and their predictions are combined
(e.g. by voting) to make the final prediction. This helps reduce overfitting and makes the model more
stable.
Each base model in bagging is trained on a different random set of the data. Some data points may
repeat, and some may be left out. Bagging reduces overfitting by averaging or voting, which lowers
variance but can slightly increase bias—though overall performance improves.
Bagging picks random samples from the training data with replacement, so some data may appear more
than once, while others might be skipped.
Original dataset : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Ans. AdaBoost stands for Adaptive Boosting. It is a machine learning algorithm that helps us make
better predictions by combining many simple models (called weak learners) into one strong model.
You can think of it like asking many people for advice. Each person may not be perfect on their own, but
if you ask the right people and give more attention to the best ones, you get a better decision in the
end. That’s what AdaBoost does with models.
The goal is to focus more on the mistakes made by earlier models and try to correct them in the next
ones. In the end, it combines all the models to make a powerful prediction system.
● At the beginning, all the data points are considered equally important.
● Suppose you have 100 examples in your dataset. Each one gets a weight of 1/100.
● These weights help the model understand how much to pay attention to each example.
● If the model is good (low error), it gets a high "vote" or weight (alpha).
● If it’s bad (high error), it gets a small vote.
● Increase weights for the data points that were predicted wrong (so the next model focuses more
on them).
● Decrease weights for the ones predicted correctly.
● This helps the next model improve on the mistakes.
7. Final Prediction
● To make a final decision, combine all weak models using their alpha (importance).
● Each model gives a vote, and stronger models (with higher alpha) have more say.
Ø Example:
If Model 1 says "Yes" with alpha = 0.8, and Model 2 says "No" with alpha = 0.3, then "Yes" wins because
it has more weight.
· Robust to overfitting.
· Extended to learning problems beyond binary classification (i.e.) can be used with text or numeric
data.
Drawbacks:
AdaBoost picks the training focus for each new model based on the mistakes of the previous one. It also
decides how much weight to give each model's answer. By combining many weak models, it creates a
strong one and was the first successful boosting method for solving yes/no (binary) problems.
Q4) Describe the key principles behind ensemble learning. Differentiate between bagging and
boosting techniques.
Ans.
2) Boosting – It combines many weak models (that are not very accurate) to make one strong
model.It builds models one after another, and each new model tries to fix the mistakes made by the
previous one.The goal is to keep improving the accuracy step by step.
Bagging :
Bagging (Bootstrap Aggregation) is used in Random Forest. It takes random samples from the original
data with replacement (called bootstrap). Each model is trained independently on these samples. Then,
all models’ results are combined using majority voting, which is called aggregation.
We take random samples with replacement from the original data (called bootstrap samples). Then, we
train separate models on each sample. Each model gives a result. If most models say "Happy" (like a
majority of happy emojis), then the final result is "Happy" based on majority voting.
Steps in Random Forest :
Step 1: Pick random samples from the original data (some records may repeat).
Step 2: Build a separate decision tree for each sample.
Step 3: Each tree makes its own prediction.
Step 4: For classification, the final result is the one that most trees agree on (majority voting).
For regression, the final result is the average of all tree predictions.
Data Uses random samples (with Uses full data, but changes weights on
Sampling replacement) mistakes
Learning All models learn at the same Each model learns from previous errors
Style time
Ans: Stacking (or Stacked Generalization) is an ensemble method that combines the predictions of
multiple models to make a better final prediction.
It works like a team of models, where a special model (called the meta-model) learns how to best
combine the outputs from the other models.
Stacking Architecture:
● First, you train several different models (like Decision Tree, SVM, KNN, etc.) on the same
training data.
● These are called base models or level-0 models.
● Each base model gives its own prediction.
● A new model (called the meta-model or level-1 model) is trained on the predictions from the
base models.
● The meta-model learns which base model to trust more for the final prediction.
Example :
Let’s say you’re asking three friends (base models) for movie suggestions.
● One likes action, one likes comedy, and one likes drama.
● You notice whose suggestions you usually like best.
Now you ask a fourth friend (meta-model) to pick a movie, but they choose based on what the
other three said, and who usually makes better choices.
Why stacking?
Stacked models are often used to win machine learning competitions because they give better
accuracy than single models.
By using different types of models in the first layer (like decision trees, SVM, etc.), we can capture
different patterns in the data.
Combining their predictions helps to make more accurate results.
MODULE 07
Q1 What is Multidimensional Scaling (MDS), and what is its primary purpose?How does MDS differ
from Principal Component Analysis?
Multidimensional Scaling (MDS) is a dimensionality reduction technique. It's mainly used to turn high-
dimensional data (data with many features) into a low-dimensional representation — usually 2D or 3D
— so that we can visualize it more easily.
For example, if you have a list of cities and the distances between each pair of cities, MDS can turn that
distance matrix into a map-like plot where the cities appear in positions that reflect those distances.
· While preserving the pairwise distances (or similarities) between points as much as possible
● Visualize relationships between items (e.g., which items are similar or different)
● Detect clusters, patterns, or outliers
● Explore data that doesn't have clear features but does have pairwise relationships (like
similarities or preferences)
MDS is often used in fields like psychology, marketing, and social sciences, where people might rate
items by similarity, and you want to understand the structure behind their judgments
Type of input Distance matrix or similarity Full data matrix with features
matrix
Q2 Compare Feature Extraction and Feature Selection techniques. Explain how dimensionality can be
reduced using Principal Components Analysis.?
Both feature extraction and feature selection are dimensionality reduction techniques used to simplify
datasets by reducing the number of features (variables) while trying to keep as much useful information
as possible. However, they work in different ways:
Feature Selection:
● What it does: Selects a subset of the original features that are most relevant to the task (e.g.,
prediction, classification).
● How: Removes irrelevant or redundant features based on certain criteria (like correlation,
information gain, mutual information, etc.).
● Result: Keeps the original meaning of features; just fewer are used.
● Example: From a dataset with 100 features, selecting the top 10 most important ones based on
their relevance to the target.
Feature Extraction:
● What it does: Creates new features by combining or transforming the original ones.
● How: Uses techniques (like PCA) to transform high-dimensional data into a new lower-
dimensional space.
● Result: New features may not have a direct interpretation, but they capture important patterns
or structure in the data.
● Example: From 100 original features, creating 10 new ones that summarize the data but in a
transformed way.
Principal Component Analysis (PCA) is a feature extraction method used to reduce dimensionality by
transforming the original data into a new set of variables called principal components.
Here's how PCA works step-by-step:
1. Standardize the data: Make sure all features have the same scale (especially if they are in
different units).
2. Compute the covariance matrix: This shows how features vary with respect to each other.
3. Find eigenvalues and eigenvectors: These help identify the directions (principal components)
where the data varies the most.
4. Select top components: Choose the top k components (based on the highest eigenvalues) that
capture the most variance in the data.
5. Transform the data: Project the original data onto the selected components to get a new, lower-
dimensional representation.
Example:
Suppose you have data with 100 features. PCA might tell you that 95% of the variance (information) in
the data can be captured using just 10 principal components. So, you reduce the data from 100 to 10
dimensions while still retaining most of its structure and meaning.
Dimensionality reduction is the process of reducing the number of input variables or features in a
dataset. In simple terms, it’s about taking high-dimensional data (data with many features or columns)
and simplifying it without losing too much important information.
High-dimensional data can be difficult to analyze, visualize, and model. It can lead to problems like:
1. High-dimensional data is hard to visualize (e.g., you can’t plot data with 50+ features)
2. Too many features can cause overfitting (the model learns noise instead of patterns)
3. Redundant or irrelevant features add unnecessary complexity
4. Speed and efficiency improve with fewer features
5. Helps in noise reduction and better generalization
This problem is also known as the curse of dimensionality – as the number of features increases, the
data becomes sparse and harder to analyze effectively.
1. Feature Selection
2. Feature Extraction
PCA is one of the most popular feature extraction techniques. Here’s how it works:
1. It finds new axes (principal components) that capture the maximum variance in the data.
2. The first principal component captures the most variation; the second one captures the next
most, and so on.
3. You can select the top k components (based on variance explained) and project your data onto
them.
4. This reduces the dimensionality while still retaining most of the important information.
For example:
You have 100 features. PCA tells you that just 10 components explain 95% of the data’s variation. You
can then reduce your dataset to 10 dimensions.
A Bayesian Belief Network (BBN), also known as a Bayesian Network, is a probabilistic graphical model
that represents a set of variables and their conditional dependencies using a directed acyclic graph
(DAG).
It combines graph theory and probability theory to model uncertainty in complex systems.
Key Components:
How it works:
Example:
The graph might show that flu causes both fever and sore throat. Given a person has fever and sore
throat, the network can estimate the probability that the person has the flu.
Applications:
● Medical diagnosis
● Decision support systems
● Risk assessment
● Spam detection
● Weather prediction
Clustering is an unsupervised learning method used to group similar data points into clusters. There
are several popular approaches to clustering, each using a different strategy to group the data. Let’s
look at the four main types:
1. Distribution-Based Clustering
● In this method, data is assumed to come from a specific statistical distribution (often
Gaussian).
● Each cluster corresponds to a probability distribution, and data points are grouped based on
which distribution they are most likely to belong to.
● This method is useful when data naturally follows certain patterns or distributions.
● Example: Gaussian Mixture Models (GMM).
2. Density-Based Clustering
● In this approach, data points can belong to more than one cluster.
● Instead of assigning a point to just one cluster, it assigns membership scores (probabilities) to
multiple clusters.
● Useful when cluster boundaries are not clearly defined and may overlap.
● Example: Fuzzy C-Means Algorithm.
4. Centroid-Based Clustering
Q5 What is soft margin hyperplane? What are the advantagesS of K Nearest neighbour?
In many real-world scenarios, data is not linearly separable, meaning it cannot be perfectly divided by a
straight line or linear boundary. Support Vector Machines (SVM) address this issue through the concept
of a Soft Margin Hyperplane.
A soft margin allows the SVM to tolerate a certain number of misclassifications while still trying to find a
hyperplane that separates the data as best as possible. The idea is to maximize the margin between
classes while minimizing classification errors.
Mathematically, this is achieved by introducing slack variables (ξᵢ) that measure how much a data point
violates the margin. A penalty is added to the objective function for each violation, and a parameter C
controls the trade-off:
This soft margin approach helps prevent overfitting, especially when the data is noisy or overlapping,
and leads to better generalization on unseen data.
1. Objective Function:
1. · Constraints: