Advanced Machine Learning CIE
Advanced Machine Learning CIE
1
| MD Riyan Nazeer | CSM-3A | 160921748036 |
• Define the learning rate, which determines the step size during weight updates.
2. Forward Pass:
• Input a set of training data into the network to obtain predictions.
• Propagate the input through the network layer by layer, calculating the weighted sum and
applying the activation function at each node.
3. Error Calculation:
• Compare the predicted output with the actual target output to calculate the error.
• Commonly used error metrics include Mean Squared Error (MSE) for regression and Cross-
Entropy Loss for classification.
4. Backward Pass (Backpropagation Proper):
• Propagate the error backward through the network to compute the gradients of the error with
respect to the weights.
• Calculate the gradient of the error with respect to the output of each node in the network.
5. Weight Update:
• Update the weights and biases using the calculated gradients and the learning rate.
• This step involves adjusting the weights in the direction that minimizes the error.
6. Repeat:
• Repeat the process for a specified number of epochs or until convergence.
b) Convergence in a neural network refers to the process where the network's performance or
loss function stabilizes and stops improving significantly with further training. This indicates
that the network has learned as much as it can from the given dataset, and adjustments to
the weights become minimal.
In practice, neural networks often converge to a local minimum of the loss function.
Advanced techniques like adaptive learning rates and batch normalization may be required
to achieve global convergence.
Local minima in a neural network refer to points on the error surface where the loss function
has a lower value than its surroundings, but there exists a higher value (global minimum)
somewhere else in the error space. This means the network has found a relatively good
solution, but not the best possible one. Local minima can lead to suboptimal performance.
To overcome this, techniques such as proper initialization, adjusting the learning rate,
regularization, designing better network architectures, and using variations of Stochastic
Gradient Descent (SGD) can be employed to help the network escape local minima and find
the global minimum for better performance.
2
| MD Riyan Nazeer | CSM-3A | 160921748036 |
5 a) Explain the basic structure and working of an artificial neural network (ANN) and 4m
its applications in various fields
b) What is the gradient descent method? How is used in the backpropagation 3m
algorithm?
A a) An Artificial Neural Network (ANN) is a computational model inspired by the way biological
neural networks function in the human brain. ANNs consist of interconnected nodes, known
as neurons or artificial neurons, organized into layers, and are fundamental components of
machine learning and artificial intelligence.
Structure of ANN:
1. Input Layer: This layer accepts inputs in several different formats provided by the
programmer.
2. Hidden Layer: Positioned between the input and output layers, the hidden layer
performs all the calculations to find hidden features and patterns.
3. Output Layer: After processing through the hidden layer, the final output is conveyed
through this layer.
The ANN takes input and computes the weighted sum of these inputs, including a bias,
represented in the form of a transfer function. This weighted total is passed as an input to an
activation function, which determines whether a node should fire or not. Only those nodes that
fire make it to the output layer.
3
| MD Riyan Nazeer | CSM-3A | 160921748036 |
Applications of ANN:
ANNs are well-suited for tasks such as:
• Pattern Recognition
• Classification
• Regression
• Decision-Making
These tasks are utilized across various fields, leveraging the ability of ANNs to model complex
relationships and make accurate predictions.
b) The gradient descent (GD) method is a widely used optimization algorithm in machine
learning and deep learning that minimizes the cost function of a neural network model
during training. It iteratively adjusts the weights or parameters of the model in the direction
of the negative gradient of the cost function until the minimum of the cost function is
reached. The goal of gradient descent is to find the optimal parameters (weights and biases)
that minimize the cost function.
In the backpropagation algorithm, gradient descent is used to optimize the weights and
biases of a neural network based on the error between the predicted output and the actual
output. The steps involve:
1. Compute Gradients: Calculate the gradient of the cost function with respect to each
parameter using backpropagation.
2. Update Parameters: Adjust the parameters in the opposite direction of the gradients to
reduce the cost, using the update rule:
new_parameter = old_parameter − learning_rate × gradient
3. Repeat: Repeat the process for a specified number of epochs or until convergence is
achieved, where the network's performance stabilizes.
This iterative process ensures that the network's weights and biases are optimized to minimize
the error and improve the model's accuracy.
6 a) Explain Perceptron & Multilayer Perceptron with Diagram? 4m
b) Describe the problem of overfitting in neural networks. What techniques can be 3m
employed to prevent or mitigate overfitting
A a) Perceptron:
A perceptron is the simplest form of an artificial neural network (ANN) and was introduced
by Frank Rosenblatt in 1957. It is a single-layer neural network that functions as a binary
classifier, taking multiple binary inputs and producing a single binary output. The perceptron
learns a linear decision boundary to separate two classes.
The model of an artificial neuron in a perceptron involves receiving inputs from other
neurons, weighted by the importance of each input (weights). These inputs are summed up,
and if the resulting value exceeds a certain threshold, the neuron fires. This process can be
represented as:
Training a perceptron involves adjusting the weights using the perceptron learning rule
derived from gradient descent to minimize the error between the predicted and target
outputs. This is suitable for linearly separable data. For non-linearly separable data, more
complex models like multilayer perceptrons (MLPs) are required.
4
| MD Riyan Nazeer | CSM-3A | 160921748036 |
Training an MLP involves the backpropagation algorithm, which calculates the gradient of the
cost function with respect to each weight by propagating the error backward through the
network. This allows the adjustment of weights to minimize the cost function iteratively,
enhancing the model's accuracy and capability to solve complex problems.
5
| MD Riyan Nazeer | CSM-3A | 160921748036 |
b) Overfitting in Neural Networks: Overfitting occurs when a neural network model learns the
training data too well, capturing noise and details specific to the training data rather than
generalizing to unseen or new data. This leads to poor performance on test or validation
data, as the model becomes too complex and fails to predict new data accurately
By employing these techniques, the risk of overfitting can be minimized, ensuring that the
neural network generalizes well to new data.
Unit-2
SAQ
1 What is Bayesian learning? 2m
A • Bayesian learning is a machine learning technique that uses probability to classify text
documents based on the occurrence of certain words in different classes.
• It involves calculating the probability of words occurring in specific classes and learning the
class prior probabilities.
• Bayesian learning combines prior knowledge with observed data to determine the
probability of a hypothesis being correct.
• It allows for incremental updates to the probability of a hypothesis based on each observed
training example and can classify new instances by combining the predictions of multiple
hypotheses weighted by their probabilities.
2 Define Bayes' theorem. 2m
A Bayes' Theorem, named after the Reverend Thomas Bayes, is a fundamental concept in
probability theory that allows us to reverse the conditional probability relationship between two
events. It is given by:
𝑩𝑩
𝑨𝑨 𝑷𝑷 �𝑨𝑨 � × 𝑷𝑷(𝑨𝑨)
𝑷𝑷 � � =
𝑩𝑩 𝑷𝑷(𝑩𝑩)
Here,
(P(A|B)) is the probability of event A occurring given that event B has occurred.
(P(B|A)) is the probability of event B occurring given that event A has occurred.
(P(A)) is the prior probability of event A, the probability of A occurring without considering B.
(P(B)) is the probability of event B occurring, known as the marginal probability of B.
Bayes' Theorem is particularly useful for updating our beliefs about the probability of an event
based on new evidence. It has crucial applications in fields such as medical diagnosis,
hypothesis testing, spam filtering, and genetic testing.
6
| MD Riyan Nazeer | CSM-3A | 160921748036 |
Bayes' Theorem, which underpins this algorithm, allows for the computation of the posterior
probability
P(A∣B) by combining the prior probability
P(A), the likelihood
P(B∣A), and the marginal probability
P(B). The formula is:
𝑩𝑩
𝑨𝑨 𝑷𝑷 �𝑨𝑨 � × 𝑷𝑷(𝑨𝑨)
𝑷𝑷 � � =
𝑩𝑩 𝑷𝑷(𝑩𝑩)
Naive Bayes classifiers are widely used in spam filtration, sentiment analysis, and article
classification due to their simplicity, speed, and effectiveness in making quick predictions.
LAQ
4 a) Explain Bayesian learning and its importance in machine learning. How does it
differ from frequentist learning approaches? 4m
b) Describe Bayes' theorem in detail. Provide a real-world example of how it can be
applied to update beliefs based on new evidence. 3m
A a) Bayesian learning in machine learning harnesses Bayes' theorem to make predictions and
update beliefs based on new evidence. Unlike frequentist methods, Bayesian learning
incorporates prior knowledge into the learning process, enabling a more informed approach
to decision-making.
The core principle of Bayesian learning lies in Bayes' theorem, which calculates the posterior
probability by combining the prior belief with the likelihood of observed data. This allows for
continuous learning and updating of probabilities as new information becomes available.
Key Elements:
1. Prior Probability: Represents initial beliefs about a hypothesis before encountering data.
2. Likelihood: Reflects how well observed data aligns with the hypothesis.
3. Posterior Probability: Updated belief based on observed data, obtained through Bayes'
theorem.
Importance:
1. Incorporates Prior Knowledge: Bayesian learning integrates existing knowledge or
domain expertise into the model, aiding decision-making with limited data.
2. Uncertainty Quantification: Provides probabilities for predictions, offering insights into
the confidence of outcomes.
3. Continuous Learning: Allows models to adapt and improve performance as new data
becomes available.
7
| MD Riyan Nazeer | CSM-3A | 160921748036 |
b) Bayes' Theorem, named after the Reverend Thomas Bayes, is a fundamental concept in
probability theory that allows us to reverse the conditional probability relationship between
two events. It is given by:
𝑩𝑩
𝑨𝑨 𝑷𝑷 �𝑨𝑨 � × 𝑷𝑷(𝑨𝑨)
𝑷𝑷 � � =
𝑩𝑩 𝑷𝑷(𝑩𝑩)
Here,
(P(A|B)) is the probability of event A occurring given that event B has occurred.
(P(B|A)) is the probability of event B occurring given that event A has occurred.
(P(A)) is the prior probability of event A, the probability of A occurring without considering B.
(P(B)) is the probability of event B occurring, known as the marginal probability of B.
Bayes' Theorem is particularly useful for updating our beliefs about the probability of an
event based on new evidence. It has crucial applications in fields such as medical diagnosis,
hypothesis testing, spam filtering, and genetic testing.
5 a) Discuss the Naive Bayes learning algorithm. How does it work, and what are its
assumptions? Provide an example of its application. 4m
b) Describe the Expectation-Maximization (EM) algorithm. How does it work, and in 3m
what types of problems is it typically used?
A a) The Naive Bayes learning algorithm is a supervised machine learning algorithm based on
Bayes' Theorem. It is particularly useful for classification tasks, especially in text
classification.
How it Works:
1. Convert the dataset into frequency tables: Calculate the frequency of each feature for
each class.
2. Generate the likelihood table: Find the probabilities of features given the class.
3. Apply Bayes' Theorem: Use the theorem to calculate the posterior probability for each
class given the features.
Assumptions:
1. Feature Independence: The algorithm assumes that all features are conditionally
independent given the class label.
2. Normal Distribution: For continuous features, it assumes they follow a normal
distribution within each class.
3. Multinomial Distribution: For discrete features, it assumes they follow a multinomial
distribution within each class.
4. Equal Importance: All features contribute equally to the prediction of the class label.
5. No Missing Data: The data should not contain any missing values.
The Naive Bayes learning algorithm is widely used across various domains due to its
simplicity and efficiency.
Here are some key applications:
1. Spam Filtering:
• Naive Bayes is extensively used in email filtering to classify emails as spam or not
spam. It analyzes the frequency of words and phrases in the email body and
subject to determine the likelihood of the email being spam.
2. Sentiment Analysis:
• It is used in text analysis to determine the sentiment of a piece of text, such as
classifying movie reviews, product reviews, or social media posts as positive,
negative, or neutral.
8
| MD Riyan Nazeer | CSM-3A | 160921748036 |
How it Works:
1. Initialization: Start with a set of incomplete data and initial parameter estimates.
2. Expectation Step (E-step): Estimate the missing data values using the observed data.
3. Maximization Step (M-step): Use the complete data from the E-step to update the
parameter values.
4. Iteration: Repeat the E-step and M-step until convergence is achieved.
Typical Usage:
1. Filling Missing Data: The EM algorithm can estimate and fill in missing data within a
dataset.
2. Unsupervised Learning of Clusters: It serves as the foundation for clustering algorithms.
3. Estimating Parameters of Hidden Markov Models (HMM): The algorithm is used to
determine the parameters of HMMs.
4. Discovering Latent Variables: EM helps uncover the values of latent variables in
datasets.
Advantages:
1. The likelihood is guaranteed to increase with each iteration.
2. The E-step and M-step are generally easy to implement.
3. Solutions to the M-step often exist in closed form.
6 a) How is probability theory used in Bayesian learning? 3m
b) Describe the Expectation-Maximization (EM) algorithm. How does it work, and in 4m
what types of problems is it typically used?
A a) Probability theory plays a crucial role in Bayesian learning by providing the mathematical
framework for updating the probability of a hypothesis based on new evidence. This is
primarily done using Bayes' Theorem, which allows for reversing the conditional probability
relationship between two events.
Here,
• P(A∣B) is the probability of event A occurring given that event B has occurred.
• P(B∣A) is the probability of event B occurring given that event A has occurred.
• P(A) is the prior probability of event A.
• P(B) is the probability of event B occurring.
In Bayesian learning, probability theory is used to update our beliefs about the probability of an
event based on new evidence or information. This involves calculating the posterior probability
P(A∣B) using the prior probability P(A), the likelihood P(B|A) and the marginal probability 𝑃𝑃(𝐵𝐵)
b) Same as 5B
9
| MD Riyan Nazeer | CSM-3A | 160921748036 |
Unit-3
SAQ
1 What is a decision tree in machine learning? 2m
A A decision tree in machine learning is a graphical representation used to model decisions and
their possible consequences. Each internal node represents a "test" on an input feature, each
branch represents the outcome of the test, and each leaf node holds a class label or probability
distribution.
The tree is built in a top-down manner by iteratively splitting the data based on the feature that
provides the most significant information gain or reduction in impurity. Decision trees help in
understanding complex relationships and making predictions based on input features, and they
are valued for their visualization, ability to handle non-linear relationships, and interpretability.
2 What is a decision tree in machine learning? 2m
A Same
3 What is a decision tree in machine learning? 2m
A Same
LAQ
4 a) Discuss the problem of overfitting in decision trees. What causes overfitting, and 3m
how can it be detected?
b) Discuss the concept of entropy in decision trees. How is it calculated, and why is it 4m
important for splitting nodes?
A a) Overfitting in decision trees occurs when the model becomes too complex, with many
nodes and deep branches, causing it to learn the training data too well, including its noise
and random fluctuations. This results in poor performance on new or unseen data, as the
model fails to generalize well beyond the training set.
Causes of Overfitting:
• Noisy Data: Data that contains errors, outliers, and random fluctuations.
• Complex Models: Models that fit irregularities rather than the underlying pattern
Detection of Overfitting:
• Validation Set: Evaluating the model's performance on a separate validation set.
• Cross-Validation: Using cross-validation to compare performance across different
subsets of data.
• Performance Drop: A significant drop in performance on validation or test datasets
compared to the training set indicates overfitting.
b) Entropy in Decision Trees
Concept of Entropy:
• Entropy measures the impurity or disorder within a dataset.
• It quantifies the diversity of class labels within a subset of data, with high entropy
indicating a high level of diversity and low entropy indicating homogeneity.
Calculation of Entropy:
For a dataset segment S with C class labels, entropy is calculated using the formula:
𝑐𝑐
10
| MD Riyan Nazeer | CSM-3A | 160921748036 |
Special case for two class labels (e.g., "yes" and "no"):
𝐸𝐸𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛(𝑆𝑆) = − 𝑝𝑝 log 2 (𝑝𝑝) − (1 − 𝑝𝑝) log 2 (1 − 𝑝𝑝)
Where 𝑝𝑝 is the proportion of examples with the "yes" label.
Information Gain:
• Evaluates how much splitting a node based on a particular attribute reduces entropy.
• Higher information gain implies more effective splitting, leading to clearer separation of
classes.
Guidance in Choosing:
Entropy:
• Helps in understanding the diversity within data subsets.
• Lower entropy indicates more homogeneous subsets, which are desirable for decision
tree splits.
Information Gain:
• Guides in selecting attributes that result in the greatest reduction in entropy.
• Attributes with higher information gain are preferred for splitting nodes as they lead to
more significant improvements in class purity.
11
| MD Riyan Nazeer | CSM-3A | 160921748036 |
b) The ID3 algorithm, developed by Ross Quinlan in 1975, is a decision tree algorithm used for
classification tasks. It assumes that there are only two class labels, denoted as "positive"
and "negative." The algorithm relies on the concept of information gain to select the most
useful attribute for classification
To manage complexity:
1. Feature selection:
• Reducing the number of features by selecting only relevant ones can decrease
complexity.
• Techniques like information gain or Gini index help identify and remove irrelevant
or redundant features.
12
| MD Riyan Nazeer | CSM-3A | 160921748036 |
2. Sampling techniques:
• Create a representative subset of data for large datasets, reducing computational
burden while maintaining accuracy.
3. Parallelization:
• Utilize multiple cores or distributed computing to parallelize the process, reducing
overall computational time.
4. Early stopping and pruning:
• Implement criteria to stop tree growth early or use pruning techniques to simplify
the tree, reducing complexity.
5. Efficient implementation and optimization:
• Use efficient data structures, algorithms, and hardware optimization to improve
performance.
13