0% found this document useful (0 votes)
10 views

CIE-2 Solutions

Uploaded by

easytranslate714
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

CIE-2 Solutions

Uploaded by

easytranslate714
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Artificial Intelligence and

Machine Learning
(U21PC602CS)
B.E (CSE) V-Semester (CIE-2) Solutions

Author:
B. Venkataramana
Assistant Professor
1.Define Policy and Reward.
A policy is a strategy used by an agent to determine the next action based on the
current state. It can be represented as a function π that maps states to probabilities
of selecting each possible action.

π(a|s) = P (At = a | St = s)
Where:
• π(a|s) is the probability that action a is taken when the state is s.
• At is the action taken at time step t.
• St is the state at time step t.
A reward is a scalar feedback signal given to an agent to indicate how well it is
performing at a particular task. The reward function R maps each state or state-action
pair to a real number.

R:S ×A→R
Where:
• S is the set of all possible states.
• A is the set of all possible actions.
• R(s, a) is the reward received after taking action a in state s.
2.List the activation functions commonly used in neural network.

Sigmoid (Logistic) Activation Function The Sigmoid function is defined as:

1
σ(x) =
1 + e−x

Hyperbolic Tangent (Tanh) Activation Function The Tanh function is


defined as:
ex − e−x
tanh(x) = x
e + e−x
Rectified Linear Unit (ReLU) Activation Function The ReLU function is
defined as:
ReLU(x) = max(0, x)
Softmax Activation Function The Softmax function is defined for the i-th
output in a multi-class classification scenario as:

exi
Softmax(xi ) = P xj
je

3.Assign the point (5,2) to a cluster where centroids of each cluster are
A : (4,6) B: (1,5) C: (7,2)

Given:

2
• Point P : (5, 2)
• Centroids:
– A : (4, 6)
– B : (1, 5)
– C : (7, 2)
The Euclidean distance d between two points (x1 , y1 ) and (x2 , y2 ) is given by:
p
d= (x2 − x1 )2 + (y2 − y1 )2

Calculate the distances from point P to each centroid:

p
d(P, A) = (4 − 5)2 + (6 − 2)2
p
= (−1)2 + 42

= 1 + 16

= 17
≈ 4.12

p
d(P, B) = (1 − 5)2 + (5 − 2)2
p
= (−4)2 + 32

= 16 + 9

= 25
=5

p
d(P, C) = (7 − 5)2 + (2 − 2)2
p
= 22 + 02

= 4
=2

Since d(P, C) is the smallest, the point P is assigned to cluster C.

3
2(a). Build a simple linear regression for the following data X 2 5 3 4
Y 7 14 8 9

Given data:

X : {2, 5, 3, 4}
Y : {7, 14, 8, 9}

First, we calculate the necessary summations:


X
X = 2 + 5 + 3 + 4 = 14
X
Y = 7 + 14 + 8 + 9 = 38
X
XY = (2 · 7) + (5 · 14) + (3 · 8) + (4 · 9) = 14 + 70 + 24 + 36 = 144
X
X 2 = 22 + 52 + 32 + 42 = 4 + 25 + 9 + 16 = 54

The number of data points n is 4.


Now, we calculate the slope (β1 ):
P P P
n( XY ) − ( X)( Y ) 4(144) − (14)(38) 576 − 532 44
β1 = P 2 P 2 = = = = 2.2
n( X ) − ( X) 4(54) − (14)2 216 − 196 20

Next, we calculate the intercept (β0 ):


P P
Y − β1 X 38 − 2.2 · 14 38 − 30.8 7.2
β0 = = = = = 1.8
n 4 4 4
Therefore, the simple linear regression equation is:

Y = 1.8 + 2.2X

2(b).llustrate Decision Tree Induction for Classification


Given example dataset:
X1 X2 Y
0 0 0
0 1 0
1 0 1
1 1 1
Step 1: Calculate Entropy of Target Variable Y
The entropy H(Y ) is calculated as:

k
X
H(Y ) = − pi log2 (pi )
i=1

4
where pi is the probability of class i.
For the given dataset:
 
2 2 2 2
H(Y ) = − log2 + log2 =1
4 4 4 4

Step 2: Calculate Information Gain for Each Feature


Feature X1:
X1 Y Count
0 0 2
1 1 2
 
2 2
H(Y |X1 = 0) = − log2 + 0 = 0
2 2
 
2 2
H(Y |X1 = 1) = − 0 + log2 =0
2 2
2 2
H(Y |X1) = · 0 + · 0 = 0
4 4
Gain(Y, X1) = H(Y ) − H(Y |X1) = 1 − 0 = 1
Feature X2:
X2 Y Count
0 0 1
0 1 1
1 0 1
1 1 1
 
1 1 1 1
H(Y |X2 = 0) = − log2 + log2 =1
2 2 2 2
 
1 1 1 1
H(Y |X2 = 1) = − log2 + log2 =1
2 2 2 2
2 2
H(Y |X2) = · 1 + · 1 = 1
4 4
Gain(Y, X2) = H(Y ) − H(Y |X2) = 1 − 1 = 0
Step 3: Select Feature with Highest Information Gain
Feature X1 has the highest information gain of 1. Therefore, X1 is selected as the
root node.

Step 4: Split Dataset Based on X1


X1 X2 Y X1 X2 Y
0 0 0 and 1 0 1
0 1 0 1 1 1
Since the subsets are pure (all Y values are the same), the decision tree is complete.
Decision Tree

5
X1

0 1

0 0 1 1

3.(a)Discuss the various metrics used in Regression


and Classification
Metrics for regression models serve to quantify the performance and accuracy of pre-
dictions for continuous numerical outcomes. Mean Absolute Error (MAE) calculates
the average of absolute differences between predicted and actual values, offering insight
into the typical magnitude of errors. Mean Squared Error (MSE) computes the aver-
age of squared differences, emphasizing larger errors more than MAE. Root Mean
Squared Error (RMSE), the square root of MSE, provides an interpretable measure in
the same unit as the target variable. R-squared (R²) gauges the proportion of variance
in the dependent variable explained by the independent variables, with values ranging
from 0 to 1 indicating perfect to no linear relationship. These metrics collectively aid
in assessing regression model fit and predictive accuracy, guiding model selection and
refinement based on specific problem requirements.

Table 1 Regression Metrics

Metric Formula
1 Pn
Mean Absolute Error (MAE) MAE = n i=1 |yi − ŷi |
1 Pn
Mean Squared Error (MSE) MSE = n i=1 (yi − ŷi )2

Root Mean Squared Error (RMSE) RMSE = MSE
Pn
(yi −ŷi )2
R-squared (R2) R2 = 1 − Pi=1
n 2
i=1 (yi −ȳ)

A confusion matrix is a table that is often used to describe the performance of a


classification model on a set of test data for which the true values are known. It helps
to visualize the performance of an algorithm by presenting the actual and predicted
classifications.
True Positives (TP): The cases in which the model correctly predicts the posi-
tive class.

6
Table 2 Confusion Matrix

Actual Predicted Positive Predicted Negative


Positive TP FN
Negative FP TN

False Positives (FP): The cases in which the model incorrectly predicts the positive
class.
True Negatives (TN): The cases in which the model correctly predicts the negative
class.
False Negatives (FN): The cases in which the model incorrectly predicts the
negative class.

Table 3 Classifier Performance Metrics

Metric Formula
TP + TN
Accuracy
TP + TN + FP + FN
FP + FN
Error Rate
TP + TN + FP + FN
TP
Precision
TP + FP
TP
Recall (Sensitivity)
TP + FN
TN
Specificity
TN + FP
Precision × Recall
F1 Score 2×
Precision + Recall

3 (b).Explain Perceptron neuron model.


The Perceptron is a simple model of a neuron used for binary classification. It
consists of:
• Inputs (x1 , x2 , . . . , xn ): Features or attributes of the data.
• Weights (w1 , w2 , . . . , wn ): Each input is associated with a weight.
• Bias (b): A constant value added to the weighted sum of inputs.
• Activation Function: Determines if the neuron should be activated. For a
Perceptron, this is a step function.
The output y of a Perceptron is computed as:

z = w1 x1 + w2 x2 + · · · + wn xn + b
(
1 if z ≥ 0
y=
0 if z < 0

Training

7
Training involves adjusting the weights and bias based on errors:
• Initialize weights and bias randomly.
• Update weights and bias using the following rules if there is a misclassification:

wj ← wj + η(yi − ŷi )xij

b ← b + η(yi − ŷi )
where η is the learning rate.
• Repeat for a fixed number of epochs or until convergence.
Applications
• Binary Classification
• Linearly Separable Problems
Limitations
• Cannot solve non-linearly separable problems (e.g., XOR problem).
• Limited to single-layer models.
4.(a)Use single link agglomerative clustering to group the given data
with the following distance matrix and show a dendrogram

Given the distance matrix:

A B C D
A 0 1 4 5
B 1 0 2 6
C 4 2 0 3
D 5 6 3 0
Steps and Dendrogram
1. Initial Clusters: {A}, {B}, {C}, {D}
2. First Merge:
• The closest pair is {A, B} with distance 1.
• Merge A and B into a new cluster {A, B}.
3. Update Distance Matrix:

{A, B} C D
{A, B} 0 2 5
C 2 0 3
D 5 3 0

4. Second Merge:
• The closest pair is {C, D} with distance 3.
• Merge C and D into a new cluster {C, D}.

8
5. Update Distance Matrix:

{A, B} {C, D}
{A, B} 0
{C, D} 2

6. Final Merge:
• The remaining clusters {A, B} and {C, D} are merged with distance 2.

Dendrogram

B A, B C, D

A 1 C 2 D 5 A, B, C, D

4.(b)Explain Expectation-Maximization (EM) Clustering

Expectation-Maximization (EM) Clustering is a probabilistic technique used for


clustering data. It models the data as a mixture of several probability distributions,
each representing a cluster. The most common model used is the Gaussian Mixture
Model (GMM).
• Mixture Model: Assumes data is generated from a mixture of probability
distributions. For GMM, each cluster is modeled as a Gaussian distribution.
• Latent Variables: Hidden variables representing cluster membership of each data
point.
• Parameters to Estimate:
– Mean (µk ): Center of the k-th cluster.
– Covariance Matrix (Σk ): Describes the shape and orientation of the k-th
cluster.
– Mixing Coefficients (πk ): Prior probability of each cluster.
The EM Algorithm
The EM algorithm involves two main steps:

9
1. Expectation Step (E-Step): Compute the probability that each data point
belongs to each cluster, given the current parameters:

πk N (xi | µk , Σk )
γik = PK
j=1 πj N (xi | µj , Σj )

where N (xi | µk , Σk ) is the Gaussian probability density function for cluster k.


2. Maximization Step (M-Step): Update the parameters to maximize the likeli-
hood of the data given the responsibilities:
PN
γik xi
µk = Pi=1
N
i=1 γik
PN
i=1 γik (xi − µk )(xi − µk )⊤
Σk = PN
i=1 γik
PN
i=1 γik
πk =
N
Convergence
The EM algorithm iterates between the E-step and M-step until convergence, which
is typically determined by the change in log-likelihood being below a threshold.
Advantages and Limitations
• Advantages:
– Flexibility in modeling different shapes and sizes of clusters.
– Provides a probabilistic assignment of data points.
• Limitations:
– Sensitive to initial parameter estimates.
– Assumes clusters are Gaussian distributed.

10

You might also like