0% found this document useful (0 votes)
19 views

ML assignment

Uploaded by

Atharva Nagore
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

ML assignment

Uploaded by

Atharva Nagore
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

INDIAN INSTITUTE OF INFORMATION TECHNOLOGY,

BHOPAL

INFORMATION TECHNOLOGY

Machine Learning
IT-613

ASSIGNMENT

Submitted to: Submitted by:


Dr. Dharmendra Dangi Atharva Nagore
21U03013
Atharva Nagore 21U03013

Assignment-1
1. Machine Learning (ML) is a process that enables systems to learn patterns from data and make
predictions or decisions without explicit programming. The basic components of this process
are:

a. Data Collection: ML begins with gathering relevant data. This data is typically divided into
training, validation, and testing sets.

b. Data Preprocessing: Data often requires cleaning and preparation, which involves:
• Handling missing data
• Normalizing/standardizing data
• Removing outliers or irrelevant features

c. Feature Selection/Engineering: Relevant features (or inputs) that have the most influence
on the output are identified. Feature engineering might also involve creating new
features.

d. Model Selection: Choose a machine learning algorithm that fits the problem. Models can
vary based on the type of learning (e.g., regression, classification, clustering, etc.)

e. Training: The selected model is trained on the training dataset, adjusting its parameters
(e.g., weights in neural networks) to minimize the error.

f. Evaluation: The model’s performance is evaluated on a validation set (or test set) using
various metrics like accuracy, precision, recall, and more.

g. Tuning and Optimization: The model's hyperparameters (e.g., learning rate,


regularization parameters) are fine-tuned to optimize performance.

h. Testing and Deployment: After training and optimization, the model is tested on a hold-
out test dataset and deployed for real-world usage.

2. Types of Learning in Machine Learning:

a. Supervised Learning: In supervised learning, the model is trained on labeled data,


meaning both the input and the output are known. The model learns to map inputs to the
correct outputs.

Example: A spam email classifier is trained on labeled emails (spam or not spam) to predict
future emails as spam or not.

Common Algorithms: Linear regression, Decision trees, Support Vector Machines (SVM).

b. Unsupervised Learning: In this case, the model is given data without explicit labels. The
goal is to identify hidden structures or patterns in the data.

2|Page
Atharva Nagore 21U03013

Example: Clustering similar customer profiles for market segmentation based on purchase
behavior without predefined categories.

Common Algorithms: K-Means, Hierarchical Clustering, Principal Component Analysis


(PCA).

c. Semi-Supervised Learning: Semi-supervised learning uses a small amount of labeled data


combined with a large amount of unlabeled data. The model uses the labeled data to help
interpret the structure of the unlabeled data.

Example: A few labeled tumor images with many unlabeled ones in medical image
analysis.

d. Reinforcement Learning: Reinforcement learning is where the model learns by interacting


with an environment, receiving rewards or penalties based on the actions it takes.

Example: Training an AI to play a game like chess, where the AI learns through wins and
losses.

Common Algorithms: Q-Learning, Deep Q Networks (DQN).

3. PCA (Principal Component Analysis) is a dimensionality reduction technique that transforms


data into a new coordinate system where the axes represent directions of maximum variance.
It reduces the number of variables in the dataset while retaining as much information as
possible. Below is a step-by-step outline:

• Standardization: The data is standardized so that each feature has a mean of 0 and a
standard deviation of 1.

• Covariance Matrix: Calculate the covariance matrix to understand the relationships


between variables.

• Eigenvectors and Eigenvalues: Compute eigenvectors (directions of principal


components) and eigenvalues (magnitude of variance along each eigenvector).

• Projection: Project the original data onto the eigenvectors to obtain the principal
components.

3|Page
Atharva Nagore 21U03013

4. Given the data:


2 3 7
𝑋=[ ]
11 14 26

We'll compute the principal components using the following steps:

a. Mean Centering: First, compute the mean of each variable (x and y), then subtract the
mean from the data to center it.

(2 + 3 + 7)
Mean of X (meanx ): = 4
3
(11 + 14 + 26)
Mean of Y (meany ): = 17
3

Centered Data:
2−4 3−4 7−4 −2 −1 3
X_centered = =
11 − 17 14 − 17 26 − 17 −6 −3 9

b. Covariance Matrix: Compute the covariance matrix of the centered data.

Cov(X centered )
1 (−2)2 + (−1)2 + (3)2 (−2)(−6) + (−1)(−3) + (3)(9)
= × [ ]
n−1 (−6)(−2) + (−3)(−1) + (9)(3) (−6)2 + (−3)2 + (9)2

c. Eigenvectors and Eigenvalues: Solve for the eigenvalues and eigenvectors of the
covariance matrix (these represent the principal components). You will then project the
data onto the eigenvectors to obtain the first principal component.

5. Cross-validation is a technique to evaluate the generalization ability of a machine learning


model. It ensures that the model is not overfitting or underfitting by testing it on different
subsets of data.

4|Page
Atharva Nagore 21U03013

• K-Fold Cross-Validation: The dataset is divided into K subsets (folds). The model is
trained on K-1 folds and tested on the remaining one. This process is repeated K times,
with each fold being used as the test set once. The performance is averaged across all
K trials.

• Stratified K-Fold: In stratified K-fold, the division is done so that each fold maintains
the proportion of each class, useful for classification problems with imbalanced data.

• Leave-One-Out Cross-Validation (LOOCV): In this variant, each instance in the dataset


is used once as a test set, and the model is trained on the remaining data. This process
is repeated for each instance.

• Time-Series Cross-Validation: Used for time-series data, this method avoids random
shuffling of data. Instead, the model is trained on earlier data and tested on later data
to reflect the temporal structure.

5|Page
Atharva Nagore 21U03013

Assignment 2

1. Types of Regression

Regression analysis is a statistical technique used to model the relationship between dependent and
independent variables. The different types of regression are:

• Linear Regression: It models the relationship between two variables by fitting a linear
equation to the observed data.

• Multiple Regression: Extends linear regression by using multiple independent variables to


predict the dependent variable.

• Polynomial Regression: Used when the data is better modeled by a polynomial equation, i.e.,
a curve.

• Ridge Regression: A type of regularization used when data suffers from multicollinearity
(independent variables are highly correlated).

• Lasso Regression: Another regularization technique, but here, less important coefficients are
set to zero, effectively performing feature selection.

• Logistic Regression: Used for binary classification problems.

• ElasticNet Regression: Combines both ridge and lasso regression penalties for better
generalization.

2. Yes, decision trees can be used for regression tasks. In regression trees, the target variable is
continuous (as opposed to categorical in classification trees). The model predicts the value of the
dependent variable by learning simple decision rules inferred from the data.

The algorithm splits the data into different regions by minimizing the "variance" (sum of squared
differences between actual and predicted values) at each node. The leaf nodes represent the
predicted output (which is typically the average of the target values in that node).

3. Information Gain and Entropy

• Entropy: A measure of impurity or randomness in the data. In decision trees, it measures how
mixed the classes are in a dataset. It’s calculated as:
𝑛

𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = − ∑ 𝑝𝑖 log 2 (𝑝𝑖 )


𝑖=1

where 𝑝𝑖 is the proportion of class 𝑖 in the dataset.

• Information Gain: The reduction in entropy or uncertainty after a dataset is split on an


attribute. It's used to decide which attribute to split on at each step of the tree. The formula
is:

6|Page
Atharva Nagore 21U03013

|𝑆𝑣 |
𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐺𝑎𝑖𝑛(𝑆, 𝐴) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) − ∑ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
|𝑆|
𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠(𝐴)

where 𝑆𝑣 is the subset of 𝑆 where attribute 𝐴 has value 𝑣.

• Building a Decision Tree: The decision tree starts by calculating the entropy of the target
variable and then calculates the information gain for each feature. The feature with the
highest information gain is selected for the first split. This process is repeated recursively until
the tree is built.

Example: Assume a dataset where we want to predict whether a student will pass or fail based
on their hours of study. If we split the data on the 'hours of study' attribute, we calculate the
entropy before and after the split. The split that reduces entropy the most is selected.

4. Some of the common issues in Decision Tree Learning are:

• Overfitting: Decision trees tend to overfit, especially when they grow too deep. This means
the model performs well on training data but poorly on unseen data. This can be overcome
by:

o Pruning: Trimming branches that have little significance to improve the model's
generalization.

o Setting a maximum depth: Restricting the tree depth to prevent over-complex


models.

• Bias towards attributes with more levels: Attributes with many unique values (e.g., IDs) tend
to have higher information gain, even if they don’t truly provide significant insight. This can
be overcome by using Gain Ratio.

• Imbalanced datasets: Decision trees might perform poorly when one class dominates. This
can be addressed by resampling or adjusting the class weights.

5. Given the data:

7|Page
Atharva Nagore 21U03013

First, sum all the values for GPA (𝑥1 ), Months worked (𝑥2 ), and Annual salary (𝑦):

∑ 𝒚 = 𝟐𝟎𝟎𝟎𝟎 + 𝟐𝟒𝟓𝟎𝟎 + 𝟐𝟑𝟎𝟎𝟎 + 𝟐𝟓𝟎𝟎𝟎 + 𝟐𝟎𝟎𝟎𝟎 + 𝟐𝟐𝟓𝟎𝟎 + 𝟐𝟕𝟓𝟎𝟎 + 𝟏𝟗𝟎𝟎𝟎 + 𝟐𝟒𝟎𝟎𝟎


+ 𝟐𝟖𝟓𝟎𝟎 = 𝟐𝟑𝟒𝟎𝟎𝟎

∑ 𝒙𝟏 = 𝟐. 𝟖 + 𝟑. 𝟒 + 𝟑. 𝟐 + 𝟑. 𝟖 + 𝟑. 𝟐 + 𝟑. 𝟒 + 𝟒. 𝟎 + 𝟐. 𝟔 + 𝟑. 𝟐 + 𝟑. 𝟖 = 𝟑𝟑. 𝟒

∑ 𝒙𝟐 = 𝟒𝟖 + 𝟐𝟒 + 𝟐𝟒 + 𝟐𝟒 + 𝟒𝟖 + 𝟑𝟔 + 𝟐𝟒 + 𝟒𝟖 + 𝟑𝟔 + 𝟏𝟐 = 𝟑𝟐𝟒

Next, calculate the sum of squares and cross products:

∑ 𝒙𝟐𝟏 = (𝟐. 𝟖)^𝟐 + (𝟑. 𝟒)^𝟐 + (𝟑. 𝟐)^𝟐 + (𝟑. 𝟖)^𝟐 + (𝟑. 𝟐)^𝟐 + (𝟑. 𝟒)^𝟐 + (𝟒. 𝟎)^𝟐
+ (𝟐. 𝟔)^𝟐 + (𝟑. 𝟐)^𝟐 + (𝟑. 𝟖)^𝟐 = 𝟏𝟏𝟒. 𝟗𝟔

∑ 𝒙𝟐𝟐 = (𝟒𝟖)^𝟐 + (𝟐𝟒)^𝟐 + (𝟐𝟒)^𝟐 + (𝟐𝟒)^𝟐 + (𝟒𝟖)^𝟐 + (𝟑𝟔)^𝟐 + (𝟐𝟒)^𝟐


+ (𝟒𝟖)^𝟐 + (𝟑𝟔)^𝟐 + (𝟏𝟐)^𝟐 = 𝟏𝟏𝟖𝟎𝟖

∑ 𝒙𝟏 𝒚 = 𝟐. 𝟖 ⋅ 𝟐𝟎𝟎𝟎𝟎 + 𝟑. 𝟒 ⋅ 𝟐𝟒𝟓𝟎𝟎 + 𝟑. 𝟐 ⋅ 𝟐𝟑𝟎𝟎𝟎 + 𝟑. 𝟖 ⋅ 𝟐𝟓𝟎𝟎𝟎 + 𝟑. 𝟐 ⋅ 𝟐𝟎𝟎𝟎𝟎 + 𝟑. 𝟒


⋅ 𝟐𝟐𝟓𝟎𝟎 + 𝟒. 𝟎 ⋅ 𝟐𝟕𝟓𝟎𝟎 + 𝟐. 𝟔 ⋅ 𝟏𝟗𝟎𝟎𝟎 + 𝟑. 𝟐 ⋅ 𝟐𝟒𝟎𝟎𝟎 + 𝟑. 𝟖 ⋅ 𝟐𝟖𝟓𝟎𝟎 = 𝟕𝟖𝟏𝟎𝟎𝟎

∑ 𝒙𝟐 𝒚 = 𝟒𝟖 ⋅ 𝟐𝟎𝟎𝟎𝟎 + 𝟐𝟒 ⋅ 𝟐𝟒𝟓𝟎𝟎 + 𝟐𝟒 ⋅ 𝟐𝟑𝟎𝟎𝟎 + 𝟐𝟒 ⋅ 𝟐𝟓𝟎𝟎𝟎 + 𝟒𝟖 ⋅ 𝟐𝟎𝟎𝟎𝟎 + 𝟑𝟔 ⋅ 𝟐𝟐𝟓𝟎𝟎


+ 𝟐𝟒 ⋅ 𝟐𝟕𝟓𝟎𝟎 + 𝟒𝟖 ⋅ 𝟏𝟗𝟎𝟎𝟎 + 𝟑𝟔 ⋅ 𝟐𝟒𝟎𝟎𝟎 + 𝟏𝟐 ⋅ 𝟐𝟖𝟓𝟎𝟎 = 𝟕𝟒𝟒𝟎𝟎𝟎
The regression coefficients 𝛽1 (for GPA) and 𝛽2 (for months worked) are found by solving the system
of normal equations:
𝜮𝒙𝟏 𝒚 − 𝜷𝟐 𝜮𝒙𝟏 𝒙𝟐
𝜷𝟏 =
∑𝒙𝟐𝟏
𝜮𝒙𝟐 𝒚 − 𝜷𝟏 𝜮𝒙𝟏 𝒙𝟐
𝜷𝟐 =
∑𝒙𝟐𝟐

Substitute the values into the equations and solve for final equation.

𝜷𝟏 ≈ 𝟑𝟐𝟏𝟖. 𝟎𝟗
𝜷𝟐 ≈ −𝟏𝟑𝟗. 𝟔𝟑
So, the final regression equation is:

𝑨𝒏𝒏𝒖𝒂𝒍 𝑺𝒂𝒍𝒂𝒓𝒚 = 𝟏𝟕𝟏𝟕𝟓. 𝟓𝟑 + 𝟑𝟐𝟏𝟖. 𝟎𝟗 ⋅ 𝑮𝑷𝑨 − 𝟏𝟑𝟗. 𝟔𝟑 ⋅ 𝑴𝒐𝒏𝒕𝒉𝒔 𝑾𝒐𝒓𝒌𝒆𝒅

8|Page
Atharva Nagore 21U03013

Assignment 3
1. Mathematical Formulation of the SVM Problem

The objective of Support Vector Machine (SVM) is to find a hyperplane that maximally
separates two classes in the feature space. The formulation is as follows:

Hard Margin SVM (Linearly Separable Case):

Given a training dataset {(𝑥𝑖 , 𝑦𝑖 )}𝑛𝑖=1 where 𝑥𝑖 ∈ 𝑅 𝑑 are the feature vectors and 𝑦𝑖 ∈ {−1,1}
are the class labels, the goal is to find a hyperplane 𝑤 ⋅ 𝑥 + 𝑏 = 0 such that:
𝑦𝑖 (𝑤 ⋅ 𝑥𝑖 + 𝑏) ≥ 1∀𝑖
This ensures that all data points are correctly classified.

The optimization problem is to maximize the margin (distance between the two classes),
1
which is equivalent to minimizing 2 ‖𝑤‖2 , subject to the above constraints:

1
min ‖𝑤‖2 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 𝑦𝑖 (𝑤 ⋅ 𝑥𝑖 + 𝑏) ≥ 1∀𝑖
𝑤,𝑏 2

Soft Margin SVM (Linearly Non-separable Case):

For non-linearly separable data, we introduce slack variables 𝜉𝑖 ≥ 0 for each data point, and
the optimization problem becomes:

𝑛
1
min ‖𝑤‖2 + 𝐶 ∑ 𝜉𝑖 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 𝑦𝑖 (𝑤 ⋅ 𝑥𝑖 + 𝑏) ≥ 1 − 𝜉𝑖 , ∀𝑖
𝑤,𝑏 2
𝑖=1

Where 𝐶 is a penalty parameter controlling the trade-off between maximizing the margin and
minimizing the classification error.

Solving the SVM Problem:


• Convert the primal problem to its dual form using Lagrangian multipliers.
• Solve the dual problem, which is a quadratic programming (QP) problem, to find the
Lagrange multipliers.
• Use the Lagrange multipliers to compute the optimal hyperplane parameters 𝑤 and
𝑏.
• In case of non-linear data, use kernel functions (e.g., RBF, polynomial) to map the data
into a higher-dimensional space where it is linearly separable.

2. A base learner (or weak learner) is an individual model in ensemble learning that, on its own,
may not perform well (i.e., may have accuracy only slightly better than random guessing), but
when combined with other base learners, it can form a strong learner that performs
significantly better.

Selecting Base Learners:

9|Page
Atharva Nagore 21U03013

• Accuracy: Base learners should be weak learners with some predictive power but not
overfitting the data.
• Diversity: In ensemble methods like bagging or boosting, base learners should be
diverse (e.g., different models, different subsets of data).
• Scalability: Base learners should be fast and computationally efficient, as they are
trained multiple times in ensemble learning.

Examples of base learners include decision trees, k-NN, naive Bayes classifiers, etc.

3. The measure of dissimilarity in clustering quantifies how different two data points are from
each other. It is used to group similar data points together while separating dissimilar ones
into different clusters.

Examples of Measures of Dissimilarity:

𝑑 2
• Euclidean Distance: 𝑑(𝑥𝑖 , 𝑦𝑖 ) = √∑ (𝑥𝑖𝑘 − 𝑥𝑗𝑘 )
𝑘=1

𝑑
• Manhattan Distance (L1 Norm): 𝑑(𝑥𝑖 , 𝑦𝑖 ) = ∑ |𝑥𝑖𝑘 − 𝑥𝑗𝑘 |
𝑘=1

• Cosine Similarity: Measures the cosine of the angle between two vectors:
𝑥 ⋅𝑥
𝑖 𝑗
𝑑(𝑥𝑖 , 𝑦𝑖 ) = 1 − ‖𝑥 ‖‖𝑥
𝑖 𝑗‖

• Jaccard Distance: Measures dissimilarity between two sets as 𝑑(𝑥𝑖 , 𝑦𝑖 ) = 1 −


|𝑥𝑖 ∩𝑥𝑗 |
.
|𝑥𝑖 ∪𝑥𝑗 |

4. The K-means algorithm clusters data points by iteratively assigning them to one of 𝑘 clusters
based on their distance from the centroids of the clusters.

Algorithm Steps:
• Initialize: Select 𝑘 initial centroids randomly or using some heuristic.
• Assignment: Assign each point to the nearest centroid based on a distance metric
(usually Euclidean distance).
• Update: Recalculate the centroids as the mean of all points assigned to each cluster.
• Repeat: Continue assigning points and updating centroids until convergence (i.e., the
centroids do not change significantly).

Example:
We are given points: (1, 0, 1), (1, 1, 0), (0, 0, 1), 𝑎𝑛𝑑 (1, 1, 1), and 𝒌 = 𝟐.
Step 1: Initialize centroids: Let's choose two points as initial centroids:
𝐶1 = (1, 0, 1)
𝐶2 = (1,1,0)

10 | P a g e
Atharva Nagore 21U03013

Step 2: Assign points to the nearest centroid: Calculate Euclidean distances between points
and centroids:
• Point (1, 0, 1): closer to 𝐶1
• Point (1, 1, 0): closer to 𝐶2
• Point (0, 0, 1): closer to 𝐶1
• Point (1, 1, 1): calculate: closer to 𝐶1

Clusters after first iteration:

• Cluster 1: {(1, 0, 1), (0, 0, 1), (1, 1, 1)}


• Cluster 2: (1,1,0)

Step 3: Update centroids:

• New centroid 𝐶1 = mean of {(1, 0, 1), (0, 0, 1), (1, 1, 1)} = (0.67, 0.33, 1)
• New centroid 𝐶2 = mean of {(1, 1, 0)} = (1, 1, 0)

Step 4: Repeat until convergence (since cluster assignments don't change, we stop here).

Final clusters:

• Cluster 1: {(1, 0, 1), (0, 0, 1), (1, 1, 1)}

• Cluster 2: {(1, 1, 0)}

5. Agglomerative hierarchical clustering builds a hierarchy of clusters in a bottom-up approach:

Algorithm Steps:
• Start: Each data point starts in its own cluster.
• Merge: At each step, the two clusters with the smallest dissimilarity are merged into
a single cluster.
• Repeat: Continue merging until all points are in a single cluster.
• Dissimilarity Measures:
• Single Linkage: Distance between the closest pair of points in two clusters.
• Complete Linkage: Distance between the farthest pair of points in two clusters.
• Average Linkage: Average distance between all pairs of points from the two clusters.

The result is typically visualized using a dendrogram, which shows the hierarchical
relationships between clusters.

11 | P a g e
Atharva Nagore 21U03013

Assignment 4
1. An Artificial Neural Network (ANN) consists of different layers, each serving a unique function.
These layers are:

• Input Layer: Receives the initial input data (features) for processing. Each node in the
input layer represents a feature in the data.
• Hidden Layer(s): These layers perform computations and extract features from the
input data. They process inputs using weights, biases, and activation functions. There
can be multiple hidden layers, and networks with more hidden layers are called "deep
networks."
• Output Layer: Produces the final output, typically representing predictions,
classifications, or regression results. The number of nodes depends on the number of
output classes or values.

2. An activation function determines whether a neuron should be activated or not based on its
weighted input. It introduces non-linearity to the model, allowing it to learn complex patterns.
Without activation functions, a neural network would essentially be a linear regression model.

Examples of Activation Functions:


1
• Sigmoid: 𝑓(𝑥) = . Outputs a value between 0 and 1.
1+𝑒 −𝑥
2
• Tanh (Hyperbolic Tangent): 𝑓(𝑥) = 𝑡𝑎𝑛ℎ(𝑥) = − 1. Outputs values between
1+𝑒 −2𝑥
-1 and 1.
• ReLU (Rectified Linear Unit): 𝑓(𝑥) = 𝑚𝑎𝑥(0, 𝑥). Outputs 0 for negative inputs and
the value itself for positive inputs.
• Leaky ReLU: A variant of ReLU that allows a small slope for negative inputs, avoiding
the "dead neuron" problem.

3. To implement the XOR logic function, we can use a two-layer perceptron network because
XOR is not linearly separable. Below is a breakdown of the perceptron setup:
• Input Layer: Two inputs, A and B.
• Hidden Layer: Two neurons:
o Neuron 1 performs AND operation: 𝑁1 = 𝐴 ⋅ 𝐵.
o Neuron 2 performs OR operation: 𝑁2 = 𝐴 + 𝐵 − (𝐴 ⋅ 𝐵) (this is essentially a
NAND operation).
• Output Layer: One neuron performs the AND operation between the outputs of the
hidden neurons (XOR output).

So, the XOR function would look like:

𝑋𝑂𝑅(𝐴, 𝐵) = (𝐴 ⋅ ¬𝐵) + (¬𝐴 ⋅ 𝐵)

12 | P a g e
Atharva Nagore 21U03013

4. Deep Learning is a subfield of machine learning that involves neural networks with multiple
layers (often called deep neural networks). It excels at learning representations from raw data
(images, text, etc.) without requiring feature engineering. Deep learning has enabled
breakthroughs in many complex tasks that were previously infeasible.

Applications of Deep Learning:


• Image Classification (e.g., object recognition in images, face recognition).
• Speech Recognition (e.g., Siri, Google Assistant).
• Natural Language Processing (NLP) (e.g., language translation, text generation).
• Autonomous Vehicles (e.g., self-driving car navigation).
• Healthcare (e.g., diagnosing diseases using medical imaging).

5. Given the following neuron, the input values 𝑥0 = 3.5, 𝑥1 = 2.9, and 𝑥2 = 1.2, and weights
𝑤0 = 0.89, 𝑤1 = −2.07, 𝑎𝑛𝑑 𝑤2 = 0.08, we first calculate the weighted sum 𝑧 (assuming
the bias 𝑏 = 0.5):
𝑧 = 𝑤0 ⋅ 𝑥0 + 𝑤1 ⋅ 𝑥1 + 𝑤2 ⋅ 𝑥2 + 𝑏
𝑧 = (0.89 ⋅ 3.5) + (−2.07 ⋅ 2.9) + (0.08 ⋅ 1.2) + 0.5
𝑧 = 3.115 − 6.003 + 0.096 + 0.5 = −2.292

Now, compute the output for each activation function:


i. Threshold Function: A threshold function outputs 1 if 𝑧 > 0, otherwise 0.
𝑦 = 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑(𝑧) = 0 (𝑠𝑖𝑛𝑐𝑒 𝑧 = −2.292)

1
ii. Sigmoid Function: The sigmoid function is 𝜎(𝑧) = 1+𝑒 −𝑧 .
1 1
𝑦 = 2.292
≈ = 0.092
1+𝑒 1 + 9.891

𝑒 𝑧 −𝑒 −𝑧
iii. Hyperbolic Tangent (Tanh) Function: The tanh function is 𝑡𝑎𝑛ℎ(𝑧) = 𝑒 𝑧 +𝑒 𝑧
.
𝑦 = 𝑡𝑎𝑛ℎ(−2.292) ≈ −0.979

13 | P a g e

You might also like