ML assignment
ML assignment
BHOPAL
INFORMATION TECHNOLOGY
Machine Learning
IT-613
ASSIGNMENT
Assignment-1
1. Machine Learning (ML) is a process that enables systems to learn patterns from data and make
predictions or decisions without explicit programming. The basic components of this process
are:
a. Data Collection: ML begins with gathering relevant data. This data is typically divided into
training, validation, and testing sets.
b. Data Preprocessing: Data often requires cleaning and preparation, which involves:
• Handling missing data
• Normalizing/standardizing data
• Removing outliers or irrelevant features
c. Feature Selection/Engineering: Relevant features (or inputs) that have the most influence
on the output are identified. Feature engineering might also involve creating new
features.
d. Model Selection: Choose a machine learning algorithm that fits the problem. Models can
vary based on the type of learning (e.g., regression, classification, clustering, etc.)
e. Training: The selected model is trained on the training dataset, adjusting its parameters
(e.g., weights in neural networks) to minimize the error.
f. Evaluation: The model’s performance is evaluated on a validation set (or test set) using
various metrics like accuracy, precision, recall, and more.
h. Testing and Deployment: After training and optimization, the model is tested on a hold-
out test dataset and deployed for real-world usage.
Example: A spam email classifier is trained on labeled emails (spam or not spam) to predict
future emails as spam or not.
Common Algorithms: Linear regression, Decision trees, Support Vector Machines (SVM).
b. Unsupervised Learning: In this case, the model is given data without explicit labels. The
goal is to identify hidden structures or patterns in the data.
2|Page
Atharva Nagore 21U03013
Example: Clustering similar customer profiles for market segmentation based on purchase
behavior without predefined categories.
Example: A few labeled tumor images with many unlabeled ones in medical image
analysis.
Example: Training an AI to play a game like chess, where the AI learns through wins and
losses.
• Standardization: The data is standardized so that each feature has a mean of 0 and a
standard deviation of 1.
• Projection: Project the original data onto the eigenvectors to obtain the principal
components.
3|Page
Atharva Nagore 21U03013
a. Mean Centering: First, compute the mean of each variable (x and y), then subtract the
mean from the data to center it.
(2 + 3 + 7)
Mean of X (meanx ): = 4
3
(11 + 14 + 26)
Mean of Y (meany ): = 17
3
Centered Data:
2−4 3−4 7−4 −2 −1 3
X_centered = =
11 − 17 14 − 17 26 − 17 −6 −3 9
Cov(X centered )
1 (−2)2 + (−1)2 + (3)2 (−2)(−6) + (−1)(−3) + (3)(9)
= × [ ]
n−1 (−6)(−2) + (−3)(−1) + (9)(3) (−6)2 + (−3)2 + (9)2
c. Eigenvectors and Eigenvalues: Solve for the eigenvalues and eigenvectors of the
covariance matrix (these represent the principal components). You will then project the
data onto the eigenvectors to obtain the first principal component.
4|Page
Atharva Nagore 21U03013
• K-Fold Cross-Validation: The dataset is divided into K subsets (folds). The model is
trained on K-1 folds and tested on the remaining one. This process is repeated K times,
with each fold being used as the test set once. The performance is averaged across all
K trials.
• Stratified K-Fold: In stratified K-fold, the division is done so that each fold maintains
the proportion of each class, useful for classification problems with imbalanced data.
• Time-Series Cross-Validation: Used for time-series data, this method avoids random
shuffling of data. Instead, the model is trained on earlier data and tested on later data
to reflect the temporal structure.
5|Page
Atharva Nagore 21U03013
Assignment 2
1. Types of Regression
Regression analysis is a statistical technique used to model the relationship between dependent and
independent variables. The different types of regression are:
• Linear Regression: It models the relationship between two variables by fitting a linear
equation to the observed data.
• Polynomial Regression: Used when the data is better modeled by a polynomial equation, i.e.,
a curve.
• Ridge Regression: A type of regularization used when data suffers from multicollinearity
(independent variables are highly correlated).
• Lasso Regression: Another regularization technique, but here, less important coefficients are
set to zero, effectively performing feature selection.
• ElasticNet Regression: Combines both ridge and lasso regression penalties for better
generalization.
2. Yes, decision trees can be used for regression tasks. In regression trees, the target variable is
continuous (as opposed to categorical in classification trees). The model predicts the value of the
dependent variable by learning simple decision rules inferred from the data.
The algorithm splits the data into different regions by minimizing the "variance" (sum of squared
differences between actual and predicted values) at each node. The leaf nodes represent the
predicted output (which is typically the average of the target values in that node).
• Entropy: A measure of impurity or randomness in the data. In decision trees, it measures how
mixed the classes are in a dataset. It’s calculated as:
𝑛
6|Page
Atharva Nagore 21U03013
|𝑆𝑣 |
𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐺𝑎𝑖𝑛(𝑆, 𝐴) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) − ∑ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
|𝑆|
𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠(𝐴)
• Building a Decision Tree: The decision tree starts by calculating the entropy of the target
variable and then calculates the information gain for each feature. The feature with the
highest information gain is selected for the first split. This process is repeated recursively until
the tree is built.
Example: Assume a dataset where we want to predict whether a student will pass or fail based
on their hours of study. If we split the data on the 'hours of study' attribute, we calculate the
entropy before and after the split. The split that reduces entropy the most is selected.
• Overfitting: Decision trees tend to overfit, especially when they grow too deep. This means
the model performs well on training data but poorly on unseen data. This can be overcome
by:
o Pruning: Trimming branches that have little significance to improve the model's
generalization.
• Bias towards attributes with more levels: Attributes with many unique values (e.g., IDs) tend
to have higher information gain, even if they don’t truly provide significant insight. This can
be overcome by using Gain Ratio.
• Imbalanced datasets: Decision trees might perform poorly when one class dominates. This
can be addressed by resampling or adjusting the class weights.
7|Page
Atharva Nagore 21U03013
First, sum all the values for GPA (𝑥1 ), Months worked (𝑥2 ), and Annual salary (𝑦):
∑ 𝒙𝟏 = 𝟐. 𝟖 + 𝟑. 𝟒 + 𝟑. 𝟐 + 𝟑. 𝟖 + 𝟑. 𝟐 + 𝟑. 𝟒 + 𝟒. 𝟎 + 𝟐. 𝟔 + 𝟑. 𝟐 + 𝟑. 𝟖 = 𝟑𝟑. 𝟒
∑ 𝒙𝟐 = 𝟒𝟖 + 𝟐𝟒 + 𝟐𝟒 + 𝟐𝟒 + 𝟒𝟖 + 𝟑𝟔 + 𝟐𝟒 + 𝟒𝟖 + 𝟑𝟔 + 𝟏𝟐 = 𝟑𝟐𝟒
∑ 𝒙𝟐𝟏 = (𝟐. 𝟖)^𝟐 + (𝟑. 𝟒)^𝟐 + (𝟑. 𝟐)^𝟐 + (𝟑. 𝟖)^𝟐 + (𝟑. 𝟐)^𝟐 + (𝟑. 𝟒)^𝟐 + (𝟒. 𝟎)^𝟐
+ (𝟐. 𝟔)^𝟐 + (𝟑. 𝟐)^𝟐 + (𝟑. 𝟖)^𝟐 = 𝟏𝟏𝟒. 𝟗𝟔
Substitute the values into the equations and solve for final equation.
𝜷𝟏 ≈ 𝟑𝟐𝟏𝟖. 𝟎𝟗
𝜷𝟐 ≈ −𝟏𝟑𝟗. 𝟔𝟑
So, the final regression equation is:
8|Page
Atharva Nagore 21U03013
Assignment 3
1. Mathematical Formulation of the SVM Problem
The objective of Support Vector Machine (SVM) is to find a hyperplane that maximally
separates two classes in the feature space. The formulation is as follows:
Given a training dataset {(𝑥𝑖 , 𝑦𝑖 )}𝑛𝑖=1 where 𝑥𝑖 ∈ 𝑅 𝑑 are the feature vectors and 𝑦𝑖 ∈ {−1,1}
are the class labels, the goal is to find a hyperplane 𝑤 ⋅ 𝑥 + 𝑏 = 0 such that:
𝑦𝑖 (𝑤 ⋅ 𝑥𝑖 + 𝑏) ≥ 1∀𝑖
This ensures that all data points are correctly classified.
The optimization problem is to maximize the margin (distance between the two classes),
1
which is equivalent to minimizing 2 ‖𝑤‖2 , subject to the above constraints:
1
min ‖𝑤‖2 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 𝑦𝑖 (𝑤 ⋅ 𝑥𝑖 + 𝑏) ≥ 1∀𝑖
𝑤,𝑏 2
For non-linearly separable data, we introduce slack variables 𝜉𝑖 ≥ 0 for each data point, and
the optimization problem becomes:
𝑛
1
min ‖𝑤‖2 + 𝐶 ∑ 𝜉𝑖 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 𝑦𝑖 (𝑤 ⋅ 𝑥𝑖 + 𝑏) ≥ 1 − 𝜉𝑖 , ∀𝑖
𝑤,𝑏 2
𝑖=1
Where 𝐶 is a penalty parameter controlling the trade-off between maximizing the margin and
minimizing the classification error.
2. A base learner (or weak learner) is an individual model in ensemble learning that, on its own,
may not perform well (i.e., may have accuracy only slightly better than random guessing), but
when combined with other base learners, it can form a strong learner that performs
significantly better.
9|Page
Atharva Nagore 21U03013
• Accuracy: Base learners should be weak learners with some predictive power but not
overfitting the data.
• Diversity: In ensemble methods like bagging or boosting, base learners should be
diverse (e.g., different models, different subsets of data).
• Scalability: Base learners should be fast and computationally efficient, as they are
trained multiple times in ensemble learning.
Examples of base learners include decision trees, k-NN, naive Bayes classifiers, etc.
3. The measure of dissimilarity in clustering quantifies how different two data points are from
each other. It is used to group similar data points together while separating dissimilar ones
into different clusters.
𝑑 2
• Euclidean Distance: 𝑑(𝑥𝑖 , 𝑦𝑖 ) = √∑ (𝑥𝑖𝑘 − 𝑥𝑗𝑘 )
𝑘=1
𝑑
• Manhattan Distance (L1 Norm): 𝑑(𝑥𝑖 , 𝑦𝑖 ) = ∑ |𝑥𝑖𝑘 − 𝑥𝑗𝑘 |
𝑘=1
• Cosine Similarity: Measures the cosine of the angle between two vectors:
𝑥 ⋅𝑥
𝑖 𝑗
𝑑(𝑥𝑖 , 𝑦𝑖 ) = 1 − ‖𝑥 ‖‖𝑥
𝑖 𝑗‖
4. The K-means algorithm clusters data points by iteratively assigning them to one of 𝑘 clusters
based on their distance from the centroids of the clusters.
Algorithm Steps:
• Initialize: Select 𝑘 initial centroids randomly or using some heuristic.
• Assignment: Assign each point to the nearest centroid based on a distance metric
(usually Euclidean distance).
• Update: Recalculate the centroids as the mean of all points assigned to each cluster.
• Repeat: Continue assigning points and updating centroids until convergence (i.e., the
centroids do not change significantly).
Example:
We are given points: (1, 0, 1), (1, 1, 0), (0, 0, 1), 𝑎𝑛𝑑 (1, 1, 1), and 𝒌 = 𝟐.
Step 1: Initialize centroids: Let's choose two points as initial centroids:
𝐶1 = (1, 0, 1)
𝐶2 = (1,1,0)
10 | P a g e
Atharva Nagore 21U03013
Step 2: Assign points to the nearest centroid: Calculate Euclidean distances between points
and centroids:
• Point (1, 0, 1): closer to 𝐶1
• Point (1, 1, 0): closer to 𝐶2
• Point (0, 0, 1): closer to 𝐶1
• Point (1, 1, 1): calculate: closer to 𝐶1
• New centroid 𝐶1 = mean of {(1, 0, 1), (0, 0, 1), (1, 1, 1)} = (0.67, 0.33, 1)
• New centroid 𝐶2 = mean of {(1, 1, 0)} = (1, 1, 0)
Step 4: Repeat until convergence (since cluster assignments don't change, we stop here).
Final clusters:
Algorithm Steps:
• Start: Each data point starts in its own cluster.
• Merge: At each step, the two clusters with the smallest dissimilarity are merged into
a single cluster.
• Repeat: Continue merging until all points are in a single cluster.
• Dissimilarity Measures:
• Single Linkage: Distance between the closest pair of points in two clusters.
• Complete Linkage: Distance between the farthest pair of points in two clusters.
• Average Linkage: Average distance between all pairs of points from the two clusters.
The result is typically visualized using a dendrogram, which shows the hierarchical
relationships between clusters.
11 | P a g e
Atharva Nagore 21U03013
Assignment 4
1. An Artificial Neural Network (ANN) consists of different layers, each serving a unique function.
These layers are:
• Input Layer: Receives the initial input data (features) for processing. Each node in the
input layer represents a feature in the data.
• Hidden Layer(s): These layers perform computations and extract features from the
input data. They process inputs using weights, biases, and activation functions. There
can be multiple hidden layers, and networks with more hidden layers are called "deep
networks."
• Output Layer: Produces the final output, typically representing predictions,
classifications, or regression results. The number of nodes depends on the number of
output classes or values.
2. An activation function determines whether a neuron should be activated or not based on its
weighted input. It introduces non-linearity to the model, allowing it to learn complex patterns.
Without activation functions, a neural network would essentially be a linear regression model.
3. To implement the XOR logic function, we can use a two-layer perceptron network because
XOR is not linearly separable. Below is a breakdown of the perceptron setup:
• Input Layer: Two inputs, A and B.
• Hidden Layer: Two neurons:
o Neuron 1 performs AND operation: 𝑁1 = 𝐴 ⋅ 𝐵.
o Neuron 2 performs OR operation: 𝑁2 = 𝐴 + 𝐵 − (𝐴 ⋅ 𝐵) (this is essentially a
NAND operation).
• Output Layer: One neuron performs the AND operation between the outputs of the
hidden neurons (XOR output).
12 | P a g e
Atharva Nagore 21U03013
4. Deep Learning is a subfield of machine learning that involves neural networks with multiple
layers (often called deep neural networks). It excels at learning representations from raw data
(images, text, etc.) without requiring feature engineering. Deep learning has enabled
breakthroughs in many complex tasks that were previously infeasible.
5. Given the following neuron, the input values 𝑥0 = 3.5, 𝑥1 = 2.9, and 𝑥2 = 1.2, and weights
𝑤0 = 0.89, 𝑤1 = −2.07, 𝑎𝑛𝑑 𝑤2 = 0.08, we first calculate the weighted sum 𝑧 (assuming
the bias 𝑏 = 0.5):
𝑧 = 𝑤0 ⋅ 𝑥0 + 𝑤1 ⋅ 𝑥1 + 𝑤2 ⋅ 𝑥2 + 𝑏
𝑧 = (0.89 ⋅ 3.5) + (−2.07 ⋅ 2.9) + (0.08 ⋅ 1.2) + 0.5
𝑧 = 3.115 − 6.003 + 0.096 + 0.5 = −2.292
1
ii. Sigmoid Function: The sigmoid function is 𝜎(𝑧) = 1+𝑒 −𝑧 .
1 1
𝑦 = 2.292
≈ = 0.092
1+𝑒 1 + 9.891
𝑒 𝑧 −𝑒 −𝑧
iii. Hyperbolic Tangent (Tanh) Function: The tanh function is 𝑡𝑎𝑛ℎ(𝑧) = 𝑒 𝑧 +𝑒 𝑧
.
𝑦 = 𝑡𝑎𝑛ℎ(−2.292) ≈ −0.979
13 | P a g e