ML[1]
ML[1]
(CIE‐421P)
Faculty name: Prof. (Dr.) Sachin Gupta Student Name: Ansh Kaushik
Roll No. : 13414802721
Semester : 7th
Group : CSE-FSD-I
1
MAHARAJA AGRASEN INSTITUTE OF TECHNOLOGY
VISION
To attain global excellence through education, innovation, research, and work ethics with the commitment to
serve humanity.
MISSION
M1. To promote diversification by adopting advancement in science, technology, management, and allied
discipline through continuous learning
M2. To foster moral values in students and equip them for developing sustainable solutions to serve both
national and global needs in society and industry.
M3. To digitize educational resources and process for enhanced teaching and effective learning.
2
Department of Computer Science and Engineering
VISION
To establish a center of excellence promoting Information Technology related education and research for
preparing technocrats and entrepreneurs with ethical values.
MISSION
M1: To excel in the field by imparting quality education and skills for software development and
applications.
M2: To establish a conducive environment that promotes intellectual growth and research.
M3: To facilitate students to acquire entrepreneurial skills for innovation and product development.
M4: To encourage students to participate in competitive events and industry interaction with a focus on
continuous learning.
3
1.
CO-PO Mapping
Course Objectives :
1. To understand the need of machine learning
2. To learn about regression and feature selection.
3. To understand about classification algorithms
4. To learn clustering algorithms
Course Outcomes (CO)
CO 1 To formulate machine learning problems
CO 2 Learn about regression and feature selection techniques
CO 3 Apply machine learning techniques such as classification to practical applications
CO 4 Apply clustering algorithms
Course Outcomes (CO) to Programme Outcomes (PO) mapping (scale 1: low, 2: Medium, 3: High)
PO01 PO02 PO03 PO04 PO05 PO06 PO07 PO08 PO09 PO10 PO11 PO12
CO 1 3 3 3 3 3 2 2 - - - - 2
CO 2 3 3 3 3 3 2 2 - - - - 2
CO 3 3 3 3 3 3 2 2 - - - - 2
CO 4 3 3 3 3 3 2 2 - - - - 2
4
Rubrics Evaluation
5
PROGRAMMING IN PYTHON LAB
PRACTICAL RECORD
PRACTICAL DETAILS
Program to demonstrate
2. 11 Simple Linear Regression
Program to demonstrate
3. 13 Logistic Regression
Program to demonstrate
4. 15 Decision Tree – ID3
Algorithm
Program to demonstrate k-
5. 18 Nearest Neighbor flowers
classification
Program to demonstrate
6. 20 Naïve- Bayes Classifier
Program to demonstrate
7. 22 PCA and LDA on Iris
dataset
Program to demonstrate
8. 24 DBSCAN clustering
algorithm
Program to demonstrate
9. 26
K-Medoid clustering
6
algorithm
Program to demonstrate K-
10. 28 Means Clustering Algorithm
on Handwritten Dataset
7
Explain the difference
8. 41
between two approaches
12.
13.
14.
15.
8
Experiment – 1
AIM: Introduction to JUPYTER IDE and its libraries Pandas and NumPy
THEORY:
1. Jupyter Notebook:
Jupyter Notebook is an interactive computing environment that allows for the creation and sharing of
documents containing live code, equations, visualizations, and narrative text.
It supports various programming languages, with Python being the most commonly used.
Jupyter Notebooks are comprised of cells, which can contain code, Markdown text, equations, or raw
text. This allows for a combination of code execution, data visualization, and documentation within a
single document.
The ability to execute code interactively and visualize results immediately makes Jupyter Notebooks
an ideal tool for data exploration, analysis, and sharing of research findings.
2. Pandas:
Pandas is a Python library built on top of NumPy that provides high-performance, easy-to-use data
structures and data analysis tools.
The primary data structures in Pandas are Series (one-dimensional labeled arrays) and DataFrame
(two-dimensional labeled data structures with columns of potentially different types).
Pandas excels at handling structured data, such as tabular data, time series, and heterogeneously-typed
data.
It offers functionalities for data manipulation, including indexing, slicing, merging, reshaping, and
pivoting, as well as data cleaning, transformation, and analysis.
Pandas' integration with other libraries like Matplotlib and Seaborn makes it convenient for data
visualization and exploration.
3. NumPy:
NumPy (Numerical Python) is a fundamental package for numerical computing in Python.
Its core feature is the ndarray, a multi-dimensional array object that provides efficient storage and
manipulation of homogeneous data.
NumPy arrays support various mathematical operations and broadcasting, enabling vectorized
computations and efficient handling of large datasets.
It provides a wide range of mathematical functions for linear algebra, Fourier analysis, random
number generation, and more.
NumPy arrays are the building blocks for many other libraries in the Python scientific ecosystem,
including Pandas, Matplotlib, and SciPy.
CODE:
import pandas as pd
import numpy as np
# Creating a NumPy array
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
9
# Filtering rows based on a condition
print(df[df['A'] > 3], "\n")
OUTPUT:
10
Experiment – 2
THEORY:
Simple Linear Regression is a linear regression model with a single explanatory variable (independent variable) to
predict the value of a dependent variable. The relationship between the two variables is assumed to be linear, i.e., a
straight line.
The equation of a simple linear regression model is given by:
𝑦=𝛽0+𝛽1⋅𝑥+𝜖y=β0+β1⋅x+ϵ
𝜖ϵ represents the error term (the difference between the observed and predicted values).
CODE:
11
# Displaying the model parameters
print("Intercept (beta0):", intercept)
print("Slope (beta1):", slope)
OUTPUT:
12
Experiment – 3
THEORY:
Logistic Regression is a supervised learning algorithm used for binary classification tasks. It models the probability
that a given input belongs to a particular class. The logistic function (sigmoid function) is used to map the output of
the linear combination of input features to a probability score between 0 and 1.
The logistic function is defined as:
𝜎(𝑧)=11+𝑒−𝑧σ(z)=1+e−z1
𝜎(𝑧)σ(z) is the logistic function, which returns a probability score between 0 and 1.
CODE:
13
# Plotting the decision boundary and the data points
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.title('Logistic Regression for Iris Dataset (Setosa vs. Non-Setosa)')
plt.show()
OUTPUT:
14
Experiment – 4
THEORY:
The ID3 (Iterative Dichotomiser 3) algorithm is a classic decision tree algorithm used for building classification
models. It recursively selects the best attribute to split the data based on Information Gain, a measure of the reduction
in entropy (or increase in purity) achieved by splitting the data on a particular attribute.
The steps of the ID3 algorithm are as follows:
1. Select the best attribute: Iterate over all attributes and calculate the Information Gain for each attribute.
Choose the attribute with the highest Information Gain as the best attribute to split the data.
2. Split the data: Split the data into subsets based on the values of the chosen attribute.
3. Repeat recursively: Recursively apply steps 1 and 2 to each subset until one of the following
conditions is met:
All instances in the subset belong to the same class.
There are no more attributes left to split.
Stopping criteria (e.g., maximum depth or minimum number of instances) are met.
4. Build the decision tree: Build the decision tree by assigning each attribute as a node and recursively
adding child nodes based on the splits.
5. Prune the tree (optional): Pruning is the process of removing sections of the tree that are not
statistically significant to reduce overfitting and improve generalization.
CODE:
import numpy as np
# If there are no more attributes left, return a leaf node with the majority class
if len(attribute_names) == 0:
return {'class': np.bincount(y).argmax()}
# Example usage
X = np.array([
['Sunny', 'Hot', 'High', 'Weak'],
['Sunny', 'Hot', 'High', 'Strong'],
['Overcast', 'Hot', 'High', 'Weak'],
['Rain', 'Mild', 'High', 'Weak'],
['Rain', 'Cool', 'Normal', 'Weak'],
['Rain', 'Cool', 'Normal', 'Strong'],
['Overcast', 'Cool', 'Normal', 'Strong'],
['Sunny', 'Mild', 'High', 'Weak'],
['Sunny', 'Cool', 'Normal', 'Weak'],
['Rain', 'Mild', 'Normal', 'Weak'],
['Sunny', 'Mild', 'Normal', 'Strong'],
['Overcast', 'Mild', 'High', 'Strong'],
['Overcast', 'Hot', 'Normal', 'Weak'],
['Rain', 'Mild', 'High', 'Strong']
])
y = np.array(['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No'])
# Example usage
tree = id3(X, y_int, attribute_names)
print(tree)
def print_tree(node, depth=0):
if 'attribute' in node:
16
print(' ' * depth + node['attribute'] + ' ?')
for value, subtree in node['subsets'].items():
print(' ' * (depth + 1) + value + ' :')
print_tree(subtree, depth + 2)
else:
print(' ' * (depth + 1) + 'Class:', node['class'])
print("\n")
# Example usage
print_tree(tree)
OUTPUT:
17
Experiment – 5
THEORY:
The k-NN algorithm is a simple and intuitive algorithm used for classification and regression tasks. It works by find-
ing the k closest data points in the training set to a new data point that needs to be classified or predicted.
For classification tasks, the algorithm assigns the new data point to the class that is most common among its k nearest
neighbors. For regression tasks, it predicts the target value as the mean or median of the target values of the k nearest
neighbors.
1. A distance metric (e.g., Euclidean distance) to calculate the distances between data points.
2. The value of k, which is the number of nearest neighbors to consider.
3. A method for assigning class labels (classification) or predicting target values (regression) based on the k
nearest neighbors.
The choice of k and the distance metric can significantly impact the performance of the algorithm. k-NN is a non-
parametric, lazy learning algorithm that does not build a model during training but instead stores the entire training
data and performs computations when a new data point needs to be classified or predicted.
CODE:
OUTPUT:
19
Experiment – 6
THEORY:
The Naïve Bayes classifier is a probabilistic machine learning algorithm based on Bayes' theorem. It assumes that
features are conditionally independent given the class label, which is why it's called "naïve". Despite this simplifying
assumption, Naïve Bayes classifiers often perform well in practice, especially for text classification tasks.
Bayes' theorem states:
Where:
P(A|B) is the probability of event A occurring given that event B has occurred (posterior probabil-
ity)
P(B|A) is the probability of event B occurring given that event A has occurred (likelihood)
P(A) is the probability of event A occurring (prior probability)
P(B) is the probability of event B occurring (evidence)
CODE:
20
OUTPUT:
21
Experiment – 7
THEORY:
Principal Component Analysis (PCA): PCA is an unsupervised dimensionality reduction technique that is used to
transform a high-dimensional dataset into a lower-dimensional space while retaining as much of the original variance
as possible. It does this by finding the directions (principal components) that maximize the variance in the data.
PCA is useful for data visualization, noise reduction, and feature extraction.
Linear Discriminant Analysis (LDA): LDA is a supervised dimensionality reduction technique that is used for clas-
sification problems. It finds the directions (linear discriminants) that maximize the separation between different
classes while minimizing the variance within each class.
LDA is useful for classification tasks, where the goal is to find the directions that best separate the classes.
CODE:
# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_std)
22
# Apply LDA
lda = LinearDiscriminantAnalysis(n_components=2)
X_lda = lda.fit_transform(X_std, y)
plt.subplot(1, 2, 1)
for target, label in zip(range(3), ['Setosa', 'Versicolor', 'Virginica']):
plt.scatter(X_pca[y == target, 0], X_pca[y == target, 1], label=label)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA on Iris dataset')
plt.legend()
plt.subplot(1, 2, 2)
for target, label in zip(range(3), ['Setosa', 'Versicolor', 'Virginica']):
plt.scatter(X_lda[y == target, 0], X_lda[y == target, 1], label=label)
plt.xlabel('LD 1')
plt.ylabel('LD 2')
plt.title('LDA on Iris dataset')
plt.legend()
plt.tight_layout()
plt.show()
OUTPUT:
23
Experiment – 8
THEORY:
The DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm is a density-based clustering
algorithm that can discover clusters of arbitrary shape and identify noise or outlier data points in a dataset. It groups
together data points that are closely packed (high density), while marking outliers as noise.
1. eps (epsilon): The maximum distance between two points for them to be considered neighbors.
2. min_samples: The minimum number of points required to form a dense region or cluster.
1. For each point p in the dataset, find the number of points that are within a distance eps from p. These points
are called the neighbors of p.
2. If the number of neighbors of p is greater than or equal to min_samples, then p is considered a core point,
and a new cluster is formed with p and its neighbors.
3. If the number of neighbors of p is less than min_samples, then p is considered a border point if it is a
neighbor of a core point, or a noise point otherwise.
4. All points that are reachable from a core point through a chain of core points and border points are added to
the same cluster.
5. The process continues until all points in the dataset are either assigned to a cluster or marked as noise.
CODE:
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(X)
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Plot result
import matplotlib.pyplot as plt
%matplotlib inline
class_member_mask = (labels == k)
OUTPUT:
25
Experiment – 9
THEORY:
The K-Medoids clustering algorithm is a partitioning technique similar to the K-Means algorithm but is more robust to
outliers and noise in the data. Instead of using the mean of the data points as the cluster centroid, K-Medoids uses
actual data points as the cluster centers, called medoids. This makes K-Medoids more robust to outliers since it
minimizes the sum of pairwise dissimilarities instead of minimizing the squared Euclidean distances.
The most commonly used algorithm for K-Medoids is the Partitioning Around Medoids (PAM) algorithm, which is
implemented in scikit-learn as KMedoids.
CODE:
# Applying K-Medoids
kmedoids = KMedoids(n_clusters=4, random_state=42)
y_kmedoids = kmedoids.fit_predict(X)
26
OUTPUT:
27
Experiment – 10
THEORY:
K-Means clustering is an unsupervised machine learning algorithm used to partition a dataset into K distinct clusters.
The algorithm aims to find the centroids (means) of the clusters and assign each data point to the cluster with the
nearest centroid.
The objective function of K-Means is to minimize the sum of squared distances between each data point and its as-
signed centroid, also known as the inertia or within-cluster sum of squares (WCSS).
The number of clusters, K, is a hyperparameter that needs to be specified beforehand. There are various techniques to
determine the optimal value of K, such as the elbow method, silhouette analysis, or gap statistic .
CODE:
# Apply K-Means
kmeans = KMeans(n_clusters=10, n_init=10, random_state=42)
y_kmeans = kmeans.fit_predict(X)
plt.tight_layout()
plt.show()
OUTPUT:
29
30
Experiment – 1
Google Colab is a free cloud-based platform provided by Google that allows you to write and execute
Python code in a Jupyter notebook-like environment. It's a great tool for experimenting with machine
learning models, data analysis, and collaborative coding. Here's a simple example to get you started with
Google Colab:
Accessing Google Colab:
1. Go to Google Colab and sign in with your Google account.
You'll be redirected to the Colab dashboard where you can create new notebooks or open
existing ones.
Creating a New Notebook:
2. Click on "New Notebook" to create a new notebook.
You can also upload an existing notebook or open a notebook from Google Drive or GitHub.
Running Code:
3. In a code cell, you can write and execute Python code. For example, you can print "Hello, World!" by
typing print("Hello, World!") in a cell and pressing Shift+Enter to execute it.
Colab provides access to a wide range of libraries and frameworks, including NumPy, pandas,
TensorFlow, and PyTorch.
Using Markdown Cells:
4. You can add text, headings, and formatted content using Markdown cells. Simply change the cell type to
"Markdown" and start typing your Markdown content.
5. Saving and Sharing Notebooks:
Colab automatically saves your notebook to Google Drive. You can also download it as a .ipynb file or
save it to GitHub.
You can share your notebook with others by clicking on the "Share" button in the top right corner.
You can share it with specific people or make it accessible to anyone with the link.
CODE:
import numpy as np
import matplotlib.pyplot as plt
31
OUTPUT:
32
Experiment – 2
AIM: A program for linear regression model using scikit-learn but no machine learning
CODE:
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
# Get the slope (coefficient) and intercept of the linear regression line
slope = model.coef_[0][0]
intercept = model.intercept_[0]
OUTPUT:
33
Experiment – 3
AIM: P A program without using scikit-learn or machine learning for Linear regression.
CODE:
import numpy as np
import matplotlib.pyplot as plt
# Calculate the slope (coefficient) and intercept of the linear regression line using ordinary least squares (OLS)
X_mean = np.mean(X)
y_mean = np.mean(y)
numerator = np.sum((X - X_mean) * (y - y_mean))
denominator = np.sum((X - X_mean) ** 2)
slope = numerator / denominator
intercept = y_mean - slope * X_mean
OUTPUT:
34
Experiment – 4
AIM: Create a decision tree without using sklearn library, just using the concept of entropy and choose what
node will be the root, and subsequent decision parameters based on information gain. Show all calculations.
CODE:
import numpy as np
class Node:
def __init__(self, feature=None, threshold=None, left=None, right=None, value=None):
self.feature = feature # Feature index
self.threshold = threshold # Threshold value for binary splitting
self.left = left # Left child node
self.right = right # Right child node
self.value = value # Value (class) for leaf nodes
def entropy(y):
_, counts = np.unique(y, return_counts=True)
probabilities = counts / len(y)
return -np.sum(probabilities * np.log2(probabilities + 1e-10))
35
right_subtree = build_tree(X[right_mask], y[right_mask], max_depth - 1 if max_depth else None)
return Node(feature=feature, threshold=threshold, left=left_subtree, right=right_subtree)
# Example dataset
X = np.array([[0, 0],
[0, 1],
[1, 0],
[1, 1],
[1, 1]])
y = np.array([0, 0, 1, 1, 0])
# Predictions
print("Predictions:", predict(tree, X))
OUTPUT:
36
Experiment – 5
AIM: Show how decision trees can be used for both classification and regression using a program with
sample dataset. Also visualize the decision tree obtained
CODE:
Classification Example:
plt.figure(figsize=(12, 8))
plot_tree(clf, filled=True, feature_names=iris.feature_names[2:],
class_names=iris.target_names)
plt.show()
37
Regression Example:
import numpy as np
from sklearn.tree import DecisionTreeRegressor
plt.figure(figsize=(12, 8))
plot_tree(reg, filled=True)
plt.show()
38
Experiment – 6
AIM: Write your own implementation for knn with 4 different distance metrics and check on Iris dataset
CODE:
import numpy as np
from collections import Counter
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
class KNN:
def __init__(self, k=3, metric='euclidean'):
self.k = k
self.metric = metric
39
distances.append((distance, self.y_train[i]))
distances.sort(key=lambda x: x[0])
neighbors = distances[:self.k]
labels = [neighbor[1] for neighbor in neighbors]
prediction = Counter(labels).most_common(1)[0][0]
predictions.append(prediction)
return predictions
OUTPUT:
40
Experiment – 7
AIM: Apply your own implementation of Knn to any instance of Pima Indians Diabetes Database from
Kaggle or Github and submit notebooks for all 3 exercises.
CODE:
import csv
import math
OUTPUT:
42
Experiment – 8
THEORY:
Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) are both dimensionality re-
duction techniques used in machine learning, but they differ in their underlying principles and the way they
approach the problem.
Principal Component Analysis (PCA): PCA is an unsupervised technique that aims to find the directions
(principal components) that maximize the variance in the data. It does this by creating new uncorrelated fea-
tures (principal components) that are linear combinations of the original features. These new features are
ordered by the amount of variance they capture in the data.
1. Unsupervised: PCA is an unsupervised technique, meaning it does not consider the target variable
or class labels during the transformation.
2. Maximize Variance: PCA seeks to find the directions that capture the maximum variance in the
data, regardless of the class labels.
3. Dimensionality Reduction: PCA can be used for dimensionality reduction by selecting the top prin-
cipal components that capture most of the variance in the data while discarding the less important
components.
4. Feature Extraction: PCA creates new features (principal components) that are linear combinations
of the original features.
Linear Discriminant Analysis (LDA): LDA, on the other hand, is a supervised technique that aims to find
the directions (linear discriminants) that maximize the separation between classes or categories in the data. It
does this by creating new features (linear discriminants) that are linear combinations of the original features,
but unlike PCA, it takes into account the class labels during the transformation.
1. Supervised: LDA is a supervised technique that considers the target variable or class labels during
the transformation.
2. Maximize Class Separability: LDA seeks to find the directions that maximize the separability
between classes or categories in the data.
3. Dimensionality Reduction: LDA can also be used for dimensionality reduction by selecting the top
linear discriminants that capture most of the class separability.
4. Feature Extraction: LDA creates new features (linear discriminants) that are linear combinations of
the original features, optimized for class separation.
The main difference between PCA and LDA lies in their objectives. PCA aims to maximize the variance in
the data, regardless of the class labels, while LDA aims to maximize the separability between classes or cat-
egories.
43
Experiment – 9
AIM: Use colab examples given in the classroom group for applying PCA and LDA to different datasets,
and share an EXPLANATION about how do you interpret the results.
CODE:
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply PCA
pca = PCA(n_components=2) # Specify number of components to keep
X_pca = pca.fit_transform(X_scaled)
# Apply LDA
lda = LinearDiscriminantAnalysis(n_components=2)
X_lda = lda.fit_transform(X_scaled, y)
plt.subplot(1, 2, 2)
plt.scatter(X_lda[:, 0], X_lda[:, 1], c=y, cmap='viridis')
plt.title('LDA')
plt.show()
44
OUTPUT:
45
Experiment – 10
AIM: Compare K-means clustering and DBSCAN for IRIS based on performance metrics. Which approach
is better? Why?
CODE:
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# K-means clustering
kmeans = KMeans(n_clusters=3, n_init=10, random_state=42)
kmeans_labels = kmeans.fit_predict(X_scaled)
# DBSCAN clustering
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_scaled)
print("K-means clustering:")
print("Silhouette score:", silhouette_kmeans)
print("Davies-Bouldin index:", davies_bouldin_kmeans)
print("Completeness score:", completeness_kmeans)
print("\nDBSCAN clustering:")
print("Silhouette score:", silhouette_dbscan)
print("Davies-Bouldin index:", davies_bouldin_dbscan)
print("Completeness score:", completeness_dbscan)
46
OUTPUT:
47
Experiment – 11
AIM: Apply SVM on IRIS data set from sklearn for classification
CODE:
# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
48
OUTPUT:
49