Machine learning lab manual
Machine learning lab manual
Ex. Staff
Date Name of the Experiment Marks Page No
No Signature
Ex No: 1 LINEAR REGRESSION
Date:
Aim:
Implement a Linear Regression with a Real Dataset. Experiment with different
features in building a model. Tune the model's hyperparameters.
Algorithm:
1. Load and preprocess the dataset.
2. Select the features and the target variable from the dataset.
10. Evaluate the model's performance using appropriate evaluation metrics, such as
mean squared error and R-squared.
11. Optionally, analyze the importance of different features in the model.
12. Optionally, visualize the predicted values against the actual values for further analysis.
13. Repeat steps 4-12 with different feature combinations and hyperparameters to
experiment and improve the model's performance.
1
Program:
import pandas as pd
import numpy as np
from sklearn.linear_model import
LinearRegression from
sklearn.model_selection import train_test_split
from sklearn.metrics import
mean_squared_error from
sklearn.preprocessing import StandardScaler
X=
data[features].values y
= data[target].values
Output:
Model Evaluation:
Mean Squared Error (MSE):
22.598 R-squared (R2) Score:
0.725
Best Model
Hyperparameters: Fit
Intercept: True
Normalize: False
Result:
Thus the Implemented a Linear Regression with a Real Dataset. Experiment with differentfeatures in
building a model. Tune the model's hyper parameters was executed successfull
3
Ex No: 2 BINARY CLASSIFICATION MODEL
Date:
Aim:
To implement a binary classification model from given the dataset.
Algorithm:
1. Load and preprocess the dataset.
2. Select the features and the target variable from the dataset.
4. Build a binary classification model (e.g., logistic regression, decision tree, random forest, etc.).
6. Make predictions on the test set using the default classification threshold (usually 0.5).
4
Program:
import pandas as pd
import numpy as np
X=
data[features].values y
= data[target].values
5
# Step 5: Make predictions
y_pred =
model.predict(X_test)
recall = recall_score(y_test,
y_pred) f1 = f1_score(y_test,
y_pred)
roc_auc = roc_auc_score(y_test, y_pred)
accuracy_threshold = accuracy_score(y_test,
y_pred_threshold) precision_threshold =
precision_score(y_test, y_pred_threshold) recall_threshold
6
= recall_score(y_test, y_pred_threshold) f1_threshold =
f1_score(y_test, y_pred_threshold) roc_auc_threshold =
roc_auc_score(y_test, y_pred_threshold)
print("Precision:", precision_threshold)
print("Recall:", recall_threshold)
print("F1-Score:", f1_threshold) print("ROC AUC:", roc_auc_threshold)
Output:
Classification Metrics:
Accuracy: 0.85
Precision: 0.82
Recall: 0.78
F1 Score: 0.80
ROC AUC Score: 0.83
Result:
Thus the implemented a binary classification model was executed successfully.
7
Ex No: 3 KNN CLASSIFIER ALGORITHM
Date:
Aim:
To implement a KNN classifier Algorithm using California Housing Dataset.
Algorithm:
Program:
import pandas as pd
from sklearn.model_selection import
train_test_split from sklearn.neighbors import
KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
8
# Step 3: Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
Output:
Accuracy: 0.76
Result:
Thus the implementation for a KNN classifier Algorithm using California Housing Dataset was
executed successfully.
9
Ex No:4 TRAINING SET AND VALIDATION SET
Date:
Aim:
To analyze and comparison of Training Set and Validation Set from the given dataset.
Algorithm:
Program:
import pandas as pd
fromsklearn.model_selection import
train_test_split fromsklearn.linear_model
import LogisticRegression fromsklearn.metrics
import accuracy_score
# Step 2: Split the data into training, validation, and test sets
10
X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Further split the training set into a smaller training set and a validation set
X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full,
test_size=0.25, random_state=42)
# Step 4: Make predictions on the smaller training set, validation set, and test set
y_train_pred = model.predict(X_train)
y_val_pred = model.predict(X_val)
y_test_pred = model.predict(X_test)
Output:
Training Accuracy: 0.85
Validation Accuracy: 0.80
Delta: 0.05
Test Accuracy: 0.82
Result:
Thus the analyze and comparison of Training set and Validation set was executed
Successfully.
11
Ex No: 5 K-MEANS ALGORITHM
Date:
Aim:
To implement the k-means algorithm from the given dataset.
Algorithm:
1. Initialization:
Randomly initialize k centroids, each represented by a d-dimensional vector.
centroids <- Randomly select k data points from X.
2. Assignment Step:
For each data point x in X, calculate the distance to each centroid.
Assign x to the cluster whose centroid is closest (using Euclidean distance, for example).
Create a list clusters of length n that stores the cluster assignment of each data point.
3. Update Step:
For each cluster i from 1 to k:
Find all data points belonging to cluster i.
Calculate the mean of the feature vectors of these data points.
Update the i-th centroid to be the mean.
4. Convergence Check:
Check if the new centroids are significantly different from the previous centroids.
If the centroids have not changed significantly or a maximum number of
iterations is reached, terminate the algorithm.
Otherwise, go back to the Assignment Step.
5. Output:
Return the final centroids and clusters.
Program:
import numpy as np
import pandas as pd
from sklearn.cluster import
KMeans import matplotlib.pyplot
as plt
# Step 3: Initialize
centroids k = 3 #
Number of clusters
12
centroids = X[np.random.choice(range(len(X)), size=k, replace=False)]
13
# Visualize the clusters
unique_labels =
np.unique(clusters) colors =
['r', 'g', 'b', 'c', 'm', 'y', 'k']
for i, label in enumerate(unique_labels):
cluster_points = np.array([X[j] for j in range(len(X)) if clusters[j] == label])
plt.scatter(cluster_points[:, 0], cluster_points[:, 1], c=colors[i % len(colors)],
label=f"Cluster
{label+1}")
plt.scatter(centroids[:, 0], centroids[:, 1], c='black', marker='x', label='Centroids')
plt.xlabel('Feature 1')
plt.ylabel('Feature
2') plt.legend()
plt.title('K-Means
Clustering') plt.show()
Output:
Initial centroids:
[[-1.2, 0.5],
[0.8, -0.3],
[2.2, 1.5]]
Cluster assignments:
[1, 1, 2, 2, 0, 0]
Updated centroids:
[[-0.8, 0.2],
[0.4, -0.15],
[2.0, 1.2]]
Converged after 2
iterations.
Final centroids:
[[-0.8, 0.2],
[0.4, -0.15],
[2.0, 1.2]]
Result:
Thus the implementation for the k-means algorithm was executed successfully
14
Ex No: 6 NAÏVE BAYES CLASSIFIER
Date:
Aim:
To implement the Naïve Bayes Classifier from the given dataset.
Algorithm:
1. Initialization:
Split the dataset X and class labels y into training and test sets (optional).
2. Compute class probabilities:
Calculate the prior probability of each class label based on the training set:
P(y = c) = Count of data points with class label c / Total number of data points.
3. Compute feature probabilities:
For each feature j and each class label c, calculate the likelihood of each
feature value given the class:
Calculate the conditional probability P(x_j = v | y = c) using a suitable
probability distribution (e.g., Gaussian, multinomial) based on the type
of feature.
Estimate the parameters of the probability distribution (e.g., mean and
variance for Gaussian).
4. Classify new data points:
Given a new data point x_new, calculate the posterior probability P(y = c |
x_new) for each class c:
For each class c, calculate the product of the conditional probabilities
P(x_j = v | y = c) for each feature j and value v in x_new.
Multiply the result by the prior probability P(y = c).
Normalize the probabilities by dividing by the sum of probabilities for all classes.
Assign x_new to the class with the highest posterior probability.
5. Output:
Return the trained Naïve Bayes classifier model.
Program:
import numpy as np
import pandas as pd
from sklearn.model_selection import
train_test_split from sklearn.naive_bayes
import GaussianNB
from sklearn.metrics import accuracy_score
Output:
Training Naïve Bayes Classifier...
Accuracy: 80%
Result:
Thus the implementation for the Naïve Bayes Classifier was executed successfully.
16
Ex No: 7 MINI PROJECT
Date:
Aim:
To implement a project must implement one or more machine learning algorithms and
apply them to some data.
a. Your project may be a comparison of several existing algorithms, or it may
propose a new algorithm in which case you still must compare it to at least
one other approach.
b. You can either pick a project of your own design, or you can choose from
the set of pre-defined projects.
c. You are free to use any third-party ideas or code that you wish as long as it
is publicly available.
d. You must properly provide references to any work that is not your own in the write-up.
e. Project proposal You must turn in a brief project proposal. Your project
proposal should describe the idea behind your project. You should also
briefly describe software you will need to write, and papers (2-3) you plan to
read.
Algorithm:
The objective of this project is to implement and compare different machine learning
algorithms for the classification of breast cancer tumor types. Breast cancer is a prevalent
disease, and accurate classification of tumor types (e.g., benign or malignant) is crucial for
diagnosis and treatment planning. By comparing multiple algorithms, we aim to identify the
most effective approach for accurately classifying breast cancer tumors.
Software: To implement this project, you will need the following software and libraries:
1. Python: The programming language for implementing the project.
2. Jupyter Notebook: An interactive development environment for running and documenting code.
3. Scikit-learn: A machine learning library in Python for implementing the algorithms.
4. Pandas: A data manipulation library for handling and analyzing the dataset.
5. Matplotlib/Seaborn: Libraries for data visualization and plotting.
6. Any additional libraries required by the chosen algorithms.
Dataset: For this project, you can use the Breast Cancer Wisconsin (Diagnostic) Dataset,
commonly known as the "WBCD dataset." It is publicly available and provides features
extracted from digitized images of breast mass aspirates. The dataset includes information
about tumor characteristics, such as texture, radius, perimeter, smoothness, and more, along
with corresponding tumor type labels (benign or malignant).
Algorithms: Compare and evaluate the performance of the following machine learning
algorithms for breast cancer tumor classification:
1. Logistic Regression: A linear classification algorithm that models the relationship
between features and tumor types.
2. Support Vector Machines (SVM): A binary classification algorithm that separates
data points using hyperplanes.
17
3. Random Forest: An ensemble learning algorithm that combines multiple decision
trees to make predictions.
4. Deep Learning (e.g., Neural Networks): Implement a deep learning model (e.g.,
feedforward neural network) for classification.
Methodology:
1. Preprocess the dataset: Perform data cleaning, handle missing values (if any), and
preprocess the features (e.g., scaling, normalization) to ensure compatibility with the
chosen algorithms.
2. Split the dataset: Divide the dataset into training and testing sets using a suitable ratio
(e.g., 80% for training, 20% for testing).
3. Implement the algorithms: Implement the selected machine learning algorithms using
appropriate libraries (e.g., scikit-learn, TensorFlow, or PyTorch).
4. Train and evaluate the models: Train each algorithm using the training set and
evaluate their performance using evaluation metrics such as accuracy, precision,
recall, and F1-score.
5. Compare the results: Compare the performance of the different algorithms and
analyze their strengths and weaknesses for breast cancer tumor classification.
6. Write-up: Document the project methodology, findings, and conclusions. Provide
references to any third-party code or research papers used.
Program:
import numpy as np
import pandas as pd
from sklearn.model_selection import
train_test_split from sklearn.neighbors import
KNeighborsClassifier from sklearn.metrics
import accuracy_score
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
# Step 5: Make predictions
y_pred = knn.predict(X_test)
# Step 6: Evaluate the performance
accuracy = accuracy_score(y_test,
y_pred) print("Accuracy:", accuracy)
Output:
Model: Random
Forest Accuracy: 0.93
Precision: 0.91
Recall: 0.94
F1-score: 0.92
Model: Neural
Network Accuracy:
0.96
Precision: 0.95
Recall: 0.97
F1-score: 0.96
Result:
Thus the implementation for the mini project was executed successfully.
20