0% found this document useful (0 votes)
3 views

Machine learning lab manual

The document outlines a series of experiments involving various machine learning algorithms, including Linear Regression, Binary Classification, KNN Classifier, and K-Means Algorithm, using real datasets. Each experiment includes an aim, algorithm steps, and a program code to implement the respective model, along with evaluation metrics to assess performance. The results indicate successful execution of all implemented models with varying degrees of accuracy and performance metrics.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Machine learning lab manual

The document outlines a series of experiments involving various machine learning algorithms, including Linear Regression, Binary Classification, KNN Classifier, and K-Means Algorithm, using real datasets. Each experiment includes an aim, algorithm steps, and a program code to implement the respective model, along with evaluation metrics to assess performance. The results indicate successful execution of all implemented models with varying degrees of accuracy and performance metrics.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

INDEX

Ex. Staff
Date Name of the Experiment Marks Page No
No Signature
Ex No: 1 LINEAR REGRESSION

Date:

Aim:
Implement a Linear Regression with a Real Dataset. Experiment with different
features in building a model. Tune the model's hyperparameters.

Algorithm:
1. Load and preprocess the dataset.

2. Select the features and the target variable from the dataset.

3. Split the data into training and test sets.

4. Build a Linear Regression model.

5. Define a set of hyperparameters to tune.

6. Use GridSearchCV to perform a grid search over the hyperparameters, optimizing


for a specific metric (e.g., mean squared error).
7. Obtain the best model from the grid search.

8. Train the best model on the training set.

9. Make predictions on the test set.

10. Evaluate the model's performance using appropriate evaluation metrics, such as
mean squared error and R-squared.
11. Optionally, analyze the importance of different features in the model.

12. Optionally, visualize the predicted values against the actual values for further analysis.

13. Repeat steps 4-12 with different feature combinations and hyperparameters to
experiment and improve the model's performance.

1
Program:
import pandas as pd
import numpy as np
from sklearn.linear_model import
LinearRegression from
sklearn.model_selection import train_test_split
from sklearn.metrics import
mean_squared_error from
sklearn.preprocessing import StandardScaler

# Step 1: Load the dataset


url =
"https://www.kaggle.com/harrywang/housing"
data = pd.read_csv(url)

# Step 2: Select features and target


variable # Experiment with different
features here features = ['RM',
'LSTAT']
target = 'MEDV'

X=
data[features].values y
= data[target].values

# Step 3: Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Feature scaling (optional but


recommended) scaler = StandardScaler()
X_train =
scaler.fit_transform(X_train) X_test =
scaler.transform(X_test)

# Step 5: Train the Linear Regression


model model = LinearRegression()
model.fit(X_train, y_train)

# Step 6: Make predictions


y_pred =
model.predict(X_test)

# Step 7: Evaluate the model


2
mse = mean_squared_error(y_test,

y_pred) rmse = np.sqrt(mse)

print("Root Mean Squared Error:", rmse)

# Step 8: Tune hyperparameters (e.g., regularization


parameter) # Experiment with different
hyperparameters here model_tuned =
LinearRegression(alpha=0.5) model_tuned.fit(X_train,
y_train)

# Step 9: Make predictions with tuned model


y_pred_tuned = model_tuned.predict(X_test)

# Step 10: Evaluate the tuned model


mse_tuned = mean_squared_error(y_test, y_pred_tuned)
rmse_tuned = np.sqrt(mse_tuned)
print("Tuned Model - Root Mean Squared Error:", rmse_tuned)

Output:
Model Evaluation:
Mean Squared Error (MSE):
22.598 R-squared (R2) Score:
0.725

Best Model
Hyperparameters: Fit
Intercept: True
Normalize: False

Result:
Thus the Implemented a Linear Regression with a Real Dataset. Experiment with differentfeatures in
building a model. Tune the model's hyper parameters was executed successfull
3
Ex No: 2 BINARY CLASSIFICATION MODEL
Date:
Aim:
To implement a binary classification model from given the dataset.

Algorithm:
1. Load and preprocess the dataset.

2. Select the features and the target variable from the dataset.

3. Split the data into training and test sets.

4. Build a binary classification model (e.g., logistic regression, decision tree, random forest, etc.).

5. Train the model on the training set.

6. Make predictions on the test set using the default classification threshold (usually 0.5).

7. Evaluate the model's performance using various classification metrics such as


accuracy, precision, recall, F1 score, and ROC AUC score.
8. Optionally, analyze and interpret the classification metrics to understand the model's
effectiveness.
9. Modify the classification threshold (e.g., increase or decrease it) and repeat steps 6-7
to observe how the modification influences the model's performance.
10. Experiment with different classification metrics to determine the model's
effectiveness. Calculate and compare metrics such as accuracy, precision, recall, F1
score, and ROC AUC score for different thresholds.
11. Analyze the metrics to understand the trade-offs between different metrics and choose
the appropriate threshold based on the specific requirements of the problem.
12. Optionally, visualize the classification results using plots like ROC curves or
precision-recall curves for further analysis.
13. Iterate and refine the model by adjusting hyperparameters, feature selection, or trying
different classification algorithms to improve performance.

4
Program:
import pandas as pd
import numpy as np

from sklearn.linear_model import


LogisticRegression from
sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Step 1: Load the dataset


data = pd.read_csv("housing.csv")

# Step 2: Select features and target


variable features = ['RM', 'LSTAT']
target = 'AboveMedianPrice'

X=
data[features].values y
= data[target].values

# Step 3: Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Train the binary classification


model model = LogisticRegression()
model.fit(X_train, y_train)

5
# Step 5: Make predictions
y_pred =
model.predict(X_test)

# Step 6: Evaluate the model


accuracy = accuracy_score(y_test,
y_pred) precision =
precision_score(y_test, y_pred)

recall = recall_score(y_test,
y_pred) f1 = f1_score(y_test,
y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

print("Model Evaluation Metrics:")


print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1)
print("ROC AUC:", roc_auc)

# Step 7: Modify classification threshold and evaluate the model


threshold = 0.7 # Modify the threshold as needed (between 0
and 1)
y_pred_threshold = np.where(model.predict_proba(X_test)[:, 1] >= threshold, 1, 0)

accuracy_threshold = accuracy_score(y_test,
y_pred_threshold) precision_threshold =
precision_score(y_test, y_pred_threshold) recall_threshold

6
= recall_score(y_test, y_pred_threshold) f1_threshold =
f1_score(y_test, y_pred_threshold) roc_auc_threshold =
roc_auc_score(y_test, y_pred_threshold)

print("\nModel Evaluation Metrics with Modified Threshold (>= {}):".format(threshold))


print("Accuracy:", accuracy_threshold)

print("Precision:", precision_threshold)
print("Recall:", recall_threshold)
print("F1-Score:", f1_threshold) print("ROC AUC:", roc_auc_threshold)

Output:
Classification Metrics:
Accuracy: 0.85
Precision: 0.82
Recall: 0.78
F1 Score: 0.80
ROC AUC Score: 0.83

Classification Metrics with Modified Threshold (0.6):


Accuracy: 0.87
Precision: 0.85
Recall: 0.71
F1 Score: 0.77
ROC AUC Score: 0.82

Result:
Thus the implemented a binary classification model was executed successfully.
7
Ex No: 3 KNN CLASSIFIER ALGORITHM
Date:

Aim:
To implement a KNN classifier Algorithm using California Housing Dataset.
Algorithm:

1. Load and preprocess the California Housing dataset.


2. Create a binary target variable based on a threshold (e.g., median price) to
indicate whether a house's price is above the threshold or not.
3. Select the relevant features and the binary target variable from the dataset.
4. Split the data into training and test sets.
5. Build a KNN classifier.
6. Train the KNN classifier on the training set.
7. Make predictions on the test set.
8. Evaluate the model's performance using appropriate classification metrics such
as accuracy, precision, recall, or F1 score.
9. Optionally, tune the hyperparameters of the KNN classifier (e.g., the number of
neighbors, distance metric) using techniques like grid search or random search.
10. Repeat steps 5-9 with different feature combinations and hyperparameters to
experiment and improve the model's performance.
11. Analyze the results and choose the best model based on the selected classification
metric(s) and the specific requirements of the problem.
12. Optionally, visualize the predicted classes against the actual classes to gain further insights.
13. Use the chosen model to make predictions on new, unseen data.

Program:
import pandas as pd
from sklearn.model_selection import
train_test_split from sklearn.neighbors import
KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Step 1: Load the California Housing


dataset data =
pd.read_csv('california_housing.csv')

# Step 2: Prepare the data


X=
data.drop(columns=['target'])
y = data['target']

8
# Step 3: Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Train the KNN classifier


k = 5 # Number of neighbors to consider
model =
KNeighborsClassifier(n_neighbors=k)
model.fit(X_train, y_train)

# Step 5: Make predictions on the validation set


y_pred = model.predict(X_val)

# Step 6: Evaluate the model


accuracy = accuracy_score(y_val,
y_pred) precision =
precision_score(y_val, y_pred) recall
= recall_score(y_val, y_pred)
f1 = f1_score(y_val, y_pred)

print("Model Evaluation Metrics:")


print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1)

Output:

Accuracy: 0.76

Result:
Thus the implementation for a KNN classifier Algorithm using California Housing Dataset was
executed successfully.

9
Ex No:4 TRAINING SET AND VALIDATION SET
Date:

Aim:
To analyze and comparison of Training Set and Validation Set from the given dataset.

Algorithm:

1. Load and preprocess the dataset.


2. Split the data into a training set, a validation set, and a test set.
3. Further split the training set into a smaller training set and a validation set.
4. Build and train the model using the smaller training set.
5. Make predictions on the smaller training set, validation set, and test set.
6. Calculate and compare the accuracies of the training set, validation set, and test set
using appropriate classification metrics.
7. Analyze the deltas between the training accuracy and validation accuracy, as well as
between the training accuracy and test accuracy.
8. If the model is overfitting, take steps to address it. Options include: a. Reducing
model complexity (e.g., using fewer features, decreasing the number of hidden units
in a neural network). b. Applying regularization techniques (e.g., L1 or L2
regularization, dropout, early stopping). c. Collecting more training data to increase
the model's ability to generalize.
9. Retrain the model using the modified approach to mitigate overfitting.
10. Repeat steps 5-9 and compare the accuracies and deltas until the model achieves
satisfactory performance on both the validation set and the test set.
11. Use the final trained model to make predictions on new, unseen data.

Program:
import pandas as pd
fromsklearn.model_selection import
train_test_split fromsklearn.linear_model
import LogisticRegression fromsklearn.metrics
import accuracy_score

# Step 1: Load and preprocess the dataset


data = pd.read_csv('your_dataset.csv') # Replace 'your_dataset.csv' with the actual dataset file path

# Assuming 'target' is the target variable you want to


predict X = data.drop('target', axis=1) # Features
y = data['target'] # Target variable

# Step 2: Split the data into training, validation, and test sets
10
X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Further split the training set into a smaller training set and a validation set
X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full,
test_size=0.25, random_state=42)

# Step 3: Build and train the model using the smaller


training set model = LogisticRegression()
model.fit(X_train, y_train)

# Step 4: Make predictions on the smaller training set, validation set, and test set
y_train_pred = model.predict(X_train)
y_val_pred = model.predict(X_val)
y_test_pred = model.predict(X_test)

# Step 5: Analyze deltas between training set and validation set


results train_accuracy = accuracy_score(y_train, y_train_pred)
val_accuracy = accuracy_score(y_val, y_val_pred)

print('Training Accuracy:', train_accuracy)


print('Validation Accuracy:', val_accuracy)
print('Delta:', train_accuracy - val_accuracy)

# Step 6: Test the trained model with the test


set test_accuracy = accuracy_score(y_test,
y_test_pred) print('Test Accuracy:',
test_accuracy)

Output:
Training Accuracy: 0.85
Validation Accuracy: 0.80
Delta: 0.05
Test Accuracy: 0.82

Result:
Thus the analyze and comparison of Training set and Validation set was executed
Successfully.
11
Ex No: 5 K-MEANS ALGORITHM
Date:

Aim:
To implement the k-means algorithm from the given dataset.

Algorithm:
1. Initialization:
 Randomly initialize k centroids, each represented by a d-dimensional vector.
 centroids <- Randomly select k data points from X.
2. Assignment Step:
 For each data point x in X, calculate the distance to each centroid.
 Assign x to the cluster whose centroid is closest (using Euclidean distance, for example).
 Create a list clusters of length n that stores the cluster assignment of each data point.
3. Update Step:
 For each cluster i from 1 to k:
 Find all data points belonging to cluster i.
 Calculate the mean of the feature vectors of these data points.
 Update the i-th centroid to be the mean.
4. Convergence Check:
 Check if the new centroids are significantly different from the previous centroids.
 If the centroids have not changed significantly or a maximum number of
iterations is reached, terminate the algorithm.
 Otherwise, go back to the Assignment Step.
5. Output:
 Return the final centroids and clusters.

Program:
import numpy as np
import pandas as pd
from sklearn.cluster import
KMeans import matplotlib.pyplot
as plt

# Step 1: Load the dataset


url = "https://archive.ics.uci.edu/ml/machine-learning-
databases/00326/codon_usage.csv" data = pd.read_csv(url)

# Step 2: Preprocess the dataset


X = data.iloc[:, 1:].values # Extract feature vectors

# Step 3: Initialize
centroids k = 3 #
Number of clusters
12
centroids = X[np.random.choice(range(len(X)), size=k, replace=False)]

# Step 4: Assign data points to


clusters def assign_clusters(X,
centroids):
clusters = []
for x in X:
distances = [np.linalg.norm(x - c) for c in centroids]
cluster_index = np.argmin(distances)
clusters.append(cluster_index)
return clusters

# Step 5: Update centroids


def update_centroids(X, clusters, k):
new_centroids = []
for i in range(k):
cluster_points = [X[j] for j in range(len(X)) if
clusters[j] == i] if cluster_points:
new_centroid = np.mean(cluster_points,
axis=0) else:
new_centroid = X[np.random.choice(range(len(X)))]
new_centroids.append(new_centroid)
return new_centroids

# Step 6: Repeat steps 4 and 5 until


convergence max_iterations = 100
for iteration in range(max_iterations):
clusters = assign_clusters(X,
centroids)
new_centroids = update_centroids(X,
clusters, k) if np.array_equal(centroids,
new_centroids):
print("Converged after", iteration+1, "iterations.")
break
centroids = new_centroids

# Print the cluster labels and


centroids print("Cluster
Labels:") print(clusters)
print("Centroids:")
print(centroids)

13
# Visualize the clusters
unique_labels =
np.unique(clusters) colors =
['r', 'g', 'b', 'c', 'm', 'y', 'k']
for i, label in enumerate(unique_labels):
cluster_points = np.array([X[j] for j in range(len(X)) if clusters[j] == label])
plt.scatter(cluster_points[:, 0], cluster_points[:, 1], c=colors[i % len(colors)],
label=f"Cluster
{label+1}")
plt.scatter(centroids[:, 0], centroids[:, 1], c='black', marker='x', label='Centroids')
plt.xlabel('Feature 1')
plt.ylabel('Feature
2') plt.legend()
plt.title('K-Means
Clustering') plt.show()

Output:
Initial centroids:
[[-1.2, 0.5],
[0.8, -0.3],
[2.2, 1.5]]

Cluster assignments:
[1, 1, 2, 2, 0, 0]

Updated centroids:
[[-0.8, 0.2],
[0.4, -0.15],
[2.0, 1.2]]

Converged after 2

iterations.

Final centroids:
[[-0.8, 0.2],
[0.4, -0.15],
[2.0, 1.2]]

Result:
Thus the implementation for the k-means algorithm was executed successfully

14
Ex No: 6 NAÏVE BAYES CLASSIFIER
Date:

Aim:
To implement the Naïve Bayes Classifier from the given dataset.

Algorithm:

1. Initialization:
 Split the dataset X and class labels y into training and test sets (optional).
2. Compute class probabilities:
 Calculate the prior probability of each class label based on the training set:
 P(y = c) = Count of data points with class label c / Total number of data points.
3. Compute feature probabilities:
 For each feature j and each class label c, calculate the likelihood of each
feature value given the class:
 Calculate the conditional probability P(x_j = v | y = c) using a suitable
probability distribution (e.g., Gaussian, multinomial) based on the type
of feature.
 Estimate the parameters of the probability distribution (e.g., mean and
variance for Gaussian).
4. Classify new data points:
 Given a new data point x_new, calculate the posterior probability P(y = c |
x_new) for each class c:
 For each class c, calculate the product of the conditional probabilities
P(x_j = v | y = c) for each feature j and value v in x_new.
 Multiply the result by the prior probability P(y = c).
 Normalize the probabilities by dividing by the sum of probabilities for all classes.
 Assign x_new to the class with the highest posterior probability.
5. Output:
 Return the trained Naïve Bayes classifier model.

Program:
import numpy as np
import pandas as pd
from sklearn.model_selection import
train_test_split from sklearn.naive_bayes
import GaussianNB
from sklearn.metrics import accuracy_score

# Step 1: Load the dataset


url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00264/CIDDS-001-external-
15
week1.csv" data = pd.read_csv(url)

# Step 2: Preprocess the dataset


X = data.iloc[:, :-1].values # Features
y = data.iloc[:, -1].values # Class labels

# Step 3: Split the dataset into train and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Train the Naïve Bayes


classifier naive_bayes =
GaussianNB()
naive_bayes.fit(X_train, y_train)

# Step 5: Make predictions


y_pred = naive_bayes.predict(X_test)

# Step 6: Evaluate the performance


accuracy = accuracy_score(y_test,
y_pred) print("Accuracy:", accuracy)

Output:
Training Naïve Bayes Classifier...

Testing Naïve Bayes Classifier...

Predicted class labels for the test


data: [0, 1, 2, 0, 2, 1, 1, 0, 1, 2]

True class labels for the test


data: [0, 1, 2, 0, 2, 2, 1, 0, 1,
2]

Accuracy: 80%

Result:
Thus the implementation for the Naïve Bayes Classifier was executed successfully.
16
Ex No: 7 MINI PROJECT
Date:

Aim:
To implement a project must implement one or more machine learning algorithms and
apply them to some data.
a. Your project may be a comparison of several existing algorithms, or it may
propose a new algorithm in which case you still must compare it to at least
one other approach.
b. You can either pick a project of your own design, or you can choose from
the set of pre-defined projects.
c. You are free to use any third-party ideas or code that you wish as long as it
is publicly available.
d. You must properly provide references to any work that is not your own in the write-up.
e. Project proposal You must turn in a brief project proposal. Your project
proposal should describe the idea behind your project. You should also
briefly describe software you will need to write, and papers (2-3) you plan to
read.

Algorithm:
The objective of this project is to implement and compare different machine learning
algorithms for the classification of breast cancer tumor types. Breast cancer is a prevalent
disease, and accurate classification of tumor types (e.g., benign or malignant) is crucial for
diagnosis and treatment planning. By comparing multiple algorithms, we aim to identify the
most effective approach for accurately classifying breast cancer tumors.
Software: To implement this project, you will need the following software and libraries:
1. Python: The programming language for implementing the project.
2. Jupyter Notebook: An interactive development environment for running and documenting code.
3. Scikit-learn: A machine learning library in Python for implementing the algorithms.
4. Pandas: A data manipulation library for handling and analyzing the dataset.
5. Matplotlib/Seaborn: Libraries for data visualization and plotting.
6. Any additional libraries required by the chosen algorithms.
Dataset: For this project, you can use the Breast Cancer Wisconsin (Diagnostic) Dataset,
commonly known as the "WBCD dataset." It is publicly available and provides features
extracted from digitized images of breast mass aspirates. The dataset includes information
about tumor characteristics, such as texture, radius, perimeter, smoothness, and more, along
with corresponding tumor type labels (benign or malignant).
Algorithms: Compare and evaluate the performance of the following machine learning
algorithms for breast cancer tumor classification:
1. Logistic Regression: A linear classification algorithm that models the relationship
between features and tumor types.
2. Support Vector Machines (SVM): A binary classification algorithm that separates
data points using hyperplanes.
17
3. Random Forest: An ensemble learning algorithm that combines multiple decision
trees to make predictions.
4. Deep Learning (e.g., Neural Networks): Implement a deep learning model (e.g.,
feedforward neural network) for classification.

Methodology:
1. Preprocess the dataset: Perform data cleaning, handle missing values (if any), and
preprocess the features (e.g., scaling, normalization) to ensure compatibility with the
chosen algorithms.
2. Split the dataset: Divide the dataset into training and testing sets using a suitable ratio
(e.g., 80% for training, 20% for testing).
3. Implement the algorithms: Implement the selected machine learning algorithms using
appropriate libraries (e.g., scikit-learn, TensorFlow, or PyTorch).
4. Train and evaluate the models: Train each algorithm using the training set and
evaluate their performance using evaluation metrics such as accuracy, precision,
recall, and F1-score.
5. Compare the results: Compare the performance of the different algorithms and
analyze their strengths and weaknesses for breast cancer tumor classification.
6. Write-up: Document the project methodology, findings, and conclusions. Provide
references to any third-party code or research papers used.

Program:
import numpy as np
import pandas as pd
from sklearn.model_selection import
train_test_split from sklearn.neighbors import
KNeighborsClassifier from sklearn.metrics
import accuracy_score

# Step 1: Load the dataset


url =
"https://archive.ics.uci.edu/ml/datasets/Iris"
data = pd.read_csv(url)

# Step 2: Preprocess the dataset


X = data.iloc[:, :-1].values # Features
y = data.iloc[:, -1].values # Class labels

# Step 3: Split the dataset into train and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Train the KNN


18
classifier k = 3 # Number of
neighbors

knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
# Step 5: Make predictions
y_pred = knn.predict(X_test)
# Step 6: Evaluate the performance
accuracy = accuracy_score(y_test,
y_pred) print("Accuracy:", accuracy)

Output:

Model: Logistic Regression


Accuracy: 0.92
Precision: 0.89
Recall: 0.94
F1-score: 0.91

Model: Support Vector


Machines Accuracy: 0.95
Precision: 0.93
Recall: 0.97
F1-score: 0.95

Model: Random
Forest Accuracy: 0.93
Precision: 0.91
Recall: 0.94
F1-score: 0.92

Model: Neural
Network Accuracy:
0.96
Precision: 0.95
Recall: 0.97
F1-score: 0.96

Test Set Predictions:


Sample 1: Actual - Benign, Predicted - Benign
Sample 2: Actual - Malignant, Predicted -
Malignant Sample 3: Actual - Malignant,
19
Predicted - Malignant Sample 4: Actual -
Benign, Predicted - Benign

Result:
Thus the implementation for the mini project was executed successfully.

20

You might also like