CP4252 Machine Learning lab manual
CP4252 Machine Learning lab manual
BONAFIDECERTIFICATE
Register No.
This is to certify that this is a bonafide record of the workdone by the above student with
RollNo. of Semester B.E
Degree in in
the during Laboratory
the academic year 2022 – 2023.
Date:
AIM:
To Implement naive bayes theorem to classify the English text
DESCRIPTION:
The challenge of text classification is to attach labels to bodies of text, e.g., tax document, medical
form, etc. based on the text itself. For example, think of your spam folder in your email. How does your
email provider know that a particular message is spam or “ham” (not spam)? We’ll take a look at one
natural language processing technique for text classification called Naive Bayes.
SOURCE CODE:
import pandas as pd
X = msg.message
y = msg.labelnum
count_v = CountVectorizer()
Xtrain_dm = count_v.fit_transform(Xtrain)
Xtest_dm = count_v.transform(Xtest)
df = pd.DataFrame(Xtrain_dm.toarray(),columns=count_v.get_feature_names())
clf = MultinomialNB()
clf.fit(Xtrain_dm, ytrain)
pred = clf.predict(Xtest_dm)
print('Accuracy Metrics:')
print('Accuracy: ', accuracy_score(ytest, pred)) print('Recall: ', recall_score(ytest, pred)) print('Precision: ',
precision_score(ytest, pred))
document.csv:
I love this sandwich,pos This
is an amazingplace,pos
is my sworn enemy,neg My
boss is horrible,neg
I love to dance,pos
OUTPUT:
Total Instances of Dataset: 18
Recall: 0.6666666666666666
Precision: 0.6666666666666666
Confusion Matrix:
[[1 1]
[1 2]]
VIVA QUESTIONS & ANSWERS:
Let’s understand it using an example. Below I have a training data set of weather and
corresponding target variable ‘Play’ (suggesting possibilities of playing). Now, we need to classify
whether players will play or not based on weather condition. Let’s follow the below steps to perform it.
Step 2: Create Likelihood table by finding the probabilities like Overcast probability = 0.29 and
probability of playing is 0.64.
Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each class. The class
with the highest posterior probability is the outcome of prediction.
Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.
Naive Bayes uses a similar method to predict the probability of different class based on various attributes.
This algorithm is mostly used in text classification and with problems having multiple classes.
Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be
used for making predictions in real time.
Multi class Prediction: This algorithm is also well known for multi class prediction feature. Here we
can predict the probability of multiple classes of target variable.
Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly used in text
classification (due to better result in multi class problems and independence rule) have higher success
rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-
mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer
sentiments).
Recommendation System: Naive Bayes Classifier and Collaborative Filtering together builds a
Recommendation System that uses machine learning and data mining techniques to filter unseen
information and predict whether a user would like a given resource or not.
PROGRAM 2:
AIM:
DESCRIPTION:
Write a program to implement classification with K-Nearest Neighbors. In this question, you
will use the scikit-learn’s KNN classifier to classify real vs. fake news headlines. The aim of
this question is for you to read the scikit-learn API and get comfortable with training
K-Nearest-Neighbour Algorithm:
1. Calculate the distance between test data and each row of training data. Here we will use Euclidean
distance as our distance metric since it’s the most popular method. The other metrics that can be
used are Chebyshev, cosine, etc.
2. Sort the calculated distances in ascending order based on distance values
3. Get top k rows from the sorted array.
4. Get the most frequent class of these rows i.e Get the labels of the selected K entries.
5. Return the predicted class. If regression, return the mean of the K labels If classification,
return the mode of the K labels.
Confusion matrix:
Note;
• Class 1 : Positive
• Class 2 : Negative
ACCURACY = TP+TN
------------------
TP+TN+FP+FN
RECALL = TP
-------------------
TP+FN
PRECISION = TP
-------------------
TP+FP
F-MEASURE = 2*RECALL*PRECISION
--------------------------------
RECALL+PRECISION
EXAMPLE:
PREDICTED: PREDICTED:
NO YES
TN=50 FP=10 60
ACTUAL = NO 55 110
ACTUAL = YES
True Positive Rate: When it's actually yes, how often does it predict yes?
TP/actual yes = 100/105 = 0.95
also known as "Sensitivity" or "Recall".
False Positive Rate: When it's actually no, how often does it predict yes?
FP/actual no = 10/60 = 0.17
True Negative Rate: When it's actually no, how often does it predict no?
TN/actual no = 50/60 = 0.83
equivalent to 1 minus False Positive Rate also known as "Specificity“.
Prevalence: How often does the yes condition actually occur in our sample?
actual yes/total = 105/165 = 0.64
Source Code:
import pandas as pd
dataset=pd.read_csv("iris.csv")
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0,test_size=0.25)
classifier=KNeighborsClassifier(n_neighbors=8,p=3,metric='euclidean')
classifier.fit(X_train,y_train)
cm=confusion_matrix(y_test,y_pred)
print('Confusion matrix is as follows\n',cm)
print('Accuracy Metrics')
print(classification_report(y_test,y_pred))
print(" correct predicition",accuracy_score(y_test,y_pred))
print(" worng predicition",(1-accuracy_score(y_test,y_pred)))
Output :
Confusion matrix is as follows
[[13 0 0]
[ 0 15 1]
[ 0 0 9]]
Accuracy Metrics:
AIM:
To Implement Linear Regression with a real datasets and Experiment with different features in
DESCRIPTION:
3. Create the training and test sets using proper sampling methods, e.g., random vs. Stratified.
4. Correlation analysis (pair-wise and attribute combinations).
5. Data cleaning (missing data, outliers, data errors).
6. Data transformation via pipelines (categorical text to number using one hot encoding, feature
scaling via normalization/standardization, feature combinations).
7. Train and cross validate different models and select the most promising one (Linear Regression,
Decision Tree, and Random Forest were tried in this tutorial).
8. Fine tune the model using trying different combinations of hyper parameters.
9. Evaluate the model with best estimators in the test set.
10. Launch, monitor, and refresh the model and system.
SOURCE CODE:
# This Python 3 environment comes with many helpful analytic libraries installed.
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input
Directory
import os
print(os.listdir("../input"))
# Any results you write to the current directory are saved as output.
['anscombe.csv', 'housing.csv']
# loading data
data_path = "../input/housing.csv"
housing = pd.read_csv(data_path)
housing.info()
<class 'pandas.core.frame.DataFrame'>
Input(3): housing.head(10)
Input(4): housing.describe()
Output(6):
Input(7): housing['ocean_proximity'].value_counts()
Output(7):
INLAND 6551
ISLAND 5
Input(8):
op_count = housing['ocean_proximity'].value_counts()
plt.figure(figsize=(10,5))
plt.show()
# housing['ocean_proximity'].value_counts().hist()
Output(8):
Input(9): housing['median_income'].hist()
Input(10): housing['median_income'].hist()
Input(12):
corr_matrix = housing.corr()
# Check the how much each attribute correlates with the median house value
corr_matrix['median_house_value'].sort_values(ascending=False)
Output(12):
median_house_value 1.000000
median_income 0.687160
total_rooms 0.135097
housing_median_age 0.114110
households 0.064506
total_bedrooms 0.047689
population -0.026920
longitude -0.047432
latitude -0.142724
AIM:
The aim of this question is for you to read the scikit-learn API and get comfortable with training/validation
DESCRIPTION:
Classification with Nearest Neighbors. In this question, you will use the scikit-learn’s KNN classifier to
classify real vs. fake news headlines. The aim of this question is for you to read the scikit-learn API and get
SOURCE CODE:
import csv
import random
import math
import operator
dataset = list(lines)
for x in range(len(dataset)-1):
for y in range(4):
dataset[x][y] = float(dataset[x][y])
else:
testSet.append(dataset[x])
def euclideanDistance(instance1, instance2, length):
distance = 0
for x in range(length):
return math.sqrt(distance)
distances = []
length = len(testInstance)-1
for x in range(len(trainingSet)):
distances.sort(key=operator.itemgetter(1))
neighbors = []
for x in range(k):
neighbors.append(distances[x][0])
return neighbors
def getResponse(neighbors):
classVotes = {}
for x in range(len(neighbors)):
response = neighbors[x][-1]
if response in classVotes:
classVotes[response] += 1
else:
classVotes[response] = 1
sortedVotes =
sorted(classVotes.iteritems(),
reverse=True)
return sortedVotes[0][0]
def getAccuracy(testSet,
predictions): correct = 0
for x in
range(len(testSet)):
key=operator.itemgetter(1);
if testSet[x][-1] == predictions[x]:
correct += 1
def main():
# prepare
Data
trainingSet=
[] testSet=[]
split = 0.67
# generate
predictions
predictions=[]
k=3
for x in range(len(testSet)):
k) result = getResponse(neighbors)
predictions.append(result)
[[11 0 0]
[0 9 1]
[0 1 8]]
Accuracy metrics:
AIM:
To implement the experiment with validation sets and test sets using the datasets.
DESCRIPTION:
To experiment with validation sets and test sets using the datasets. Split a training set into a smaller
training set and a validation set. Analyze deltas between training set and validation set results. Test the
trained model with a test set to determine whether your trained model is over fitting. Detect and fix a
SOURCE CODE:
import numpy as np
np.random.seed(42)
N = 300
x = np.linspace(0, 7*np.pi, N)
smooth = 1 + 0.5*np.sin(x)
y = smooth + 0.2*np.random.randn(N)
plt.plot(x, y)
plt.plot(x, smooth)
plt.xlabel("x")
plt.ylabel("y")
plt.ylim(0,2)
plt.show()
X = x.reshape(-1,1)
degree = 2
linreg = LinearRegression()
# Cross-validation
scoring = "neg_root_mean_squared_error"
# This starts from the constant term and in ascending order of powers
print(polyscores["estimator"][0].steps[1][1].coef_)
plt.plot(x, y)
plt.plot(x, smooth)
plt.plot(x, polyscores["estimator"][0].predict(X))
plt.plot(x, linscores["estimator"][0].predict(X))
plt.ylim(0,2)
plt.xlabel("x")
plt.ylabel("y")
plt.show()
import sklearn
linreg = sklearn.base.clone(linreg)
linreg.fit(X_train, y_train)
AIM:
Implement a binary classification model. Binary question such as "Are houses in this neighborhood above a
certain price?
DESCRIPTION:
Implement a binary classification model. Binary question such as "Are houses in this neighborhood above a
certain price?"(use data from exercise 1). Modify the classification threshold and determine how that
modification influences the model. Experiment with different classification metrics to determine your
model's effectiveness.
SOURCE CODE:
test_df = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_test.csv")
train_df = train_df.reindex(np.random.permutation(train_df.index))
threshold = 265000 # This is the 75th percentile for median house values.
# median_house_value_is_high column
train_df_norm["median_house_value_is_high"].head(8000)
inputs = {
'total_rooms': tf.keras.Input(shape=(1,))
learning_rate = 0.001
epochs = 20
batch_size = 100
classification_threshold = 0.35
label_name = "median_house_value_is_high"
METRICS = [
tf.keras.metrics.BinaryAccuracy(name='accuracy',
threshold=classification_threshold),
tf.keras.metrics.Precision(thresholds=classification_threshold,
name='precision'
),
? # write code here
]
label_name, batch_size)
# Plot metrics vs. Epochs
list_of_metrics_to_plot = ['accuracy', 'precision', 'recall']
AIM:
DESCRIPTION:
We can define the back propagation algorithm as an algorithm that trains some given feed forward Neural
Network for a given input pattern where the classifications are known to us. At the point when every
passage of the example set is exhibited to the network, the network looks at its yield reaction to the
example input pattern. After that, the comparison done between output response and expected output with
the error value is measured. Later, we adjust the connection weight based upon the error value measured.
It was first introduced in the 1960s and 30 years later it was popularized by David Rumelhart,
Geoffrey Hinton, and Ronald Williams in the famous 1986 paper. In this paper, they spoke about the
various neural networks. Today, back propagation is doing good. Neural network training happens through
back propagation. By this approach, we fine-tune the weights of a neural net based on the error rate
obtained in the previous run. The right manner of applying this technique reduces error rates and makes
the model more reliable. Back propagation is used to train the neural network of the chain rule method. In
simple terms, after each feed-forward passes through a network, this algorithm does the backward pass to
adjust the model’s parameters based on weights and biases. A typical supervised learning algorithm
attempts to find a function that maps input data to the right output. Back propagation works with a multi-
layered neural network and learns internal representations of input to output mapping.
The Back propagation algorithm is a supervised learning method for multi layer feed forward
Feed-forward neural networks are inspired by the information processing of one or more neural
cells, called a neuron. A neuron accepts input signals via its dendrites, which pass the electrical signal
down to the cell body. The axon carries the signal out to synapses, which are the connections of a cell’s
The principle of the back propagation approach is to model a given function by modifying internal
weightings of input signals to produce an expected output signal. The system is trained using a supervised
learning method, where the error between the system’s output and a known expected output is presented to
Technically, the back propagation algorithm is a method for training the weights in a multilayer
feed-forward neural network. As such, it requires a network structure to be defined of one or more layers
where one layer is fully connected to the next layer. A standard network structure is one input layer, one
hidden layer, and one output layer.Back propagation can be used for both classification
In classification problems, best results are achieved when the network has one neuron in the output
layer for each class value. For example, a 2-class or binary classification problem with the class values of
A and B. These expected outputs would have to be transformed into binary vectors with one column for
each class value. Such as [1, 0] and [0, 1] for A and B respectively. This is called a one hot encoding.
Let us take a look at how back propagation works. It has four layers: input layer, hidden layer,
hidden layer II and final output layer. So, the main three layers are:
1. Input layer
2. Hidden layer
3. Output layer.
Each layer has its own way of working and its own way to take action such that we are able to get the
desired results and correlate these scenarios to our conditions. Let us discuss other details needed to help
import pandas as pd
fromsklearn.feature_extraction.text import
import MLPClassifier
msg = pd.read_csv('document.csv',
X=
msg.me
ssage y
msg.lab
Elnum
count_v = CountVectorizer()
Xtrain_dm =
count_v.fit_transform(Xtrain)
Xtest_dm =
count_v.transform(Xtest)
df = pd.DataFrame(Xtrain_dm.toarray(),columns=count_v.get_feature_names())
clf = MLPClassifier(solver='lbfgs', alpha=1e-5,hidden_layer_sizes=(5, 2),
random_state=1)
print('Accuracy Metrics:')
document.csv:
,pos This is
an amazingplace,
pos KG CO
work,pos
restaurant,neg I am
He is my sworn enemy,ne
My boss is horrible,neg
of this place,neg
tomorrow,pos I went to my
OUTPUT:
Accuracy Metrics:
Accuracy: 0.8
Recall: 1.0
Precisio n: 0.75
Confusi on Matrix:
[[1 1]
[0 3]}
VIVA QUESTIONS:
What is clustering?
Define regression?
What is ANN?