ML Lab Experiments (1) - Pages-3
ML Lab Experiments (1) - Pages-3
Experiment No: 5
Objective: Write a program to implement the naïve Bayesian classifier for a sample training
data set stored as a .CSV file. Compute the accuracy of the classifier, considering few test data
sets.
Explanation:
Bayes’ Theorem is stated as:
Where,
P(h|D) is the probability of hypothesis h given the data D. This is called the posterior
probability.
P(D|h) is the probability of data d given that the hypothesis h was true.
P(h) is the probability of hypothesis h being true. This is called the prior probability of h.
P(D) is the probability of the data. This is called the prior probability of D.
After calculating the posterior probability for a number of different hypotheses h, and is
interested in finding the most probable hypothesis h ∈ H given the observed data D. Any such
maximally probable hypothesis is called a maximum a posteriori (MAP) hypothesis.
Bayes theorem to calculate the posterior probability of each candidate hypothesis is hMAP is a
MAP hypothesis provided
2
Laboratory File
Machine Learning Lab (IT804) Jan-Jun 2021
Sample Examples:
Diabetic
Blood Skin
Examples Pregnancies Glucose Insulin BMI Pedigree Age Outcome
Pressure Thickness
Function
1 6 148 72 35 0 33.6 0.627 50 1
2 1 85 66 29 0 26.6 0.351 31 0
3 8 183 64 0 0 23.3 0.672 32 1
4 1 89 66 23 94 28.1 0.167 21 0
5 0 137 40 35 168 43.1 2.288 33 1
6 5 116 74 0 0 25.6 0.201 30 0
7 3 78 50 32 88 31 0.248 26 1
8 10 115 0 0 0 35.3 0.134 29 0
9 2 197 70 45 543 30.5 0.158 53 1
10 8 125 96 0 0 0 0.232 54 1
Program:
import csv
import random
3
Laboratory File
Machine Learning Lab (IT804) Jan-Jun 2021
import math
def loadcsv(filename):
lines = csv.reader(open(filename, "r"));
dataset = list(lines)
for i in range(len(dataset)):
#converting strings into numbers for processing
dataset[i] = [float(x) for x in dataset[i]]
return dataset
#generate indices for the dataset list randomly to pick ele for training data
index = random.randrange(len(copy));
trainset.append(copy.pop(index))
return [trainset, copy]
def separatebyclass(dataset):
separated = {} #dictionary of classes 1 and 0
def mean(numbers):
4
Laboratory File
Machine Learning Lab (IT804) Jan-Jun 2021
return sum(numbers)/float(len(numbers))
def stdev(numbers):
avg = mean(numbers)
variance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1)
return math.sqrt(variance)
def summarizebyclass(dataset):
separated = separatebyclass(dataset);
#print(separated)
summaries = {}
for classvalue, instances in separated.items():
for i in range(len(classsummaries)):
mean, stdev = classsummaries[i]
#take mean and sd of every attribute for class 0 and 1 seperaely
x = inputvector[i] #testvector's first attribute
probabilities[classvalue] *= calculateprobability(x, mean, stdev);#use normal dist
return probabilities
def main():
filename = 'naivedata.csv'
splitratio = 0.67
6
Laboratory File
Machine Learning Lab (IT804) Jan-Jun 2021
dataset = loadcsv(filename);
trainingset, testset = splitdataset(dataset, splitratio)
print('Split {0} rows into train={1} and test={2} rows'.format(len(dataset),
len(trainingset), len(testset)))
# prepare model
summaries = summarizebyclass(trainingset);
#print(summaries)
# test model
predictions = getpredictions(summaries, testset)
#find the predictions of test data with the training data
accuracy = getaccuracy(testset, predictions)
print('Accuracy of the classifier is : {0}%'.format(accuracy))
main()
Output:
Split 768 rows into train=514 and test=254 rows
Accuracy of the classifier is: 71.65354330708661%
7
Laboratory File
Machine Learning Lab (IT804) Jan-Jun 2021
Experiment No: 6
Objective: Assuming a set of documents that need to be classified, use the naïve Bayesian
Classifier model to perform this task. Built-in Java classes/API can be used to write the program.
Calculate the accuracy, precision, and recall for your data set.
Explanation:
Naive Bayes algorithms for learning and classifying text
LEARN_NAIVE_BAYES_TEXT (Examples, V)
Examples is a set of text documents along with their target values. V is the set of all possible
target values. This function learns the probability terms P(wk |vj,), describing the probability that
a randomly drawn word from a document in class vj will be the English word wk. It also learns
the class prior probabilities P(vj).
1. collect all words, punctuation, and other tokens that occur in Examples
Vocabulary ← c the set of all distinct words and other tokens occurring in any text
document from Examples
2. calculate the required P(vj) and P(wk|vj) probability terms
For each target value vj in V do
docsj ← the subset of documents from Examples for which the target value is vj
P(vj) ← | docsj | / |Examples|
Textj ← a single document created by concatenating all members of docsj
n ← total number of distinct word positions in Textj
for each word wk in Vocabulary
nk ← number of times word wk occurs in Textj
P(wk|vj) ← ( nk + 1) / (n + | Vocabulary| )
CLASSIFY_NAIVE_BAYES_TEXT (Doc)
Return the estimated target value for the document Doc. ai denotes the word found in the ith
position within Doc.
positions ← all word positions in Doc that contain tokens found in Vocabulary
8
Laboratory File
Machine Learning Lab (IT804) Jan-Jun 2021
Program:
import pandas as pd
msg=pd.read_csv('naivetext.csv',names=['message','label'])
X=msg.message
Y=msg.labelnum
print(X)
print(Y)
df=pd.DataFrame(xtrain_dtm.toarray(),columns=count_vect.get_feature_names())
Data set:
Text Documents Label
1 I love this sandwich pos
2 This is an amazing place pos
3 I feel very good about these beers pos
4 This is my best work pos
5 What an awesome view pos
6 I do not like this restaurant neg
7 I am tired of this stuff neg
8 I can't deal with this neg
9 He is my sworn enemy neg
10 My boss is horrible neg
11 This is an awesome place pos
12 I do not like the taste of this juice neg
13 I love to dance pos
14 I am sick and tired of this place neg
15 What a great holiday pos
16 That is a bad locality to stay neg
17 We will have good fun tomorrow pos
18 I went to my enemy's house today neg
Output:
10
Laboratory File
Machine Learning Lab (IT804) Jan-Jun 2021
14 1
15 0
16 1
17 0
Name: labelnum, dtype: int64
The total number of Training Data: (13,)
The total number of Test Data: (5,)
12
Laboratory File