0% found this document useful (0 votes)
68 views2 pages

Naïve Bayesian Classifier Example

Uploaded by

senthil7111
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views2 pages

Naïve Bayesian Classifier Example

Uploaded by

senthil7111
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd

[Link].

6 Naïve Bayesian Classifier

############### PROGRAM ###############


import csv
import random
import math
def loadcsv(filename):
with open(filename,"r") as file:
lines = [Link](file)
dataset = list(lines)
dataset = dataset[1:]
dataset = [[float(x)for x in row]for row in dataset]
return dataset
def splitdataset(dataset, splitratio):
trainsize = int(len(dataset) * splitratio);
trainset = []
copy = list(dataset);
while len(trainset) <trainsize:
index = [Link](len(copy));
[Link]([Link](index))
return [trainset, copy]
def separatebyclass(dataset):
separated = {}
for i in range(len(dataset)):
vector = dataset[i]
if (vector[-1] not in separated):
separated[vector[-1]] = []
separated[vector[-1]].append(vector)
return separated
def mean(numbers):
return sum(numbers)/float(len(numbers))
def stdev(numbers):
avg = mean(numbers)
variance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1)
return [Link](variance)
def summarize(dataset):
summaries = [(mean(attribute), stdev(attribute)) for attribute in
zip(*dataset)];
del summaries[-1]
return summaries
def summarizebyclass(dataset):
separated = separatebyclass(dataset);
summaries = {}
for classvalue, instances in [Link]():
summaries[classvalue] = summarize(instances)
return summaries
def calculateprobability(x, mean, stdev):
exponent = [Link](-([Link](x-mean,2)/(2*[Link](stdev,2))))
return (1 / ([Link](2*[Link]) * stdev)) * exponent
def calculateclassprobabilities(summaries, inputvector):
probabilities = {}
for classvalue, classsummaries in [Link]():
probabilities[classvalue] = 1
for i in range(len(classsummaries)):
mean, stdev = classsummaries[i]
x = inputvector[i]
probabilities[classvalue] *= calculateprobability(x, mean, stdev)
return probabilities
def predict(summaries, inputvector):
probabilities = calculateclassprobabilities(summaries, inputvector)
bestLabel, bestProb = None, -1
for classvalue, probability in [Link]():
if bestLabel is None or probability >bestProb:
bestProb = probability
bestLabel = classvalue
return bestLabel
def getpredictions(summaries, testset):
predictions = []
for i in range(len(testset)):
result = predict(summaries, testset[i])
[Link](result)
return predictions
def getaccuracy(testset, predictions):
correct = 0
for i in range(len(testset)):
if testset[i][-1] == predictions[i]:
correct += 1
return (correct/float(len(testset))) * 100.0
def main():
filename = 'D:\\New folder\\[Link]'
splitratio = 0.67
dataset = loadcsv(filename)
trainingset, testset = splitdataset(dataset, splitratio)
print('Split {0} rows into train={1} and test={2} rows'.format(len(dataset),
len(trainingset), len(testset)))
summaries = summarizebyclass(trainingset);
predictions = getpredictions(summaries, testset)
accuracy = getaccuracy(testset, predictions)
print('Accuracy of the classifier is : {0}%'.format(accuracy))
main()

############### OUTPUT ###############

Split 9 rows into train=6 and test=3 rows


Accuracy of the classifier is : 33.33333333333333%

Common questions

Powered by AI

The `calculateclassprobabilities` function calculates the probability of the input vector belonging to each possible class. It uses the mean and standard deviation of each feature under each class to compute the probability of that feature value for the class, multiplying them together to get a combined probability for the class. This helps determine the class to which the input vector most likely belongs based on the highest probability, contributing directly to the final prediction .

To improve the classification accuracy, one could ensure a larger and more representative training dataset, as the example uses a small sample of 9 rows split into 6 training rows and 3 testing rows, which is likely insufficient for accurate learning. Additionally, feature scaling, handling missing values, and tuning the feature selection could also improve performance. It's also beneficial to test different split ratios for training and testing sets .

Separating the dataset by class is important in Naïve Bayes classification because it allows the model to calculate statistics for each class independently, which are crucial for determining the class-conditional probabilities. In the provided code, this is achieved by iterating through the dataset and grouping entries into a dictionary where the keys are class labels, and the values are lists of data points that belong to those classes .

In the Gaussian Naïve Bayes classification process, the standard deviation is used to measure the spread of feature values around the mean for each class. It impacts the shape of the Gaussian distribution used when calculating the probability of a feature value. A smaller standard deviation leads to a sharper peak in the distribution, while a larger one results in a wider distribution, affecting the likelihood calculations significantly .

The `summarize` function calculates the mean and standard deviation for each attribute of the dataset, which are then used in probability calculations. It excludes the class value. The `summarizebyclass` function first separates the dataset by class and then applies the `summarize` function to each class. This dual-function setup provides the statistical summary necessary for calculating class-conditional probabilities crucial for classification .

The Naïve Bayes classifier calculates the probability of a data point belonging to a specific class by using the Gaussian probability density function. It calculates the probability for each feature under each class by considering the mean and standard deviation of the feature for that class. The probabilities for all features are then multiplied together to get the total probability of the data point belonging to that class .

The dataset splitting strategy uses a random selection process with a split ratio, which can lead to variability in which data points are considered for training and testing in each run, potentially impacting model results unless averaged over multiple runs. It may also introduce bias if certain classes are under-represented in either set due to random sampling, which could skew training and reduce test accuracy .

One of the main limitations of using a Naïve Bayes classifier is its assumption of independence among features, which is rarely true in real-world data and may affect performance. Another limitation is that it tends to work poorly with small datasets as demonstrated, since it can lead to imprecise estimates of mean and standard deviation, thus skewing probabilities. The model's performance may also suffer if feature distributions significantly deviate from Gaussian .

The `getaccuracy` function determines the accuracy by comparing the predicted class labels with the actual class labels in the test set. It counts the number of correct predictions and then calculates the percentage of correct predictions over the total number of test instances, thus providing the model's accuracy .

The Naïve Bayes algorithm handles continuous numeric input features by assuming they follow a Gaussian distribution. For each feature, the mean and standard deviation are calculated for each class. Then, the probability of a given data point's feature value is determined using the Gaussian probability density function based on these calculations .

You might also like