0% found this document useful (0 votes)
3 views

dwm exp4 a49

Data warehousing and mining Experiment 4 Mumbai University

Uploaded by

gharatsoham28
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

dwm exp4 a49

Data warehousing and mining Experiment 4 Mumbai University

Uploaded by

gharatsoham28
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

PART A

(PART A: TO BE REFFERED BY STUDENTS)


Experiment No.04

Aim: Implementation of naïve Bayesian Classifier using Weka Tool.

Outcome:
After successful completion of this experiment students will be able to
1. Demonstrate an understanding of the importance of data mining
2. Organize and Prepare the data needed for data mining using pre preprocessing techniques
3. Perform exploratory analysis of the data to be used for mining.
4. Implement the appropriate data mining methods like classification

Theory:

Naïve Bayes:

Naïve Bayes methods are aset of supervised learning algorithms based on applying Bayes’ theorem with the
“naïve” assumption of conditional independence between every pair of features given the value of the class
variable. Bayes’ theorem states the following relationship, given class variable y and dependent feature vector
x1 through xn, :

P(y∣x1,…,xn) = P(y)P(x1,…,xn∣y)/P(x1,…,xn) Using


the naive conditional independence assumption that
P(xi|y,x1,…,xi−1,xi+1,…,xn) = P(xi|y),

for all i, this relationship is simplified to

P(y∣x1,…,xn)=P(y)∏i=1nP(xi∣y)/P(x1,…,xn)

SinceP(x1,…,xn)is constant given the input, we can use the following classification rule:

P(y∣x1,…,xn)∝P(y)∏i=1nP(xi∣y)𝖴y^
=argmax(y)(P(y)∏i=1nP(xi∣y)),
and we can use Maximum A Posteriori (MAP) estimation to estimate P(y) and P(xi∣y); the former is then the
relative frequency of class y in the training set.

The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of
P(xi∣y).
In spite of their apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in
many real-world situations, famously document classification and spam filtering. They require a small amount
of training data to estimate the necessary parameters. (For theoretical reasons why naive Bayes works well,
and on which types of data it does, see the references below.)

Naive Bayes learners and classifiers can be extremely fast compared to more sophisticated methods. The
decoupling of the class conditional feature distributions means that each distribution can be independently
estimated as a one-dimensional distribution. This in turn helps to alleviate problems stemming from the course
of dimensionality.
Gaussian Naive Bayes:
When working with continuous data,an assumption often taken is that the continuous values associated with
each class are distributed according to a normal (or Gaussian) distribution. The likelihood of the features is
assumed to be-

Sometimes assume variance

● Is independent of Y(i.e.,σi),

● Or independent of Xi (i.e.,σk)

● Or both(i.e., σ)

Gaussian NaiveBayes supports continuous valued features and models each as conforming to a Gaussian
(normal) distribution.
PART B
(PART B: TO BE COMPLETED BY STUDENTS)

Roll. No: A49 Name: Soham B. Gharat


Class: TE AI & DS Batch: A3
Date of Experiment: 16/08/2024 Date of Submission: 30/08/2024
Grade:

Input and Output:


Observations and learning:
The Naïve Bayes classifier is a probabilistic classifier based on Bayes' theorem with the assumption of
independence between features. Weka's user-friendly interface simplified data import, preprocessing, and
model training, making it straightforward to handle various datasets. The Naïve Bayes classifier, known for
its simplicity and efficiency, performed well even with the assumption of feature independence.

Conclusion:
In this experiment, we implemented a Naïve Bayesian Classifier using Weka, successfully training it to
classify data with good accuracy. The results demonstrated the model's efficiency and scalability for
classification tasks, with insights into areas for potential improvement.

Question of Curiosity:

Q.1] What type of datasets are suitable for the Naive Bayesian Classifier in Weka and why?

Ans: Naive Bayes is particularly well-suited for datasets that:


1. Categorical Data: Naive Bayes works exceptionally well with categorical data because it estimates the
probability of each category independently. In Weka, datasets with categorical features allow the Naive
Bayes algorithm to directly calculate the probability of a given class based on the frequency of attribute
values.
2. Text Data (Bag-of-Words Representation): Naive Bayes is effective in text classification tasks, such as
spam detection or sentiment analysis, where data is often represented as a bag-of-words. Each word
(feature) is treated as independent, making the algorithm simple and fast for text data.
3. Low Dimensional Datasets: Naive Bayes performs well on datasets with a relatively small number of
features. In cases where the dimensionality is high but the features are sparse (like text data), the
independence assumption of Naive Bayes simplifies computation and often leads to good results.
4. Moderately Sized Datasets: It is particularly effective on small to medium-sized datasets where other,
more complex models might overfit. The simplicity of Naive Bayes allows it to generalize well even with
limited data.

The Naive Bayesian Classifier assumes that all features are independent given the class label (the so-called
"naive" assumption). This simplification makes the algorithm computationally efficient and easy to
implement, even with relatively simple data. Despite the independence assumption, Naive Bayes often yields
good results, especially when the independence assumption is approximately true or when the model's
simplicity outweighs the impact of any correlations between features.
Q.2] How do you preprocess a dataset in Weka before applying the Naive Bayesian Classifier?
Ans: Before applying the Naive Bayesian Classifier, the following preprocessing steps are recommended in
Weka:
1. Data Cleaning: Remove or impute missing values using Weka's "Filter" option under the "Preprocess"
tab.
2. Discretization: If the dataset has continuous numerical attributes, consider discretizing them into
categorical intervals using filters like unsupervised.attribute.Discretize.
3. Attribute Selection: Use Weka's attribute selection filters to remove irrelevant or redundant features,
which can improve the model’s performance.
4. Normalization: Though Naive Bayes generally handles raw data well, normalization can be applied to
ensure that attributes are on a similar scale, especially when dealing with continuous data.
Q.3] How can you interpret the confusion matrix generated by the Naive Bayesian Classifier in Weka ?
Ans: The confusion matrix generated by the Naive Bayesian Classifier in Weka is a key tool for evaluating
the performance of your model. The following are the ways to interpret the confusion matrix:
1. True Positives (TP): This value represents the number of instances that were correctly predicted as
belonging to the positive class. For example, if you are classifying emails as "spam" or "not spam," TP
would be the number of emails correctly identified as "spam."
2. True Negatives (TN): This value indicates the number of instances that were correctly predicted as
belonging to the negative class. Continuing with the spam example, TN would be the number of emails
correctly identified as "not spam.".
3. False Positives (FP): These are the instances where the classifier incorrectly predicted the positive class
when it should have predicted the negative class. In the spam example, FP represents the number of
"not spam" emails that were incorrectly classified as "spam." This is also known as a Type I error.
4. False Negatives (FN): These are the instances where the classifier incorrectly predicted the negative
class when it should have predicted the positive class. In our example, FN would be the number of
"spam" emails that were incorrectly classified as "not spam." This is also known as a Type II error.
5. Interpretation:
• A high number of True Positives (TP) and True Negatives (TN) indicates that the classifier is
performing well.
• A high number of False Positives (FP) or False Negatives (FN) suggests areas where the model
is misclassifying data, potentially requiring further tuning or alternative modeling approaches.
• By analyzing the balance between Precision and Recall (via the F1-score), you can assess
whether the model is biased towards one class, which is particularly important in imbalanced
datasets.

You might also like