0% found this document useful (0 votes)
9 views

DAL Assignment 3 Endsem

IITB DAL Endsem

Uploaded by

msrirang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

DAL Assignment 3 Endsem

IITB DAL Endsem

Uploaded by

msrirang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

A Bayesian Model for Income Bracket

Classification
1st
Department of Chemical Engineering
IIT Madras
Chennai, India

Abstract—This paper explores the prediction of individuals’ in- encoding categorical features. Additionally, we employ feature
come levels based on the 1994 Census Bureau database by Ronny selection techniques to identify the most influential variables,
Kohavi and Barry Becker, using a Naive Bayes Classifier. The improving the model’s interpretability and efficiency.
study focuses on determining whether a person’s income exceeds
$50,000, utilizing demographic and socio-economic attributes like
education level, marital status, capital gains and losses, and The primary objective of this study is to evaluate the
more. The census data is cleaned and processed. A Naive Bayes effectiveness of the Naive Bayes Classifier in predicting
Classifier is used for the predictive model, and is evaluated income levels based on the provided dataset. To achieve this,
using metrics like accuracy and precision by cross-validation. we employ rigorous evaluation metrics, including accuracy,
The classifier is effective in income prediction and we emphasize
its potential applications in decision-making processes in fields precision, recall, and F1-score, while applying the Boostrap
like social policy planning and targeted marketing. Overall, this Technique to assess the model’s generalization capabilities.
research demonstrates the feasibility and significance of machine
learning techniques in income classification. Section III has been II. DATA AND C LEANING
changed. A. The Datasets
Index Terms—naive Bayes, bootstrapping, 1994 census, Kohavi
and Becker, cross-validation One dataset (adult.xlsx) was provided to train the Naive
Bayes model. This dataset contained around 32, 000 training
I. I NTRODUCTION samples. The target label was a binary class ’income-class’
with a person’s income either being above $50000 or below it.
Income prediction is an important part of social policy The dataset contained a mixture of categorical and numerical
planning and business marketing strategies. Accurately variables. The description of the features in the dataset are
predicting an individual’s income level enables more effective summarized in Table I.
resource allocation, targeted assistance, and improved
decision-making. Bayesian models offer a promising avenue TABLE I
for income classification, and in this study, we delve into TABLE OF THE FEATURES IN THE GIVEN DATASETS ALONG WITH THEIR
DESCRIPTIONS . W E OBSERVE THAT MOST VARIABLES ARE CATEGORICAL
the development and evaluation of a Naive Bayes Classifier
BUT THERE ARE SOME IMPORTANT NUMERICAL VARIABLES THAT COULD
for predicting income levels based on demographic and BE POWERFUL INDICATORS OF THE INCOME BRACKET.
socio-economic features.
Feature Description Type
age Age Continuous
The data is taken from 1994 Census Bureau database workclass Work Class Categorical (8)
by Ronny Kohavi and Barry Becker, containing information fnlwgt - Numerical
education Lvl. of education Categorical (16)
such as education level, marital status, capital gains and education-num Years of education Numerical
losses. It offers a comprehensive view of the factors that marital-status Marital Status Categorical (7)
may influence an individual’s income. Using this dataset, our Occupation Occupation Categorical (14)
relationship Relationship Categorical (6)
study aims to construct a robust predictive model capable of race Race Categorical (5)
categorizing individuals into income groups: those earning sex Gender Categorical (2)
more than $50,000 and those earning less. capital-gain Capital Gain Numerical
capital-loss Capital Loss Numerical
hours-per-week Hours per week Numerical
The choice of a Naive Bayes Classifier is motivated by native-country Native Country Categorical (41)
its simplicity, efficiency, and ability to handle categorical income-category Income Bracket Categorical (2)
and continuous data. By exploiting conditional independence
among attributes, the Naive Bayes Classifier provides an
intuitive framework for modeling complex relationships in B. Data Cleaning
the data. A pipeline is coded to take a dataset of the above format
and a flag (’train’ or ’test’) and clean it. Persons with variables
W first preprocess the data, imputing missing values and that cannot be imputed such as ’income-category’ having
missing values are removed. We find that the placeholder for
missing values is ’ ?’. We do not drop any variables with
missing data, instead choosing to impute them

A Simple Imputer based on the most frequent value is


used on the dataset to impute missing values. This largely
preserves variable distribution. Finally the variables are
converted to their appropriate types and the cleaned dataset Fig. 2. The probability and cumulative distributions of the FNL Weight of the
is returned. No confounding symbols are present in the train various persons is plotted. The left image contains the KDE of the data after
or test data, we only find missing values. Most Freq. Imputation for both classes. The right image shows the ECDFs
of the data after Most Freq. Imputation for both classes.

There are multiple imputation techniques available. One


can impute missing values by 0, by the mean, median or
based on the k-NN of the data point or by randomly sampling
from the distribution of the variable. The Expectation
Imputers distort the distribution of the imputed data about the
expectation estimator used, when compared to the Random
Sampling Imputer (RSI) and KNN Imputer.

Unfortunately the RSI is a slow imputation technique. Fig. 3. The probability and cumulative distributions of the Years of Educa-
tion of the various persons is plotted. The left image contains the KDE of
Either a prior distribution must be assumed and its parameters the data after Most Freq. Imputation for both classes. The right image shows
estimated from data, or a non-parametric method such as a the ECDFs of the data after Most Freq. Imputation for both classes.
Kernel Density Estimate (KDE) can be used.

However, given that we are dealing with multiple categorical


variables, we choose to use the most frequent value for
imputation, given the KNN’s difficulty with handling
categorical variables..

We can also observe this empirically. In Figs. 1-9, we


present the Kernel Density Estimate (KDE) and Empirical
Fig. 4. The probability and cumulative distributions of the Hours per Week
Cumulative Density Function (ECDF) of the numerical of the various persons is plotted. The left image contains the KDE of the
variables in the train dataset, after imputation for both data after Most Freq. Imputation for both classes. The right image shows the
categories. Finally all categorical variables are encoded as ECDFs of the data after Most Freq. Imputation for both classes.
features. In Figs. 5-8, we present the Count Plots of some

Fig. 1. The probability and cumulative distributions of the Age of the various
persons is plotted. The left image contains the KDE of the data after Most
Freq. Imputation for both classes. The right image shows the ECDFs of the
data after Most Freq. Imputation for both classes.

categorical variables in the train dataset, after imputation. Fig. 5. The count plot of the various classes of Work Class for various per-
sons are shown after Most Frequent Imputation. Unlike numerical variables,
III. M ETHODS categorical variables are not visualized well using density plots.

A. Naive Bayes Classifier


The Naive Bayes classifier is a probabilistic machine Naive Bayes classifier is the naive assumption of conditional
learning model based on Bayes’ theorem. It is widely used independence among features given the class label.
for classification tasks, particularly in natural language
processing and spam filtering. The key assumption of the Let X = {x1 , x2 , . . . , xn } represent a set of features,
• P (C|X) is the posterior probability of class C given the
features X.
• P (X|C) is the likelihood of the features given class C.
• P (C) is the prior probability of class C.
• P (X) is the marginal likelihood of the features.

The naive assumption in Naive Bayes is that the features are


conditionally independent given the class label. Mathemati-
cally, this is expressed as:

P (X|C) = P (x1 |C) · P (x2 |C) · . . . · P (xn |C) (2)

This simplifies the likelihood calculation, making it


computationally more tractable.
Fig. 6. The count plot of the various classes of Race for various persons are
shown after Most Frequent Imputation. Unlike numerical variables, categorical Naive Bayes involves two main prior assumptions:
variables are not visualized well using density plots.
1) Class Prior (P (C)): This is the prior probability of each
class, and it represents the likelihood of encountering
each class in the absence of any feature information.
2) Feature Independence (P (X|C)): As per the naive
assumption, features are assumed to be conditionally
independent given the class label. This significantly
simplifies the computation but may not hold in reality
for all datasets.
The Bayes Error Rate is the lowest possible error rate that any
classifier can achieve. It is given by:

Bayes Error Rate = 1 − max P (Ci |X) (3)


i

where Ci is the i-th class. The Bayes Error Rate is a


theoretical measure, and achieving it in practice is challenging
Fig. 7. The count plot of the various classes of Marital Status for
various persons are shown after Most Frequent Imputation. Unlike numerical due to the assumptions and limitations of real-world data.
variables, categorical variables are not visualized well using density plots.
Both Naive Bayes and Logistic Regression are popular
classification algorithms, but they differ in their underlying
assumptions and modeling approaches.

Naive Bayes assumes that features are conditionally


independent given the class label, which simplifies the
modeling process. In contrast, Logistic Regression does not
make such a strong assumption about feature independence,
allowing it to capture more complex relationships between
features.

Logistic Regression models the relationship between the


features and the log-odds of the outcome using a linear
Fig. 8. The count plot of the various classes of Occupation for various per-
sons are shown after Most Frequent Imputation. Unlike numerical variables,
function. This allows it to handle non-linear relationships
categorical variables are not visualized well using density plots. through feature engineering or higher-order terms. Naive
Bayes, on the other hand, is a simpler model due to its
assumption of feature independence.
and C represent the class label. Bayes’ theorem relates the
probability of the class given the features to the likelihood of Naive Bayes can handle missing data gracefully since
the features given the class: the conditional independence assumption allows it to ignore
P (X|C) · P (C) missing features when estimating probabilities. Logistic
P (C|X) = (1) Regression may struggle with missing data, and imputation
P (X)
or other techniques may be necessary.
Here,
Logistic Regression provides interpretable coefficients Area Under the Receiver Operating Characteristic (AUROC)
for each feature, indicating the direction and strength of curve.
their influence on the outcome. Naive Bayes, due to its
conditional independence assumption, does not provide such C. Accuracy
direct interpretability.
Accuracy is a fundamental metric that measures the overall
Naive Bayes is generally robust to irrelevant features correctness of predictions. It is defined as the ratio of correctly
since it assumes independence. Logistic Regression may predicted instances to the total number of instances:
be sensitive to irrelevant features, and feature selection Number of Correct Predictions
techniques might be necessary. Accuracy = (4)
Total Number of Predictions
Naive Bayes often performs well on small datasets, and D. Precision
its simplicity makes it computationally efficient. Logistic Precision is a metric that focuses on the accuracy of positive
Regression may require larger datasets to capture complex predictions. It measures the ratio of correctly predicted positive
relationships effectively. instances to the total number of instances predicted as positive:
The choice between Naive Bayes and Logistic Regression True Positives
Precision = (5)
depends on the nature of the data, the assumptions that can True Positives + False Positives
be reasonably made, and the desired level of interpretability.
Naive Bayes is a good choice for simple and small-scale E. Recall
problems, while Logistic Regression is more flexible and Recall, also known as Sensitivity or True Positive Rate,
suitable for situations where feature independence is not emphasizes the ability of a model to capture all positive in-
assumed to hold. stances. It is defined as the ratio of correctly predicted positive
instances to the total number of actual positive instances:
In conclusion, Naive Bayes is a simple yet powerful classifier
based on Bayes’ theorem. Its effectiveness is influenced by True Positives
Recall = (6)
the naive assumption of feature independence and the prior True Positives + False Negatives
assumptions about class probabilities. Understanding these
F. F1 Score
assumptions and their implications is crucial for effectively
applying Naive Bayes in various machine learning tasks. The F1 Score is the harmonic mean of Precision and
Recall. It provides a balanced measure that considers both
The Gaussian Naive Bayes Classifier is suitable for false positives and false negatives. The formula for F1 Score
continuous features and can handle multivariate Gaussian is:
distributions efficiently. It is an extension of the basic Naive 2 × Precision × Recall
F1 Score = (7)
Bayes Classifier and is particularly effective when the data Precision + Recall
distribution aligns with the Gaussian assumption.
G. AUROC Curve
In this paper, we use the the Gaussian Naive Bayes The Receiver Operating Characteristic (ROC) curve is a
Classifier to predict income categories based on the 1994 graphical representation of the trade-off between true positive
Census Bureau database. We assess its performance using rate and false positive rate at various classification thresholds.
various evaluation metrics to determine its suitability for the The Area Under the ROC Curve (AUROC) summarizes the
task at hand. performance across all possible threshold values. A model
with a higher AUROC score indicates better discrimination
between positive and negative instances.
The area under the ROC curve (AUC-ROC) quantifies the
B. Classification Metrics model’s overall performance. A higher AUC-ROC indicates a
There are various metrics that can evaluate the goodness- better model at distinguishing between positive and negative
of-fit of a given classifier. Some of these metrics are presented instances.
in this section. In classification tasks, it is essential to choose
appropriate evaluation metrics based on the problem’s context These classification metrics offer a comprehensive evaluation
and objectives. of a model’s performance. While Accuracy provides an
1) Accuracy: In machine learning, the evaluation of classi- overall view, Precision, Recall, and F1 Score focus on
fication models is crucial to assess their performance. Several specific aspects of classification. The AUROC curve and
metrics provide insights into different aspects of a classifier’s its associated score are particularly useful for binary
effectiveness. This write-up discusses key classification met- classification tasks, providing insights into the model’s ability
rics, including Accuracy, Precision, Recall, F1 Score, and the to discriminate between classes.
SVD is closely related to the eigenvalue decomposition
of a symmetric matrix. For a symmetric matrix M , the
eigendecomposition is M = QΛQT , where Q is an
orthogonal matrix of eigenvectors and Λ is a diagonal matrix
of eigenvalues. SVD of M can be expressed as M = U ΣU T ,
where U contains the eigenvectors of M M T and Σ contains
the square root of the eigenvalues of M M T .

SVD is a powerful mathematical tool with a wide range of


applications in various fields. Its ability to decompose a matrix
into its constituent parts facilitates numerous computational
and analytical tasks, making it a cornerstone in the field of
linear algebra.
Fig. 9. A sample ROC curve from a classifier. Note the trade-off between
sensitivity and specificity. Based on the problem, we may optimize be required IV. R ESULTS
to optimize for only one.
A. Existence of Linear Relationships among income factors
Exploratory analysis of the Independent Variables indicates
H. Singular Value Decomposition (SVD)
the existence of linear relationships between themselves.
Singular Value Decomposition (SVD) is a fundamental This could allow us to loss-lessly reduce the number of
matrix factorization technique used in linear algebra and independent variables used in our model. This is evident
numerical analysis. It plays a crucial role in various from Fig. 10 where the singular values of the Independent
applications, including dimensionality reduction, data Variables dataset are presented. Three linear relationships
compression, and solving linear systems of equations. exist between the variables.

Given an m × n matrix A, the Singular Value Decomposition


of A is given by:
A = U ΣV T (8)
where:
• U is an m × m orthogonal matrix.
• Σ is an m × n diagonal matrix with non-negative real
numbers on the diagonal, known as the singular values
of A.
Fig. 10. Singular values of the Independent Variables are presented. The last
• V is an n × n orthogonal matrix. three singular values are of order < 10−13 and can be considered to be 0.
T
• V is the transpose of V . This allows us to loss-lessly remove up to three variables from the dataset
The SVD can be calculated using various methods, such
as the power iteration method or the Jacobi method. The The correlation heatmap for the independent variables is
singular values in Σ are typically arranged in descending order. shown in Fig. 11. We observe several variables that are
perfectly correlated with each other. This is an artefact of
For a given matrix A, the singular values σ1 , σ2 , . . . , σp our encoding method. When we encoded our categorical
(where p = min(m, n)) are the square roots of the eigenvalues variables, atleast one class will be highly correlated with all
of AT A (or AAT ). The columns of U are the corresponding other classes. For example in our ’sex’ feature, only ’M’ and
eigenvectors of AAT , and the columns of V are the ’F’ classes are present. If a sample has ’sex’ attribute ’M’,
corresponding eigenvectors of AT A. then it cannot have ’F’, making the two classes, which have
now become features, perfectly negatively correlated.
SVD is widely used for dimensionality reduction. By retaining
only the first k singular values and their corresponding To verify this, we plot the heatmap of only the numerical
columns in U and V , one can approximate the original features in Fig. 12. We find no correlation between them,
matrix A with reduced rank. This is particularly useful in confirming our suspicion.
applications like image compression.
B. Naive Bayes is a fast and accurate classifier
The SVD provides a way to compute the pseudo-inverse of To train and evaluate our Naive Bayes model, we split our
a matrix. If A = U ΣV T , then the pseudo-inverse of A is train data into train and validation splits. This is done using
given by A+ = V Σ+ U T , where Σ+ is obtained by taking a fixed random seed for replicability, with 20% of our given
the reciprocal of non-zero singular values in Σ. data in the validation split.
are presented in Figs. 13-16.

TABLE II
E VALUATION METRICS OF THE NAIVE BAYES CLASSIFIER . W E FIND THAT
ACCURACY AND P RECISION ARE REASONABLY HIGH . T HE VARIANCE IN
THESE ESTIMATES ARE ALSO ACCEPTABLE .

Metric Value 95% CI


Accuracy 0.80 (0.79, 0.81)
Precision 0.68 (0.64, 0.72)
Recall 0.32 (0.29, 0.35)
F1 Score 0.43 (0.40, 0.46)

Fig. 11. The correlation heatmap between all independent variables. This
was obtained by finding the pairwise correlation coefficient between each
independent variable. The color gradient indicates the magnitude of the
correlation between the variables. Fig. 13. The left plot contains the histogram of the accuracy obtained for
each bootstrap sample from the validation split. The right plot contains the
ECDF of the accuracy obtained for each bootstrap sample from the validation
split. We find that the metric is high and its variance is acceptable

Fig. 14. The left plot contains the histogram of the recall obtained for each
bootstrap sample from the validation split. The right plot contains the ECDF
of the recall obtained for each bootstrap sample from the validation split. We
find that the metric is high and its variance is acceptable

Fig. 12. The correlation heatmap between all numerical independent variables.
This was obtained by finding the pairwise correlation coefficient between
each independent variable. The color gradient indicates the magnitude of the
correlation between the variables.

Fig. 15. The left plot contains the histogram of the precision obtained for
The Naive Bayes model is first trained on the train each bootstrap sample from the validation split. The right plot contains the
ECDF of the precision obtained for each bootstrap sample from the validation
split without any regularization. We then bootstrap the split. We find that the metric is high and its variance is acceptable
validation set (1000 bootstrap samples) and compute the
evaluation metrics presented in Section III-B. We provide The ROC curves for both the Naive Bayes Classifier is shown
the 95% CIs for our evaluation metrics in Table II. The in Fig. 17. We find that it performs significantly better than a
probability distributions and ECDFs of our evaluation metrics random classifier.
VI. C ONCLUSIONS AND F UTURE W ORK
The classifier exhibits high precision, indicating its ability
to make accurate predictions for identifying individuals with
incomes exceeding $50,000. This precision ensures that
resources are efficiently allocated to those who genuinely
qualify for certain programs or benefits.

While precision is high, we observed a trade-off with


recall, which falls in the medium range. This means that
Fig. 16. The left plot contains the histogram of the F1 score obtained for while the classifier excels at minimizing false positives,
each bootstrap sample from the validation split. The right plot contains the
ECDF of the F1 score obtained for each bootstrap sample from the validation
it may miss some high-income individuals. The balance
split. We find that the metric is high and its variance is acceptable between precision and recall should be carefully considered
based on the specific application’s priorities.

There is room for improvement in terms of recall without


significantly sacrificing precision. Future work should focus on
refining the model to better capture high-income individuals.
This could involve feature engineering, incorporating
additional data sources, or exploring alternative machine
learning algorithms.

Ensemble methods and interpretability metrics such as


the SHAP values can be incorporated into the classifier
model. Future work must also consider the socio-economic
implications of using these models when deciding public
policy and economic planning. Temporal data may also
provide a more comprehensive picture.

Fig. 17. The Receiver Operator Characteristic curve obtained for the Naive
Bayes classifier. We find that we can achieve a good True Positive Rate with
a small False Positive Rate, indicating that our classifier is robust to class R EFERENCES
imbalances. We also find that the classifier is significantly better than a random
classifier. [1] James, G., Witten, D., Hastie, T. and Tibshirani, R., 2013. An introduc-
tion to statistical learning (Vol. 112, p. 18). New York: springer.
[2] Hastie, T., Tibshirani, R., Friedman, J. H., & Friedman, J. H. (2009). The
elements of statistical learning: data mining, inference, and prediction
V. D ISCUSSION (Vol. 2, pp. 1-758). New York: springer.
[3] Bishop, C. M., & Nasrabadi, N. M. (2006). Pattern recognition and
machine learning (Vol. 4, No. 4, p. 738). New York: springer.
Our analysis indicates that the Gaussian Naive Bayes
Classifier provides a good performance in predicting income
levels, based on the 1994 Census Bureau database. We
observe that our classifier has high precision. This suggests
that the classifier is particularly adept at minimizing false
positives, which are instances where it predicts a higher
income when it’s not the case. High precision is crucial in
scenarios such as targeted marketing, where false positives
can result in inefficient resource allocation.

While our classifier demonstrates a high precision, it is


important to acknowledge that its recall falls in the medium
range. This implies that the classifier is effective at capturing a
substantial portion of individuals with incomes above $50,000
but may miss some such instances. In other words, there is a
trade-off between precision and recall. The balance between
these two metrics depends on the specific application context.
In cases where identifying all high-income individuals is
critical, further model refinement may be needed to enhance
recall.

You might also like