DAL Assignment 3 Endsem
DAL Assignment 3 Endsem
Classification
1st
Department of Chemical Engineering
IIT Madras
Chennai, India
Abstract—This paper explores the prediction of individuals’ in- encoding categorical features. Additionally, we employ feature
come levels based on the 1994 Census Bureau database by Ronny selection techniques to identify the most influential variables,
Kohavi and Barry Becker, using a Naive Bayes Classifier. The improving the model’s interpretability and efficiency.
study focuses on determining whether a person’s income exceeds
$50,000, utilizing demographic and socio-economic attributes like
education level, marital status, capital gains and losses, and The primary objective of this study is to evaluate the
more. The census data is cleaned and processed. A Naive Bayes effectiveness of the Naive Bayes Classifier in predicting
Classifier is used for the predictive model, and is evaluated income levels based on the provided dataset. To achieve this,
using metrics like accuracy and precision by cross-validation. we employ rigorous evaluation metrics, including accuracy,
The classifier is effective in income prediction and we emphasize
its potential applications in decision-making processes in fields precision, recall, and F1-score, while applying the Boostrap
like social policy planning and targeted marketing. Overall, this Technique to assess the model’s generalization capabilities.
research demonstrates the feasibility and significance of machine
learning techniques in income classification. Section III has been II. DATA AND C LEANING
changed. A. The Datasets
Index Terms—naive Bayes, bootstrapping, 1994 census, Kohavi
and Becker, cross-validation One dataset (adult.xlsx) was provided to train the Naive
Bayes model. This dataset contained around 32, 000 training
I. I NTRODUCTION samples. The target label was a binary class ’income-class’
with a person’s income either being above $50000 or below it.
Income prediction is an important part of social policy The dataset contained a mixture of categorical and numerical
planning and business marketing strategies. Accurately variables. The description of the features in the dataset are
predicting an individual’s income level enables more effective summarized in Table I.
resource allocation, targeted assistance, and improved
decision-making. Bayesian models offer a promising avenue TABLE I
for income classification, and in this study, we delve into TABLE OF THE FEATURES IN THE GIVEN DATASETS ALONG WITH THEIR
DESCRIPTIONS . W E OBSERVE THAT MOST VARIABLES ARE CATEGORICAL
the development and evaluation of a Naive Bayes Classifier
BUT THERE ARE SOME IMPORTANT NUMERICAL VARIABLES THAT COULD
for predicting income levels based on demographic and BE POWERFUL INDICATORS OF THE INCOME BRACKET.
socio-economic features.
Feature Description Type
age Age Continuous
The data is taken from 1994 Census Bureau database workclass Work Class Categorical (8)
by Ronny Kohavi and Barry Becker, containing information fnlwgt - Numerical
education Lvl. of education Categorical (16)
such as education level, marital status, capital gains and education-num Years of education Numerical
losses. It offers a comprehensive view of the factors that marital-status Marital Status Categorical (7)
may influence an individual’s income. Using this dataset, our Occupation Occupation Categorical (14)
relationship Relationship Categorical (6)
study aims to construct a robust predictive model capable of race Race Categorical (5)
categorizing individuals into income groups: those earning sex Gender Categorical (2)
more than $50,000 and those earning less. capital-gain Capital Gain Numerical
capital-loss Capital Loss Numerical
hours-per-week Hours per week Numerical
The choice of a Naive Bayes Classifier is motivated by native-country Native Country Categorical (41)
its simplicity, efficiency, and ability to handle categorical income-category Income Bracket Categorical (2)
and continuous data. By exploiting conditional independence
among attributes, the Naive Bayes Classifier provides an
intuitive framework for modeling complex relationships in B. Data Cleaning
the data. A pipeline is coded to take a dataset of the above format
and a flag (’train’ or ’test’) and clean it. Persons with variables
W first preprocess the data, imputing missing values and that cannot be imputed such as ’income-category’ having
missing values are removed. We find that the placeholder for
missing values is ’ ?’. We do not drop any variables with
missing data, instead choosing to impute them
Unfortunately the RSI is a slow imputation technique. Fig. 3. The probability and cumulative distributions of the Years of Educa-
tion of the various persons is plotted. The left image contains the KDE of
Either a prior distribution must be assumed and its parameters the data after Most Freq. Imputation for both classes. The right image shows
estimated from data, or a non-parametric method such as a the ECDFs of the data after Most Freq. Imputation for both classes.
Kernel Density Estimate (KDE) can be used.
Fig. 1. The probability and cumulative distributions of the Age of the various
persons is plotted. The left image contains the KDE of the data after Most
Freq. Imputation for both classes. The right image shows the ECDFs of the
data after Most Freq. Imputation for both classes.
categorical variables in the train dataset, after imputation. Fig. 5. The count plot of the various classes of Work Class for various per-
sons are shown after Most Frequent Imputation. Unlike numerical variables,
III. M ETHODS categorical variables are not visualized well using density plots.
TABLE II
E VALUATION METRICS OF THE NAIVE BAYES CLASSIFIER . W E FIND THAT
ACCURACY AND P RECISION ARE REASONABLY HIGH . T HE VARIANCE IN
THESE ESTIMATES ARE ALSO ACCEPTABLE .
Fig. 11. The correlation heatmap between all independent variables. This
was obtained by finding the pairwise correlation coefficient between each
independent variable. The color gradient indicates the magnitude of the
correlation between the variables. Fig. 13. The left plot contains the histogram of the accuracy obtained for
each bootstrap sample from the validation split. The right plot contains the
ECDF of the accuracy obtained for each bootstrap sample from the validation
split. We find that the metric is high and its variance is acceptable
Fig. 14. The left plot contains the histogram of the recall obtained for each
bootstrap sample from the validation split. The right plot contains the ECDF
of the recall obtained for each bootstrap sample from the validation split. We
find that the metric is high and its variance is acceptable
Fig. 12. The correlation heatmap between all numerical independent variables.
This was obtained by finding the pairwise correlation coefficient between
each independent variable. The color gradient indicates the magnitude of the
correlation between the variables.
Fig. 15. The left plot contains the histogram of the precision obtained for
The Naive Bayes model is first trained on the train each bootstrap sample from the validation split. The right plot contains the
ECDF of the precision obtained for each bootstrap sample from the validation
split without any regularization. We then bootstrap the split. We find that the metric is high and its variance is acceptable
validation set (1000 bootstrap samples) and compute the
evaluation metrics presented in Section III-B. We provide The ROC curves for both the Naive Bayes Classifier is shown
the 95% CIs for our evaluation metrics in Table II. The in Fig. 17. We find that it performs significantly better than a
probability distributions and ECDFs of our evaluation metrics random classifier.
VI. C ONCLUSIONS AND F UTURE W ORK
The classifier exhibits high precision, indicating its ability
to make accurate predictions for identifying individuals with
incomes exceeding $50,000. This precision ensures that
resources are efficiently allocated to those who genuinely
qualify for certain programs or benefits.
Fig. 17. The Receiver Operator Characteristic curve obtained for the Naive
Bayes classifier. We find that we can achieve a good True Positive Rate with
a small False Positive Rate, indicating that our classifier is robust to class R EFERENCES
imbalances. We also find that the classifier is significantly better than a random
classifier. [1] James, G., Witten, D., Hastie, T. and Tibshirani, R., 2013. An introduc-
tion to statistical learning (Vol. 112, p. 18). New York: springer.
[2] Hastie, T., Tibshirani, R., Friedman, J. H., & Friedman, J. H. (2009). The
elements of statistical learning: data mining, inference, and prediction
V. D ISCUSSION (Vol. 2, pp. 1-758). New York: springer.
[3] Bishop, C. M., & Nasrabadi, N. M. (2006). Pattern recognition and
machine learning (Vol. 4, No. 4, p. 738). New York: springer.
Our analysis indicates that the Gaussian Naive Bayes
Classifier provides a good performance in predicting income
levels, based on the 1994 Census Bureau database. We
observe that our classifier has high precision. This suggests
that the classifier is particularly adept at minimizing false
positives, which are instances where it predicts a higher
income when it’s not the case. High precision is crucial in
scenarios such as targeted marketing, where false positives
can result in inefficient resource allocation.