0% found this document useful (0 votes)
107 views

Comparative Analysis of Classification Models On Income Prediction

Paper Title Comparative Analysis of Classification Models on Income Prediction Authors Bhavin Patel, V. Kakulapati, VVSSS Balaram Abstract Predictive Analytics is the underlying technology that can simply be described as an approach to scientifically utilize the past to predict the future to help coveted results. It is the branch of cutting edge analytics which is utilized to make predictions about unfamiliar events. Predictive analytics utilizes different procedures from information mining, insights, modeling, machine learning and artificial Intelligence. It includes extraction of data from information and is utilized to predict patterns and behavior patterns. It can be connected to an unfamiliar event or interest whether past, present or future. It helps being used of statistical algorithms information and machine learning strategies to distinguish the probability of future results in light of chronicled information. Income Determination is an important application of predictive analytics where customer segmentation takes place based on different demographical data. In this paper, we attempt to identify this purpose with a novel approach using different classification techniques to minimize the risk and cost involved to predict certain income levels. Here we have demonstrated the performance of each algorithm particularly on identification of customers using classification techniques. In addition, we provide an investigation analysis on true positives, false negatives, scored labels and scored probabilities. Keywords Predictive Analytics, Statistics, Machine Learning, Data Mining, Classification Citation/Export MLA Bhavin Patel, V. Kakulapati, VVSSS Balaram, “Comparative Analysis of Classification Models on Income Prediction”, April 17 Volume 5 Issue 4 , International Journal on Recent and Innovation Trends in Computing and Communication (IJRITCC), ISSN: 2321-8169, PP: 451 – 455 APA Bhavin Patel, V. Kakulapati, VVSSS Balaram, April 17 Volume 5 Issue 4, “Comparative Analysis of Classification Models on Income Prediction”, International Journal on Recent and Innovation Trends in Computing and Communication (IJRITCC), ISSN: 2321-8169, PP: 451 – 455

Uploaded by

Editor IJRITCC
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views

Comparative Analysis of Classification Models On Income Prediction

Paper Title Comparative Analysis of Classification Models on Income Prediction Authors Bhavin Patel, V. Kakulapati, VVSSS Balaram Abstract Predictive Analytics is the underlying technology that can simply be described as an approach to scientifically utilize the past to predict the future to help coveted results. It is the branch of cutting edge analytics which is utilized to make predictions about unfamiliar events. Predictive analytics utilizes different procedures from information mining, insights, modeling, machine learning and artificial Intelligence. It includes extraction of data from information and is utilized to predict patterns and behavior patterns. It can be connected to an unfamiliar event or interest whether past, present or future. It helps being used of statistical algorithms information and machine learning strategies to distinguish the probability of future results in light of chronicled information. Income Determination is an important application of predictive analytics where customer segmentation takes place based on different demographical data. In this paper, we attempt to identify this purpose with a novel approach using different classification techniques to minimize the risk and cost involved to predict certain income levels. Here we have demonstrated the performance of each algorithm particularly on identification of customers using classification techniques. In addition, we provide an investigation analysis on true positives, false negatives, scored labels and scored probabilities. Keywords Predictive Analytics, Statistics, Machine Learning, Data Mining, Classification Citation/Export MLA Bhavin Patel, V. Kakulapati, VVSSS Balaram, “Comparative Analysis of Classification Models on Income Prediction”, April 17 Volume 5 Issue 4 , International Journal on Recent and Innovation Trends in Computing and Communication (IJRITCC), ISSN: 2321-8169, PP: 451 – 455 APA Bhavin Patel, V. Kakulapati, VVSSS Balaram, April 17 Volume 5 Issue 4, “Comparative Analysis of Classification Models on Income Prediction”, International Journal on Recent and Innovation Trends in Computing and Communication (IJRITCC), ISSN: 2321-8169, PP: 451 – 455

Uploaded by

Editor IJRITCC
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169

Volume: 5 Issue: 4 451 455


______________________________________________________________________________________________

Comparative Analysis of Classification Models on Income Prediction

1
Bhavin Patel, 2V. Kakulapati, 3VVSSS Balaram
1.2,3
Sreenidhi Institute of Science & Technology,
Yamnampet, Ghatkesar, Hyderabad, India
1
[email protected], [email protected], [email protected]

Abstract: Predictive Analytics is the underlying technology that can simply be described as an approach to scientifically utilize the past to
predict the future to help coveted results. It is the branch of cutting edge analytics which is utilized to make predictions about unfamiliar events.
Predictive analytics utilizes different procedures from information mining, insights, modeling, machine learning and artificial Intelligence. It
includes extraction of data from information and is utilized to predict patterns and behavior patterns. It can be connected to an unfamiliar event
or interest whether past, present or future. It helps being used of statistical algorithms information and machine learning strategies to distinguish
the probability of future results in light of chronicled information. Income Determination is an important application of predictive analytics
where customer segmentation takes place based on different demographical data. In this paper, we attempt to identify this purpose with a novel
approach using different classification techniques to minimize the risk and cost involved to predict certain income levels. Here we have
demonstrated the performance of each algorithm particularly on identification of customers using classification techniques. In addition, we
provide an investigation analysis on true positives, false negatives, scored labels and scored probabilities.

Keywords: Predictive Analytics, Statistics, Machine Learning, Data Mining, Classification

__________________________________________________*****_________________________________________________

I. INTRODUCTION may hypothesize a place of independent factors that are


With the power to predict income shifts based on past statistically associated with the purchase of the product.
transactions, the Income Prediction Model can enable you to Using regression, how much every variable effects the
achieve a greater understanding of consumer and market behavior may be known. In this paper, we attempt to
behavior. With the intelligence you gain, you can better understand an emerging yet popular ML tool such as the
target programs or apps to specific income demographics Azure Machine Learning Studio that can assist us in
and promote the right offers to the right consumers. understanding such similarities and differences.
Sales and Marketing campaigns often require buyers with
certain levels of disposable incomes for their products and II. RELATED WORK:
solutions. Most of them are limited in their ability to provide There are two major machine learning techniques, one is
accurate results and information. Almost, all companies use classification which is used to assign each dataset to
some sort of statistical modeling and regression techniques predefined sets and another one is the prediction which is
to predict potential customers of value to them. They use used to predict continuous valued function. The intention of
various tools such as the open source R programming categorization is to precisely calculate the objective class for
language with features such as ggplots to visualize the every access in a certain data set.
trends or powerful scripting languages such as Python to SVM and PCA [3] are utilized to create and assess income
create models. However, Retailers generally require more prediction data in light of the Current Population Survey
informed decisions about the type of products they need to given by the Census Bureau of U.S. An itemized factual
stock and in order to accomplish this task and they seek review focused for relevant element determination is found
commercial tools that are expensive. to increase the productivity and even enhance classification
The absence of good information is the greatest obstruction precision.
to associations trying to utilize predictive analytics. To make Vrushali [4] investigated and application of dissimilar
more accurate predictions [1] they need attributes of those algorithms like J48, Nave Bayes and Random Forest
products that buyers have bought in the past and a Classifiers evaluates by fertility index to increase the
demographic attributes of customer such as Age, Gender, algorithm performance.
and Location. According to Y. Bengio [5] networks are suitable in
The primary tool regression analysis [2] is used by information-rich situations and are usually utilized for
associations for projecting analytics. For instance, an analyst retrieving implanted facts in the structure of regulations,
451
IJRITCC | April 2017, Available @ http://www.ijritcc.org
_______________________________________________________________________________________
International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169
Volume: 5 Issue: 4 451 455
______________________________________________________________________________________________
quantitative assessment of these regulations, clustering, self- IV. PREDICTIVE ANALYSIS
organization, categorization and regression. They have an
Now we insert an ML design so that can train the model
improvement, over other kinds of machine learning
which is utilized to evaluate the data. The question isnt
algorithms for scaling.
whether you can find the answers. The question is how. So,
S.Archana et al [6 ] emphasis about the major category of
depending on what you find out you can choose the
categorization algorithm includes C4.5, k-nearest neighbor
appropriate algorithm. For instance:
classifier, Naive Bayes, SVM, and IB3. They explained a
common survey of dissimilar categorization algorithms and
their benefits and drawbacks.

III. OUR APPROACH


To develop a predictive for our experimentation may utilize
the sample information set from the UCI repository where
thousands of datasets belonging to different categories are
available for free use and download. The dataset contains
various attributes that are tagged as uninterrupted as they
stand for numerical values whereas some factors have a
predefined list of probable values. A sample Census
Income Dataset looks like the following:
Fig 4.1. Framework of predictive analytics of
Information Multivariate The count 48842 income Prediction
set of
4. i. Neural Network
uniqueness: Instances:
This is one of the useful classification algorithms that utilize
the perception of neurons that logically represents the
Feature Categorical, The count 14
working of the human brain. In this classification process,
uniqueness: Integer of
data values are representing by the neurons and the
Attributes:
connectivity is representing by synapses. It
is basically the layered approach in which there are two mai
Related categorization omitted Yes
n layers called two end points represented by input and
Tasks: Values:
output layer. Other then these two layers, the model also
have intermediate layers called hidden layers. On each layer
Upon the successful selection of the dataset the next major
some weight age is assigned. The middle layer of the
step is to create an environment where one can model and
network defines weights to different input values so that
test data. But in order to accomplish this feat, the missing
effective classification will be done. This allows the dataset
values and inconsistencies (noise and outliers) must be
as the input layer, and characterizes it as the network nodes.
removed, so that there is no duplication or unnecessary
The predictor weights are applied to these nodes in the
repetition and only unique values must exist. The omitted
hidden layer. This layer actually defines the degree of
values require be updating with accurate values or dropping
connectivity between the nodes. After adjusting the
for accurate results. Clean Missing Data an inherent feature
weightage, output layer is derived as the final result [7, 8].
with the ML Studio might be used where the appropriate
cleaning mode may be selected such as:
4. ii. Support Vector Machine
Custom Substitution Value, Replace with SVM is another robust and successful classification
Mean/Median/Mode, Remove Entire Row, Remove Entire algorithm.SVM basically works as the linear separator
Column. between two data points to identify two different classes in
the multidimensional environment. The main aim of SVM is
The next step is to split the dataset or to partition it for to exploit the boundary between classes and to reduce the
Training Data (Utilized for generating a analytical space between points. SVM basically defines the dealing of
representation based on inherent prototypes originate in interaction respective to the features and the repetitive
chronological information and Validation Data (Used for features. This divides the dataset in two vector sets beneath
creating a new predictive model against known outcomes). n dimensional space vector. This
Here again we can use the splitting mode such as: divide algorithm basically builds a hyper-
tuples, Recommender divisions, normal appearance or plane environment so that each element is compared with
virtual appearance.
452
IJRITCC | April 2017, Available @ http://www.ijritcc.org
_______________________________________________________________________________________
International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169
Volume: 5 Issue: 4 451 455
______________________________________________________________________________________________
respective to the separated linear line. The hyper-plane e. Neural Network Regression
concept is presented to perform the data separation based on A neural framework can be considered as a weighted
the largest distance analysis to identify the classes. To coordinated non-cyclic graph. The nodes of the graph
reduce the error ratio, the largest margin classifier is reorganizing in layers and are relating with weight edges to
defined. The work also includes the analysis based on nodes in the accompanying layer. The main layer is called as
margin vector along with support vector analysis [9]. the information layer. The final layer is called as the yield
layer. The output layer contains a single node on the account
4. iii. Regression: It generates a condition to depict the
of the regression model.
factual connection between at least one of the indicators and
The left over layers are called unseen layers. To process the
the reaction variable and to anticipate new perception.
yield of the network on given information case, esteem is
calculated for every node in the unseen layers and in the
a. Ordinal Regression
yield layer. For every hub, the esteem is set by figuring the
This is utilized when the tag or objective column contains
weighted total of the estimation of the nodes in the past
numbers, yet the numbers represent to a positioning or order
layer and applied an initiation capacity to that weighted
instead of a numeric estimation. Anticipating ordinal
aggregate.
numbers requires an different design than predicting the
The structure of neural network model graph includes the
estimations of numbers on a continuous scale, due to the
following attributes:
numbers allocated to represent to rank order don't have
The count of unseen layers
essential scale.
The count of nodes in every unseen layer
b. Poisson Regression relationships between the layers
This is a special kind of regression that is typically utilized selection of creation functions
to model calculations. Approximation the quantity of loads on the grid edges
emergency service calls during an event, or estimating The organization of the graph and activation capacity are
customer inquiries relating to a promotion. Creating resolved by the consumer. The weights on the edges are
contingency tables since the response variable has Poisson originated while setting up the neural framework on
dispersion; hidden suspicions about likelihood dissemination information data. These networks can be computationally
are unique in relation to minimum square relapses and costly; due to various hyper-parameters and the arrangement
Poisson models must be translated uniquely in contrast to of routine structure topologies. In spite of the fact that much
other regression models. of the time neural systems create preferable outcomes over
different calculations, acquiring such outcomes may include
c. Linear Regression a considerable lot of clearing (cycles) over hyper-
For predictive task, linear regression is a better choice. This parameters.
type of regression is likely to effort well on high-
dimensional, meager information sets not having f. Decision Forest Regression
complication. This is in Azure Machine Learning Studio These trees are non-parametric models that perform a
utilizes to solve regression crisis: The typical regression succession of simple tests for every instance, navigating a
predicament occupies a particular independent factor and a binary tree information structure until a leaf node attained.
dependent factor. This is called simple regression. These trees are effective in both computation and memory
utilization amid preparing and prediction. They can describe
d. Bayesian Linear Regression non-straight decision limits. They perform integrated feature
In insights, the Bayesian way to deal with regression is determination and categorizations are versatile within the
frequently distinguished diversely in relation to frequent view of noisy elements. Regression contains a group of
approach. The Bayesian approach utilizes linear relapse decision trees. Each tree in a regression decision forest
supplemented by extra data as an earlier likelihood yields a Gaussian distribution by method for prediction. An
appropriation. Earlier the information regarding the factors amassing is performed over the group of trees to find a
is connected with possibility capacity to produce Gaussian distribution nearest to the joined appropriation for
approximations for the factors. Interestingly, the incessant all trees in the model.
advance, stand for regular least-square linear regression,
except that the information includes adequate estimation to g. Boosted Decision Tree Regression
create a significant mode Boosting is solitary of several classic methods for creating
ensemble models, along with bagging, random forests, and
so forth. These trees in AML Studio utilize a proficient

453
IJRITCC | April 2017, Available @ http://www.ijritcc.org
_______________________________________________________________________________________
International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169
Volume: 5 Issue: 4 451 455
______________________________________________________________________________________________
execution of the MART GBS. To solve regression problems, Counterfeit- -ve: 722
gradient boosting is a machine learning technique. This Counterfeit- +ve: 498
constructs every failure tree in a stage insightful form, Exact- -ve: 6918
utilizing a predefined misfortune capacity to quantify the Positive-Label: >50K
mistake in every progression and accurate for it in the Negative-Label: <=50K
subsequent. In this manner the prediction model is really a Accuracy: 0.90
grouping of weaker forecast models. In regression issues,
Confusion Matrix: The following matrix is known as
boosting constructs a progression of trees in a step-wise
Confusion Matrix. It predicts the Scored Labels against
form, and afterward chooses the optimal tree utilizing a
the actual class. For instance, this matrix indicates the
subjective differentiable loss work
accuracy of the Multi-Classification Decision Forest
. In this paper, since we are concerned with predicting
Algorithm against the actual classes i.e. High, Medium and
values and estimating the relationship between variable we
Low.
will use Regression and in order to predict categories and
identify what categories the new information belongs to we In short, it predicts the Scored Probabilities of the predicted
use Classification. class.
Here we need to select the column to be expected stand on
further columns. We start training the model and determine The Average-Accuracy of the Multi-Classification
its suitability for the solution. We later visualize the newly Decision Forest using the sample dataset turned out to be
trained data. 0.7.

V. EVALUATING EXPERIMENT RESULTS.


In order to calculate the consequences, we establish the
representations accuracy. Here we notice the set of curves
and metrics that are useful in evaluating the model. For
instance, we have two additional parameters:

Scored-Labels: It predicts True if the probability is greater


than .5 and False if it isnt.

Scored-Probabilities: It is the probability that algorithm has


decided, that the above record belongs to True-Category

Accuracy=Correctly Predicted Observations= 0.94


(Indicates 94% accuracy)
Fig 3.3. Multi-Classification Decision Forest
Total Number of Events

VI. CONCLUSION AND FUTURE WORK

With the proposed model, we measure the accuracy of


various classification models directly by comparing the
whole dataset. This new comparison gives the positive and
improved result using the given metrics. The algorithm
retrains itself each time an input variable is passed and
compares itself to the previous scored label. However, we
also observe that at times it gives us negative results. In
order to overcome such negative impact, we enhance the
model by carefully training it.
Fig 3.2. Recipient Operating Characteristic (ROC) Curve:

The above arc is known as Receiver Operating REFERENCES


Characteristic design. The Horizontal alignment stand for [1] Anton Antonov et al., Classification and Association
the ratio of Counterfeit Positives (event is Negative) and Rules for Census Data Mathematical for prediction
algorithms, March 30, 2014 .
the Vertical alignment stand for the ratio of Exact Positives
[2] Azamat Kibekbaev et al., Benchmarking Regression
(event is Positive)
Algorithms for Income Prediction Modeling, 2015
Exact- +ve: 1630 International Conference on Computational Science and

454
IJRITCC | April 2017, Available @ http://www.ijritcc.org
_______________________________________________________________________________________
International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169
Volume: 5 Issue: 4 451 455
______________________________________________________________________________________________
Computational Intelligence, 978-1-4673-9795-7/15 $31.00
2015 IEEE,DOI 10.1109/CSCI.2015.162, pp: 180-185.
[3] A Lazar. Income Prediction via Support Vector Machine,
IEEE conference on Machine Learning and
applications,16-18 Dec. 2004 DOI:
10.1109/ICMLA.2004.1383506.
[4] Vrushali Comparative Analysis of Classification
Techniques on Soil Data to Predict Fertility Rate for
Aurangabad District IJETTCS, Volume 3, Issue 2, March
April 2014, ISSN 2278-6856, PP:200-203.
[5] Y. Bengio et al., "Introduction to the special issue on
neural networks for data mining and knowledge
discovery," IEEE Trans. Neural Networks, vol. 11, pp.
545-549, 2000.
[6] S.Archana et al., Survey of Classification Techniques in
Data Mining, International Journal of Computer Science
and Mobile Applications, Vol.2 Issue. 2, February- 2014.
[7] Kumari et al , Comparative Study of Data Mining
Classification Methods in Cardiovascular Disease
Prediction, International Journal of Computer Science and
Technology Vol. 2, Issue 2, pp. 304-308, 2011.
[8] Ture, M et al, Comparing classification techniques for
predicting essential hypertension, Expert Systems with
Applications 29, pp. 583588, 2011.
[9] Burges, C., A Tutorial on Support Vector Machines for
Pattern Recognition Data Mining and Knowledge
Discovery, Vol. 2, pp. 121-167, 1998

455
IJRITCC | April 2017, Available @ http://www.ijritcc.org
_______________________________________________________________________________________

You might also like