0% found this document useful (0 votes)
27 views26 pages

Salary Prediction-2

The document discusses a project aimed at predicting employee salaries based on past experiences and educational qualifications using machine learning techniques. It outlines various methodologies, including data collection, cleaning, and model training, while also detailing different machine learning algorithms such as supervised, unsupervised, and reinforcement learning. Additionally, it highlights the tools and libraries required for implementation, including Python, Jupyter Notebook, and Flask.

Uploaded by

Sai Vetcha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views26 pages

Salary Prediction-2

The document discusses a project aimed at predicting employee salaries based on past experiences and educational qualifications using machine learning techniques. It outlines various methodologies, including data collection, cleaning, and model training, while also detailing different machine learning algorithms such as supervised, unsupervised, and reinforcement learning. Additionally, it highlights the tools and libraries required for implementation, including Python, Jupyter Notebook, and Flask.

Uploaded by

Sai Vetcha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

CHAPTER 1

INTRODUCTION
Now days, one of the major reasons an employee switches a company is
the salary of the employee. Employees keep switching the company to get the
expected salary. And it results in loss for the company and to overcome this loss
we came with an idea what if the employee gets the desired/expected salary from
the Company or Organization. In this Competitive world everyone has a higher
expectation and goals.

But we cannot randomly provide everyone their expected salary there


should be a system which should measure the ability of the Employee for the
Expected salary. We cannot decide the exact salary but we can predict it by using
certain data sets.

A prediction is an assumption about a future event. A prediction is


sometimes, though not always, is based upon knowledge or experience. Future
events are not necessarily certain, thus confirmed exact data about the future is
in many cases are impossible, a prediction may be useful to help in preparing
plans about probable developments. In this project, the salary of an employee of
an organization is to be predicted on basis of past experience and the educational
qualifications of the individual. Here the history of salary has been observed and
then on basis of that salary of a person after a certain period of time it can be
calculated automatically.

In order to gain useful insights into the job recruitment, we compare


different strategies and machine learning models. The methodology different
phases like: Data collection, Data cleaning, Manual feature engineering, Data set
description, Automatic feature selection, Model selection, Model training and
validation, Model comparison.

1
The process of learning begins with observations or data, such as
examples, direct experience, or instruction, in order to look for patterns in data
and make better decisions in the future based on the examples that we provide.
The primary aim is to allow the computers learn automatically without human
intervention or assistance and adjust actions accordingly.

Machine learning algorithms are broadly classified into three divisions, namely;
Supervised learning, Unsupervised learning and Reinforcement learning.

• Supervised Learning:-Supervised learning is a learning in which we


teach or train the machine using data which is well labelled that means
some data is already tagged with correct answer. After that, machine is
provided with new set of examples so that supervised learning algorithm
analyses the training data and produces a correct outcome from labelled
data.
Basically, they can apply what has been learned in the past to new data
using labelled examples to predict future events. Starting from the
analysis of a known training dataset, the learning algorithm produces an
inferred function to make predictions about the output values. The system
is able to provide targets for any new input after sufficient training. The
learning algorithm can also compare its output with the correct, intended
output and find errors in order to modify the model accordingly.

• Unsupervised Learning:-In contrast, unsupervised machine learning


algorithms are used when the information used to train is neither classified
nor labelled. Unsupervised learning studies how systems can infer a
function to describe a hidden structure from unlabelled data. The system
doesn’t figure out the right output, but it explores the data and can draw
inferences from datasets to describe hidden structures from unlabelled

2
data. Unsupervised learning is the training of machine using information
that is neither classified nor labelled and allowing the algorithm to act on
that information without guidance. Here the task of machine is to group
unsorted information according to similarities, patterns and differences
without any prior training of data. Unlike, supervised learning, no teacher
is provided that means no training will be given to the machine. Therefore,
machine is restricted to find the hidden structure in unlabelled data by our-
self.
• Reinforcement Learning:- Reinforcement learning is an area of
Machine Learning. Reinforcement. It is about taking suitable action to
maximize reward in a particular situation. It is employed by various
software and machines to find the best possible behaviour or path it should
take in a specific situation. Reinforcement learning differs from the
supervised learning in a way that in supervised learning the training data
has the answer key with it so the model is trained with the correct answer
itself whereas in reinforcement learning, there is no answer but the
reinforcement agent decides what to do to perform the given task. In the
absence of training dataset, it is bound to learn from its experience.
Reinforcement machine learning algorithms is a learning method that
interacts with its environment by producing actions and discovers errors
or rewards. Trial and error search and delayed reward are the most relevant
characteristics of reinforcement learning. This method allows machines
and software agents to automatically determine the ideal behaviour within
a specific context in order to maximize its performance. Simple reward
feedback is required for the agent to learn which action is best; this is
known as the reinforcement signal.

3
The project uses various regression techniques for predicting the salary of the
employees. The techniques are listed as follows.

1. Linear Regression: In Linear regression we are given a number of

predictor variables and a continuous response variable, and we try to find


a relationship between those variables that allows us to predict a
continuous outcome.
2. For example, given X and Y, we fit a straight line that minimize the

distance using methods to estimate the coefficients like Ordinary Least


Squares and Gradient Descent between the sample points and the fitted
line.
3. Decision Tree Regressor: Decision tree builds regression or
classification models in the form of a tree structure. It breaks down a
dataset into smaller and smaller subsets while at the same time an
associated decision tree is incrementally developed. The final result is a
tree with decision nodes and leaf nodes. A decision node has two or more
branches, each representing values for the attribute tested. Leaf node
represents a decision on the numerical target. The topmost decision node
in a tree which corresponds to the best predictor called root node. Decision
trees can handle both categorical and numerical data.
4. Random Forest Regressor: Random forests or random decision forests

are an ensemble learning method for classification, regression and other


tasks that operates by constructing a multitude of decision trees at training
time for regression tasks, the mean or average prediction of the individual
trees is returned.

4
REQUIRED TOOLS

• Python: Python is an interpreted, object-oriented, high-level


programming language with dynamic semantics. Its high-level built in
data structures, combined with dynamic typing and dynamic binding,
make it very attractive for Rapid Application Development, as well as for
use as a scripting or glue language to connect existing components
together.

• Jupyter Notebook: The Jupyter Notebook is an open-source web


application that allows you to create and share documents that contain live
code, equations, visualizations and narrative text. Uses include: data
cleaning and transformation, numerical simulation, statistical modeling,
data visualization, machine learning, and much more.

• Anaconda Navigator: Anaconda Navigator is a desktop graphical


user interface (GUI) included in Anaconda® distribution that allows you
to launch applications and easily manage conda packages, environments,
and channels without using command-line commands.

5
CHAPTER 2
LITERATURE SURVEY
1) Susmita Ray," A Quick Review of Machine Learning Algorithms," 2019

International Conference on Machine Learning, Big Data, Cloud and Parallel


Computing (Com-ITCon), India, 14th -16th Feb 2019 a brief review of
various machine learning algorithms which are most frequently used to solve
classification, regression and clustering problems. The advantages,
disadvantages of these algorithms have been discussed along with comparison
of different algorithms (wherever possible) in terms of performance, learning
rate etc. Along with that, examples of practical applications of these
algorithms have been discussed.

2) Sananda Dutta, Airiddha Halder, Kousik Dasgupta,” Design of a novel

Prediction Engine for predicting suitable salary for a job” 2018 Fourth
International Conference on Research in Computational Intelligence and
Communication Networks (ICRCICN) - focused on the problem of predicting
salary for job advertisements in which salary are not mentioned and also tried
to help fresher to predict possible salary for different companies in different
locations. The corner stone of this study is a dataset provided by ADZUNA.
model is well capable to predict precise value.

3) Pornthep Khongchai, Pokpong Songmuang, “Improving Students’


Motivation to Study using Salary Prediction System” - proposed prediction
model using Decision tree technique with seven features. Moreover, the result
of the system is not only a predicted salary, but also the 3-highest salary of
the graduated students which share common attributes to the users. To test the
system’s efficiency, they set up an experiment by using 13,541 records of
actual graduated student data. The total result in accuracy is 41.39%.

6
4) Phuwadol Viroonluecha, Thongchai Kaewkiriya,” Salary Predictor System

for Thailand Labour Workforce using Deep Learning” - used Deep learning
techniques to construct a model which predicts the monthly salary of job
seekers in Thailand solving a regression problem which is a numerical
outcome is effective. We used five-month personal profile data from well-
known job search website for the analysis. As a result, Deep learning model
has strong performance whether accuracy or process time by RMSE 0.774 x
104 and only 17 seconds for runtime.

7
CHAPTER 3
MERITS OF THE SYSTEM
1. Easily identifies trends and patterns: Machine Learning Models
can review large volumes of data and discover specific trends and patterns that
would not be apparent to humans. For instance, for an e-commerce website like
Amazon, it serves to understand the browsing behaviours and purchase histories
of its users to help cater to the right products, deals, and reminders relevant to
them. It uses the results to reveal relevant advertisements to them.

2. No human intervention needed (automation): With


implementation of ML model, there is no need to have any eye on the project at
every step of the way. Since, giving machines the ability to learn, lets them make
predictions and also improve the algorithms on their own. A common example
of this is anti-virus software; they learn to filter new threats as they are
recognized. ML is also good at recognizing spam.

3. Continuous Improvement: As ML algorithms gain experience, they


keep improving in accuracy and efficiency. This lets them make better decisions.

4. Handling multi-dimensional and multi-variety data: Machine


Learning algorithms are good at handling data that are multi-dimensional and
multi-variety, and they can do this in dynamic or uncertain environments.

5. Wide Applications: You could be an e-tailer or a healthcare provider


and make ML work for you. Where it does apply, it holds the capability to help
deliver a much more personal experience to customers while also targeting the
right customers.

8
CHAPTER 4
ARCHITECTURAL FLOW OF THE SYSTEM

Data Data Data Application of


Data Analysis Evaluation
collection Exploration manipulation Algorithm

An Architectural Diagram or a pipeline is used to help automate machine


learning workflows. They operate by enabling a sequence of data to be
transformed and correlated together in a model that can be tested and evaluated
to achieve an outcome, whether positive or negative.

The pipeline/ Diagram consists of several steps to train a model.


Machine learning pipelines are iterative as every step is repeated to continuously
improve the accuracy of the model and achieve a successful algorithm. To build
better machine learning models, and get the most value from them, accessible,
scalable and durable storage solutions are imperative, paving the way for on
premises object storage. The steps include:

• Data Collection: Collecting raw data from billions of datasets available.


• Data Exploration: Exploring the data & the features related and being
familiar with the data-types.
• Data Manipulation: Includes Cleaning of data, treating missing, repetitive
values that are present.
• Data Analysis: Analysing the data to increase efficiency while applying
the best Algorithm & feature selection according to our preferences.
• Application of Algorithm: Applying the algorithm to the model.
• Evaluation: Using evaluation metrices to calculate the least error and
following the above to make further changes.

9
CHAPTER 5
DESCRIPTION OF MODULES
For the Salary Prediction Model, we will be using the python library
modules such as numpy, pandas, matplotlib & sklearn, though which we will
import different functions for computing the model. Further, we will use Flask
library which is one of the most important, as it connects our backend end code
with the front end.

1. Pandas: Pandas is a fast, powerful, flexible and easy to use open-source


data analysis and manipulation tool, built on top of the python programming
language. It is a high-level data manipulation tool It is built on the Numpy
package and its key data structure is called the Data Frame. Data Frames allow
you to store and manipulate tabular data in rows of observations and columns of
variables.
2. Numpy: NumPy is a Python package that stands for ‘Numerical Python’.
It is the core library for scientific computing, which contains a powerful n-
dimensional array object. It is a powerful N-dimensional array object which is
in the form of rows and columns. We can initialize NumPy arrays from nested
Python lists and access it elements.
3. Matplotlib: Matplotlib is a comprehensive library for creating static,
animated, and interactive visualizations in Python. It is a multi-platform data
visualization library built on NumPy arrays and designed to work with the
broader SciPy stack. One of the greatest benefits of visualization is that it allows
us visual access to huge amounts of data in easily digestible visuals. Matplotlib
consists of several plots like line, bar, scatter, histogram etc.
4. Scikit learn: Scikit-learn is probably the most useful library for machine
learning in Python. The sklearn library contains a lot of efficient tools for
machine learning and statistical modelling including classification, regression,

10
clustering and dimensionality reduction. sklearn is used to build machine
learning models. It should not be used for reading the data, manipulating and
summarizing it. Components of scikit-learn:
• Supervised learning algorithms

• Cross-validation

• Unsupervised learning algorithms

• Various toy datasets

• Feature extraction

5. Flask: Flask is a web application framework written in Python. Armin


Ronacher, who leads an international group of Python enthusiasts named Pocco,
develops it. Flask is based on Werkzeug WSGI toolkit and Jinja2 template
engine. Flask is considered more Pythonic than the Django web framework
because in common situations the equivalent Flask web application is more
explicit. Flask is also easy to get started with as a beginner because there is little
boilerplate code for getting a simple app up and running.
6. Html & CSS: Hypertext Markup Language (HTML) is the standard
markup language for documents designed to be displayed in a web browser. It
can be assisted by technologies such as Cascading Style Sheets (CSS) and
scripting languages such as JavaScript. Cascading Style Sheets (CSS) is a style
sheet language used for describing the presentation of a document written in a
markup language such as HTML. CSS is a cornerstone technology of the World
Wide Web, alongside HTML and JavaScript.
7. Heroku: Heroku is a cloud platform as a service (PaaS) supporting
several programming languages. One of the first cloud platforms, Heroku has
been in development since June 2007. We will deploy our model on the Heroku
platform.

11
CHAPTER 6
UML DIAGRAMS
1. ER Diagram: An entity relationship diagram (ERD) shows the
relationships of entity sets stored in a database. An entity in this context is an
object, a component of data. An entity set is a collection of similar entities.
These entities can have attributes that define its properties.

2. Use Case Diagram: A use case diagram is a graphical depiction of a user's


possible interactions with a system. A use case diagram shows various use
cases and different types of users the system has and will often be
accompanied by other types of diagrams as well.

12
3. Activity Diagram: An activity diagram is a behavioural diagram i.e. it
depicts the behaviour of a system. An activity diagram portrays the control
flow from a start point to a finish point showing the various decision paths
that exist while the activity is being executed.

13
CHAPTER 7
IMPLEMENTATION OF THE SYSTEM
SOURCE CODE:
In[1]: import pandas as pd
dataset=pd.read_csv('/content/Salary_Data_SLR.csv')
In[2]: dataset
In[3]: X=dataset.iloc[:,0].values
In[4]: X
In[5]: X=X.reshape(-1,1)
In[6]: X
In[7]: Y=dataset.iloc[:,-1].values
In[8]: Y
In[9]: import plotly.express as px
fig=px.line(dataset,x="YearsExperience",y="Salary")
fig.show()
In[10]: from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2)
In[11]: X_train.shape
In[12]: X_test.shape
In[13]: Y_train.shape
In[14]: Y_test.shape
In[15]: from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(X_train,Y_train)
In[16]: regressor.coef_
In[17]: regressor.intercept_
In[18]: from sklearn.metrics import r2_score
Y_pred=regressor.predict(X_test)

14
print(r2_score(Y_pred,Y_test))
In[19]: yoe=float(input("Enter Years of Experience"))
regressor.predict([[yoe]])

1. The Data: Data collection is the first real step towards the real
development of a machine learning model, collecting data. This is a
critical step that will cascade in how good the model will be, the more and
better data that we get, the better our model will perform. Our dataset
named “survey_results_public” is a raw dataset. It means that a lot of pre-
processing is required so that all it becomes useful for evaluation. Our
dataset consists of 83439 rows and 48 features that will help us to predict
the sales of the house and is fairly a big dataset.

2. Loading the Data: We load the dataset into our notebook using the
pandas dataframe.

3. Data Pre-Processing: Our Next step is to convert our data set into best
possible format so that we can extract what all features are required to
predict the price of the house. This is where all cleaning of our data takes
place, be it treating the missing values, treating repetitive values, or
addition of different features according to our needs. Once they are
identified, there are several ways to deal with them:
Eliminating the samples or features with missing values. (we risk to delete
relevant information or too many samples). Imputing the missing values,
with some pre-built estimators such as the Imputer class from scikit learn.

15
4. Data Exploration: Further, we explore our data as much as possible to
know the features very well. We get to know the count of each features,
their mean values, standard deviation, min and max value etc.

5. Univariate Analysis: Uni means one аnd vаriаte meаns vаriаble, sо in


univаriаte аnаlysis, there is оnly оne deрendаble vаriаble. The оbjeсtive
оf univаriаte аnаlysis is tо derive the dаtа, define аnd summаrize it, аnd
аnаlyze the раttern рresent in it. In а dаtаset, it exрlоres eасh vаriаble
seраrаtely.

16
Bar Plot for
Education Level

Data Manipulation: Data Manipulation plays a very crucial role in the


machine learning pipeline, as all the cleaning of the data takes place in this step.
The process includes finding and treating missing values in the dataset and then
imputing them with different techniques like mean, mode, median, average or
even dropping the column (if irrelevant). Also, outliers are treated in this very
step as deviate the plots from their actual positions.

17
Bivariate Analysis: As the name suggests, bivariate analysis is the analysis of
2 features taken together. It is one of the simplest forms of statistical analysis,
used to find out if there is a relationship between two sets of values. It usually
involves the variables X and Y. Again, we randomly pick up any two features,
one pair at a time and analyse it using histograms, bar graphs, plots etc.

18
19
Feature Scaling: This is a crucial step in the preprocessing phase as the
majority of machine learning algorithms perform much better when
dealing with features that are on the same scale. The most common
techniques are:

• Normalization: it refers to rescaling the features to a range of [0,1],


which is a special case of min-max scaling. To normalize our data we’ll
simply need to apply the min-max scaling method to each feature column.

• Standardization: it consists in centering the feature columns at mean 0


with standard deviation 1 so that the feature columns have the same
parameters as a standard normal
distribution (zero mean and unit
variance). This makes much more easier
for the learning algorithms to learn the
weights of the parameters. In addition, it
keeps useful information about outliers
and makes the algorithms less sensitive to them.

9. Implementing the Model: Here comes the part where the actual
machine learning algorithms are being implemented. As stated above, we
are using Linear Regression Machine Learning Algorithm to predict the
house price under Chennai house price Prediction model.

20
10. Segregating Dependant and Independent Variables: Independent

variables (also referred to as Features) are the input for a process that is
being analyzes. Dependent variables are the output of the process.
For example:
y=f(x),Where,

x=independent variable

y= dependent variable

This means any changes in x will cause a change in the value of y. The change
can be negative or positive. In Our Model, we have “SALARY” as our target/
dependent variable and all other features are considered as independent variables.

11. Splitting the Data Set into Train and Test Dataset: We will split our data

in three parts: training, testing and validating sets. We train our model with
training data, evaluate it on validation data and finally, once it is ready to
use, test it one last time on test data. The ultimate goal is that the model
can generalize well on unseen data, in other words, predict accurate results
from new data, based on its internal parameters adjusted while it was
trained and validated.
In our Model, we have divided our dataset into a 70:30 ration, i.e., the training
data consists of 70% of the dataset while the testing data consists of the remaining
30% of the dataset. To split the data we use train_test_split function provided by
scikit-learn library.

21
12. Implementing Linear Regression:
i) Learning Phase: In Linear regression we are given a number of
predictor variables and a continuous response variable, and we try to
find a relationship between those variables that allows us to predict a
continuous outcome.
For example, given X and Y, we fit a straight line that minimize the
distance using methods to estimate the coefficients like Ordinary Least
Squares and Gradient
Descent between the sample points and the fitted line. We’ll use the
intercept and slope learned, that form the fitted line, to predict the
outcome of new data.

The formula for the straight line is y = B0 + B1x +u. Where x is the input, B1 is
the slope, B0 the y-intercept, u the residual and y is the value of the line at the
postion x.
The values available for being trained are B0 and B1, which are the values that
affect the position of the line, since the only other variables are x (the input and
y, the output (the residual is not considered). These values (B0 and B1) are the
“weights” of the predicting funtion.

22
These weights and other, called biases, are the parameters that will be arranged
together as matrixes.

The process is repeated, one iteration (or step) at a time. In each iteration the
initial random line moves closer to the ideal and more accurate one.

ii) Overfitting & Underfitting: One of the most important problems when considering the
training of models is the tension between optimization and generalization.

• Optimization is the process of adjusting a model to get the


best performance possible on training data (the learning
process).

• Generalization is how well the model performs on unseen


data. The goal is to obtain the best generalization ability.

At the beginning of training, those two issues are correlated, the lower the
loss on training data, the lower the loss on test data. This happens while the
model is still underfitted: there is still learning to be done, it hasn’t been
modelled yet all the relevant parameters of the model.

There are two ways to avoid this overfitting, getting more data and
regularization.

23
• Getting more data is usually the best solution, a model trained on more
data will naturally generalize better.
• Regularization is done when the latter is not possible, it is the process of
modulating the quantity of information that the model can store or to add
constraints on what information it is allowed to keep. If the model can
only memorize a small number of patterns, the optimization will make it
to focus on the most relevant ones, improving the chance of generalizing
well.
• Regularization is done mainly by the following techniques:

• Reducing the model’s size: Reducing the number of learnable


parameters in the model, and with them its learning capacity.

• Adding weight regularization: L1 Regularization & L2


Regularization.

13. Implementing Decision Tree: Decision Tree is a decision-making tool

that uses a flowchart-like tree structure or is a model of decisions and all


of their possible results, including outcomes, input costs, and utility.
Decision-tree algorithm falls under the category of supervised learning
algorithms. It works for both continuous as well as categorical output
variables.
14. Evaluation of the Model: The final Step of the model is evaluating it using

appropriate evaluation matrices. We have evaluated our model using


score() and R2-sqaure metrics as it suits our model perfectly.

24
CHAPTER-8
CONCLUSION
In today’s real world, it has become tough to store such huge data and extract
them for one’s own requirement. Also, the extracted data should be useful. The
system makes optimal use of the Linear Regression Algorithm. The system
makes use of such data in the most efficient way. The linear regression
algorithm helps to fulfil customers by increasing the accuracy of estate choice
and reducing the risk of investing in an estate.

Our Model Predicted An Accuracy score of 95.68% on the training dataset


while it predicted an Accuracy score of 95.33% on the testing dataset.
Since there is a very minute difference between the training and testing
scores, we can say that our model has performed extremely well on the
given dataset, that with such a high % score. It is illustrated that the
approach contributes positively according to the evaluation.

FUTURE SCOPE

Since nothing in this universe can be termed as “perfect”, thus a lot of


features can be added to make the system more widely acceptable and more user
friendly. This will not only help to predict rates of other areas in the city but also
will be more user beneficial. In the upcoming phase of our project we will be able
to connect an even larger dataset to this model so that the training can be even
better. This mоdel shоuld сheсk fоr new dаtа, оnсe in а mоnth, аnd inсоrроrаte
them tо exраnd the dаtаset аnd рrоduсe better results. We can try out other
dimensionality reduction techniques like Uni-variate Feature Selection and
Recursive feature elimination in the initial stages. Another major future scope
that can be added is providing the model with estate database of more cities which

25
will provide the user to explore more graduates and reach an accurate decision.
More factors like training period that affect the job salary of a graduate shall be
added. In-depth details of every individual will be added to provide ample details
of a desired estate. This will help the system to run on a larger level

26

You might also like