0% found this document useful (0 votes)

14 views30 pages

Salary Prediction Document

This document is a Summer Internship report on 'Salary Prediction' submitted by students of the Bachelor of Technology in Computer Science and Engineering at Loyola Institute of Technology and Management. The project utilizes machine learning algorithms, particularly Linear Regression, Decision Tree Regressor, and Random Forest Regressor, to predict employee salaries based on historical data and qualifications. The report outlines the methodology, tools used, and the significance of accurate salary predictions in the job market.

Uploaded by

Sai Vetcha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views30 pages

Salary Prediction Document

Uploaded by

Sai Vetcha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

SALARY PREDICTION

This Summer Internship report submitted for partial fulfilment of the requirements
for the award of degree.
BACHELOR OF TECHNOLOGY

COMPUTER SCIENCE & ENGINEERING

Submitted By

21A41A0562 : SK. FAYAZ AHAMMED

21A41A0563: SK. SARDAR HUSSAIN
21A41A0569: SK. NAGUL MEERA
21A41A0582: V. LEELA SAI BALAJI MANIKANTA
21A41A0586: M. MANIKANTA

Under the Esteemed Guidance of

Mrs. B. Prasanna Jyothi

(Assistant Professor)

Department of Computer science & Engineering

LOYOLA INSTITUTE OF TECHNOLOGY AND MANAGEMENT
(Approved by AICTE, New Delhi & Affiliated to JNTUK, Kakinada)
ESTD: 2001, Sponsored by: SanthiNiketan Minority Education Society.
NAAC ”A” graded and an ISO 9001:2015 Certified Institution
Loyola Nagar, Dhulipalla (V), Sattenapalli (M)- 522 412
A.Y. 2023 – 2024
LOYOLA INSTITUTE OF TECHNOLOGY AND MANAGEMENT
(Approved by AICTE, New Delhi & Affiliated to JNTUK, Kakinada)
ESTD: 2001, Sponsored by: SanthiNiketan Minority Education Society.
NAAC ”A” graded and an ISO 9001:2015 Certified Institution Loyola Nagar, Dhulipalla (V),
Sattenapalli (M)- 522 412.

Department of Computer science &Engineering

CERTIFICATE
This is to certify that the Summer Internship report entitled as “ SALARY PREDICTION ” is the Bonafide
work of done SK.FAYAZ AHAMMAED, SK.J.SARDAR HUSSAIN, SK.NAGUL MEERA, V.LEELA SAI
BALAJI MANIKANTA, M.MANIKANTA. Submitted to the Department of Computer Science and
Engineering, Loyola Institute of Technology and Management, in partial fulfilment of the requirements for
the award of the Degree in Bachelor of Technology, Computer Science and Engineering.

GUIDE HEAD OF THE DEPARTMENT

B. Prasanna Jyothi R. Pitchaiah

(Assistant Professor) (Associate Professor)

EXTERNAL EXAMINER
ACKNOWLEDGEMENT

We would like to express our profound gratitude to our supervisor, for providing
us with the opportunity to work on this Summer Internship, “SALARY PREDICTION”.
We especially acknowledge him for his advice, supervision, and the vital contribution as
and when required during this research. His involvement with originality has triggered
and nourished our intellectual maturity that will help us for a long time to come. We are
proud to record that we had the opportunity to work with an exceptionally experienced
Professor like him.

We'd want to express our gratitude to our classmates and friends for assisting us in
dispelling our doubts. We'd want to use this occasion to express our gratitude to our parents
for their financial support.

Finally, we'd want to express our heartfelt gratitude to everyone who assisted us in
the execution of this project.
DECLARATION
We hereby declare that the Summer Internship report entitled “SALARY PREDICTION”
is submitted to JNTUK in partial fulfillment of the requirements for the award of degree of
B.Tech is a Bonafide work carried out by us. The matter embodied in this project is genuine
work done by us and has not been submitted earlier to this or any other university for the
award of any degree.

Declared by;

SK.FAYAZ AHAMMED - (21A41A0562)

SK.SARDAR HUSSAIN - (21A41A0563)
SK.NAGUL MEERA - (21A41A0569)
V.LEELA SAI BALAJI MANIKANTA - (21A41A0582)
M.MANIKANTA - (21A41A0586)
Abstract
Mасhine leаrning is а teсhnоlоgy whiсh аllоws а sоftwаre рrоgrаm tо beсаme mоre aссurаte аt
рretending mоre ассurаte results withоut being exрliсitly рrоgrаmmed аnd аlsо ML аlgоrithms uses
histоriс dаtа tо рrediсts the new оutрuts. Beсаuse оf this ML gets а distinguish аttentiоn. Nоw а dаy’s
рrediсtiоn engine hаs beсоme sо рорulаr thаt they аre generаting ассurаte аnd аffоrdаble рrediсtiоns
just like а humаn, аnd being using industry tо sоlve mаny оf the рrоblems. Рrediсting justified sаlаry fоr
emрlоyee is аlwаys being а сhаllenging jоb fоr аn emрlоyer. In this project, a sаlаry рrediсtiоn mоdel is
made with suitаble аlgоrithm using key feаtures required tо рrediсt the sаlаry оf emрlоyee.
The main aim of the project is to predict the salary of graduates and make a suitable user-friendly graph.
From this prediction the salary of an employee can be observed according to a particular field according
to their qualifications. It helps to see the growth of any field.
In the project, we have used Linear Regression as an algorithm for prediction. Linear regression carries out
a task that may predict the value of a dependent variable (y) on basis of an independent variable (x) that is
given. Therefore, this kind of regression technique looks for a linear type of relationship between input x
and output y. Apart from Linear Regression, other types of regression techniques are also used like the
Decision Tree Regressor and Random Forest Regressor.

Since nothing in this universe can be termed as “perfect”, thus a lot of features can be added to make the
system more widely acceptable and more user friendly. This will not only help to predict salaries of other
fields but also will be more user beneficial.
In the upcoming phase of our project we will be able to connect an even larger dataset to this model so
that the training can be even better. This mоdel shоuld сheсk fоr new dаtа, оnсe in а mоnth, аnd
inсоrроrаte them tо exраnd the dаtаset аnd рrоduсe better results.
Table of Contents
Title Page No.
Candidates Declaration I
Abstract II
Table of Contents III
List of Figures IV
Acronyms and Terminology Used V
Chapter 1 Introduction
9
1.1 Introduction
1.2 Required Tools 12

Chapter 2
Literature Survey & Project Design 13
Chapter 3
Merits of the Proposed System 14
Chapter 4
Architectural Flow of the Proposed Model 15
Chapter 5
Description of Modules 16
Chapter 6 UML Diagrams
6.1 ER Diagram 18

6.2 Use Case Diagram 19

6.3 Activity Diagram 20

Chapter 7 Implementation of the Model

7.1 Data Collection 21

7.2 Loading the Data

7.3 Data Preprocessing 22
23
7.4 Data Exploration
24
7.5 Univariate Analysis
25
7.6 Data Manipulation
26
7.7 Bivariate Analysis
7.8 Feature Scaling
7.9 Implementing the model
7.10 Segregating dependent & independent variables
7.11 Train and Test Split
7.12 Learning Phase 27
Overfitting and Underfitting 29
7.13 Implementing Decision Tree 30

7.14 Evaluation of the Model

Chapter 8
Conclusion and Future Scope 31

References 32
List of Figures
S.No Particulars Page No

1 Architectural Diagram 15

2 ER Diagram 18

3 Use Case Diagram 19

4 Activity Diagram 20

5 Loading the Dataset 21

6 Data Pre-processing 21

7 Data Exploration 22

8 Univariate Analysis 23

9 Data Manipulation 24

10 Bivariate Analysis 25

11 Feature Scaling 26

12 Standardization 27

13 Model Implementation 28

14 Overfitting and Underfitting 29

15 Evaluation of the Model 30

ACRONYMS
I/P & O/P Input and Output
ML Machine Learning
ANN Artificial Neural Network
LR Linear Regression
DC Decision Trees
CV Computer Vision
NLP Natural Language Processing

Terminology Used
• An Algorithm is a set of rules that a machine follows to achieve a particular goal. An algorithm can
be considered as a recipe that defines the inputs, the output and all the steps needed to get from
the inputs to the output.

• Machine Learning is a set of methods that allow computers to learn from data to make and
improve predictions.

• A Machine Learning Model is the learned program that maps inputs to predictions. This can be a
set of weights for a linear model or for a neural network.

• A Dataset is a table with the data from which the machine learns. The dataset contains the
features and the target to predict. When used to induce a model, the dataset is called training
data.

• The Prediction is what the machine learning model "guesses" what the target value should be
based on the given features.
CHAPTER 1

INTRODUCTION
Now days, one of the major reasons an employee switches a company is the salary of the employee.
Employees keep switching the company to get the expected salary. And it results in loss for the company
and to overcome this loss we came with an idea what if the employee gets the desired/expected salary
from the Company or Organization. In this Competitive world everyone has a higher expectation and
goals.
But we cannot randomly provide everyone their expected salary there should be a system which should
measure the ability of the Employee for the Expected salary. We cannot decide the exact salary but we can
predict it by using certain data sets.
A prediction is an assumption about a future event. A prediction is sometimes, though not always, is based
upon knowledge or experience. Future events are not necessarily certain, thus confirmed exact data
about the future is in many cases are impossible, a prediction may be useful to help in preparing
plans about probable developments. In this project, the salary of an employee of an organization is to
be predicted on basis of past experience and the educational qualifications of the individual. Here the
history of salary has been observed and then on basis of that salary of a person after a certain period of
time it can be calculated automatically.
In order to gain useful insights into the job recruitment, we compare different strategies and machine
learning models. The methodology different phases like: Data collection, Data cleaning, Manual feature
engineering, Data set description, Automatic feature selection, Model selection, Model training and
validation, Model comparison.
The process of learning begins with observations or data, such as examples, direct experience, or
instruction, in order to look for patterns in data and make better decisions in the future based on the
examples that we provide. The primary aim is to allow the computers learn automatically without human
intervention or assistance and adjust actions accordingly.

Machine learning algorithms are broadly classified into three divisions, namely; Supervised learning,
Unsupervised learning and Reinforcement learning.

• Supervised learning:- Supervised learning is a learning in which we teach or train the machine
using data which is well labelled that means some data is already tagged with correct answer. After
that, machine is provided with new set of examples so that supervised learning algorithm analyses
the training data and produces a correct outcome from labelled data.
Basically, they can apply what has been learned in the past to new data using labelled examples to
predict future events. Starting from the analysis of a known training dataset, the learning
algorithm produces an inferred function to make predictions about the output values. The system
is able to provide targets for any new input after sufficient training. The learning algorithm can also
compare its output with the correct, intended output and find errors in order to modify the model
accordingly.
• Unsupervised Learning:- In contrast, unsupervised machine learning algorithms are used when the
information used to train is neither classified nor labelled. Unsupervised learning studies how
systems can infer a function to describe a hidden structure from unlabelled data. The system
doesn’t figure out the right output, but it explores the data and can draw inferences from datasets
to describe hidden structures from unlabelled data. Unsupervised learning is the training of
machine using information that is neither classified nor labelled and allowing the algorithm to act
on that information without guidance. Here the task of machine is to group unsorted information
according to similarities, patterns and differences without any prior training of data. Unlike,
supervised learning, no teacher is provided that means no training will be given to the machine.
Therefore, machine is restricted to find the hidden structure in unlabelled data by our-self.
• Reinforcement learning:- Reinforcement learning is an area of Machine Learning. Reinforcement.
It is about taking suitable action to maximize reward in a particular situation. It is employed by
various software and machines to find the best possible behaviour or path it should take in a
specific situation. Reinforcement learning differs from the supervised learning in a way that in
supervised learning the training data has the answer key with it so the model is trained with the
correct answer itself whereas in reinforcement learning, there is no answer but the reinforcement
agent decides what to do to perform the given task. In the absence of training dataset, it is bound
to learn from its experience. Reinforcement machine learning algorithms is a learning method that
interacts with its environment by producing actions and discovers errors or rewards. Trial and error
search and delayed reward are the most relevant characteristics of reinforcement learning. This
method allows machines and software agents to automatically determine the ideal behavior within
a specific context in order to maximize its performance. Simple reward feedback is required for the
agent to learn which action is best; this is known as the reinforcement signal.

The project uses various regression techniques for predicting the salary of the employees. The techniques
are listed as follows.
1. Linear Regression: In Linear regression we are given a number of predictor variables and a
continuous response variable, and we try to find a relationship between those variables that allows
us to predict a continuous outcome.
2. For example, given X and Y, we fit a straight line that minimize the distance using methods to
estimate the coefficients like Ordinary Least Squares and Gradient Descent between the sample
points and the fitted line.
3. Decision Tree Regressor: Decision tree builds regression or classification models in the form of a
tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an
associated decision tree is incrementally developed. The final result is a tree with decision nodes
and leaf nodes. A decision node has two or more branches, each representing values for the
attribute tested. Leaf node represents a decision on the numerical target. The topmost decision
node in a tree which corresponds to the best predictor called root node. Decision trees can handle
both categorical and numerical data.
4. Random Forest Regressor: Random forests or random decision forests are an ensemble learning
method for classification, regression and other tasks that operates by constructing a multitude of
decision trees at training time For regression tasks, the mean or average prediction of the
individual trees is returned.

REQUIRED TOOLS
• Python: Python is an interpreted, object-oriented, high-level programming language with dynamic
semantics. Its high-level built in data structures, combined with dynamic typing and dynamic
binding, make it very attractive for Rapid Application Development, as well as for use as a scripting
or glue language to connect existing components together.

• Jupyter Notebook: The Jupyter Notebook is an open-source web application that allows you to
create and share documents that contain live code, equations, visualizations and narrative text.
Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data
visualization, machine learning, and much more.

• Anaconda Navigator: Anaconda Navigator is a desktop graphical user interface (GUI) included in
Anaconda® distribution that allows you to launch applications and easily manage conda packages,
environments, and channels without using command-line commands.
CHAPTER 2
LITERATURE SURVEY

1) Susmita Ray," A Quick Review of Machine Learning Algorithms," 2019 International

Conference on Machine Learning, Big Data, Cloud and Parallel Computing (Com-ITCon), India, 14th -16th
Feb 2019 a brief review of various machine learning algorithms which are most frequently used to solve
classification, regression and clustering problems. The advantages, disadvantages of these algorithms have
been discussed along with comparison of different algorithms (wherever possible) in terms of
performance, learning rate etc. Along with that, examples of practical applications of these algorithms
have been discussed.
2) Sananda Dutta, Airiddha Halder, Kousik Dasgupta,” Design of a novel Prediction Engine for predicting
suitable salary for a job” 2018 Fourth International Conference on Research in Computational
Intelligence and Communication Networks (ICRCICN) - focused on the problem of predicting salary for
job advertisements in which salary are not mentioned and also tried to help fresher to predict possible
salary for different companies in different locations. The corner stone of this study is a dataset
provided by ADZUNA. model is well capable to predict precise value.

3) Pornthep Khongchai, Pokpong Songmuang, “Improving Students’ Motivation to Study using Salary
Prediction System” - proposed prediction model using Decision tree technique with seven features.
Moreover, the result of the system is not only a predicted salary, but also the 3-highest salary of the
graduated students which share common attributes to the users. To test the system’s efficiency, they
set up an experiment by using 13,541 records of actual graduated student data.
The total result in accuracy is 41.39%.

4) Phuwadol Viroonluecha, Thongchai Kaewkiriya,” Salary Predictor System for Thailand Labour
Workforce using Deep Learning” - used Deep learning techniques to construct a model which predicts
the monthly salary of job seekers in Thailand solving a regression problem which is a numerical
outcome is effective. We used five-month personal profile data from wellknown job search website for
the analysis. As a result, Deep learning model has strong performance whether accuracy or process
time by RMSE 0.774 x 104 and only 17 seconds for runtime.

MERITS OF THE PROPOSED SYSTEM

1. Easily identifies trends and patterns: Machine Learning Models can review large volumes of data
and discover specific trends and patterns that would not be apparent to humans. For instance, for an e-
commerce website like Amazon, it serves to understand the browsing behaviours and purchase histories
of its users to help cater to the right products, deals, and reminders relevant to them. It uses the results to
reveal relevant advertisements to them.

2. No human intervention needed (automation): With implementation of ML model, there is no need

to have any eye on the project at every step of the way. Since, giving machines the ability to learn, lets
them make predictions and also improve the algorithms on their own. A common example of this is anti-
virus softwares; they learn to filter new threats as they are recognized. ML is also good at recognizing
spam.

3. Continuous Improvement: As ML algorithms gain experience, they keep improving in accuracy and
efficiency. This lets them make better decisions.

4. Handling multi-dimensional and multi-variety data: Machine Learning algorithms are good at
handling data that are multi-dimensional and multi-variety, and they can do this in dynamic or uncertain
environments.

5. Wide Applications: You could be an e-tailer or a healthcare provider and make ML work for you.
Where it does apply, it holds the capability to help deliver a much more personal experience to customers
while also targeting the right customers.
CHAPTER 3
ARCHITECTURAL FLOW OF THE PROPOSED SYSTEM

Data Data Data Application of

Data Analysis Evaluation
collection Exploration manipulation Algorithm

An Architectural Diagram or a pipeline is used to help automate machine learning workflows. They
operate by enabling a sequence of data to be transformed and correlated together in a model that can be
tested and evaluated to achieve an outcome, whether positive or negative.
The pipeline/ Diagram consists of several steps to train a model. Machine learning pipelines are iterative
as every step is repeated to continuously improve the accuracy of the model and achieve a successful
algorithm. To build better machine learning models, and get the most value from them, accessible,
scalable and durable storage solutions are imperative, paving the way for onpremises object storage. The
steps include:

• Data Collection: Collecting raw data from billions of datasets available.

• Data Exploration: Exploring the data & the features related and being familiar with the data-types.
• Data Manipulation: Includes Cleaning of data, treating missing, repetitive values that are present.
• Data Analysis: Analysing the data to increase efficiency while applying the best Algorithm & feature
selection according to our preferences.
• Application of Algorithm: Applying the algorithm to the model.
• Evaluation: Using evaluation metrices to calculate the least error and following the above to make
further changes.
CHAPTER 4

DESCRIPTION OF MODULES

For the Salary Prediction Model, we will be using the python library modules such as numpy, pandas,
matplotlib & sklearn, though which we will import different functions for computing the model. Further,
we will use Flask library which is one of the most important module as it connects our backend end code
with the front end.
1. Pandas: Pandas is a fast, powerful, flexible and easy to use open source data analysis and
manipulation tool, built on top of the python programming language. It is a high-level data manipulation
tool It is built on the Numpy package and its key data structure is called the DataFrame. DataFrames allow
you to store and manipulate tabular data in rows of observations and columns of variables.

2. Numpy: NumPy is a Python package that stands for ‘Numerical Python’. It is the core library for
scientific computing, which contains a powerful n-dimensional array object.
It is a powerful N-dimensional array object which is in the form of rows and columns. We can initialize
NumPy arrays from nested Python lists and access it elements.

3. Matplotlib: Matplotlib is a comprehensive library for creating static, animated, and interactive
visualizations in Python. It is a multi-platform data visualization library built on NumPy arrays and designed
to work with the broader SciPy stack. One of the greatest benefits of visualization is that it allows us visual
access to huge amounts of data in easily digestible visuals. Matplotlib consists of several plots like line, bar,
scatter, histogram etc.

4. Scikit learn: Scikit-learn is probably the most useful library for machine learning in Python. The
sklearn library contains a lot of efficient tools for machine learning and statistical modeling including
classification, regression, clustering and dimensionality reduction. sklearn is used to build machine
learning models. It should not be used for reading the data, manipulating and summarizing it.
Components of scikit-learn:
• Supervised learning algorithms
• Cross-validation
• Unsupervised learning algorithms
• Various toy datasets
• Feature extraction
5. Flask: Flask is a web application framework written in Python. Armin Ronacher, who leads an
international group of Python enthusiasts named Pocco, develops it. Flask is based on Werkzeug WSGI
toolkit and Jinja2 template engine. Flask is considered more Pythonic than the Django web framework
because in common situations the equivalent Flask web application is more explicit. Flask is also easy to
get started with as a beginner because there is little boilerplate code for getting a simple app up and
running.
6. Html & CSS: Hypertext Markup Language (HTML) is the standard markup language for documents
designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style
Sheets (CSS) and scripting languages such as JavaScript. Cascading Style Sheets (CSS) is a style sheet
language used for describing the presentation of a document written in a markup language such as HTML.
CSS is a cornerstone technology of the World Wide Web, alongside HTML and JavaScript.
7. Heroku: Heroku is a cloud platform as a service (PaaS) supporting several programming languages.
One of the first cloud platforms, Heroku has been in development since June 2007. We will deploy our
model on the Heroku platform.
CHAPTER 5
UML DIAGRAMS

1. ER Diagram: An entity relationship diagram (ERD) shows the relationships of entity sets stored in a
database. An entity in this context is an object, a component of data. An entity set is a collection of similar
entities. These entities can have attributes that define its properties.

1. Use Case Diagram: A use case diagram is a graphical depiction of a user's possible interactions with
a system. A use case diagram shows various use cases and different types of users the system has and will
often be accompanied by other types of diagrams as well.
2. Activity Diagram: An activity diagram is a behavioural diagram i.e. it depicts the behaviour of a
system. An activity diagram portrays the control flow from a start point to a finish point showing the various
decision paths that exist while the activity is being executed.
CHAPTER 7
IMPlLEMENTATION OF THE MODEL

SOURCE CODE:
In[1]: import pandas as pd
dataset=pd.read_csv('/content/Salary_Data_SLR.csv')
In[2]: dataset
In[3]: X=dataset.iloc[:,0].values
In[4]: X
In[5]: X=X.reshape(-1,1)
In[6]: X
In[7]: Y=dataset.iloc[:,-1].values
In[8]: Y
In[9]: import plotly.express as px
fig=px.line(dataset,x="YearsExperience",y="Salary")
fig.show()
In[10]: from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2)
In[11]: X_train.shape
In[12]: X_test.shape
In[13]: Y_train.shape
In[14]: Y_test.shape
In[15]: from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(X_train,Y_train)
In[16]: regressor.coef_
In[17]: regressor.intercept_
In[18]: from sklearn.metrics import r2_score
Y_pred=regressor.predict(X_test)
print(r2_score(Y_pred,Y_test))
In[19]: yoe=float(input("Enter Years of Experience"))
regressor.predict([[yoe]])
1. The Data: Data collection is the first real step towards the real development of a machine learning
model, collecting data. This is a critical step that will cascade in how good the model will be, the
more and better data that we get, the better our model will perform. Our dataset named
“survey_results_public” is a raw dataset. It means that a lot of pre-processing is required so that all
it becomes useful for evaluation. Our dataset consists of 83439 rows and 48 features that will help
us to predict the sales of the house and is fairly a big dataset.

2. Loading the Data: We load the dataset into our notebook using the pandas dataframe.

3. Data Pre-Processing: Our Next step is to convert our data set into best possible format so that we
can extract what all features are required to predict the price of the house. This is where all
cleaning of our data takes place, be it treating the missing values, treating repetitive values, or
addition of different features according to our needs. Once they are identified, there are several
ways to deal with them:
Eliminating the samples or features with missing values. (we risk to delete relevant information or
too many samples).
Imputing the missing values, with some pre-built estimators such as the Imputer class from scikit
learn.

4. Data Exploration: Further, we explore our data as much as possible to know the features very well.
We get to know the count of each features, their mean values, standard deviation, min and max
value etc.
5. Univariate Analysis: Uni means one аnd vаriаte meаns vаriаble, sо in univаriаte аnаlysis,
there is оnly оne deрendаble vаriаble. The оbjeсtive оf univаriаte аnаlysis is tо derive the
dаtа, define аnd summаrize it, аnd аnаlyze the раttern рresent in it. In а dаtаset, it
exрlоres eасh vаriаble seраrаtely.
Bar Plot for
Education Level

Data Manipulation: Data Manipulation plays a very crucial role in the machine learning pipeline, as all the
cleaning of the data takes place in this step. The process includes finding and treating missing values in the
datset and then imputing them with different techniques like mean, mode, median, average or even
droping the column(if irrelevant). Also, outliers are treated in this very step as deviate the plots from their
actual positions.
Bivariate Analysis: As the name suggests, bivariate analysis is the analysis of 2 features taken
together. It is one of the simplest forms of statistical analysis, used to find out if there is a
relationship between two sets of values. It usually involves the variables X and Y. Again, we
randomly pick up any two features, one pair at a time and analyse it using histograms, bar graphs,
plots etc.
Feature Scaling: This is a crucial step in the preprocessing phase as the majority of machine learning
algorithms perform much better when dealing with features that are on the same scale. The most
common techniques are:
• Normalization: it refers to rescaling the features to a range of [0,1], which is a special case of min-
max scaling. To normalize our data we’ll simply need to apply the min-max scaling method to each
feature column.
• Standardization: it consists in centering the feature columns at mean 0 with standard deviation 1 so
that the feature columns have the same parameters as a standard normal distribution (zero mean
and unit variance). This makes much more easier for
the learning algorithms to learn the weights of the
parameters. In addition, it keeps useful
information about outliers and makes the
algorithms less sensitive to them.

9. Implementing the Model: Here comes the part where the actual machine learning algorithms are
being implemented. As stated above, we are using Linear Regression Machine Learning Algorithm to
predict the house price under Chennai house price Prediction model.

10. Segregating Dependant and Independent Variables: Independent variables (also referred to as
Features) are the input for a process that is being analyzes. Dependent variables are the output of
the process. For example:

y=f(x) Where,

x=independentvariable y=

dependent variable

This means any changes in x will cause a change in the value of y. The change can be negative or positive.
In Our Model, we have “SALARY” as our target/ dependent variable and all other features are considered as
independent variables.

11. Splitting the Data Set into Train and Test Dataset: We will split our data in three parts: training,
testing and validating sets. We train our model with training data, evaluate it on validation data and
finally, once it is ready to use, test it one last time on test data. The ultimate goal is that the model
can generalize well on unseen data, in other words, predict accurate results from new data, based
on its internal parameters adjusted while it was trained and validated.
In our Model, we have divided our dataset into a 70:30 ration, i.e., the training data consists of 70% of the
dataset while the testing data consists of the remaining 30% of the dataset. To split the data we use
train_test_split function provided by scikit-learn library.

12. Implementing Linear Regression:

i) Learning Phase: In Linear regression we are given a number of predictor variables and a
continuous response variable, and we try to find a relationship between those variables that
allows us to predict a continuous outcome.
For example, given X and Y, we fit a straight line that minimize the distance using methods
to estimate the coefficients like Ordinary Least Squares and Gradient
Descent between the sample points and the fitted line. We’ll use the intercept and slope
learned, that form the fitted line, to predict the outcome of new data.

The formula for the straight line is y = B0 + B1x +u. Where x is the input, B1 is the slope, B0 the y-intercept,
u the residual and y is the value of the line at the postion x.
The values available for being trained are B0 and B1, which are the values that affect the position of the
line, since the only other variables are x (the input and y, the output (the residual is not considered). These
values (B0 and B1) are the “weights” of the predicting funtion.
These weights and other, called biases, are the parameters that will be arranged together as matrixes.
The process is repeated, one iteration (or step) at a time. In each iteration the initial random line moves
closer to the ideal and more accurate one.

ii) Overfitting & Underfitting: One of the most important problems when considering the
training of models is the tension between optimization and generalization.

• Optimization is the process of adjusting a model to get the best performance

possible on training data (the learning process).

• Generalization is how well the model performs on unseen data. The goal is to obtain
the best generalization ability.

At the beginning of training, those two issues are correlated, the lower the loss on training data, the
lower the loss on test data. This happens while the model is still underfitted: there is still learning to
be done, it hasn’t been modelled yet all the relevant parameters of the model.
There are two ways to avoid this overfitting, getting more data and regularization.

 Getting more data is usually the best solution, a model trained on more data will naturally
generalize better.
 Regularization is done when the latter is not possible, it is the process of modulating the quantity of
information that the model can store or to add constraints on what information it is allowed to keep.
If the model can only memorize a small number of patterns, the optimization will make it to focus
on the most relevant ones, improving the chance of generalizing well.
 Regularization is done mainly by the following techniques:

• Reducing the model’s size: Reducing the number of learnable parameters in the
model, and with them its learning capacity.
• Adding weight regularization: L1 Regularization & L2 Regularization.

13. Implementing Decision Tree: Decision Tree is a decision-making tool that uses a flowchart-like tree
structure or is a model of decisions and all of their possible results, including outcomes, input costs,
and utility.
Decision-tree algorithm falls under the category of supervised learning algorithms. It works for both
continuous as well as categorical output variables.

14. Evaluation of the Model: The final Step of the model is evaluating it using appropriate evaluation
matrices. We have evaluated our model using score() and R2-sqaure metrics as it suits our model
perfectly.
CHAPTER 8
CONCLUSION
In today’s real world, it has become tough to store such huge data and extract them for one’s own
requirement. Also, the extracted data should be useful. The system makes optimal use of the Linear
Regression Algorithm. The system makes use of such data in the most efficient way. The linear regression
algorithm helps to fulfil customers by increasing the accuracy of estate choice and reducing the risk of
investing in an estate.
Our Model Predicted An Accuracy score of 95.68% on the training dataset while it predicted an Accuracy
score of 95.33% on the testing dataset. Since there is a very minute difference between the training and
testing scores, we can say that our model has performed extremely well on the given dataset, that with
such a high % score. It is illustrated that the approach contributes positively according to the evaluation.
FUTURE WORKS
Since nothing in this universe can be termed as “perfect”, thus a lot of features can be added to make the
system more widely acceptable and more user friendly. This will not only help to predict rates of other aras
in the city but also will be more user beneficial.
In the upcoming phase of our project we will be able to connect an even larger dataset to this model so
that the training can be even better. This mоdel shоuld сheсk fоr new dаtа, оnсe in а mоnth, аnd
inсоrроrаte them tо exраnd the dаtаset аnd рrоduсe better results
We can try out other dimensionality reduction techniques like Uni-variate Feature Selection and Recursive
feature elimination in the initial stages.
Another major future scope that can be added is providing the model with estate database of more cities
which will provide the user to explore more graduates and reach an accurate decision. More factors like
training period that affect the job salary of a graduate shall be added. In-depth details of every individual
will be added to provide ample details of a desired estate. This will help the system to run on a larger level.

Operating System Concepts Test
No ratings yet
Operating System Concepts Test
11 pages
1 - People V Adriano, GR 205228
50% (2)
1 - People V Adriano, GR 205228
1 page
AWS AI ML Virtual Internship Full Report
No ratings yet
AWS AI ML Virtual Internship Full Report
33 pages
Student Performance Analysis Using Machine Learning
No ratings yet
Student Performance Analysis Using Machine Learning
40 pages
Finite Element Method For Electromagnetics
No ratings yet
Finite Element Method For Electromagnetics
360 pages
Project Synopsis
33% (3)
Project Synopsis
4 pages
Internship Introduction Pages
No ratings yet
Internship Introduction Pages
10 pages
Salary Prediction
No ratings yet
Salary Prediction
4 pages
Mini Project Report
No ratings yet
Mini Project Report
10 pages
Adnan
100% (1)
Adnan
21 pages
BT4234 - RPT - Mr. Sreenarayanan N M
No ratings yet
BT4234 - RPT - Mr. Sreenarayanan N M
32 pages
Machine Learning Algorithms
No ratings yet
Machine Learning Algorithms
32 pages
Final Last
No ratings yet
Final Last
34 pages
Sample Project Report
No ratings yet
Sample Project Report
19 pages
1SJ18CS117 Venkatesh Murthy
No ratings yet
1SJ18CS117 Venkatesh Murthy
37 pages
ML Report
No ratings yet
ML Report
20 pages
First Project
No ratings yet
First Project
34 pages
Project Proposal 260 Copy
No ratings yet
Project Proposal 260 Copy
38 pages
Internship Report Final
No ratings yet
Internship Report Final
21 pages
Chapter 1 5 Thesis Sample
100% (2)
Chapter 1 5 Thesis Sample
64 pages
Adnan
No ratings yet
Adnan
19 pages
Internship Report
No ratings yet
Internship Report
33 pages
Jayanth Documentation
No ratings yet
Jayanth Documentation
34 pages
Sanjib Final
No ratings yet
Sanjib Final
41 pages
7th Sem Final Report
No ratings yet
7th Sem Final Report
67 pages
Goutham
No ratings yet
Goutham
30 pages
Employee Face Attendance System
No ratings yet
Employee Face Attendance System
75 pages
Mounojit Das SIP Report
No ratings yet
Mounojit Das SIP Report
61 pages
Report
No ratings yet
Report
112 pages
Final 30
No ratings yet
Final 30
20 pages
Final Report22.4 PDF
No ratings yet
Final Report22.4 PDF
118 pages
Final-Report22 3 PDF
No ratings yet
Final-Report22 3 PDF
124 pages
Internshipreport FINAL441
No ratings yet
Internshipreport FINAL441
14 pages
Lecture O03: ENGR90024 Computational Fluid Dynamics
No ratings yet
Lecture O03: ENGR90024 Computational Fluid Dynamics
43 pages
Documentation
No ratings yet
Documentation
62 pages
Fender
No ratings yet
Fender
14 pages
Walter Pullen B Com (Entrepreneurship) CV 2010
No ratings yet
Walter Pullen B Com (Entrepreneurship) CV 2010
4 pages
Pavanikha S N-727822TUAD034
No ratings yet
Pavanikha S N-727822TUAD034
4 pages
Physical Properties of Metals
No ratings yet
Physical Properties of Metals
4 pages
Lab Report: Submitted To
No ratings yet
Lab Report: Submitted To
6 pages
Data Science: Industrial Training Report
No ratings yet
Data Science: Industrial Training Report
45 pages
BUSINESS FORECASTING SYSTEM 181103 Update 29 12 22
No ratings yet
BUSINESS FORECASTING SYSTEM 181103 Update 29 12 22
52 pages
Backup and Recovery Basics Presentation
No ratings yet
Backup and Recovery Basics Presentation
15 pages
Project Document
No ratings yet
Project Document
30 pages
Design and Manufacturing of Carbon Fiber Composite Drive Shaft As An Alternative To Conventional Steel Drive Shaft
No ratings yet
Design and Manufacturing of Carbon Fiber Composite Drive Shaft As An Alternative To Conventional Steel Drive Shaft
10 pages
Aiml Report
No ratings yet
Aiml Report
70 pages
Tushar Internship Report 4th Year
No ratings yet
Tushar Internship Report 4th Year
17 pages
The World During Rizal's Time PDF
No ratings yet
The World During Rizal's Time PDF
29 pages
Arul Final PPP
No ratings yet
Arul Final PPP
45 pages
Group Thesis Part 1
No ratings yet
Group Thesis Part 1
17 pages
Project
No ratings yet
Project
34 pages
1822 B.E Cse Batchno 7
No ratings yet
1822 B.E Cse Batchno 7
60 pages
Ross Girshick Et Al - in 2013 Proposed An Architecture Called R-CNN (Region
No ratings yet
Ross Girshick Et Al - in 2013 Proposed An Architecture Called R-CNN (Region
6 pages
B1 Booster v1
No ratings yet
B1 Booster v1
32 pages
BZ-08-062-F Forklift Handover Checklist Form
No ratings yet
BZ-08-062-F Forklift Handover Checklist Form
2 pages
Sono 336 Carotid-Worksheet
No ratings yet
Sono 336 Carotid-Worksheet
1 page
Visvesvaraya Technological University
No ratings yet
Visvesvaraya Technological University
11 pages
Panasonic 120 150 PDF
No ratings yet
Panasonic 120 150 PDF
5 pages
QS-302 110907
No ratings yet
QS-302 110907
3 pages
Regular Letter 2024 Dulguime Jesus Carl 1
No ratings yet
Regular Letter 2024 Dulguime Jesus Carl 1
2 pages
Internship
No ratings yet
Internship
30 pages
INTERNSHIPREPORT
No ratings yet
INTERNSHIPREPORT
34 pages
Sarumathi Intern18
No ratings yet
Sarumathi Intern18
37 pages
Abhi Inter 01
No ratings yet
Abhi Inter 01
68 pages
Predicting Student Performance
No ratings yet
Predicting Student Performance
38 pages
XI - BST - 3 - Private, Public and Global Enterprises
No ratings yet
XI - BST - 3 - Private, Public and Global Enterprises
3 pages
Updated Project Report Format 2025
No ratings yet
Updated Project Report Format 2025
58 pages
Internship Report 1
No ratings yet
Internship Report 1
19 pages
CONTOH SKRIPSI (Analisa Penempatan Shear Wall)
No ratings yet
CONTOH SKRIPSI (Analisa Penempatan Shear Wall)
61 pages
DiGi KaGB T&C
No ratings yet
DiGi KaGB T&C
5 pages
TECH-5 - Rahul Dhall CV
No ratings yet
TECH-5 - Rahul Dhall CV
3 pages
2 Final
No ratings yet
2 Final
45 pages
Anthony 8
No ratings yet
Anthony 8
2 pages
4-2 Project Documentation
No ratings yet
4-2 Project Documentation
72 pages
Final Report
No ratings yet
Final Report
22 pages
Batch 1 Job Market Analysis and Prediction-1
No ratings yet
Batch 1 Job Market Analysis and Prediction-1
60 pages
Gotaq QPCR Master Mix Quick Protocol
No ratings yet
Gotaq QPCR Master Mix Quick Protocol
1 page
The Wizard's Harem - Volume Five - His Elven Dancer - Griz T. Orc & Kimiko Petaway - 2020 - Anna's Archive
No ratings yet
The Wizard's Harem - Volume Five - His Elven Dancer - Griz T. Orc & Kimiko Petaway - 2020 - Anna's Archive
45 pages
Empowering Small Companies With Automated Sales Forecasting
No ratings yet
Empowering Small Companies With Automated Sales Forecasting
66 pages
Ci Driver Do Motor Do CD Rom Datasheet
No ratings yet
Ci Driver Do Motor Do CD Rom Datasheet
11 pages
21P31A05C3
No ratings yet
21P31A05C3
54 pages
Chotu 101
No ratings yet
Chotu 101
28 pages
Final Project Sample Report
No ratings yet
Final Project Sample Report
69 pages
20MCA041
No ratings yet
20MCA041
72 pages
Egsh064784 (1) - 060844
No ratings yet
Egsh064784 (1) - 060844
1 page
111 Merged
No ratings yet
111 Merged
10 pages
Đề Khảo Sát Cuối Kỳ Ii
No ratings yet
Đề Khảo Sát Cuối Kỳ Ii
5 pages
The Ultimate Guide To Reading The Water
No ratings yet
The Ultimate Guide To Reading The Water
39 pages
23KE1F0024 (5)_merged
No ratings yet
23KE1F0024 (5)_merged
58 pages

Salary Prediction Document

Uploaded by

Salary Prediction Document

Uploaded by

SALARY PREDICTION

COMPUTER SCIENCE & ENGINEERING

21A41A0562 : SK. FAYAZ AHAMMED

Under the Esteemed Guidance of

Mrs. B. Prasanna Jyothi

Department of Computer science & Engineering

Department of Computer science &Engineering

GUIDE HEAD OF THE DEPARTMENT

B. Prasanna Jyothi R. Pitchaiah

SK.FAYAZ AHAMMED - (21A41A0562)

6.2 Use Case Diagram 19

6.3 Activity Diagram 20

Chapter 7 Implementation of the Model

7.2 Loading the Data

7.14 Evaluation of the Model

3 Use Case Diagram 19

5 Loading the Dataset 21

14 Overfitting and Underfitting 29

15 Evaluation of the Model 30

1) Susmita Ray," A Quick Review of Machine Learning Algorithms," 2019 International

MERITS OF THE PROPOSED SYSTEM

2. No human intervention needed (automation): With implementation of ML model, there is no need

Data Data Data Application of

• Data Collection: Collecting raw data from billions of datasets available.

12. Implementing Linear Regression:

• Optimization is the process of adjusting a model to get the best performance

You might also like