0% found this document useful (0 votes)
21 views

Predict Employee Retention Using Data Sciene

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Predict Employee Retention Using Data Sciene

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Predict Employee Retention Using Data Science

Article · March 2023

CITATIONS READS

2 3,709

3 authors, including:

Dr Anil Kumar Dubey


ABES Engineering College
69 PUBLICATIONS 162 CITATIONS

SEE PROFILE

All content following this page was uploaded by Dr Anil Kumar Dubey on 07 September 2023.

The user has requested enhancement of the downloaded file.


Predict Employee Retention Using Data Science
Dr. Anil Kumar Dubey1, Ila Maheshwari2, Ashutosh Mishra3
1
Assosciate Professor, 2-3UG Scholar, Dept. of CSE, Poornima Institute of Engineering & Technology, Jaipur, India
1
[email protected], [email protected], [email protected]

Abstract: Now a day’s data science predictions are used in numerous edges, at times edges not known before.
IT industries, for the improvement in market investment, Along these lines, Data Science is essentially used to
employee management etc. Retention of valuable employees settle on choices and forecasts making utilization of
within an organization has become an important issue as it prescient causal examination, prescriptive investigation
is hard to find out the reasons that why employees are
(prescient in addition to choice science) and machine
leaving an organization and keep them satisfied is a big
challenge, for this a report ismade to predict the retention of learning.
an employee in an organization using the python We know that larger companies contain more than
programming with data science methods. The main idea of thousand employees working for them, so taking care of
this report is to find out that which valuable employee will
leave the company and the features which are affecting
the needs and satisfaction of each employee is a
him/her to making this decision like salary level, no. of challenging task to do, it results in valuable and talented
hours spending in a week, promotion, no. of work accident employees leave the company without giving the proper
etc. The application was developed in python programming reason. This paper provides solution for the given
language and prediction are made with the help of data problem as it gives a prediction model that can be used
science and machine learning models.The design criteria to predict which employee will leave the company and
and the implementation details are presented in this report. which will not leave. It also helps in finding the exact
Keywords: Data Science, Preprocessing Techniques, reasons which are motivating the employees for shifting
Machine learning, Supervised Learning, Logistic companies like lower salary, less promotions or heavy
Regression. work load etc. To find the result in the form of yes or no,
we have used logistic regression method, which predicts
I. INTRODUCTION result in binary values that are 0 or 1, 0 means
Data mining is the next big in the world of Information employee will not leave the company and 1 means
Technology, usage of data extraction is increasing day he/she will.
by day. Data science is the process of mining of useful
II. PREVIOUS WORK
insights from larger amount of data to use it for the
development purpose. To extract data several Retention of valuable employee within an organization
algorithms, methods and analyzing processes are used is a major issue in the companies, so several efforts are
depending upon the kind of data we have and what the made to find out the proper employee management
analyst intended to do with the data. The data we get is policies in the companies, we are discussing some work
in the form of raw data, it needs to get preprocessed to from them –
make it in the form to apply algorithm on it.
Piotr Płoński (MLJAR) et.al [1] proposed the analytic
Preprocessing techniques includes collection, noise
methods those can improve Human Resources (HR)
removal, data reduction, transformation etc. data
management for companies with large number of
science methodologies are mainly classified in two
employees by providing approaches to predict
categories as making prediction and pattern discovery,
employee attrition with machine learning. They used
prediction making is the process of producing estimate
1200 employee’s data for training datasets, which
result by analyzing previous results known as regression
contains description, but the retention is unknown,
or supervised learning and pattern discovery is that
which is predicted using binary classification.
method when we apply different approaches to find out
similarities and dissimilarities in the given data by Le Zhang and Graham Williams et.al [2] proposed that
assigning class notations which is known as clustering employee retention is the biggest challenge for a
or unsupervised learning. company, so it is important for company to recognize
behavioural patterns to understand their employees
Data Science is a blend of various tools, algorithms, and
better. They used R for predictions by feature extraction
machine learning principles with the goal to discover
methods as word-to-vector, term frequency, or term
hidden patterns from the raw data.A Data Analyst as a
frequency and inverse document frequency, R packages
rule clarifies what is happening by handling history of
such as tm etc. They finally concluded that ensemble
the information. Then again, Data Scientist not
techniques can be deployed to effectively boost model
exclusively does the exploratory investigation to find
performance.
bits of knowledge from it, yet in addition utilizes
different propelled machine learning calculations to Ashish Mishra et.al [3] proposed that it is first
recognize the event of a specific occasion later on. A important to recruit right person to do talent
Data Scientist will take a gander at the information from management, the easily available data source for
67
present and past candidates is their resume. This paper freely adjust. They gain from past calculations to
provides a method to calculate the employee score deliver solid, repeatable choices and results. It's a
using his educational and business experience scores. science that is not new – but rather one that is
They concluded that information like number of years increasing crisp energy. While numerous machine
of education, number of organizations worked for, learning calculations have been around for quite a while,
number of positions held in the past, and age can be the capacity to naturally apply complex scientific
easily translated into a score for every employee which computations to huge information again and again,
can be used for predicting retention. quicker and speedier is a current advancement.
Rupesh Khare, Dimple Kaloya and Gauri Gupta et.al [4]
proposed that a risk equation can be develop, which can
be used assess attrition risk with current set of
employees that a company is having. They concluded
by stating that among the various attrition predictive
techniques available in the market, Logistic Regression
and Discriminant Analysis are the closest to give a
solution which produced highly accurate results.
Randy Lao et.al [5] states that a company which make
healthy environment and provide equal opportunities
for employees to glow, grows rapidly. Their goal is to Fig. 1. Types of Machine Learning
create a model that help in improving retention Machine learning algorithms are differentiated as
strategies on targeted employees. He used R supervised or unsupervised.
programming language and, they concluded by saying
that employees having higher satisfaction and A. Supervised machine learning calculations can apply
evaluation rate will have fewer chance to leave the what has been realized in the past to new information
company. utilizing marked cases to anticipate future occasions.
Beginning from the examination of a known preparing
III. MACHINE LEARNING dataset, the learning calculation creates a surmised
capacity to make expectations about the yield esteems.
Machine learning is the process of making the machine The framework can give focuses to any new
tolearn itself through patterns and training data sets. contribution after adequate preparing. The learning
Training data sets are data which is given to machine calculation can likewise contrast its yield and the right,
for understanding the hidden patterns within data and planned yield and discover mistakes to adjust the model
make relations for own understanding. It helps in appropriately.
working of machines efficiently by making them
processed like a human brain. Pattern recognition is the B. In differentiate, unsupervised machine learning
most challenging task for developers to use such calculations are utilized when the data used to prepare is
algorithms that allows different machines to work neither grouped nor named. Unsupervised learning
according to the requirement. contemplates how frameworks can induce a capacity to
portray a concealed structure from unlabeled
This paper emphasizes on making prediction of information. The framework doesn't make sense of the
retention of an employee within an organization such correct yield; however, it investigates the information
that whether the employee will leave the company or and can attractderivations from datasets to depict
continue with it. It uses the data of previous employees concealed structures from unlabeled information.
which have worked for the company and by finding
pattern it predicts the retention in the form of yes or no. C. Semi-directed et.al [7] machine learning
It uses various parameters of employees such as salary, calculations fall some place in the middle of regulated
number of years spent in the company, promotions, and unsupervised learning, since they utilize both
number of hours, work accident, financial background marked and unlabeled information for preparing –
etc. ordinarily a little measure of named information and a
lot of unlabeled information. The frameworks that
Considering new processing innovations, machine utilization this strategy can significantly enhance
adapting today isn't care for machine learning of the learning precision. For the most part, semi-administered
past. It was conceived from design acknowledgment learning is picked when the procured named
and the hypothesis that PCs can learn without being information requires gifted and significant assets to
customized to perform assignments; specialists prepare it/gain from it. Something else, obtaining
intrigued by manmade brainpower et.al [6] needed to unlabeled information by and large doesn't require extra
check whether PCs could gain from information. The assets.
iterative part of machine learning is essential claiming
as models are presented to new information, they can

68
D. Reinforcement machine learning calculations is a lenient disentangled BSD permit and is circulated under
learning technique that interfaces with its condition by numerous Linux appropriations, empowering scholastic
creating activities and finds mistakes or rewards. and business utilize.
Experimentation seek and postponed compensate are
the most pertinent attributes of fortification learning.
This technique enables machines and programming
operators to naturally decide the perfect conduct inside
a setting to augment its execution. Basic reward input is
required for the specialist to realize which activity is
ideal; this is known as the support flag.
IV. TECHNOLOGY
We have utilized Python programming dialect, which is
a translated, progressively written dialect and least
difficult in grammar. Python is utilized for every one of
the applications like in IOT advancement, information Fig. 2. Prediction Methodology
science field, web improvement, scripting reason and so
V. PREPROCESSING TECHNIQUES
forth. Consequently, now it is being utilized generally
over the globe. In straightforward words, pre-preparing et.al [9] alludes
to the changes connected to the information before
Python contains various number of libraries accessible
nourishing it to the calculation. In python, scikit-learn
in it, this makes it simple to use for each application
library has a pre-assembled usefulness under sklearn.
like for web rejecting delightful cleanser, for GUI
pre-processing. The information we get from client is as
improvement TKinter, for web network urlib2, for
crude information, so it needs to get perfect, change and
machine learning sklearn et.al [8], numpy, pandas and
decrease to make it proper for applying strategies on it,
so on. Python is one of the for the most part utilized
this procedure is known as preprocessing. require
dialect for Data Science applications since it gives
scientific sandbox in which you can perform
libraries, for example, Pandas, nltk which can oversee
examination for the whole term of the task. You have to
substantial number of datasets into fitting way, it gives
investigate, preprocess and condition information
representation libraries like Matplotlib, Bokeh, Seaborn
preceding demonstrating. Further, you will perform
and so on that are exceedingly expressive regarding
ETLT (remove, change, stack and change) to get
charts and plots portrayals.
information into the sandbox. It enhances the general
The sklearn library is one which gives bigger number of nature of the information and effectiveness of the model
machine learning calculations, for example, direct and to deliver comes about. There are numerous more
various relapse, polynomial relapse, choice tree alternatives for pre-preparing as –
characterization and so on., to make expectations,
bunching and grouping of information in number of
billions.Machine learning is a branch in software
engineering that reviews the outline of calculations that
can learn. Run of the mill errands are idea learning,
work learning or "prescient demonstrating", bunching
and finding prescient examples. These undertakings are
found out through accessible information that were seen
through encounters or directions, for instance. The
expectation that accompanies this teach is that including
the experience into its assignments will in the end
enhance the learning. However, this change needs to Fig. 3. Preprocessing Techniques
occur such that the learning itself ends up programmed A. Feature Scaling:
with the goal that people like ourselves don't have to
meddle any longer is a definitive objective. Highlight scaling is the strategy to restrict the scope of
factors with the goal that they can be thought about on
Scikit-learn is the most helpful library for machine basic grounds. It is performed on constant factors.
learning in Python. It is on NumPy, SciPy and
matplotlib, this library contains a great deal of efficient B. Label Encoding:
devices for machine learning and factual displaying Sklearn gives an extremely proficient device to
including arrangement, relapse, bunching and encoding the levels of an all-out highlights into numeric
dimensionality lessening. Scikit-learn gives a scope of esteems. Name Encoder encode names with an
directed and unsupervised learning calculations through incentive about 0 and classes-
a reliable interface in Python. It is authorized under a
69
International Journal of Electrical Electronics & Computer Science Engineering
Special Issue - NCSCT-2018 | E-ISSN : 2348-2273 | P-ISSN : 2454-1222
March, 2018 | Available Online at www.ijeecse.com

C. One-Hot Encoding: connections. In confined conditions, relapse


investigation can be utilized to induce causal
One-Hot Encoding changes each clear-cut component
connections between the autonomous and ward factors.
with n conceivable esteems into n parallel highlights,
However, this can prompt figments or false connections,
with just a single dynamic. Most of the ML calculations
so alert is advisable; for instance, relationship does not
either take in a solitary weight for each component or it
demonstrate causation.
figures remove between the examples.
Numerous strategies for completing relapse
VI. METHODOLOGY USED FOR PREDICTION investigation have been created. Well-known
Utilizing this expectation demonstrate, which intends to techniques, for example, straight relapse and common
foresee whether a representative will proceed or leave minimum squares relapse are parametric, in that the
the association based upon the investigation of the relapse work is characterized as far as a limited number
information of past workers. The expectation factors of obscure parameters that are evaluated from the
incorporate fulfillment level, last assessment, normal information. Nonparametric relapse alludes to strategies
month to month hours, compensation, work mischance, that permit the relapse capacity to lie in a predefined set
advancement, time spent at the organization and of capacities, which might be endless dimensional.
division, in view of these parameters, diverse machine Through this expectation show an organization can
learning models like calculated relapse, choice tree choose its arrangements to keep great representatives
order and so forth are connected to foresee which from leaving the organization. Information science part
worker will leave straightaway and the variables that that utilized as a part of this venture is to take crude
are most huge in this choice. information from csv record and then apply distinctive
In measurable demonstrating, relapse investigation is an preparing system to settle on information valuable in
arrangement of factual procedures for assessing the settling on choices from it like arrangement of dataset,
connections among factors. It incorporates numerous LabelEncoding, OnehotEncoding and highlight scaling.
systems for displaying and dissecting a few factors, Relapse is the most widely recognized technique
when the emphasis is on the connection between a utilized for making expectation utilizing python
reliant variable and at least one free factors (or programming dialect. Relapse examination likewise
'indicators'). More particularly, relapse examination enables us to look at the impacts of factors estimated on
causes one to see how the run of the mill estimation of various scales, for example, the impact of value changes
the needy variable (or 'model variable') changes when and the quantity of limited time exercises. These
any of the free factors is fluctuated, while the other advantages help economic specialists/information
autonomous factors are held settled. experts/information researchers to dispose of and assess
the best arrangement of factors to be utilized for
Most regularly, relapse investigation evaluates the building prescient models.
restrictive desire of the needy variable given the
autonomous factors – that is, the normal estimation of A. Linear Regression:
the reliant variable when the free factors are settled. Coordinate backslide is the path toward finding the
Less regularly, the attention is on a quantile, or other association between two ward factors using a straight
area parameter of the restrictive conveyance of the condition. It is the most principal kind of making
reliant variable given the autonomous factors. In all figures using backslide that is known as coordinated
cases, a component of the free factors called the relapse learning, in it a planning dataset is used to set up the
work is to be evaluated. In relapse investigation, it is machine with the objective that when we ask for to
additionally important to portray the variety of the impact desires it to will can make comes to fruition
needy variable around the forecast of the relapse work using the association between the components. It can be
utilizing a likelihood conveyance. A related however used for most prominent two elements for various
particular approach is Necessary Condition Analysis variable conjectures polynomial backslide is used. It
(NCA), which gauges the most extreme (instead of produces data as some motivating force after associated
normal) estimation of the needy variable for a given distinctive preprocessing methods. It is the most
estimation of the autonomous variable (roof line as broadly perceived system used for fitting a backslide
opposed to focal line) to recognize what estimation of line. It figures the best-fit line for the watched data by
the free factor is important yet not adequate for a given constraining the aggregate of the squares of the vertical
estimation of the reliant variable. deviations from each datum point to the line. Since the
Relapse investigation is broadly utilized for expectation deviations are first squared, when included, there is no
and estimating, where its utilization has considerable counterbalancing among positive and negative regards.
cover with the field of machine learning. Relapse B. Polynomial Regression:
examination is likewise used to comprehend which
among the autonomous factors are identified with the Polynomial backslide is the methodology in which
needy variable, and to investigate the types of these association between no less than two variables ought to

70
International Journal of Electrical Electronics & Computer Science Engineering
Special Issue - NCSCT-2018 | E-ISSN : 2348-2273 | P-ISSN : 2454-1222
March, 2018 | Available Online at www.ijeecse.com

be find in a polynomial condition shape, later this have the contrasting options to join affiliation effects of
condition is used for making desire for test dataset. It is full scale factors in the examination and in the model. If
the refreshed shape if coordinate backslide, as it can be the estimations of ward variable are ordinal, by then it is
used for finding association between more than two called as Ordinal ascertained backslide, if subordinate
variables. While there might be a motivation to fit a variable is multi class then it is known as Multinomial
higher degree polynomial to get cut down botch, this Logistic backslide
can realize completed the process of fitting.
D. Lasso Regression:
Consistently plot the associations with see the fit and
focus on guaranteeing that the curve fits the possibility Rope (Least Absolute Shrinkage and Selection Operator)
of the issue. Especially pay uncommon personality to furthermore rebuffs undoubtedly the traverse of the
twist towards the terminations and see whether those backslide coefficients. Also, it can reduce the
shapes and examples look good. Higher polynomials capriciousness and upgrading the accuracy of direct
can end up conveying wired results on extrapolation. backslide models. Tie backslide contrasts from edge
backslide in a way that it uses preeminent regards in the
C. Logistic Regression:
discipline work, as opposed to squares. This incite
Backslide is the route toward making desire the rebuffing (or indistinguishably obliging the aggregate of
association state of two ward factors. the minimum the aggregate estimations of the examinations) values
complex kind of the backslide condition with one which makes a part of the parameter assessments turn
dependent and one free factor is portrayed by the out absolutely zero. Greater the discipline associated,
condition et.al [10] encourage the evaluations get contracted towards add
up to zero. This results to variable assurance out of
y = m + c*x
given n factors.
where y = assessed subordinate variable score, m =
enduring, c =regression coefficient, and x = score on the VII. RESULT AND DISCUSSION
self-sufficient variable. This report expects to foresee whether a worker will
proceed or leave the association in view of the
examination of the information of past representatives.
The expectation factors incorporate fulfillment level,
last assessment, normal month to month hours, pay,
work mischance, advancement, time spent at the
organization and division, in view of these parameters,
distinctive machine learning models like strategic
relapse, choice tree characterization and so on are
connected to anticipate which worker will leave
straightaway and the components that are most critical
in this choice.
Through this paper an organization can choose its
Fig. 4. Logistic Regression strategies to keep great representatives from leaving the
organization. Information science part that utilized as a
Computed backslide is the one of a kind sort of part of this report is to take crude information from csv
backslide where desires are made as yes or no as document and then apply diverse handling
twofold regards. Here we should predict whether component to settle on information helpful in settling
specialist will leave or not, so it is the best proper on choices from it like arrangement of dataset, Label
technique for making desires using backslide. It is Encoding, Onehot Encoding and include scaling.
extensively used for arranging issues; Logistic
backslide doesn't require straight association among It at that point applies diverse relapse models to
poor and self-sufficient components. It can manage anticipate whether the worker will leave the
various types of associations since it applies a non- organization or not as 0 and 1. If 0 comes in the
straight log change to the foreseen chances extent, to outcome that implies that the worker will proceed with
keep up a vital separation from over fitting and under the organization, however if 1 comes then the
fitting, we should consolidate each critical variable. A representative will leave the organization.
respectable method to manage ensure this preparation is Here is given the example information that we utilized
to use a phase clever system to assess the ascertained for making expectations, it is in an unthinkable frame
backslide, it requires considerable illustration sizes which contains segments as fulfillment level, last
since most outrageous likelihood checks are less assessment, number of undertakings, normal month to
extraordinary at low case sizes than standard scarcest month hours, years spent in the organization, work
square, the free factors should not be associated with mischance, advancement, office and pay.
each other i.e. no multi collinearity. Regardless, we
71
International Journal of Electrical Electronics & Computer Science Engineering
Special Issue - NCSCT-2018 | E-ISSN : 2348-2273 | P-ISSN : 2454-1222
March, 2018 | Available Online at www.ijeecse.com

IX. REFERENCES
[1] Piotr Płoński (MLJAR), “Human-first Machine
Learning Platform,” Human Resource Analytics
Predict Employee Attrition.
[2] Le Zhang and Graham Williams (Data Scientist,
Microsoft), “Employee Retention with R based
Data Science Accelarator”.
[3] Ashish Mishra (Data Scientist, Experfy), “Using
Machine Learning to Predict and explain
Fig. 5. Dataset for Prediction
Employee Attrition”.
When the accuracy of the result is being calculated from
[4] Rupesh Khare, Dimple Kaloya and Gauri Gupta,
the previous analysed data with the help of confusion
“Employee Attrition Risk Assessment using
matrix and the accuracy score, this result is being
Logistic Regression Analysis,” from 2nd IIMA
compared with the available data to find the result
International Conference on Advanced Data
accuracy and 97% of the predictions are made correct.
Analysis, Business Analytics and Intelligence.
[5] Randy Lao, “Predicting Employee Kernelover,”
Kaggle.
[6] Sandra W. Pyke & Peter M. Sheridan, “Logistic
Regression Analysis of Graduate Student
Retention,” from The Canadian Journal of Higher
Education, Vol. XXIII-2, 1993.
[7] Prof. Dr. Vjollca Hasani and Prof. Dr. Alba Dumi,
“Application of Logistic Regression in the Study
of Students’ Performance Level,” Journal of
Fig. 6. Result Educational and Social Research Italy.
The figure contains the result in the form of 0 or 1 as o [8] Dr. Jonathan Erhardt, “Artificial Intelligence:
representing the employee who will not leave the Opportunities and Risks,” Policy paper by the
company and 1 representing as employee who will Effective Altruism Foundation.
going to leave the company.
[9] Sofia Stromberg’s, “Binary Logistic Regression
VIII. CONCLUSIONS and its application to data from a study of
In this investigation, we become more acquainted with children's recognition of their own recorded voices”
that maintenance of a representative inside an term paper in statically method.
association can be discover utilizing strategic relapse Anish Talwar and Yogesh Kumar, “Machine Learning:
procedure, which delivers an outcome with 97% An artificial intelligence methodology,” from
exactness. It can likewise help in discovering the International Journal of Engineering and
components that are influencing the representatives in Computer Science ISSN:2319-7242 Volume 2
the association like pay level, work stack, advancements Issue 12, Dec.2013PageNo.3400-3404.
and so forth.
The future extent of information science is brilliant;
consequently, this procedure can be utilized as a part of
any association for better worker administration and for
their fulfillment. This paper can be additionally reached
out as it requires information as .csv records just, so this
impediment can be expelled.
ACKNOWLEDGMENT
This examination is guided by Dr. Anil Kumar Dubey,
we thank our guide from Poornima Institute of
Engineering and Technology, Jaipur who gave
understanding and aptitude that enormously helped the
examination for this paper.

72

View publication stats

You might also like