0% found this document useful (0 votes)
0 views

Vertopal.com_ML Project 2

The document outlines a data preprocessing workflow for a dataset related to HR analytics, including steps such as loading data, handling missing values, encoding categorical variables, and scaling numerical features. It also includes visualizations to analyze employee attrition based on various factors like age and monthly income. The final dataset consists of 48 columns after processing, ready for further analysis or modeling.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Vertopal.com_ML Project 2

The document outlines a data preprocessing workflow for a dataset related to HR analytics, including steps such as loading data, handling missing values, encoding categorical variables, and scaling numerical features. It also includes visualizations to analyze employee attrition based on various factors like age and monthly income. The final dataset consists of 48 columns after processing, ready for further analysis or modeling.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

import numpy as np

from sklearn.model_selection import train_test_split


from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report,
confusion_matrix
import pandas as pd

df=pd.read_csv("HR_Analytics.csv")
df.head()

EmpID Age AgeGroup Attrition BusinessTravel DailyRate \


0 RM297 18 18-25 Yes Travel_Rarely 230
1 RM302 18 18-25 No Travel_Rarely 812
2 RM458 18 18-25 Yes Travel_Frequently 1306
3 RM728 18 18-25 No Non-Travel 287
4 RM829 18 18-25 Yes Non-Travel 247

Department DistanceFromHome Education EducationField


... \
0 Research & Development 3 3 Life Sciences
...
1 Sales 10 3 Medical
...
2 Sales 5 3 Marketing
...
3 Research & Development 5 2 Life Sciences
...
4 Research & Development 8 1 Medical
...

RelationshipSatisfaction StandardHours StockOptionLevel \


0 3 80 0
1 1 80 0
2 4 80 0
3 4 80 0
4 4 80 0

TotalWorkingYears TrainingTimesLastYear WorkLifeBalance


YearsAtCompany \
0 0 2 3
0
1 0 2 3
0
2 0 3 3
0
3 0 2 3
0
4 0 0 3
0

YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager


0 0 0 0.0
1 0 0 0.0
2 0 0 0.0
3 0 0 0.0
4 0 0 0.0

[5 rows x 38 columns]

df.isnull().sum()

EmpID 0
Age 0
AgeGroup 0
Attrition 0
BusinessTravel 0
DailyRate 0
Department 0
DistanceFromHome 0
Education 0
EducationField 0
EmployeeCount 0
EmployeeNumber 0
EnvironmentSatisfaction 0
Gender 0
HourlyRate 0
JobInvolvement 0
JobLevel 0
JobRole 0
JobSatisfaction 0
MaritalStatus 0
MonthlyIncome 0
SalarySlab 0
MonthlyRate 0
NumCompaniesWorked 0
Over18 0
OverTime 0
PercentSalaryHike 0
PerformanceRating 0
RelationshipSatisfaction 0
StandardHours 0
StockOptionLevel 0
TotalWorkingYears 0
TrainingTimesLastYear 0
WorkLifeBalance 0
YearsAtCompany 0
YearsInCurrentRole 0
YearsSinceLastPromotion 0
YearsWithCurrManager 57
dtype: int64

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1480 entries, 0 to 1479
Data columns (total 38 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 EmpID 1480 non-null object
1 Age 1480 non-null int64
2 AgeGroup 1480 non-null object
3 Attrition 1480 non-null object
4 BusinessTravel 1480 non-null object
5 DailyRate 1480 non-null int64
6 Department 1480 non-null object
7 DistanceFromHome 1480 non-null int64
8 Education 1480 non-null int64
9 EducationField 1480 non-null object
10 EmployeeCount 1480 non-null int64
11 EmployeeNumber 1480 non-null int64
12 EnvironmentSatisfaction 1480 non-null int64
13 Gender 1480 non-null object
14 HourlyRate 1480 non-null int64
15 JobInvolvement 1480 non-null int64
16 JobLevel 1480 non-null int64
17 JobRole 1480 non-null object
18 JobSatisfaction 1480 non-null int64
19 MaritalStatus 1480 non-null object
20 MonthlyIncome 1480 non-null int64
21 SalarySlab 1480 non-null object
22 MonthlyRate 1480 non-null int64
23 NumCompaniesWorked 1480 non-null int64
24 Over18 1480 non-null object
25 OverTime 1480 non-null object
26 PercentSalaryHike 1480 non-null int64
27 PerformanceRating 1480 non-null int64
28 RelationshipSatisfaction 1480 non-null int64
29 StandardHours 1480 non-null int64
30 StockOptionLevel 1480 non-null int64
31 TotalWorkingYears 1480 non-null int64
32 TrainingTimesLastYear 1480 non-null int64
33 WorkLifeBalance 1480 non-null int64
34 YearsAtCompany 1480 non-null int64
35 YearsInCurrentRole 1480 non-null int64
36 YearsSinceLastPromotion 1480 non-null int64
37 YearsWithCurrManager 1423 non-null float64
dtypes: float64(1), int64(25), object(12)
memory usage: 439.5+ KB

from sklearn.preprocessing import LabelEncoder, StandardScaler

# Drop irrelevant columns


df.drop(columns=["EmpID", "EmployeeNumber", "Over18", "EmployeeCount",
"StandardHours"], inplace=True)

# Fill missing values in 'YearsWithCurrManager' with median


df['YearsWithCurrManager'].fillna(df['YearsWithCurrManager'].median(),
inplace=True)

# Encode target variable 'Attrition' and 'Gender'


label_encoder = LabelEncoder()
df['Attrition'] = label_encoder.fit_transform(df['Attrition']) # Yes
= 1, No = 0
df['Gender'] = df['Gender'].map({'Male': 1, 'Female': 0})

# One-Hot Encode categorical features


categorical_cols = ['BusinessTravel', 'Department', 'EducationField',
'JobRole', 'MaritalStatus', 'OverTime']
df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Scale numerical features


scaler = StandardScaler()
numerical_cols = ['Age', 'DailyRate', 'DistanceFromHome',
'HourlyRate', 'MonthlyIncome', 'MonthlyRate', 'YearsAtCompany']
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])

# Check the final dataset


df.head()

Age AgeGroup Attrition DailyRate DistanceFromHome Education


\
0 -2.07305 18-25 1 -1.417860 -0.765246 3

1 -2.07305 18-25 0 0.026342 0.095926 3

2 -2.07305 18-25 1 1.252176 -0.519197 3

3 -2.07305 18-25 0 -1.276417 -0.519197 2

4 -2.07305 18-25 1 -1.375675 -0.150123 1

EnvironmentSatisfaction Gender HourlyRate JobInvolvement ... \


0 3 1 -0.582896 3 ...
1 4 0 0.155242 2 ...
2 2 1 0.155242 3 ...
3 2 1 0.352079 3 ...
4 3 1 0.696543 3 ...

JobRole_Laboratory Technician JobRole_Manager \


0 1 0
1 0 0
2 0 0
3 0 0
4 1 0

JobRole_Manufacturing Director JobRole_Research Director \


0 0 0
1 0 0
2 0 0
3 0 0
4 0 0

JobRole_Research Scientist JobRole_Sales Executive \


0 0 0
1 0 0
2 0 0
3 1 0
4 0 0

JobRole_Sales Representative MaritalStatus_Married


MaritalStatus_Single \
0 0 0
1
1 1 0
1
2 1 0
1
3 0 0
1
4 0 0
1

OverTime_Yes
0 0
1 0
2 1
3 0
4 0

[5 rows x 48 columns]

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1480 entries, 0 to 1479
Data columns (total 48 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 1480 non-null float64
1 AgeGroup 1480 non-null object
2 Attrition 1480 non-null int32
3 DailyRate 1480 non-null float64
4 DistanceFromHome 1480 non-null float64
5 Education 1480 non-null int64
6 EnvironmentSatisfaction 1480 non-null int64
7 Gender 1480 non-null int64
8 HourlyRate 1480 non-null float64
9 JobInvolvement 1480 non-null int64
10 JobLevel 1480 non-null int64
11 JobSatisfaction 1480 non-null int64
12 MonthlyIncome 1480 non-null float64
13 SalarySlab 1480 non-null object
14 MonthlyRate 1480 non-null float64
15 NumCompaniesWorked 1480 non-null int64
16 PercentSalaryHike 1480 non-null int64
17 PerformanceRating 1480 non-null int64
18 RelationshipSatisfaction 1480 non-null int64
19 StockOptionLevel 1480 non-null int64
20 TotalWorkingYears 1480 non-null int64
21 TrainingTimesLastYear 1480 non-null int64
22 WorkLifeBalance 1480 non-null int64
23 YearsAtCompany 1480 non-null float64
24 YearsInCurrentRole 1480 non-null int64
25 YearsSinceLastPromotion 1480 non-null int64
26 YearsWithCurrManager 1480 non-null float64
27 BusinessTravel_TravelRarely 1480 non-null uint8
28 BusinessTravel_Travel_Frequently 1480 non-null uint8
29 BusinessTravel_Travel_Rarely 1480 non-null uint8
30 Department_Research & Development 1480 non-null uint8
31 Department_Sales 1480 non-null uint8
32 EducationField_Life Sciences 1480 non-null uint8
33 EducationField_Marketing 1480 non-null uint8
34 EducationField_Medical 1480 non-null uint8
35 EducationField_Other 1480 non-null uint8
36 EducationField_Technical Degree 1480 non-null uint8
37 JobRole_Human Resources 1480 non-null uint8
38 JobRole_Laboratory Technician 1480 non-null uint8
39 JobRole_Manager 1480 non-null uint8
40 JobRole_Manufacturing Director 1480 non-null uint8
41 JobRole_Research Director 1480 non-null uint8
42 JobRole_Research Scientist 1480 non-null uint8
43 JobRole_Sales Executive 1480 non-null uint8
44 JobRole_Sales Representative 1480 non-null uint8
45 MaritalStatus_Married 1480 non-null uint8
46 MaritalStatus_Single 1480 non-null uint8
47 OverTime_Yes 1480 non-null uint8
dtypes: float64(8), int32(1), int64(16), object(2), uint8(21)
memory usage: 336.9+ KB

# Get unique values in SalarySlab


print("Unique values in SalarySlab:")
print(df['SalarySlab'].unique())

# Get unique values in AgeGroup


print("\nUnique values in AgeGroup:")
print(df['AgeGroup'].unique())

# Encoding Order
# 1 SalarySlab ("Upto 5k", "5k-10k", etc.)
11️⃣
# "Upto 5k" → 0

# "5k-10k" → 1

# "10k-15k" → 2

# "15k+" → 3

2️⃣ AgeGroup ("18-25", "26-35", etc.)


#
# "18-25" → 0

# "26-35" → 1

# "36-45" → 2

# "46-55" → 3

# "55+" → 4

Unique values in SalarySlab:


['Upto 5k' '5k-10k' '10k-15k' '15k+']

Unique values in AgeGroup:


['18-25' '26-35' '36-45' '46-55' '55+']

from sklearn.preprocessing import OrdinalEncoder

# Define custom order for SalarySlab and AgeGroup


salary_slab_order = ["Upto 5k", "5k-10k", "10k-15k", "15k+"]
age_group_order = ["18-25", "26-35", "36-45", "46-55", "55+"]

# Apply Ordinal Encoding


ordinal_encoder = OrdinalEncoder(categories=[salary_slab_order,
age_group_order])
df[['SalarySlab', 'AgeGroup']] =
ordinal_encoder.fit_transform(df[['SalarySlab', 'AgeGroup']])
# Convert to integer for better readability
df[['SalarySlab', 'AgeGroup']] = df[['SalarySlab',
'AgeGroup']].astype(int)

import matplotlib.pyplot as plt


import seaborn as sns

# Countplot of Attrition
plt.figure(figsize=(6, 4))
sns.countplot(x=df['Attrition'], palette="coolwarm")
plt.title("Employee Attrition Count")
plt.xlabel("Attrition (0 = No, 1 = Yes)")
plt.ylabel("Count")
plt.show()

plt.figure(figsize=(8, 5))
sns.scatterplot(x=df['Age'], y=df['MonthlyIncome'],
hue=df['Attrition'], alpha=0.7, palette="coolwarm")
plt.title("Age vs Monthly Income (Colored by Attrition)")
plt.xlabel("Age")
plt.ylabel("Monthly Income")
plt.legend(title="Attrition", labels=["No", "Yes"])
plt.show()
plt.figure(figsize=(6, 4))
sns.barplot(x=df['SalarySlab'], y=df['Attrition'], palette="coolwarm")
plt.title("Attrition Rate by Salary Slab")
plt.xlabel("Salary Slab (Encoded)")
plt.ylabel("Attrition Rate")
plt.show()
plt.figure(figsize=(6, 4))
sns.boxplot(x=df['WorkLifeBalance'], y=df['Attrition'],
palette="coolwarm")
plt.title("Work-Life Balance vs Attrition")
plt.xlabel("Work-Life Balance (1 = Worst, 4 = Best)")
plt.ylabel("Attrition")
plt.show()
plt.figure(figsize=(8,5))
sns.boxplot(x=df['Attrition'], y=df['YearsAtCompany'], palette='Set2')
plt.title("Attrition vs Years at Company")
plt.xlabel("Attrition (0 = No, 1 = Yes)")
plt.ylabel("Years at Company")
plt.show()
plt.figure(figsize=(8,5))
sns.histplot(df['MonthlyIncome'], bins=30, kde=True, color='purple')
plt.title("Distribution of Monthly Income")
plt.xlabel("Monthly Income")
plt.ylabel("Count")
plt.show()
plt.figure(figsize=(8,6))
sns.scatterplot(x=df['YearsAtCompany'], y=df['MonthlyIncome'],
hue=df['Attrition'], alpha=0.6, palette='coolwarm')
plt.title("Years at Company vs Monthly Income (Colored by Attrition)")
plt.xlabel("Years at Company")
plt.ylabel("Monthly Income")
plt.show()
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

X = df.drop(columns=['Attrition']) # Features
y = df['Attrition'] # Target

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42, stratify=y)

# Apply SMOTE to balance the classes


smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train,
y_train)

# Feature Scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Define models
models = {
"Logistic Regression": LogisticRegression(),
"Random Forest": RandomForestClassifier(n_estimators=100,
random_state=42),
"Decision Tree": DecisionTreeClassifier(random_state=42),
"SVM": SVC(kernel='linear', probability=True)
}

# Train & Evaluate Models


for name, model in models.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"Model: {name}")
print(f"Accuracy: {acc:.4f}")
print("Classification Report:\n", classification_report(y_test,
y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("-" * 50)

Model: Logistic Regression


Accuracy: 0.9155
Classification Report:
precision recall f1-score support

0 0.93 0.98 0.95 248


1 0.83 0.60 0.70 48

accuracy 0.92 296


macro avg 0.88 0.79 0.82 296
weighted avg 0.91 0.92 0.91 296

Confusion Matrix:
[[242 6]
[ 19 29]]
--------------------------------------------------
Model: Random Forest
Accuracy: 0.8682
Classification Report:
precision recall f1-score support

0 0.87 1.00 0.93 248


1 0.91 0.21 0.34 48

accuracy 0.87 296


macro avg 0.89 0.60 0.63 296
weighted avg 0.87 0.87 0.83 296

Confusion Matrix:
[[247 1]
[ 38 10]]
--------------------------------------------------
Model: Decision Tree
Accuracy: 0.8209
Classification Report:
precision recall f1-score support

0 0.86 0.94 0.90 248


1 0.41 0.23 0.29 48

accuracy 0.82 296


macro avg 0.63 0.58 0.60 296
weighted avg 0.79 0.82 0.80 296

Confusion Matrix:
[[232 16]
[ 37 11]]
--------------------------------------------------
Model: SVM
Accuracy: 0.9088
Classification Report:
precision recall f1-score support

0 0.92 0.98 0.95 248


1 0.84 0.54 0.66 48

accuracy 0.91 296


macro avg 0.88 0.76 0.80 296
weighted avg 0.90 0.91 0.90 296

Confusion Matrix:
[[243 5]
[ 22 26]]
--------------------------------------------------

from sklearn.linear_model import LogisticRegression


from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.01, 0.1, 1, 10, 100]} # Regularization


parameter

log_reg = LogisticRegression(class_weight='balanced',
solver='liblinear')
grid_search = GridSearchCV(log_reg, param_grid, cv=5, scoring='f1')
grid_search.fit(X_train_resampled, y_train_resampled)

print("Best parameters for Logistic Regression:",


grid_search.best_params_)

Best parameters for Logistic Regression: {'C': 0.1}


from sklearn.svm import SVC

param_grid_svm = {
'C': [0.1, 1, 10],
'kernel': ['linear', 'rbf']
}

svm = SVC(class_weight='balanced')
grid_search_svm = GridSearchCV(svm, param_grid_svm, cv=5,
scoring='f1')
grid_search_svm.fit(X_train_resampled, y_train_resampled)

print("Best parameters for SVM:", grid_search_svm.best_params_)

Best parameters for SVM: {'C': 10, 'kernel': 'rbf'}

from sklearn.ensemble import RandomForestClassifier

param_grid_rf = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5, 10]
}

rf = RandomForestClassifier(class_weight='balanced', random_state=42)
grid_search_rf = GridSearchCV(rf, param_grid_rf, cv=5, scoring='f1')
grid_search_rf.fit(X_train_resampled, y_train_resampled)

print("Best parameters for Random Forest:",


grid_search_rf.best_params_)

Best parameters for Random Forest: {'max_depth': None,


'min_samples_split': 2, 'n_estimators': 100}

from sklearn.ensemble import VotingClassifier


from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# Define individual models


lr = LogisticRegression(C=0.1, class_weight='balanced')
svm = SVC(C=10, kernel='rbf', probability=True,
class_weight='balanced')

# Create ensemble
ensemble = VotingClassifier(estimators=[
('log_reg', lr),
('svm', svm)
], voting='soft', weights=[2, 1]) # Give LR more importance
# Soft voting uses probabilities
# Train & evaluate
ensemble.fit(X_train, y_train)
y_pred_ensemble = ensemble.predict(X_test)

# Report
print("accuracy ",accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred_ensemble))

accuracy 0.8581081081081081
precision recall f1-score support

0 0.91 0.86 0.89 249


1 0.44 0.57 0.50 47

accuracy 0.81 296


macro avg 0.68 0.72 0.69 296
weighted avg 0.84 0.81 0.82 296

from sklearn.metrics import precision_recall_curve

y_probs = ensemble.predict_proba(X_test)[:, 1] # Get probabilities


for class 1
precision, recall, thresholds = precision_recall_curve(y_test,
y_probs)

best_threshold = thresholds[np.argmax(precision * recall)] # Find


best tradeoff
y_pred_adjusted = (y_probs >= best_threshold).astype(int)

from sklearn.ensemble import VotingClassifier, RandomForestClassifier


from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score

# Define individual models


lr = LogisticRegression(C=0.1, class_weight='balanced')
svm = SVC(C=10, kernel='rbf', probability=True,
class_weight='balanced')
rf = RandomForestClassifier(n_estimators=100, class_weight='balanced')

# Create ensemble with all models


ensemble = VotingClassifier(estimators=[
('log_reg', lr),
('svm', svm),
('rf', rf) # Include Random Forest
], voting='soft', weights=[2, 1, 2]) # Adjust weights based on
performance

# Train & evaluate


ensemble.fit(X_train, y_train)
y_pred_ensemble = ensemble.predict(X_test)

# Report
print("Accuracy:", accuracy_score(y_test, y_pred_ensemble))
print(classification_report(y_test, y_pred_ensemble))

Accuracy: 0.8614864864864865
precision recall f1-score support

0 0.89 0.95 0.92 249


1 0.60 0.38 0.47 47

accuracy 0.86 296


macro avg 0.75 0.67 0.69 296
weighted avg 0.84 0.86 0.85 296

from sklearn.metrics import precision_recall_curve

y_probs = ensemble.predict_proba(X_test)[:, 1] # Get probabilities


for class 1
precision, recall, thresholds = precision_recall_curve(y_test,
y_probs)

# Find a threshold that balances precision and recall


best_threshold = thresholds[(precision * recall).argmax()]
y_pred_adjusted = (y_probs >= best_threshold).astype(int)

# New Evaluation
print("Adjusted Accuracy:", accuracy_score(y_test, y_pred_adjusted))
print(classification_report(y_test, y_pred_adjusted))

Adjusted Accuracy: 0.652027027027027


precision recall f1-score support

0 0.98 0.60 0.74 249


1 0.31 0.94 0.46 47

accuracy 0.65 296


macro avg 0.64 0.77 0.60 296
weighted avg 0.87 0.65 0.70 296

You might also like