0% found this document useful (0 votes)
39 views

Documentation - Ishaan Mittal - Jio - Assessment

The document discusses the preprocessing, exploratory data analysis, predictive modeling, and evaluation of a job role classification model. It details the following: 1. Data preprocessing steps including removing null and duplicate values. 2. Exploratory analysis finding most jobs are admin (76.58%), average education differs by gender, and education levels differ for English speakers. 3. A decision tree classifier achieves 93% accuracy, and permutation importance finds the 'education' feature is most correlated with job role.

Uploaded by

mittalishaan0902
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Documentation - Ishaan Mittal - Jio - Assessment

The document discusses the preprocessing, exploratory data analysis, predictive modeling, and evaluation of a job role classification model. It details the following: 1. Data preprocessing steps including removing null and duplicate values. 2. Exploratory analysis finding most jobs are admin (76.58%), average education differs by gender, and education levels differ for English speakers. 3. A decision tree classifier achieves 93% accuracy, and permutation importance finds the 'education' feature is most correlated with job role.

Uploaded by

mittalishaan0902
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Made by Ishaan Mittal (Btech ECE’23 Jamia Millia Islamia)

DOCUMENTATION

Task-1 Data Loading and Preprocessing:


For preprocessing,
1. dropna() function of pandas dataframe is used to remove all the rows
that contain NULL values.
2. drop_duplicates() function of pandas dataframe is used to remove all
the duplicate rows in place.
head() function of pandas dataframe is used to display the first few rows.
Sno job education gender English speaker

0 1 manage 15 male no

1 2 admin 16 male no

2 3 admin 12 female no

3 4 admin 8 female no

5 admin 15 male no
4

Task-2 Exploratory Data Analysis:


Matplotlib library is used for visualisation and seaborn library is used on top
of the former to provide a higher level interface.
Made by Ishaan Mittal (Btech ECE’23 Jamia Millia Islamia)

For percentage calculation, value_counts() is used with normalize parameter


to get the proportions and then further multiplied with 100 to get the
expected result.
Percentage of individuals belonging to different job roles:
admin 76.582278
manage 17.721519
custodial 5.696203
Name: job, dtype: float64

Percentage of individuals with different education levels:


12 40.084388
15 24.472574
16 12.447257
8 11.181435
19 5.696203
17 2.320675
18 1.898734
14 1.265823
20 0.421941
21 0.210970
Name: education, dtype: float64

Percentage of individuals with different genders:


male 54.43038
female 45.56962
Name: gender, dtype: float64

Percentage of individuals with different English-speaking


status:
no 78.059072
yes 21.940928
Name: English speaker, dtype: float64
Made by Ishaan Mittal (Btech ECE’23 Jamia Millia Islamia)

Task-3 Gender and English speaker Analysis:


To calculate the average education level of each gender using groupby()
function based on the ‘gender’ column and simple mean() function is applied
to get the expected results.
gender
female 12.370370
male 14.430233

For comparing the distribution of job roles among different gender groups
using a stacked bar chart, the data in the dataframe is grouped by two
columns, 'job' and 'gender'. The size (count) of each group is calculated and
results are organised into a table where job roles are in rows and genders are
in columns. If there are missing combinations, it fills them with 0 using
unstack(fill_value=0). Through parameters ‘kind’ (=bar) and ‘stacked’ (=True)
stacked bar chart is plotted.
Made by Ishaan Mittal (Btech ECE’23 Jamia Millia Islamia)

In order to create a histogram to show the distribution of education levels


among English speaking and non-English speaking individuals, simple hist()
function of matplotlib library is used. The data is divided into 5 bins and to
make the bars somewhat transparent ‘alpha=0.5’ is done.

Task-4 Predictive Modelling:


● For encoding categorical values, LabelEncoder() is used from sklearn
preprocessing library through fit_transform() function.
● Dataset is split through the train_test_split() function in the sklearn
library.
● Used Decision Tree Classifier as the classification model (inbuilt model
from sklearn was used)
● Classification Report:
precision recall f1-score support
0 0.95 0.96 0.95 77
1 1.00 0.50 0.67 2
2 0.81 0.81 0.81 16

accuracy 0.93 95
macro avg 0.92 0.76 0.81 95
weighted avg 0.93 0.93 0.92 95
Made by Ishaan Mittal (Btech ECE’23 Jamia Millia Islamia)

● Classification Report Analysis :


Precision:
➢ Class 0 (admin): Precision is high at 0.95, indicating that when
the model predicts someone as an admin, it's correct 95% of
the time.
➢ Class 1 (manager): Precision is very high at 1.00, suggesting
that when the model predicts someone as a manager, it's almost
always correct.
➢ Class 2 (custodial): Precision is decent at 0.81, meaning that
when the model predicts someone as custodial, it's correct
81% of the time.

Recall:
➢ Class 0 (admin): Recall is good at 0.96, indicating that the
model correctly identifies 96% of the actual admin roles.
➢ Class 1 (manager): Recall is lower at 0.50, suggesting that the
model captures only half of the actual manager roles.
➢ Class 2 (custodial): Recall is reasonable at 0.81, meaning the
model correctly identifies 81% of the actual custodial roles.
F1-Score:
➢ Class 0 (admin): F1-score is high at 0.95, which is a balanced
measure of precision and recall.
➢ Class 1 (manager): F1-score is 0.67, indicating a decent balance
between precision and recall for managers.
➢ Class 2 (custodial): F1-score is 0.81, showing a good balance
Made by Ishaan Mittal (Btech ECE’23 Jamia Millia Islamia)

between precision and recall for custodial roles.


Accuracy: The overall accuracy of the model is 93%, which means it's
correct in its predictions for 93% of the instances in the
dataset.
Macro Avg: The macro-average of precision, recall, and F1-score across
all classes provides a balanced assessment of model
performance:
➢ The macro-avg precision (0.92) is high, indicating overall ood
precision across classes.
➢ The macro-avg recall (0.76) is lower, suggesting some issues in
capturing all instances across classes.
➢ The macro-avg F1-score (0.81) is decent, reflecting a
reasonable balance between precision and recall.
Weighted Avg: The weighted average considers class imbalances by
taking into account the number of instances in each class:
➢ The weighted-avg precision (0.93) is high and similar to
macro-avg, showing overall good precision.
➢ The weighted-avg recall (0.93) is also high, indicating a good
balance of recall across classes.
➢ The weighted-avg F1-score (0.92) is a strong measure of overall
model performance, considering class imbalances.

● Also used permutation_importance() from sklearn which measures the


impact of shuffling the values of a particular feature keeping other
features unchanged.
Made by Ishaan Mittal (Btech ECE’23 Jamia Millia Islamia)

● Following were the results:

● Based on the results, it seems that feature ‘education’ is highly


correlated with the class label.
Made by Ishaan Mittal (Btech ECE’23 Jamia Millia Islamia)

● y_score = classifier.predict_proba(X_test) : The code calculates and


stores the predicted probabilities of different job roles for each sample
in the test dataset using the classifier.
● The following code computes Receiver Operating Characteristic (ROC)
curves and their respective Area Under the Curve (AUC) scores for each
class in a multiclass classification problem. It helps assess the model's
performance for each class by plotting how well it distinguishes
between true positives and false positives at different thresholds.
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(label_binarize(y_test, classes=[i]),
y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
Made by Ishaan Mittal (Btech ECE’23 Jamia Millia Islamia)

The Decision Tree Classifier worked well on the dataset giving 93% accuracy.
The feature "education" is crucial in determining job roles in this model. It has
a significant impact on the predictions, meaning a person's education level
strongly influences whether they are classified as an admin, manager, or
custodial worker. In practical terms, it suggests that education plays a big
role in job assignments in this dataset, and it's the most important factor the
model considers when making predictions.

You might also like