0% found this document useful (0 votes)

10 views

ML Assignment 5

The document discusses the challenges of using decision trees for classification on an imbalanced dataset with high-dimensional features, specifically focusing on a credit dataset. It proposes strategies such as applying SMOTE for class imbalance and selecting important features based on decision tree feature importance to enhance model robustness and generalization. The document includes code snippets for data loading, preprocessing, and model training.

Uploaded by

anuj rawat

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

ML Assignment 5

Uploaded by

anuj rawat

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

hmj2dydk1

January 3, 2025

Given a dataset credit.csv with imbalanced class distributions and a high-dimensional feature space,
discuss the challenges and considerations in using decision trees for classification. Propose strategies
for mitigating the impact of class imbalance and feature selection to improve model robustness and
generalisation performance.You are being provided with a meta data also please rad it before doing
implementation.
[1]: import pandas as pd

# Load the dataset

url = "https://itv-contentbucket.s3.ap-south-1.amazonaws.com/Exams/ML/
↪Decision+Tree/credit.csv"

data = pd.read_csv(url)

# Display the first few rows of the dataset

print(data.head())

# Display summary statistics

print(data.describe())

# Display information about the dataset

print(data.info())

Time V1 V2 V3 V4 V5 V6 V7 \
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803
2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461
3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609
4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941

V8 V9 … V21 V22 V23 V24 V25 \

0 0.098698 0.363787 … -0.018307 0.277838 -0.110474 0.066928 0.128539
1 0.085102 -0.255425 … -0.225775 -0.638672 0.101288 -0.339846 0.167170
2 0.247676 -1.514654 … 0.247998 0.771679 0.909412 -0.689281 -0.327642
3 0.377436 -1.387024 … -0.108300 0.005274 -0.190321 -1.175575 0.647376
4 -0.270533 0.817739 … -0.009431 0.798278 -0.137458 0.141267 -0.206010

V26 V27 V28 Amount Class

0 -0.189115 0.133558 -0.021053 149.62 0

1
1 0.125895 -0.008983 0.014724 2.69 0
2 -0.139097 -0.055353 -0.059752 378.66 0
3 -0.221929 0.062723 0.061458 123.50 0
4 0.502292 0.219422 0.215153 69.99 0

[5 rows x 31 columns]
Time V1 V2 V3 V4 \
count 284807.000000 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05
mean 94813.859575 1.168375e-15 3.416908e-16 -1.379537e-15 2.074095e-15
std 47488.145955 1.958696e+00 1.651309e+00 1.516255e+00 1.415869e+00
min 0.000000 -5.640751e+01 -7.271573e+01 -4.832559e+01 -5.683171e+00
25% 54201.500000 -9.203734e-01 -5.985499e-01 -8.903648e-01 -8.486401e-01
50% 84692.000000 1.810880e-02 6.548556e-02 1.798463e-01 -1.984653e-02
75% 139320.500000 1.315642e+00 8.037239e-01 1.027196e+00 7.433413e-01
max 172792.000000 2.454930e+00 2.205773e+01 9.382558e+00 1.687534e+01

V5 V6 V7 V8 V9 \
count 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05
mean 9.604066e-16 1.487313e-15 -5.556467e-16 1.213481e-16 -2.406331e-15
std 1.380247e+00 1.332271e+00 1.237094e+00 1.194353e+00 1.098632e+00
min -1.137433e+02 -2.616051e+01 -4.355724e+01 -7.321672e+01 -1.343407e+01
25% -6.915971e-01 -7.682956e-01 -5.540759e-01 -2.086297e-01 -6.430976e-01
50% -5.433583e-02 -2.741871e-01 4.010308e-02 2.235804e-02 -5.142873e-02
75% 6.119264e-01 3.985649e-01 5.704361e-01 3.273459e-01 5.971390e-01
max 3.480167e+01 7.330163e+01 1.205895e+02 2.000721e+01 1.559499e+01

… V21 V22 V23 V24 \

count … 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05
mean … 1.654067e-16 -3.568593e-16 2.578648e-16 4.473266e-15
std … 7.345240e-01 7.257016e-01 6.244603e-01 6.056471e-01
min … -3.483038e+01 -1.093314e+01 -4.480774e+01 -2.836627e+00
25% … -2.283949e-01 -5.423504e-01 -1.618463e-01 -3.545861e-01
50% … -2.945017e-02 6.781943e-03 -1.119293e-02 4.097606e-02
75% … 1.863772e-01 5.285536e-01 1.476421e-01 4.395266e-01
max … 2.720284e+01 1.050309e+01 2.252841e+01 4.584549e+00

V25 V26 V27 V28 Amount \

count 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 284807.000000
mean 5.340915e-16 1.683437e-15 -3.660091e-16 -1.227390e-16 88.349619
std 5.212781e-01 4.822270e-01 4.036325e-01 3.300833e-01 250.120109
min -1.029540e+01 -2.604551e+00 -2.256568e+01 -1.543008e+01 0.000000
25% -3.171451e-01 -3.269839e-01 -7.083953e-02 -5.295979e-02 5.600000
50% 1.659350e-02 -5.213911e-02 1.342146e-03 1.124383e-02 22.000000
75% 3.507156e-01 2.409522e-01 9.104512e-02 7.827995e-02 77.165000
max 7.519589e+00 3.517346e+00 3.161220e+01 3.384781e+01 25691.160000

Class
count 284807.000000

2
mean 0.001727
std 0.041527
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000

[8 rows x 31 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Time 284807 non-null float64
1 V1 284807 non-null float64
2 V2 284807 non-null float64
3 V3 284807 non-null float64
4 V4 284807 non-null float64
5 V5 284807 non-null float64
6 V6 284807 non-null float64
7 V7 284807 non-null float64
8 V8 284807 non-null float64
9 V9 284807 non-null float64
10 V10 284807 non-null float64
11 V11 284807 non-null float64
12 V12 284807 non-null float64
13 V13 284807 non-null float64
14 V14 284807 non-null float64
15 V15 284807 non-null float64
16 V16 284807 non-null float64
17 V17 284807 non-null float64
18 V18 284807 non-null float64
19 V19 284807 non-null float64
20 V20 284807 non-null float64
21 V21 284807 non-null float64
22 V22 284807 non-null float64
23 V23 284807 non-null float64
24 V24 284807 non-null float64
25 V25 284807 non-null float64
26 V26 284807 non-null float64
27 V27 284807 non-null float64
28 V28 284807 non-null float64
29 Amount 284807 non-null float64
30 Class 284807 non-null int64
dtypes: float64(30), int64(1)
memory usage: 67.4 MB
None

3
[2]: import matplotlib.pyplot as plt
import seaborn as sns

# Check class distribution

class_counts = data['Class'].value_counts()
print(class_counts)

# Visualize class distribution

sns.countplot(x='Class', data=data)
plt.title('Class Distribution')
plt.show()

# Visualize correlations between features

plt.figure(figsize=(12, 8))
sns.heatmap(data.corr(), annot=True, fmt=".2f")
plt.title('Feature Correlation Matrix')
plt.show()

Class
0 284315
1 492
Name: count, dtype: int64

4
[3]: from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Handle missing values (if any)

data = data.dropna()

# Separate features and target variable

X = data.drop('Class', axis=1)
y = data['Class']

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,␣
↪random_state=42, stratify=y)

# Scale the features

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

5
[4]: from imblearn.over_sampling import SMOTE

# Apply SMOTE to the training data

smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled,␣
↪y_train)

# Check the new class distribution

print(pd.Series(y_train_resampled).value_counts())

Class
0 199020
1 199020
Name: count, dtype: int64

[10]: from sklearn.tree import DecisionTreeClassifier

# Train a decision tree to get feature importance

clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train_resampled, y_train_resampled)

# Get feature importance scores

importances = clf.feature_importances_
feature_importance = pd.Series(importances, index=X.columns).
↪sort_values(ascending=False)

print(feature_importance)

# Convert resampled data back to DataFrame to select features by their names

X_train_resampled_df = pd.DataFrame(X_train_resampled, columns=X.columns)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=X.columns)

# Select top features (e.g., top 10 features)

selected_features = feature_importance.head(10).index
X_train_selected = X_train_resampled_df[selected_features]
X_test_selected = X_test_scaled_df[selected_features]

V14 0.773470
V4 0.058559
V12 0.020056
V10 0.014436
V8 0.013011
V13 0.010029
V7 0.007680
V1 0.006985
V11 0.006978
Time 0.006907
V23 0.006864

6
V19 0.006841
V6 0.006788
V17 0.005915
V26 0.005891
V3 0.005694
V21 0.005551
V18 0.005128
V24 0.005046
V9 0.003899
V25 0.003670
V20 0.003646
V16 0.003503
V22 0.003228
V5 0.002949
V15 0.002785
Amount 0.001979
V27 0.001449
V2 0.000643
V28 0.000419
dtype: float64

[11]: from sklearn.metrics import classification_report, confusion_matrix,␣

↪accuracy_score

# Train the decision tree model

clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train_selected, y_train_resampled)

# Make predictions
y_pred = clf.predict(X_test_selected)

# Evaluate the model

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("Classification Report:")
print(classification_report(y_test, y_pred))

Accuracy: 0.9964654799105837
Confusion Matrix:
[[85035 260]
[ 42 106]]
Classification Report:
precision recall f1-score support

0 1.00 1.00 1.00 85295

1 0.29 0.72 0.41 148

7
accuracy 1.00 85443
macro avg 0.64 0.86 0.71 85443
weighted avg 1.00 1.00 1.00 85443

Sunrise TB8
79% (90)
Sunrise TB8
114 pages
All Formulas in One: Quantitative Aptitude Ebook by Lucid Math
100% (1)
All Formulas in One: Quantitative Aptitude Ebook by Lucid Math
26 pages
credit card-fraud-detection
No ratings yet
credit card-fraud-detection
39 pages
Credit Card Fraud Detection
No ratings yet
Credit Card Fraud Detection
101 pages
afbpr7
No ratings yet
afbpr7
7 pages
Linear Algebra
No ratings yet
Linear Algebra
323 pages
EDA and Similarity of Transactions On CreditCardFraudDetection
No ratings yet
EDA and Similarity of Transactions On CreditCardFraudDetection
66 pages
E21CSEU0770 Lab4
No ratings yet
E21CSEU0770 Lab4
4 pages
Credit Card Fraud Detection
No ratings yet
Credit Card Fraud Detection
8 pages
Project 8 Predictive Analytics - Ipynb - Colaboratory
No ratings yet
Project 8 Predictive Analytics - Ipynb - Colaboratory
8 pages
Kaggle Machine Learning
No ratings yet
Kaggle Machine Learning
6 pages
DATA SCIENCE IDC 302 End Sem Project
No ratings yet
DATA SCIENCE IDC 302 End Sem Project
1 page
Copy of SPYPRO ML Project RF.ipynb - Colaboratory
No ratings yet
Copy of SPYPRO ML Project RF.ipynb - Colaboratory
4 pages
Credit_Card_fraud_detection Using ML - Jupyter Notebook (1)
No ratings yet
Credit_Card_fraud_detection Using ML - Jupyter Notebook (1)
12 pages
"Normal" "Fraud": #Check For Any Null Values
No ratings yet
"Normal" "Fraud": #Check For Any Null Values
7 pages
Credit_Card_fraud_detection Using ML - Jupyter Notebook2
No ratings yet
Credit_Card_fraud_detection Using ML - Jupyter Notebook2
13 pages
credit-card_notebooks_preprocessed-data_data_preprocessing.ipynb at main · Shubhamdongarjal_credit-card
No ratings yet
credit-card_notebooks_preprocessed-data_data_preprocessing.ipynb at main · Shubhamdongarjal_credit-card
15 pages
TP.ipynb - Colab
No ratings yet
TP.ipynb - Colab
6 pages
Using Tensorflow To Predict Jet Numbers in Cern Proton Collisions (Evaluator-Omid-Baghcheh-Saraei)
No ratings yet
Using Tensorflow To Predict Jet Numbers in Cern Proton Collisions (Evaluator-Omid-Baghcheh-Saraei)
29 pages
Nlp2.ipynb - Colab
No ratings yet
Nlp2.ipynb - Colab
3 pages
DLL 7 Harsh Kabir 2203254
No ratings yet
DLL 7 Harsh Kabir 2203254
6 pages
KMeans Clustering Bidimensional Daniel Ames Camayo
No ratings yet
KMeans Clustering Bidimensional Daniel Ames Camayo
15 pages
group-2-th (1)
No ratings yet
group-2-th (1)
25 pages
K Means
No ratings yet
K Means
15 pages
BT2_pca_(8x8).Nguyễn Hoài Hân-2113313
No ratings yet
BT2_pca_(8x8).Nguyễn Hoài Hân-2113313
5 pages
Post Test Praktikum8
No ratings yet
Post Test Praktikum8
14 pages
linear-reg-signal-and-noise.pdf
No ratings yet
linear-reg-signal-and-noise.pdf
20 pages
Logistic Pima Indians - Ipynb - Colaboratory
No ratings yet
Logistic Pima Indians - Ipynb - Colaboratory
4 pages
Practical 5
No ratings yet
Practical 5
6 pages
K Medoids
No ratings yet
K Medoids
10 pages
Question- 2-Interview Question ML
No ratings yet
Question- 2-Interview Question ML
13 pages
Data Science
No ratings yet
Data Science
1 page
Pattern - Recognition - 3 - Code With Output
No ratings yet
Pattern - Recognition - 3 - Code With Output
7 pages
data_analytucs_1[1]
No ratings yet
data_analytucs_1[1]
5 pages
cern-electron-mass-prediction-0-9859-r
No ratings yet
cern-electron-mass-prediction-0-9859-r
53 pages
Ridge - Lasso - Regression (1) .Ipynb - Colaboratory
No ratings yet
Ridge - Lasso - Regression (1) .Ipynb - Colaboratory
4 pages
Infosys Stock Trend.ipynb - Colab
No ratings yet
Infosys Stock Trend.ipynb - Colab
16 pages
SVM (Support Vector Machine) For Classification - by Aditya Kumar - Towards Data Science
100% (1)
SVM (Support Vector Machine) For Classification - by Aditya Kumar - Towards Data Science
28 pages
HW Assignment1
No ratings yet
HW Assignment1
8 pages
PROGRA
No ratings yet
PROGRA
18 pages
A4 - Jupyter Notebook PDF
No ratings yet
A4 - Jupyter Notebook PDF
8 pages
vertopal.com_model_training
No ratings yet
vertopal.com_model_training
6 pages
Practical 4
No ratings yet
Practical 4
3 pages
Daily Gold Price Time Series Analysis 1649918083
No ratings yet
Daily Gold Price Time Series Analysis 1649918083
23 pages
Coding Tugas Besar Analitika Data
No ratings yet
Coding Tugas Besar Analitika Data
7 pages
Merging - Scaled - 1D - & - Trying - Different - CLassification - ML - Models - .Ipynb - Colaboratory
100% (1)
Merging - Scaled - 1D - & - Trying - Different - CLassification - ML - Models - .Ipynb - Colaboratory
16 pages
LiteCrypto 1
No ratings yet
LiteCrypto 1
22 pages
Uber
No ratings yet
Uber
7 pages
Ammar 142014096 Kode4
No ratings yet
Ammar 142014096 Kode4
45 pages
Short Notes on pandas
No ratings yet
Short Notes on pandas
21 pages
2441843 Week 1 Time Series.pdf
No ratings yet
2441843 Week 1 Time Series.pdf
47 pages
Time Series
No ratings yet
Time Series
23 pages
hanoi 2019 và 2020-descriptive statistics
No ratings yet
hanoi 2019 và 2020-descriptive statistics
7 pages
Bank Loan
No ratings yet
Bank Loan
85 pages
DL Lab Manual
No ratings yet
DL Lab Manual
40 pages
Matlab Code: Question No: 4 (A)
No ratings yet
Matlab Code: Question No: 4 (A)
11 pages
01 - Inspect - Pretrained - Model: 0.1 Download Pre-Trained Model Files
No ratings yet
01 - Inspect - Pretrained - Model: 0.1 Download Pre-Trained Model Files
8 pages
BA STOCK PRICE project neural network analysis
No ratings yet
BA STOCK PRICE project neural network analysis
17 pages
Exercises 2 Unfinished
No ratings yet
Exercises 2 Unfinished
8 pages
MLR-handson - Jupyter Notebook
No ratings yet
MLR-handson - Jupyter Notebook
5 pages
R Lab Ex 1 to 5
No ratings yet
R Lab Ex 1 to 5
26 pages
A List of Factorial Math Constants
From Everand
A List of Factorial Math Constants
StreetLib
No ratings yet
JurajDatko TeachingListeningusingMultimedia
No ratings yet
JurajDatko TeachingListeningusingMultimedia
9 pages
En002v1maintenance Guide
No ratings yet
En002v1maintenance Guide
118 pages
ARW2-Writing 2
No ratings yet
ARW2-Writing 2
1 page
Military Courtesy Discipline
No ratings yet
Military Courtesy Discipline
27 pages
Aerospace Engineering Benefits
No ratings yet
Aerospace Engineering Benefits
5 pages
PVC Tarpaulins Range: Lac 900, Dicoplan, Lac 680, Lac 640, Ecolac 640, Unilac
No ratings yet
PVC Tarpaulins Range: Lac 900, Dicoplan, Lac 680, Lac 640, Ecolac 640, Unilac
13 pages
Notes - For - Teacher Unit 8 To 10 PDF
No ratings yet
Notes - For - Teacher Unit 8 To 10 PDF
3 pages
Our Lady of Fatima University College of Nursing - Cabanatuan City
No ratings yet
Our Lady of Fatima University College of Nursing - Cabanatuan City
3 pages
IJOMS Template (Ver. 2 Ed)
No ratings yet
IJOMS Template (Ver. 2 Ed)
3 pages
Honneth (2001) Invisibility
No ratings yet
Honneth (2001) Invisibility
10 pages
Pairing and Cross Matching
No ratings yet
Pairing and Cross Matching
6 pages
Discourse Analysis HANDOUTS
No ratings yet
Discourse Analysis HANDOUTS
4 pages
Fundamentals of Compressor Flow
No ratings yet
Fundamentals of Compressor Flow
45 pages
Srs On Quora Com 364482795 Srs On Quora Com
No ratings yet
Srs On Quora Com 364482795 Srs On Quora Com
13 pages
Cell Name Original Value Final Value
No ratings yet
Cell Name Original Value Final Value
7 pages
Ryan Andriansyah S.Kom: Phone
No ratings yet
Ryan Andriansyah S.Kom: Phone
3 pages
Is Guide 43 1 1997 PDF
No ratings yet
Is Guide 43 1 1997 PDF
26 pages
Updated Draft PDF
No ratings yet
Updated Draft PDF
55 pages
ACP Tutorial Ex2
No ratings yet
ACP Tutorial Ex2
15 pages
Three - Laws of Behaviour Genetics
No ratings yet
Three - Laws of Behaviour Genetics
5 pages
Movingmirror PDF
No ratings yet
Movingmirror PDF
16 pages
Online Quiz 10 Metode Pembuktian Matematis 2 Attempt Review2 Hafidh Adani PDF
No ratings yet
Online Quiz 10 Metode Pembuktian Matematis 2 Attempt Review2 Hafidh Adani PDF
9 pages
Data Sheet - Zytel 80g33 Hs1-L-nc-10
No ratings yet
Data Sheet - Zytel 80g33 Hs1-L-nc-10
3 pages
Cosmic Consciousness Reader
No ratings yet
Cosmic Consciousness Reader
13 pages
Naval Science & Technological Laboratory DRDO, Ministry of Defence, Visakhapatnam - 530027
No ratings yet
Naval Science & Technological Laboratory DRDO, Ministry of Defence, Visakhapatnam - 530027
3 pages
Example of Lab Report Mark Scheme
100% (1)
Example of Lab Report Mark Scheme
1 page
Insta 1 Test U2 Standard
No ratings yet
Insta 1 Test U2 Standard
4 pages
Practicum Module
No ratings yet
Practicum Module
108 pages