0% found this document useful (0 votes)

150 views

Assignment 1 - CIS 508

This document describes building a machine learning model using Azure ML to predict customer satisfaction from bank customer data. It summarizes data exploration of the 370 features, including removing outliers and zero-value columns. A two-class boosted decision tree model was trained on the top 30 features selected using mutual information. Model 1 with 20 leaves, 130 trees and a learning rate of 0.4 performed best on the test set for predicting satisfied and dissatisfied customers. Key challenges were the imbalanced target classes and lack of feature descriptions.

Uploaded by

bhole.rahul4059

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

150 views

Assignment 1 - CIS 508

Uploaded by

bhole.rahul4059

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Customer Satisfaction Prediction Problem

using Azure ML

ASSIGNMENT 1

Rahul Bhole | Data Mining CIS 508 | Date 10/26/2016

1. Overview

Santander bank wants to identify dissatisfied customers early in banks relationship with

their customers, so that it can take appropriate steps to avoid customer churn. The bank has

provided raw data with number of features and one TARGET column with satisfied and

dissatisfied indicator. Value 1 in TARGET indicates unsatisfied customer and 0 indicates

satisfied customers. The training set provided contains 370 features of 76020 records and

1 TARGET column.

Our goal is to build and train a model using Two Class Boosted Decision Tree algorithm

with training data set. As there are 370 features present in the training set, we need to select

the subset of most relevant features to train the model in order to predict the customer status

(Satisfied, Dissatisfied) in test data set.

The main hurdle in building and training the model is interpreting various features. There

is no description provided for features present in the file. Also, feature names are not self-

explanatory E.g. var3, var15 are the feature names with no description.

To get insights from the data, I performed data analysis on raw data using various

techniques including summarizing data and visualization. Also, I used in-built feature

selection methods of Azure ML to increase the prediction rate. I used Filter based Feature

Selection with Mutual Information as feature scoring method. Apart from this, I tried

algorithm with different parameters and analyzed different trained models to select the best

one.

PAGE 1
2. Exploration of Raw Data:

We start by importing data to create the model. By going through the data we conclude that

there are no missing values so we dont need any missing value treatment. But we might

need data cleaning for other anomalies. To figure out which are those anomalies we need

to summarize and visualize the data.

Before proceeding to data cleaning, we edit the metadata using Edit Metadata module.

Change TARGET to categorical, Boolean and Label.

We then start by visualizing the data by going through all the columns Fig 2.1-

Fig. 2.1

I checked Histograms and Box plots for various features to get insights on distribution of

these variables. Some of the distributions are shown below -

PAGE 2
Fig. 2.2 and 2.3 are the histograms and box plot of var15 respectively. We can see var15

is not normally distributed. It is right skewed according to the distribution. There are some

extreme outliers in the data but that might not affect overall result.

Fig. 2.2 Fig. 2.3

Var3 extreme outlier - -999999 :

Box plot of var3 Fig. 2.4 shows extreme outlier at the negative side. We need to remove

this extreme outlier before proceeding to feature selection.

Fig. 2.4

PAGE 3
To dive deep into data exploration, we summarize data which gives us various parameters

that could be analyzed. Fig. 2.5 shows snapshot of summary.

Fig. 2.5

The extreme outlier we found in box plot of var3 feature can be seen here. The minimum

value of var3 -999999 is the extreme outlier. This value is affecting mean of the entire

column. Looking at the other entries of var3 we can safely say that this is a data entry

error or missing data value. We will remove this outlier in further process.

The ID column has 76020 values and is unique for every row. We can remove this

column for our process as model require generalized rules.

Zero value columns-

Some columns contain only 0 values; we need to remove these columns as these dont give

any additional information.

ind_var41 imp_amort_var18_hace3
num_var27_0 imp_amort_var34_hace3
num_var28_0 imp_reemb_var13_hace3
num_var28 imp_reemb_var33_hace3
num_var27 imp_trasp_var17_out_hace3
num_var41 imp_trasp_var33_out_hace3
num_var46_0 num_var2_0_ult1
num_var46 num_var2_ult1

PAGE 4
saldo_var28 num_reemb_var13_hace3
saldo_var27 num_reemb_var33_hace3
saldo_var41 num_trasp_var17_out_hace3
saldo_var46 num_trasp_var33_out_hace3
saldo_medio_var13_medio_hace3 saldo_var2_ult1

Zero and extreme value columns:

Below columns contains only 0 and 9999999999 which seems to be data entry error.

delta_imp_amort_var18_1y3
delta_imp_amort_var34_1y3
delta_imp_reemb_var13_1y3
delta_imp_reemb_var33_1y3
delta_imp_trasp_var17_out_1y3
delta_imp_trasp_var33_out_1y3
delta_num_reemb_var13_1y3
delta_num_reemb_var33_1y3
delta_num_trasp_var17_out_1y3
delta_num_trasp_var33_out_1y3

Extreme Outliers:

Below columns have extreme outlier 9999999999. These columns contain other values to

so we need to remove outliers from below columns.

delta_imp_aport_var13_1y3
delta_imp_aport_var17_1y3
delta_imp_aport_var33_1y3
delta_imp_compra_var44_1y3
delta_imp_reemb_var17_1y3
delta_imp_trasp_var17_in_1y3
delta_imp_trasp_var33_in_1y3
delta_imp_venta_var44_1y3
delta_num_aport_var13_1y3
delta_num_aport_var17_1y3
delta_num_aport_var33_1y3

PAGE 5
delta_num_compra_var44_1y3
delta_num_reemb_var17_1y3
delta_num_trasp_var17_in_1y3
delta_num_trasp_var33_in_1y3
delta_num_venta_var44_1y3

3. Data Cleaning

Change column types to Categorical-

There are some columns with only 0 and 1 values. I converted all these columns to

categorical.

Removing extreme outliers

To remove outliers from var3 and other columns we put following structure.

1. Convert the dataset with missing value as -999999

2. Clean the missing data by setting missing value equal to -999999 as shown in fig 3.1

Fig. 3.1

PAGE 6
With this outliers -999999 and 9999999999 are removed and we can confirm this by

visualizing columns.

Check summary of data. We can see -999999 value is no more in minimum of var3

column. Also, as result of removal of extreme outlier mean of var3 changed to 2.71.

Fig. 3.2

Removing columns with only 0 values-

Excluded columns with only 0 values by Select Columns in Dataset module before

feeding data to ML algorithm

I then used Filter based feature selection module in Azure to select top 30 features from

dataset using mutual information feature scoring method. It measures the contribution of

a variable towards reducing uncertainty about the value of TARGET.

PAGE 7
4. Training different models

To train a model, I split the data in 70:30. 70% for training and 30% for testing

I then evaluated 3 different training models by changing the parameters of Two Class

Boosted Decision Tree as shown in Fig. 4.1

Fig. 4.1

Leaves Trees Samples Learning

rate
Model 1 20 130 15 0.4
Model 2 10 300 20 0.2
Model 3 10 100 10 0.2

Fig 4.2

PAGE 8
Out of these 3 models, model 1 gave the best results. Fig 4.2.

So, I chose model 1 to evaluate the test set to predict the number of satisfied and

dissatisfied customers.

All of the above steps are followed to build the model as shown in Fig. 4.3

Fig. 4.3

Kaggle rank is shown in the following screenshot-

PAGE 9
5. Conclusion

In conclusion, according to me Santander problem is a difficult one from the machine

learning point of view because of 2 reasons:

1. Asymmetric dataset There are few unsatisfied customers (1s) than satisfied

customers (0s) which is affecting the training of the model as we dont have many

cases in target column to predict accurately.

2. High dimensionality There are 370 columns with no information about each

column. Some of the columns seem to have redundant/ No information. But as we

dont have description, feature subset selection for training a model is a problem.

I learnt how to load data in Azure to train a model. I explored various features that are

present in Azure for data exploration, summarization and cleaning. Also, how to build

model and evaluate those on the basis of various parameters and choose the best one.

PAGE 10

cz4041 Project Final Report Nyc Taxi Fare Prediction
0% (1)
cz4041 Project Final Report Nyc Taxi Fare Prediction
18 pages
FRA Milestone1 - Maminulislam
100% (4)
FRA Milestone1 - Maminulislam
23 pages
27672toy Catalogue Web
67% (12)
27672toy Catalogue Web
64 pages
Merchant Center Intro PDF
0% (1)
Merchant Center Intro PDF
2 pages
Project Report-Micro Credit Loan
No ratings yet
Project Report-Micro Credit Loan
8 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
10 Techniques To Deal With Class Imbalance in Machine Learning
No ratings yet
10 Techniques To Deal With Class Imbalance in Machine Learning
10 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Practical # 9 PDF
No ratings yet
Practical # 9 PDF
30 pages
Anomalies in dataset
No ratings yet
Anomalies in dataset
4 pages
Final Project Report - Kelompok 4
No ratings yet
Final Project Report - Kelompok 4
6 pages
Be A 65 Ads Exp 3
No ratings yet
Be A 65 Ads Exp 3
6 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
Catboost ET Comparaison
No ratings yet
Catboost ET Comparaison
20 pages
Articles Xgboost Classification With Smote-Enn Algorithm
No ratings yet
Articles Xgboost Classification With Smote-Enn Algorithm
11 pages
(Week 4) - Balance DataSet
No ratings yet
(Week 4) - Balance DataSet
5 pages
Unit 7 ML
No ratings yet
Unit 7 ML
33 pages
ADS Ut2
No ratings yet
ADS Ut2
23 pages
2 - Preprocessing
No ratings yet
2 - Preprocessing
74 pages
Revenue Predictor - Udit Ennam PDF
No ratings yet
Revenue Predictor - Udit Ennam PDF
30 pages
Data Cleaning and Preprocessing
No ratings yet
Data Cleaning and Preprocessing
4 pages
Unit 3
No ratings yet
Unit 3
55 pages
Dealing With Missing Data in Python Pandas
100% (1)
Dealing With Missing Data in Python Pandas
14 pages
Chapter 02 Overview - 4
No ratings yet
Chapter 02 Overview - 4
43 pages
MEE22154 task2
No ratings yet
MEE22154 task2
4 pages
Practical Guide and Concepts Data Mining
No ratings yet
Practical Guide and Concepts Data Mining
63 pages
Machine Learning
100% (1)
Machine Learning
33 pages
Data Preprocessing For Supervised Leaning
No ratings yet
Data Preprocessing For Supervised Leaning
6 pages
ads7
No ratings yet
ads7
6 pages
Case Study 219302405
No ratings yet
Case Study 219302405
14 pages
FinalPresentation
No ratings yet
FinalPresentation
12 pages
Data Analytics 02: Drag Connect It Change Remove Cabin, Life Boat, Name, and Ticket Number
No ratings yet
Data Analytics 02: Drag Connect It Change Remove Cabin, Life Boat, Name, and Ticket Number
2 pages
Bussiness Report PM
No ratings yet
Bussiness Report PM
44 pages
Credit Risk Project
No ratings yet
Credit Risk Project
11 pages
3-Data Preprocessing
No ratings yet
3-Data Preprocessing
32 pages
Uber Data Analysis
No ratings yet
Uber Data Analysis
22 pages
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
No ratings yet
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
69 pages
1 s2.0 S0950705121006626 Main
No ratings yet
1 s2.0 S0950705121006626 Main
16 pages
Bank Marketing Targets 1724510938
No ratings yet
Bank Marketing Targets 1724510938
13 pages
30 Days ML Projects Challenge
No ratings yet
30 Days ML Projects Challenge
288 pages
Data Cleaning Approaches in Machine Learning Algorithms
No ratings yet
Data Cleaning Approaches in Machine Learning Algorithms
8 pages
Dealing With Outliers
No ratings yet
Dealing With Outliers
19 pages
Flight Price Prediction Report
No ratings yet
Flight Price Prediction Report
18 pages
UNIT 4
No ratings yet
UNIT 4
17 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
ML Workflow Steps: Step 2: Building Dataset
No ratings yet
ML Workflow Steps: Step 2: Building Dataset
5 pages
Oe Cae 3
No ratings yet
Oe Cae 3
7 pages
Ml-Exp-3 - Jupyter Notebook
No ratings yet
Ml-Exp-3 - Jupyter Notebook
6 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
4 Automatic Outlier Detection Algorithms in Python
No ratings yet
4 Automatic Outlier Detection Algorithms in Python
2 pages
Credit Card Fraud Analysis Ashutosh
No ratings yet
Credit Card Fraud Analysis Ashutosh
3 pages
Question 1 The Given Dataset Can Be Visualized As Follows
No ratings yet
Question 1 The Given Dataset Can Be Visualized As Follows
13 pages
ML 02 Dataset-Feature Selection PDF
No ratings yet
ML 02 Dataset-Feature Selection PDF
44 pages
5 Techniques To Handle Imbalanced Data For A Classification Problem
No ratings yet
5 Techniques To Handle Imbalanced Data For A Classification Problem
7 pages
Machine Learning Lecture1 - 26-27 Aug
No ratings yet
Machine Learning Lecture1 - 26-27 Aug
30 pages
CH 03
No ratings yet
CH 03
48 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Applied Economics Week1
No ratings yet
Applied Economics Week1
6 pages
(FREE PDF Sample) An Introduction To The Representation Theory of Groups 1st Edition Emmanuel Kowalski Ebooks
100% (17)
(FREE PDF Sample) An Introduction To The Representation Theory of Groups 1st Edition Emmanuel Kowalski Ebooks
71 pages
Task Creation
No ratings yet
Task Creation
5 pages
TN Companies
No ratings yet
TN Companies
10 pages
Cool Needle Felting For Kids. A F
100% (5)
Cool Needle Felting For Kids. A F
34 pages
Cell - The Fundamental Unit of Life LESSON
No ratings yet
Cell - The Fundamental Unit of Life LESSON
104 pages
Abandon Abolish Absolve Absorb Abuse Accede Accelerate Acelerar Accent Accept Accuse Accustom Acquire Act Address Admire Admit Adore Adorn Advance
No ratings yet
Abandon Abolish Absolve Absorb Abuse Accede Accelerate Acelerar Accent Accept Accuse Accustom Acquire Act Address Admire Admit Adore Adorn Advance
7 pages
Stanford Prison Experiment
No ratings yet
Stanford Prison Experiment
29 pages
Certificate For COVID-19 Vaccination: Beneficiary Details
No ratings yet
Certificate For COVID-19 Vaccination: Beneficiary Details
1 page
BST 32202 LINEAR REGRESSION 4 TWO WAY ANOVA
No ratings yet
BST 32202 LINEAR REGRESSION 4 TWO WAY ANOVA
25 pages
Exploring Students Generative AI-Assisted Writing
No ratings yet
Exploring Students Generative AI-Assisted Writing
23 pages
4 - Media 11 Lesson 1
No ratings yet
4 - Media 11 Lesson 1
4 pages
American Standard
No ratings yet
American Standard
62 pages
Indo Global ISO Certificate
100% (1)
Indo Global ISO Certificate
1 page
Catalogo Niedax
No ratings yet
Catalogo Niedax
58 pages
Commissioning of Fire Suppression
No ratings yet
Commissioning of Fire Suppression
4 pages
01 - FMEA Training Material PDF
100% (1)
01 - FMEA Training Material PDF
63 pages
Thesis For Maritime Students
100% (3)
Thesis For Maritime Students
4 pages
Pengaruh Era Digital Terhadap Musisi Local Indie
No ratings yet
Pengaruh Era Digital Terhadap Musisi Local Indie
5 pages
Consolidated Financial Accounting
No ratings yet
Consolidated Financial Accounting
3 pages
Rock Wool India
0% (1)
Rock Wool India
12 pages
313A Manual
No ratings yet
313A Manual
70 pages
Test For P3
No ratings yet
Test For P3
16 pages
Tuesday, 8 November 2022 - Morning Unit 1: Non-Calculator Foundation Tier
No ratings yet
Tuesday, 8 November 2022 - Morning Unit 1: Non-Calculator Foundation Tier
20 pages
Askey Dual Cell SFE3060 B1B7
No ratings yet
Askey Dual Cell SFE3060 B1B7
2 pages
Vertical Axis Wind Turbine
No ratings yet
Vertical Axis Wind Turbine
22 pages
Civ 122-Lec03 - (7-3-2022)
No ratings yet
Civ 122-Lec03 - (7-3-2022)
23 pages
Issue 87 Vol 15, No 3, July 2019 by Magicseen Magazine @MagicTuts
100% (3)
Issue 87 Vol 15, No 3, July 2019 by Magicseen Magazine @MagicTuts
60 pages

Assignment 1 - CIS 508

Uploaded by

Assignment 1 - CIS 508

Uploaded by

Customer Satisfaction Prediction Problem

Rahul Bhole | Data Mining CIS 508 | Date 10/26/2016

dissatisfied indicator. Value 1 in TARGET indicates unsatisfied customer and 0 indicates

(Satisfied, Dissatisfied) in test data set.

to summarize and visualize the data.

Change TARGET to categorical, Boolean and Label.

these variables. Some of the distributions are shown below -

Fig. 2.2 Fig. 2.3

Var3 extreme outlier - -999999 :

this extreme outlier before proceeding to feature selection.

that could be analyzed. Fig. 2.5 shows snapshot of summary.

column for our process as model require generalized rules.

Zero value columns-

any additional information.

Zero and extreme value columns:

so we need to remove outliers from below columns.

Change column types to Categorical-

Removing extreme outliers

1. Convert the dataset with missing value as -999999

Removing columns with only 0 values-

feeding data to ML algorithm

a variable towards reducing uncertainty about the value of TARGET.

Boosted Decision Tree as shown in Fig. 4.1

Leaves Trees Samples Learning

Kaggle rank is shown in the following screenshot-

In conclusion, according to me Santander problem is a difficult one from the machine

learning point of view because of 2 reasons:

cases in target column to predict accurately.

column. Some of the columns seem to have redundant/ No information. But as we

You might also like