0% found this document useful (0 votes)
150 views

Assignment 1 - CIS 508

This document describes building a machine learning model using Azure ML to predict customer satisfaction from bank customer data. It summarizes data exploration of the 370 features, including removing outliers and zero-value columns. A two-class boosted decision tree model was trained on the top 30 features selected using mutual information. Model 1 with 20 leaves, 130 trees and a learning rate of 0.4 performed best on the test set for predicting satisfied and dissatisfied customers. Key challenges were the imbalanced target classes and lack of feature descriptions.

Uploaded by

bhole.rahul4059
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
150 views

Assignment 1 - CIS 508

This document describes building a machine learning model using Azure ML to predict customer satisfaction from bank customer data. It summarizes data exploration of the 370 features, including removing outliers and zero-value columns. A two-class boosted decision tree model was trained on the top 30 features selected using mutual information. Model 1 with 20 leaves, 130 trees and a learning rate of 0.4 performed best on the test set for predicting satisfied and dissatisfied customers. Key challenges were the imbalanced target classes and lack of feature descriptions.

Uploaded by

bhole.rahul4059
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Customer Satisfaction Prediction Problem

using Azure ML

ASSIGNMENT 1

Rahul Bhole | Data Mining CIS 508 | Date 10/26/2016


1. Overview

Santander bank wants to identify dissatisfied customers early in banks relationship with

their customers, so that it can take appropriate steps to avoid customer churn. The bank has

provided raw data with number of features and one TARGET column with satisfied and

dissatisfied indicator. Value 1 in TARGET indicates unsatisfied customer and 0 indicates

satisfied customers. The training set provided contains 370 features of 76020 records and

1 TARGET column.

Our goal is to build and train a model using Two Class Boosted Decision Tree algorithm

with training data set. As there are 370 features present in the training set, we need to select

the subset of most relevant features to train the model in order to predict the customer status

(Satisfied, Dissatisfied) in test data set.

The main hurdle in building and training the model is interpreting various features. There

is no description provided for features present in the file. Also, feature names are not self-

explanatory E.g. var3, var15 are the feature names with no description.

To get insights from the data, I performed data analysis on raw data using various

techniques including summarizing data and visualization. Also, I used in-built feature

selection methods of Azure ML to increase the prediction rate. I used Filter based Feature

Selection with Mutual Information as feature scoring method. Apart from this, I tried

algorithm with different parameters and analyzed different trained models to select the best

one.

PAGE 1
2. Exploration of Raw Data:

We start by importing data to create the model. By going through the data we conclude that

there are no missing values so we dont need any missing value treatment. But we might

need data cleaning for other anomalies. To figure out which are those anomalies we need

to summarize and visualize the data.

Before proceeding to data cleaning, we edit the metadata using Edit Metadata module.

Change TARGET to categorical, Boolean and Label.

We then start by visualizing the data by going through all the columns Fig 2.1-

Fig. 2.1

I checked Histograms and Box plots for various features to get insights on distribution of

these variables. Some of the distributions are shown below -

PAGE 2
Fig. 2.2 and 2.3 are the histograms and box plot of var15 respectively. We can see var15

is not normally distributed. It is right skewed according to the distribution. There are some

extreme outliers in the data but that might not affect overall result.

Fig. 2.2 Fig. 2.3

Var3 extreme outlier - -999999 :

Box plot of var3 Fig. 2.4 shows extreme outlier at the negative side. We need to remove

this extreme outlier before proceeding to feature selection.

Fig. 2.4

PAGE 3
To dive deep into data exploration, we summarize data which gives us various parameters

that could be analyzed. Fig. 2.5 shows snapshot of summary.

Fig. 2.5

The extreme outlier we found in box plot of var3 feature can be seen here. The minimum

value of var3 -999999 is the extreme outlier. This value is affecting mean of the entire

column. Looking at the other entries of var3 we can safely say that this is a data entry

error or missing data value. We will remove this outlier in further process.

The ID column has 76020 values and is unique for every row. We can remove this

column for our process as model require generalized rules.

Zero value columns-

Some columns contain only 0 values; we need to remove these columns as these dont give

any additional information.

ind_var41 imp_amort_var18_hace3
num_var27_0 imp_amort_var34_hace3
num_var28_0 imp_reemb_var13_hace3
num_var28 imp_reemb_var33_hace3
num_var27 imp_trasp_var17_out_hace3
num_var41 imp_trasp_var33_out_hace3
num_var46_0 num_var2_0_ult1
num_var46 num_var2_ult1

PAGE 4
saldo_var28 num_reemb_var13_hace3
saldo_var27 num_reemb_var33_hace3
saldo_var41 num_trasp_var17_out_hace3
saldo_var46 num_trasp_var33_out_hace3
saldo_medio_var13_medio_hace3 saldo_var2_ult1

Zero and extreme value columns:

Below columns contains only 0 and 9999999999 which seems to be data entry error.

delta_imp_amort_var18_1y3
delta_imp_amort_var34_1y3
delta_imp_reemb_var13_1y3
delta_imp_reemb_var33_1y3
delta_imp_trasp_var17_out_1y3
delta_imp_trasp_var33_out_1y3
delta_num_reemb_var13_1y3
delta_num_reemb_var33_1y3
delta_num_trasp_var17_out_1y3
delta_num_trasp_var33_out_1y3

Extreme Outliers:

Below columns have extreme outlier 9999999999. These columns contain other values to

so we need to remove outliers from below columns.

delta_imp_aport_var13_1y3
delta_imp_aport_var17_1y3
delta_imp_aport_var33_1y3
delta_imp_compra_var44_1y3
delta_imp_reemb_var17_1y3
delta_imp_trasp_var17_in_1y3
delta_imp_trasp_var33_in_1y3
delta_imp_venta_var44_1y3
delta_num_aport_var13_1y3
delta_num_aport_var17_1y3
delta_num_aport_var33_1y3

PAGE 5
delta_num_compra_var44_1y3
delta_num_reemb_var17_1y3
delta_num_trasp_var17_in_1y3
delta_num_trasp_var33_in_1y3
delta_num_venta_var44_1y3

3. Data Cleaning

Change column types to Categorical-

There are some columns with only 0 and 1 values. I converted all these columns to

categorical.

Removing extreme outliers

To remove outliers from var3 and other columns we put following structure.

1. Convert the dataset with missing value as -999999

2. Clean the missing data by setting missing value equal to -999999 as shown in fig 3.1

Fig. 3.1

PAGE 6
With this outliers -999999 and 9999999999 are removed and we can confirm this by

visualizing columns.

Check summary of data. We can see -999999 value is no more in minimum of var3

column. Also, as result of removal of extreme outlier mean of var3 changed to 2.71.

Fig. 3.2

Removing columns with only 0 values-

Excluded columns with only 0 values by Select Columns in Dataset module before

feeding data to ML algorithm

I then used Filter based feature selection module in Azure to select top 30 features from

dataset using mutual information feature scoring method. It measures the contribution of

a variable towards reducing uncertainty about the value of TARGET.

PAGE 7
4. Training different models

To train a model, I split the data in 70:30. 70% for training and 30% for testing

I then evaluated 3 different training models by changing the parameters of Two Class

Boosted Decision Tree as shown in Fig. 4.1

Fig. 4.1

Leaves Trees Samples Learning


rate
Model 1 20 130 15 0.4
Model 2 10 300 20 0.2
Model 3 10 100 10 0.2

Fig 4.2

PAGE 8
Out of these 3 models, model 1 gave the best results. Fig 4.2.

So, I chose model 1 to evaluate the test set to predict the number of satisfied and

dissatisfied customers.

All of the above steps are followed to build the model as shown in Fig. 4.3

Fig. 4.3

Kaggle rank is shown in the following screenshot-

PAGE 9
5. Conclusion

In conclusion, according to me Santander problem is a difficult one from the machine

learning point of view because of 2 reasons:

1. Asymmetric dataset There are few unsatisfied customers (1s) than satisfied

customers (0s) which is affecting the training of the model as we dont have many

cases in target column to predict accurately.

2. High dimensionality There are 370 columns with no information about each

column. Some of the columns seem to have redundant/ No information. But as we

dont have description, feature subset selection for training a model is a problem.

I learnt how to load data in Azure to train a model. I explored various features that are

present in Azure for data exploration, summarization and cleaning. Also, how to build

model and evaluate those on the basis of various parameters and choose the best one.

PAGE 10

You might also like