Assignment 1 - CIS 508
Assignment 1 - CIS 508
using Azure ML
ASSIGNMENT 1
Santander bank wants to identify dissatisfied customers early in banks relationship with
their customers, so that it can take appropriate steps to avoid customer churn. The bank has
provided raw data with number of features and one TARGET column with satisfied and
satisfied customers. The training set provided contains 370 features of 76020 records and
1 TARGET column.
Our goal is to build and train a model using Two Class Boosted Decision Tree algorithm
with training data set. As there are 370 features present in the training set, we need to select
the subset of most relevant features to train the model in order to predict the customer status
The main hurdle in building and training the model is interpreting various features. There
is no description provided for features present in the file. Also, feature names are not self-
explanatory E.g. var3, var15 are the feature names with no description.
To get insights from the data, I performed data analysis on raw data using various
techniques including summarizing data and visualization. Also, I used in-built feature
selection methods of Azure ML to increase the prediction rate. I used Filter based Feature
Selection with Mutual Information as feature scoring method. Apart from this, I tried
algorithm with different parameters and analyzed different trained models to select the best
one.
PAGE 1
2. Exploration of Raw Data:
We start by importing data to create the model. By going through the data we conclude that
there are no missing values so we dont need any missing value treatment. But we might
need data cleaning for other anomalies. To figure out which are those anomalies we need
Before proceeding to data cleaning, we edit the metadata using Edit Metadata module.
We then start by visualizing the data by going through all the columns Fig 2.1-
Fig. 2.1
I checked Histograms and Box plots for various features to get insights on distribution of
PAGE 2
Fig. 2.2 and 2.3 are the histograms and box plot of var15 respectively. We can see var15
is not normally distributed. It is right skewed according to the distribution. There are some
extreme outliers in the data but that might not affect overall result.
Box plot of var3 Fig. 2.4 shows extreme outlier at the negative side. We need to remove
Fig. 2.4
PAGE 3
To dive deep into data exploration, we summarize data which gives us various parameters
Fig. 2.5
The extreme outlier we found in box plot of var3 feature can be seen here. The minimum
value of var3 -999999 is the extreme outlier. This value is affecting mean of the entire
column. Looking at the other entries of var3 we can safely say that this is a data entry
error or missing data value. We will remove this outlier in further process.
The ID column has 76020 values and is unique for every row. We can remove this
Some columns contain only 0 values; we need to remove these columns as these dont give
ind_var41 imp_amort_var18_hace3
num_var27_0 imp_amort_var34_hace3
num_var28_0 imp_reemb_var13_hace3
num_var28 imp_reemb_var33_hace3
num_var27 imp_trasp_var17_out_hace3
num_var41 imp_trasp_var33_out_hace3
num_var46_0 num_var2_0_ult1
num_var46 num_var2_ult1
PAGE 4
saldo_var28 num_reemb_var13_hace3
saldo_var27 num_reemb_var33_hace3
saldo_var41 num_trasp_var17_out_hace3
saldo_var46 num_trasp_var33_out_hace3
saldo_medio_var13_medio_hace3 saldo_var2_ult1
Below columns contains only 0 and 9999999999 which seems to be data entry error.
delta_imp_amort_var18_1y3
delta_imp_amort_var34_1y3
delta_imp_reemb_var13_1y3
delta_imp_reemb_var33_1y3
delta_imp_trasp_var17_out_1y3
delta_imp_trasp_var33_out_1y3
delta_num_reemb_var13_1y3
delta_num_reemb_var33_1y3
delta_num_trasp_var17_out_1y3
delta_num_trasp_var33_out_1y3
Extreme Outliers:
Below columns have extreme outlier 9999999999. These columns contain other values to
delta_imp_aport_var13_1y3
delta_imp_aport_var17_1y3
delta_imp_aport_var33_1y3
delta_imp_compra_var44_1y3
delta_imp_reemb_var17_1y3
delta_imp_trasp_var17_in_1y3
delta_imp_trasp_var33_in_1y3
delta_imp_venta_var44_1y3
delta_num_aport_var13_1y3
delta_num_aport_var17_1y3
delta_num_aport_var33_1y3
PAGE 5
delta_num_compra_var44_1y3
delta_num_reemb_var17_1y3
delta_num_trasp_var17_in_1y3
delta_num_trasp_var33_in_1y3
delta_num_venta_var44_1y3
3. Data Cleaning
There are some columns with only 0 and 1 values. I converted all these columns to
categorical.
To remove outliers from var3 and other columns we put following structure.
2. Clean the missing data by setting missing value equal to -999999 as shown in fig 3.1
Fig. 3.1
PAGE 6
With this outliers -999999 and 9999999999 are removed and we can confirm this by
visualizing columns.
Check summary of data. We can see -999999 value is no more in minimum of var3
column. Also, as result of removal of extreme outlier mean of var3 changed to 2.71.
Fig. 3.2
Excluded columns with only 0 values by Select Columns in Dataset module before
I then used Filter based feature selection module in Azure to select top 30 features from
dataset using mutual information feature scoring method. It measures the contribution of
PAGE 7
4. Training different models
To train a model, I split the data in 70:30. 70% for training and 30% for testing
I then evaluated 3 different training models by changing the parameters of Two Class
Fig. 4.1
Fig 4.2
PAGE 8
Out of these 3 models, model 1 gave the best results. Fig 4.2.
So, I chose model 1 to evaluate the test set to predict the number of satisfied and
dissatisfied customers.
All of the above steps are followed to build the model as shown in Fig. 4.3
Fig. 4.3
PAGE 9
5. Conclusion
1. Asymmetric dataset There are few unsatisfied customers (1s) than satisfied
customers (0s) which is affecting the training of the model as we dont have many
2. High dimensionality There are 370 columns with no information about each
dont have description, feature subset selection for training a model is a problem.
I learnt how to load data in Azure to train a model. I explored various features that are
present in Azure for data exploration, summarization and cleaning. Also, how to build
model and evaluate those on the basis of various parameters and choose the best one.
PAGE 10