Recognition of Handwritten Digit Using CNN
Recognition of Handwritten Digit Using CNN
on
Recognition of Handwritten Digit using CNN
Submitted
in partial fulfillment of the requirements forthe award of the degree of
Bachelor of Technology
in
Computer Science and Engineering
(Artificial Intelligence and Machine Learning)
By
Varshini Manuka
(21261A6641)
2024 - 2025
i
MAHATMA GANDHI INSTITUTE OF TECHNOLOGY
(Affiliated to Jawaharlal Nehru Technological University Hyderabad)GANDIPET,
HYDERABAD – 500075, Telangana
CERTIFICATE
External Examiner
iii
DECLARATION
This is to certify that the work reported in this project titled ―” RECOGNITION OF
HANDWRITTEN DIGIT USING CNN” is a record of work done by me in the Department of
Emerging Technologies, Mahatma Gandhi Institute of Technology, Hyderabad.
No part of the work is copied from books/journals/internet and wherever the portion is
taken, the same has been duly referred in the text. The report is based on the work done
entirely by me and not copied from any other source.
i
ACKNOWLEDGEMENTS
I would like to express my sincere thanks to Dr. G. Chandra Mohan Reddy, Principal MGIT,
for providing the working facilities in college.
I wish to express my sincere thanks and gratitude to Dr. M. Rama Bai, Professor and HOD,
Department of ET, MGIT, for all the timely support and valuable suggestions during the period
of project.
I am extremely thankful to Dr. M. Rama Bai, Professor and HOD, and Dr. M Srikanth Sagar,
Associate Professor, Department of Emerging Technology, MGIT, minor project coordinators
for their encouragement and support throughout the project.
I am extremely thankful and indebted to my internal guide Ms. Najini Mohd, Assistant
Professor, Department of Emerging Technology, for her constant guidance, encouragement and
moral support throughout the project.
Finally, I would also like to thank all the faculty and staff of the Emerging Technology
Department who helped me directly or indirectly, for completing this project.
i
TABLE OF CONTENTS
Certificate i
Declaration ii
Acknowledgement iii
List of Figures vi
Abstract vii
i
1. Introduction 1
1.1 Motivation 2
2. Literature Survey 12
i
3.5.1 Data Flow Diagram 26
3.5.2 Sequence Diagram 26
3.5.3 Activity Diagram 27
3.5.4 Collaboration Diagram 27
4. Testing and Results 28
4.1 Model Performances 32
4.2 Comparison of Models 38
5. Conclusion and Future Work 39
5.1 Conclusion 39
5.2 Future Work 39
Bibliography 40
Appendix 41
i
LIST OF FIGURES
i
Figure Classification Report and ROC of KNN 36
4.10
Figure Classification Report and ROC of SVM 36
4.11
Figure Feature Importance of SVM 37
4.12
Figure Graphical summarization of model performances 38
4.13
i
LIST OF TABLES
Tab Comparison of Literature survey 1
le 4
2.1
Tab Description of attributes used in the dataset 1
le 8
3.1
Tab Comparison of Results 3
le 8
4.1
i
ABSTRACT
Handwritten digit recognition is a key problem in the field of computer vision, with significant
applications in areas such as postal services, banking, and document processing. This study
proposes a deep learning-based approach using Convolutional Neural Networks (CNN) for
classifying digits drawn by users through an interactive drawing interface. The system allows
users to directly sketch digits on a digital canvas, which are then preprocessed and fed into the
CNN for classification. The CNN model consists of multiple convolutional and pooling layers for
automatic feature extraction, followed by fully connected layers to classify the digits from 0 to 9.
A key component of this project is the design of an intuitive interface that allows users to easily
input handwritten digits, providing a real-time, interactive experience. Techniques such as data
augmentation, dropout, and batch normalization are used to enhance model performance and
ensure robust classification. The results demonstrate the ability of the CNN to accurately
recognize drawn digits, highlighting the potential of deep learning and interactive interfaces for
real-time digit recognition applications.
Keywords: Handwritten recognition, digit recognition, epochs, hidden layers, machine learning,
neural network, CNN, Random Forest Classifier, Decision Tree Classifier, Gini Index, Entropy,
K-Nearest Neighbor, Supports Vector Machine, Gaussian Naive Bayes, , Bagging Classifier.
10
CHAPTER- 01
INTRODUCTION
With the continuous development of the economy and society, digital-related applications in
daily life are becoming more and more extensive, and the use scenarios are becoming more and
more abundant. The corresponding demand for handwritten digit recognition has also increased
significantly. Handwritten digit recognition is an important technology in computer vision, and it
has a very wide range of applications in postal codes, financial statements, and grade judgments.
At present, handwritten digit recognition technology is used in recognition systems such as postal
code recognition, automatic entry of bank documents, and automatic entry of financial
statements.
At present, many algorithms for handwritten digit recognition have been researched at home and
abroad. Commonly used algorithms include vector machine algorithm, Bayesian classification
algorithm, neural network, the K-nearest neighbor algorithm and so on. However, the recognition
accuracy of the above-mentioned algorithms is not high enough, because these algorithms have
limited mathematical function expression ability and network generalization ability for complex
classification problems. The emergence of convolutional neural networks provides a way to solve
the above problems, that is, the recognition accuracy is not high enough.
Using the fundamental principles of convolutional neural networks, this project develops a
pytorch-based model and applies dropout regularization to prevent overfitting and enhance the
recognition performance. The model is trained using the MNIST dataset and assessed on the VS
code platform. Experimental results demonstrate a recognition accuracy of approximately 98%,
indicating a high level of accuracy in recognizing handwritten digits.
11
1.1 Motivation
The ability to recognize handwritten digits accurately has become a cornerstone in various fields,
including banking, postal services, and document digitization. Traditional methods often rely on
rule-based algorithms or manual effort, which are prone to errors and inefficiencies when faced
with diverse handwriting styles. The advent of deep learning, particularly Convolutional Neural
Networks (CNNs), has revolutionized how we approach image recognition tasks by enabling
automated and highly accurate solutions.
This project was inspired by the need to create an accessible and interactive tool that
demonstrates the power of deep learning in real-time applications. By integrating a CNN model
with an intuitive drawing interface, the project seeks to bridge the gap between advanced
machine learning techniques and user-friendly applications. The motivation lies in showcasing
how AI can simplify complex tasks, such as digit recognition, while providing a hands-on,
engaging experience for users. Additionally, this project aims to highlight the practical
implications of AI in solving everyday problems, encouraging further innovation and exploration
in this domain
The goal of this project is to develop a reliable handwritten digit recognition system that allows
users to draw digits interactively on a digital canvas. The system should preprocess the input,
classify the digit using a trained Convolutional Neural Network (CNN), and display the result in
real time. The primary challenges include handling variations in handwriting, ensuring accurate
predictions, and creating an intuitive interface for seamless user interaction. This project aims to
address these challenges and demonstrate the potential of deep learning in solving real-world
classification tasks effectively.
12
1.3 Existing System
13
south Asia. Another model utilized a CRM framework using neural network and data mining for
the prediction of customer behavior in banking. An algorithm was also developed based on
clickstream data of a website to extract information and tested the predictive power of the model
based on data such as number of clicks, repeated visits, repetitive purchases, etc. Nonetheless,
these models raised a few concerns which are to be addressed. The main drawbacks of existing
systems include:
● Most of them were suited only for applying suitable models and taking inference from
predictions.
● None of them focused on the attributes crucial towards customer churn.
14
1.5.2 Hardware Requirements
● RAM – 4GB minimum
1.5.1.1 Python
Python is a high-level, interpreted, interactive and object-oriented scripting language created by
Guido Rossum in 1989. Python is designed to be highly readable. Its language constructs and
object-oriented approach aim to help programmers write clear, logical code for small and large-
scale projects. It is ideally designed for rapid prototyping of complex applications.
It has interfaces to many OS system calls and libraries and is extensible to C or C++. Python is
dynamically typed and garbage-collected. It supports multiple programming paradigms,
including procedural, object-oriented, and functional programming. Python programming is
widely used in Artificial Intelligence, Natural Language Generation, Neural Networks and other
advanced fields of Computer Science.
History of Python
Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde &
Informatica (CWI) in the Netherlands as a successor to the ABC language, capable of exception
handling and interfacing with the Amoeba operating system. Language developer Guido van
Rossum shouldered sole responsibility for the project until July 2018 but now shares his
leadership as a member of a five-person steering council.
Python 2.0 was released on 16 October 2000 with many major new features, including a cycle-
detecting garbage collector and support for Unicode.
Python 3.0 was released on 3 December 2008. It was a major revision of the language that is not
completely backward-compatible. Many of its major features were backported to Python 2.6.x
and 2.7.x version series. Releases of Python 3 include the 2 to 3 utility, which automates (at least
partially) the translation of Python 2 code to Python 3.
15
Features of Python
Python's features include:
• Easy-to-learn: Python has few keywords, simple structure, and a clearly defined syntax.This
allows the student to pick up the language quickly.
• Easy-to-read: Python code is more clearly defined and visible to the eyes.
• Easy-to-maintain: Python's source code is fairly easy-to-maintain.
• A broad standard library: Python's bulk of the library is very portable and cross-
platform compatible on UNIX, Windows, and Macintosh.
• Interactive Mode: Python has support for an interactive mode which allows interactivetesting
and debugging of snippets of code.
• Portable: Python can run on a wide variety of hardware platforms and has the same
interface on all platforms.
• Extendable: You can add low-level modules to the Python interpreter. These modulesenable
programmers to add to or customize their tools to be more efficient.
• Databases: Python provides interfaces to all major commercial databases.
• GUI Programming: Python supports GUI applications that can be created and ported to many
system calls, libraries, and windows systems, such as Windows MFC, Macintosh, and the X
Window system of Unix.
• Scalable: Python provides a better structure and support for large programs than shell scripting.
Apart from the above-mentioned features, Python has a big list of good features, few are listed
below:
• IT supports functional and structured programming methods as well as OOP.
• It can be used as a scripting language or can be compiled to byte-code for building large
applications.
• It provides very high-level dynamic data types and supports dynamic type checking.
• IT supports automatic garbage collection.
• It can be easily integrated with C, C++, COM, ActiveX, CORBA, and Java.
16
Python ModulesNumPy
NumPy is a general-purpose array-processing package. It provides a high-performance
multidimensional array object, and tools for working with these arrays.
It is the fundamental package for scientific computing with Python. It contains various features
including these important ones:
● A powerful N-dimensional array object
● Sophisticated (broadcasting) functions
● Tools for integrating C/C++ and Fortran code
● Useful linear algebra, Fourier transform, and random number capabilities
Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional
container of generic data. Arbitrary data-types can be defined using Numpy which allowsNumPy
to seamlessly and speedily integrate with a wide variety of databases.
Pandas
Pandas is a Python package providing fast, flexible, and expressive data structures designed to
make working with structured (tabular, multidimensional, potentially heterogeneous) and time
series data both easy and intuitive. It aims to be the fundamental high-level building block for
doing practical, real world data analysis in Python. Additionally, it has the broader goal of
becoming the most powerful and flexible open source data analysis / manipulation tool available
in any language.
17
The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-
dimensional), handle the vast majority of typical use cases in finance, statistics, social science,
and many areas of engineering. For R users, DataFrame provides everything that R’s data.frame
provides and much more. pandas is built on top of NumPy and is intended to integrate well
within a scientific computing environment with many other 3rd party libraries.
Matplotlib
Matplotlib is an amazing visualization library in Python for 2D plots of arrays. Matplotlib is a
multi-platform data visualization library built on NumPy arrays and designed to work with the
broader SciPy stack. It was introduced by John Hunter in the year 2002.
One of the greatest benefits of visualization is that it allows us visual access to huge amounts of
data in easily digestible visuals. Matplotlib consists of several plots like line, bar, scatter,
histogram etc. For simple plotting the pyplot module provides a MATLAB-like interface,
particularly when combined with IPython. For the power user, you have full control of line styles,
font properties, axes properties, etc, via an object oriented interface or via a set of functions
familiar to MATLAB users.
Plotly
Plotly is an interactive, open-source, and browser-based graphing library for Python. Built on top
of plotly.js, plotly.py is a high-level, declarative charting library. plotly.js ships with over 30
chart types, including scientific charts, 3D graphs, statistical charts, SVG maps, financial charts,
and more.
Plotly has got some amazing features that make it better than other graphing libraries:
● It is interactive by default
● Charts are not saved as images but serialized as JSON, making them open to be read withR,
MATLAB, Julia and others easily
● Exports vector for print/publication
● Easy to manipulate/embed on web
18
1.5.1.2 Anaconda Navigator
Anaconda Navigator is a desktop graphical user interface (GUI) included in Anaconda®
distribution that allows you to launch applications and easily manage conda packages,
environments, and channels without using command-line commands. Navigator can search for
packages on Anaconda Cloud or in a local Anaconda Repository. It is available for Windows,
macOS, and Linux.
Usage of Navigator
In order to run, many scientific packages depend on specific versions of other packages. Data
scientists often use multiple versions of many packages and use multiple environments toseparate
these different versions.
The command-line program conda is both a package manager and an environment manager. This
helps data scientists ensure that each version of each package has all the dependencies it requires
and works correctly.
19
Navigator is an easy, point-and-click way to work with packages and environments without
needing to type conda commands in a terminal window. You can use it to find the packages you
want, install them in an environment, run the packages, and update them – all inside Navigator.
Advanced conda users can also build their own Navigator applications.
20
Notebook documents: A representation of all content visible in the web application, including
inputs and outputs of the computations, explanatory text, mathematics, images, and rich media
representations of objects.
● In-browser editing for code, with automatic syntax highlighting, indentation, and tab
completion/introspection.
● The ability to execute code from the browser, with the results of computations attached to the
code which generated them.
● Displaying the result of computation using rich media representations, such as HTML, LaTeX,
PNG, SVG, etc. For example, publication-quality figures rendered by the matplotlib library, can
be included inline.
● In-browser editing for rich text using the Markdown markup language, which can provide
commentary for the code, is not limited to plain text.
● The ability to easily include mathematical notation within markdown cells using LaTeX, and
rendered natively by MathJax.
21
Notebook name: The name displayed at the top of the page, next to the Jupyter logo, reflects the
name of the .ipynb file. Clicking on the notebook name brings up a dialog which allows you to
rename it. Thus, renaming a notebook from ―Untitled0‖ to ―My first notebook‖ in the browser,
renames the Untitled0.ipynb file to My first notebook.ipynb.
Menu bar: The menu bar presents different options that may be used to manipulate the way the
notebook functions.
Toolbar: The tool bar gives a quick way of performing the most-used operations within the
notebook, by clicking on an icon.
Code cell: The default type of cell; read on for an explanation of cells.
22
1. LITERATURE SURVEY
Potential research work carried out on various techniques for churn prediction in different
areas such as telecom, e-commerce, and banking etc. has been discussed in the following
paragraphs. Various researchers have employed different mechanisms for predicting customer
churn and to findout the most useful features used in the prediction.
[3] Muhammad Azeem, Muhammad Usman and A. C. M. Fong published a paper titled ―A
churn prediction model for prepaid customers in telecom using fuzzy classifiers‖. In this paper, a
fuzzy based churn prediction model has been proposed and validated using real data from a
telecom company in South Asia. A number of predominant classifiers namely, Neural Networks,
Linear Regression, C4.5, Support Vector Machines, AdaBoost, Gradient Boosting and Random
Forest have been compared with fuzzy classifiers to highlight the superiority of fuzzy classifiers
in predicting the accurate set of churners. Parameters such as TP rate & AUC were considered
and enhanced using the model.
[4] J. Vijaya and E. Sivasankar published a paper titled ―An efficient system for customer
churn prediction through particle swarm optimization based feature selection model with
simulated annealing‖. It employs particle swarm optimization (PSO) and proposes three variants
of PSO for churn prediction namely PSO incorporated with feature selection as its pre-processing
mechanism, The proposed classifiers were compared with a decision tree, naive bayes, K-nearest
neighbor, support vector machine, random forest and three hybrid models to analyze their
predictability levels and performance aspects. Experiments reveal that the performance of meta-
heuristics was more efficient and they also exhibited better predictability levels.
23
underlying fundamental context of given optimization which results in identification of
criticalfactors leading to well performing ANN classification model.
[6] Gordini and Veglio published a paper titled ―Customers churn prediction and
marketing retention strategies. An application of support vector machines based on the AUC
parameter- selection technique in B2B e-commerce industry‖. Parameters such as recentness,
frequency, length, product category, failure, monetary, age, profession, gender, request status etc.
were taken for performance comparison. The prediction power of the proposed method was
found to be better as compared to Linear Regression, Neural Networks & SVM especially for
noisy, imbalance & nonlinear data. Thus, their findings confirm that the data-driven approach to
churn prediction and the development of retention strategies outperforms commonly used
managerial heuristics in B2B e-commerce industry.
[7] Femina Bahari and Sudheep Elayidom published a paper titled ―An Efficient CRM-
Data Mining Framework for the Prediction of Customer Behaviour‖ in banking. The UCI dataset
containing direct bank marketing campaigns of Portuguese bank was taken. The model is used to
predict the behaviour of customers to enhance the decision-making processes for retaining valued
customers. Two classification models, Naïve Bayes and Neural Networks are studied and it was
concluded that Neural Network was better than Naïve Bayes algorithm for accuracy & specificity
while Naïve Bayes was better than Neural Network algorithm for sensitivity, TPR, FPR, and
ROC area. Neural network classified 4007/514 & Naive Bayes classified 3977/544 instances
correctly/incorrectly.
[8] Anil Kumar D. and Ravi V published a paper titled ―Predicting credit card customer
churn in banks using data mining‖. An ensemble system was developed incorporating majority
voting and involving Multilayer Perceptron (MLP), Logistic Regression (LR), decision trees
(J48), Random Forest (RF), Radial Basis Function (RBF) network and Support Vector Machine
(SVM) as the constituents. Since the dataset is highly unbalanced, with 93% loyal and 7%
churned customers,
(1) undersampling, (2) oversampling, (3) a combination of undersampling and oversampling
and
(4) the Synthetic Minority Oversampling Technique (SMOTE) was employed for balancing it.
The results indicated that SMOTE achieved good overall accuracy. Also, SMOTE and a
combination of undersampling and oversampling improved the sensitivity and overall accuracy in
majority voting. Moreover, the rules generated by decision tree J48 can act as an early warning
24
expert system.
25
Table 1: Comparison of Literature survey
S Y A TITL TECH ADVAN DISADV
. E U E NIQUE TAGES ANTAG
N A T S ES
O R H
O
R
27
3 2 M Hype Artific It - Improv
0 at rpara ial provides ed LR,
1 ri meter Neural a decision
7
n Opti Netwo hyperpa tree,
F mizat rk, rametric fuzzy
ri ion of Geneti approac method
dr Artifi c h to scan be
ic cial Algori optimiz used.
h Neur thms e the
al custome - E-
Netw r churn commer
orkin predicti ce
Custo on datasets
mer model. and
Chur more
n paramet
Predi ers can
ction be used
for
better
predicti
on
accurac
y.
28
4 2 G Custo Logist - Paramet - Staying
0 or mers ic er power of
1 di churn Regre optimiz the
6
ni predi ssion, ation model
a ction Neural procedu is not
n and Netwo re plays predicte
d mark rks & a key d.
V eting Suppo role in
e retent rt predicti - Selectio
gl ion Vector ve n of
io strate Machi perform SVM
gies nes ance. kernel
function
- SVMau can be
c points done
out more
good accurate
generali ly &
zation more
perform predicti
ance on
when variable
applied s can be
tonoisy include
data d.
30
2. CUSTOMER CHURN PREDICTION METHODOLOGY
The basic layer for predicting future customer churn is data from the past. We look at data from
customers that already have churned (response) and their characteristics / behaviour (predictors)
beforethe churn happened. By fitting a statistical model that relates the predictors to the response,
the response for existing customers is predicted. The overall scope of work to forecast customer
attrition may look like the following:
● Data collection
● Classification
The goal of classification is to determine to which class or category a data point (that is,
customer) belongs to. For classification problems, historical data is used with predefined target
variables, that is, labels (churner/non-churner) – answers that need to be predicted – to train an
algorithm. With classification, businesses can answer the following questions:
● Will this customer churn or not?
● Regression
Customer churn prediction can be also formulated as a regression task. Regression analysis is a
statistical technique to estimate the relationship between a target variable and other data
values that
31
influence the target variable, expressed in continuous values. The result of regression is always
some number, while classification always suggests a category. In addition, regression analysis
allows for estimating how many different variables in data influence a target variable. With
regression, businessescan forecast in what period of time, a specific customer is likely to churn or
receive some probability estimate of churn per customer.
Data collection
Once kinds of insights to look for are identified, the data sources necessary for further predictive
modeling can be decided. The dataset used for this project contains demographic details of
customers, their total charges and they type of service they receive from the company. It comprises of
churn data of over 7000 customers spread over 21 attributes (described in Table 3.1) obtained
from the Kaggle website. (as shown in Figure 3.1). It can be used to analyze all relevant customer
data and develop focused customer retention programs.
In the given Figure 3.1, each row represents a customer, and each column contains customer’s
attributes described on the column Metadata.
The data set includes information about:
● Customers who left within the last month – the column is called Churn
● Services that each customer has signed up for – phone, multiple lines, internet, online
security,online backup, device protection, tech support, and streaming TV and movies
● Customer account information – how long they’ve been a customer, contract, payment
method,paperless billing, monthly charges, and total charges
● Demographic info about customers – gender, age range, and if they have partners and dependents
32
Table 3.1 Description of attributes used in the dataset
ATTRI DESCRIPTION
BUTE
customer Customer ID
ID
gender Whether the customer is a male or a female
PhoneSe Whether the customer has a phone service or not (Yes, No)
rvice
Multiple Whether the customer has multiple lines or not (Yes, No, No phone
Lines service)
InternetS Customer’s internet service provider (DSL, Fiber optic, No)
ervice
OnlineS Whether the customer has online security or not (Yes, No, No internet
ecurity service)
OnlineB Whether the customer has online backup or not (Yes, No, No internet
ackup service)
DeviceP Whether the customer has device protection or not (Yes, No, No internet
rotection service)
TechSup Whether the customer has tech support or not (Yes, No, No internet
port service)
Streamin Whether the customer has streaming TV or not (Yes, No, No internet
gTV service)
Streamin Whether the customer has streaming movies or not (Yes, No, No internet
gMovies service)
Contract The contract term of the customer (Month-to-month, One year, Two
year)
Paperles Whether the customer has paperless billing or not (Yes, No)
sBilling
Payment The customer’s payment method (Electronic check, Mailed check, Bank
Method transfer
(automatic), Credit card (automatic))
Monthly The amount charged to the customer monthly
Charges
TotalCh The total amount charged to the customer
arges
Churn Whether the customer churned or not (Yes or No)
33
Data preparation and preprocessing
Historical data that was selected for solving the problem must be transformed into a format
suitable for machine learning. Since model performance and therefore the quality of received
insights depend on the quality of data, the primary aim is to make sure all data points are
presented using the same logic, and the overall dataset is free of inconsistencies.
34
Figure 3.2. How different user behavior, subscription, and demographic features correlate with
churn [9].
Feature extraction aims at reducing the number of variables (attributes) by leaving the ones that
represent the most discriminative information. Feature extraction helps to reduce the data
dimensionality (dimensions are columns with attributes in a dataset) and exclude irrelevant
information[10]. During this process, specialists revise previously extracted features and define a
subgroup of them that’s most correlated with customer churn. As a result of feature selection,
specialists have a dataset with only relevant features.
35
3.1 Logistic Regression
Logistic regression is a statistical model that in its basic form uses a logistic function to model a
binary dependent variable. Mathematically, a binary logistic model has a dependent variable with
two possible values, such as pass/fail which is represented by an indicator variable, where the
two values are labeled "0" and "1".
Figure 3.3 The binary logistic regression model basically gives you two possible values – 0/1,
happy/sad and churn/not churn.
Logistic Function
Logistic regression is named for the function used at the core of the method, the logistic function.
The logistic function, also called the sigmoid function is an S-shaped curve (as shown in Figure)
that can take any real-valued number and map it into a value between 0 and 1, but never exactly
at those limits.
Where e is the base of the natural logarithms and t value is the actual numerical value that you
want to transform.
36
3.2 Decision Trees
Decision tree learning is one of the predictive modeling approaches that uses a decision tree (as a
predictive model) to go from observations about an item i.e. attribute (represented in the
branches) to conclusions about the item's target value i.e. churn or not (represented in the leaves).
Tree models where the target variable can take a discrete set of values are called classification
trees; in these tree structures, leaves represent class labels and branches represent conjunctions of
features that lead to those class labels. Decision trees where the target variable can take
continuous values (typically real numbers) are called regression trees. This algorithm splits a data
sample into two or more homogeneous sets based on the most significant differentiator in input
variables to make a prediction. With each split, a part of a tree is being generated. As a result,
a tree with decision nodes and leafnodes (which are decisions or classifications) is developed. A
tree starts from a root node – the best predictor.
Prediction results of decision trees can be easily interpreted and visualized. Even people without
an analytical or data science background can understand how a certain output appeared.
Compared to other algorithms, decision trees require less data preparation, which is also an
advantage. However, they may be unstable if any small changes were made in data. In other
words, variations in data may lead to radically different trees being generated.
37
3.3 K-Nearest Neighbors Algorithm
The k-nearest neighbors algorithm (k-NN) is a method used for classification and regression. In
both cases, the input consists of the k closest training examples in the feature space.
In k-NN classification, the output is a class membership (churn or not). A customer is classified
by a plurality vote of its neighbors, with the customer being assigned to the class most common
among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the
customer is simplyassigned to the class of that single nearest neighbor.
To determine which of the K instances in the training dataset are most similar to a new
input, a distance measure is used. For real-valued input variables, the most popular distance
measure isEuclidean distance.
38
Figure 3.7 Support vectors used to classify data items by separating with a hyperplane
Advantages:
● It works really well with clear margin of separation
● It is effective in high dimensional spaces.
● It is effective in cases where number of dimensions is greater than the number of samples.
● It uses a subset of training points in the decision function (called support vectors), so it is also
memoryefficient.
Disadvantages:
● It doesn’t perform well, when we have large data set because the required training time is higher
● It also doesn’t perform very well, when the data set has more noise i.e. target classes are
overlapping
● SVM doesn’t directly provide probability estimates, these are calculated using an expensive
five-foldcross-validation. It is related SVC method of Python scikit-learn library.
Deployment and monitoring
The selected models need to be put into production. Predicting customer churn with machine
learning is an iterative process that never ends. We monitor model performance and adjust
features as necessary to improve accuracy when customer-facing teams give us feedback or new
data becomes available. At the point of any human interaction – a support call, a CSM QBR
[quarterly business review], a Sales discovery call – we monitor and log the human interpretation
of customer help, which augments the machine learning models and increases the accuracy of our
health prediction for each customer.
39
Insights and Actions
Lastly, we have to evaluate and interpret the outcomes. Because predicting customer churn is
only half of the part and many people forget that by just predicting, they can still leave. In our
case we actually want to make them stop leaving. Selection of the most significant features for a
model would influence its predictive performance: The more qualitative the dataset, the more
precise forecasts are.
The ability to identify customers that aren’t happy with provided solutions allows businesses to
learn about product or pricing plan weak points, operation issues, as well as customer preferences
and expectations to proactively reduce reasons for churn.
40
3.5 UML Diagrams
3.5.1 Data Flow Diagram
41
3.5.3 Activity Diagram
42
3. TESTING AND RESULTS
Testing is a crucial phase that determines the quality of models used as well as the importances
of all the features under consideration. The algorithms used in this project have been rigorously
tested based on various factors including accuracy, recall, precision, f1 score and kappa statistic.
Accuracy - It measures how many observations, both positive and negative, were correctly
classified.
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = … (1)
(
𝑡𝑝+𝑡𝑛)
(𝑡𝑝+𝑓𝑝+𝑓𝑛+𝑡𝑛)
Figure 4.1. From the above figure, it is clear that Logistic Regression has the highest accuracy of
0.802 while the KNN classifier performed the worst with an accuracy of 0.699
Recall - It measures how many observations out of all positive observations, have we classified
as positive. Taking our customer churn example, it tells us how many churned customers we
recalled from all the churned customers.
𝑡𝑝
𝑅𝑒𝑐𝑎𝑙𝑙 = … (2)
𝑡𝑝+𝑓𝑛
43
While optimizing recall, you want to make sure you have identified ALL the customers who
could churn.
𝐹𝑜𝑟 𝑡ℎ𝑒 𝑆𝑉𝑀 𝐶𝑙𝑎𝑠𝑠i𝑓i𝑒𝑟,
391
𝑅𝑒𝑐𝑎𝑙𝑙 =
391 + 345 = 0.798 = 79.8%
𝐹𝑜𝑟 𝐷𝑒𝑐i𝑠i𝑜𝑛 𝑇𝑟𝑒𝑒,
3
𝑅𝑒𝑐𝑎𝑙𝑙 = 2 = 0.455 = 45.5%
5
3
2
5
3
2
4
Figure 4.2. From the above figure, it is clear that SVM classifier has the highest recall score of
0.798 while the Decision Tree has the least recall score of 0.455
Precision - It measures how many observations predicted as positive are in fact positive. Taking
our fraud detection example, it tells us what the ratio of customers correctly classified as churned
is.
𝑡𝑝
𝑃𝑟𝑒𝑐i𝑠i𝑜𝑛 = … (3)
𝑡𝑝+𝑓𝑝
While optimizing precision, you want to make sure that the customers that you classify as
churned ARE ACTUALLY CHURNED.
𝐹𝑜𝑟 𝐿𝑜𝑔i𝑠𝑡i𝑐 𝑅𝑒𝑔𝑟𝑒𝑠𝑠i𝑜𝑛, 259 + 231
259 𝐹𝑜𝑟 𝐾𝑁𝑁 𝐶𝑙𝑎𝑠𝑠i𝑓i𝑒𝑟,
𝑃𝑟𝑒𝑐i𝑠i𝑜𝑛 = 44
= 0.688 = 68.8%
3
𝑃𝑟𝑒𝑐i𝑠i𝑜𝑛 = 5 = 0.474 = 47.4%
1
3
5
1
1
3
9
45
Figure 4.3. From the above figure, it is clear that Logistic Regression is most precise at 0.688
while the KNN classifier performed the worst with a precision of 0.474
F-1 score - Simply put, it combines precision and recall into one metric. It’s the harmonic mean
between precision and recall. A perfect F1-score is 1.0 or 100%. The closer it is to 1.0, the better
the model. You can calculate it in the following way:
𝑝𝑟𝑒𝑐i𝑠i𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙
𝐹𝑏𝑒𝑡𝑎 = (1 𝑤ℎ𝑒𝑟𝑒 𝛽 = 1 … (4)
𝛽2 × 𝑝𝑟𝑒𝑐i𝑠i𝑜𝑛
+ 𝛽 2) + 𝑟𝑒𝑐𝑎𝑙𝑙
𝐹𝑜𝑟 𝑆𝑉𝑀 𝐶𝐿𝑎𝑠𝑠i𝑓i𝑒𝑟,
0.5312 × 0.798
𝐹1𝑠𝑐𝑜𝑟𝑒 = (1 + 1) = 0.638
1 × 0.5312 + 0.798
𝐹𝑜𝑟 𝐷𝑒𝑐i𝑠i𝑜𝑛 𝑇𝑟𝑒𝑒,
0.5947 × 0.4551
𝐹1𝑠𝑐𝑜𝑟𝑒 = (1 + 1) = 0.515
1 × 0.5947 + 0.4551
46
Figure 4.4. From the above figure, it is clear that SVM Classifier has the highest F1-score of
0.638 while the Decision Tree has the least score of 0.515
In the formulae of above metrics, tp is the true positive rate, fp is the false positive rate, tn is the
true negative rate and fn is the false negative rate.
Kappa Metric - It is a measure of agreement between the predictions and the actual labels. It
can also be interpreted as a comparison of the overall accuracy to the expected random chance
accuracy. The higher the Kappa metric is, the better your classifier is compared to a random
chance classifier. Kappa is defined as the difference between the overall accuracy and the
expected accuracy divided by 1 minus the expected accuracy.
𝑝𝑜−𝑝𝑒
𝐾= … (5)
1−𝑝𝑒
Where po is the observed agreement wr.t. our classifier and pe is the expected agreement wr.t. a
random classifier.
𝐹𝑜𝑟 𝐿𝑜𝑔i𝑠𝑡i𝑐 𝑅𝑒𝑔𝑟𝑒𝑠𝑠i𝑜𝑛,
0.8020 − 0.7425
𝐾=
1 − 0.7425 = 0.470
𝐹𝑜𝑟 𝐾𝑁𝑁 𝐶𝑙𝑎𝑠𝑠i𝑓i𝑒𝑟,
0.6991 − 0.7425
𝐾=
1 − 0.7425 = 0.353
47
Figure 4.5. From the above figure, it is clear that Logistic Regression has the highest Kappa
metric of 0.470 while the KNN classifier has the least value of 0.353.
Figure 4.6. From the above classification report and ROC, the following information can be
concluded:
● Accuracy of 0.80 indicates that 80% of the customers were correctly classified.
● Precision of 0.83 indicates that 83% of the churned customers predicted by the model have
actually churned.
● Recall score of 0.91 indicates that the model was able to predict 91% of the actual churned
customers as churned.
● F1 score of 0.87 out of maximum of 1 indicates that the model performs really well.
48
● The Area under Curve (AUC) of the Receiver Operating Characteristic(ROC) is 0.71 out of a
maximum of 1 which again indicates that the model has a good performance.
Figure 4.7. From the Logistic Regression algorithm, the above graph highlights the following
points:
● Attributes such as Contract_Two_year, Tenure_group_12-24 and InternetService_No contribute
the most towards churn. This implies that customers having a two-year contract with the
company, or who have stayed with the company for 12 to 24 months or have no internet service
are more likely to leave the company.
● Attributes such as Contract_Month-to-month, Tenure_group_48-60, Tenure_group_gt_60 and
PaperlessBilling contribute the least towards churn. This implies that customers having a monthly
contract with the company, or who have stayed with the company for more than 48 months or
have enrolled for a paperless billing service are more likely to stay with the company.
● Attributes such as Partner, DeviceProtection and MultipleLines_Yes have negligible contribution
in deciding customer churn. This implies that having a partner or not, or the device protection
service or not, or multiple phone lines plays an insignificant role in estimating the likelihood of a
custormer to leave the company.
49
4.1.2 Decision Tree Classifier
Figure 4.8. From the above classification report and ROC, the following information can be
concluded:
● Accuracy of 0.72 indicates that 72% of the customers were correctly classified.
● Precision of 0.85 indicates that 85% of the churned customers predicted by the model have
actually churned.
● Recall score of 0.74 indicates that the model was able to predict 74% of the actual churned
customers as churned.
● F1 score of 0.79 out of maximum of 1 indicates that the model performs well.
● The Area under Curve (AUC) of the Receiver Operating Characteristic(ROC) is 0.70 out of a
maximum of 1 which again indicates that the model has a good performance.
50
Figure 4.9. From the Decision Tree Classifier, the above graph highlights the following points:
● Attributes such as Tenure_group_gt_60, TotalCharges and MonthlyCharges contribute the most
towards churn. This implies that customers who have stayed with the company for more than 60
months are more likely to leave the company. High charges levied upon the customer for the
services provided also contribute to churn.
● Attribute such as Contract_Month-to-month contributes the least towards churn. This implies that
customers having a monthly contract are given the flexibility to choose among different plans and
hence are more likely to stay with the company.
● Attributes such as Gender, PhoneService and PaymentMethod_Bank_transfer have negligible
contribution in deciding customer churn. This implies that the customer’s gender, or having a
phone service or not, bank transfer as a payment method or not plays an insignificant role in
estimating the likelihood of a custormer to leave the company.
51
4.1.3 KNN Classifier
Figure 4.10. From the above classification report and ROC, the following information can be
concluded:
● Accuracy of 0.69 indicates that 69% of the customers were correctly classified.
● Precision of 0.86 indicates that 86% of the churned customers predicted by the model have
actually churned.
● Recall score of 0.69 indicates that the model was able to predict 69% of the actual churned
customers as churned.
● F1 score of 0.77 out of maximum of 1 indicates that the model performs well.
● The Area under Curve (AUC) of the Receiver Operating Characteristic(ROC) is 0.70 out of a
maximum of 1 which again indicates that the model has a good performance.
Figure 4.11. From the above classification report and ROC, the following information can be
concluded:
52
● Accuracy of 0.75 indicates that 75% of the customers were correctly classified.
● Precision of 0.90 indicates that 90% of the churned customers predicted by the model have
actually churned.
● Recall score of 0.73 indicates that the model was able to predict 73% of the actual churned
customers as churned.
● F1 score of 0.81 out of maximum of 1 indicates that the model performs well.
● The Area under Curve (AUC) of the Receiver Operating Characteristic (ROC) is 0.76 out ofa
maximum of 1 which again indicates that the model has a good performance.
Figure 4.12. From the SVM Classifier, the above graph highlights the following points:
53
4.2 Comparison of Models
A thorough comparison of algorithms based on the metrics mentioned above gives a
comprehensive insight into the performance and efficiency of each of them. Their performances
can be summarized as follows:
Figure 4.13. Graphical summarization of the performances of all the algorithms usedTable 4.1.
Comparison of Results
Model Accuracy_score Recall_score Precision f1_score Kappa_metric
Logistic Regression 0.8020 0.5286 0.6888 0.5982 0.4698
Decision Tree 0.7218 0.4551 0.5947 0.5156 0.3612
KNN Classifier 0.6991 0.7163 0.4737 0.5703 0.3532
SVM Classifier 0.7474 0.798 0.5312 0.6378 0.4557
From the above table, we observe that the results predicted by the Logistic Regression algorithm
are the most efficient, evident from the high accuracy, precision, kappa metric and f1 score.
54
4. CONCLUSION AND FUTURE WORK
5.1 Conclusion
Churn prediction is one of the most effective strategies used in the telecom sector to retain
existing customers. It leads directly to improved cost allocation in customer relationship
management activities, retaining revenue and profits in future. It also has several positive
indirect impacts such as increasing customer’s loyalty, lowering customer’s sensitivity to
competitors marketing activities, and helps to build a positive image through satisfied customers.
The results predicted by the Logistic Regression algorithm were the most efficient with an
accuracy of 80.2%. Therefore, companies that want to prevent customer churn should utilize this
algorithm and remove features like long term contracts and instead replace them with monthly or
short term contracts, thereby giving them more flexibility. Providing additional services such as
device protection and multiple phone lines proves to be of little value to customer attrition.
Lastly, focusing on enhancing the experience of loyal customers who have stayed with the
company for long will prove worthwhile, ensuring their retention. The ability to identify
customers that aren’t happy with provided solutions allows businesses to learn about product or
pricing plan weak points, operation issues, as well as customer preferences and expectations to
proactively reduce reasons for churn.
A comparative analysis of prediction model building time with respect to different classifiers
could be done in order to assist telecom analysts to pick a classifier which not only gives
accurate results in terms of TP rate, AUC and lift curve but also scales well with high dimension
and large volume of call records data. As concrete findings are related to the telecom dataset,
other domains’ datasets might be subject for further exploration and testing. Also, different and a
greater number of performance metrics with respect to business context and interpretability
might be explored in future.
55
BIBLIOGRAPHY
[1] Pavan Raj. Telecom Customer Churn Prediction, October 29th, 2018. Available:
https://www.kaggle.com/pavanraj159/telecom-customer-churn-prediction/
[2] Dataset resource link. Available: https://www.kaggle.com/blastchar/telco-customer-churn
[3] Azeem, M., Usman, M. & Fong, A.C.M. A churn prediction model for prepaid customers in
telecom using fuzzy classifiers. Telecommun Syst 66, 603–614 (2017).
[4] Vijaya, J. & Elango, Sivasankar. An efficient system for customer churn prediction through
particle swarm optimization based feature selection model with simulated annealing. Cluster
Computing. 22. 10.1007/s10586-017-1172-1 (2017).
[5] Fridrich, Martin. Hyperparameter optimization of artificial neural network in customer churn
prediction using genetic algorithm. 11. 9. 10.13164/trends.2017.28.9 (2017).
[6] Gordini, Niccolo & Veglio, Valerio. Customers Churn Prediction And Marketing Retention
Strategies. An Application of Support Vector Machines Based On the Auc Parameter-Selection
Technique In B2B E-Commerce Industry. Industrial Marketing Management. 62.
10.1016/j.indmarman.2016.08.003 (2016).
[7] Bahari, Tirani and M. Sudheep Elayidom. ―An Efficient CRM-Data Mining Framework for the
Prediction of Customer Behaviour.‖ (2015).
[8] Kumar, Dudyala & Ravi, Vadlamani. Predicting credit card customer churn in banks using
data mining. International Journal of Data Analysis Techniques and Strategies. 1. 4-28.
10.1504/IJDATS.2008.020020 (2008).
[9] Customer Churn Prediction for Subscription Businesses Using Machine Learning: Main
Approaches and Models. Available: https://www.altexsoft.com/blog/business/customer-churn-
prediction-for-subscription-businesses-using-machine-learning-main-approaches-and-models/
[10] Hands-on: Predict Customer Churn. Available: https://towardsdatascience.com/hands-on-
predict-customer-churn-5c2a42806266
56
APPENDIX
#Importing librariesimport numpy as np
import pandas as pd # data processingimport os
import matplotlib.pyplot as plt #visualizationfrom PIL import Image
%matplotlib inline
import seaborn as sns#visualizationimport itertools
import warnings warnings.filterwarnings("ignore")import io
import plotly.offline as py #visualization py.init_notebook_mode(connected=True) #visualization
import plotly.graph_objs as go #visualization
import plotly.tools as tls #visualization
import plotly.figure_factory as ff #visualization
#Data overview
print ("Rows : " ,telcom.shape[0]) print ("Columns : " ,telcom.shape[1])
print ("\nFeatures : \n" ,telcom.columns.tolist())
print ("\nMissing values : ", telcom.isnull().sum().values.sum())print ("\nUnique values :
\n",telcom.nunique())
#Data Manipulation
#Replacing spaces with null values in total charges column telcom['TotalCharges'] =
telcom["TotalCharges"].replace(" ",np.nan)
#Dropping null values from total charges column which contain .15% missing datatelcom =
telcom[telcom["TotalCharges"].notnull()]
telcom = telcom.reset_index()[telcom.columns]
57
#replace 'No internet service' to No for the following columns replace_cols = [ 'OnlineSecurity',
'OnlineBackup', 'DeviceProtection',
'TechSupport','StreamingTV', 'StreamingMovies']for i in replace_cols :
telcom[i] = telcom[i].replace({'No internet service' : 'No'})
#replace values
telcom["SeniorCitizen"] = telcom["SeniorCitizen"].replace({1:"Yes",0:"No"})
58
trace = go.Pie(labels = lab ,
values = val ,
marker = dict(colors = [ 'royalblue' ,'lime'],line = dict(color = "white",
width = 1.3)
),
rotation = 90,
hoverinfo = "label+value+text",hole = .5
)
layout = go.Layout(dict(title = "Customer attrition in data",plot_bgcolor = "rgb(243,243,243)", paper_bgcolor =
"rgb(243,243,243)",
)
)
data = [trace]
fig = go.Figure(data = data,layout = layout)py.iplot(fig)
#Varibles distribution in customer attrition #function for pie plot for customer attrition typesdef
plot_pie(column) :
59
name = "Non churn customers"
)
60
opacity = .9
)
data = [trace1,trace2]
layout = go.Layout(dict(title =column + " distribution in customer attrition ",plot_bgcolor =
"rgb(243,243,243)",
paper_bgcolor = "rgb(243,243,243)",
xaxis = dict(gridcolor = 'rgb(255, 255, 255)',title = column,
zerolinewidth=1,ticklen=5, gridwidth=2
),
yaxis = dict(gridcolor = 'rgb(255, 255, 255)',title = "percent", zerolinewidth=1,
ticklen=5, gridwidth=2
),
)
)
fig = go.Figure(data=data,layout=layout)py.iplot(fig)
#function for scatter plot matrix for numerical columns in datadef scatter_matrix(df) :
pl_colorscale = "Portland"pl_colorscale
61
text = [df.loc[k,"Churn"] for k in range(len(df))]text
layout = go.Layout(dict(title =
"Scatter plot matrix for Numerical columns for customer attrition",autosize = False,
height = 800,
width = 800, dragmode = "select", hovermode = "closest",
plot_bgcolor = 'rgba(240,240,240, 0.95)',xaxis1 = dict(axis),
yaxis1 = dict(axis),xaxis2 = dict(axis),yaxis2 = dict(axis),xaxis3 = dict(axis),yaxis3 = dict(axis),
)
)
data = [trace]
62
fig = go.Figure(data = data,layout = layout )py.iplot(fig)
#bar - churn
trace1 = go.Bar(x = tg_ch["tenure_group"] , y = tg_ch["count"],name = "Churn Customers",
marker = dict(line = dict(width = .5,color = "black")),opacity = .9)
63
)
)
data = [trace1,trace2]
fig = go.Figure(data=data,layout=layout)py.iplot(fig)
#Data preprocessing
from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import
StandardScaler
#customer id col
Id_col = ['customerID']#Target columns target_col = ["Churn"] #categorical columns
cat_cols = telcom.nunique()[telcom.nunique() < 6].keys().tolist()cat_cols = [x for x in cat_cols
if x not in target_col]
#numerical columns
num_cols = [x for x in telcom.columns if x not in cat_cols + target_col + Id_col]#Binary
columns with 2 values
bin_cols = telcom.nunique()[telcom.nunique() == 2].keys().tolist()#Columns more than 2 values
multi_cols = [i for i in cat_cols if i not in bin_cols]
64
#Variable Summary
summary = (df_telcom_og[[i for i in df_telcom_og.columns if i not in Id_col]].
describe().transpose().reset_index())
#Correlation Matrix#correlation
correlation = telcom.corr()#tick labels
matrix_cols = correlation.columns.tolist()#convert to array
corr_array = np.array(correlation)
#Plotting
trace = go.Heatmap(z = corr_array,
x = matrix_cols,y = matrix_cols,
colorscale = "Viridis",
colorbar = dict(title = "Pearson Correlation coefficient",titleside = "right"
),
)
65
layout = go.Layout(dict(title = "Correlation Matrix for variables",autosize = False,
height = 720,
width = 800,
margin = dict(r = 0 ,l = 210,
t = 25,b = 210,
),
yaxis = dict(tickfont = dict(size = 9)),xaxis = dict(tickfont = dict(size = 9))
)
)
data = [trace]
fig = go.Figure(data=data,layout=layout)py.iplot(fig)
#Logistic Regression
from sklearn.model_selection import train_test_split from sklearn.linear_model import
LogisticRegression
from sklearn.metrics import confusion_matrix,accuracy_score,classification_reportfrom
sklearn.metrics import roc_auc_score,roc_curve,scorer
from sklearn.metrics import f1_scoreimport statsmodels.api as sm
from sklearn.metrics import precision_score,recall_score from yellowbrick.classifier import
DiscriminationThreshold#splitting train and test data
train,test = train_test_split(telcom,test_size = .25 ,random_state = 111)
#Function attributes
#dataframe - processed dataframe#Algorithm - Algorithm used
#training_x - predictor variables dataframe(training)#testing_x - predictor variables
dataframe(testing) #training_y - target variable(training)
#training_y - target variable(testing)
66
#cf - ["coefficients","features"](cooefficients for logistic
#regression,features for tree based models)#threshold_plot - if True returns
telecom_churn_prediction(algorithm,training_x,testing_x, training_y,testing_y,cols,cf,threshold_plot) :
column_df = pd.DataFrame(cols)
oef_sumry = (pd.merge(coefficients,column_df,left_index= True,right_index= True, how = "left"))
coef_sumry.columns = ["coefficients","features"]
coef_sumry = coef_sumry.sort_values(by = "coefficients",ascending = False)
print (algorithm)
print ("\n Classification report : \n",classification_report(testing_y,predictions))print ("Accuracy
Score : ",accuracy_score(testing_y,predictions))
#confusion matrix
conf_matrix = confusion_matrix(testing_y,predictions)#roc_auc_score
model_roc_auc = roc_auc_score(testing_y,predictions) print ("Area under curve :
",model_roc_auc,"\n") fpr,tpr,thresholds = roc_curve(testing_y,probabilities[:,1])
67
#plot roc curve
trace2 = go.Scatter(x = fpr,y = tpr,
name = "Roc : " + str(model_roc_auc),
line = dict(color = ('rgb(22, 96, 167)'),width = 2))trace3 = go.Scatter(x =
[0,1],y=[0,1],
line = dict(color = ('rgb(205, 12, 24)'),width = 2,dash = 'dot'))
#plot coeffs
trace4 = go.Bar(x = coef_sumry["features"],y = coef_sumry["coefficients"],name = "coefficients",
marker = dict(color = coef_sumry["coefficients"],colorscale = "Picnic",
line = dict(width = .6,color = "black")))
#subplots
fig = tls.make_subplots(rows=2, cols=2, specs=[[{}, {}], [{'colspan': 2}, None]],subplot_titles=('Confusion
Matrix',
'Receiver operating characteristic','Feature Importances'))
if threshold_plot == True :
visualizer = DiscriminationThreshold(algorithm)visualizer.fit(training_x,training_y)
visualizer.poof()
68
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2',
random_state=None, solver='liblinear', tol=0.0001,verbose=0, warm_start=False)
elecom_churn_prediction(logit,train_X,test_X,train_Y,test_Y,cols,"coefficients",threshold_plot = True)
elecom_churn_prediction(logit_smote,os_smote_X,test_X,os_smote_Y,test_Y, cols,"coefficients",threshold_plot
= True)
69
#Recursive Feature Elimination
from sklearn.feature_selection import RFElogit = LogisticRegression()
rfe = RFE(logit,10)
rfe = rfe.fit(os_smote_X,os_smote_Y.values.ravel())
rfe.support_rfe.ranking_
tab_rk = ff.create_table(idc_rfe)py.iplot(tab_rk)
#Univariate Selection
#Feature Extraction with Univariate Statistical Tests (Chi-squared for classification)
#uses the chi squared (chi^2) statistical test for non-negative features to select the bestfeatures
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest
70
#select columns
cols = [i for i in telcom.columns if i not in Id_col + target_col ]#dataframe with non negative
values
df_x = df_telcom_og[cols]
df_y = df_telcom_og[target_col]
#create dataframe
score = pd.DataFrame({"features":cols,"scores":fit.scores_,"p_values":fit.pvalues_ })score =
score.sort_values(by = "scores" ,ascending =False)
#plot
trace = go.Scatter(x = score[score["feature_type"] == "Categorical"]["features"],y =
score[score["feature_type"] == "Categorical"]["scores"],
name = "Categorial",mode = "lines+markers",marker = dict(color = "red",
line = dict(width =1))
)
71
xaxis = dict(gridcolor = 'rgb(255, 255, 255)',tickfont = dict(size =10), domain=[0, 0.7],
tickangle = 90,zerolinewidth=1,ticklen=5,gridwidth=2),
yaxis = dict(gridcolor = 'rgb(255, 255, 255)',title = "scores",
zerolinewidth=1,ticklen=5,gridwidth=2),margin = dict(b=200),
xaxis2=dict(domain=[0.8, 1],tickangle = 90,gridcolor = 'rgb(255, 255, 255)'),
yaxis2=dict(anchor='x2',gridcolor = 'rgb(255, 255, 255)')
)
)
data=[trace,trace1]
fig = go.Figure(data=data,layout=layout)py.iplot(fig)
#Decision Tree
#Using top three numerical features
from sklearn.tree import DecisionTreeClassifierfrom sklearn.tree import export_graphviz
from sklearn import tree from graphviz import Source
from IPython.display import SVG,display
#Function attributes
#columns - selected columns #maximum_depth - depth of tree #criterion_type - ["gini" or
"entropy"]#split_type - ["best" or "random"]
#Model Performance - True (gives model output)
72
#separating dependent and in dependent variablesdtc_x = df_x[columns]
dtc_y = df_y[target_col]
#model
ier = DecisionTreeClassifier(max_depth = maximum_depth,splitter = split_type,
criterion = criterion_type,
)
dt_classifier.fit(dtc_x,dtc_y)
#model performance
if model_performance == True : telecom_churn_prediction(dt_classifier,
dtc_x,test_X[columns],dtc_y,test_Y,
columns,"features",threshold_plot = True)
display(graph) plot_decision_tree(features_num,3,"gini","best")
73
#model algorithm.fit(training_x,training_y) predictions = algorithm.predict(testing_x)
probabilities = algorithm.predict_proba(testing_x)
print (algorithm)
print ("\n Classification report : \n",classification_report(testing_y,predictions))print ("Accuracy
Score : ",accuracy_score(testing_y,predictions))
#confusion matrix
conf_matrix = confusion_matrix(testing_y,predictions)#roc_auc_score
model_roc_auc = roc_auc_score(testing_y,predictions) print ("Area under curve :
",model_roc_auc) fpr,tpr,thresholds = roc_curve(testing_y,probabilities[:,1])
74
zerolinewidth=1, ticklen=5,gridwidth=2),
margin = dict(b=200), xaxis2=dict(domain=[0.7, 1],tickangle = 90,
gridcolor = 'rgb(255, 255, 255)'),
yaxis2=dict(anchor='x2',gridcolor = 'rgb(255, 255, 255)')
)
)
data = [trace1,trace2,trace3]
fig = go.Figure(data=data,layout=layout)py.iplot(fig)
if threshold_plot == True :
visualizer = DiscriminationThreshold(algorithm)visualizer.fit(training_x,training_y)
visualizer.poof()
75
#gives model report in dataframe
def model_report(model,training_x,testing_x,training_y,testing_y,name) :
model.fit(training_x,training_y)
predictions = model.predict(testing_x)
accuracy = accuracy_score(testing_y,predictions)recallscore =
recall_score(testing_y,predictions) precision = precision_score(testing_y,predictions)
roc_auc = roc_auc_score(testing_y,predictions) f1score =
f1_score(testing_y,predictions)
kappa_metric = cohen_kappa_score(testing_y,predictions)
76
table = ff.create_table(np.round(model_performances,4))py.iplot(table)
77