0% found this document useful (0 votes)

45 views

CS306 Data Analysis and Visualization

This document is a project report for analyzing and visualizing customer data from a bank. It summarizes the data cleaning and geocoding process, how customers were classified into household and business types, and the results of clustering addresses using k-means. Visualizations including scatter plots and heatmaps of current, permanent, household and business addresses on maps of the world and India are presented.

Uploaded by

Dhruvesh Asnani

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views

CS306 Data Analysis and Visualization

Uploaded by

Dhruvesh Asnani

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

CS306

Data Analysis and Visualization

Project Report

Prepared by:

Tirth Shah - 201601009

Dhruvesh Asnani - 201601423

1
1 Data Cleaning and Geocoding
We first reduced the size of our data. We took the first 3 lakh entries of the yes bank.csv. We opened and
viewed this reduced file to come up with some criteria to remove the invalid records. We observed the
there were many records which had addpin value in their current pin code field. Such records did not have
any value in any of the fields related to the permanent address and had values like addcity, addstate in the
fields related to the current address. Observing this, we removed all records which had addpin value in
the current pin code field. This was the sole criterion for our cleaning at this stage. After this step, the
number of records reduced to 161065 (almost half).
Next, we performed geocoding. For this we proceeded as follows. We found a csv file on GitHub [1]
which contained the data of pincodes of Indian regions and other relevant data. Among this other relevant
data were latitude and longitude of some point in the region represented by the pincode. So, using this file,
we were able to geocode around 1.5 lakh Indian addresses based on their pincode. By 1.5 lakh addresses
we mean that both current and permanent addresses in around 1.5 lakh records we geocoded. So, in total
3 lakh addresses were geocoded. So, now around 10, 000 records remained. These, were geocoded using
geocoder API in python. Specifically, we used the geocoding service provided by Nominatim. Using this
we were able to geocode around another 5000 records. So, in the end, around 5000 records could not be
geocoded. These records were assumed to be invalid by us.
We did not perform data normalization or augmentation. Our final csv contains 158022 records in to-
tal. Each record has 5 fields: customer id, current address latitude, current address longitude, permanent
address latitude and permanent address longitude.

2 Classification and Clustering

To classify customers into household and business customers, we use the following scheme. If the current
address and permanent address of a customer are the same, then it is a household customer and is a
business customer otherwise. Using this classifcation, there were 136652 household customers and 21370
business customers.
In clustering, we used the K-means clustering algorithm to cluster the addresses based on geographical
location. We performed this using a function provided by the scikit package. [2]

2
3 Scatter Plots, Heatmaps and other graphs
We used basemap package [3] to plot our data on the world map. We first show the scatter plot of addresses
and then we show the heatmap of the same addresses so as to have a good visualization. Finally, we show
the result of applying the clustering algorithm on the addresses.

3.1 Current Addresses

Figure 1: Scatter plot of current addresses

3
Figure 2: Scatter plot of current addresses in India

Figure 3: Heatmap of current addresses

4
Figure 4: Heatmap of current addresses in India

3.2 Permanent Addresses

Figure 5: Scatter plot of permanent addresses

5
Figure 6: Scatter plot of permanent addresses in India

Figure 7: Heatmap of permanent addresses

6
Figure 8: Heatmap of permanent addresses in India

3.3 Household Addresses

Figure 9: Scatter plot of household addresses

7
Figure 10: Scatter plot of household addresses in India

Figure 11: Heatmap of household addresses

8
Figure 12: Heatmap of household addresses in India

3.4 Business Addresses

Figure 13: Scatter plot of business addresses

9
Figure 14: Scatter plot of business addresses in India

Figure 15: Heatmap of business addresses

10
Figure 16: Heatmap of business addresses in India

3.5 Clustering algorithm results

For current addresses, here is the plot of error v/s the number of clusters. The error is defined as the sum
of squared distances of samples to their closest cluster center.

Figure 17: The elbow occurs at k = 7

11
Here is the result with 7 clusters:

Figure 18: Clustering of current addresses with 7 clusters

Figure 19: Clustering of permanent addresses with 7 clusters

12
References
[1] https://github.com/arswright/data-geonames

[2] https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

[3] https://matplotlib.org/basemap/

Data Visualization Complete Notes
100% (9)
Data Visualization Complete Notes
28 pages
Python Seaborn Notes
No ratings yet
Python Seaborn Notes
28 pages
Assignment 3
No ratings yet
Assignment 3
4 pages
Telecom Customer Churn Project Report
50% (2)
Telecom Customer Churn Project Report
25 pages
Islam CSC342342 TakeHomeTest 1
No ratings yet
Islam CSC342342 TakeHomeTest 1
8 pages
Amazon | System Design | System to capture unique addresses in the entire world
No ratings yet
Amazon | System Design | System to capture unique addresses in the entire world
4 pages
Beginning Julia Programming: For Engineers and Scientists 1st Edition Sandeep Nagar All Chapter Instant Download
100% (6)
Beginning Julia Programming: For Engineers and Scientists 1st Edition Sandeep Nagar All Chapter Instant Download
62 pages
PF - Lab - 11 - Pointers
No ratings yet
PF - Lab - 11 - Pointers
22 pages
Chapter 3 - Exercies
No ratings yet
Chapter 3 - Exercies
5 pages
Chapter 7
No ratings yet
Chapter 7
73 pages
Dive Into Data Science: Use Python To Tackle Your Toughest Business Challenges
From Everand
Dive Into Data Science: Use Python To Tackle Your Toughest Business Challenges
Bradford Tuckfield
No ratings yet
CCNA 200-301 Official Cert Guid - Wendell Odom - New - Parte90
No ratings yet
CCNA 200-301 Official Cert Guid - Wendell Odom - New - Parte90
10 pages
VT2010 Sol
No ratings yet
VT2010 Sol
22 pages
Data Structure Questions Bank
No ratings yet
Data Structure Questions Bank
30 pages
2013-03-04 Understanding Polar Graphs
No ratings yet
2013-03-04 Understanding Polar Graphs
5 pages
Lab 10
No ratings yet
Lab 10
11 pages
Logical Addressing: Faisal Karim Shaikh
No ratings yet
Logical Addressing: Faisal Karim Shaikh
69 pages
8300 Gui SV
No ratings yet
8300 Gui SV
22 pages
Final 2016 2017-Sol PDF
No ratings yet
Final 2016 2017-Sol PDF
15 pages
Slide 2 - Computer Number System
No ratings yet
Slide 2 - Computer Number System
104 pages
Big Data Exercieses
No ratings yet
Big Data Exercieses
6 pages
Binary Numbering System
No ratings yet
Binary Numbering System
107 pages
List
No ratings yet
List
4 pages
(SS) System Software Viva Question and Answers
No ratings yet
(SS) System Software Viva Question and Answers
15 pages
Lec 22
No ratings yet
Lec 22
11 pages
Pa Textbook
No ratings yet
Pa Textbook
338 pages
3. Graph II_ Shortest paths, MST (1)
No ratings yet
3. Graph II_ Shortest paths, MST (1)
4 pages
CN Unit 3
No ratings yet
CN Unit 3
55 pages
Complete Answer Guide for Solution Manual for Introduction to the Design and Analysis of Algorithms, 3/E 3rd Edition Anany Levitin
100% (24)
Complete Answer Guide for Solution Manual for Introduction to the Design and Analysis of Algorithms, 3/E 3rd Edition Anany Levitin
58 pages
Proceedings of The DATA COMPRESSION CONFERENCE (DCC'02) 1068-0314/02 $17.00 © 2002 IEEE
No ratings yet
Proceedings of The DATA COMPRESSION CONFERENCE (DCC'02) 1068-0314/02 $17.00 © 2002 IEEE
10 pages
GRADE 12 SEP PAPER 2 EXAM 2017
No ratings yet
GRADE 12 SEP PAPER 2 EXAM 2017
14 pages
Pointers Unit 4th 2nd Sem
No ratings yet
Pointers Unit 4th 2nd Sem
10 pages
2020_FT[1]
No ratings yet
2020_FT[1]
4 pages
Mod16IPaddressManagement
No ratings yet
Mod16IPaddressManagement
9 pages
Assembler
No ratings yet
Assembler
9 pages
Math Concepts
No ratings yet
Math Concepts
4 pages
Rajalakshmi Engineering College: CS2308 - SS Lab VVQ Unit I-Introduction
No ratings yet
Rajalakshmi Engineering College: CS2308 - SS Lab VVQ Unit I-Introduction
17 pages
Low Power Square and Cube Architectures Using Vedic Sutras: G L, C - V R, R G
No ratings yet
Low Power Square and Cube Architectures Using Vedic Sutras: G L, C - V R, R G
4 pages
A Guide To Doing Statistics PDF
No ratings yet
A Guide To Doing Statistics PDF
320 pages
Ee6301 DLLC Notes Rejinpaul
No ratings yet
Ee6301 DLLC Notes Rejinpaul
104 pages
Important Big Questions For IAT2
No ratings yet
Important Big Questions For IAT2
10 pages
Slot3_CSI_02_Number systems
No ratings yet
Slot3_CSI_02_Number systems
33 pages
Reducing Lookup Table Size Used For Bit-Counting Algorithm
No ratings yet
Reducing Lookup Table Size Used For Bit-Counting Algorithm
8 pages
Shikaku
100% (1)
Shikaku
3 pages
Base Displacement Lesson
No ratings yet
Base Displacement Lesson
3 pages
Practical -1 NS RollNo-6
No ratings yet
Practical -1 NS RollNo-6
29 pages
EM - ICT - G11 - T3 - I, II PP Ans - 2018
No ratings yet
EM - ICT - G11 - T3 - I, II PP Ans - 2018
18 pages
FDS II ANS KEY.pdf
No ratings yet
FDS II ANS KEY.pdf
50 pages
Assignment: 7: Due: Language Level: Allowed Recursion: Files To Submit: Warmup Exercises: Practise Exercises
No ratings yet
Assignment: 7: Due: Language Level: Allowed Recursion: Files To Submit: Warmup Exercises: Practise Exercises
6 pages
Mathematics in Ip-Subnetting
No ratings yet
Mathematics in Ip-Subnetting
11 pages
Final Exam
No ratings yet
Final Exam
16 pages
Math 8 Textbook
No ratings yet
Math 8 Textbook
346 pages
ip address 1
No ratings yet
ip address 1
8 pages
Solution Manual for Introduction to the Design and Analysis of Algorithms, 3/E 3rd Edition Anany Levitin instant download
100% (2)
Solution Manual for Introduction to the Design and Analysis of Algorithms, 3/E 3rd Edition Anany Levitin instant download
55 pages
Database Normalization
No ratings yet
Database Normalization
53 pages
DS - XS18FA - 2.pdf Version 1
No ratings yet
DS - XS18FA - 2.pdf Version 1
14 pages
DSA MANUAL WITH MINIPROJECT
No ratings yet
DSA MANUAL WITH MINIPROJECT
133 pages
Data Structures With C++
No ratings yet
Data Structures With C++
169 pages
CSC 317 Automata Theory Project
No ratings yet
CSC 317 Automata Theory Project
40 pages
Aptitude Papers
No ratings yet
Aptitude Papers
151 pages
Handout 9 - Hashing
No ratings yet
Handout 9 - Hashing
11 pages
Dive Into Algorithms: A Pythonic Adventure for the Intrepid Beginner
From Everand
Dive Into Algorithms: A Pythonic Adventure for the Intrepid Beginner
Bradford Tuckfield
No ratings yet
Some Lab07
No ratings yet
Some Lab07
7 pages
Quantum Discrete Log
No ratings yet
Quantum Discrete Log
26 pages
LM386 Audio Amplifier Report
No ratings yet
LM386 Audio Amplifier Report
7 pages
Presentation
No ratings yet
Presentation
17 pages
Model
No ratings yet
Model
1 page
CS306 Data Analysis and Visualization Winter, 2019: Lab. 7 MNIST Dataset For Dimensionality Reduction Using PCA
No ratings yet
CS306 Data Analysis and Visualization Winter, 2019: Lab. 7 MNIST Dataset For Dimensionality Reduction Using PCA
1 page
Professor:Manish K Gupta Course: SC107 Calculus Fall 2016 Da-Iict
No ratings yet
Professor:Manish K Gupta Course: SC107 Calculus Fall 2016 Da-Iict
5 pages
LM386 Audio Amplifier Report
No ratings yet
LM386 Audio Amplifier Report
7 pages
Problem 1
No ratings yet
Problem 1
3 pages
Problem 5
No ratings yet
Problem 5
3 pages
Problem 2
No ratings yet
Problem 2
2 pages
Lab 9
No ratings yet
Lab 9
4 pages
Lab 2
No ratings yet
Lab 2
11 pages
Prediction of Regression Rate of HTPB Solid Fuel-A Machine Learning Approach
No ratings yet
Prediction of Regression Rate of HTPB Solid Fuel-A Machine Learning Approach
9 pages
66 Data Analyst Interview Questions To Ace Your in
No ratings yet
66 Data Analyst Interview Questions To Ace Your in
38 pages
Prediction and Sentiment Analysis of Stock Using Machine Learning
No ratings yet
Prediction and Sentiment Analysis of Stock Using Machine Learning
10 pages
CH 4 Data Visualization
No ratings yet
CH 4 Data Visualization
43 pages
Data Exploration and Visualization - AD3301 - Important Questions With Answer - Unit 2 - Visualizing Using Matplotlib
No ratings yet
Data Exploration and Visualization - AD3301 - Important Questions With Answer - Unit 2 - Visualizing Using Matplotlib
8 pages
Data Visualization
No ratings yet
Data Visualization
8 pages
MX CG MxAnalytics en 150331
No ratings yet
MX CG MxAnalytics en 150331
7 pages
21AD71-module-3-textbook
No ratings yet
21AD71-module-3-textbook
49 pages
03 Multivariate
No ratings yet
03 Multivariate
10 pages
Python Mini Project
No ratings yet
Python Mini Project
32 pages
Week 7 - Data Visualization
No ratings yet
Week 7 - Data Visualization
14 pages
Unit 5
No ratings yet
Unit 5
19 pages
Power BI - Visual Vocabulary
No ratings yet
Power BI - Visual Vocabulary
18 pages
iQRcDEQBTHLdcA6Ncp4A_Miuul_Data_Visualization_Cheat_Sheet
No ratings yet
iQRcDEQBTHLdcA6Ncp4A_Miuul_Data_Visualization_Cheat_Sheet
12 pages
TamaraMunzner 2015 Cap 7.5 SeparateOrderAndAli VisualizationAnalysis
No ratings yet
TamaraMunzner 2015 Cap 7.5 SeparateOrderAndAli VisualizationAnalysis
13 pages
Data Visualization - Data Mining
No ratings yet
Data Visualization - Data Mining
11 pages
Seaborn
No ratings yet
Seaborn
7 pages
1 s2.0 S0168169923004143 Main
No ratings yet
1 s2.0 S0168169923004143 Main
5 pages
Device Network SDK (Heat Map) - Developer Guide - V6.0.X.X - 20230330
No ratings yet
Device Network SDK (Heat Map) - Developer Guide - V6.0.X.X - 20230330
263 pages
Malignant and Benign Breast Cancer Classification Using Machine Learning Algorithms
No ratings yet
Malignant and Benign Breast Cancer Classification Using Machine Learning Algorithms
5 pages
DVP 3
No ratings yet
DVP 3
97 pages
Dap_Latex(ENG)
No ratings yet
Dap_Latex(ENG)
11 pages
HVPD Kronos Ultimate Software - Branded
No ratings yet
HVPD Kronos Ultimate Software - Branded
48 pages
DVT (Lab) - R Language Manual
No ratings yet
DVT (Lab) - R Language Manual
20 pages
Share Data Through The Art of Visualization
No ratings yet
Share Data Through The Art of Visualization
63 pages
Unit-5 BDA - Data Visualization
No ratings yet
Unit-5 BDA - Data Visualization
19 pages
Effect of CLAHE-based Enhancement On Bean Leaf Disease Classification Through Explainable AI
No ratings yet
Effect of CLAHE-based Enhancement On Bean Leaf Disease Classification Through Explainable AI
2 pages
Unit 5-1
No ratings yet
Unit 5-1
21 pages

CS306 Data Analysis and Visualization

Uploaded by

CS306 Data Analysis and Visualization

Uploaded by

CS306

Data Analysis and Visualization

Tirth Shah - 201601009

2 Classification and Clustering

3.1 Current Addresses

Figure 1: Scatter plot of current addresses

Figure 3: Heatmap of current addresses

3.2 Permanent Addresses

Figure 5: Scatter plot of permanent addresses

Figure 7: Heatmap of permanent addresses

3.3 Household Addresses

Figure 9: Scatter plot of household addresses

Figure 11: Heatmap of household addresses

3.4 Business Addresses

Figure 13: Scatter plot of business addresses

Figure 15: Heatmap of business addresses

3.5 Clustering algorithm results

Figure 17: The elbow occurs at k = 7

Figure 18: Clustering of current addresses with 7 clusters

Figure 19: Clustering of permanent addresses with 7 clusters

You might also like