CS306 Data Analysis and Visualization
CS306 Data Analysis and Visualization
Project Report
Prepared by:
1
1 Data Cleaning and Geocoding
We first reduced the size of our data. We took the first 3 lakh entries of the yes bank.csv. We opened and
viewed this reduced file to come up with some criteria to remove the invalid records. We observed the
there were many records which had addpin value in their current pin code field. Such records did not have
any value in any of the fields related to the permanent address and had values like addcity, addstate in the
fields related to the current address. Observing this, we removed all records which had addpin value in
the current pin code field. This was the sole criterion for our cleaning at this stage. After this step, the
number of records reduced to 161065 (almost half).
Next, we performed geocoding. For this we proceeded as follows. We found a csv file on GitHub [1]
which contained the data of pincodes of Indian regions and other relevant data. Among this other relevant
data were latitude and longitude of some point in the region represented by the pincode. So, using this file,
we were able to geocode around 1.5 lakh Indian addresses based on their pincode. By 1.5 lakh addresses
we mean that both current and permanent addresses in around 1.5 lakh records we geocoded. So, in total
3 lakh addresses were geocoded. So, now around 10, 000 records remained. These, were geocoded using
geocoder API in python. Specifically, we used the geocoding service provided by Nominatim. Using this
we were able to geocode around another 5000 records. So, in the end, around 5000 records could not be
geocoded. These records were assumed to be invalid by us.
We did not perform data normalization or augmentation. Our final csv contains 158022 records in to-
tal. Each record has 5 fields: customer id, current address latitude, current address longitude, permanent
address latitude and permanent address longitude.
2
3 Scatter Plots, Heatmaps and other graphs
We used basemap package [3] to plot our data on the world map. We first show the scatter plot of addresses
and then we show the heatmap of the same addresses so as to have a good visualization. Finally, we show
the result of applying the clustering algorithm on the addresses.
3
Figure 2: Scatter plot of current addresses in India
4
Figure 4: Heatmap of current addresses in India
5
Figure 6: Scatter plot of permanent addresses in India
6
Figure 8: Heatmap of permanent addresses in India
7
Figure 10: Scatter plot of household addresses in India
8
Figure 12: Heatmap of household addresses in India
9
Figure 14: Scatter plot of business addresses in India
10
Figure 16: Heatmap of business addresses in India
For current addresses, here is the plot of error v/s the number of clusters. The error is defined as the sum
of squared distances of samples to their closest cluster center.
11
Here is the result with 7 clusters:
12
References
[1] https://github.com/arswright/data-geonames
[2] https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
[3] https://matplotlib.org/basemap/
13