Coursera Capstone Project Final
Coursera Capstone Project Final
1) Introduction/Business Problem
Around the world, hundreds of people are trying every day to open small and medium
businesses. No matter in what city they are planning to do it, they will look for the best
place with the aim of increase their earnings. The present project, is directed to help future
entrepreneurs in order to choose the best location to build their businesses in New York
City, through providing data about neighborhoods' characteristics and common venues to
set up the venture.
It should be noted that to reach this goal, we need to follow a particular structure to show
the results. In this case, we were claimed to follow the typical Data science methodology. I
hope to do the best of myself along the project.
2) Data
To reach the goal of this project and provide information to stakeholders, I'll be using New
York data and Foursquare API to extract competitors on the same neighborhoods.
New York data can be found here https://geo.nyu.edu/catalog/nyu_2451_34572
2.1 Neighborhoods
The data of the neighborhoods in New York can be extracted from JSON file found in
https://cocl.us/new_york_dataset.
From the location data obtained previously, the venue data is found out by passing in the
required parameters to the FourSquare API, and creating another Data Frame to contain all
the venue details along with the respective neighborhoods.
3. Methodology
3.1 Folium
Folium builds on the data wrangling strengths of the Python ecosystem and the mapping
strengths of the leaflet.js library. All cluster visualization is done with help of Folium which
in turn generates a Leaflet map made using OpenStreetMap technology.
One hot encoding is a process by which categorical variables are converted into a form that
could be provided to ML algorithms to do a better job in prediction. For the K-means
Clustering Algorithm, all unique items under Venue Category are one-hot encoded.
Due to high variety in the venues, only the top 10 common venues are selected and a new
Data Frame is made, which is used to train the K-means Clustering Algorithm.
3.4 K-means clustering
The venue data is then trained using K-means Clustering Algorithm to get the desired
clusters to base the analysis on. K-means was chosen as the variables (Venue Categories)
are huge, and in such situations K-means will be computationally faster than other
clustering algorithms.
4) Results