0% found this document useful (0 votes)
512 views6 pages

Coursera Capstone Project Final

The document summarizes a Coursera capstone project aimed at helping future entrepreneurs choose the best location to open small or medium businesses in New York City. The project uses data on New York City neighborhoods and venues from Foursquare to cluster neighborhoods based on common venue types. K-means clustering is applied to venue data from the top 10 most common venue categories. The results divide neighborhoods into clusters that are visualized on a map. Cluster 4 is identified as prime for restaurants, containing 9 neighborhoods in the Bronx.

Uploaded by

Yader Carrillo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
512 views6 pages

Coursera Capstone Project Final

The document summarizes a Coursera capstone project aimed at helping future entrepreneurs choose the best location to open small or medium businesses in New York City. The project uses data on New York City neighborhoods and venues from Foursquare to cluster neighborhoods based on common venue types. K-means clustering is applied to venue data from the top 10 most common venue categories. The results divide neighborhoods into clusters that are visualized on a map. Cluster 4 is identified as prime for restaurants, containing 9 neighborhoods in the Bronx.

Uploaded by

Yader Carrillo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Coursera Capstone Project: Applied Data Science

Yader Rafael Carrillo Jaime


[email protected]
National Autonomous University of Nicaragua, Managua

1) Introduction/Business Problem
Around the world, hundreds of people are trying every day to open small and medium
businesses. No matter in what city they are planning to do it, they will look for the best
place with the aim of increase their earnings. The present project, is directed to help future
entrepreneurs in order to choose the best location to build their businesses in New York
City, through providing data about neighborhoods' characteristics and common venues to
set up the venture.

It should be noted that to reach this goal, we need to follow a particular structure to show
the results. In this case, we were claimed to follow the typical Data science methodology. I
hope to do the best of myself along the project.

2) Data

To reach the goal of this project and provide information to stakeholders, I'll be using New
York data and Foursquare API to extract competitors on the same neighborhoods.
New York data can be found here https://geo.nyu.edu/catalog/nyu_2451_34572

2.1 Neighborhoods
The data of the neighborhoods in New York can be extracted from JSON file found in
https://cocl.us/new_york_dataset.

2.2 Geopy library


I used this library to get Bronx latitude and longitude
2.3 Venue Data

From the location data obtained previously, the venue data is found out by passing in the
required parameters to the FourSquare API, and creating another Data Frame to contain all
the venue details along with the respective neighborhoods.

3. Methodology
3.1 Folium

Folium builds on the data wrangling strengths of the Python ecosystem and the mapping
strengths of the leaflet.js library. All cluster visualization is done with help of Folium which
in turn generates a Leaflet map made using OpenStreetMap technology.

3.2 One hot encoding

One hot encoding is a process by which categorical variables are converted into a form that
could be provided to ML algorithms to do a better job in prediction. For the K-means
Clustering Algorithm, all unique items under Venue Category are one-hot encoded.

3.3 Top 10 most common venues

Due to high variety in the venues, only the top 10 common venues are selected and a new
Data Frame is made, which is used to train the K-means Clustering Algorithm.
3.4 K-means clustering

The venue data is then trained using K-means Clustering Algorithm to get the desired
clusters to base the analysis on. K-means was chosen as the variables (Venue Categories)
are huge, and in such situations K-means will be computationally faster than other
clustering algorithms.

4) Results

The neighborhoods are divided into n clusters where n is the number of


clusters found using the optimal approach. The clustered neighborhoods are
visualized using different colors so as to make them distinguishable
6 Discussion

After analyzing the various clusters produced by the Machine learning


algorithm, cluster no 4, is a prime fit to solving the problem of finding a
cluster with common venue as a restaurant mentioned before.
Nine neighborhoods called: Pelham Parkway, Morris Park, Van Nest, Throgs Neck,
Belmont, North Riverdale, Pelham Bay, Edgewater Park, Bronxdale are the best places to
set up the venture in Bronx, New York City.

You might also like