Correlation Between A Neighborhood Real Estate Price and Its Surrounding Venues
Correlation Between A Neighborhood Real Estate Price and Its Surrounding Venues
Table of content:
I. Introduction:....................................................................................................................................2
II. Data description:.............................................................................................................................3
III. Methodology:...............................................................................................................................5
1. First insight using visualization:...................................................................................................5
2. Linear Regression:........................................................................................................................6
3. Principal Component Regression (PCR):..................................................................................8
IV. Results:.........................................................................................................................................9
V. Discussion:......................................................................................................................................9
VI. Conclusion:................................................................................................................................10
References:............................................................................................................................................11
Table of Figures:...................................................................................................................................12
TOAN, LE T. 1
Capstone project report 2018 Data Science Specialization - IBM
I. Introduction:
This report is for the final course of the Data Science Specialization. A 9-
courses series created by IBM, hosted on Coursera platform. The
problem and the analysis approach are left for the learner to decide, with
a requirement of leveraging the Foursquare location data to explore or
compare neighborhoods or cities of your choice or to come up with a
problem that you can use the Foursquare location data to solve.
The main goal will be exploring the neighborhoods of New York city in
order to extract the correlation between the real estate value and its
surrounding venues.
The idea comes from the process of a normal family finding a place to
stay after moving to another city. It’s common that the owners or agents
advertise their properties are closed to some kinds of venues like
supermarkets, restaurants or coffee shops, etc.; showing the
“convenience” of the location in order to raise their house’s value.
So, can the surrounding venues affect the price of a house? If so, what
types of venues have the most affect, both positively and negatively?
The target audience for this report are:
- Potential buyers who can roughly estimate the value of a house
based on the surrounding venues and the average price.
- Real estate makers and planners who can decide what kind of
venues to put around their products to maximize selling price.
- Houses sellers who can optimize their advertisements.
- And of course, to this course’s instructors and learners who will grade
this project. Or to anyone who catch this shared on the social media
showing that I can use Python data science tools.
TOAN, LE T. 2
Capstone project report 2018 Data Science Specialization - IBM
TOAN, LE T. 3
Capstone project report 2018 Data Science Specialization - IBM
TOAN, LE T. 4
Capstone project report 2018 Data Science Specialization - IBM
III. Methodology:
The assumption is that real estate price is dependent on the surrounding
venue. Thus, regression techniques will be used to analyze the dataset.
The regressors will be the occurrences of venue types. And the
dependent variable will be standardized average prices.
At the end, a regression model will be obtained. Along with a coefficients
list which describes how each venue type may be related to the increase
or decrease of a neighborhood’s real estate average price around the
mean.
Python data science tools will be used to help analyze the data.
Completed code can be found here: https://github.com/lethien/coursera-
ibm-ds-capstone/blob/master/Capstone_Analyze.ipynb
1. First insight using visualization:
In order to have a first insight of New York city real estate average price
between neighborhoods, there is no better way than visualization.
The medium chosen is Choropleth map, which uses differences in
shading or coloring to indicate a property’s values or quantity within
predefined areas. It is ideal for showing how differently real estate priced
between neighborhoods across the New York city map.
The map (Figure 2) shows high price in neighborhoods that located
around Central Park, Midtown and Lower Manhattan. The price reduces
further toward North Manhattan or toward Brooklyn.
Manhattan can be considered the heart of New York city. It’s where
most businesses, tourist attractions and entertainments located. So, the
venue types that can attract many people are expected to have the most
positive coefficients in the regression model.
TOAN, LE T. 5
Capstone project report 2018 Data Science Specialization - IBM
Figure 2 - New York city real estate price spread between neighborhoods
2. Linear Regression:
Linear Regression was chosen because it is a simple technique. And by
using Sklearn library, implementing the model is quick and easy. Which
is perfect to start the analyzing process.
The model will contain a list of coefficients corresponding to venue
types. R2 score (or Coefficient of determination) and Mean Squared
Error (MSE) will be used to see how well the model fit the data.
The result (Figure 3) doesn’t seem very promising. R2 score is small,
which means the model may not be suitable for the data.
TOAN, LE T. 6
Capstone project report 2018 Data Science Specialization - IBM
TOAN, LE T. 7
Capstone project report 2018 Data Science Specialization - IBM
TOAN, LE T. 8
Capstone project report 2018 Data Science Specialization - IBM
IV. Results:
Even though the scores seem to be improved after applying a more
sophisticate method, the model is still not suitable for the dataset. Thus,
it can’t be used to precisely predict a neighborhood average price.
Explanations for the poor model can be:
- The real estate price is hard to predict.
- The data is incomplete (small sample size, missing deciding factors).
- The machine learning techniques are chosen or applied poorly.
But again, on the bright side, the insight, gotten from observing the
analysis results, seems consistent and logical. And the insight is
business venues that can serve the needs of most normal people
usually situated in pricy neighborhoods.
V. Discussion:
The real challenge is constructing the dataset:
- Usually the needed data isn’t publicly available.
- When combining data from multiple sources, inconsistent can
happen. And lots of efforts are required to check, research and
change the data before merge.
- For data obtained through API calls, different results are returned with
different set of parameters and different point of time. Multiple trial
and error runs are required to get the optimal result.
- Even after the dataset has been constructed, lots of research and
analysis are required to decide if the data should be kept as is or be
transform by normalization or standardization.
It can be considered the most important process in the whole data
science pipeline. Which can affect the most on the result.
On the other hand, choosing the suitable technique to construct the
model is also a worthwhile process. As this report shows that, by
applying a different method, the result can be improved.
TOAN, LE T. 9
Capstone project report 2018 Data Science Specialization - IBM
VI. Conclusion:
It’s unfortunately that the analysis couldn’t produce a precise model or
showing any strong coefficient correlation for any venue type. But we
can still get some meaningful and logical insights from the result.
Doing this project helps practicing every topic in the specialization, and
thus, equipping learners with Data Science methodology and tools using
Python libraries. Also doing a real project certainly helps one learns so
much more outside the curriculum, as well as realizes what more to
research into after completing the program. And as this report shows,
there are surely a lot of things to dig into.
Some notes on the analysis result:
- This project is done by a web developer who only started self-
studying Data Science for 4 months. So please take it with a grain of
salt.
- The coefficients only show correlation, not causation. So, if your
neighborhood average price is low, please don’t go destroying the
surrounding bars and food trucks. There might be another reason.
Toward the person that went through this project, many thanks for the
time and patient.
TOAN, LE T. 10
Capstone project report 2018 Data Science Specialization - IBM
References:
TOAN, LE T. 11
Capstone project report 2018 Data Science Specialization - IBM
Table of Figures:
TOAN, LE T. 12