0% found this document useful (0 votes)
82 views

Iamsp 2

Uploaded by

sunilme
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views

Iamsp 2

Uploaded by

sunilme
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

2018 International Conference on Machine Learning and Data Engineering (iCMLDE)

Housing Price Prediction using Machine Learning


Algorithms: The Case of Melbourne City, Australia
The Danh Phan
Macquarie University
Sydney, Australia
[email protected]

Abstract—House price forecasting is an important topic of used approach for tackling data analytics problems. The next
real estate. The literature attempts to derive useful knowledge parts of the paper are constructed as follows: Section 2 reviews
from historical data of property markets. Machine learning previous work on housing market forecasting applying
techniques are applied to analyze historical property transactions different machine learning techniques. Section 3 explains the
in Australia to discover useful models for house buyers and dataset and how to transform it into cleaned data. In section 4,
sellers. Revealed is the high discrepancy between house prices in various machine learning methodologies are proposed. Model
the most expensive and most affordable suburbs in the city of implementation and evaluation will be discussed in Section 5,
Melbourne. Moreover, experiments demonstrate that the and the conclusion is deduced in Section 6.
combination of Stepwise and Support Vector Machine that is
based on mean squared error measurement is a competitive
approach. II. RELATED WORK
Previous studies on the real estate market using machine
Keywords—House price prediction, Regression Trees, Neural learning approaches can be categorized into two groups: the
Network, Support Vector Machine, Stepwise, Principal Component
trend forecasting of house price index, and house price
Analysis
valuation. Literature review indicates that studies in the former
category deem predominant.
I. INTRODUCTION
In the house growth forecasting, researchers try to find
Buying a house is undoubtedly one of the most important optimal solutions to predict the movement of housing market
decisions one makes in his life. The price of a house may using historical growth rates or house price indices, which are
depend on a wide variety of factors ranging from the house’s often calculated from a national average house price [3], [5],
location, its features, as well as the property demand and [6], or the median house price [4]. [1] contends that house
supply in the real estate market. The housing market is also one growth forecasting could act as a leading indicator for
crucial element of the national economy. Therefore, forecasting policymakers to assess the overall economy. Factors that affect
housing values is not only beneficial for buyers, but also for house price growth tend to be macroeconomic features such as
real estate agents and economic professionals. income, credit, interest rates, and construction costs. In these
Studies on housing market forecasting investigate the house papers, Vector Auto regression (VAR) model was commonly
values, growth trend, and its relationships with various factors. applied in earlier periods [10], [11], while Dynamic Model
The improvement of machine learning techniques and the Averaging (DMA) has become more popular in recent years
proliferation of data or big data available have paved the way [3], [13], [14].
for real estate studies in recent years. There is a variety of On the other hand, house price valuation studies focus on
research leveraging statistical learning methods to investigate the estimation of house values [2], [12], [15]. These studies
the housing market. In these studies, the most popular seek useful models to predict the house price given its
investigated locations are the United States [1], [2], [3], [4], characteristics like location, land size, and the number of
[5], [6]; Europe [7], [8], [9]; as well as China [10], [11], [12], rooms. Support Vector Machine (SVM) and its combination
[13]; and Taiwan [14], [15]. However, research on the housing with other techniques have been commonly adopted for house
market by applying data analytics with machine learning value prediction. For instance, [12] integrates SVM with
algorithms in Australia is rare, or elusive to find. Genetic Algorithm to improve accuracy, while [15] combines
The goal of this study is through analyzing a real historical SVM and Stepwise to effectively project house prices.
transactional dataset to derive valuable insight into the housing Furthermore, other methods such as Neural Network and
market in Melbourne city. It seeks useful models to predict the Partial Least Squares (PLS) are also employed for house values
value of a house given a set of its characteristics. Effective prediction [2].
models could allow home buyers, or real estate agents to make It is underscored that Neural Network (NN) and SVM has
better decisions. Moreover, it could benefit the projection of recently been applied in a wide variety of applications across
future house prices and the policymaking process for the real numerous industries. Neural Networks has been further
estate market. developed to become deep networks or Deep learning method.
The study follows the cross-industry standard process for Besides, the advance of SVM deems to achieve by integrating
data mining, known as CRISP-DM [16]. This is a commonly it with other algorithms. For example, Principal component

978-1-7281-0404-1/19/$31.00 ©2019 IEEE 35


DOI 10.1109/iCMLDE.2018.00017
analysis (PCA) is combined with SVM to address prediction As a result, the cleaned data, which is used to build and
issues in different industries such as Information security [17], evaluate models, has 11 variables (in TABLE I) and more than 20
Stock Market [18], or Industrial processes [19]. In addition, the thousand observations.
combination of Stepwise and SVM is widely used for Credit
Scoring [20], Faulty detection of Semiconductor Producing TABLE I. FEATURES DESCRIPTION
[21], or Dimension Reduction of High-Dimensional Datasets
Name Type Description
[22]. Both NN and SVM methods will be implemented and
discussed in this paper. Price Numerical House price (prediction outcome)

Year Numerical Sold year: 2016-2018


III. DATA PREPARATION AND EXPLORATION
Property Count Numerical Number of properties
1. Original Data Distance Numerical Distance to CBD
The data implemented in this study is the Melbourne Longitude Numerical House’s Longitude
Housing Market dataset downloaded from Kaggle website
[23]. The original dataset has 34,857 observations and 21 Latitude Numerical House’s Latitude
variables. Each observation presents a real sold house Rooms Numerical Number of bedrooms
transaction in the city of Melbourne from 2016 to 2018. These
variables can be categorized into 3 groups: Bathrooms Numerical Number of bathrooms

• Transactional variables include Price, Date, Method, Car Numerical Number of car spots
Seller, Property count. Land size Numerical House’s land size
• Related location predictors which contain Address, House’s type: u-unit, h-house, t-
Type Categorical
Suburbs, Distance to CBD, Postcode, Building Area, townhouse
Council Area, Region name, Longitude, Latitude
3. Descriptive exploration
• Other house features such as House Type, Number of
Bedrooms, Number of Bathrooms, Number of Car slots, This section only presents the most important findings. The
and Land size data summary information and other informative figures are
The outcome of house value prediction is the price which is allocated in the Appendix.
a continuous value, and predictors consist of other features Descriptive analysis indicates that a median house has three
with both numeric and categorical types. bedrooms, one bathroom, with land size above 500 square
meters. Its median price is roughly 900 thousand dollars. Fig. 1
2. Data preparation and 2 indicate the histograms of Price and log(Price). While the
Before applying models for house price prediction, the range of Price values varies widely with a long tail, log(Price)
dataset needs to be pre-processed. The investigation of missing deems to have a normal distribution. Thus, log(Price) will be
data is at first performed. Several missing patterns are assessed used as the output in model building and evaluation phases.
rigorously since they play an important role in deciding
suitable methods for handling missing data [24].
Columns with more than 55% values missing are removed
from the original dataset since it is difficult to impute these
missing values with an acceptable level of accuracy. In
addition, there are many rows with missing values of the
outcome variable (Price). Since the imputation of these values
could increase bias in input data, observations with missing
values of the Price column are deleted.
In addition, imputation is performed for other predictors Fig. 1. Histogram of Price
with a small portion of missing values. Longitude and Latitude,
for instance, are imputed from house addresses using a Google
map’s Application Programming Interface (API). Another
example is the imputation of Land size values by using its
median values group by house types and suburbs.
Furthermore, outliers are also discovered and addressed.
Outliers are defined as an observation which seems to be
inconsistent with the remainder of the dataset [25]. Outliers
may stem from factors such as human errors, relationship with
probability models, and even structured situations [25]. For
instance, land sizes of less than 10 square meters are removed. Fig. 2. Histogram of log(Price)

36
The list of suburbs with the most expensive and cheapest
median house prices are demonstrated in Figs 3 and 4,    
respectively. In this formulation, n is the number of observations, while
( ) is the prediction of ith observation.
Before fitting data into models, the cleaned dataset is
divided into train and evaluation data. The evaluation set will
be kept isolated from model building, and only used for model
evaluation. The model fitting process utilizes train data using
ten folds cross-validation. It is noted that cross-validation is
applied in both data reduction and model construction stages.
The next subsections will introduce several important
machine learning techniques utilized in this study.
Fig. 3. Suburbs with most expensive houses

Kooyong, in Fig. 3, is the suburb with the highest median 1. Stepwise


price at nearly three million dollars, and one may spend around Stepwise is one common-used method for subset selection.
two million dollars to buy a house in other expensive suburbs. It is an improved technique of Best Subset Selection [26]
which trains a least squares regression for 2p possible models
of p predictors. In this study, we use forward stepwise selection
[26] which only involves fitting (1+p(p+1)/2) models. Fig. 5
indicates the important level of predictors for the outcome
log(price).

Fig. 4. Suburbs with cheapest houses

On the other hand, in Fig. 4 with cheap suburbs such as


Kurunjang, Melton, and Melton South, one can own a property
with about 400 thousand dollars. Other affordable suburbs have
the median house price of fewer than 500 thousand dollars. Fig. 5. Feature importance scores
Hence, the difference in median house prices among low and
high price suburbs is significant, which varies from around Five most important variables related to the outcome
four to six folds. variable are derived from cross-validation results. These
predictors comprise of Rooms, Distance, Latitude, Longitude,
IV. METHODOLOGIES and Type of houses. Interestingly, the land size contributes an
insignificant portion of house prices. There is also an
Data reduction and transformation
insignificant influence of the number of car spots and the year
In order to improve the interpretability and enhance the
when the house was sold. Moreover, the Boosting method
performance of prediction models, data reduction techniques
produces similar feature importance list which confirms the
like Stepwise and Boosting are exploited to derive the most
reliability of these five predictors. The detailed importance
important predictors. Moreover, PCA, a data transformation
scores of predictors in Boosting are shown in TABLE II.
technique, is applied to get significant components in order to
integrate with SVM.
TABLE II. FEATURE IMPORTANCE SCORES IN BOOSTING
Model selection and evaluation
The paper implements different regression models to find Predictor Importance
the useful ones.
Type 27.3600
An attribute subset from Stepwise will be inputted in Linear Latitude 20.7183
Regression, Polynomial Regression, Regression Tree, as well
as Neural Network, and SVM. In addition, SVM is also Distance 17.8677
integrated with PCA to compare the accuracy of its integration Rooms 15.0577
with Stepwise.
Longitude 14.3742
Linear regression will be used as a baseline for model
evaluation, which based on Mean Squared Error (MSE) Bathrooms 4.5816
measured on an evaluation dataset. MSE is the most popular Land size 0.0405
tool to measure the quality of fit [26]. It is calculated as:

37
2. Principal component analysis 4. Regression Trees
Principal component analysis (PCA), which is an Decision Trees is a widely known methodology for
unsupervised approach, can be utilized for data reduction. PCA classification; and Regression Trees, which use for continuous
allows us to create a low-dimensional representation of data outcome prediction, is a special case of Decision Trees. Each
that captures as much of the feature variation as possible [26]. leaf contains the prediction value which is the mean of prices
It can assist in the improvement of SVM performance. of all observations in that leaf. The feature selection as a node
in a Regression Tree will be based on the goal of minimizing
After implementing PCA for the train data using cross-
the Residual Sum of Squares (RSS) [26].
validation, the first six components are extracted for further
analysis since they account for nearly 80% of all predictors’
variance. Fig. 6 demonstrates the scree plot of the cumulative    
proportion of explained variance along with the number of
principal components. The tree induction is implemented with ten folds cross-
validation to get minimum RSS with a tree size of twelve. It
then is pruned to derive an optimal tree, as shown in Fig. 8.

Fig. 6. The cumulative proportion of variance

3. Polynomial Regression
Polynomial regression is a standard extension of linear
regression [26].
From a simple linear regression model:
Fig. 8. A pruned regression tree
   
To a polynomial formation with d degree:
5. Neural Network
Neural Network is the methodology which has widely
    deployed in many real-world systems. The idea of a neural
network is a connected network of nodes or units with related
The degree d in Polynomial regression is often less than weights and bias [27]. These units are confined into different
five since when degree becomes larger, the polynomial model layers. A neural network normally has one input layer, one
tends to be over-flexible [26]. Therefore, Polynomial output layer, and one or more hidden layers. The complexity
Regression models are implemented using cross-validation arises when the number of hidden layers and/or the numbers of
with the degree d varies from one to five. Fig. 7 indicates that units in each layer increases.
three is the optimal degree for Polynomial Regression.
The network learns by adjusting the weights to reduce the
prediction error [27]. Initially, all weights and bias are
allocated randomly. The algorithm then runs iteratively, and
each iteration comprises two steps: forward feeding and
backpropagation.
In the forward feeding phase, the output of each unit is
calculated from outputs of nodes from the previous layer, as
depicted in Fig. 9. The prediction of the output layer is then
compared to the observed outcome to derive the learning rate
and errors.

Fig. 7. The plot of polynomial degree and MSE

38
TABLE III. SVM WITH STEPWISE PARAMETERS

Parameters SVM Tuned SVM


Cost 10 1

Gamma 0.1 1
The number of
12842 12599
support vectors

TABLE IV. SVM WITH PCA PARAMETERS

Fig. 9. A neural network unit [27] Parameters SVM Tuned SVM

Cost 10 1
In backpropagation, given the learning rate and errors, it
recalculates the weights and bias in hidden layers and makes Gamma 0.1 1
appropriate changes to reduce prediction errors. The number of
13402 13032
support vectors
In this research, different neural networks are tested with
one to three hidden layers. Results demonstrate that the neural
network with two hidden layers indicated in Fig. 10, has the V. RESULTS
smallest Mean Squared Error.
The experiments have been deployed in R language on a
Window system. The Mean Squared Error (MSE) of both train
and evaluation datasets are presented in TABLE V. As in the
previous discussion, linear regression will act as the baseline
for model comparison. The evaluation ratio of each model is
equal to its evaluation MSE divides to the evaluation MSE of
Linear regression. The smaller evaluation ratio, the higher
accuracy of the model’s prediction.

TABLE V. PREDICTION RESULTS

Model Train MSE Eval. MSE Eval. Ratio


Linear regression 0.0948 0.0994 1.00

Fig. 10. A 2-hidden-layer neural network. Polynomial regression 0.0773 0.0832 0.84
Regression tree 0.0925 0.0985 0.99
6. Support Vector Machine Neural Network 0.2657 0.2749 2.77
Support Vector Machine (SVM) is a powerful technique for Stepwise & SVM 0.0558 0.0615 0.62
supervised learning. SVM algorithm transforms the original
data into a high dimension to seek a hyperplane for data Stepwise & tuned SVM 0.0480 0.0561 0.56
segregation [27]. The hyperplane is established by “essential PCA & SVM 0.0721 0.0810 0.82
training tuples” which are called support vectors. In
comparison with other models, SVM tends to deliver better PCA & tuned SVM 0.0474 0.0728 0.74
accuracy due to its ability of fitting nonlinear boundary [27].
SVM models are implemented with two different sets of It can be seen from TABLE V that Regression tree delivers a
input variables. The first one stems from five most important prediction result as good as linear regression, while Polynomial
features of Stepwise subset selection. The second input set is regression results in lower errors which is acceptable.
the six components from PCA transformation. Furthermore, Neural Network seems not to work effectively
These are four basic kernels in SVM including linear, with this dataset. This may not represent the efficacy of
polynomial, radial basis function (RBF), and sigmoid. RBF modern deep learning methods.
kernel is selected since the number of variables is not large, In addition, PCA and tuned SVM deliver a relatively higher
and RBF deems to suitable with regression problems [28]. accuracy. However, there is an over-fitting issue in PCA and
The selection of other parameters at first is arbitrary, with tuned SVM case, since its evaluation MSE increases
Cost of 10 and Gamma of 0.1. Tuned functions are then significantly in compared with train MSE. The combination of
performed to get the best parameters. TABLE III and TABLE IV Stepwise and tuned SVM, which produces the lowest error on
show the detailed information of these parameters. this dataset, is the most competitive models.

39
Regarding the model’s performance, when the complexity References
of models increases, the model fitting time also goes up. While [1] Gupta, R., Kabundi, A., & Miller, S. M. (2011). Forecasting the US real
Linear and Polynomial regression deliver results instantly, house price index: Structural and non-structural models with and
other models could take longer durations, which are indicated without fundamentals. Economic Modelling, 28(4), 2013-2021.
in TABLE VI. [2] Mu, J., Wu, F. & Zhang, A., 2014. Housing Value Forecasting Based on
Machine Learning Methods. Abstract and Applied Analysis,
2014(2014), p.7.
TABLE VI. FITTING MODEL RUNTIME [3] Bork L. & Moller S., 2015. Forecasting house prices in the 50 states
using Dynamic Model Averaging and Dynamic Model Selection.
Model Time (min.) International Journal of Forecasting, 31(1), pp.63–78.
Regression tree 0.033 [4] Balcilar, M., Gupta, R., & Miller, S. M. (2015). The out-of-sample
forecasting performance of nonlinear models of regional housing prices
Neural Network 0.033 in the US. Applied Economics, 47(22), 2259-2277.
Stepwise & SVM 1.583 [5] Park, B., & Bae, J. K. (2015). Using machine learning algorithms for
housing price prediction: The case of Fairfax County, Virginia housing
Stepwise & tuned SVM 1.400 data. Expert Systems with Applications, 42(6), 2928-2934.
[6] Plakandaras, V., Gupta, R., Gogas, P., & Papadimitriou, T. (2015).
PCA & SVM 2.317 Forecasting the US real house price index. Economic Modelling, 45,
259-267.
PCA & tuned SVM 2.733
[7] Ng, A., & Deisenroth, M. (2015). Machine learning for a London
housing price prediction mobile application. Technical Report, June
In comparison with SVM, Regression tree and Neural 2015, Imperial College, London, UK.
Network are relatively faster. Therefore, there is a trade-off [8] Rahal, C. (2015). House Price Forecasts with Factor Combinations (No.
15-05).
between modes’ runtime and prediction accuracy. It is also
[9] Risse M. & Kern M., 2016. Forecasting house-price growth in the Euro
underlined that PCA and SVM spend more training time than area with dynamic model averaging. North American Journal of
Stepwise and SVM. Thus, in this case, Stepwise seems more Economics and Finance, 38, pp.70–85.
efficient when combining with SVM than PCA. [10] Jie, T. J. Z. (2005). What pushes up the house price: Evidence from
Shanghai [J]. World Economy, 5, 005.
In terms of interpretability, it is easy to explain the
prediction results in simple models such as Linear regression, [11] Changrong, X. K. M. Y. D. (2010). Volatility Clustering and Short-term
Forecast of China House Price [J]. Chinese Journal of Management, 6,
Polynomial regression, and Decision tree. For instance, we can 024.
get the coefficients of related features in Polynomial function, [12] Gu J., Zhu M. & Jiang L., 2011. Housing price forecasting based on
while using decision tree for explanation is simple. In contrast, genetic algorithm and support vector machine. Expert Systems with
it will be more difficult to interpret the prediction outcome in Applications, 38(4), pp.3383–3386.
Neural Network and SVM. These models run like “black [13] Wei Y. & Cao Y., 2017. Forecasting house prices using dynamic model
boxes”, and we do not know the relationship among predictors averaging approach: Evidence from China. Economic Modelling, 61,
and the price prediction. pp.147–155.
[14] Chen, P. F., Chien, M. S., & Lee, C. C. (2011). Dynamic modeling of
For further investigation, it is suggested to deploy two regional house price diffusion in Taiwan. Journal of Housing
models: Stepwise - SVM, and Polynomial regression to predict Economics, 20(4), 315-332.
observations with no outcome values. Polynomial regression [15] Chen, J.-H. et al., 2017. Forecasting spatial dynamics of the housing
can act as a new baseline for comparing prediction results. This market using Support Vector Machine. International Journal of Strategic
Property Management, 21(3), pp.273–283.
implementation should be rigorously tested on historical
[16] Wirth, R., & Hipp, J. (2000, April). CRISP-DM: Towards a standard
datasets from different cities in Australia. The results could process model for data mining. In Proceedings of the 4th international
help to improve the performance and accuracy of these models. conference on the practical applications of knowledge discovery and
data mining (pp. 29-39).
VI. CONCLUSION [17] Ahmad, I., Hussain, M., Alghamdi, A., & Alelaiwi, A. (2014).
Enhancing SVM performance in intrusion detection using optimal
In summary, this paper seeks useful models for house price feature subset selection based on genetic principal components. Neural
prediction. It also provides insights into the Melbourne computing and applications, 24(7-8), 1671-1682.
Housing Market. Firstly, the original data is prepared and [18] Yu, H., Chen, R., & Zhang, G. (2014). A SVM stock selection model
transformed into a cleaned dataset ready for analysis. Data within PCA. Procedia computer science, 31, 406-412.
reduction and transformation are then applied by using [19] Jing, C., & Hou, J. (2015). SVM and PCA based fault classification
Stepwise and PCA techniques. Different methods are then approaches for complicated industrial process. Neurocomputing, 167,
636-642.
implemented and evaluated to achieve an optimal solution. The
[20] Yao, P. (2009, June). Feature selection based on SVM for credit scoring.
evaluation phase indicates that the combination of Step-wise In Computational Intelligence and Natural Computing, 2009. CINC'09.
and SVM model is a competitive approach. Therefore, it could International Conference on (Vol. 2, pp. 44-47). IEEE.
be used for further deployment. This research can also be [21] An, D., Ko, H. H., Gulambar, T., Kim, J., Baek, J. G., & Kim, S. S.
applied for transactional datasets of the housing market from (2009, November). A semiconductor yields prediction using stepwise
different locations across Australia. support vector machine. In Assembly and Manufacturing, 2009. ISAM
2009. IEEE International Symposium on (pp. 130-136). IEEE.
[22] Chou, E. P., & Ko, T. W. (2017). Dimension Reduction of High-
Dimensional Datasets Based on Stepwise SVM. arXiv preprint
arXiv:1711.03346.

40
[23] Pino A 2018. Melbourne Housing Market data. Kaggle.
https://www.kaggle.com/anthonypino/melbourne-housing-market.
[24] Little, R. J., & Rubin, D. B. (2014). Statistical analysis with missing
data (Vol. 333). John Wiley & Sons.
[25] Barnett, V., & Lewis, T. (1974). Outliers in statistical data. Wiley.
[26] James, G., Witten, D., Hastie, T. and Tibshirani, R., 2013. An
introduction to statistical learning (Vol. 112). New York: springer.
[27] Han, J., Pei, J. and Kamber, M., 2011. Data mining: concepts and
techniques. Elsevier.
[28] Hsu, C. W., Chang, C. C., & Lin, C. J. (2003). A practical guide to
support vector classification.

41
APPENDIX
Additional informative figures in data preparation and descriptive exploration processes.

Fig. 11 Data summary of numeric predictors in data preparation

Fig. 12. Median House Price by Suburbs in the City of Melbourne


0.6
0.5
Proportion of missings

0.4

Combinations
0.3
0.2
0.1
0.0

YearBuilt

YearBuilt
Car

Car
Rooms

Rooms
Bedroom2

Postcode
Longtitude

Bedroom2

Postcode
Suburb

Method
Date

Longtitude
Suburb

Method
Date

Fig. 13. Numeric variable correlation in data preparation Fig. 14. Missing data patterns in data preparation

42

You might also like