Iamsp 2
Iamsp 2
Abstract—House price forecasting is an important topic of used approach for tackling data analytics problems. The next
real estate. The literature attempts to derive useful knowledge parts of the paper are constructed as follows: Section 2 reviews
from historical data of property markets. Machine learning previous work on housing market forecasting applying
techniques are applied to analyze historical property transactions different machine learning techniques. Section 3 explains the
in Australia to discover useful models for house buyers and dataset and how to transform it into cleaned data. In section 4,
sellers. Revealed is the high discrepancy between house prices in various machine learning methodologies are proposed. Model
the most expensive and most affordable suburbs in the city of implementation and evaluation will be discussed in Section 5,
Melbourne. Moreover, experiments demonstrate that the and the conclusion is deduced in Section 6.
combination of Stepwise and Support Vector Machine that is
based on mean squared error measurement is a competitive
approach. II. RELATED WORK
Previous studies on the real estate market using machine
Keywords—House price prediction, Regression Trees, Neural learning approaches can be categorized into two groups: the
Network, Support Vector Machine, Stepwise, Principal Component
trend forecasting of house price index, and house price
Analysis
valuation. Literature review indicates that studies in the former
category deem predominant.
I. INTRODUCTION
In the house growth forecasting, researchers try to find
Buying a house is undoubtedly one of the most important optimal solutions to predict the movement of housing market
decisions one makes in his life. The price of a house may using historical growth rates or house price indices, which are
depend on a wide variety of factors ranging from the house’s often calculated from a national average house price [3], [5],
location, its features, as well as the property demand and [6], or the median house price [4]. [1] contends that house
supply in the real estate market. The housing market is also one growth forecasting could act as a leading indicator for
crucial element of the national economy. Therefore, forecasting policymakers to assess the overall economy. Factors that affect
housing values is not only beneficial for buyers, but also for house price growth tend to be macroeconomic features such as
real estate agents and economic professionals. income, credit, interest rates, and construction costs. In these
Studies on housing market forecasting investigate the house papers, Vector Auto regression (VAR) model was commonly
values, growth trend, and its relationships with various factors. applied in earlier periods [10], [11], while Dynamic Model
The improvement of machine learning techniques and the Averaging (DMA) has become more popular in recent years
proliferation of data or big data available have paved the way [3], [13], [14].
for real estate studies in recent years. There is a variety of On the other hand, house price valuation studies focus on
research leveraging statistical learning methods to investigate the estimation of house values [2], [12], [15]. These studies
the housing market. In these studies, the most popular seek useful models to predict the house price given its
investigated locations are the United States [1], [2], [3], [4], characteristics like location, land size, and the number of
[5], [6]; Europe [7], [8], [9]; as well as China [10], [11], [12], rooms. Support Vector Machine (SVM) and its combination
[13]; and Taiwan [14], [15]. However, research on the housing with other techniques have been commonly adopted for house
market by applying data analytics with machine learning value prediction. For instance, [12] integrates SVM with
algorithms in Australia is rare, or elusive to find. Genetic Algorithm to improve accuracy, while [15] combines
The goal of this study is through analyzing a real historical SVM and Stepwise to effectively project house prices.
transactional dataset to derive valuable insight into the housing Furthermore, other methods such as Neural Network and
market in Melbourne city. It seeks useful models to predict the Partial Least Squares (PLS) are also employed for house values
value of a house given a set of its characteristics. Effective prediction [2].
models could allow home buyers, or real estate agents to make It is underscored that Neural Network (NN) and SVM has
better decisions. Moreover, it could benefit the projection of recently been applied in a wide variety of applications across
future house prices and the policymaking process for the real numerous industries. Neural Networks has been further
estate market. developed to become deep networks or Deep learning method.
The study follows the cross-industry standard process for Besides, the advance of SVM deems to achieve by integrating
data mining, known as CRISP-DM [16]. This is a commonly it with other algorithms. For example, Principal component
• Transactional variables include Price, Date, Method, Car Numerical Number of car spots
Seller, Property count. Land size Numerical House’s land size
• Related location predictors which contain Address, House’s type: u-unit, h-house, t-
Type Categorical
Suburbs, Distance to CBD, Postcode, Building Area, townhouse
Council Area, Region name, Longitude, Latitude
3. Descriptive exploration
• Other house features such as House Type, Number of
Bedrooms, Number of Bathrooms, Number of Car slots, This section only presents the most important findings. The
and Land size data summary information and other informative figures are
The outcome of house value prediction is the price which is allocated in the Appendix.
a continuous value, and predictors consist of other features Descriptive analysis indicates that a median house has three
with both numeric and categorical types. bedrooms, one bathroom, with land size above 500 square
meters. Its median price is roughly 900 thousand dollars. Fig. 1
2. Data preparation and 2 indicate the histograms of Price and log(Price). While the
Before applying models for house price prediction, the range of Price values varies widely with a long tail, log(Price)
dataset needs to be pre-processed. The investigation of missing deems to have a normal distribution. Thus, log(Price) will be
data is at first performed. Several missing patterns are assessed used as the output in model building and evaluation phases.
rigorously since they play an important role in deciding
suitable methods for handling missing data [24].
Columns with more than 55% values missing are removed
from the original dataset since it is difficult to impute these
missing values with an acceptable level of accuracy. In
addition, there are many rows with missing values of the
outcome variable (Price). Since the imputation of these values
could increase bias in input data, observations with missing
values of the Price column are deleted.
In addition, imputation is performed for other predictors Fig. 1. Histogram of Price
with a small portion of missing values. Longitude and Latitude,
for instance, are imputed from house addresses using a Google
map’s Application Programming Interface (API). Another
example is the imputation of Land size values by using its
median values group by house types and suburbs.
Furthermore, outliers are also discovered and addressed.
Outliers are defined as an observation which seems to be
inconsistent with the remainder of the dataset [25]. Outliers
may stem from factors such as human errors, relationship with
probability models, and even structured situations [25]. For
instance, land sizes of less than 10 square meters are removed. Fig. 2. Histogram of log(Price)
36
The list of suburbs with the most expensive and cheapest
median house prices are demonstrated in Figs 3 and 4,
respectively. In this formulation, n is the number of observations, while
( ) is the prediction of ith observation.
Before fitting data into models, the cleaned dataset is
divided into train and evaluation data. The evaluation set will
be kept isolated from model building, and only used for model
evaluation. The model fitting process utilizes train data using
ten folds cross-validation. It is noted that cross-validation is
applied in both data reduction and model construction stages.
The next subsections will introduce several important
machine learning techniques utilized in this study.
Fig. 3. Suburbs with most expensive houses
37
2. Principal component analysis 4. Regression Trees
Principal component analysis (PCA), which is an Decision Trees is a widely known methodology for
unsupervised approach, can be utilized for data reduction. PCA classification; and Regression Trees, which use for continuous
allows us to create a low-dimensional representation of data outcome prediction, is a special case of Decision Trees. Each
that captures as much of the feature variation as possible [26]. leaf contains the prediction value which is the mean of prices
It can assist in the improvement of SVM performance. of all observations in that leaf. The feature selection as a node
in a Regression Tree will be based on the goal of minimizing
After implementing PCA for the train data using cross-
the Residual Sum of Squares (RSS) [26].
validation, the first six components are extracted for further
analysis since they account for nearly 80% of all predictors’
variance. Fig. 6 demonstrates the scree plot of the cumulative
proportion of explained variance along with the number of
principal components. The tree induction is implemented with ten folds cross-
validation to get minimum RSS with a tree size of twelve. It
then is pruned to derive an optimal tree, as shown in Fig. 8.
3. Polynomial Regression
Polynomial regression is a standard extension of linear
regression [26].
From a simple linear regression model:
Fig. 8. A pruned regression tree
To a polynomial formation with d degree:
5. Neural Network
Neural Network is the methodology which has widely
deployed in many real-world systems. The idea of a neural
network is a connected network of nodes or units with related
The degree d in Polynomial regression is often less than weights and bias [27]. These units are confined into different
five since when degree becomes larger, the polynomial model layers. A neural network normally has one input layer, one
tends to be over-flexible [26]. Therefore, Polynomial output layer, and one or more hidden layers. The complexity
Regression models are implemented using cross-validation arises when the number of hidden layers and/or the numbers of
with the degree d varies from one to five. Fig. 7 indicates that units in each layer increases.
three is the optimal degree for Polynomial Regression.
The network learns by adjusting the weights to reduce the
prediction error [27]. Initially, all weights and bias are
allocated randomly. The algorithm then runs iteratively, and
each iteration comprises two steps: forward feeding and
backpropagation.
In the forward feeding phase, the output of each unit is
calculated from outputs of nodes from the previous layer, as
depicted in Fig. 9. The prediction of the output layer is then
compared to the observed outcome to derive the learning rate
and errors.
38
TABLE III. SVM WITH STEPWISE PARAMETERS
Gamma 0.1 1
The number of
12842 12599
support vectors
Cost 10 1
In backpropagation, given the learning rate and errors, it
recalculates the weights and bias in hidden layers and makes Gamma 0.1 1
appropriate changes to reduce prediction errors. The number of
13402 13032
support vectors
In this research, different neural networks are tested with
one to three hidden layers. Results demonstrate that the neural
network with two hidden layers indicated in Fig. 10, has the V. RESULTS
smallest Mean Squared Error.
The experiments have been deployed in R language on a
Window system. The Mean Squared Error (MSE) of both train
and evaluation datasets are presented in TABLE V. As in the
previous discussion, linear regression will act as the baseline
for model comparison. The evaluation ratio of each model is
equal to its evaluation MSE divides to the evaluation MSE of
Linear regression. The smaller evaluation ratio, the higher
accuracy of the model’s prediction.
Fig. 10. A 2-hidden-layer neural network. Polynomial regression 0.0773 0.0832 0.84
Regression tree 0.0925 0.0985 0.99
6. Support Vector Machine Neural Network 0.2657 0.2749 2.77
Support Vector Machine (SVM) is a powerful technique for Stepwise & SVM 0.0558 0.0615 0.62
supervised learning. SVM algorithm transforms the original
data into a high dimension to seek a hyperplane for data Stepwise & tuned SVM 0.0480 0.0561 0.56
segregation [27]. The hyperplane is established by “essential PCA & SVM 0.0721 0.0810 0.82
training tuples” which are called support vectors. In
comparison with other models, SVM tends to deliver better PCA & tuned SVM 0.0474 0.0728 0.74
accuracy due to its ability of fitting nonlinear boundary [27].
SVM models are implemented with two different sets of It can be seen from TABLE V that Regression tree delivers a
input variables. The first one stems from five most important prediction result as good as linear regression, while Polynomial
features of Stepwise subset selection. The second input set is regression results in lower errors which is acceptable.
the six components from PCA transformation. Furthermore, Neural Network seems not to work effectively
These are four basic kernels in SVM including linear, with this dataset. This may not represent the efficacy of
polynomial, radial basis function (RBF), and sigmoid. RBF modern deep learning methods.
kernel is selected since the number of variables is not large, In addition, PCA and tuned SVM deliver a relatively higher
and RBF deems to suitable with regression problems [28]. accuracy. However, there is an over-fitting issue in PCA and
The selection of other parameters at first is arbitrary, with tuned SVM case, since its evaluation MSE increases
Cost of 10 and Gamma of 0.1. Tuned functions are then significantly in compared with train MSE. The combination of
performed to get the best parameters. TABLE III and TABLE IV Stepwise and tuned SVM, which produces the lowest error on
show the detailed information of these parameters. this dataset, is the most competitive models.
39
Regarding the model’s performance, when the complexity References
of models increases, the model fitting time also goes up. While [1] Gupta, R., Kabundi, A., & Miller, S. M. (2011). Forecasting the US real
Linear and Polynomial regression deliver results instantly, house price index: Structural and non-structural models with and
other models could take longer durations, which are indicated without fundamentals. Economic Modelling, 28(4), 2013-2021.
in TABLE VI. [2] Mu, J., Wu, F. & Zhang, A., 2014. Housing Value Forecasting Based on
Machine Learning Methods. Abstract and Applied Analysis,
2014(2014), p.7.
TABLE VI. FITTING MODEL RUNTIME [3] Bork L. & Moller S., 2015. Forecasting house prices in the 50 states
using Dynamic Model Averaging and Dynamic Model Selection.
Model Time (min.) International Journal of Forecasting, 31(1), pp.63–78.
Regression tree 0.033 [4] Balcilar, M., Gupta, R., & Miller, S. M. (2015). The out-of-sample
forecasting performance of nonlinear models of regional housing prices
Neural Network 0.033 in the US. Applied Economics, 47(22), 2259-2277.
Stepwise & SVM 1.583 [5] Park, B., & Bae, J. K. (2015). Using machine learning algorithms for
housing price prediction: The case of Fairfax County, Virginia housing
Stepwise & tuned SVM 1.400 data. Expert Systems with Applications, 42(6), 2928-2934.
[6] Plakandaras, V., Gupta, R., Gogas, P., & Papadimitriou, T. (2015).
PCA & SVM 2.317 Forecasting the US real house price index. Economic Modelling, 45,
259-267.
PCA & tuned SVM 2.733
[7] Ng, A., & Deisenroth, M. (2015). Machine learning for a London
housing price prediction mobile application. Technical Report, June
In comparison with SVM, Regression tree and Neural 2015, Imperial College, London, UK.
Network are relatively faster. Therefore, there is a trade-off [8] Rahal, C. (2015). House Price Forecasts with Factor Combinations (No.
15-05).
between modes’ runtime and prediction accuracy. It is also
[9] Risse M. & Kern M., 2016. Forecasting house-price growth in the Euro
underlined that PCA and SVM spend more training time than area with dynamic model averaging. North American Journal of
Stepwise and SVM. Thus, in this case, Stepwise seems more Economics and Finance, 38, pp.70–85.
efficient when combining with SVM than PCA. [10] Jie, T. J. Z. (2005). What pushes up the house price: Evidence from
Shanghai [J]. World Economy, 5, 005.
In terms of interpretability, it is easy to explain the
prediction results in simple models such as Linear regression, [11] Changrong, X. K. M. Y. D. (2010). Volatility Clustering and Short-term
Forecast of China House Price [J]. Chinese Journal of Management, 6,
Polynomial regression, and Decision tree. For instance, we can 024.
get the coefficients of related features in Polynomial function, [12] Gu J., Zhu M. & Jiang L., 2011. Housing price forecasting based on
while using decision tree for explanation is simple. In contrast, genetic algorithm and support vector machine. Expert Systems with
it will be more difficult to interpret the prediction outcome in Applications, 38(4), pp.3383–3386.
Neural Network and SVM. These models run like “black [13] Wei Y. & Cao Y., 2017. Forecasting house prices using dynamic model
boxes”, and we do not know the relationship among predictors averaging approach: Evidence from China. Economic Modelling, 61,
and the price prediction. pp.147–155.
[14] Chen, P. F., Chien, M. S., & Lee, C. C. (2011). Dynamic modeling of
For further investigation, it is suggested to deploy two regional house price diffusion in Taiwan. Journal of Housing
models: Stepwise - SVM, and Polynomial regression to predict Economics, 20(4), 315-332.
observations with no outcome values. Polynomial regression [15] Chen, J.-H. et al., 2017. Forecasting spatial dynamics of the housing
can act as a new baseline for comparing prediction results. This market using Support Vector Machine. International Journal of Strategic
Property Management, 21(3), pp.273–283.
implementation should be rigorously tested on historical
[16] Wirth, R., & Hipp, J. (2000, April). CRISP-DM: Towards a standard
datasets from different cities in Australia. The results could process model for data mining. In Proceedings of the 4th international
help to improve the performance and accuracy of these models. conference on the practical applications of knowledge discovery and
data mining (pp. 29-39).
VI. CONCLUSION [17] Ahmad, I., Hussain, M., Alghamdi, A., & Alelaiwi, A. (2014).
Enhancing SVM performance in intrusion detection using optimal
In summary, this paper seeks useful models for house price feature subset selection based on genetic principal components. Neural
prediction. It also provides insights into the Melbourne computing and applications, 24(7-8), 1671-1682.
Housing Market. Firstly, the original data is prepared and [18] Yu, H., Chen, R., & Zhang, G. (2014). A SVM stock selection model
transformed into a cleaned dataset ready for analysis. Data within PCA. Procedia computer science, 31, 406-412.
reduction and transformation are then applied by using [19] Jing, C., & Hou, J. (2015). SVM and PCA based fault classification
Stepwise and PCA techniques. Different methods are then approaches for complicated industrial process. Neurocomputing, 167,
636-642.
implemented and evaluated to achieve an optimal solution. The
[20] Yao, P. (2009, June). Feature selection based on SVM for credit scoring.
evaluation phase indicates that the combination of Step-wise In Computational Intelligence and Natural Computing, 2009. CINC'09.
and SVM model is a competitive approach. Therefore, it could International Conference on (Vol. 2, pp. 44-47). IEEE.
be used for further deployment. This research can also be [21] An, D., Ko, H. H., Gulambar, T., Kim, J., Baek, J. G., & Kim, S. S.
applied for transactional datasets of the housing market from (2009, November). A semiconductor yields prediction using stepwise
different locations across Australia. support vector machine. In Assembly and Manufacturing, 2009. ISAM
2009. IEEE International Symposium on (pp. 130-136). IEEE.
[22] Chou, E. P., & Ko, T. W. (2017). Dimension Reduction of High-
Dimensional Datasets Based on Stepwise SVM. arXiv preprint
arXiv:1711.03346.
40
[23] Pino A 2018. Melbourne Housing Market data. Kaggle.
https://www.kaggle.com/anthonypino/melbourne-housing-market.
[24] Little, R. J., & Rubin, D. B. (2014). Statistical analysis with missing
data (Vol. 333). John Wiley & Sons.
[25] Barnett, V., & Lewis, T. (1974). Outliers in statistical data. Wiley.
[26] James, G., Witten, D., Hastie, T. and Tibshirani, R., 2013. An
introduction to statistical learning (Vol. 112). New York: springer.
[27] Han, J., Pei, J. and Kamber, M., 2011. Data mining: concepts and
techniques. Elsevier.
[28] Hsu, C. W., Chang, C. C., & Lin, C. J. (2003). A practical guide to
support vector classification.
41
APPENDIX
Additional informative figures in data preparation and descriptive exploration processes.
0.4
Combinations
0.3
0.2
0.1
0.0
YearBuilt
YearBuilt
Car
Car
Rooms
Rooms
Bedroom2
Postcode
Longtitude
Bedroom2
Postcode
Suburb
Method
Date
Longtitude
Suburb
Method
Date
Fig. 13. Numeric variable correlation in data preparation Fig. 14. Missing data patterns in data preparation
42