house-price-prediction-using-machine-learning-and-artificial-intelligence.
house-price-prediction-using-machine-learning-and-artificial-intelligence.
net/publication/383057414
CITATIONS READS
0 1,039
2 authors:
All content following this page was uploaded by Fatbardha Maloku on 12 August 2024.
Master of Science in Business Analytics Student Candidates, Ageno School of Business, Golden Gate University, San Francisco, California 94105, USA
ABSTRACT
The escalating annual rise in housing prices introduces volatility and uncertainty into the real estate market, underscoring the critical need for accurate
price forecasting systems. Predicting house prices accurately remains challenging due to the multitude of influencing factors. This study aims to identify
and analyze key determinants affecting house prices, employing two established machine learning models. Through comparative analysis, the research will
recommend the most effective model for enhancing the accuracy of house price predictions.
*Corresponding author
Fatbardha Maloku, Master of Science in Business Analytics Student Candidates, Ageno School of Business, Golden Gate University, San Francisco, California
94105, USA.
Received: July 04, 2024; Accepted: July 08, 2024; Published: August 12, 2024
• SalePrice: The sale price of the houses min, and a max of our data set as well as the summary statistical
analysis of the columns in the house price data collection. By
The data set has a total of 30 columns and 1460 rows. The vast doing explanatory data analysis we discover that the maximum
majority of the data set's columns are classified as explanatory sales price for a house is $755,000, the minimum is $349,000
variables. The column labeled "Sales Price" will serve as and the average price of a house is approximately $163,000. We
the analysis' predictive variable. We will examine how these can see that there is a total of 30 features or variables, 8 of which
explanatory variables or factors affect the prediction of a house's are 'objects', 20 of which are 'int64', and 2 which are ‘float64'.
sales price during this analysis. To know more about the statistical According to the documentation of the competition, the variable
values of our dataset, we can use Python functions and methods to 'SalePrice', which has a data type of 'int64', is the target or label
gain more insight. In the example below, we've used the describe that we are going to predict.
function to learn more about the count, mean, standard deviation,
We learned during analysis that the dataset's neighborhood is a key variable. We can observe from the visualization below that the cost
of houses varies depending on the neighborhood within the same city. From the visualization, we can notice the diverse distribution
of house values among neighborhoods.
What time of year is ideal for selling a home? According to the analysis from real estate research company ATTOM Data Solutions,
late spring and early summer are the greatest seasons of the year to sell a home [3]. However, our house prediction dataset goes a
bit more in- depth and shows us the home sales for a period of four years, from 2006 through 2020. From the graph, we can see an
increase in the Sales Price in 2007 and a huge decrease in the market in the upcoming year, 2008. We understand that year 2007 had
the highest sales prices in the market, whereas in the upcoming year the prices shrank to approximately $8,000.
The descriptive analysis comes alive when there are distribution graphics of the explanatory variables in the relationship with the
Sales Price predictive variable. In this way, we will get to visualize how our data is distributed before predicting any values. To do
that, we have created a function which will go through explanatory columns in the house price dataset and visualize the results.
The next set of visuals demonstrates the distribution of MSSubClass, Utilities, and BldgType variables. From the chart, we can see
the distribution of our data in different categories. The MSSubClass that has the highest values is the class of 60. The utilities that
are highly used are the houses that include all public utilities such as (E, G, W, & S). The building type is approximately distributed
the same between single-family houses and townhouse end units.
When it comes to the general state of the houses, when the condition of the houses is 9 or above, the sales price of the houses tends
to be higher. The exterior condition of the property is crucial, and when it receives top ratings and presents a better impression, the
sales price of the home tends to be greater. Conversely, when the exterior condition of the home is poor, the sales price of the home
is lower. The general condition of the basement is another factor which plays a high role in the overall sales price of the house.
When the condition of the basement is good then the price of the house tends to be higher than in the houses where the condition of
the basement is poor.
The heating factor is also important. The gas-forced warm air furnace heating option is highly expensive compared to the floor furnace
and gravity furnace. The houses tend to be higher in price when the houses contain a central air conditioning system installed in place,
versus the ones that don’t. Houses that tend to have a higher number of fireplaces are also more expensive.
The garage and the number of parking spaces are two other factors that consumers consider when purchasing a home. We learned
through the analysis that homes with three garages are the most popular among buyers. Additionally, more expensive are homes with
555 square feet of pool space. Don't overlook the importance of the month that the residences were sold.
As we can see, the distribution is quite consistent over the entire month. We noticed a slight difference in the rate of price increases
in September.
The box plot visualization type can be used to display the distribution of the sales type over the sales price variable. We can observe
that the distribution changes between various categories from the graphic below. We can see the category “New” is a category for
the houses that are just constructed. This category seems to be higher distributed than the other categories of the types.
The density distribution of the Sales Price variable is shown in the below graph. The Sales Price target variable has a right-skewed
distribution and is not symmetric, which would have an impact on the model, according to the plot below.
Predictive Analysis
Predictive analytics, a subset of artificial intelligence, is a statistics-based technique that data analysts use to formulate hypotheses
and analyze past data in order to estimate the chance of a specific future result. Using machine learning and historical data like trends
and behaviors, predictive analytics enhances processes. On a set of future data, predictions are made using both predictive analytics
and machine learning. In our research, we used machine learning models to predict the possible house price based on the different
influential factors which we discussed in the above section.
Before building our machine learning models, we started by running a quick correlation test between variables to identify the highest
influencing variables which have a significant effect on house prices. As mentioned earlier, we have considered all the non-economic
factors to see which of these influences house prices the most.
Correlation Analysis
Correlation analysis is a test to examine the relationship between the variables in a dataset. With the correlation heatmap and correlation
matrix, one can observe the relationship between the explanatory and the response variables. For our study, we used a correlation
heatmap to examine the relationship between the variables.
As we can observe from the above heatmap most of the variables have a positive correlation with the sales price which is also known as
the house price in our study. Variables like lot area, house condition, basement condition, garage area etc. have a significant relationship
with the house prices. We will be taking all of these variables and proceeding to build our machine learning predictive models.
We first used the algorithm to derive our x and y. We removed the garage year built and lot frontage, as these are already covered as
a part of the variable house-built year and the square feet of the entire house. We derived the sales price as y and other independent
variables as x. We then split the data into train and the test set as 0.80 and 0.20 as shown below.
After deriving the train and test set, we then trained our model and tested the training and testing accuracy. According to the results,
the training set has an accuracy of 79.44% as shown below, whereas the testing set accuracy score is 58.89% which is not bad.
We used the same split set to build a random forest regressor model and train the model. According to this model results, the training
set has an accuracy score of 97.32% and the test set has an accuracy of 82.29% as shown below.
As we can observe, clearly the random forest regressor model regression model with only 79.4% and 58.9% for training and
is better in comparison with the linear regression model. The testing, respectively. Therefore, we can conclude that the Random
reason for achieving a higher accuracy score is to ensure that Regression model is the best model for our study which can be
the predictive outcome is more accurate. Therefore, the random used to predict house prices with higher accuracy, given considers
forest regressor model is the clear winner in this case. Further, the same variables.
this trained random forest regressor model can be used to test the
dataset which consists of similar variables. Conclusion
For many years, house prices have been interpreted through
Model Results numerical values that contain various information about the
The analysis provided us with a collection of new information. houses. The use of statistical data causes an increase in the
Initially, the data contained both categorical and continuous number of scientific research based on housing data. One of the
features, and the target feature had a binary value. The data types frequently studied topics in these scientific studies is the prediction
for feature values are a mix of int, float, and object. Numerous of house prices. Our study is an example of these studies. The
columns had a significant number of missing values. Most house price prediction model uses machine learning algorithms
continuous feature variables have outliers, which we dealt with and models. We built two popular machine learning models to
during data pre-processing. Based on the heatmap and plot graphs, train the dataset. We measured their accuracy score on both the
there are dependent features that are closely associated with other training and testing set. Based on the accuracy score, we strongly
dependent features. In our analysis, we also noticed the diverse recommend a random forest regression model for better prediction
distribution of house values among neighborhoods. When it comes of house prices.
to the general state of the houses, when the condition of the houses
is 9 or above, the sales price of the houses tends to be higher. The model will be able to predict house prices exactly so that
buyers or sellers will not get lost. This will be useful to both
The exterior condition of the property is crucial. We learned buyers and sellers as well as real estate agents and companies.
through the analysis that homes with three garages are the most
popular among buyers. We noticed a slight difference in the rate Recommendations
of price increases in September. The sales price target variable As mentioned above, the Random Forest Regression model has
had a right-skewed distribution which affects the machine learning the highest accuracy score in terms of both training and test set. To
model. increase the prediction successes obtained in the study, enrichment
can be performed both in the study dataset and in the methods
The machine learning model used in this paper is the linear used. Many other variables affect house prices but are not taken
regression model and the random forest model. We considered into account. Variables such as location, room size, bathroom size
all the variables when training our model, except for garage year etc. that are ignored in most scientific studies can also directly
built and lot frontage as they are already a part of the house affect the results. For these reasons, instead of using a dataset
build year and the size of the house. We proceeded with all other consisting only of limited variables a large dataset can be used,
variables and trained our model, as we saw both the models had which also includes variables such as those mentioned here. In
a high accuracy score, however, the Random Forest Regressor addition, there are several other methods which can be used to
model had a higher accuracy score with 97.3% for the training increase the accuracy and prediction.
set and 82.3% for the testing set, in comparison with the linear