House Price Prediction with ML Techniques
House Price Prediction with ML Techniques
ON
SUBMITTED BY
2024-25
i
GANDHI INSTITUTE OF TECHNOLOGY AND MANAGEMENT
(GITAM) BHUBANESWAR, ODISHA
CERTIFICATE
This is to certify that the work in this Project Report entitled
“HOUSE PRICE PREDICTION USING PYTHON” by
Suchismita Sahoo (2221304049), Suraj Sahoo (2221304051), Tusar
Kanta Dhal(2221304052), Dibyarashmi Bhanja (2231304002)
have been carried out under my supervision in partial fulfillment of
the requirements for the [Link] in Computer Science &
Engineering during session 2024-2025 in Department of Computer
Science & Engineering of GITAM and this work is the original
work of the above students.
ii
GANDHI INSTITUTE OF TECHNOLOGY AND MANAGEMENT
(GITAM) BHUBANESWAR, ODISHA
ACKNOWLEDGMENT
iii
GANDHI INSTITUTE OF TECHNOLOGY AND MANAGEMENT
(GITAM) BHUBANESWAR, ODISHA
DECLARATION
We declare that every part of project report submitted is
genuinely our work and has not been submitted for the
award of any other degree. We acknowledge that, if any
sort of malpractice is detected in relation to this project,
we shall be held liable for it.
Submitted By:
Suchismita Sahoo
Suraj Sahoo
Tusar Kanta Dhal
Dibyarashmi Bhanja
iv
ABSTRACT
i
LIST OF TABLES
Table 1 Application
LIST OF FIGURES
ii
TABLE OF CONTENTS
Page Number
Certificate ii
Acknowledgement iii
Abstract iv
List of Table v
List of Figures v
PHASE-I
1. Introduction
1.1 Introduction 1
1.2 Motivation 2
2. Literature Survey
2.1 Literature Survey 3
3. Proposed Work
3.1 Objective of proposed work 9
3.2 Methodology 9
3.2.1 Introduction to machine learning 9
3.2.2 How does Machine Learning work? 11
3.2.3 Need for Machine Learning 12
3.2.4 Applications of Machine learning: - 12
3.2.5 Machine Learning Classifications 15
[Link] Supervised Learning 15
[Link] Unsupervised Machine Learning 28
[Link] Reinforcement Learning: 30
PHASE-II
4. Implementation
4.1 Code 31
5. Result Analysis
5.1 Visualization Insights 33
5.2 Advantages 36
5.3 Disadvantages 37
5.4 Maintenance 39
5.5 Application 41
6. Conclusion and Future Development 42
Reference 43
iii
CHAPTER 1
INTRODUCTION
1.1 INTRODUCTION:
1
1.2 Motivation
We are highly interested in anything related to Machine Learning, the independent
project provided us with the opportunity to study and reaffirm our passion for this
subject. The capacity to generate guesses, forecasts, and offer machines the ability to
learn on their own is both powerful and infinite in terms of application possibilities.
Machine Learning may be applied in finance, medicine, and virtually any other field.
That is why we opted to base our idea on Machine Learning.
2
CHAPTER 2
LITERATURE SURVEY
3
situation. Regardless, we don't have accurate standardized approaches to live the
significant estate property values.
First, we looked at different articles and discussions about machine learning for
housing price prediction. The title of the article is house price prediction, and it is
based on machine learning and neural networks. The publication's description is
minimal error and the highest accuracy. The aforementioned title of the paper is
Hedonic models based on price data from Belfast infer that submarkets and
residential valuation this model is used to identify over a larger spatial scale and
implications for the evaluation process related to the selection of comparable
evidence and the quality of variables that the values may require. Understanding
current developments in house prices and homeownership are the subject of the
study. In this article, they utilized a feedback mechanism or social pandemic that
fosters a perception of property as an essential market investment.
In this section, first of all, the basic economic structure affecting housing prices is
emphasized. Houses meet the shelter needs of people and are also an investment tool.
The housing market differs from other markets in that housing is both a
consumption and an investment good. Housing markets differ from other markets in
that the housing supply is very costly, the housing is permanent and continuous,
heterogeneous, fixed, causes growth in the secondary markets, and is used as a
guarantee (Iacoviello, 2000). The housing market is formed through a mechanism of
housing supply and demand. In the housing market, unlike the goods and services
market, the housing supply is inelastic. Supply and demand for housing change and
develop over time depending on the economic, social, cultural, geographical, and
demographic realities of the countries. Meeting the housing demand is associated
with housing policies and economic conditions. Housing demand arises for different
purposes such as consumption, investment, and wealth accumulation. The supply
and demand factors change according to the type of housing demand. In addition to
the input costs of the house as a product,
4
the determination of the price of the house is affected by many variables such as
people’s income level, marital status, industrialization of the society and agricultural
employment rate, interest rates, population growth and migration, and all variables
also affect the price. Since changes in housing prices affect both socio-economic
conditions and national economic conditions, it is an important issue that concerns
governments and individuals (Kim and Park, 2005). Housing demand arises for
different purposes such as consumption, investment, and wealth accumulation. In
this part of the literature, some studies that estimate housing prices are cited. The
prediction of houses with real factors is important for the studies. With the
developments in artificial intelligence methods, it now allows the solution of many
problems in daily life such as purchasing a house. The competitive nature 4
AESTIMUM JUST ACCEPTED MANUSCRIPT of the housing sector helps the
data mining process in this industry, processing this data and predicting its future
trends. Regression is a machine learning tool that encourages to build expectations
from available measurable information by taking the links between the target
parameter and many different independent parameters. The cost of a house is based
on several parameters. Machine learning is one of the most important areas to apply
ideas on how to increase costs and predict with high accuracy. Machine learning
method is one of the recent methods used for prediction. It is used to interpret and
analyze highly complex data structures and patterns (Ngiam and Khor, 2019).
Machine learning predicts that computers learn and behave like humans (Feggella,
2019). Machine learning means providing valid dataset, and moreover predictions
are based on it, machine learns how important a particular event might be on the
whole system based on pre-loaded data and predicts the outcome accordingly.
Various modern applications of this technique include predicting stock prices,
predicting the probability of an earthquake, and predicting company sales, and the
list has infinite possibilities (Shiller, 2007). Unlike traditional econometrics models,
machine learning algorithms do not require the training data to be normally
distributed. Many statistical tests rely on the assumption of normality. If the data are
not
5
normally distributed, these statistical tests will fail and become invalid. These
processes used to take a long time, however, today they can be completed quickly
with the high-speed computing power of modern computers and therefore this
technique is less costly and less timely to use. Rafiei and Adeli (2016) used SVR to
determine whether a property developer should build a new development or stop the
construction at the beginning of a project based on the prediction of future house
prices. The study, in which data from 350 apartment houses built in Tehran (Iran)
between 1993 and 2008 were used, had 26 features such as zip code, gross floor area,
land area, estimated cost of construction, construction time, and property prices. Its
results revealed that SVR was a suitable method for making home price predictions
since the loss of prediction (error) was as low as 3.6% of the test data. Therefore, the
prediction results provide valuable input to the property developer’s decision- making
process. Cechin et al. (2000) analyzed the data of buildings for sale and rental in
Porto Alegre, Brazil, using linear regression and artificial neural network methods.
They used parameters such as the size of the house, district, geographical location,
environmental arrangement, number of rooms, building construction date and total
area of use. According to the study, they reported that the artificial neural network
method was more useful compared to linear regression. Yu and Wu (2016) used the
classification and regression algorithms. According to the analysis, living area
square meter, roof content and neighborhood have the greatest statistical
significance in predicting the selling price of a house, and the prediction analysis
can be improved by the Principal Component Analysis (PCA) technique. Because
the value of a particular property is closely associated with the infrastructure
facilities surrounding the property. Koktashev et al. (2019) attempted to predict the
house values in the city of Krasnoyarsk by using 1.970 housing transaction records.
The number of rooms, total area, floor, parking lot, type of repair, number of
balconies, type of bathroom, number of elevators, garbage disposal, year of
construction and accident rate of the house were discussed as the features in that
study. They applied random forest, ridge regression, and linear regression to predict
the property prices. Their study concluded
6
that the random forest outperformed the other two algorithms, as evaluated by the
Mean Absolute Error (MAE). Park and Bae (2015) developed a house price
prediction model with machine learning algorithms in real estate research and
compared their performance in terms of classification accuracy. Their study aimed at
helping real estate sellers or real estate agents to make rational decisions in real
estate transactions. The tests showed that the accuracy-based Repeated Incremental
Pruning to Produce Error Reduction (RIPPER) consistently outperformed other
models in house price prediction performance. Bhagat et al. (2016) studied on linear
regression algorithms for house prediction. The aim of the study was to predict the
effective price of the real estate for clients based on their budget and priorities. They
indicated that the linear regression technique of the analysis of past market trends
and price ranges could be used to determine future house prices. In their study,
Mora-Esperanza and Gallego (2004) analyzed house prices in Madrid using 12
parameters. The parameters they used were the distance to the city center, road, size
of the district, construction class, age of the building, renovation status, housing
area, terrace area, location within the district, housing design, the floor and the
presence of outbuildings. The dataset was created assuming that the sales values of
100 houses for sale in the region were the real values. Researchers, who used the
ANN and linear regression analysis technically, reported that the ANN technique
was more successful and achieved an average agreement of 95% and an accuracy of
86%. Wang and Wu (2018) used 27.649 data on home appraisal price from
Airlington County, Virginia, USA in 2015 and suggested that Random Forest
outperformed the linear regression in terms of accuracy. In their study in the case of
Mumbai, India, Varma et al. (2018) attempted to predict t the price of the house by
using various regression techniques (Linear Regression, Forest regression, boosted
regression) and artificial neural network technique based on the features of the house
(usage area, number of rooms, number of bathrooms, parking lot, elevator,
furniture). In conclusion, they determined that the efficiency of the algorithm with
the use of artificial neural networks was higher compared to other regression
techniques. They also revealed that the system prevented the
7
risk of investing in the wrong house by providing the right output. Thamarai and
Malarvizhi (2020) attempted to predict the prices of houses from real- time data
after the large fluctuation in house price increases in 2018 at the Tadepalligudem
location of West Godavari District in Andhra Pradesh, India using the features of the
number of bedrooms, age of the house, transportation facilities, nearby schools, and
shopping opportunities. They applied these models in decision tree regression and
multiple linear regression techniques, which are among the machine learning
techniques. They suggested that the performance of multiple linear regression was
better than decision tree regression in predicting the house prices.
Zhao et al. [1] who applied deep learning in combination with extreme Gradient
Boosting (XGBoost) for real estate price predictions, by analyzing historical
property sale records. The dataset was extracted from Online Real Estate website.
The data split into 80% as training set and 20% as testing test. According to Satish et
al. [2] regression deals with specifying the relationship between dependent also
called as response or outcome and independent variable or predictor. The study
aimed to predict future house price with the help of machine learning algorithm.
8
CHAPTER 3.
PROPOSED WORK
3.1 Objective of proposed work
3.2 Methodology
9
surrounded by humans who can increasing data. This research includes the history of
machine learning learn everything from their experiences with their learning
capability, and we have computers or machines which work on our instructions. But
can a machine also learn from experiences or past data like a human does? So here
comes the role of Machine Learning. It is a science that will improve more in the
future. The reason behind this development is the difficulty of analyzing and
processing the rapidly increasing data. Machine learning is based on the principle of
finding the best model for the new data among the previous data thanks to this
increasing data. Therefore, machine learning researches will go on in parallel with
the, the methods used in machine learning, its application fields, and the researches
on this field. The aim of this study is to transmit the knowledge on machine learning,
which has become very popular nowadays, and its applications to the researchers.
There is no error margin in the operations carried out by computers based an
algorithm and the operation follows certain steps. Different from the commands
which are written to have an output based on an input, there are some situations
when the computers make decisions based upon the present sample data. In those
situations, computers may make mistakes just like people in the decision-making
process. That is, machine learning is the process of equipping the computers with the
ability to learn by using the data and experience like a human brain. The main aim of
machine learning is to create models which can train themselves to improve,
perceive the complex patterns, and find solutions to the new problems by using the
previous data.
10
3.2.2 How does Machine Learning work?
A Machine Learning system learns from historical data, builds the prediction models,
and whenever it receives new data, predicts the output for it. The accuracy of
predicted output depends upon the amount of data, as the huge amount of data helps
to build a better model which predicts the output more accurately. Suppose we have
a complex problem, where we need to perform some predictions, so instead of
writing a code for it, we just need to feed the data to generic algorithms, and with the
help of these algorithms, machine builds the logic as per the data and predict the
output. Machine learning has changed our way of thinking about the problem. The
below block diagram explains the working of Machine Learning algorithm:
11
3.2.3 Need for Machine Learning
The need for machine learning is increasing day by day. The reason behind the need
for machine learning is that it is capable of doing tasks that are too complex for a
person to implement directly. As a human, we have some limitations as we cannot
access the huge amount of data manually, so for this, we need some computer
systems and here comes the machine learning to make things easy for us. We can
train machine learning algorithms by providing them the huge amount of data and
let them explore the data, construct the models, and predict the required output
automatically. The performance of the machine learning algorithm depends on the
amount of data, and it can be determined by the cost function. With the help of
machine learning, we can save both time and money. The importance of machine
learning can be easily understood by its use’s cases, Currently, machine learning is
used in self-driving cars, cyber fraud detection, face recognition, and friend
suggestion by Facebook, etc. Various top companies such as Netflix and Amazon
have built machine learning models that are using a vast amount of data to analyses
the user interest and recommend product accordingly.
it such as Google Maps, Google assistant, Alexa, etc. Below are some
most trending real-world applications of Machine Learning:
12
efficient machine learning project. The main purpose of the life cycle is to find a
solution to the problem or project. Machine learning life cycle involves seven major
steps, which are given below:
• Gathering Data
• Data preparation
• Data Wrangling
• Analyze Data
• Train the model
• Test the model
Gathering Data:
Data Gathering is the first step of the machine learning life cycle. The goal of this
step is to identify and obtain all data-related problems. In this step, we need to
identify the different data sources, as data can be collected from various sources
such as files, database, internet, or mobile devices. It is one of the most important
steps of the life cycle. The quantity and quality of the collected data will determine
the efficiency of the output. The more will be the data, the more accurate will be the
prediction. This step includes the below tasks:
13
• Identify various data sources
• Collect data
• Integrate the data obtained from different source
Data preparation:
After collecting the data, we need to prepare it for further steps. Data preparation is
a step where we put our data into a suitable place and prepare it to use in our
machine learning training. In this step, first, we put all data together, and then
randomize the ordering of data. This step can be further divided into two processes:
Data exploration:
It is used to understand the nature of data that we have to work with. We need to
understand the characteristics, format and quality of data.
A better understanding of data leads to an effective outcome. In this, we find
Correlations, general trends, and outliers.
Data Wrangling:
Data wrangling is the process of cleaning and converting raw data into a useable
format. It is the process of cleaning the data, selecting the variable to use, and
transforming the data in a proper format to make it more suitable for analysis in the
next step. It is one of the most important steps of the complete process. Cleaning of
data is required to address the quality issues. It is not necessary that data we have
collected is always of our use as some of the data may not be useful. In real-world
applications, collected data may have various issues, including:
• Missing Values
• Duplicate data
• Invalid data
• Noise
14
Data Analysis
• Now the cleaned and prepared data is passed on to the analysis step. This
step involves:
• Selection of analytical techniques
• Building models
• Review the result
The aim of this step is to build a machine learning model to analyze the data using
various analytical techniques and review the outcome. It starts with the determination
of the type of the problems, where we select the machine learning techniques
such as Classification, Regression, Cluster analysis, Association, etc. then build
the model using prepared data, and evaluate the model.
Deployment
The last step of machine learning life cycle is deployment, where we deploy the
model in the real-world system. If the above-prepared model is producing an accurate
result as per our requirement with acceptable speed, then we deploy the model in the
real system. But before deploying the project, we will check whether it is improving
its performance using available data or not. The deployment phase is similar to
making the final report for a project.
15
labelled data to understand the datasets and learn about each data, once the training
and processing are done then we test the model by providing a sample data to check
whether it is predicting the exact output or not. The goal of supervised learning is to
map input data with the output data. The supervised learning is based on supervision,
and it is the same as when a student learns things in the supervision of the teacher.
The example of supervised learning is spam filtering. Supervised learning is a
process of providing input data as well as correct output data to the machine learning
model. The aim of a supervised learning algorithm is to find a mapping function to
map the input variable(x) with the output variable(y). In the real-world, supervised
learning can be used for Risk Assessment, Image classification, Fraud Detection,
spam filtering, etc. In supervised learning, models are trained using labelled dataset,
where the model learns about each type of data. Once the training process is
completed, the model is tested on the basis of test data (a subset of the training set),
and then it predicts the output. The working of Supervised learning can be easily
understood by the below example and diagram:
• If the given shape has four sides, and all the sides are equal, then it will be
labelled as a Square.
• If the given shape has three sides, then it will be labelled as a triangle.
16
• If the given shape has six equal sides, then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is
to identify the shape. The machine is already trained on all types of shapes, and
when it finds a new shape, it classifies the shape on the bases of a number of sides,
and predicts the output.
17
Fig. 5 Types of supervised Machine learning
Regression
Regression algorithms are used if there is a relationship between the input variable
and the output variable. It is used for the prediction of continuous variables, such as
Weather forecasting, Market Trends, etc. Below are some popular Regression
algorithms which come under supervised learning:
• Linear Regression
• Regression Trees
• Non-Linear Regression
• Bayesian Linear Regression
• Polynomial Regression
Linear regression is one of the easiest and most popular Machine Learning
algorithms. It is a statistical method that is used for predictive analysis. Linear
regression makes predictions for continuous/real or numeric variables such as sales,
salary, age, product price, etc. Linear regression algorithm shows a linear
relationship between a dependent (y) and one or more independent (y) variables,
hence called as linear regression. Since linear regression shows the linear
relationship, which means it finds how the value of the dependent
18
variable is changing according to the value of the independent variable. The linear
regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:
19
• Simple Linear Regression: If a single independent variable is used to
predict the value of a numerical dependent variable, then such a Linear
Regression algorithm is called Simple Linear Regression.
• Multiple Linear Regressions: If more than one independent variable is used
to predict the value of a numerical dependent variable, then such a Linear
Regression algorithm is called Multiple Linear Regression.
20
Fig 8 Negative Linear Relationship
Cost function
• The different values for weights or coefficient of lines (a0, a1) gives the
different line of regression, and the cost function is used to estimate the
values of the coefficient for the best fit line.
• Cost function optimizes the regression coefficients or weights. It measures
how a linear regression model is performing.
• We can use the cost function to find the accuracy of the mapping function,
which maps the input variable to the output variable. This mapping function
is also known as Hypothesis function.
• For Linear Regression, we use the Mean Squared Error (MSE) cost function,
which is the average of squared error occurred between the
21
predicted values and actual values. It can be written as: For the above linear
equation, MSE can be calculated as:
Where,
N=Total number of observations Yi =
Actual value
(a1xi+a0) = Predicted value.
Classification
Classification algorithms are used when the output variable is
categorical, which means there are two classes such as Yes-No, Male-
Female, True-false, etc.
• Random Forest
• Decision Trees
• Logistic Regression
• Support vector Machines
22
problem of over fitting. The below diagram explains the working of the Random
Forest algorithm:
23
• Land Use: We can identify the areas of similar land use by this
algorithm.
• Marketing: Marketing trends can be identified using this
algorithm.
24
Decision Tree Classification Algorithm: -
• Decision Tree is a supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision rules and
each leaf node represents the outcome.
• In a Decision tree, there are two nodes, which are the Decision Node and
Leaf Node. Decision nodes are used to make any decision and have multiple
branches, whereas Leaf nodes are the output of those decisions and do not
contain any further branches.
• The decisions or the test are performed on the basis of features of the given
dataset.
• In order to build a tree, we use the CART algorithm, which stands for
Classification and Regression Tree algorithm.
• A decision tree simply asks a question, and based on the answer (Yes/No), it
further split the tree into sub trees.
• Below diagram explains the general structure of a decision tree:
25
Why use Decision Trees?
• Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
• Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
• Step-3: Divide the S into subsets that contains possible values for the best
attributes.
• Step-4: Generate the decision tree node, which contains the best attribute.
• Step-5: Recursively make new decision trees using the subsets of the dataset
created in step -3. Continue this process until a stage is reached where you
cannot further classify the nodes and called the final node
26
as a leaf node. Finally, the decision node splits into two leaf nodes (Accepted
offers and Declined offer). Consider the below diagram:
Logistic Regression
• Logistic Regression is a significant machine learning algorithm because it
has the ability to provide probabilities and classify new data using continuous
and discrete datasets.
• Logistic Regression can be used to classify the observations using different
types of data and can easily determine the most effective variables used for
the classification. The below image is showing the logistic function:
27
[Link] Unsupervised Machine Learning:
Unsupervised learning cannot be directly applied to a regression or classification
problem because unlike supervised learning, we have the input data but no
corresponding output data. The goal of unsupervised learning is to find the
underlying structure of dataset, group that data according to similarities, and
represent that dataset in a compressed format.
Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to the
machine learning model in order to train it. Firstly, it will interpret the raw data to
find the hidden patterns from the data and then will apply suitable algorithms such as
k-means clustering, Decision tree, etc.
• K-means clustering
• KNN (k-nearest neighbors)
28
• Hierarchal clustering
• Anomaly detection
• Neural Networks
• Principle Component Analysis
• Independent Component Analysis
• Apriori algorithm
• Singular value decomposition
29
Apart from these general usages, it is used by the Amazon in its recommendation
system to provide the recommendations as per the past search of products. Netflix
also uses this technique to recommend the movies and web-series to its users as per
the watch history. The below diagram explains the working of the clustering
algorithm. We can see the different fruits are divided into several groups with
similar properties.
30
CHAPTER. 4
4. IMPLIMENTATION
4.1 CODE
import matplotlib. pyplot as plt def
plot_scatter_chart(df,location):
bhk2=df[([Link]==location)&([Link]==2)]
bhk3=df[([Link]==location)&([Link]==3)] [Link]['[Link]']=(15,10)
[Link](bhk2.total_sqft,[Link],color='Blue',label='2 BHK',s=50)
[Link](bhk3.total_sqft,[Link],color='green',marker='+',label='3
BHK',s=50)
[Link]('Total Square Foot')
[Link]('Price') [Link](location)
[Link]() plot_scatter_chart(data3,"Rajaji
Nagar") def remove_bhk_outliers(df):
exclude_indices=[Link]([])
for location, location_df in [Link]('location'):
bhk_sats={}
for BHK,BHK_df in location_df.groupby('BHK'): bhk_sats[BHK]={
'mean':[Link](BHK_df.price_per_sqft),
'std':[Link](BHK_df.price_per_sqft), 'count':BHK_df.shape[0]
}
for BHK,BHK_df in location_df.groupby('BHK'):
stats=bhk_sats.get(BHK-1)
if stats and stats['count']>5:
31
exclude_indices=[Link](exclude_indices,BHK_df[BHK_df.price_per_s
qft<(stats['mean'])].[Link])
return [Link](exclude_indices,axis='index')
data4=remove_bhk_outliers(data3) [Link]
32
CHAPTER 5
5 RESULT ANALYSIS
5.1 VISUALIZATION INSIGHTS:
2BHK Preference:
The observation that most houses sold are 2BHK suggests that buyers may prefer
smaller-sized homes, possibly due to factors such as affordability, family size, or
lifestyle preferences.
Location Diversity:
With houses from 255 different locations, 'Whitefield' and 'Sarjapur Road' emerge as
popular areas. This information is valuable for understanding market demand and
can aid in targeted marketing or investment decisions.
33
Distribution Plots:
The distribution plots for 'bath', 'bhk', 'price', and 'total_sqft' provide insights into the
spread and variability of these features. Understanding their distributions can help in
identifying outliers, understanding central tendencies, and assessing data quality.
Train-Test Split and Model Building:
Data Splitting:
The dataset is split into training and testing sets, with 80% of the data used for
training and 20% for testing. This ensures that the model's performance is evaluated
on unseen data, providing a more accurate assessment of its generalization ability.
Model Selection:
Three regression models - Linear Regression, Lasso Regression, and Ridge
Regression - are chosen for predicting house prices. These models offer different
approaches to regression and can capture different aspects of the data's underlying
relationships.
Preprocessing:
One-hot encoding is used to handle the categorical feature 'location', while standard
scaling ensures that all features are on a similar scale, preventing any particular
feature from dominating the model training process.
Evaluation Metric:
R2 score, also known as the coefficient of determination, is employed as the
evaluation metric. It represents the proportion of the variance in the dependent
variable (house prices) that is predictable from the independent variables.
34
Fig. 17 Evaluation Metric
Result Analysis:
Model Performance:
Linear Regression and Ridge Regression exhibit similar performance, with R2
scores of around 0.82. This indicates that approximately 82% of the variance in
house prices is captured by these models.
Impact of Regularization:
Lasso Regression, which applies L1 regularization, slightly underperforms
compared to the other two models. The negligible difference in performance
between Ridge and Linear Regression suggests that regularization might not
significantly affect model performance in this scenario.
35
effectiveness of the chosen regression models in predicting house prices. The insights
gleaned from data visualization aid in understanding market dynamics, while model
evaluation provides valuable feedback for refining model selection and
preprocessing techniques.
5.2 ADVANTAGES
36
projects. It can be integrated with databases, web frameworks, and cloud
services, allowing for end-to-end development and deployment of predictive
models.
• Machine Learning Ecosystem: Python's machine learning ecosystem is
well-established and constantly evolving. It offers state- of-the-art
algorithms, techniques, and methodologies for solving predictive modelling
problems, including house price prediction.
• Interpretability: Python-based machine learning models are often highly
interpretable, allowing stakeholders to understand the factors driving
predictions. This transparency is crucial, especially in real estate, where
buyers, sellers, and agents seek to understand the rationale behind house
price estimates.
• Open Source: Python is open source and free to use, making it accessible to
everyone. This democratization of technology enables individuals and
organizations of all sizes to leverage machine learning for various
applications, including house price prediction.
5.3 DISADVANTAGES
While Python offers numerous advantages for house price prediction, there
are also some potential disadvantages to consider:
37
• GIL Limitation: Python's Global Interpreter Lock (GIL) can hinder
multithreaded performance, particularly in CPU-bound tasks. While libraries
like NumPy and Pandas can offload computation to optimized C or Fortran
code, certain operations may still be affected by the GIL, impacting parallel
processing performance.
• Dependency Management: Python's dependency management system,
particularly with respect to package versions and compatibility, can
sometimes be challenging. Dependency conflicts or version mismatches
between libraries may arise, requiring careful management and potentially
causing issues with model reproducibility.
• Debugging Complexity: Python's dynamic typing and flexible syntax, while
advantageous for development speed, can sometimes lead to more
challenging debugging processes. Errors may not be caught until runtime,
and troubleshooting issues in complex machine learning pipelines may
require significant effort.
• Limited Deployment Options: While Python excels in model development
and experimentation, deploying Python-based machine learning models into
production environments may present challenges.
• Interpretability: While Python-based machine learning models can offer
interpretability, certain advanced techniques such as deep learning may
produce less interpretable models. Understanding and explaining the
predictions of complex models may require additional effort and expertise.
• Security Risks: Python's open-source nature and extensive library
ecosystem can introduce security risks, particularly when using third- party
packages or dependencies. Ensuring the security of machine learning
pipelines and protecting against vulnerabilities requires careful attention and
proactive measures.
38
• Learning Curve: While Python's syntax is relatively easy to learn,
mastering the full spectrum of machine learning techniques and libraries can
be challenging. Beginners may face a steep learning curve, requiring time
and dedication to gain proficiency in data preprocessing, model selection,
and evaluation.
5.4 MAINTENANCE
39
of the predictive models, along with the associated data preprocessing and
feature engineering pipelines.
• Security Measures: Implement security measures to protect the integrity and
confidentiality of the data used in the prediction system. Use encryption,
access controls, and secure communication protocols to safeguard sensitive
information.
• Scalability: Monitor system performance and scalability as the volume of
data and user traffic grows. Optimize code and infrastructure to handle
increasing workloads efficiently and ensure timely responses to user queries.
• Documentation: Maintain comprehensive documentation for the prediction
system, including model specifications, data sources, preprocessing steps,
and evaluation metrics. Document any changes or updates made to
the system over time.
• User Feedback Incorporation: Gather feedback from users and stakeholders
to identify areas for improvement and address any usability issues.
Incorporate user feedback into future iterations of the prediction system to
enhance user satisfaction and adoption.
• Continual Improvement: Continuously evaluate and refine the prediction
system based on feedback, performance metrics, and evolving business
requirements. Experiment with new algorithms, techniques, or features to
improve predictive accuracy and relevance.
40
5.5 APPLICATION
Table 1 Application
41
CHAPTER 6
With several characteristics, the suggested method predicts the property price in
Bangalore. We experimented with different Machine Learning algorithms to get the
best model. When compared to all other algorithms, the Decision Tree Algorithm
achieved the lowest loss and the greatest R-squared. Flask was used to create the
website.
Let's see how our project pans out. Open the HTML web page we generated and run
the [Link] file in the backend. Input the property's square footage, the number of
bedrooms, the number of bathrooms, and the location, then click 'ESTIMATE
PRICE.' We forecasted the cost of what may be someone's ideal home.
The goal of the project "House Price Prediction Using Machine Learning" is to
forecast house prices based on various features in the provided data. Our best
accuracy was around 90% after we trained and tested the model. To make this model
distinct from other prediction systems, we must include more parameters like tax
and air quality. People can purchase houses on a budget and minimize financial loss.
Numerous algorithms are used to determine house values. The selling price was
determined with greater precision and accuracy. People will benefit greatly from
this. Numerous elements that influence housing prices must be taken into account
and handled.
42
REFERENCE
[1] Model “BANGALORE HOUSE PRICE PREDICTION MODEL”
[2] Heroku “Documentation”
[3] Repository: “Web Application” [Link]
House-Price-Prediction
[4] Repository: “Web Application” [Link]
Thakur/BANGALORE-HOUSE-PRICE-PREDICTION
[5] Pickle ‘’Documentation’
[6] A. Varma, A. Sarma, S. Doshi and R. Nair, "House Price Prediction Using
Machine Learning and Neural Networks," 2018 Second International Conference on
Inventive Communication and Computational Technologies (ICICCT), 2018, pp.
1936-1939, doi: 10.1109/ICICCT.2018.8473231.
[7] Furia, Palak, and Anand Khandare. "Real Estate Price Prediction Using Machine
Learning Algorithm." e-Conference on Data Science and Intelligent Computing.
2020.
[8] Musciano, Chuck, and Bill Kennedy. HTML & XHTML: The Definitive Guide:
The Definitive Guide. " O'Reilly Media, Inc.", 2002.
[9] Aggarwal, Shalabh. Flask framework cookbook. Packt Publishing Ltd, 2014.
[10] Grinberg, Miguel. Flask web development: developing web applications with
python. " O'Reilly Media, Inc.", 2018.
[11] Middleton, Neil, and Richard Schneeman. Heroku: up and running: effortless
application deployment and scaling. " O'Reilly Media, Inc.", 2013.
[12]Available:[Link]
Price_Prediction_using_a_Machine_Learning_Model_A_Survey_of_Literat ure
[13] House price prediction using a hedonic price model vs an artificial neural
network. American Journal of Applied Sciences. Limsombunchai, Christopher Gan,
and Minsoo Lee. 3:193–201.
43
[14] Joep Steegmans and Wolter Hassink. an empirical investigation of how wealth
and income affect one's financial status and ability to purchase a home. Journal of
Housing Economics. 2017;z36:8–24.
[15] Ankit Mohokar, Nihar Baghat, and Shreyash Mane. House Price Forecasting
Using Data Mining, International Journal of Computer Applications. 152:23–26.
[16] Joao Gama, Torgo, and Luis. Logic regression using Classification
Algorithms. Intelligent Data Analysis. 4:275-292.
[17] Fabian Pedregosa et al. Python's Scikit-learn library for machine learning,
Journal of Machine Learning Research. 12:2825–830.
[18] Real Estate Economics. Heidelberg, Bork M. and Moller VS, House Price
Forecast Ability: A Factor Analysis. 46:582–611.
[19] Hy Dang, Minh Nguyen, Bo Mei, and Quang Troung. Improvements to home
price prediction methods using machine learning. Precedia Engineering. 174:433-
442.
[20] Atharva Chogle, Priyankakhaire, Akshata Gaud, and Jinal Jain. A article titled
House Price Forecasting Using Data Mining Techniques was published in the
International Journal of Advanced Research in Computer and Communication
Engineering. 6:24-28.
[21] Kai-Hsuan Chu, Li, Li. Prediction of real estate price variation based on
economic parameters, International Conference on. IEEE, Applied System
Innovation (ICASI); 2017.
[22] Subhani Shaik, Uppu Ravibabu. Classification of EMG Signal Analysis based
on Curvelet Transform and Random Forest tree Method. Paper selected for Journal of
Theoretical and Applied Information Technology (JATIT). 95.
[23] Subhani Shaik, Uppu Ravibabu. Classification of EMG Signal Analysis based
on Curvelet Transform and Random Forest tree Method. Paper selected for Journal of
Theoretical and Applied Information Technology (JATIT). 95.
44