Oral Presentation
Oral Presentation
✓ [email protected] ,
[email protected]
✓ Highlights (what was firstly discovered and why is it abreakthrough)
Using housing data from kaggle to build prediction models. The data often includes attributes like square footage,
location, number of bedrooms etc. Location data requires special preprocessing as it has an outsized impact on
prices. Techniques like one-hot encoding for neighborhoods are common. Identifying and removing outliers also very
important. Predictive accuracy in the 70-80% range on unseen test data is considered quite good. Model
interpretability also important to understand which factors are most influential. Sometimes insights are extracted -
e.g. ranking locations by average price per square foot. These add business value over just predictions.
✓ Abstract (self-explanation of the main discovery)
This study addresses house price prediction in Bengaluru using linear and multiple regression techniques. Utilizing a
dataset of 1298 unique localities, the research focuses on forecasting land prices in the Bengaluru Metropolitan
Area (BMA) in Karnataka, India. Beyond the House Price Index (HPI), factors such as area type, availability, location,
society, and apartment size are considered. The goal is to predict the price per square foot for apartments. In
metropolitan cities like Bengaluru, determining accurate sales prices remains challenging, making predictive
modeling crucial for real estate decision-making. The models aim to capture the complex interplay of these factors
in influencing individual house prices in the dynamic real estate market of Bengaluru.
✓ Introduction (clarify the complexity of the topic and justify the urgency to investigate the
research hypothesis)
Housing is an essential human need and real estate markets impact economies. Accurately valuing properties is
important but challenging due to many influencing factors. Prices depend on attributes like size, rooms, location.
Regression techniques used to predict sales prices. Bengaluru, India sees rising housing demand. Buyers consider
amenities, area, facilities when purchasing. Study develops model to forecast Bengaluru house prices per square foot
using machine learning algorithms. Based on dataset of 1298 locations with details like number of rooms, baths,
location features. Compares linear regression, lasso regression and decision tree models. Tunes data by handling
outliers, missing values. Multiple linear regression provides 85% accuracy in final model. Location data requires
preprocessing as it strongly influences prices. Model helps determine fair valuations across many neighborhoods.
Enables real-world usage via web interface that provides price estimates. Demonstrates feasibility of applying
machine learning to complex real estate market. Overall a breakthrough in bringing efficiency, transparency and
analytical rigor to property pricing. Has implications for home buyers, investors, developers by accounting for many
parameters.
✓ Methods (reproducible instructions to confirm or disprove a research hypothesis)
Data Set:
Utilized the "Bangalore_House_data prediction" dataset with 13320 rows and 9 columns. The target variable is "price," which is to
be predicted.
Data Preprocessing and Integration:
Cleaning: Ensured data quality by addressing missing values through mean or median replacement.
Refinement: Improved model efficiency by removing irrelevant columns, focusing on essential data.
Outlier Detection:
Identification: Identified outliers using statistical measures such as interquartile range and visualizations like boxplots.
Handling: Employed the removeOutliers function to enhance data accuracy by eliminating outliers.
Data Visualization:
Techniques: Utilized box plots for effective visualization, specifically focusing on trends in location area vs. prices.
Purpose: Enhanced interpretability and understanding of data patterns, aiding subsequent modeling.
Test Train Split:
Procedure: Applied the train_test_split() method, allocating 75% of data for training and 25% for testing.
Objective: Facilitated robust model evaluation by segregating data appropriately.
Machine Learning Models:
Linear Regression:
Modeling Approach: Developed a supervised machine learning model capturing a linear relationship
between dependent and independent variables.
Representation: Y=a0+a1X+εY=a0+a1X+ε.
Focus: Predicted house prices based on individual factors.
Multiple Linear Regression:
Approach: Explored relationships between house prices and multiple independent variables.
Utility: Identified and utilized various factors contributing to house price prediction.
Random Forest:
Advantages: Handled missing values efficiently, maintained accuracy, and addressed overfitting
concerns.
Implementation: Developed decision trees based on random data and variable selection.
Application: Particularly effective for predicting house prices in large datasets.
✓ Results (present only the most environmentally and industrially important results, make
sure to provide corresponding units)
After the preprocessing and visualization of our dataset, we realized that for a certain number of
attributes we could use a few models such as Multiple linear regression, Lasso Regression, Decision tree
etc. Further evaluating through GridSearchcv, we observed that multiple linear regression was the best
suitable model giving the best scores. Hence we were able to evaluate our model successfully by using
MSE, R square, RMSE as our evaluation metrics and obtain an accuracy of 85% and therefore predict the
price of various houses in Bangalore by taking in the final parameters as location, area in sq ft, bathroom
and BHK.
We also compared the three models that we have used and found linear regression to be the best
among them and we visualized the result in the form of a bar chart. We further tried to establish a
website which took in all the parameters such as Area(in sq ft), BHK, No. of bathrooms and the locality
and in turn give in the price prediction for the house using the multiple regression model we used which
gave us the best accuracy among the 3 models we choose.
A pickle model is exported from the notebook. The model is integrated into a simple and userfriendly website by
using the flask server and API requests received from the user are given a suitable HTML server side response from
the model imported here.
The working looks like this -
✓ Discussion (critically compare your results to existing literature and reveal the
mechanism that might caused the differences, identify the weakneses of your methods,
do not ignore (economic) reality, identify promising directions for futureresearch)
Our project findings align with Wang et al. (2021) and Varma et al. (2018), employing machine learning algorithms like
Linear Regression for accurate house price prediction. Similar to Varma et al. (2018), our study utilized machine
learning techniques achieving consumer satisfaction with accurate outputs. However, in contrast to Phan (2018), who
employed Random Forest algorithms, we found Multiple Linear Regression to be the optimal model among
alternatives. A limitation of our approach is the exclusive reliance on machine learning algorithms, potentially
overlooking crucial economic factors influencing house prices, as noted in the literature by Phan (2018). Future
research directions should integrate economic and real estate indicators into machine learning models, addressing
these limitations for a more comprehensive prediction approach. Moreover, incorporating advanced data visualization
techniques like Augmented Reality, proposed by Varma et al. (2018), could enhance user experience and decision-
making in real estate, offering avenues for further research. While our findings align with existing literature, we
acknowledge the need for future research integrating economic reality, advanced visualization, and a broader set of
influencing variables for more accurate and comprehensive house price prediction models. Such research would bridge
the gap between machine learning and economic reality, providing enhanced tools for real estate stakeholders.
✓ Conclusions (make sure you are building synthesis above your discussion and not
repeating your results, clearly indicate whether the research hypothesis tends to be
confirmed or not and whether the concept seems to be industrially promising
(economically sustainable))
The main goal of this project is to determine the house price prediction which we have successfully done using
different machine learning algorithms like a Linear Regression, Lasso and Decision Tree. It is quite evident from our
evaluation that the Linear Regression model has more accuracy inwhen compared to the others. Moreover, our project
provides a way to find the attributes contribution in prediction. Hence we could conclude that this project would be
helpful to a variety of people. The above models of prediction are very efficient from the point of view of linearly
dependent data. Thus we use the linear regression techniques. The Exploratory Data Analysis helps us to visualize the
data better and decide which regression technique must be deployed. We use the scatter plot to compare the
dependent variables and the bar plot to compare individual model accuracy which helps us best decide which model
should be used. Different accuracies might be possible for the same model when we are using the train_test_split with
different values for the test_size attribute. Currently we have used 90% data for train and 10% for test. The
GridSearchCV should be further calibrated such that is it capable of not only handling more parameters for a given
model but also handling more models at a time.