Project Report
Project Report
net/publication/349477129
CITATIONS READS
3 20,136
1 author:
Udit Deo
Amdocs
4 PUBLICATIONS 3 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Udit Deo on 21 February 2021.
A MINI PROJECT
REPORT
Submitted by
Udit Deo
17bcs057
Uday Deo
17bcs056
B. TECH.
IN
CERTIFICATE
REGRESSION TECHNIQUES” is the work of Udit Deo and Uday Deo who carried out the mini
Pooja Sharma
Assistant Professor
CSE, SMVDU
Abstract iii
List of Table iv
List of Figures v
List of symbols and Abbreviations vi
1 INTRODUCTION vii
1.1 AIM and IMPORTANCE viii
1.1.1 Aim ix
1.1.2 Need and Motivation x
2. DATASET xi
2.1 Steps in Preparing Data for Model xii
5. CONCLUSION xxix
Abstract
House price forecasting is an important topic of real estate. The literature attempts to
derive useful knowledge from historical data of property markets. Machine learning
useful models for house buyers and sellers. Revealed is the high discrepancy between
house prices in the most expensive and most affordable suburbs in the city of Mumbai.
Moreover, experiments demonstrate that the Multiple Linear Regression that is based on
Aim
• Identify the important home price attributes which feed the model’s predictive power.
Need and Motivation
Having lived in India for so many years if there is one thing that I had been taking for granted,
it’s that housing and rental prices continue to rise. Since the housing crisis of 2008, housing
prices have recovered remarkably well, especially in major housing markets. However, in the
4th quarter of 2016, I was surprised to read that Bombay housing prices had fallen the most in
the last 4 years. In fact, median resale prices for condos and coops fell 6.3%, marking the first
time there was a decline since Q1 of 2017. The decline has been partly attributed to political
uncertainty domestically and abroad and the 2014 election. So, to maintain the transparency
among customers and also the comparison can be made easy through this model. If customer
finds the price of house at some given website higher than the price predicted by the model, so
Here we have web scrapped the Data from 99acres.com website which is one of the leading
real estate websites operating in INDIA.
Data exploration is the first step in data analysis and typically involves summarizing the main
characteristics of a data set, including its size, accuracy, initial patterns in the data and other
attributes. It is commonly conducted by data analysts using visual analytics tools, but it can
also be done in more advanced statistical software, Python. Before it can conduct analysis on
data collected by multiple data sources and stored in data warehouses, an organization must
know how many cases are in a data set, what variables are included, how many missing
values there are and what general hypotheses the data is likely to support. An initial
exploration of the data set can help answer these questions by familiarizing analysts with the
We divided the data 9:1 for Training and Testing purpose respectively.
Data Visualization
using visual elements like charts, graphs, and maps, data visualization tools provide an
accessible way to see and understand trends, outliers, and patterns in data. In the
world of Big Data, data visualization tools and technologies are essential to analyse
type and source, as well as suitable instruments to collect data. Data selection
precedes the actual practice of data collection. This definition distinguishes data
selection from selective data reporting (selectively excluding data that is not
analyses). The process of selecting suitable data for a research project can impact data
integrity.
The primary objective of data selection is the determination of appropriate data type,
the nature of the investigation, existing literature, and accessibility to necessary data
sources.
Correlation Heatmap
Data Transformation
The log transformation can be used to make highly skewed distributions less skewed. This
can be valuable both for making patterns in the data more interpretable and for helping to
It is hard to discern a pattern in the upper panel whereas the strong relationship is shown
clearly in the lower panel. The comparison of the means of log-transformed data is actually a
comparison of geometric means. This occurs because, as shown below, the anti-log of
Normal Price
Skewed Area
Normal Area
Skewed Price/Sq.
Normal Price/Sq.
Pandas
NumPy
Matplotlib
Seaborn
Scikit Learn
XG Boost
MODELS USED
Regression Model
• It is mostly used for finding out the relationship between variables and forecasting.
Real Vs Predicted
Random Forest Regression Model
• Bagging, in the Random Forest method, involves training each decision tree on a
different data sample where sampling is done with replacement.
• The basic idea behind this is to combine multiple decision trees in determining the
final output rather than relying on individual decision trees.
Real Vs Predicted
XG Boost Regressor Model
• The XG Boost library implements the gradient boosting decision tree algorithm.
• Boosting is an ensemble technique where new models are added to correct the errors
made by existing models.
Real Vs Predicted
RESULTS AND DISCUSSIONS
Linear Regression displayed the best performance for this Dataset and can be used for
deploying purposes.
Random Forest Regressor and XGBoost Regressor are far behind, so can’t be
recommended for further deployment purposes.
The Model is deployed through Python Web App Flask in collaboration with HTML and
CSS.
Conclusion
So, our Aim is achieved as we have successfully ticked all our parameters as mentioned in
our Aim Column. It is seen that circle rate is the most effective attribute in predicting the
house price and that the Linear Regression is the most effective model for our Dataset with
LIST OF TABLES
LIST OF FIGURES
NUMBER
ABBREVIATIONS