0% found this document useful (0 votes)
41 views14 pages

House Price Prediction

This document discusses predicting house prices using machine learning models. It performs data preprocessing steps like dropping unnecessary columns, handling missing values, feature scaling and encoding categorical variables. Different regression models like linear regression, ridge regression, decision trees are applied to predict house prices.

Uploaded by

Vinay Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views14 pages

House Price Prediction

This document discusses predicting house prices using machine learning models. It performs data preprocessing steps like dropping unnecessary columns, handling missing values, feature scaling and encoding categorical variables. Different regression models like linear regression, ridge regression, decision trees are applied to predict house prices.

Uploaded by

Vinay Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

7/8/23, 4:15 PM house price

House Price Prediction


In [ ]: #Importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from folium.plugins import FastMarkerCluster
from sklearn import preprocessing
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import Ridge

In [ ]: # Importing the dataset


data = pd.read_csv('https://raw.githubusercontent.com/rashida048/Datasets/master
data.head()

Out[ ]: id date price bedrooms bathrooms sqft_living sqft_lot f

0 7129300520 20141013T000000 221900 3 1.00 1180 5650

1 6414100192 20141209T000000 538000 3 2.25 2570 7242

2 5631500400 20150225T000000 180000 2 1.00 770 10000

3 2487200875 20141209T000000 604000 4 3.00 1960 5000

4 1954400510 20150218T000000 510000 3 2.00 1680 8080

5 rows × 21 columns

In [ ]: #droping the unnecessary columns such as id, date, zipcode , lat and long
data.drop(['id','date'],axis=1,inplace=True)
data.head()

file:///E:/Data Science Course/Projects/house price.html 1/13


7/8/23, 4:15 PM house price

Out[ ]: price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condit

0 221900 3 1.00 1180 5650 1.0 0 0

1 538000 3 2.25 2570 7242 2.0 0 0

2 180000 2 1.00 770 10000 1.0 0 0

3 604000 4 3.00 1960 5000 1.0 0 0

4 510000 3 2.00 1680 8080 1.0 0 0

In [ ]: data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 price 21613 non-null int64
1 bedrooms 21613 non-null int64
2 bathrooms 21613 non-null float64
3 sqft_living 21613 non-null int64
4 sqft_lot 21613 non-null int64
5 floors 21613 non-null float64
6 waterfront 21613 non-null int64
7 view 21613 non-null int64
8 condition 21613 non-null int64
9 grade 21613 non-null int64
10 sqft_above 21613 non-null int64
11 sqft_basement 21613 non-null int64
12 yr_built 21613 non-null int64
13 yr_renovated 21613 non-null int64
14 zipcode 21613 non-null int64
15 lat 21613 non-null float64
16 long 21613 non-null float64
17 sqft_living15 21613 non-null int64
18 sqft_lot15 21613 non-null int64
dtypes: float64(4), int64(15)
memory usage: 3.1 MB

In [ ]: data.describe()

file:///E:/Data Science Course/Projects/house price.html 2/13


7/8/23, 4:15 PM house price

Out[ ]: price bedrooms bathrooms sqft_living sqft_lot fl

count 2.161300e+04 21613.000000 21613.000000 21613.000000 2.161300e+04 21613.000

mean 5.400881e+05 3.370842 2.114757 2079.899736 1.510697e+04 1.494

std 3.671272e+05 0.930062 0.770163 918.440897 4.142051e+04 0.539

min 7.500000e+04 0.000000 0.000000 290.000000 5.200000e+02 1.000

25% 3.219500e+05 3.000000 1.750000 1427.000000 5.040000e+03 1.000

50% 4.500000e+05 3.000000 2.250000 1910.000000 7.618000e+03 1.500

75% 6.450000e+05 4.000000 2.500000 2550.000000 1.068800e+04 2.000

max 7.700000e+06 33.000000 8.000000 13540.000000 1.651359e+06 3.500

In [ ]: # checking for null values/missing values


data.isnull().sum()

Out[ ]: price 0
bedrooms 0
bathrooms 0
sqft_living 0
sqft_lot 0
floors 0
waterfront 0
view 0
condition 0
grade 0
sqft_above 0
sqft_basement 0
yr_built 0
yr_renovated 0
zipcode 0
lat 0
long 0
sqft_living15 0
sqft_lot15 0
dtype: int64

In [ ]: data.nunique()

file:///E:/Data Science Course/Projects/house price.html 3/13


7/8/23, 4:15 PM house price

Out[ ]: price 4032


bedrooms 13
bathrooms 30
sqft_living 1038
sqft_lot 9782
floors 6
waterfront 2
view 5
condition 5
grade 12
sqft_above 946
sqft_basement 306
yr_built 116
yr_renovated 70
zipcode 70
lat 5034
long 752
sqft_living15 777
sqft_lot15 8689
dtype: int64

Data Preprocessing
In [ ]: # changing float to integer
data['bathrooms'] = data['bathrooms'].astype(int)
data['floors'] = data['floors'].astype(int)
# renaming the column yr_built to age and changing the values to age
data.rename(columns={'yr_built':'age'},inplace=True)
data['age'] = 2023 - data['age']
# changing the column yr_renovated to renovated and changing the values to 0 and
data.rename(columns={'yr_renovated':'renovated'},inplace=True)
data['renovated'] = data['renovated'].apply(lambda x: 0 if x == 0 else 1)

In [ ]: # using simple feature scaling


data['sqft_living'] = data['sqft_living']/data['sqft_living'].max()
data['sqft_living15'] = data['sqft_living15']/data['sqft_living15'].max()
data['sqft_lot'] = data['sqft_lot']/data['sqft_lot'].max()
data['sqft_above'] = data['sqft_above']/data['sqft_above'].max()
data['sqft_basement'] = data['sqft_basement']/data['sqft_basement'].max()
data['sqft_lot15'] = data['sqft_lot15']/data['sqft_lot15'].max()

In [ ]: data.head()

Out[ ]: price bedrooms bathrooms sqft_living sqft_lot floors waterfront view cond

0 221900 3 1 0.087149 0.003421 1 0 0

1 538000 3 2 0.189808 0.004385 2 0 0

2 180000 2 1 0.056869 0.006056 1 0 0

3 604000 4 3 0.144756 0.003028 1 0 0

4 510000 3 2 0.124077 0.004893 1 0 0

file:///E:/Data Science Course/Projects/house price.html 4/13


7/8/23, 4:15 PM house price

Exploratory Data Analysis


Correlation Matrix to find the relationship between the variables

In [ ]: # using correlation statistical method to find the relation between the price an
data.corr()['price'].sort_values(ascending=False)

Out[ ]: price 1.000000


sqft_living 0.702035
grade 0.667434
sqft_above 0.605567
sqft_living15 0.585379
bathrooms 0.510072
view 0.397293
sqft_basement 0.323816
bedrooms 0.308350
lat 0.307003
waterfront 0.266369
floors 0.237211
renovated 0.126092
sqft_lot 0.089661
sqft_lot15 0.082447
condition 0.036362
long 0.021626
zipcode -0.053203
age -0.054012
Name: price, dtype: float64

In [ ]: plt.figure(figsize=(20,20))
sns.heatmap(data.corr(),annot=True)
plt.show()

file:///E:/Data Science Course/Projects/house price.html 5/13


7/8/23, 4:15 PM house price

Visualizing the coorelation with price

In [ ]: data.corr()['price'][:-1].sort_values().plot(kind='bar')

Out[ ]: <Axes: >

file:///E:/Data Science Course/Projects/house price.html 6/13


7/8/23, 4:15 PM house price

Visulaizing the data

In [ ]: # visualizing the relation between price and sqft_living, sqft_lot, sqft_above,


fig, ax = plt.subplots(4,4,figsize=(20,20))
sns.scatterplot( x = data['sqft_living'], y = data['price'],ax=ax[0,0])
sns.scatterplot( x = data['sqft_lot'], y = data['price'],ax=ax[0,1])
sns.scatterplot( x = data['sqft_above'], y = data['price'],ax=ax[0,2])
sns.scatterplot( x = data['sqft_basement'], y = data['price'],ax=ax[0,3])
sns.scatterplot( x = data['sqft_living15'], y = data['price'],ax=ax[1,0])
sns.scatterplot( x = data['sqft_lot15'], y = data['price'],ax=ax[1,1])
sns.lineplot( x = data['age'], y = data['price'],ax=ax[1,2])
sns.boxplot( x = data['renovated'], y = data['price'],ax=ax[1,3])
sns.scatterplot( x = data['bedrooms'], y = data['price'],ax=ax[2,0])
sns.lineplot( x = data['bathrooms'], y = data['price'],ax=ax[2,1])
sns.barplot( x = data['floors'], y = data['price'],ax=ax[2,2])
sns.boxplot( x = data['waterfront'], y = data['price'],ax=ax[2,3])
sns.barplot( x = data['view'], y = data['price'],ax=ax[3,0])
sns.barplot( x = data['condition'], y = data['price'],ax=ax[3,1])
sns.lineplot( x = data['grade'], y = data['price'],ax=ax[3,2])
sns.lineplot( x = data['age'], y = data['renovated'],ax=ax[3,3])
plt.show()

file:///E:/Data Science Course/Projects/house price.html 7/13


7/8/23, 4:15 PM house price

Plotting the location of the houses based on longitude and latitude on


the map

In [ ]: # adding a new column price_range and categorizing the price into 4 categories
data['price_range'] = pd.cut(data['price'],bins=[0,321950,450000,645000,1295648]

In [ ]: map = folium.Map(location=[47.5480, -121.9836],zoom_start=8)


marker_cluster = FastMarkerCluster(data[['lat', 'long']].values.tolist()).add_to
map

file:///E:/Data Science Course/Projects/house price.html 8/13


7/8/23, 4:15 PM house price

22
Out[ ]: Make this Notebook Trusted to load map: File -> Trust Notebook
+ 13 34

− 47
6

25

36

35

7 56
Leaflet (https://leafletjs.com) | Data by © OpenStreetMap (http://openstreetmap.org), under ODbL
(http://www.openstreetmap.org/copyright).
30 52

Train/Test Split
In [ ]: data.drop(['price_range'],axis=1,inplace=True)
X_train, X_test, y_train, y_test = train_test_split(data.drop('price',axis=1),da

Model Training

Using pipeline to combine the transformers and estimators


and fit the model
In [ ]: input = [('scale',StandardScaler()),('polynomial', PolynomialFeatures(degree=2))
pipe = Pipeline(input)
pipe

Out[ ]: ▸ Pipeline

▸ StandardScaler

▸ PolynomialFeatures

▸ LinearRegression

In [ ]: #training the model


pipe.fit(X_train,y_train)
pipe.score(X_test,y_test)

Out[ ]: 0.8271896429378042

In [ ]: #testing the model


pipe_pred = pipe.predict(X_test)
r2_score(y_test,pipe_pred)

file:///E:/Data Science Course/Projects/house price.html 9/13


7/8/23, 4:15 PM house price

Out[ ]: 0.8271896429378042

Ridge Regression
In [ ]: Ridgemodel = Ridge(alpha = 0.001)
Ridgemodel

Out[ ]: ▾ Ridge

Ridge(alpha=0.001)

In [ ]: # training the model


Ridgemodel.fit(X_train,y_train)
Ridgemodel.score(X_test,y_test)

In [ ]: #testing the model


r_pred = Ridgemodel.predict(X_test)
r2_score(y_test,r_pred)

Out[ ]: 0.7123220593275169

Random Forest Regression


In [ ]: from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=100, random_state=0)
regressor

Out[ ]: ▾ RandomForestRegressor

RandomForestRegressor(random_state=0)

In [ ]: # training the model


regressor.fit(X_train,y_train)
regressor.score(X_test,y_test)

Out[ ]: 0.878968081057204

In [ ]: #testing the model


yhat = regressor.predict(X_test)
r2_score(y_test,yhat)

Out[ ]: 0.878968081057204

Model Evalution

Distribution plot from the models predictions and the


actual values
In [ ]: # displot of the actual price and predicted price for all models
fig, ax = plt.subplots(1,3,figsize=(20,5))
sns.distplot(y_test,ax=ax[0])
sns.distplot(pipe_pred,ax=ax[0])

file:///E:/Data Science Course/Projects/house price.html 10/13


7/8/23, 4:15 PM house price

sns.distplot(y_test,ax=ax[1])
sns.distplot(r_pred,ax=ax[1])
sns.distplot(y_test,ax=ax[2])
sns.distplot(yhat,ax=ax[2])
# legends
ax[0].legend(['Actual Price','Predicted Price'])
ax[1].legend(['Actual Price','Predicted Price'])
ax[2].legend(['Actual Price','Predicted Price'])
#model name as title
ax[0].set_title('Linear Regression')
ax[1].set_title('Ridge Regression')
ax[2].set_title('Random Forest Regression')
plt.show()

Error Evaluation
In [ ]: #plot the graph to compare mae, mse, rmse for all models
fig, ax = plt.subplots(1,3,figsize=(20,5))
sns.barplot(x=['Linear Regression','Ridge Regression','Random Forest'],y=[mean_a
sns.barplot(x=['Linear Regression','Ridge Regression','Random Forest'],y=[mean_s
sns.barplot(x=['Linear Regression','Ridge Regression','Random Forest'],y=[np.sqr
# label for the graph
ax[0].set_ylabel('Mean Absolute Error')
ax[1].set_ylabel('Mean Squared Error')
ax[2].set_ylabel('Root Mean Squared Error')
plt.show()

Accuracy Evaluation

In [ ]: # plot accuracy of all models in the same graph


fig, ax = plt.subplots(figsize=(7,5))
sns.barplot(x=['Linear Regression','Ridge Regression','Random Forest Regression'
ax.set_title('Accuracy of all models')
plt.show()

file:///E:/Data Science Course/Projects/house price.html 11/13


7/8/23, 4:15 PM house price

Predicting the price of a new house


In [ ]: #input the values
bedrooms = 3
bathrooms = 2
sqft_living = 2000
sqft_lot = 10000
floors = 2
waterfront = 0
view = 0
condition = 3
grade = 8
sqft_above = 2000
sqft_basement = 0
yr_built = 1990
yr_renovated = 0
zipcode = 98001
lat = 47.5480
long = -121.9836
sqft_living15 = 2000
sqft_lot15 = 10000

In [ ]: #predicting the price using random forest regression


price = regressor.predict([[bedrooms,bathrooms,sqft_living,sqft_lot,floors,water
print('The price of the house is $',price[0])

The price of the house is $ 1078694.0533333335

Conclusion

file:///E:/Data Science Course/Projects/house price.html 12/13


7/8/23, 4:15 PM house price

From the analysis, we can see that the Random Forest Regression model performed
better than the Ridge Regression model and Polynomial Regression model.

During the EDA process, we found out that the location of the house is a very important
factor in determining the price of the house, since houese with similar area and other
features can have different prices depending on the location of the house.

The location of the houses has been plotted on the map using the longitude and latitude
values which makesrole of location in determining the price of the house more clear.

file:///E:/Data Science Course/Projects/house price.html 13/13


NOTE: For some reasons, the map was not rendered properly when
the notebook was converted into pdf. So here is the image of the
rendered map showing the locations of the houses, color coded
according to their price range

You might also like