House Price Prediction
House Price Prediction
5 rows × 21 columns
In [ ]: #droping the unnecessary columns such as id, date, zipcode , lat and long
data.drop(['id','date'],axis=1,inplace=True)
data.head()
Out[ ]: price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condit
In [ ]: data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 price 21613 non-null int64
1 bedrooms 21613 non-null int64
2 bathrooms 21613 non-null float64
3 sqft_living 21613 non-null int64
4 sqft_lot 21613 non-null int64
5 floors 21613 non-null float64
6 waterfront 21613 non-null int64
7 view 21613 non-null int64
8 condition 21613 non-null int64
9 grade 21613 non-null int64
10 sqft_above 21613 non-null int64
11 sqft_basement 21613 non-null int64
12 yr_built 21613 non-null int64
13 yr_renovated 21613 non-null int64
14 zipcode 21613 non-null int64
15 lat 21613 non-null float64
16 long 21613 non-null float64
17 sqft_living15 21613 non-null int64
18 sqft_lot15 21613 non-null int64
dtypes: float64(4), int64(15)
memory usage: 3.1 MB
In [ ]: data.describe()
Out[ ]: price 0
bedrooms 0
bathrooms 0
sqft_living 0
sqft_lot 0
floors 0
waterfront 0
view 0
condition 0
grade 0
sqft_above 0
sqft_basement 0
yr_built 0
yr_renovated 0
zipcode 0
lat 0
long 0
sqft_living15 0
sqft_lot15 0
dtype: int64
In [ ]: data.nunique()
Data Preprocessing
In [ ]: # changing float to integer
data['bathrooms'] = data['bathrooms'].astype(int)
data['floors'] = data['floors'].astype(int)
# renaming the column yr_built to age and changing the values to age
data.rename(columns={'yr_built':'age'},inplace=True)
data['age'] = 2023 - data['age']
# changing the column yr_renovated to renovated and changing the values to 0 and
data.rename(columns={'yr_renovated':'renovated'},inplace=True)
data['renovated'] = data['renovated'].apply(lambda x: 0 if x == 0 else 1)
In [ ]: data.head()
Out[ ]: price bedrooms bathrooms sqft_living sqft_lot floors waterfront view cond
In [ ]: # using correlation statistical method to find the relation between the price an
data.corr()['price'].sort_values(ascending=False)
In [ ]: plt.figure(figsize=(20,20))
sns.heatmap(data.corr(),annot=True)
plt.show()
In [ ]: data.corr()['price'][:-1].sort_values().plot(kind='bar')
In [ ]: # adding a new column price_range and categorizing the price into 4 categories
data['price_range'] = pd.cut(data['price'],bins=[0,321950,450000,645000,1295648]
22
Out[ ]: Make this Notebook Trusted to load map: File -> Trust Notebook
+ 13 34
− 47
6
25
36
35
7 56
Leaflet (https://leafletjs.com) | Data by © OpenStreetMap (http://openstreetmap.org), under ODbL
(http://www.openstreetmap.org/copyright).
30 52
Train/Test Split
In [ ]: data.drop(['price_range'],axis=1,inplace=True)
X_train, X_test, y_train, y_test = train_test_split(data.drop('price',axis=1),da
Model Training
Out[ ]: ▸ Pipeline
▸ StandardScaler
▸ PolynomialFeatures
▸ LinearRegression
Out[ ]: 0.8271896429378042
Out[ ]: 0.8271896429378042
Ridge Regression
In [ ]: Ridgemodel = Ridge(alpha = 0.001)
Ridgemodel
Out[ ]: ▾ Ridge
Ridge(alpha=0.001)
Out[ ]: 0.7123220593275169
Out[ ]: ▾ RandomForestRegressor
RandomForestRegressor(random_state=0)
Out[ ]: 0.878968081057204
Out[ ]: 0.878968081057204
Model Evalution
sns.distplot(y_test,ax=ax[1])
sns.distplot(r_pred,ax=ax[1])
sns.distplot(y_test,ax=ax[2])
sns.distplot(yhat,ax=ax[2])
# legends
ax[0].legend(['Actual Price','Predicted Price'])
ax[1].legend(['Actual Price','Predicted Price'])
ax[2].legend(['Actual Price','Predicted Price'])
#model name as title
ax[0].set_title('Linear Regression')
ax[1].set_title('Ridge Regression')
ax[2].set_title('Random Forest Regression')
plt.show()
Error Evaluation
In [ ]: #plot the graph to compare mae, mse, rmse for all models
fig, ax = plt.subplots(1,3,figsize=(20,5))
sns.barplot(x=['Linear Regression','Ridge Regression','Random Forest'],y=[mean_a
sns.barplot(x=['Linear Regression','Ridge Regression','Random Forest'],y=[mean_s
sns.barplot(x=['Linear Regression','Ridge Regression','Random Forest'],y=[np.sqr
# label for the graph
ax[0].set_ylabel('Mean Absolute Error')
ax[1].set_ylabel('Mean Squared Error')
ax[2].set_ylabel('Root Mean Squared Error')
plt.show()
Accuracy Evaluation
Conclusion
From the analysis, we can see that the Random Forest Regression model performed
better than the Ridge Regression model and Polynomial Regression model.
During the EDA process, we found out that the location of the house is a very important
factor in determining the price of the house, since houese with similar area and other
features can have different prices depending on the location of the house.
The location of the houses has been plotted on the map using the longitude and latitude
values which makesrole of location in determining the price of the house more clear.