0% found this document useful (0 votes)
10 views

ml project part a 1

This project aims to predict property prices in California using district-level features through simple and multiple linear regression models. The objective is to identify key variables influencing housing prices and evaluate model performance using metrics like MSE, RMSE, and R-Squared. Deliverables include exploratory data analysis, data preprocessing, model development, and proper documentation of the process.

Uploaded by

Fahad King
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

ml project part a 1

This project aims to predict property prices in California using district-level features through simple and multiple linear regression models. The objective is to identify key variables influencing housing prices and evaluate model performance using metrics like MSE, RMSE, and R-Squared. Deliverables include exploratory data analysis, data preprocessing, model development, and proper documentation of the process.

Uploaded by

Fahad King
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Part A: Property Price Prediction

1. Overview
This project focuses on predicting property prices in various districts of
California using several district-level features. By building a predictive model, we
aim to identify key variables that influence housing prices and improve the
accuracy of house value predictions. We will utilize simple linear regression and
multiple linear regression to address this regression task, ensuring proper data
handling and model evaluation.

2. Problem Statement
The objective is to predict the median house value in California districts based
on features such as income, the number of rooms, geographical location, and
proximity to the ocean. Given the dataset, we will develop regression models,
evaluate their performance, and determine which model provides the best
balance between predictive accuracy and interpretability.

3. Dataset Information
The dataset information and variables can be found in the Data Information.pdf
file.

4. Deliverables
- Exploratory Data Analysis (EDA) : with visualizations and summary statistics.
- Data Preprocessing, including handling missing values and encoding
categorical variables.
- Model Development using:
- Simple Linear Regression
- Multiple Linear Regression
- Evaluation of the models using relevant metrics (MSE, RMSE, R-Squared).
5. Success Criteria
- The model should have a high degree of accuracy and balance with
interpretability.
- Evaluation metrics such as MSE, RMSE, and R-Squared will be used to
measure the model’s performance.
- Ensure proper documentation of all steps and present visualizations that
help explain the data and model outcomes.

6. Guidelines
- Make sure to split your data into training and testing sets to avoid overfitting.
- Tune the hyperparameters of your models to improve performance.
- Report all the steps taken in the data preprocessing, modeling, and evaluation
phases.
- Provide a final model that balances accuracy with interpretability.

7. Tools Required
- Python (with libraries such as pandas, scikit-learn, matplotlib, seaborn, etc.)
- Jupyter Notebook or any IDE suitable for running Python code
Step-by-Step Guide

Step 1: Exploratory Data Analysis (EDA)

code
import pandas as pd import
matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset


data = pd.read_csv('path_to_data_file.csv')

# Display basic information about the dataset


print(data.info()) print(data.describe())

# Visualize the distribution of the median house value plt.figure(figsize=(10,


6))
sns.histplot(data['median_house_value'], kde=True)
plt.title('Distribution of Median House Value')
plt.xlabel('Median House Value')
plt.ylabel('Frequency') plt.show()

# Visualize relationships between features and median house value


plt.figure(figsize=(12, 8))

sns.pairplot(data, x_vars=['median_income', 'total_rooms',


'housing_median_age'], y_vars='median_house_value', kind='scatter')
plt.title('Relationships between Features and Median House Value') plt.show()
```

Step 2: Data Preprocessing

code
# Handle missing values (if any) data
= data.dropna()

# Encoding categorical variables (if any) data =


pd.get_dummies(data, drop_first=True)

# Split the data into training and testing sets from


sklearn.model_selection import train_test_split

X = data.drop('median_house_value', axis=1) y
= data['median_house_value']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)
```

Step 3: Model Development


Simple Linear Regression

code
from sklearn.linear_model import LinearRegression

# Simple Linear Regression simple_model


= LinearRegression()
simple_model.fit(X_train[['median_income']], y_train)

# Prediction using the simple model


y_pred_simple = simple_model.predict(X_test[['median_income']])

Multiple Linear Regression

code
# Multiple Linear Regression multiple_model
= LinearRegression()
multiple_model.fit(X_train, y_train)

# Prediction using the multiple model y_pred_multiple


= multiple_model.predict(X_test)

Step 4: Model Evaluation


code
from sklearn.metrics import mean_squared_error, r2_score

# Simple Linear Regression Evaluation


mse_simple = mean_squared_error(y_test, y_pred_simple)
rmse_simple = mean_squared_error(y_test, y_pred_simple, squared=False)
r2_simple = r2_score(y_test, y_pred_simple)

print(f'Simple Linear Regression - MSE: {mse_simple}, RMSE: {rmse_simple}, R2:


{r2_simple}')

# Multiple Linear Regression Evaluation


mse_multiple = mean_squared_error(y_test, y_pred_multiple)
rmse_multiple = mean_squared_error(y_test, y_pred_multiple, squared=False)
r2_multiple = r2_score(y_test, y_pred_multiple)

print(f'Multiple Linear Regression - MSE: {mse_multiple}, RMSE:


{rmse_multiple}, R2: {r2_multiple}')

You might also like