Final_Data_Science_Report_25_Pages
Final_Data_Science_Report_25_Pages
Abstract
This project explores a data science approach to solving a real-world problem using data collection,
1. Introduction
The purpose of this project is to demonstrate the application of data science methodologies to analyze and
extract insights from data. It covers the end-to-end pipeline including problem definition, data collection,
To understand the relationships between features, we created a correlation heatmap. This visual highlights
which features are most strongly correlated with the target variable `SalePrice`. A strong correlation is
Page 1
Final Data Science Report
6. Feature Importance
Using a Random Forest model, we identified the most influential features for predicting housing prices. The
bar chart below shows the relative importance of key variables. `GrLivArea` has the highest impact, followed
The following Python code was used for data preprocessing, visualization, and model building.
import pandas as pd
# Load data
data = pd.read_csv('housing.csv')
# Preprocessing
data.fillna(0, inplace=True)
data = pd.get_dummies(data)
Page 2
Final Data Science Report
# Model
X = data.drop('SalePrice', axis=1)
y = data['SalePrice']
model = RandomForestRegressor()
model.fit(X, y)
2. Related Work
Previous research in housing price prediction has employed various machine learning algorithms, including
linear regression, decision trees, and ensemble models such as Gradient Boosting and Random Forest. Studies
have shown that ensemble models tend to perform better due to their ability to reduce variance and bias.
3. Data Collection
The dataset used in this project was sourced from a publicly available Kaggle competition on house price
prediction. The dataset contains various numerical and categorical features describing residential homes in
Ames, Iowa. Data was downloaded in CSV format and then cleaned and preprocessed before model
application.
4. Data Preprocessing
Data preprocessing involved handling missing values, encoding categorical features, and normalizing the
dataset. Missing values were filled using mean/mode imputation, while categorical variables were transformed
using one-hot encoding. This step ensures that the model can learn effectively from the data.
7. Model Evaluation
Page 3
Final Data Science Report
To evaluate model performance, we used metrics like Mean Absolute Error (MAE), Mean Squared Error
(MSE), and R² score. These metrics provide insights into how well the model predicts the target variable.
While Random Forest performed well on the dataset, further improvements could be achieved using
hyperparameter tuning and ensemble methods such as XGBoost. Additionally, incorporating more
The Python code snippet demonstrates how the data was loaded, preprocessed, and modeled using Random
Forest. Seaborn and Matplotlib were used for data visualization. The RandomForestRegressor was selected for
Page 4