0% found this document useful (0 votes)
2 views

Final_Data_Science_Report_25_Pages

This report presents a data science project focused on predicting housing prices using a dataset from a Kaggle competition. It details the entire data science pipeline including data collection, preprocessing, exploratory data analysis, modeling with a Random Forest algorithm, and evaluation of model performance. Key findings highlight the importance of features such as `GrLivArea`, `OverallQual`, and `TotalBsmtSF` in predicting `SalePrice`.

Uploaded by

Khadija Tajir
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Final_Data_Science_Report_25_Pages

This report presents a data science project focused on predicting housing prices using a dataset from a Kaggle competition. It details the entire data science pipeline including data collection, preprocessing, exploratory data analysis, modeling with a Random Forest algorithm, and evaluation of model performance. Key findings highlight the importance of features such as `GrLivArea`, `OverallQual`, and `TotalBsmtSF` in predicting `SalePrice`.

Uploaded by

Khadija Tajir
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Final Data Science Report

Data Science Project Report

Submitted by: Khadija Tajir

Course: Data Science

Institution: Chandigarh University

Date: April 2025

Abstract

This project explores a data science approach to solving a real-world problem using data collection,

preprocessing, exploratory data analysis, modeling, and evaluation.

1. Introduction

The purpose of this project is to demonstrate the application of data science methodologies to analyze and

extract insights from data. It covers the end-to-end pipeline including problem definition, data collection,

preprocessing, modeling, evaluation, and visualization.

5. Exploratory Data Analysis (Expanded)

To understand the relationships between features, we created a correlation heatmap. This visual highlights

which features are most strongly correlated with the target variable `SalePrice`. A strong correlation is

observed with `GrLivArea`, `OverallQual`, and `TotalBsmtSF`.

Page 1
Final Data Science Report

6. Feature Importance

Using a Random Forest model, we identified the most influential features for predicting housing prices. The

bar chart below shows the relative importance of key variables. `GrLivArea` has the highest impact, followed

by `OverallQual` and `TotalBsmtSF`.

Appendix: Source Code Snippets

The following Python code was used for data preprocessing, visualization, and model building.

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestRegressor

# Load data

data = pd.read_csv('housing.csv')

# Preprocessing

data.fillna(0, inplace=True)

data = pd.get_dummies(data)

Page 2
Final Data Science Report

# Model

X = data.drop('SalePrice', axis=1)

y = data['SalePrice']

model = RandomForestRegressor()

model.fit(X, y)

2. Related Work

Previous research in housing price prediction has employed various machine learning algorithms, including

linear regression, decision trees, and ensemble models such as Gradient Boosting and Random Forest. Studies

have shown that ensemble models tend to perform better due to their ability to reduce variance and bias.

3. Data Collection

The dataset used in this project was sourced from a publicly available Kaggle competition on house price

prediction. The dataset contains various numerical and categorical features describing residential homes in

Ames, Iowa. Data was downloaded in CSV format and then cleaned and preprocessed before model

application.

4. Data Preprocessing

Data preprocessing involved handling missing values, encoding categorical features, and normalizing the

dataset. Missing values were filled using mean/mode imputation, while categorical variables were transformed

using one-hot encoding. This step ensures that the model can learn effectively from the data.

7. Model Evaluation

Page 3
Final Data Science Report

To evaluate model performance, we used metrics like Mean Absolute Error (MAE), Mean Squared Error

(MSE), and R² score. These metrics provide insights into how well the model predicts the target variable.

8. Limitations and Future Work

While Random Forest performed well on the dataset, further improvements could be achieved using

hyperparameter tuning and ensemble methods such as XGBoost. Additionally, incorporating more

domain-specific features or external economic data could enhance prediction accuracy.

Appendix: Code Explanation

The Python code snippet demonstrates how the data was loaded, preprocessed, and modeled using Random

Forest. Seaborn and Matplotlib were used for data visualization. The RandomForestRegressor was selected for

its ability to handle complex data with minimal preprocessing.

Page 4

You might also like