Project Report
Project Report
ON
Predicting House Prices with Machine Learning
YASH PARCHA
SAGAR
DEVANSH GARG
1
INDEX
1. Introduction to Python and Machine Learning 3-4
2. IMPORTING DATA 6
8. Logistic REGRESSOR 28
2
Introduction to Python and Machine Learning
Introduction to python :
Python is a general-purpose, dynamically typed, high-level, compiled and interpreted, garbage-
collected, and purely object-oriented programming language that supports procedural, object-
oriented, and functional programming. It was Created by Guido van Rossum and first released in
1991, Python emphasizes code readability and allows programmers to express concepts in
fewer lines of code compared to languages like C++ or Java.
o Interpreted Language: Python does not require compilation, allowing rapid development and
testing. It uses Interpreter instead of Compiler.
o Extensive Libraries : Python has a rich ecosystem of libraries and frameworks, such as NumPy,
Pandas, and Matplotlib, which simplify tasks like data manipulation and visualization.
o Machine Learning: Python is widely used for machine learning due to its simplicity, ease of
use, and availability of powerful machine learning libraries.
3
Introduction to Machine Learning
Machine learning (ML) is a subfield of artificial intelligence (AI) that involves the development of
algorithms and statistical models enabling computers to perform tasks without explicit instructions.
Instead, these systems learn patterns and make decisions based on data. Machine learning is
transforming various industries by automating complex processes, providing insights from large datasets,
and creating new opportunities for innovation.
5. Evaluation: Assessing the model's performance using metrics and validation techniques.
7. Maintenance: Continuously updating and refining the model as new data becomes available.
1. Supervised Learning: The model is trained on a labeled dataset, meaning that each training example is paired
with an output label. Common algorithms include:
o Linear Regression
o Decision Trees
o Neural Networks
2. Unsupervised Learning: The model is provided with unlabeled data and must find inherent patterns or
groupings. Common algorithms include:
4
3. Reinforcement Learning: The model learns by interacting with an environment, receiving rewards or penalties
based on its actions, and aims to maximize cumulative rewards. Key concepts include:
o Q-Learning
5
IMPORTING DATA
. Libraries
Pandas: It provides data structures like Data Frames and Series to handle and analyze data
efficiently. NumPy is a library for numerical computing in Python. Seaborn is a statistical data
visualization library. Matplotlib is a plotting library for Plotting graphs, histograms, scatter plots,
and customizing visualizations.
. Dataset Read
pd. read_csv(' Housing.csv') is a function in Pandas that reads a comma-separated values (CSV)
file into a Data Frame df. This function is reading the weather_classification_data.
6
Gathering Basic information
df. shape
df.info ()
df.info () is a function that returns the count of rows, null status and the
datatype of each column. Example The column price has 545 rows each
containing non null values and datatype is int64.
7
df. count ()
df.min()
8
df.min () function is used to return the minimum value from each
column. For example, area has 1650 as the minimum value in the entire
column.
df.describe()
9
df.head()
df. tail ()
10
df.isnull().sum()
df. duplicated().sum()
11
Identification of Outliers using Box Plot and Outliers
12
→ Figure represents the Boxplot for area.
13
→Figure represents the Boxplot for bedrooms
14
→Figure represents the Boxplot for bathrooms
15
→ Figure represents the Boxplot for stories.
16
→ Figure represents the Boxplot for parking.
17
RESETING INDEX AFTER REMOVING OUTLIERS
Outliers are data points that deviate significantly from the rest of the
dataset. They can arise due to measurement errors, data entry errors,
or inherent variability in the data. Outliers can skew the analysis results,
leading to inaccurate conclusions. Therefore, it is essential to identify
and handle outliers before performing statistical analysis.
18
Dataset Trend Visualization
PAIRPLOTS
Pair plots are particularly useful in the context of outlier detection and data preprocessing. They
provide a clear visual representation of how each variable interacts with others, making it easier
to spot anomalies that do not follow the general pattern of the data.
19
SCATTER PLOT
20
REGRESSION PLOT
21
BAR PLOT
A bar plot (or bar chart) is a graphical display of data using bars of
different heights. It is commonly used to compare quantities across
different categories. Here’s an explanation of how to create and
interpret a bar plot.
22
HISTOGRAM
23
LINE PLOT
A line plot (or line chart) is a type of chart used to display information
as a series of data points called 'markers' connected by straight line
segments. It is commonly used to visualize trends over time. Here’s a
step-by-step explanation of how to create and interpret a line plot.
24
COUNTER PLOT
25
CORRELATION HEAMAP
26
Test and Train Splitting
27
LOGISTIC REGRESSOR
DECISION TREE
29
`DecisionTreeClassifier` is imported from `sklearn` and assigned to the variable `treemodel`,
with a maximum depth of 2. The model is then trained using `x_train` and `y_train`.
The max depth with value 2 graph is plotted using decision tree model. A tree plot visually
represents the structure of a decision tree, illustrating how decisions are made based on feature
values. It shows the tree's nodes, branches, and leaves, detailing the splits and outcomes at
each node.
RANDOM FOREST
30
The following libraries are imported: `roc_curve`, `auc`, `classification_report`,
`GridSearchCV`, and `RandomForestClassifier`. The `time` library is also imported.
A `RandomForestClassifier` model.
Parameters like `max_depth`, `bootstrap`, `max_features`, and `criterion` are set to optimize the
accuracy of the dataset. The best parameters are determined using `GridSearchCV`. The `cv_rf`
(a `GridSearchCV` instance) is used to fit the model on `x_train` and `y_train`.
31
The values of `x_test` are used to generate predictions, and these predictions are compared
with `y_test` to calculate the accuracy.
32
K - Nearest Neighbor
33
Gaussian Naive Bayes
34
A heatmap for true and predicted values visualizes the performance of a
classification model by showing how often each combination of actual and
predicted classes occurs. It helps in understanding the distribution of prediction
errors and correct classifications.
35
Data Comparison
36
END OF REPORT
37