0% found this document useful (0 votes)
16 views

Assignment Cars

Uploaded by

afaq0456
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Assignment Cars

Uploaded by

afaq0456
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Question 1

Overfitting: The phenomenon where a machine learning model learns the training data too
closely, capturing noise and randomness, which leads to a lack of generalization to new data.
The impact of overfitting on machine learning models is significant and generally negative:
1. Poor Generalization: Overfit models perform very well on the training data but poorly
on new, unseen data. They essentially memorize the training data rather than learning the
actual underlying patterns. This means that they are not useful for making accurate
predictions in real-world scenarios.
2. Reduced Model Robustness: Overfit models are sensitive to small variations in the
training data. If the training data changes slightly, the model's performance can degrade
rapidly.
3. Inefficient Use of Resources: Overfit models often use more complex structures and
more features than necessary, which can lead to increased computational resources and
longer training times. This inefficiency can be problematic in practical applications.
4. Loss of Interpretability: Complex models that overfit the data are often harder to
interpret. This is especially important when you need to understand the relationships and
factors that influence the predictions.
Mitigating Overfitting (Solution):
To mitigate overfitting caused by including the 'Car' and 'Model' features, we should consider
feature selection or feature engineering. In this case, we will remove these two features, which
appear to be causing overfitting. Feature selection helps in reducing the dimensionality of the
dataset and simplifying the model, reducing the risk of overfitting.

Scenario R2 RMSE

Original Model (Without Car and Model features) -0.172968 7.112579

With Car and Model features 0.850000 3.500000

New Solution (with Feature Selection/Regularization) -0.129553 6.979711

Question No 2
EDA
1. Data Sample (First 5 Rows):
 This section displays the first 5 rows of your dataset. Each row represents an
individual data point, and each column represents a different attribute or feature of
that data point. The dataset has 12 columns.
2. Data Types and Non-Null Counts:
 This part provides information about the data types and non-null counts for each
column in your dataset.
 The "Data columns" section shows the names of the columns.
 The "Non-Null Count" column indicates how many non-null (non-missing) values
are present in each column.
 The "Dtype" column shows the data type of each column.
Here's a breakdown of dataset:
 Dataset contains 96,453 rows, and there are 12 columns.
 The columns contain a mix of data types:
 8 columns have a data type of float64, which typically represents numerical data.
 4 columns have a data type of object, which typically represents text or
categorical data.
 Some columns have missing values (null values). For example, the "Precip Type" column
has some missing values because it contains 95,936 non-null values, which is less than
the total number of rows (96,453).
Data cleaning
'Precip Type' column has missing values. To handle missing values in the 'Precip Type' column, I
apply data cleaning methods such as filling missing values with a suitable strategy.
EDA plots
1. Pair Plot:
 Purpose: The pair plot (created using sns.pairplot) is used to visualize the
relationships between numerical variables in the dataset. It displays scatter plots
for all possible pairs of numerical variables and histograms on the diagonal.
 Interpretation: Each scatter plot shows how two variables relate to each other.
we can observe whether there's a linear or non-linear relationship between
variables. The histograms on the diagonal show the distributions of individual
variables.
2. Histogram of Temperature:
 Purpose: This histogram (created using sns.histplot) provides insights into the
distribution of temperature in the dataset.
 Interpretation: It shows the frequency or count of temperature values within
specified bins. We can identify the central tendency, spread, and shape of the
temperature distribution. For example, it helps you check if the temperature data
follows a normal distribution.

3. Correlation Heatmap:
 Purpose: The correlation heatmap (created using sns.heatmap) helps you
understand the relationships between numerical variables and identify potential
correlations.
 Interpretation: The heatmap uses color coding to represent the strength and
direction of correlations. Positive correlations are shown in one color, while
negative correlations are shown in another color. It helps you determine which
variables are strongly correlated and might influence each other.
4. Box Plot of Temperature by Precip Type:
 Purpose: This box plot (created using sns.boxplot) allows you to compare
temperature values based on different categories of precipitation type.
 Interpretation: The box plot displays the distribution of temperature for each
category of precipitation (e.g., 'rain' and 'snow'). It shows the median, quartiles,
and potential outliers. We can see how temperature varies between different
precipitation types.
5. Time Series Plot:
 Purpose: The time series plot (created using sns.lineplot) visualizes the
temperature changes over time.
 Interpretation: This plot shows how temperature varies with time, making it
suitable for time-based datasets. It helps you identify trends, seasonality, and other
patterns in temperature data.
Model performance
1. Model Performance Before Data Cleaning:
 R-squared (R2) Score: R2 measures the proportion of the variance in the target
variable (temperature) that is explained by the independent variables (features) in
the model. An R2 score of 0.5817 indicates that approximately 58.17% of the
variation in temperature is explained by the features included in the model. In
other words, the model can capture about 58.17% of the variation in temperature,
and the rest is unexplained.
 Root Mean Squared Error (RMSE): RMSE quantifies the average error
between the predicted temperature values and the actual (observed) temperature
values. An RMSE of 6.2086 suggests that, on average, the model's predictions are
off by approximately 6.2086 degrees Celsius. Lower RMSE values indicate better
model performance.
2. Model Performance After Data Cleaning and Transformation:
 The R2 score remains the same at 0.5817, indicating that the proportion of
explained variance in temperature did not change after data cleaning and
transformation. In other words, the model still captures about 58.17% of the
variation in temperature.
 The RMSE value also remains the same at 6.2086, suggesting that, on average,
the model's predictions continue to have an error of approximately 6.2086 degrees
Celsius after data cleaning and transformation.

You might also like