Practical Example Full Notes
Practical Example Full Notes
Preprocessing
Exploring the descriptive statistics of the variables
# Descriptive statistics are very useful for initial exploration of
the variables
# By default, only descriptives for the numerical variables are shown
# To include the categorical ones, you should specify this with an
argument
raw_data.describe(include='all')
# Note that categorical variables don't have some types of numerical
descriptives
# and numerical variables don't have some types of categorical
descriptives
Brand 0
Price 172
Body 0
Mileage 0
EngineV 150
Engine Type 0
Registration 0
Year 0
dtype: int64
<matplotlib.axes._subplots.AxesSubplot at 0x2b897ab4940>
Dealing with outliers
# Obviously there are some outliers present
# Without diving too deep into the topic, we can deal with the problem
easily by removing 0.5%, or 1% of the problematic samples
# Here, the outliers are situated around the higher prices (right side
of the graph)
# Logic should also be applied
# This is a dataset about used cars, therefore one can imagine how
$300,000 is an excessive price
# Outliers are a great issue for OLS, thus we must deal with them in
some way
# It may be a useful exercise to try training a model without removing
the outliers
# We can check the PDF once again to ensure that the result is still
distributed in the same way overall
# however, there are much fewer outliers
sns.distplot(data_1['Price'])
<matplotlib.axes._subplots.AxesSubplot at 0x2b897eacfd0>
# We can treat the other numerical variables in a similar way
sns.distplot(data_no_mv['Mileage'])
<matplotlib.axes._subplots.AxesSubplot at 0x2b897e36e80>
q = data_1['Mileage'].quantile(0.99)
data_2 = data_1[data_1['Mileage']<q]
<matplotlib.axes._subplots.AxesSubplot at 0x2b8980359b0>
<matplotlib.axes._subplots.AxesSubplot at 0x2b897f681d0>
# A simple Google search can indicate the natural domain of this
variable
# Car engine volumes are usually (always?) below 6.5l
# This is a prime example of the fact that a domain expert (a person
working in the car industry)
# may find it much easier to determine problems with the data than an
outsider
data_3 = data_2[data_2['EngineV']<6.5]
<matplotlib.axes._subplots.AxesSubplot at 0x2b8981a0b00>
# Finally, the situation with 'Year' is similar to 'Price' and
'Mileage'
# However, the outliers are on the low end
sns.distplot(data_no_mv['Year'])
<matplotlib.axes._subplots.AxesSubplot at 0x2b89825e6a0>
# I'll simply remove them
q = data_3['Year'].quantile(0.01)
data_4 = data_3[data_3['Year']>q]
<matplotlib.axes._subplots.AxesSubplot at 0x2b898296d30>
plt.show()
# From the subplots and the PDF of price, we can easily determine that
'Price' is exponentially distributed
# A good transformation in that case is a log transformation
sns.distplot(data_cleaned['Price'])
<matplotlib.axes._subplots.AxesSubplot at 0x2b8994ccdd8>
plt.show()
Multicollinearity
# Let's quickly see the columns of our data frame
data_cleaned.columns.values
# we create a new data frame which will include all the VIFs
# note that each variable has its own variance inflation factor as
this measure is variable specific (not model specific)
vif = pd.DataFrame()
VIF Features
0 3.791584 Mileage
1 10.354854 Year
2 7.662068 EngineV
# Since Year has the highest VIF, I will remove it from the model
# This will drive the VIF of other variables down!!!
# So even if EngineV seems with a high VIF, too, once 'Year' is gone
that will no longer be the case
data_no_multicollinearity = data_cleaned.drop(['Year'],axis=1)
Rearrange a bit
# To make our data frame more organized, we prefer to place the
dependent variable in the beginning of the df
# Since each problem is different, that must be done manually
# We can display all possible features and then choose the desired
order
data_with_dummies.columns.values
# Scale the features and store them in a new variable (the actual
scaling procedure)
inputs_scaled = scaler.transform(inputs)
Train Test Split
# Import the module for the split
from sklearn.model_selection import train_test_split
# Split the variables with an 80-20 split and some random state
# To have the same split as mine, use random_state = 365
x_train, x_test, y_train, y_test = train_test_split(inputs_scaled,
targets, test_size=0.2, random_state=365)
# Include a title
plt.title("Residuals PDF", size=18)
Text(0.5,1,'Residuals PDF')
# Find the R-squared of the model
reg.score(x_train,y_train)
0.744996578792662
9.415239458021299
Features Weights
0 Mileage -0.448713
1 EngineV 0.209035
2 Brand_BMW 0.014250
3 Brand_Mercedes-Benz 0.012882
4 Brand_Mitsubishi -0.140552
5 Brand_Renault -0.179909
6 Brand_Toyota -0.060550
7 Brand_Volkswagen -0.089924
8 Body_hatch -0.145469
9 Body_other -0.101444
10 Body_sedan -0.200630
11 Body_vagon -0.129887
12 Body_van -0.168597
13 Engine Type_Gas -0.121490
14 Engine Type_Other -0.033368
15 Engine Type_Petrol -0.146909
16 Registration_yes 0.320473
Testing
# Once we have trained and fine-tuned our model, we can proceed to
testing it
# Testing is done on a dataset that the algorithm has never seen
# Luckily we have prepared such a dataset
# Our test inputs are 'x_test', while the outputs: 'y_test'
# We SHOULD NOT TRAIN THE MODEL ON THEM, we just feed them and find
the predictions
# If the predictions are far off, we will know that our model
overfitted
y_hat_test = reg.predict(x_test)
# Create a scatter plot with the test targets and the test predictions
# You can include the argument 'alpha' which will introduce opacity to
the graph
plt.scatter(y_test, y_hat_test, alpha=0.2)
plt.xlabel('Targets (y_test)',size=18)
plt.ylabel('Predictions (y_hat_test)',size=18)
plt.xlim(6,13)
plt.ylim(6,13)
plt.show()
Prediction
0 10685.501696
1 3499.255242
2 7553.285218
3 7463.963017
4 11353.490075
# We can also include the test targets in that data frame (so we can
manually compare them)
df_pf['Target'] = np.exp(y_test)
df_pf
# Note that we have a lot of missing values
# There is no reason to have ANY missing values, though
# This suggests that something is wrong with the data frame / indexing
Prediction Target
0 10685.501696 NaN
1 3499.255242 7900.0
2 7553.285218 NaN
3 7463.963017 NaN
4 11353.490075 NaN
5 21289.799394 14200.0
6 20159.189144 NaN
7 20349.617702 NaN
8 11581.537864 11950.0
9 33614.617349 NaN
10 7241.068243 NaN
11 5175.769541 10500.0
12 5484.015362 NaN
13 13292.711243 NaN
14 8248.666686 NaN
15 10621.836767 NaN
16 23721.581637 3500.0
17 11770.636010 NaN
18 37600.146722 7500.0
19 16178.143307 6800.0
20 11876.820988 NaN
21 31557.804999 NaN
22 6102.358118 NaN
23 13111.914144 NaN
24 23650.150725 NaN
25 45272.248411 NaN
26 2178.941672 NaN
27 2555.022542 NaN
28 35991.510539 NaN
29 26062.229419 NaN
.. ... ...
744 2379.583414 NaN
745 6421.180201 7777.0
746 13355.106770 NaN
747 8453.281424 10500.0
748 48699.979367 NaN
749 6082.849234 4100.0
750 10381.621436 NaN
751 8493.042746 NaN
752 8591.658845 13999.0
753 6358.547301 NaN
754 17028.451182 NaN
755 15885.658673 NaN
756 3752.540952 NaN
757 12028.905190 NaN
758 9380.459827 16999.0
759 10125.265176 NaN
760 13443.324968 NaN
761 9097.127448 NaN
762 12201.288474 4700.0
763 12383.352887 NaN
764 14049.760996 NaN
765 11034.660068 3750.0
766 18982.148845 NaN
767 24323.483753 NaN
768 38260.361723 NaN
769 29651.726363 6950.0
770 10732.071179 NaN
771 13922.446953 NaN
772 27487.751303 NaN
773 13491.163043 NaN
# Therefore, to get a proper result, we must reset the index and drop
the old indexing
y_test = y_test.reset_index(drop=True)
0 7.740664
1 7.937375
2 7.824046
3 8.764053
4 9.121509
Name: log_price, dtype: float64
Prediction Target
0 10685.501696 2300.0
1 3499.255242 2800.0
2 7553.285218 2500.0
3 7463.963017 6400.0
4 11353.490075 9150.0
5 21289.799394 20000.0
6 20159.189144 38888.0
7 20349.617702 16999.0
8 11581.537864 12500.0
9 33614.617349 41000.0
10 7241.068243 12800.0
11 5175.769541 5000.0
12 5484.015362 7900.0
13 13292.711243 16999.0
14 8248.666686 9200.0
15 10621.836767 11999.0
16 23721.581637 20500.0
17 11770.636010 9700.0
18 37600.146722 39900.0
19 16178.143307 16400.0
20 11876.820988 15200.0
21 31557.804999 24500.0
22 6102.358118 5650.0
23 13111.914144 12900.0
24 23650.150725 20900.0
25 45272.248411 31990.0
26 2178.941672 3600.0
27 2555.022542 11600.0
28 35991.510539 43999.0
29 26062.229419 42500.0
.. ... ...
744 2379.583414 3000.0
745 6421.180201 4400.0
746 13355.106770 7500.0
747 8453.281424 10900.0
748 48699.979367 77500.0
749 6082.849234 7450.0
750 10381.621436 3000.0
751 8493.042746 12800.0
752 8591.658845 12000.0
753 6358.547301 4850.0
754 17028.451182 18700.0
755 15885.658673 17300.0
756 3752.540952 2600.0
757 12028.905190 10500.0
758 9380.459827 7950.0
759 10125.265176 6700.0
760 13443.324968 9000.0
761 9097.127448 8000.0
762 12201.288474 12999.0
763 12383.352887 10800.0
764 14049.760996 10700.0
765 11034.660068 9800.0
766 18982.148845 17900.0
767 24323.483753 18800.0
768 38260.361723 75555.0
769 29651.726363 29500.0
770 10732.071179 9600.0
771 13922.446953 18300.0
772 27487.751303 68500.0
773 13491.163043 10800.0
# Finally, it makes sense to see how far off we are from the result
percentage-wise
# Here, we take the absolute difference in %, so we can easily order
the data frame
df_pf['Difference%'] =
np.absolute(df_pf['Residual']/df_pf['Target']*100)
df_pf