0% found this document useful (0 votes)
9 views

Grid Search CV

Uploaded by

tongjohn9
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Grid Search CV

Uploaded by

tongjohn9
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

DATAIDEA https://courses.dataidea.org/Python%20Data%20Analy...

Photo by DATAIDEA

GridSearchCV

GridSearchCV is a method in the scikit-learn library, which is a popular machine learning library in
Python. It’s used for hyperparameter optimization, which involves searching for the best set of
hyperparameters for a machine learning model.

Let’s import some packages


We begin by importing necessary packages and modules. The KNeighborsRegressor model is
imported from the sklearn.neighbors module.

# Let's import some packages


from dataidea.packages import * # imports np, pd, plt, etc
from dataidea.datasets import loadDataset
from sklearn.neighbors import KNeighborsRegressor

Let’s import necessary components from sklearn


We import essential components from sklearn , including Pipeline , which we’ll use to create a
pipe as from the previous section, ColumnTransformer , StandardScaler , and OneHotEncoder
which we’ll use to transform the numeric and categorical columns respectively to be good for
modelling.

# lets import the Pipeline from sklearn


from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

Loading the dataset


We load the dataset named boston using the loadDataset function, which is inbuilt in the dataidea
package. The loaded dataset is stored in the variable data . More about the boston dataset is in the
Pipeline notebook

# loading the data set


data = loadDataset('boston')

2 of 6 5/7/24, 14:53
DATAIDEA https://courses.dataidea.org/Python%20Data%20Analy...

# looking at the top part


data.head()

CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV

0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296.0 15.3 396.90 4.98 24.0

1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242.0 17.8 396.90 9.14 21.6

2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242.0 17.8 392.83 4.03 34.7

3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222.0 18.7 394.63 2.94 33.4

4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222.0 18.7 396.90 5.33 36.2

Selecting features (X) and target variable (y)


We separate the features (X) from the target variable (y). Features are stored in X , excluding the target
variable ‘MEDV’, which is stored in y .

# Selecting our X set and y


X = data.drop('MEDV', axis=1)
y = data.MEDV

Defining numeric and categorical columns


We define lists of column names representing numeric and categorical features in the dataset. We
identified these columns as the best features from the previous section of this week

# numeric columns
numeric_cols = [
'INDUS', 'NOX', 'RM',
'TAX', 'PTRATIO', 'LSTAT'
]

# categorical columns
categorical_cols = ['CHAS', 'RAD']

Preprocessing steps
We define transformers for preprocessing numeric and categorical features. StandardScaler is used
for standardizing numeric features, while OneHotEncoder is used for one-hot encoding categorical
features. These transformers are applied to respective feature types using ColumnTransformer as we
learned in the previous section.

# Preprocessing steps
numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

# Combine preprocessing steps


column_transformer = ColumnTransformer(
transformers=[
('numeric', numeric_transformer, numeric_cols),
('categorical', categorical_transformer, categorical_cols)

3 of 6 5/7/24, 14:53
DATAIDEA https://courses.dataidea.org/Python%20Data%20Analy...

('categorical', categorical_transformer, categorical_cols)


])

Defining the pipeline


We construct a machine learning pipeline using Pipeline . The pipeline consists of preprocessing
steps (defined in column_transformer ) and a KNeighborsRegressor model with 10 neighbors.

# Pipeline
pipe = Pipeline([
('column_transformer', column_transformer),
('model', KNeighborsRegressor(n_neighbors=10))
])

pipe

▸ Pipeline i ?

▸ column_transformer: ColumnTransformer ?

▸ numeric ▸ categorical

▸ StandardScaler ? ▸ OneHotEncoder ?

▸ KNeighborsRegressor ?

Fitting the pipeline


As we learned, the Pipeline has the fit , score and predict methods which we use to fit on the
dataset ( X , y ) and evaluate the model’s performance using the score() method, finally making
predictions.

# Fit the pipeline


pipe.fit(X, y)

# Score the pipeline


pipe_score = pipe.score(X, y)

# Predict using the pipeline


pipe_predicted_y = pipe.predict(X)

print('Pipe Score:', pipe_score)

Pipe Score: 0.818140222027107

Hyperparameter tuning using GridSearchCV


We perform hyperparameter tuning using GridSearchCV . The pipeline ( pipe ) serves as the base
estimator, and we define a grid of hyperparameters to search through, focusing on the number of
neighbors for the KNN model.

4 of 6 5/7/24, 14:53
DATAIDEA https://courses.dataidea.org/Python%20Data%20Analy...

from sklearn.model_selection import GridSearchCV

model = GridSearchCV(
estimator=pipe,
param_grid={
'model__n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
},
cv=3
)

Fitting the model for hyperparameter tuning


We fit the GridSearchCV model on the dataset to find the optimal hyperparameters. This involves
preprocessing the data and training the model multiple times using cross-validation.

model.fit(X, y)

▸ GridSearchCV i ?

▸ estimator: Pipeline

▸ column_transformer: ColumnTransformer ?

▸ numeric ▸ categorical

▸ StandardScaler ? ▸ OneHotEncoder ?

▸ KNeighborsRegressor ?

Extracting and displaying cross-validation results


We extract the results of cross-validation performed during hyperparameter tuning and present them
in a tabular format using a DataFrame.

cv_results = pd.DataFrame(model.cv_results_)
cv_results

mean_fit_time std_fit_time mean_score_time std_score_time param_model__n_neighbors params

0 0.004732 0.000748 0.003540 0.000235 1 {'model__n_neighbors':


1}

1 0.003898 0.000124 0.002893 0.000093 2 {'model__n_neighbors':


2}

2 0.004030 0.000161 0.003201 0.000090 3 {'model__n_neighbors':


3}

3 0.003658 0.000036 0.002809 0.000061 4 {'model__n_neighbors':


4}

4 0.003693 0.000060 0.002730 0.000054 5 {'model__n_neighbors':


5}

5 of 6 5/7/24, 14:53
DATAIDEA https://courses.dataidea.org/Python%20Data%20Analy...

5 0.003681 0.000137 0.002855 0.000016 6 {'model__n_neighbors':


6}

6 0.004109 0.000393 0.002984 0.000106 7 {'model__n_neighbors':


7}

7 0.003641 0.000018 0.002798 0.000052 8 {'model__n_neighbors':


8}

8 0.003670 0.000036 0.002757 0.000006 9 {'model__n_neighbors':


9}

9 0.003590 0.000023 0.002807 0.000031 10 {'model__n_neighbors':


10}

These are the results of a grid search cross-validation performed on our pipeline ( pipe ). Let’s break
down each column:

• mean_fit_time : The average time taken to fit the estimator on the training data across all folds.
• std_fit_time : The standard deviation of the fitting time across all folds.
• mean_score_time : The average time taken to score the estimator on the test data across all
folds.
• std_score_time : The standard deviation of the scoring time across all folds.
• param_model__n_neighbors : The value of the n_neighbors parameter of the
KNeighborsRegressor model in our pipeline for this particular grid search iteration.
• params : A dictionary containing the parameters used in this grid search iteration.
• split0_test_score , split1_test_score , split2_test_score : The test scores obtained for
each fold of the cross-validation. Each fold corresponds to one entry here.
• mean_test_score : The average test score across all folds.
• std_test_score : The standard deviation of the test scores across all folds.
• rank_test_score : The rank of this model configuration based on the mean test score. Lower
values indicate better performance.

These results allow you to compare different parameter configurations and select the one that
performs best based on the mean test score and other relevant metrics.

 Report an issue

6 of 6 5/7/24, 14:53

You might also like