Grid Search CV
Grid Search CV
Photo by DATAIDEA
GridSearchCV
GridSearchCV is a method in the scikit-learn library, which is a popular machine learning library in
Python. It’s used for hyperparameter optimization, which involves searching for the best set of
hyperparameters for a machine learning model.
2 of 6 5/7/24, 14:53
DATAIDEA https://courses.dataidea.org/Python%20Data%20Analy...
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222.0 18.7 396.90 5.33 36.2
# numeric columns
numeric_cols = [
'INDUS', 'NOX', 'RM',
'TAX', 'PTRATIO', 'LSTAT'
]
# categorical columns
categorical_cols = ['CHAS', 'RAD']
Preprocessing steps
We define transformers for preprocessing numeric and categorical features. StandardScaler is used
for standardizing numeric features, while OneHotEncoder is used for one-hot encoding categorical
features. These transformers are applied to respective feature types using ColumnTransformer as we
learned in the previous section.
# Preprocessing steps
numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
3 of 6 5/7/24, 14:53
DATAIDEA https://courses.dataidea.org/Python%20Data%20Analy...
# Pipeline
pipe = Pipeline([
('column_transformer', column_transformer),
('model', KNeighborsRegressor(n_neighbors=10))
])
pipe
▸ Pipeline i ?
▸ column_transformer: ColumnTransformer ?
▸ numeric ▸ categorical
▸ StandardScaler ? ▸ OneHotEncoder ?
▸ KNeighborsRegressor ?
4 of 6 5/7/24, 14:53
DATAIDEA https://courses.dataidea.org/Python%20Data%20Analy...
model = GridSearchCV(
estimator=pipe,
param_grid={
'model__n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
},
cv=3
)
model.fit(X, y)
▸ GridSearchCV i ?
▸ estimator: Pipeline
▸ column_transformer: ColumnTransformer ?
▸ numeric ▸ categorical
▸ StandardScaler ? ▸ OneHotEncoder ?
▸ KNeighborsRegressor ?
cv_results = pd.DataFrame(model.cv_results_)
cv_results
5 of 6 5/7/24, 14:53
DATAIDEA https://courses.dataidea.org/Python%20Data%20Analy...
These are the results of a grid search cross-validation performed on our pipeline ( pipe ). Let’s break
down each column:
• mean_fit_time : The average time taken to fit the estimator on the training data across all folds.
• std_fit_time : The standard deviation of the fitting time across all folds.
• mean_score_time : The average time taken to score the estimator on the test data across all
folds.
• std_score_time : The standard deviation of the scoring time across all folds.
• param_model__n_neighbors : The value of the n_neighbors parameter of the
KNeighborsRegressor model in our pipeline for this particular grid search iteration.
• params : A dictionary containing the parameters used in this grid search iteration.
• split0_test_score , split1_test_score , split2_test_score : The test scores obtained for
each fold of the cross-validation. Each fold corresponds to one entry here.
• mean_test_score : The average test score across all folds.
• std_test_score : The standard deviation of the test scores across all folds.
• rank_test_score : The rank of this model configuration based on the mean test score. Lower
values indicate better performance.
These results allow you to compare different parameter configurations and select the one that
performs best based on the mean test score and other relevant metrics.
Report an issue
6 of 6 5/7/24, 14:53