Feature selection using SelectFromModel and LassoCV in Scikit Learn
Last Updated :
06 Mar, 2024
Feature selection is a critical step in machine learning and data analysis, aimed at identifying and retaining the most relevant variables in a dataset. It not only enhances model performance but also reduces overfitting and improves interpretability. In this guide, we delve into the world of feature selection using Scikit-Learn, a popular Python library for machine learning. Specifically, we explore the SelectFromModel class and the LassoCV model, showcasing their synergy for efficient feature selection.
Feature Selection
In machine learning, selecting the most significant and pertinent features from the initial set of variables is known as feature selection. Its objective is to improve interpretability, minimize overfitting, and reduce dimensionality to improve model performance. The model becomes more effective by choosing the most informative features, which results in quicker training times and improved generalization to new, untested data. Model-based procedures, feature importance scores, and statistical tests are common methodologies.
Concepts related to Feature selection using SelectFromModel and LassoCV
- L1 Regularization (Lasso): L1 regularization, also referred to as Lasso, is a machine-learning regularization technique that penalizes the absolute values of a model's coefficients. By adding a regularization component to the cost function, it pushes some coefficients to exactly zero, which promotes sparsity. When it comes to feature selection, Lasso works well since it can automatically recognize and highlight the most important traits while lessening the impact of less important ones. This regularization adds to better generalization performance, improves the interpretability of the model, and inhibits overfitting.
- SelectFromModel: With the use of a pre-trained model's feature importance scores, the scikit-learn feature selection method SelectFromModel automatically determines which features are the most significant. Following training, only features that meet a user-specified threshold of significance are retained by the model (either tree-based or linear). In addition to maintaining or even improving prediction performance, this strategy simplifies models by emphasizing the most informative aspects, encouraging efficiency, and improving interpretability.
- LassoCV: With a cross-validated selection of the regularization strength (alpha), LassoCV is a scikit-learn package that carries out L1 regularization (Lasso). It uses internal cross-validation to automate the process of alpha tuning by choosing the value that minimizes mean squared error. This makes it possible for linear models to choose features and regularize them effectively, balancing predictability and simplicity while reducing overfitting.
How SelectFromModel and LassoCV work together
SelectFromModel leverages the relevance rankings supplied by LassoCV. LassoCV first trains a Lasso regression model on the training data, computing significance scores for each feature by assessing the difference in model performance when the feature is eliminated. SelectFromModel then utilizes these relevance scores to choose the most significant features.
Feature selection using SelectFromModel and LassoCV in Scikit Learn
Importing necessary libraries
Python3
import numpy as np
import pandas as pd
from sklearn.linear_model import LassoCV
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
import seaborn as sns
This code selects features using L1 regularization (LassoCV) on the Breast Cancer dataset using scikit-learn. After dividing the data into training and testing sets, SelectFromModel is used to save significant features based on L1 regularization, and a RandomForestClassifier is used to calculate feature importances. Lastly, it assesses the RandomForestClassifier's classification report, offering an understanding of the model's effectiveness.
Loading the Breast Cancer dataset and splitting the data
Python3
# Load the Breast Cancer dataset
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
The load_breast_cancer function from scikit-learn is used in this code to load the Breast Cancer dataset. Then, using train_test_split from scikit-learn, it divides the data into training and testing sets, with 80% going to training and 20% to testing. By fixing the random seed used for the data split, the random_state parameter guarantees reproducibility.
Fitting the LassoCV model
Python3
# Fit LassoCV model
lasso_cv = LassoCV(cv=5)
lasso_cv.fit(X_train, y_train)
Using 5-fold cross-validation (cv=5), this code fits a LassoCV model using the LassoCV class from scikit-learn. It automatically evaluates performance over several folds in order to get the ideal regularization strength (alpha). X_train and y_train, the training data, are used to train the model.
Feature selection using SelectFromModel
Python3
# Feature selection
sfm = SelectFromModel(lasso_cv, prefit=True)
X_train_selected = sfm.transform(X_train)
X_test_selected = sfm.transform(X_test)
Using scikit-learn's SelectFromModel, this code selects features. It selects the most significant features from the training and testing sets (X_train and X_test) using the pre-trained LassoCV model (lasso_cv). Only the features determined to be relevant by the L1 regularization are included in the final selected feature sets, which are saved in X_train_selected and X_test_selected.
Training a Random Forest Classifier using the selected features
Python3
# Train a Random Forest Classifier using the selected features
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_selected, y_train)
Using the RandomForestClassifier from scikit-learn, this code trains a Random Forest classifier. For reproducibility, a fixed random state (random_state=42) and 100 decision trees (n_estimators=100) are used in the classifier's configuration. After that, it is fitted to the training set (X_train_selected and y_train) using just the features that were chosen during the LassoCV feature selection process.
Evaluating the model
Python3
# Evaluate the model
y_pred = model.predict(X_test_selected)
print(classification_report(y_test, y_pred))
Output:
precision recall f1-score support
0 0.97 0.86 0.91 43
1 0.92 0.99 0.95 71
accuracy 0.94 114
macro avg 0.95 0.92 0.93 114
weighted avg 0.94 0.94 0.94 114
Using the trained Random Forest Classifier on the testing set (X_test_selected), this method produces predictions (y_pred). Afterwards, it uses the classification_report function from scikit-learn to print a classification report that includes detailed performance metrics for each class in the binary classification job, including F1-score, precision, and recall. This report facilitates the evaluation of the model's performance on the hidden data.
Analyzing selected features and their importance
Python3
# Analyze selected features and their importance
selected_feature_indices = np.where(sfm.get_support())[0]
selected_features = cancer.feature_names[selected_feature_indices]
coefficients = lasso_cv.coef_
print("Selected Features:", selected_features)
print("Feature Coefficients:", coefficients)
Output:
Selected Features: ['mean area' 'worst texture' 'worst perimeter' 'worst area']
Feature Coefficients: [-0. -0. -0. 0.00025492 -0. -0.
-0. -0. -0. -0. -0. -0.
-0. -0. -0. -0. -0. -0.
-0. -0. -0. -0.00999371 -0.01746145 0.00026856
-0. -0. -0. -0. -0. -0. ]
Using the LassoCV feature selection, this code evaluates the chosen features and their relative importance. Using sfm.get_support(), it obtains the indices of the chosen features and maps them to the relevant feature names in the Breast Cancer dataset. The coefficients from the LassoCV model (lasso_cv.coef_) are used to print the feature coefficients. This data sheds light on the applicability and value that each feature that was chosen has to the model.
Extracting selected features from the original dataset and creating a DataFrame
Python3
# Extract the selected features from the original dataset
X_selected_features = X_train[:, selected_feature_indices]
# Create a DataFrame for better visualization
selected_features_df = pd.DataFrame(X_selected_features, columns=selected_features)
# Add the target variable for coloring
selected_features_df['target'] = y_train
With the help of the feature selection indices, this method extracts the chosen features from the original training dataset (X_train). To better illustrate the chosen features, a DataFrame (selected_features_df) is subsequently created. Each chosen feature and the target variable ('target') for possible coloring are shown in columns of the DataFrame. In the context of the training data, this structured representation makes it easier to comprehend the correlations between the selected features and the target variable.
Plotting a scatter plot of the two most important features
Python3
# Plot the two most important features
sns.scatterplot(x='mean area', y='worst area', hue='target', data=selected_features_df, palette='viridis')
plt.xlabel('Mean Area')
plt.ylabel('Worst Area')
plt.title('Scatter Plot of Two Most Important Features - Breast Cancer Dataset')
plt.show()
Output:

This code uses Seaborn to create a scatter plot that shows the association between the "mean area" and "worst area," two features that were chosen from the Breast Cancer dataset. The target variable ("target") determines the color of the points, making it possible to visually examine any potential patterns or separability between the two classes.
In this article, we discussed the concept of feature selection using SelectFromModel and LassoCV in scikit-learn. This method allows you to automatically select the most relevant features and improve the efficiency of your machine learning models. By incorporating L1 regularization and cross-validation, you can enhance the robustness and accuracy of your feature selection process.
Similar Reads
Joint Feature Selection with multi-task Lasso in Scikit Learn
This article is likely to introduces the concepts of Lasso and multitask Lasso regression and demonstrates how to implement these methods in Python using the scikit-learn library. The article is going to cover the differences between Lasso and multitask Lasso and provide guidance on which method to
7 min read
Feature Selection in Python with Scikit-Learn
Feature selection is a crucial step in the machine learning pipeline. It involves selecting the most important features from your dataset to improve model performance and reduce computational cost. In this article, we will explore various techniques for feature selection in Python using the Scikit-L
4 min read
Feature Selection Using Random Forest
Feature selection is a crucial step in building machine learning models. It involves selecting the most important features from your dataset that contribute to the predictive power of the model. Random Forest, an ensemble learning method, is widely used for feature selection due to its inherent abil
4 min read
Feature Agglomeration vs Univariate Selection in Scikit Learn
Selecting the most relevant characteristics for a given job is the aim of feature selection, a crucial stage in machine learning. Feature Agglomeration and Univariate Selection are two popular methods for feature selection in Scikit-Learn. These techniques aid in the reduction of dimensionality, inc
4 min read
SVM with Univariate Feature Selection in Scikit Learn
Support Vector Machines (SVM) is a powerful machine learning algorithm used for classification and regression analysis. It is based on the idea of finding the optimal boundary between two classes that maximizes the margin between them. However, the challenge with SVM is that it requires a large amou
10 min read
Blind source separation using FastICA in Scikit Learn
FastICA is the most popular method and the fastest algorithm to perform Independent Component Analysis. It can be used to separate all the individual signals from a mixed signal. Independent Component Analysis(ICA) is a method where it performs a search for finding mutually independent non-Gaussian
8 min read
Feature Selection Using Random forest Classifier
Feature selection is a crucial step in the machine learning pipeline that involves identifying the most relevant features for building a predictive model. One effective method for feature selection is using a Random Forest classifier, which provides insights into feature importance. In this article,
5 min read
How can Feature Selection reduce overfitting?
The development of precise models is essential for predicted performance in the rapidly developing area of machine learning. The possibility of overfitting, in which a model picks up noise and oscillations unique to the training set in addition to the underlying patterns in the data, presents an inh
8 min read
How to store a TfidfVectorizer for future use in scikit-learn?
The TfidfVectorizer in scikit-learn is a powerful tool for converting text data into numerical features, making it essential for many Natural Language Processing (NLP) tasks. Once you have fitted and transformed your data with TfidfVectorizer, you might want to save the vectorizer for future use. Th
7 min read
Feature Selection Techniques in Machine Learning
In data science many times we encounter vast of features present in a dataset. But it is not necessary all features contribute equally in prediction that where feature engineering comes. It helps in choosing important features while discarding rest. In this article we will learn more about it and it
6 min read