Support Vector Machines (SVM) is a powerful machine learning algorithm used for classification and regression analysis. It is based on the idea of finding the optimal boundary between two classes that maximizes the margin between them. However, the challenge with SVM is that it requires a large amount of computational power and is sensitive to the choice of features. This can make the model more complex and harder to interpret.
Univariate feature selection is a method used to select the most important features in a dataset. The idea behind this method is to evaluate each individual feature's relationship with the target variable and select the ones that have the strongest correlation. This process is repeated for each feature and the best ones are selected based on defined criteria, such as the highest correlation or statistical significance.
In univariate feature selection, the focus is on individual features and their contribution to the target variable, rather than considering the relationships between features. This method is simple and straightforward, but it does not take into account any interactions or dependencies between features.
Univariate feature selection is useful when working with a large number of features and the goal is to reduce the dimensionality of the data and simplify the modeling process. It is also useful for feature selection in cases where the relationship between the target variable and individual features is not complex and can be understood through a simple statistical analysis.
Syntax of SelectKBest():
Select features according to the k highest scores.
sklearn.feature_selection.SelectKBest(score_func=<function f_classif>, *, k=10)
score_fun : In score_fun we can use f_classif, f_regression, chi2, mutual_info_classif, GenericUnivariateSelect etc, The default is f_classif
used for classification data it takes two arrays X and y, and return a pair of arrays (scores, pvalues) or a single array with scores.
k : We can assign the integer value denotes the number of features we want or "all", The default value is 10
ANOVA stands for Analysis of Variance and is a statistical technique used to determine the relationship between a dependent variable (label) and one or more independent variables (features). It measures the variability between different groups of data and helps to identify which independent variable has a significant impact on the dependent variable.
In machine learning, ANOVA is used as a univariate feature selection method between the feature and the label. This means it helps to identify the most important features in a dataset that have the greatest impact on the target variable.
Univariate statistical tests are a class of statistical tests that are used to analyze the distribution of a single variable. The goal of these tests is to determine whether there is significant variation in the variable and to identify any patterns or relationships in the data. Some common univariate statistical tests include:
The F-score, also known as the F-statistic, is a ratio of two variances used in ANOVA. It is calculated as the ratio of the variance between the groups to the variance within the groups. The F-score is used to test the hypothesis that the means of the groups are equal.
Formula:
The F-score can be calculated as follows:
F = (MSB / MSW)
where:
MSB = Mean Square Between (variance between groups)
MSW = Mean Square Within (variance within groups)
The F-score is used to test the null hypothesis, which states that the means of the groups are equal. If the calculated F-score is larger than the critical value from the F-distribution, the null hypothesis is rejected, and it is concluded that there is a significant difference between the means of the groups.
Here's an example of how ANOVA works in Scikit Learn, which we will use as the score_fun:
f_classif : In the first example, SelectKBest(f_classif, k=2), the scoring function used is f_classif, which is used for classification problems. The f_classif scoring function calculates the ANOVA (analysis of variance) F-value between each feature and the target variable, and the features with the highest F-values are selected as the top k features. This is a useful technique when working with classification problems, as it helps to identify the most important features for making accurate predictions.
f_regression : In the second example, SelectKBest(f_regression, k=5), the scoring function used is f_regression, which is used for regression problems. The f_regression scoring function calculates the F-value between each feature and the target variable, and the features with the highest F-values are selected as the top k features. This is a useful technique when working with regression problems, as it helps to identify the most important features for making accurate predictions.
chi2: This test is used to determine whether there is a significant association between two categorical variables. The test calculates the difference between the expected frequency of occurrences and the observed frequency of occurrences.
EXAMPLE 1 :
In this article, we will use the iris dataset from the sci-kit-learn library and apply univariate feature selection to the data before training an SVM. The iris dataset contains 150 samples of iris flowers, with four features: sepal length, sepal width, petal length, and petal width. The goal is to use SVM to classify the iris flowers into three different species based on their features.
Step 1: Load the iris dataset and split the data into training and test sets:
Python3
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris(as_frame=True)
df = iris.frame
X = df.drop(['target'], axis = 1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.2,
random_state=42)
Step 2: Univariate Feature Selection
we will use the SelectKBest class from sklearn.feature_selection module to perform univariate feature selection.
In this case, SelectKBest(f_classif, k=2), the scoring function used is f_classif, which is used for classification problems. The f_classif scoring function calculates the ANOVA (analysis of variance) F-value between each feature and the target variable, and the features with the highest F-values are selected as the top k features. This is a useful technique when working with classification problems, as it helps to identify the most important features for making accurate predictions.
We will set the k parameter to 2, which means that we will keep the two best features from the dataset.
Python3
from sklearn.feature_selection import SelectKBest, f_classif
selector = SelectKBest(f_classif, k=2)
selector.fit(X_train, y_train)
print('Number of input features:', selector.n_features_in_)
print('Input features Names :', selector.feature_names_in_)
print('Input features scores :', selector.scores_)
print('Input features pvalues:', selector.pvalues_)
print('Output features Names :', selector.get_feature_names_out())
Output:
Number of input features: 4
Input features Names : ['sepal length (cm)' 'sepal width (cm)' 'petal length (cm)'
'petal width (cm)']
Input features scores : [ 84.80836804 41.29284269 925.55642345 680.77560309]
Input features pvalues: [1.72477507e-23 2.69962606e-14 1.93619072e-72 3.57639330e-65]
Output features Names : ['petal length (cm)' 'petal width (cm)']
Now we will select both petal length and petal width by using selector.transform to train and test features.
Python3
X_train_selected = selector.transform(X_train)
X_test_selected = selector.transform(X_test)
Step 3: Apply the Support Vector Machine Classifier to train the model.
Now that we have selected the best two features, we will train an SVM classifier using these features:
Python3
from sklearn.svm import SVC
clf = SVC(kernel='linear', C=1, random_state=42)
clf.fit(X_train_selected, y_train)
Step 4: Evaluate the performance of the SVM classifier
Finally, we will evaluate the performance of the SVM classifier by calculating its accuracy on the test set:
Python3
from sklearn.metrics import accuracy_score
y_pred = clf.predict(X_test_selected)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Output:
Accuracy: 1.0
This means that the SVM classifier was able to classify 100% of the test samples correctly, using only two features. By reducing the number of features in the model, we have made it simpler and more interpretable, while still achieving good performance.
Full code:
Python3
# Import the necesssary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# Load the datasets
iris = load_iris(as_frame=True)
df = iris.frame
X = df.drop(['target'], axis = 1)
y = df['target']
# Split train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.2,
random_state=42)
# Select the best features
selector = SelectKBest(f_classif, k=2)
selector.fit(X_train, y_train)
print('Number of input features:', selector.n_features_in_)
print('Input features Names :', selector.feature_names_in_)
print('Input features scores :', selector.scores_)
print('Input features pvalues:', selector.pvalues_)
print('Output features Names :', selector.get_feature_names_out())
X_train_selected = selector.transform(X_train)
X_test_selected = selector.transform(X_test)
# Train the classifier
clf = SVC(kernel='linear', C=1, random_state=42)
clf.fit(X_train_selected, y_train)
# Prediction
y_pred = clf.predict(X_test_selected)
# Evaluation
accuracy = accuracy_score(y_test, y_pred)
print("\n Accuracy:", accuracy)
Output:
Number of input features: 4
Input features Names : ['sepal length (cm)' 'sepal width (cm)' 'petal length (cm)'
'petal width (cm)']
Input features scores : [ 84.80836804 41.29284269 925.55642345 680.77560309]
Input features pvalues: [1.72477507e-23 2.69962606e-14 1.93619072e-72 3.57639330e-65]
Output features Names : ['petal length (cm)' 'petal width (cm)']
Accuracy: 1.0
Example 2:
In this example, we are using the SelectKBest class from sklearn.feature_selection module. The f_regression function is used as the scoring function, which is the ANOVA F-value between the feature and the target. The fit method is used to fit the selector to the data, and the scores_ attribute is used to get the scores for each feature. Finally, we sort the scores and get the names of the top 5 features with the greatest impact on the target variable.
In the first example, SelectKBest(f_regression, k=5), the scoring function used is f_regression, which is used for regression problems. The f_regression scoring function calculates the F-value between each feature and the target variable, and the features with the highest F-values are selected as the top k features. This is a useful technique when working with regression problems, as it helps to identify the most important features for making accurate predictions.
The value of k determines the number of features that will be selected. In the first example, k=5, so the top 5 features will be selected based on their F-values. In the second example, k=2, so the top 2 features will be selected based on their F-values.
Python3
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
from sklearn.metrics import mean_squared_error
# Load the diabetes dataset
data = load_diabetes(as_frame=True)
df = data.frame
X = df.drop(['target'], axis = 1)
y = df['target']
# Split train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.2,
random_state=42)
# Create the feature selector
selector = SelectKBest(f_regression, k=3)
# Fit the selector to the data
selector.fit(X_train, y_train)
print('Number of input features:', selector.n_features_in_)
print('Input features Names :', selector.feature_names_in_)
# Get the scores for each feature
print('Input features scores :', selector.scores_)
# Get the pvalues for each feature
print('Input features pvalues:', selector.pvalues_)
# Print the names of the best features
print('Output features Names :', selector.get_feature_names_out())
X_train_selected = selector.transform(X_train)
X_test_selected = selector.transform(X_test)
# Train the classifier
reg = SVR(kernel='rbf')
reg.fit(X_train_selected, y_train)
# Prediction
y_pred = reg.predict(X_test_selected)
mse = mean_squared_error(y_test, y_pred)
print("\Mean Squared Error :", mse)
Output:
Number of input features: 10
Input features Names : ['age' 'sex' 'bmi' 'bp' 's1' 's2' 's3' 's4' 's5' 's6']
Input features scores : [1.40986700e+01 1.77755064e-02 2.02386965e+02 8.65580384e+01
1.45561098e+01 8.63143031e+00 6.07087750e+01 7.74171182e+01 1.53967806e+02 6.31023038e+01]
Input features pvalues: [2.02982942e-04 8.94012908e-01 1.39673719e-36 1.49839640e-18
1.60730187e-04 3.52250747e-03 7.56195523e-14 6.36582277e-17 1.45463546e-29 2.69104622e-14]
Output features Names : ['bmi' 'bp' 's5']
\Mean Squared Error : 3668.63356096246
In conclusion, univariate feature selection is a useful technique for reducing the complexity of SVM models.
Similar Reads
Python Tutorial - Learn Python Programming Language Python is one of the most popular programming languages. Itâs simple to use, packed with features and supported by a wide range of libraries and frameworks. Its clean syntax makes it beginner-friendly. It'sA high-level language, used in web development, data science, automation, AI and more.Known fo
10 min read
Machine Learning Tutorial Machine learning is a branch of Artificial Intelligence that focuses on developing models and algorithms that let computers learn from data without being explicitly programmed for every task. In simple words, ML teaches the systems to think and understand like humans by learning from the data.Machin
5 min read
Non-linear Components In electrical circuits, Non-linear Components are electronic devices that need an external power source to operate actively. Non-Linear Components are those that are changed with respect to the voltage and current. Elements that do not follow ohm's law are called Non-linear Components. Non-linear Co
11 min read
Linear Regression in Machine learning Linear regression is a type of supervised machine-learning algorithm that learns from the labelled datasets and maps the data points with most optimized linear functions which can be used for prediction on new datasets. It assumes that there is a linear relationship between the input and output, mea
15+ min read
Spring Boot Tutorial Spring Boot is a Java framework that makes it easier to create and run Java applications. It simplifies the configuration and setup process, allowing developers to focus more on writing code for their applications. This Spring Boot Tutorial is a comprehensive guide that covers both basic and advance
10 min read
Support Vector Machine (SVM) Algorithm Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks. It tries to find the best boundary known as hyperplane that separates different classes in the data. It is useful when you want to do binary classification like spam vs. not spam or
9 min read
Logistic Regression in Machine Learning Logistic Regression is a supervised machine learning algorithm used for classification problems. Unlike linear regression which predicts continuous values it predicts the probability that an input belongs to a specific class. It is used for binary classification where the output can be one of two po
11 min read
Class Diagram | Unified Modeling Language (UML) A UML class diagram is a visual tool that represents the structure of a system by showing its classes, attributes, methods, and the relationships between them. It helps everyone involved in a projectâlike developers and designersâunderstand how the system is organized and how its components interact
12 min read
100+ Machine Learning Projects with Source Code [2025] This article provides over 100 Machine Learning projects and ideas to provide hands-on experience for both beginners and professionals. Whether you're a student enhancing your resume or a professional advancing your career these projects offer practical insights into the world of Machine Learning an
5 min read
K means Clustering â Introduction K-Means Clustering is an Unsupervised Machine Learning algorithm which groups unlabeled dataset into different clusters. It is used to organize data into groups based on their similarity. Understanding K-means ClusteringFor example online store uses K-Means to group customers based on purchase frequ
4 min read