Open In App

Selecting Top Features with tsfresh: A Technical Guide

Last Updated : 11 Jul, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

tsfresh (Time Series Feature extraction based on scalable hypothesis tests) is a powerful Python library designed for automatic extraction of numerous features from time series data. It excels at tasks such as classification, regression, and clustering. However, the abundance of features it generates can lead to overfitting, making feature selection crucial.

tsfresh's Approach to Feature Selection

tsfresh's feature selection module leverages statistical hypothesis tests to assess the relevance of each feature. This is particularly beneficial for time series data, where features often exhibit complex dependencies.

The relevance table is a crucial component in tsfresh that evaluates the importance of each feature. It is generated using the tsfresh.feature_selection.relevance module, which calculates the p-values for each feature using univariate tests. The Benjamini Hochberg procedure is then applied to these p-values to determine which features are significant and should be retained.

Selecting Top Features with tsfresh

1. Calculating the Relevance Table

To calculate the relevance table, you need to use the calculate_relevance_table function from the tsfresh.feature_selection.relevance module. This function takes several parameters, including the feature matrix X, the target vector y, and the machine learning task ml_task. It returns a table with the p-values and the corresponding feature names.

from tsfresh.feature_selection.relevance import calculate_relevance_table

relevance_table = calculate_relevance_table(X, y, ml_task='auto')

2. Sorting and Selecting Top Features

Once the relevance table is calculated, you can sort it by the p-values to identify the most significant features. To select the top n features, you can use pandas to select the first n rows of the sorted table.

import pandas as pd

# Sort the relevance table by p-values
sorted_table = relevance_table.sort_values(by='p_value')

# Select the top n features
top_n_features = sorted_table.head(n)

For a practical implementation, let's create a random dataset and demonstrate how to perform feature selection using tsfresh. Here's a step-by-step example:

Python
import numpy as np
import pandas as pd

np.random.seed(42)  # For reproducibility
n_samples = 100
n_features = 20

X = pd.DataFrame(np.random.randn(n_samples, n_features), columns=[f'Feature_{i+1}' for i in range(n_features)])
y = pd.Series(np.random.randint(0, 2, n_samples), name='Target')

# Calculate Relevance Table using tsfresh
from tsfresh.feature_selection.relevance import calculate_relevance_table

relevance_table = calculate_relevance_table(X, y, ml_task='auto')

# Step 3: Sort and Select Top Features
# Convert relevance_table to a DataFrame for easier manipulation
relevance_df = pd.DataFrame(relevance_table)

# Sort by p-values
sorted_table = relevance_df.sort_values(by='p_value')

# Define the number of top features you want to select
n_top_features = 5  # Example: Selecting top 5 features

# Select the top n features
top_n_features = sorted_table.head(n_top_features)
print("Top", n_top_features, "features:")
print(top_n_features)

Output:

Top 5 features:
feature type p_value relevant
feature
Feature_18 Feature_18 real 0.055744 False
Feature_19 Feature_19 real 0.091964 False
Feature_20 Feature_20 real 0.211287 False
Feature_7 Feature_7 real 0.254481 False
Feature_4 Feature_4 real 0.266178 False

How to Select Top Features: Alternative Methods

One of the standout capabilities of tsfresh is its feature selection process, which helps in identifying the most relevant features for your predictive models. Here's a step-by-step guide, with code examples, on how to select only a certain number of top features using tsfresh.

Step 1: Install tsfresh

First, ensure you have tsfresh installed in your Python environment. You can install it using pip:

pip install tsfresh

Step 2: Import Necessary Libraries

Next, import the necessary libraries including pandas and tsfresh.

Python
import pandas as pd
from tsfresh import extract_features
from tsfresh.feature_extraction import ComprehensiveFCParameters
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.impute import SimpleImputer

Step 3: Prepare Your Data

Let's assume you have a time series dataset. We'll create a sample dataframe for illustration

Python
# Sample time series data
data = {
    'id': [1, 1, 1, 2, 2, 2],
    'time': [1, 2, 3, 1, 2, 3],
    'value': [10, 20, 15, 30, 25, 35]
}
df = pd.DataFrame(data)

Step 4: Extract Features

Use tsfresh to extract features from your time series data.

Python
# Extract features
extracted_features = extract_features(df, column_id="id", column_sort="time", default_fc_parameters=ComprehensiveFCParameters())

# Drop columns with all NaN values (if any)
extracted_features = extracted_features.dropna(axis=1, how='all')

Output

WARNING:tsfresh.feature_extraction.settings:Dependency not available for matrix_profile, this feature will be disabled!
Feature Extraction: 100%|██████████| 2/2 [00:00<00:00, 24.44it/s]

Step 5: Handle NaNs

Impute missing values using the mean of the feature.

Python
# Impute missing values with the mean of the respective feature
imputer = SimpleImputer(strategy='mean')
imputed_features = imputer.fit_transform(extracted_features)

Step 6: Select Top Features

To select the top features, we'll use SelectKBest from scikit-learn.

Python
# Assuming you have a target variable
y = [0, 1]  # Sample target variable for two different ids

# Select top 5 features
selector = SelectKBest(score_func=f_classif, k=5)
selected_features = selector.fit_transform(imputed_features, y)

# Get the names of the selected features
selected_feature_mask = selector.get_support()
selected_feature_names = extracted_features.columns[selected_feature_mask]
print(selected_feature_names)

Output:

Index(['value__fourier_entropy__bins_3', 'value__fourier_entropy__bins_5',
'value__fourier_entropy__bins_10', 'value__fourier_entropy__bins_100',
'value__permutation_entropy__dimension_3__tau_1'],
dtype='object')
/usr/local/lib/python3.10/dist-packages/sklearn/feature_selection/_univariate_selection.py:109: RuntimeWarning: invalid value encountered in divide
msw = sswn / float(dfwn)

This code includes the following improvements:

  • Dropping columns with all NaN values before imputation.
  • Ensuring the dimensions of the boolean mask match the columns of the extracted features.

By applying these code should handle the feature extraction, NaN imputation, and feature selection processes correctly, allowing you to select and print the top features without encountering dimension mismatches.

This code snippet handles the NaNs by imputing them with the mean of the respective feature and then selects the top 5 features based on their ANOVA F-value between label/feature combinations.

Conclusion

tsfresh is a powerful tool for automatic feature extraction from time series data. Its ability to extract hundreds of relevant features and integrate with popular Python libraries makes it an essential package for data scientists and researchers working with time series data. By following the steps outlined in this article, you can efficiently extract and select features from your time series data, leading to more accurate and robust predictive models.


Next Article

Similar Reads