0% found this document useful (0 votes)
2 views

vertopal.com_experiment11

The document outlines a feature selection process using the breast cancer dataset, employing various methods including SelectKBest, Recursive Feature Elimination (RFE), Random Forest importance, and Lasso regression. Each method identifies a set of top features relevant for classification, with results printed for each technique. Additionally, a bar plot visualizes the top 10 feature importances derived from the Random Forest model.

Uploaded by

Rishab
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

vertopal.com_experiment11

The document outlines a feature selection process using the breast cancer dataset, employing various methods including SelectKBest, Recursive Feature Elimination (RFE), Random Forest importance, and Lasso regression. Each method identifies a set of top features relevant for classification, with results printed for each technique. Additionally, a bar plot visualizes the top 10 feature importances derived from the Random Forest model.

Uploaded by

Rishab
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

import numpy as np

import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest, f_classif, RFE
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Standardize the features


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,
test_size=0.3, random_state=42)

# ------------------------------
# 1. Univariate Feature Selection (SelectKBest)
# ------------------------------
select_k = SelectKBest(score_func=f_classif, k=10)
select_k.fit(X_train, y_train)
selected_features_kbest = X.columns[select_k.get_support()]

print("\n📌 Top 10 Features (SelectKBest):")


print(selected_features_kbest)

# ------------------------------
# 2. Recursive Feature Elimination (RFE)
# ------------------------------
model = LogisticRegression(max_iter=10000)
rfe = RFE(estimator=model, n_features_to_select=10)
rfe.fit(X_train, y_train)
selected_features_rfe = X.columns[rfe.support_]

print("\n📌 Top 10 Features (RFE):")


print(selected_features_rfe)

# ------------------------------
# 3. Feature Importances from Random Forest
# ------------------------------
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
importances = rf.feature_importances_
indices = np.argsort(importances)[::-1]
top_10_rf = X.columns[indices[:10]]

print("\n📌 Top 10 Features (Random Forest Importance):")


print(top_10_rf)

# Plot top 10 feature importances


plt.figure(figsize=(8, 5))
sns.barplot(x=importances[indices[:10]], y=X.columns[indices[:10]],
palette="viridis")
plt.title("Top 10 Feature Importances (Random Forest)")
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.tight_layout()
plt.show()

# ------------------------------
# 4. Lasso (L1-based) Feature Selection
# ------------------------------
lasso = LogisticRegression(penalty='l1', solver='liblinear', C=0.1,
max_iter=10000)
lasso.fit(X_train, y_train)
selected_features_lasso = X.columns[lasso.coef_[0] != 0]

print("\n📌 Features selected by Lasso (L1 regularization):")


print(selected_features_lasso)

📌 Top 10 Features (SelectKBest):


Index(['mean radius', 'mean perimeter', 'mean area', 'mean concavity',
'mean concave points', 'worst radius', 'worst perimeter',
'worst area',
'worst concavity', 'worst concave points'],
dtype='object')

📌 Top 10 Features (RFE):


Index(['mean concave points', 'radius error', 'area error',
'fractal dimension error', 'worst radius', 'worst texture',
'worst perimeter', 'worst area', 'worst concavity',
'worst concave points'],
dtype='object')

📌 Top 10 Features (Random Forest Importance):


Index(['mean concave points', 'worst concave points', 'worst area',
'mean concavity', 'worst radius', 'worst perimeter', 'mean
perimeter',
'mean area', 'worst concavity', 'mean radius'],
dtype='object')
<ipython-input-17-9106fc3251d0>:59: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be


removed in v0.14.0. Assign the `y` variable to `hue` and set
`legend=False` for the same effect.

sns.barplot(x=importances[indices[:10]], y=X.columns[indices[:10]],
palette="viridis")

📌 Features selected by Lasso (L1 regularization):


Index(['mean concave points', 'radius error', 'worst radius', 'worst
texture',
'worst smoothness', 'worst concavity', 'worst concave points',
'worst symmetry'],
dtype='object')

You might also like