0% found this document useful (0 votes)
7 views

featureselection

The document outlines a series of data analysis tasks using Python, including creating datasets, applying statistical methods, and visualizing results. It covers topics such as feature selection for predicting final grades, generating random datasets for height and weight, and applying dimensionality reduction techniques like PCA and SVD. The document also includes code snippets for implementing these analyses in a Jupyter notebook environment.

Uploaded by

vamsikopparthi84
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

featureselection

The document outlines a series of data analysis tasks using Python, including creating datasets, applying statistical methods, and visualizing results. It covers topics such as feature selection for predicting final grades, generating random datasets for height and weight, and applying dimensionality reduction techniques like PCA and SVD. The document also includes code snippets for implementing these analyses in a Jupyter notebook environment.

Uploaded by

vamsikopparthi84
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Untitled12.ipynb - Colab https://colab.research.google.com/drive/1y_9mSzRweIDs2RQA9Y...

Size of the House (sqft)


Number of Bedrooms
Age of the House (years)
Distance to the City Center
Color of the Front Door

# Import necessary libraries


import pandas as pd

# Create a small sample dataset


data = pd.DataFrame({
'Size (sqft)': [1500, 1600, 1700, 1800, 1900],
'Bedrooms': [3, 3, 4, 4, 5],
'Age of House (years)': [10, 15, 20, 25, 30],
'Distance to City Center (km)': [5, 4, 6, 3, 2],
'Front Door Color': ['Red', 'Blue', 'Green', 'Blue', 'Red'],
'Price': [300000, 320000, 350000, 370000, 400000]
})

# Display the data


data

Size Age of House Distance to City Front Door


Bedrooms Price
(sqft) (years) Center (km) Color

0 1500 3 10 5 Red 300000

1 1600 3 15 4 Blue 320000

2 1700 4 20 6 Green 350000

3 1800 4 25 3 Blue 370000

4 1900 5 30 2 Red 400000

data

# Drop the "Front Door Color" column as it doesn't add value to our prediction
data = data.drop(columns=['Front Door Color'])

1 of 11 11/11/2024, 4:46 PM
Untitled12.ipynb - Colab https://colab.research.google.com/drive/1y_9mSzRweIDs2RQA9Y...

# Display the modified data


data

Size Age of House Distance to City Center


Bedrooms Price
(sqft) (years) (km)

0 1500 3 10 5 300000

1 1600 3 15 4 320000

2 1700 4 20 6 350000

3 1800 4 25 3 370000

4 1900 5 30 2 400000

data

170: This is the mean (average) of the distribution. The generated numbers will center around 170.
10: This is the standard deviation (spread) of the distribution. It determines how much the numbers
100: This is the number of values to generate. In this case, it will create an array of 100 random n

# Import necessary libraries


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Seed for reproducibility


np.random.seed(0)

# Create a simple dataset for height and weight


height = np.random.normal(170 10 100) # Average height in cm

2 of 11 11/11/2024, 4:46 PM
Untitled12.ipynb - Colab https://colab.research.google.com/drive/1y_9mSzRweIDs2RQA9Y...

height = np.random.normal(170, 10, 100) # Average height in cm


weight = height * 0.5 + np.random.normal(0, 5, 100) # Weight dependent on height

# Combine data into a DataFrame


data = pd.DataFrame({'Height': height, 'Weight': weight})

# Plot the data


plt.figure(figsize=(8, 6))
plt.scatter(data['Height'], data['Weight'], color='b', alpha=0.7)
plt.xlabel('Height (cm)')
plt.ylabel('Weight (kg)')
plt.title('Scatter plot of Height vs. Weight')
plt.show()

# Apply PCA
pca = PCA(n_components=1)
data_transformed = pca.fit_transform(data)

# Print the transformed data (principal component)


print("Transformed Data (1 Principal Component):\n", data_transformed[:10])

# Plot the data in 1D (principal component)


plt.scatter(data_transformed, np.zeros_like(data_transformed))
( )

3 of 11 11/11/2024, 4:46 PM
Untitled12.ipynb - Colab https://colab.research.google.com/drive/1y_9mSzRweIDs2RQA9Y...

plt.xlabel("Principal Component 1")


plt.title("Data after PCA (1D)")
plt.show()

---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-1-509fb8b12387> in <cell line: 2>()
1 # Apply PCA
----> 2 pca = PCA(n_components=1)
3 data_transformed = pca.fit_transform(data)
4
5 # Print the transformed data (principal component)

NameError: name 'PCA' is not defined

from sklearn.decomposition import TruncatedSVD


from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
documents = [
"The cat is on the table",
"The dog is under the table",
"Cats and dogs are friends",
"Dogs run and cats jump"
]

# Convert text data into a term-document matrix


vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)

# Print the term-document matrix


print("Original Term-Document Matrix:\n", X.toarray())
print("\nFeature Names:", vectorizer.get_feature_names_out())

Original Term-Document Matrix:


[[0 0 1 0 0 0 0 1 0 1 0 1 2 0]
[0 0 0 0 1 0 0 1 0 0 0 1 2 1]
[1 1 0 1 0 1 1 0 0 0 0 0 0 0]
[1 0 0 1 0 1 0 0 1 0 1 0 0 0]]

Feature Names: ['and' 'are' 'cat' 'cats' 'dog' 'dogs' 'friends' 'is' 'jump' 'on' 'run'
'table' 'the' 'under']

# Apply Truncated SVD


svd = TruncatedSVD(n_components=2) # Reducing to 2 components
X_reduced = svd.fit_transform(X)

# Print the reduced matrix


print("Reduced Term-Document Matrix:\n", X_reduced)

4 of 11 11/11/2024, 4:46 PM
Untitled12.ipynb - Colab https://colab.research.google.com/drive/1y_9mSzRweIDs2RQA9Y...

print("Reduced Term-Document Matrix:\n", X_reduced)

# Plot the reduced representation of documents


plt.scatter(X_reduced[:, 0], X_reduced[:, 1])
plt.xlabel("Component 1")
plt.ylabel("Component 2")
plt.title("Document Representation after SVD")
plt.show()

Reduced Term-Document Matrix:


[[ 2.64575131e+00 3.92045402e-18]
[ 2.64575131e+00 1.01846971e-15]
[-7.34164894e-16 2.00000000e+00]
[-7.35324166e-16 2.00000000e+00]]

Hours Studied
Attendance Rate
Participation in Class
Previous Grades
Extra-Curricular Activities

5 of 11 11/11/2024, 4:46 PM
Untitled12.ipynb - Colab https://colab.research.google.com/drive/1y_9mSzRweIDs2RQA9Y...

Start with No Features.


Evaluate Each Feature Individually to see which one gives the best prediction.
Add the Best Feature to the model.
Evaluate the Remaining Features with the selected feature(s).
Repeat until adding more features doesn’t significantly improve the model.

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

# Sample dataset: Predicting Final Grade based on various factors


data = {
'Hours_Studied': [5, 8, 2, 6, 9, 3, 7, 10, 4, 5],
'Attendance_Rate': [80, 90, 70, 85, 95, 60, 88, 92, 75, 82],
'Participation': [3, 4, 2, 3, 5, 1, 4, 5, 2, 3],
'Previous_Grades': [75, 85, 65, 78, 88, 70, 80, 90, 68, 74],
'Extra_Curricular': [1, 2, 0, 1, 2, 0, 1, 2, 1, 1],
'Final_Grade': [78, 88, 65, 80, 90, 68, 85, 92, 72, 76]
}

# Convert the data to a DataFrame


df = pd.DataFrame(data)
X = df.drop('Final_Grade', axis=1) # Features (input data)
y = df['Final_Grade'] # Target variable (what we want to predict)

# Initialize variables for forward selection


remaining_features = list(X.columns) # Start with all features available for selection
selected_features = [] # Start with an empty set of selected features
model = LinearRegression() # Initialize a linear regression model

# Forward Selection Process


print("Forward Feature Selection Process:")

# Keep adding features until no significant improvement is observed


while remaining_features:
scores = {} # Dictionary to store scores for each feature

# Test adding each feature not yet selected


for feature in remaining_features:
# Temporarily add the current feature to the selected set
current_features = selected_features + [feature]

# Use only the selected features in the model


X_subset = X[current_features]

# Evaluate model performance using cross-validation and calculate average score


score = cross_val_score(model, X_subset, y, cv=3).mean()

6 of 11 11/11/2024, 4:46 PM
Untitled12.ipynb - Colab https://colab.research.google.com/drive/1y_9mSzRweIDs2RQA9Y...

# Store the score of this model in the dictionary


scores[feature] = score

# Find the feature that gives the best improvement in model performance
best_feature = max(scores, key=scores.get) # Select the feature with the highest score

# Add the best feature to the selected features list


selected_features.append(best_feature)

# Remove the best feature from the remaining features list


remaining_features.remove(best_feature)

# Print the selected feature and its score for tracking


print(f"Selected Feature: {best_feature}, Score: {scores[best_feature]:.4f}")

# Print the order in which features were selected


print("\nSelected Features in Order:", selected_features)

Forward Feature Selection Process:


Selected Feature: Hours_Studied, Score: 0.9679
Selected Feature: Extra_Curricular, Score: 0.9619
Selected Feature: Participation, Score: 0.9374
Selected Feature: Attendance_Rate, Score: 0.9163
Selected Feature: Previous_Grades, Score: -1.0925

Selected Features in Order: ['Hours_Studied', 'Extra_Curricular', 'Participation', 'Atten

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

7 of 11 11/11/2024, 4:46 PM
Untitled12.ipynb - Colab https://colab.research.google.com/drive/1y_9mSzRweIDs2RQA9Y...

# Sample dataset: Predicting Final Grade based on various factors


data = {
'Hours_Studied': [5, 8, 2, 6, 9, 3, 7, 10, 4, 5],
'Attendance_Rate': [80, 90, 70, 85, 95, 60, 88, 92, 75, 82],
'Participation': [3, 4, 2, 3, 5, 1, 4, 5, 2, 3],
'Previous_Grades': [75, 85, 65, 78, 88, 70, 80, 90, 68, 74],
'Extra_Curricular': [1, 2, 0, 1, 2, 0, 1, 2, 1, 1],
'Final_Grade': [78, 88, 65, 80, 90, 68, 85, 92, 72, 76]
}

# Convert the data to a DataFrame


df = pd.DataFrame(data)
X = df.drop('Final_Grade', axis=1) # Features (input data)
y = df['Final_Grade'] # Target variable (what we want to predict)

remaining_features = list(X.columns) # Start with all features available for selection


selected_features = [] # Start with an empty set of selected features
model = LinearRegression() # Initialize a linear regression model

while remaining_features:: Start a loop that runs as long as there are still features we haven’t

scores = {}: Initialize an empty dictionary to keep track of model scores for each feature we test.

Inner Loop: Testing Each Feature:

8 of 11 11/11/2024, 4:46 PM
Untitled12.ipynb - Colab https://colab.research.google.com/drive/1y_9mSzRweIDs2RQA9Y...

Inner Loop: Testing Each Feature:


for feature in remaining_features:: Loop over each feature we haven’t selected yet.

current_features = selected_features + [feature]: Add the current feature to the selected list t

X_subset = X[current_features]: Create X_subset, which includes only the selected features.

score = cross_val_score(model, X_subset, y, cv=3).mean(): Evaluate the model’s performance

print("Forward Feature Selection Process:")


while remaining_features:
scores = {} # Dictionary to store scores for each feature

# Test adding each feature not yet selected


for feature in remaining_features:
# Temporarily add the current feature to the selected set
current_features = selected_features + [feature]

# Use only the selected features in the model


X_subset = X[current_features]

# Evaluate model performance using cross-validation and calculate average score


score = cross_val_score(model, X_subset, y, cv=3).mean()

# Store the score of this model in the dictionary


scores[feature] = score

print("\nSelected Features in Order:", selected_features)

import numpy as np

# Set the random seed for reproducibility


np.random.seed(0)

9 of 11 11/11/2024, 4:46 PM
Untitled12.ipynb - Colab https://colab.research.google.com/drive/1y_9mSzRweIDs2RQA9Y...

np.random.seed(0)

# Simulate flipping a coin 100 times (0 = Tails, 1 = Heads)


flips = np.random.choice([0, 1], size=100) # 0 for Tails, 1 for Heads

# Count the number of Heads and Tails


heads_count = np.sum(flips == 1)
tails_count = np.sum(flips == 0)

# Calculate empirical probabilities


p_heads = heads_count / len(flips)
p_tails = tails_count / len(flips)

print(f"Empirical Probability of Heads: {p_heads}")


print(f"Empirical Probability of Tails: {p_tails}")

np.random.choice([0, 1], size=100) simulates flipping a coin 100 times. Here, 0 represents Tails and

We count how many times 1 (Heads) and 0 (Tails) occur in the array.

By dividing the count of each outcome by the total number of flips, we get the empirical probability

import numpy as np

# Data for study time (hours) and test scores


study_time = [2, 3, 4, 5, 6]
test_scores = [60, 65, 70, 75, 80]

# Calculate correlation
correlation = np.corrcoef(study_time, test_scores)[0 1]

10 of 11 11/11/2024, 4:46 PM
Untitled12.ipynb - Colab https://colab.research.google.com/drive/1y_9mSzRweIDs2RQA9Y...

correlation = np.corrcoef(study_time, test_scores)[0, 1]


print("Correlation between study time and test scores:", correlation)

Correlation between study time and test scores: 1.0

11 of 11 11/11/2024, 4:46 PM

You might also like