0% found this document useful (0 votes)
1 views3 pages

f10

The document outlines a feature engineering process for a restaurant dataset, creating new features such as name length, address length, cuisine count, and binary indicators for table booking and online delivery. It also categorizes ratings and costs, normalizing costs to USD for comparison. Finally, it analyzes and visualizes the correlation between these new features and the aggregate rating using correlation matrices and box plots.

Uploaded by

sambhaviasingh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views3 pages

f10

The document outlines a feature engineering process for a restaurant dataset, creating new features such as name length, address length, cuisine count, and binary indicators for table booking and online delivery. It also categorizes ratings and costs, normalizing costs to USD for comparison. Finally, it analyzes and visualizes the correlation between these new features and the aggregate rating using correlation matrices and box plots.

Uploaded by

sambhaviasingh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 3

# LEVEL 2 - TASK 3: FEATURE ENGINEERING

print("LEVEL 2 - TASK 3: FEATURE ENGINEERING")


print("===================================")

# Create a copy of the dataframe for feature engineering


df_features = df_processed.copy()

# 1. Extract additional features from existing columns


# Length of restaurant name
df_features['Name_Length'] = df_features['Restaurant Name'].apply(lambda x:
len(str(x)) if pd.notna(x) else 0)

# Length of address
df_features['Address_Length'] = df_features['Address'].apply(lambda x: len(str(x))
if pd.notna(x) else 0)

# Number of cuisines
df_features['Cuisine_Count'] = df_features['Cuisines'].apply(
lambda x: len(str(x).split(',')) if pd.notna(x) else 0
)

# 2. Create new features by encoding categorical variables


# Has Table Booking (binary)
df_features['Has_Table_Booking_Binary'] = df_features['Has Table
booking'].map({'Yes': 1, 'No': 0})

# Has Online Delivery (binary)


df_features['Has_Online_Delivery_Binary'] = df_features['Has Online
delivery'].map({'Yes': 1, 'No': 0})

# Is Delivering Now (binary)


df_features['Is_Delivering_Now_Binary'] = df_features['Is delivering
now'].map({'Yes': 1, 'No': 0})

# Rating Category
def categorize_rating(rating):
if rating >= 4.5:
return 'Excellent'
elif rating >= 4.0:
return 'Very Good'
elif rating >= 3.5:
return 'Good'
elif rating >= 3.0:
return 'Average'
elif rating >= 2.0:
return 'Poor'
else:
return 'Very Poor'

df_features['Rating_Category'] = df_features['Aggregate
rating'].apply(categorize_rating)

# Cost Category
def categorize_cost(cost, currency):
if pd.isna(cost) or pd.isna(currency):
return 'Unknown'

# Normalize to USD for comparison (very simplified)


if currency == 'Dollar($)':
normalized_cost = cost
elif currency == 'Indian Rupees(Rs.)':
normalized_cost = cost / 75 # Approximate conversion
elif currency == 'Pounds(£)':
normalized_cost = cost * 1.3 # Approximate conversion
elif currency == 'Turkish Lira(TL)':
normalized_cost = cost / 8 # Approximate conversion
elif currency == 'Brazilian Real(R$)':
normalized_cost = cost / 5 # Approximate conversion
elif currency == 'Indonesian Rupiah(IDR)':
normalized_cost = cost / 14000 # Approximate conversion
else:
normalized_cost = cost # Default case

if normalized_cost < 20:


return 'Budget'
elif normalized_cost < 50:
return 'Moderate'
elif normalized_cost < 100:
return 'Expensive'
else:
return 'Very Expensive'

df_features['Cost_Category'] = df_features.apply(
lambda row: categorize_cost(row['Average Cost for two'], row['Currency']),
axis=1
)

# Display the new features


print("\nNew Features Created:")
print(df_features[['Name_Length', 'Address_Length', 'Cuisine_Count',
'Has_Table_Booking_Binary', 'Has_Online_Delivery_Binary',
'Is_Delivering_Now_Binary', 'Rating_Category',
'Cost_Category']].head())

# Analyze the relationship between new features and rating


print("\nCorrelation between new features and rating:")
feature_corr = df_features[['Name_Length', 'Address_Length', 'Cuisine_Count',
'Has_Table_Booking_Binary',
'Has_Online_Delivery_Binary',
'Is_Delivering_Now_Binary', 'Aggregate rating']].corr()
print(feature_corr['Aggregate rating'].sort_values(ascending=False))

# Visualize the correlation


plt.figure(figsize=(12, 10))
sns.heatmap(feature_corr, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Between Features and Rating', fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.tight_layout()
plt.show()

# Analyze the relationship between categorical features and rating


plt.figure(figsize=(14, 6))
sns.boxplot(x='Rating_Category', y='Aggregate rating', data=df_features,
palette='viridis')
plt.title('Rating Distribution by Rating Category', fontsize=16)
plt.xlabel('Rating Category', fontsize=14)
plt.ylabel('Aggregate Rating', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.grid(axis='y', alpha=0.3)
plt.show()

plt.figure(figsize=(14, 6))
sns.boxplot(x='Cost_Category', y='Aggregate rating', data=df_features,
palette='viridis')
plt.title('Rating Distribution by Cost Category', fontsize=16)
plt.xlabel('Cost Category', fontsize=14)
plt.ylabel('Aggregate Rating', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.grid(axis='y', alpha=0.3)
plt.show()

You might also like