ML Lab File Final.docx - Google Docs
ML Lab File Final.docx - Google Docs
IM: Use PCA on a high-dimensional dataset to reduceits dimensionality while retaining most of the
A
variance and visualize the data.
CODE:
df = pd.read_csv('USA_Housing.csv')
Output variable
#
y = df['Price']
Apply PCA
#
pca = PCA(n_components=2) # Reduce to 2 dimensions for
visualization
X_pca = pca.fit_transform(X_scaled)
2
lt.ylabel('Principal Component 2 (Explained Variance:
p
{:.2f}%)'.format(explained_variance[1]*100))
plt.show()
OUTPUT:
Scatter Plot
Before PCA
After PCA
3
EXPERIMENT – 3
IM:Performalinearregressionanalysisonadatasettopredictacontinuoustargetvariablebasedon
A
a one o r morepredictorvariables.Evaluatethemodel’sperformanceusingmetricslikeRMSEand
R-sqaured.
CODE:
Importing Libraries
#
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
df = pd.read_csv('USA_Housing.csv')
Output variable
#
y = df['Price']
y_pred = lm.predict(X_test)
Calculate RMSE
#
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
Calculate R-squared
#
r_squared = r2_score(y_test, y_pred)
4
rint("Linear regression model performance:\nRoot Mean Squared Error
p
(RMSE):", rmse)
print("R-squared:", r_squared)
lt.figure(figsize=(8, 6))
p
plt.scatter(y_test, y_pred, alpha=0.5, label='Predicted',
color='cyan')
plt.scatter(y_test, y_test, alpha=0.5, label='Actual', color='blue')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Values')
plt.legend()
plt.show()
OUTPUT:
5
EXPERIMENT – 4
IM: Compare the performance of various classifications algorithms (e.g., Logistic Regression,
A
Decision Trees, Random Forest, SVM and Naïve Bayes) on a common dataset using accuracy,
precision, recall and F1-Score.
CODE:
f = pd.read_csv('gender_classification_v7.csv')
d
df['gender'] = df['gender'].apply(lambda x: 0 if x == 'Male' else 1)
lt.figure(figsize=(2, 4))
p
plt.title('Count of Gender', size=10)
sns.countplot(data=df, x="gender")
plt.ylabel('Count', size=12)
plt.xlabel('Gender', size=12)
sns.despine(top=True, right=True, left=False, bottom=False)
plt.show()
= df.drop(columns=['gender'])
X
y = df['gender']
caler = StandardScaler()
s
X_scaled = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,
test_size=0.2, random_state=42, stratify=y)
Name, Accuracy, Precision, Recall, F1_Score = [], [], [], [], []
Logistic Regression
#
regression = LogisticRegression()
regression.fit(X_train, y_train)
6
_pred = regression.predict(X_test)
y
Accuracy.append(float("{:.2f}".format(accuracy_score(y_test, y_pred)
*100)))
Precision.append(precision_score(y_test, y_pred))
Recall.append(recall_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))
Name.append('Logistic Regression')
Decision Tree
#
tree = DecisionTreeClassifier(criterion="gini", random_state=100,
max_depth=3, min_samples_leaf=5)
tree.fit(X_train, y_train)
y_pred = tree.predict(X_test)
Accuracy.append(float("{:.2f}".format(accuracy_score(y_test, y_pred)
*100)))
Precision.append(precision_score(y_test, y_pred))
Recall.append(recall_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))
Name.append('Decision Tree')
Random Forest
#
forest = RandomForestClassifier(n_estimators = 100)
forest.fit(X_train, y_train)
y_pred = forest.predict(X_test)
Accuracy.append(float("{:.2f}".format(accuracy_score(y_test, y_pred)
*100)))
Precision.append(precision_score(y_test, y_pred))
Recall.append(recall_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))
Name.append('Random Forest')
SVM
#
svm = SVC(kernel='linear')
svm.fit(X_train, y_train)
y_pred = svm.predict(X_test)
Accuracy.append(float("{:.2f}".format(accuracy_score(y_test, y_pred)
*100)))
Precision.append(precision_score(y_test, y_pred))
Recall.append(recall_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))
Name.append('SVM')
Naive Bayes
#
naiveBayes = GaussianNB()
naiveBayes.fit(X_train, y_train)
y_pred = naiveBayes.predict(X_test)
7
ccuracy.append(float("{:.2f}".format(accuracy_score(y_test, y_pred)
A
*100)))
Precision.append(precision_score(y_test, y_pred))
Recall.append(recall_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))
Name.append('Naive Bayes')
OUTPUT:
Dataset
Count Plot
RESULT: All models are trained and evaluated. NaïveBayes performs best for the given dataset.
8
EXPERIMENT – 5
IM: Implement ensemble methods such as Bagging (e.g., Random Forest) and Boosting (e.g.,
A
AdaBoost) on a classification task and compare their performance to individual models.
ODE:
C
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier,
AdaBoostClassifier
from sklearn.metrics import accuracy_score, precision_score,
recall_score, f1_score
f = pd.read_csv('gender_classification_v7.csv')
d
df['gender'] = df['gender'].apply(lambda x: 0 if x == 'Male' else 1)
lt.figure(figsize=(2, 4))
p
plt.title('Count of Gender', size=10)
sns.countplot(data=df, x="gender")
plt.ylabel('Count', size=12)
plt.xlabel('Gender', size=12)
sns.despine(top=True, right=True, left=False, bottom=False)
plt.show()
= df.drop(columns=['gender'])
X
y = df['gender']
scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,
test_size=0.2, random_state=42, stratify=y)
Name, Accuracy, Precision, Recall, F1_Score = [], [], [], [], []
Ensemble Methods
#
# Bagging - Random Forest
forest = RandomForestClassifier(n_estimators = 100)
forest.fit(X_train, y_train)
y_pred = forest.predict(X_test)
Accuracy.append(float("{:.2f}".format(accuracy_score(y_test, y_pred)
*100)))
Precision.append(precision_score(y_test, y_pred))
Recall.append(recall_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))
Name.append('Random Forest')
9
Boosting - AdaBoost Classifier
#
adaBoost = AdaBoostClassifier()
adaBoost.fit(X_train, y_train)
y_pred = adaBoost.predict(X_test)
Accuracy.append(float("{:.2f}".format(accuracy_score(y_test, y_pred)
*100)))
Precision.append(precision_score(y_test, y_pred))
Recall.append(recall_score(y_test, y_pred))
F1_Score.append(f1_score(y_test, y_pred))
Name.append('AdaBoost')
OUTPUT:
Dataset
Count Plot
ESULT: Both models are trained and evaluated. AdaBoostClassifier performs best for the given
R
dataset.
10
EXPERIMENT – 6
IM: Write a code for feature selection techniquesto reduce the no. of features in a dataset while
A
maintaining or improving the model's performance.
CODE:
11
mse = math.sqrt(mean_squared_error(y_test, y_pred))
r
r_squared = r2_score(y_test, y_pred)
lt.figure(figsize=(8, 6))
p
plt.scatter(y_test, y_pred, alpha=0.5, label='Predicted',
color='cyan')
plt.scatter(y_test, y_test, alpha=0.5, label='Actual', color='blue')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Values')
plt.legend()
plt.show()
OUTPUT:
ESULT: After Feature reduction using ANOVA F-Test. Model performance is good. Results are
R
shown below:
● Root Mean Squared Error (MSE):100367.9313
● R-squared: 0.9181
● OriginalFeatures:['Avg.AreaIncome','Avg.AreaHouseAge','Avg.AreaNumberofRooms',
'Avg. Area Number of Bedrooms', 'Area Population']
● SelectedFeatures:['Avg.AreaIncome','Avg.AreaHouseAge','Avg.AreaNumberofRooms',
'Area Population']
12
EXPERIMENT – 7
IM: Write a code to apply Apriori algorithm to discoverassociation rules in retail transaction dataset
A
to identify frequently co-occurring items in customer purchases
CODE:
ummy = pd.get_dummies(data['itemDescription'])
d
data.drop(['itemDescription'], inplace =True, axis=1)
data = data.join(dummy)
ata.head()
d
# Transaction: If a customer bought multiple products in one day, it
will be considered as 1 transaction:
def product_names(x):
for product in products:
if x[product] >0:
x[product] = product
return x
= data1.values
x
x = [sub[~(sub==0)].tolist() for sub in x if sub [sub !=
0].tolist()]
transactions = x
transactions[0:10]
rules = apriori(transactions, min_support = 0.00030, min_lift = 3,
max_length = 2, target = "rules")
13
ssociation_results = list(rules)
a
print(association_results[0])
air = item[0]
p
items = [x for x in pair]
print("=============================")
OUTPUT:
14
EXPERIMENT – 8
IM: Implement k-fold cross-validation on a classificationtask to assess the model’s performance,
A
addressing issue of overfitting.
CODE:
ata = load_iris()
d
df = pd.DataFrame(data.data, columns=data.feature_names)
df['species'] = data.target
df
df['species'].value_counts()
= df.drop(['species'],axis='columns')
X
Y = data.target
for i in range(2,16):
kf=KFold(n_splits=i, random_state=1, shuffle=True)
scores = cross_val_score(model, X, Y, scoring='accuracy', cv=kf,
n_jobs=-1)
print('n-split:',i)
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))
OUTPUT:
15
EXPERIMENT – 9
IM: To implement a simple classification model topredict the species of iris flowers in Iris Dataset
A
using basic algorithms like logistic regression or k-nearest neighbors.
CODE:
_pred_knn5 = knn5.predict(X_test)
y
y_pred_knn1 = knn1.predict(X_test)
print("Accuracy with KNN at k=5", accuracy_score(y_test,
y_pred_knn5)*100)
print("Accuracy with KNN at k=1", accuracy_score(y_test,
y_pred_knn1)*100)
log_regr = LogisticRegression(solver='lbfgs', max_iter=1000)
log_regr.fit(X_train, y_train)
# Predict labels of unseen (test) data
y_pred_lr=log_regr.predict(X_test)
score=accuracy_score(y_test,y_pred_lr)
# The score method returns the accuracy of the model
print("Accuracy of logistic regression ", score*100)
OUTPUT:
ESULT: Simple classification models (K-Nearest Neighborand Logistic Regression) are trained
R
and evaluated.
16
EXPERIMENT – 10
IM: Predict the quality of wine based on featureslike acidity, alcohol content, and pH by using
A
either linear regression or decision trees.
CODE:
f.isnull().sum()
d
df.update(df.fillna(df.mean()))
X = df[['fixed acidity', 'volatile acidity', 'citric acid',
'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur
dioxide', 'density', 'pH','sulphates','alcohol']].values
Y = df[‘quality'].values
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size
= 0.2, random_state = 0)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
coeff_df = pd.DataFrame(regressor.coef_, ['fixed acidity', 'volatile
acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur
dioxide', 'total sulfur dioxide', 'density',
'pH','sulphates','alcohol'] , columns=['Coefficient'])
coeff_df
print(regressor.intercept_)
y_pred = regressor.predict(X_test)
Calculate RMSE
#
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
Calculate R-squared
#
r_squared = r2_score(y_test, y_pred)
17
lt.legend()
p
plt.show()
OUTPUT:
18