ML Lab Manual
ML Lab Manual
(NBA Accredited)
II SEMESTER – R 2021
EX.NO: 1
AIM:
To implement a linear regression model with a real dataset and experiment with different features, as well as tune
the model hyperparameters, we will use the "Housing Prices" dataset from Kaggle.
Step 1: Import the necessary libraries and load the Housing Prices dataset
import numpy as np
import pandas as pd
# Load the Housing Prices dataset (replace 'housing_prices.csv' with the actual file
path)
data_url = "https://example.com/housing_prices.csv"
df = pd.read_csv(data_url)
Step 2: Prepare the data and define the features and target
# Select the features and target variable
features = ['OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF']
target = 'SalePrice'
# Separate the features (X) and the target (y)
X = df[features]
y = df[target]
Step 3: Split the data into training and test sets
2
🐉
RESULTS:
In this code, we iterate over different values of the fit_intercept hyperparameter (True and False) and create a
linear regression model with each value. We then train the model, make predictions on the test set, and evaluate
the model's performance using mean squared error. This allows you to assess how the hyperparameter influences
the model's performance.
3
🐉
Implement a binary classification model. That is, answers a binary question such as
EX.NO: 2 "Are houses in this neighborhood above a certain price?"(use data from exercise 1).
Modify the classification threshold and determine how that modification influences
the model. Experiment with different classification metrics to determine your
model's effectiveness.
AIM:
To implement a binary classification model that answers a question such as "Are houses in this neighborhood
above a certain price?" using the California Housing Dataset from exercise 1, and experiment with different
classification thresholds and metrics.
PROCEDURE:
Step 1: Import the necessary libraries and load the California Housing Dataset (from exercise 1)
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,
roc_auc_score
# Load the California Housing Dataset (assuming it is loaded as 'df' from exercise 1)
Step 2: Prepare the data and define the binary classification target
# Assume the target is whether the median house value is above a certain threshold
(e.g., $200,000)
threshold = 200000
y = df['above_threshold']
4
🐉
model = LogisticRegression()
model.fit(X_train, y_train)
Step 5: Evaluate the model's performance with different classification thresholds and metrics
# Predict probabilities for the test set
y_pred_probs = model.predict_proba(X_test)[:, 1] # Probability of class 1 (above
threshold)
print(f"Threshold: {threshold}")
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")
print(f"ROC AUC: {roc_auc}")
print("------------------------------")
RESULTS:
In this code, we train a binary classification model using logistic regression. We define a threshold to determine
the class predictions (above or below the specified house price threshold) was implemented successfully.
5
🐉
EX.NO: 3
Classification with Nearest Neighbors. In this question, you will use the scikit-
learn’s KNN classifier to classify real vs. fake news headlines.
AIM:
To classify real vs. fake news headlines using scikit-learn's KNN classifier and perform a training/validation split,
we will need to use a dataset related to news headlines.
PROCEDURE:
The general process of using scikit-learn's KNN classifier with a different dataset that includes real vs. Fake news
headlines. Let's assume we have a suitable dataset called the "News Headlines Dataset" for this task.
import numpy as np
import pandas as pd
# Load the dataset (replace 'news_headlines.csv' with the actual file path or URL)
data_url = "https://example.com/news_headlines.csv"
df = pd.read_csv(data_url)
X = df["headline"].values
y = df["label"].values
# Preprocess the data if necessary (e.g., text cleaning, feature extraction, etc.)
# You may need to convert the headline text into numerical features (e.g., using TF-
IDF, word embeddings, etc.)
# Vectorize the headlines (if they are in textual form) using appropriate techniques like
TF-IDF, word embeddings, etc.
# The vectorization process will convert the headlines into numerical features suitable
for training the KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_val_pred = knn.predict(X_val)
# Analyze the validation results and make improvements to the model if necessary
# You can try different values of 'n_neighbors' or explore other hyperparameters and
techniques to improve the model's performance
# You can also perform additional preprocessing steps, feature engineering, or use
more advanced models if needed
RESULTS:
The code performs a training/validation split, trains the KNN classifier on the training data, makes predictions on
the validation data, and evaluates the accuracy of the classifier.
7
🐉
Experiment with validation sets and test sets using the dataset. Split a training set
EX.NO: 4 into a smaller training set and a validation set. Analyze deltas between training set
and validation set results. Test the trained model with a test set to determine
whether your trained model is overfitting. Detect and fix a common training
problem.
AIM:
To experiment with validation sets and test sets, and analyze the deltas between training set and validation set
results, as well as detect and fix overfitting, we can modify the previous implementation.
PROCEDURE:
import numpy as np
import pandas as pd
data_url = "https://archive.ics.uci.edu/ml/machine-learning-
databases/00323/data.zip"
df = pd.read_csv(data_url, compression='zip')
label_encoder = LabelEncoder()
df_encoded = df.apply(label_encoder.fit_transform)
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df_encoded)
Step 3: Split the data into training, validation, and test sets
8
🐉
# Split the training set into smaller training and validation sets
Step 4: Perform k-means clustering on the smaller training set and evaluate on the validation set
k=3
kmeans.fit(X_train_small)
val_labels = kmeans.predict(X_val)
train_labels = kmeans.labels_
Step 5: Analyze the deltas between training set and validation set results
# Compare the cluster labels between training set and validation set
test_labels = kmeans.predict(X_test)
# Compare the cluster labels between training set and test set
9
🐉
RESULT:
Thus incorporated the splitting of the training set into a smaller training set and a validation set, analyzed the
deltas between the training set and validation set results, and tested the trained model on a separate test set to
check for overfitting implemented successfully.
10
🐉
AIM:
PROCEDURE:
Certainly! The k-means algorithm is an unsupervised machine learning algorithm used for clustering. I
will guide you through implementing the k-means algorithm using the Codon usage dataset from the UCI
Before we start, please note that the Codon usage dataset consists of categorical data, and k-means is
typically used for numerical data. However, we can still apply k-means by converting the categorical data
import numpy as np
import pandas as pd
data_url = "https://archive.ics.uci.edu/ml/machine-learning
databases/00323/data.zip"
df = pd.read_csv(data_url, compression='zip')
label_encoder = LabelEncoder()
df_encoded = df.apply(label_encoder.fit_transform)
11
🐉
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df_encoded)
RESULT:
The k-means algorithm on the Codon usage dataset was implemented successfully.
12
🐉
EX.NO: 6
Implement the Naïve Bayes Classifier using
https://archive.ics.uci.edu/ml/datasets/Gait+Classification dataset
AIM:
To implement the Naïve Bayes Classifier using the Gait Classification dataset, we'll need to perform the following
steps:
PROCEDURE:
Load the dataset: Download the dataset from the provided URL
(https://archive.ics.uci.edu/ml/datasets/Gait+Classification) and load it into your programming
environment. The dataset contains both the training and testing data.
Preprocess the data: Preprocess the dataset to prepare it for the Naïve Bayes Classifier. This may involve
handling missing values, normalizing the data, and converting categorical variables into numerical
representations if necessary.
Train the Naïve Bayes Classifier: Implement the training phase of the Naïve Bayes Classifier using the
training data. Calculate the class priors and class conditional probabilities based on the training samples.
Classify test samples: Use the trained model to classify the test samples by calculating the posterior
probability for each class given the test sample. The class with the highest probability will be assigned as
the predicted class for that sample.
Evaluate the model: Compare the predicted classes with the true labels of the test samples to evaluate the
performance of the Naïve Bayes Classifier. Calculate metrics such as accuracy, precision, recall, or F1
score to assess the classifier's effectiveness.
PROGRAM:
import pandas as pd
13
🐉
X = data.drop("target_class", axis=1)
y = data["target_class"]
# Load the test data (assuming it's stored in a separate CSV file named "test_data.csv")
test_data = pd.read_csv("test_data.csv")
# Separate features (X_test) and true labels (y_true) of the test data
X_test = test_data.drop("target_class", axis=1)
y_true = test_data["target_class"]
# Calculate accuracy
accuracy = accuracy_score(y_true, y_pred)
print("Accuracy:", accuracy)
RESULTS:
The code assumes that you have the necessary libraries installed, such as pandas and scikit-learn. Additionally,
make sure to adjust the code based on the specific preprocessing steps required for your dataset.
14