0% found this document useful (0 votes)
12 views

Cat 2 Document Likkitha

Uploaded by

Likkitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Cat 2 Document Likkitha

Uploaded by

Likkitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 80

TOPICS

MACHINE LEARNING LAB RECORD – 21MDS48

SUBMITTED BY
LIKKITHA S
71762132023

SUBMITTED TO
MRS. D. SUDHA DEVI

COIMBATORE INSTITUTE OF TECHNOLOGY


30 JUNE 2023
X NO DATE TOPICS

1 09/03/23 SIMPLE LINEAR REGRESSION

2 16/03/23 FIND S ALGORITHM

3 30/03/23 CANDIDATE ALGORITHM

4 30/03/23 MULTIPLE LINEAR REGRESSION

5 18/04/23 POLYNOMIAL REGRESSION

6 20/04/23 LOGISTIC REGRESSION

7 18/05/23 GAUSSIAN NAÏVE BAYES MODEL

8 23/05/23 BERNOULLI NAÏVE BAYES MODEL

9 23/05/23 MULTINOMIAL NAÏVE BAYE MODEL

10 23/05/23 K-NEAREST NEIGHBOR MODEL

11 25/05/23 K-MEAN CLUSTERING MODEL

12 30/05/23 HIERARCHICAL CLUSTERING MODEL

13 01/06/23 PRINCIPAL COMPONENT ANALYSIS

14 06/06/23 DECISION TREE CLASSIFIER

15 08/06/23 RANDOM FOREST

16 13/06/23 SUPPORT VECTOR MACHINE


EX NO 01 SIMPLE LINEAR REGRESSION
DATE: 21.03.2023

PROBLEM STATEMENT:
Predicting the cost of homes in any rural area has become a significant difficulty for
construction companies. In order to anticipate the cost of dwellings in Coimbatore for a
specific square foot, the least squares method must be used.

PROBLEM ANALYSIS:
In this machine learning problem, we aim to build a model to predict the chance of admission
to a graduate school based on various features. The dataset provided contains information
about different applicants, including their GRE scores, TOEFL scores, university ratings,
statement of purpose (SOP) scores, letter of recommendation (LOR) scores, undergraduate
CGPA, research experience, and their corresponding chances of admission. The objective is
to create a model that can predict the likelihood of an applicant's admission based on their
profile. We want to determine the relationship between the various features and the chance of
admission and use this information to make accurate predictions for new, unseen applicants.

SAMPLE DATASET:
CODE 1 - FROM SCRATCH:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('adm_data.csv')
X = data['GRE Score'].values
Y = data['CGPA'].values
data.head()
mean_x = np.mean(X)
mean_y = np.mean(Y)
n = len(X)
numer = 0
denom = 0
for i in range(n):
numer += (X[i] - mean_x) * (Y[i] - mean_y)
denom += (X[i] - mean_x) ** 2
m = numer / denom
c = mean_y - (m * mean_x)
# Printing coefficients
print("Coefficients")
print(m, c)
max_x = np.max(X) + 30
min_x = np.min(X) - 30
x = np.linspace(min_x, max_x, 1000)
y=c+m*x
plt.plot(x, y, color='#58b970', label='Regression Line')
plt.scatter(X, Y, c='#ef5423', label='Scatter Plot')
plt.xlabel('GRE Score')
plt.ylabel('CGPA')
plt.legend()
plt.show()
rmse = 0
for i in range(n):
y_pred = c + m * X[i]
rmse += (Y[i] - y_pred) ** 2
rmse = np.sqrt(rmse/n)
print("RMSE")
print(rmse)
ss_tot = 0
ss_res = 0
for i in range(n):
y_pred = c + m * X[i]
ss_tot += (Y[i] - mean_y) ** 2
ss_res += (Y[i] - y_pred) ** 2
r2 = 1 - (ss_res/ss_tot)
print("R2 Score")
print(r2)

OUTPUT:
CODE 2 - USING LIBRARIES:
import pandas as pd
dataset = pd.read_csv('adm_data.csv')
X = dataset['GRE Score'].values
Y = dataset['CGPA'].values
x=np.reshape(X,(-1,1))
y=np.reshape(Y,(-1,1))
dataset.head()
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.30, random_state = 0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train, y_train)
y_pred = regressor.predict(x_test)
import matplotlib.pyplot as plt
plt.scatter(x_train, y_train, color = 'red')
plt.plot(x_train, regressor.predict(x_train), color = 'blue')
plt.title('GRE Score vs CGPA (Training set)')
plt.xlabel('GRE Score')
plt.ylabel('CGPA')
plt.show()
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt
import numpy as np
y_pred= regressor.predict(x_train)
print(np.sqrt(mean_squared_error(y_train,y_pred)
))
print(r2_score(y_train, y_pred))
y_pred= regressor.predict(x_test)
print(np.sqrt(mean_squared_error(y_test,y_pred)))
print(r2_score(y_test, y_pred))

OUTPUT:

PLOT:
EX NO 02 FIND S ALGORITHM
DATE: 23.03.2023

PROBLEM STATEMENT:
To implement Find S algorithm to find the specific hypothesis that fits all the positive
examples.

SAMPLE DATASET:

SOURCE CODE:
import pandas as pd
import numpy as np
data = pd.read_csv("ws.csv")
print(data,"\n")
d = np.array(data)[:,:-1]
print("\n The attributes are: ",d)
target = np.array(data)[:,-1]
print("\n The target is: ",target)
def train(c,t):
for i, val in enumerate(t):
if val == "Yes":
specific_hypothesis = c[i].copy()
break
for i, val in enumerate(c):
if t[i] == "Yes":
for x in range(len(specific_hypothesis)):
if val[x] != specific_hypothesis[x]:
specific_hypothesis[x] = '?'
else:
pass
return specific_hypothesis
print("\n The final hypothesis is:",train(d,target))

OUTPUT:

INFERENCE:
FIND S Algorithm is used to find the Maximally Specific Hypothesis. Using the Find-S
algorithm gives a single maximally specific hypothesis for the given set of training examples.

EX NO 03 CANDIDATE ELIMINATION
DATE: 28.03.2023
PROBLEM STATEMENT:
The aim of the Candidate Elimination algorithm is to learn a hypothesis that approximates the
target concept based on a set of training examples and a hypothesis space. It seeks to find the
most specific and general hypotheses that can accurately classify the training examples and
generalize to unseen instances, thereby effectively narrowing down the hypothesis space. The
algorithm aims to provide a concise and accurate representation of the target concept using a
minimal set of hypotheses.

SAMPLE DATASET:

SOURCE CODE:
import numpy as np
import pandas as pd
data = pd.read_csv('candidate elimination.csv')
concepts = np.array(data.iloc[:,0:-1])
print("\nInstances are:\n",concepts)
target = np.array(data.iloc[:,-1])
print("\nTarget Values are: ",target)
def learn(concepts, target):
specific_h = concepts[0].copy()
print("\nInitialization of specific_h and genearal_h")
print("\nSpecific Boundary: ", specific_h)
general_h = [["?" for i in range(len(specific_h))] for i in range(len(specific_h))]
print("\nGeneric Boundary: ",general_h)
for i, h in enumerate(concepts):
print("\nInstance", i+1 , "is ", h)
if target[i] == "yes":
print("Instance is Positive ")
for x in range(len(specific_h)):
if h[x]!= specific_h[x]:
specific_h[x] ='?'
general_h[x][x] ='?'
if target[i] == "no":
print("Instance is Negative ")
for x in range(len(specific_h)):
if h[x]!= specific_h[x]:
general_h[x][x] = specific_h[x]
else:
general_h[x][x] = '?'
print("Specific Bundary after ", i+1, "Instance is ", specific_h)
print("Generic Boundary after ", i+1, "Instance is ", general_h)
print("\n")
indices = [i for i, val in enumerate(general_h) if val == ['?', '?', '?', '?', '?', '?']]
for i in indices:
general_h.remove(['?', '?', '?', '?', '?', '?'])
return specific_h, general_h
s_final, g_final = learn(concepts, target)
print("Final Specific_h: ", s_final, sep="\n")
print("Final General_h: ", g_final, sep="\n")
OUTPUT:
INFERENCE:
The candidate Elimination algorithm finds all hypotheses that match all the given training
examples. Unlike in Find-S algorithm, it goes through both negative and positive examples,
eliminating any inconsistent hypothesis.

EX NO 04 MULTI - LINEAR REGRESSION


DATE: 13.04.2023

PROBLEM STATEMENT:
Predicting the house prices in all outskirts have become a major problem for construction
companies. So our problem here is to predict the prices of houses in Coimbatore for the given
square feet, number of bedrooms and age using multiple linear regression

PROBLEM ANALYSIS:

Here we will develop and evaluate the performance and the predictive power of a
model trained and tested on data collected from houses in Coimbatore’s suburbs. Once we get
a good fit, we will use this model to predict the monetary value of a house located at all parts
of Coimbatore. A model like this would be very valuable for real estate agents and
construction companies where they could make use of the information provided in a daily
basis. Our data set consists of four rows (Area, Bedrooms, Age of the home, Price) and thirty
columns. We analyse the problem by applying the least squares method to the given dataset.
We separate the dataset into training and test data, training the MLR model in the training
dataset and we predict the test results and visualize them. By analysing the R square value we
get from the model using MLR, we can predict the accuracy for the model.

SAMPLE DATASET:
CODE 1 - FROM SCRATCH:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from math import sqrt

data = pd.read_csv(r'C:\Users\Selva Vignesh\Downloads\Houses.csv')


from scipy import optimize
df = pd.DataFrame(data, columns = ['area','bedrooms','homeage'])
X1 = df['area'].values
X2 = df['bedrooms'].values
Y = df['homeage'].values
n = len(X1) + len(X2)

mean_x1 = np.mean(X1)
mean_x2 = np.mean(X2)
mean_y = np.mean(Y)
for i in range(no):
nr_b1 = ((X2[i] ** 2) * (X1[i] * Y[i])) - ((X1[i] * X2[i]) * (X2[i] * Y[i]))
dr_b1 = ((X1[i] * 2) * (X2[i] * 2)) - ((X1[i] * X2[i]) ** 2)
b1 = nr_b1/dr_b1
nr_b2 = ((X1[i] **2) * (X2[i] * Y[i])) - ((X1[i] * X2[i]) * (X1[i] * Y[i]))
dr_b2 = ((X1[i] * 2) * (X2[i] * 2)) - ((X1[i] * X2[i]) ** 2)
b2 = nr_b2/dr_b2
mean_y = (b1 * mean_x1) + (b2 * mean_x2)

batch_size = 30
no = batch_size

numer = 0
denom = 0
for i in range(no):
numer += (X[i] - mean_x) * (Y[i] - mean_y)
denom += (X[i] - mean_x) ** 2
m = numer / denom
c = mean_y - (m * mean_x)

# Printing coefficients
print("Coefficients")
print(m, c)
max_x = np.max(X) + 30
min_x = np.min(X) - 30
x = np.linspace(min_x, max_x, 1000)
for i in range(no):
y = c + m * X[i]

rmse = 0
for i in range(no):
y_pred = c + m * X[i]
rmse += (Y[i] - y_pred) ** 2
rmse = np.sqrt(rmse/n)
print("RMSE")
print(rmse)
ss_tot = 0
ss_res = 0
for i in range(no):
y_pred = c + m * X[i]
ss_tot += (Y[i] - mean_y) ** 2
ss_res += (Y[i] - y_pred) ** 2
r2 = 1 - (ss_res/ss_tot)
print("R2 Score")
print(r2)
OUTPUT:
Coefficients
97.26987365 -68.67032333 6.78195499
RMSE
245.0986827382367
R2 Score
0.984526627789891

INFERENCE:

From the MLR scratch model we get an RMSE value of 245.09868 and R square
value of 0.98, from which we can infer that the model has an accuracy of 98% which states
that the model has performed very well. The equation obtained here is y = 97.2698 * x1+
6.7819 * x2- 68.6703 * x3. This model can be used further by training it with a large data.

CODE 2 - USING LIBRARIES:


from sklearn import linear_model
import numpy as np
import statsmodels.api as sm
import pandas as pd
data=pd.read_csv(r'C:\Users\Selva Vignesh M\Desktop\Houses.csv')
dt = pd.DataFrame(data, columns = ['area','bedrooms','homeage','price'])
x = dt[['area','bedrooms','homeage']]
y = dt['price']
reg = linear_model.LinearRegression()
reg.fit(x, y)
print("Intercept: ", reg.intercept_)
print("Coefficients: ", reg.coef_)
#Extracting independent variables (income, age)
x = dt.iloc[:,:-1].values
print(x)
#Extracting dependent variable (happiness)
y = dt.iloc[:,3:].values
print(y)
from sklearn.model_selection import train_test_split
from sklearn import model_selection
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from math import sqrt
lr = LinearRegression()
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)
lr.fit(x_train, y_train)
regressor = LinearRegression()
regressor.fit(x_train,y_train)
y_pred = regressor.predict(x_test)
pred_train_lr = lr.predict(x_train)
pred_test_lr = lr.predict(x_test)
print("RMSE and r-square for train set:")
print(np.sqrt(mean_squared_error(y_train,pred_train_lr)))
print(r2_score(y_train,pred_train_lr))
print("RMSE and r-square for test set:")
print(np.sqrt(mean_squared_error(y_test,pred_test_lr)))
print(r2_score(y_test,pred_test_lr))
OUTPUT:
Intercept: 3046.2225659549213
Coefficients: [ 97.26987365 -68.67032333 6.78195499]
RMSE and r-square for train set:
243.14341291104748
0.9842721957215742
RMSE and r-square for test set:
214.29771477177258
0.9922371865519531

INFERENCE:

From the model we can infer that the training dataset computed the RMSE value as
243.14314 with r square value of 0.98 and the test dataset computed its RMSE as 214.29771
with r square value of 0.99. Here the model has performed well with an accuracy of 99%
which infers that both the models fit perfectly.

Now let’s try to give an input to the model and see how it predicts the price. Here the
input is the area in square feet.
EX NO 05 POLYNOMIAL REGRESSION
DATE: 20.04.2023

PROBLEM STATEMENT:
The objective of this problem statement is to provide comprehensive support to 50 startups by
addressing their key challenges and enabling their growth and success. The problem is to
develop a program that supports the growth of 50 startups by addressing their critical needs
and challenges. The program should encompass various areas and provide tailored solutions
to meet the unique requirements of each startup.

PROBLEM ANALYSIS:
Here we will develop and evaluate the performance and the predictive power of a model
trained and tested on data collected from a Company. Once we get a good fit, we will use this
model to predict the salary of an employee based on their position. The data set consists of
five columns (R&D Spend,Administration,Marketing Spend,State,Profit) and fifty rows. We
separate the dataset into training and test data, training the model in the training dataset and
we predict the test results and visualize them. By analysing the R square value, we can
predict the accuracy for the model.

SAMPLE DATASET:
CODE 1 - FROM SCRATCH:
n=int(input("Enter the degree:"))
xn=data['Level']
for i in range(1,n+1):
for j in range(len(xn)):
data['x',i]=xn**i
xval=np.array(data.iloc[:,-n:])
yval=np.array(y)
coeffs = np.linalg.inv(xval.T @ xval) @ xval.T @ yval
coeffs.shape
print('Coefficients:', coeffs)
new=int(input("Enter the value of x to be predicted:"))
X_n=[]
for i in range(1,n+1):
X_n.append(new**i)
print('Dimensions of coeff matrix:',coeffs.shape)
X_new=np.array(X_n)
y_pred = X_new.T @ coeffs
print('Prediction:', y_pred)

OUTPUT:
y = -38494.26 + 66878.12x + 287369.29x^2 + 460744.27x^3
RMSE: 100912.45186113848
R2 score: 0.8737535471595872
CODE 2 - USING LIBRARIES:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
print("Done")
data=pd.read_csv("/kaggle/input/position-salaries/Position_Salaries.csv")
x=data.iloc[:, 1:2]
y=data['Salary']
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
poly_reg = PolynomialFeatures(degree = 4)
x_poly = poly_reg.fit_transform(x)
poly_reg.fit(x_poly, y)
linear = LinearRegression()
linear.fit(x_poly, y)
y_pred=linear.predict(x_poly)
from sklearn.metrics import mean_squared_error,r2_score
rmse=np.sqrt(mean_squared_error(y,y_pred))
print('RMSE:',rmse)
r2_scr=r2_score(y,y_pred)
print('R2 SCORE:',r2_scr)

OUTPUT:
Salary = -195333.33333333337 + 80878.78787878789 * Level
RMSE : 163388.73519272613
R2 score: 0.6690412331929895
PLOT:
plt.scatter(x,y,color='Black')
plt.plot(x,linear.predict(x_poly),color='Red')
plt.xlabel('Levels')
plt.ylabel('Salary')

INFERENCE:
Using Polynomial Regression model, we have predicted the salaries of the employees of a
company.The models has computed the value of R square score as 0.99739.Therefore the
model has performed with an accuracy of 99.73%.
EX NO 06 LOGISTIC REGRESSION
DATE: 18.05.2023

PROBLEM STATEMENT:
The problem statement is that we have to implement the Logistic regression in both
scratch and using in built functions method in python using numpy and pandas.

PROBLEM ANALYSIS:
The SAT score achieved by a student (input feature). A binary variable indicating whether
the student was admitted (output label). Our goal is to build a logistic regression model that
can accurately predict the admission decision based on the SAT scores. Logistic regression is
a binary classification algorithm commonly used for predicting binary outcomes, such as
"admitted" or "not admitted" in our case. It models the relationship between the input features
and the probability of the output label using a logistic function. The train the logistic
regression model, we will split the dataset into two parts: a training set and a test set. The
training set will be used to train the model, and the test set will be used to evaluate its
performance. We will use evaluation metrics such as accuracy, precision, recall, and F1-score
to assess the performance of our logistic regression model. These metrics will help us
understand how well the model predicts the admission decisions based on the SAT scores.
Once the logistic regression model is trained, we can use it to make predictions on new,
unseen data. Given a student's SAT score, the model will output the probability of being
admitted. We can then apply a threshold (e.g., 0.5) to classify the student as admitted or not
admitted.

SAMPLE DATASET:
CODE 1 - FROM SCRATCH:
import numpy as np
import pandas as pd
# Normalize the independent variable (optional but recommended)
X = (X - np.mean(X)) / np.std(X)
# Add a column of ones to X for the bias term
X = np.column_stack((np.ones(len(X)), X))
# Define the sigmoid function
def sigmoid(z):
return 1 / (1 + np.exp(-z))
# Define the cost function
def cost_function(X, y, theta):
m = len(y)
h = sigmoid(np.dot(X, theta))
cost = (-1/m) * np.sum(y*np.log(h) + (1-y)*np.log(1-h))
return cost
# Define the gradient descent function
def gradient_descent(X, y, theta, alpha, num_iterations):
m = len(y)
cost_history = []
for i in range(num_iterations):
h = sigmoid(np.dot(X, theta))
gradient = (1/m) * np.dot(X.T, (h - y))
theta -= alpha * gradient
cost = cost_function(X, y, theta)
cost_history.append(cost)
return theta, cost_history
# Set the learning rate and number of iterations
learning_rate = 0.01
num_iterations = 1000
# Initialize the parameters (theta)
theta = np.zeros(X.shape[1])
# Run gradient descent to train the model
theta_optimized, cost_history = gradient_descent(X, y, theta, learning_rate, num_iterations)
# Print the optimized parameters (theta)
print("Optimized parameters (theta):", theta_optimized)
# Optimized parameters (theta)
theta_optimized = np.array([0.28904728, 1.85989727])
# Example test data
X_test = np.array([1, 1600]) # Your test data
# Predict the admission using the optimized parameters
h_test = sigmoid(np.dot(X_test, theta_optimized))
prediction = 1 if h_test >= 0.5 else 0
# Calculate the accuracy on the training set
y_pred_train = sigmoid(np.dot(X, theta_optimized))
y_pred_train = np.round(y_pred_train) # Round the predictions to 0 or 1
accuracy = np.mean(y_pred_train == y) * 100

# Print the predicted admission and accuracy


print("Predicted admission:", prediction)
print("Accuracy on the training set:", accuracy)

OUTPUT:
CODE 2 - USING LIBRARIES:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Create the dependent and independent variables
Y = data['Admitted']
X= data['SAT'].values
X = X.reshape(-1, 1)
# Create an instance of LogisticRegression model
logreg = LogisticRegression()
# Fit the model to the training data
logreg.fit(X, y)
X_test = np.array([[1600]]) # Your test data
# Predict the admission for the test data
y_pred = logreg.predict(X_test)
# Calculate the accuracy
accuracy = accuracy_score(y, logreg.predict(X))
# Print the predicted admission and accuracy
print("Predicted admission:", y_pred[0])
print("Accuracy:", accuracy)
plt.scatter(x1,y, color='C0')
# Don't forget to label your axes!
plt.xlabel('SAT', fontsize = 20)
plt.ylabel('Admitted', fontsize = 20)
plt.show()
x = sm.add_constant(x1)
reg_log = sm.Logit(y,x)
results_log = reg_log.fit()
def f(x,b0,b1):
return np.array(np.exp(b0+x*b1) / (1 + np.exp(b0+x*b1))) f_sorted =
np.sort(f(x1,results_log.params[0],results_log.params[1]))
x_sorted = np.sort(np.array(x1))
plt.scatter(x1,y,color='C0')
plt.xlabel('SAT', fontsize = 20)
plt.ylabel('Admitted', fontsize = 20)
plt.plot(x_sorted,f_sorted,color='C8')
plt.show()

OUTPUT:

PLOT:
INFERENCE:
We have implemented the Logistic Regression in both scratch and in-built functions method
and the accuracies for both the naïve bayes classifier from scratch and using built in methods
are found to have only slight difference making the model more accurate and precise.

EX NO 07 GAUSSIAN NAÏVE BAYES MODEL


DATE: 23.05.2023

PROBLEM STATEMENT:
The problem statement is that we have to implement the Gaussian naive bayes classifier in
both scratch and using in built functions method in python using numpy and pandas.

PROBLEM ANALYSIS:
The code begins by importing necessary libraries such as pandas, numpy, scikit-learn's
train_test_split, GaussianNB, and accuracy_score.The code uses the pandas library to read
the CSV file 'fraud_oracle.csv' and store it in a DataFrame called 'df'.The code drops the
'PolicyNumber' and 'RepNumber' columns from the DataFrame using the drop() method.he
code selects four features ('DriverRating', 'Deductible', 'Age', 'WeekOfMonth') as input
features (X) for the model and assigns the 'FraudFound_P' column as the target variable
(y).The code splits the dataset into training and testing sets using the train_test_split()
function from scikit-learn. It assigns 80% of the data for training (X_train, y_train) and 20%
for testing (X_test, y_test). The random_state parameter is set to 42 for reproducibility.The
code prints the accuracy score obtained from the accuracy_score() function.
SAMPLE DATASET:

CODE 1 - FROM SCRATCH:


import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
# Load the dataset
df = pd.read_csv('/kaggle/input/vehicle-claim-fraud-detection/fraud_oracle.csv')

# Drop irrelevant columns


df.drop(['PolicyNumber', 'RepNumber'], axis=1, inplace=True)

# Convert categorical variables to numerical


df['Sex'] = df['Sex'].map({'Male': 0, 'Female': 1})
df['MaritalStatus'] = df['MaritalStatus'].map({'Single': 0, 'Married': 1})
df = pd.get_dummies(df, columns=['Make', 'AccidentArea', 'DayOfWeekClaimed',
'MonthClaimed', 'VehicleCategory', 'PolicyType', 'AgentType', 'AddressChange_Claim'])
train, test = train_test_split(df, test_size=0.2, random_state=42)
# Separate the fraudulent and non-fraudulent claims in the training set
fraud_train = train[train['FraudFound_P'] == 1]
nonfraud_train = train[train['FraudFound_P'] == 0]

# Compute the prior probabilities


prior_fraud = len(fraud_train) / len(train)
prior_nonfraud = len(nonfraud_train) / len(train)

# Compute the likelihoods for each feature and class


likelihood_fraud = {}
likelihood_nonfraud = {}

for col in train.columns[:-1]:


if train[col].dtype == 'float64':
likelihood_fraud[col] = (np.mean(fraud_train[col]), np.std(fraud_train[col]))
likelihood_nonfraud[col] = (np.mean(nonfraud_train[col]), np.std(nonfraud_train[col]))
else:
likelihood_fraud[col] = dict(fraud_train[col].value_counts(normalize=True))
likelihood_nonfraud[col] = dict(nonfraud_train[col].value_counts(normalize=True))
from sklearn.metrics import accuracy_score, confusion_matrix

# Make predictions on the testing set


predictions = []

for index, row in test.iterrows():


# Compute the posterior probabilities
posterior_fraud = prior_fraud
posterior_nonfraud = prior_nonfraud

for col in test.columns[:-1]:


if test[col].dtype == 'float64':
likelihood_f = np.exp(-(row[col]-likelihood_fraud[col][0])*2 /
(2*likelihood_fraud[col][1]*2)) / (np.sqrt(2*np.pi)*likelihood_fraud[col][1])
likelihood_nf = np.exp(-(row[col]-likelihood_nonfraud[col][0])*2 /
(2*likelihood_nonfraud[col][1]*2)) / (np.sqrt(2*np.pi)*likelihood_nonfraud[col][1])
else:
if row[col] in likelihood_fraud[col]:
likelihood_f = likelihood_fraud[col][row[col]]
else:
likelihood_f = 0

if row[col] in likelihood_nonfraud[col]:
likelihood_nf = likelihood_nonfraud[col][row[col]]
else:
likelihood_nf = 0

posterior_fraud *= likelihood_f
posterior_nonfraud *= likelihood_nf

# Make the prediction


if posterior_fraud > posterior_nonfraud:
predictions.append(1)
else:
predictions.append(0)

# Evaluate the model


accuracy = accuracy_score(test['FraudFound_P'], predictions)
confusion = confusion_matrix(test['FraudFound_P'], predictions)

print("Accuracy:", accuracy)
print("Confusion Matrix:\n", confusion)
OUTPUT:

CODE 2 - USING LIBRARIES:


import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
df = pd.read_csv('/kaggle/input/vehicle-claim-fraud-detection/fraud_oracle.csv')
df.drop(['PolicyNumber', 'RepNumber'], axis=1, inplace=True)
X = df[['DriverRating','Deductible','Age','WeekOfMonth']]
y = df['FraudFound_P']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
print("accuracy:",accuracy_score(y_pred,y_test))

OUTPUT:

INFERENCE:
We have implemented the Gaussian Naive Bayes Classifier in both scratch and in built
functions method and The accuracies for both the naïve bayes classifier from scratch and
using built in methods are found to be same making the model more accurate and precise.

EX NO 08 BERNOULLI NAÏVE BAYES MODEL


DATE: 23.05.2023

PROBLEM STATEMENT:
The problem statement is that we have to implement the Bernoulli naive bayes classifier in
both scratch and using in built functions method in python using numpy and pandas.

PROBLEM ANALYSIS:
The code starts by importing necessary libraries such as NumPy, Pandas, and scikit-learn
modules.The Titanic dataset is read from a CSV file using the pd.read_csv() function and
stored in the dataset variable.The categorical variables in the dataset are encoded using the
LabelEncoder from scikit-learn. The apply() function is used to apply the encoding to object-
type columns, while numerical columns are left unchanged. The encoded features are stored
in the X_encoded variable.The target variable is also encoded using the LabelEncoder, and
the encoded labels are stored in the y_encoded variable.The dataset is split into training and
testing sets using the train_test_split() function from scikit-learn. The training set consists of
75% of the data, while the testing set contains the remaining 25%. The random state is set to
42 for reproducibility.An instance of the BernoulliNB class is created and assigned to the bnb
variable. The model is then trained on the training data using the fit() method.The accuracy of
the model is calculated by comparing the predicted labels (y_pred) with the actual labels from
the testing set (y_test) using the accuracy_score() function. The accuracy score is printed to
the console.

SAMPLE DATASET:

CODE 1 - FROM SCRATCH:


import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
data = pd.read_csv('/kaggle/input/bernoulli-naive-bayes/titanic_prediction.csv')
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
class BernoulliNaiveBayes:
def __init__(self):
self.class_probabilities = None
self.feature_probabilities = None
def fit(self, X, y):
n_samples, n_features = X.shape
self.class_probabilities = {}
self.feature_probabilities = {}
classes, class_counts = np.unique(y, return_counts=True)
total_samples = n_samples
for i in range(len(classes)):
class_name = classes[i]
class_probability = class_counts[i] / total_samples
self.class_probabilities[class_name] = class_probability
for feature in range(n_features):
feature_values = np.unique(X[:, feature])
self.feature_probabilities[feature] = {}
for class_name in classes:
class_indices = np.where(y == class_name)
class_samples = X[class_indices, :]
feature_counts = np.sum(class_samples[:, feature] == 1)
feature_probability = (feature_counts + 1) / (len(class_indices[0]) + 2)
self.feature_probabilities[feature][class_name] = feature_probability
def predict(self, X):
y_pred = []
for sample in X:
class_probabilities = {}
for class_name, class_probability in self.class_probabilities.items():
feature_probabilities = self.feature_probabilities
for feature, feature_value in enumerate(sample):
if feature_value == 0:
feature_probability = 1 - feature_probabilities[feature][class_name]
else:
feature_probability = feature_probabilities[feature][class_name]
if class_name not in class_probabilities:
class_probabilities[class_name] = feature_probability
else:
class_probabilities[class_name] *= feature_probability
class_probabilities[class_name] *= class_probability
predicted_class = max(class_probabilities, key=class_probabilities.get)
y_pred.append(predicted_class)
return y_pred
naive_bayes = BernoulliNaiveBayes()
naive_bayes.fit(X_train, y_train)
y_pred = naive_bayes.predict(X_test)
accuracy = np.mean(y_pred == y_test)
print("Accuracy:", accuracy)

OUTPUT:

CODE 2 - USING LIBRARIES:

import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
dataset = pd.read_csv("/kaggle/input/bernoulli-naive-bayes/titanic_prediction.csv")
label_encoder = LabelEncoder()
X_encoded = dataset.iloc[:, :-1].apply(lambda x: label_encoder.fit_transform(x) if x.dtype ==
"object" else x)
y_encoded = label_encoder.fit_transform(dataset.iloc[:, -1])
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y_encoded, test_size=0.25,
random_state=42)
bnb = BernoulliNB()
bnb.fit(X_train, y_train)
y_pred = bnb.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

OUTPUT:

INFERENCE:
We have implemented the Bernoulli Naive Bayes Classifier in both scratch and in built
functions method and The accuracies for both the naïve bayes classifier from scratch and
using built in methods are found using numpy and pandas libraries in python.

EX NO 09 MULTINOMIAL NAÏVE BAYES MODEL


DATE: 23.05.2023

PROBLEM STATEMENT:
The problem statement is that we have to implement the Multinomial naive bayes classifier in
both scratch and using in built functions method in python using numpy and pandas.

PROBLEM ANALYSIS:
The necessary libraries are imported, including pandas, which is used for data manipulation,
and various modules from scikit-learn for the machine learning tasks.The code reads a CSV
file containing heart attack data and assigns it to the variable data. It then separates the
features (X) from the target variable (y).The dataset is split into training and testing sets using
the train_test_split function from scikit-learn. The testing set size is set to 20% of the data,
and a random state of 42 is used for reproducibility.A Multinomial Naive Bayes classifier is
instantiated using MultinomialNB() and trained on the training set using the fit method.The
trained classifier is used to make predictions on the test set (X_test) using the predict
method.The accuracy of the classifier is calculated by comparing the predicted values
(y_pred) with the actual target values (y_test) using the accuracy_score function. The
accuracy is then printed.

SAMPLE DATASET:

CODE 1 - FROM SCRATCH:


import csv
import math
# Load the dataset
dataset = []
with open('/kaggle/input/heart-attack-analysis-prediction-dataset/heart.csv', 'r') as file:
csv_reader = csv.reader(file)
for row in csv_reader:
dataset.append(row)

# Remove the header


header = dataset[0]
dataset = dataset[1:]
# Convert string values to float
for i in range(len(dataset)):
dataset[i] = [float(x) for x in dataset[i]]

# Split the dataset into features and target


X = [row[:-1] for row in dataset]
y = [row[-1] for row in dataset]

# Function to split the dataset based on the target variable


def split_dataset(X, y, target_value):
X_subset = []
y_subset = []
for i in range(len(X)):
if y[i] == target_value:
X_subset.append(X[i])
y_subset.append(y[i])
return X_subset, y_subset

# Function to calculate the probability of a value given a mean and standard deviation
def calculate_probability(value, mean, stdev):
exponent = math.exp(-((value - mean) ** 2 / (2 * stdev ** 2)))
return (1 / (math.sqrt(2 * math.pi) * stdev)) * exponent

# Function to train the multinomial Naive Bayes model


def train(X_train, y_train):
# Separate the dataset by class
separated = {}
for i in range(len(X_train)):
features = X_train[i]
target = y_train[i]
if target not in separated:
separated[target] = []
separated[target].append(features)

# Calculate the mean and standard deviation for each feature by class
summaries = {}
for target, features in separated.items():
summaries[target] = []
for i in range(len(features[0])):
values = [row[i] for row in features]
mean = sum(values) / len(values)
stdev = math.sqrt(sum([(x - mean) ** 2 for x in values]) / len(values))
summaries[target].append((mean, stdev))

return summaries

# Function to make predictions using the trained model


def predict(X_test, summaries):
predictions = []
for features in X_test:
probabilities = {}
for target, class_summaries in summaries.items():
probabilities[target] = 1
for i in range(len(class_summaries)):
mean, stdev = class_summaries[i]
value = features[i]
probabilities[target] *= calculate_probability(value, mean, stdev)

# Select the class with the highest probability


best_class = None
best_probability = -1
for target, probability in probabilities.items():
if best_class is None or probability > best_probability:
best_class = target
best_probability = probability

predictions.append(best_class)

return predictions

# Split the dataset into training and testing sets


split_ratio = 0.8
split_point = int(split_ratio * len(dataset))
X_train = X[:split_point]
y_train = y[:split_point]
X_test = X[split_point:]
y_test = y[split_point:]

# Train the model


model = train(X_train, y_train)

# Make predictions on the test set


predictions = predict(X_test, model)

# Calculate accuracy
correct_predictions = sum(1 for pred, true in zip(predictions, y_test) if pred == true)
accuracy = correct_predictions / len(y_test)
print("Accuracy:", accuracy)

OUTPUT:
CODE 2 - USING LIBRARIES:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np
data = pd.read_csv('/kaggle/input/heart-attack-analysis-prediction-dataset/heart.csv')
data.head()
X=data.iloc[:,:-1]
y=data.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the Naive Bayes classifier


nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)

# Make predictions on the test set


y_pred = nb_classifier.predict(X_test)

# Evaluate the accuracy of the classifier


accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

OUTPUT:
INFERENCE:
We have implemented the Multinomial Naive Bayes Classifier in both scratch and in built
functions method and The accuracies for both the naïve bayes classifier from scratch and
using built in methods are found using numpy and pandas libraries in python.

EX NO 10 K-NEAREST NEIGHBOR MODEL


DATE: 25.05.2023

PROBLEM STATEMENT:
The problem statement is that we have to implement the KNN classifier in both scratch and
using in built functions method in python using numpy and pandas.

SAMPLE DATASET:
CODE 1 - FROM SCRATCH:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
class KNNClassifier:
def __init__(self, k=3):
self.k = k
def fit(self, X, y):
self.X_train = X
self.y_train = y
def euclidean_distance(self, x1, x2):
return np.sqrt(np.sum((x1 - x2) ** 2))
def predict(self, X):
y_pred = []
for x_test in X:
distances = []for x_train in self.X_train:
dist = self.euclidean_distance(x_test, x_train)
distances.append(dist)

indices = np.argsort(distances)[:self.k]
k_nearest_labels = self.y_train[indices]
unique x=data.iloc[:,:-1].values
y=data.iloc[:,-1].values
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Feature scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# K-NN classification
knn = KNNClassifier(k=)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
accuracy = np.sum(y_pred == y_test) / len(y_test)
print("Accuracy:", accuracy)
_labels, counts = np.unique(k_nearest_labels, return_counts=True)
predicted_label = unique_labels[np.argmax(counts)]
y_pred.append(predicted_label)
return np.array(y_pred)
data=pd.read_csv("/kaggle/input/iris-flower-dataset/IRIS.csv")
data.head()

OUTPUT:

CODE 2 - USING LIBRARIES:


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset


dataset=pd.read_csv("/kaggle/input/iris-flower-dataset/IRIS.csv")
dataset.head()

# Split the dataset into features and labels


X = dataset.iloc[:, :-1]
y = dataset.iloc[:, -1]

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create the KNN model


knn = KNeighborsClassifier(n_neighbors=3)

# Fit the model on the training data


knn.fit(X_train, y_train)

# Predict the labels for the test data


y_pred = knn.predict(X_test)

# Calculate the accuracy of the model


accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

OUTPUT:
INFERENCE:
We have implemented the KNN classifier in both scratch and in built functions method and
the plots are displayed in both the methods using the numpy and pandas libraries using these
plots we are able to identify the neighbours.

EX NO 11 K-MEAN CLUSTERING MODEL


DATE: 30.05.2023

PROBLEM STATEMENT:
The problem at hand is to cluster a given dataset into k distinct groups using the K-Means
Clustering algorithm. The dataset consists of various data points/features, and the goal is to
identify natural groupings or patterns within the data. The number of clusters, 'k', needs to be
determined based on the nature of the data or domain knowledge.
PROBLEM ANALYSIS:
The problem at hand is to cluster a given dataset into k distinct groups using the K-Means
Clustering algorithm. The dataset consists of various data points/features, and the goal is to
identify natural groupings or patterns within the data. The number of clusters, 'k', needs to be
determined based on the nature of the data or domain knowledge. Develop the K-Means
Clustering model using a programming language or a machine learning library that supports
the algorithm. Implement the necessary steps for initializing centroids, assigning data points
to clusters, and updating centroids iteratively. Evaluate the performance of the clustering
model. This can be done using metrics such as the silhouette score or the average distance
between data points and their cluster centroids. Assess the quality and coherence of the
obtained clusters. Visualize the clusters to gain insights and interpret the results effectively.
Plotting the data points and their respective clusters can help understand the structure and
patterns within the dataset.

CODE 1 - FROM SCRATCH:


import numpy as np
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
class KMeansClustering:
def __init__(self, num_clusters, max_iterations=100):
self.K = num_clusters
self.max_iterations = max_iterations
self.centroids = None
self.clusters = None
def initialize_random_centroids(self, X):
centroids = np.zeros((self.K, X.shape[1]))
for k in range(self.K):
centroid = X[np.random.randint(0, X.shape[0])]
centroids[k] = centroid
return centroids
def calculate_distance(self, x1, x2):
return np.sqrt(np.sum((x1 - x2) ** 2))
def create_clusters(self, X, centroids):
clusters = [[] for _ in range(self.K)]
for i in range(X.shape[0]):
distances = [self.calculate_distance(X[i], centroid) for centroid in centroids]
cluster_idx = np.argmin(distances)
clusters[cluster_idx].append(i)
return clusters
def calculate_new_centroids(self, X, clusters):
centroids = np.zeros((self.K, X.shape[1]))
for cluster_idx, cluster in enumerate(clusters):
if len(cluster) > 0:
new_centroid = np.mean(X[cluster], axis=0)
centroids[cluster_idx] = new_centroid
return centroids
def fit(self, X):
self.centroids = self.initialize_random_centroids(X)
for _ in range(self.max_iterations):
self.clusters = self.create_clusters(X, self.centroids)
prev_centroids = np.copy(self.centroids)
self.centroids = self.calculate_new_centroids(X, self.clusters)
if np.all(prev_centroids == self.centroids):
break
return self.clusters, self.centroids
def predict(self, X):
distances = np.zeros((X.shape[0], self.K))
for i in range(X.shape[0]):
for j in range(self.K):
distances[i, j] = self.calculate_distance(X[i], self.centroids[j])
return np.argmin(distances, axis=1)
def plot_clusters(self, X, clusters):
colors = ['r', 'g', 'b', 'c', 'm', 'y']
for cluster_idx, cluster in enumerate(clusters):
for idx in cluster:
plt.scatter(X[idx, 0], X[idx, 1], color=colors[cluster_idx])
for centroid in self.centroids:
plt.scatter(centroid[0], centroid[1], marker='x', color='k', s=100)
plt.show()
# Example usage
np.random.seed(10)
num_clusters = 3
X, _ = make_blobs(n_samples=1000, n_features=2, centers=num_clusters)

kmeans = KMeansClustering(num_clusters)
clusters, centroids = kmeans.fit(X)
predictions = kmeans.predict(X)
kmeans.plot_clusters(X, clusters)

OUTPUT:
CODE 2 - USING LIBRARIES:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import warnings
warnings.filterwarnings("ignore")
# Generate sample data
np.random.seed(10)
num_clusters = 3
X, _ = make_blobs(n_samples=1000, n_features=2, centers=num_clusters)
# Perform K-means clustering using scikit-learn
kmeans = KMeans(n_clusters=num_clusters)
kmeans.fit(X)
# Get cluster labels and centroids
labels = kmeans.labels_
centroids = kmeans.cluster_centers_
# Plot the clusters and centroids
colors = ['r', 'g', 'b', 'c', 'm', 'y']
for i in range(num_clusters):
plt.scatter(X[labels == i, 0], X[labels == i, 1], c=colors[i])
plt.scatter(centroids[:, 0], centroids[:, 1], marker='x', color='k', s=100)
plt.show()

OUTPUT:

INFERENCE:
We have implemented the K-Mean Clustering in both scratch and in built functions method
and the plots are displayed in both the methods using the numpy and pandas libraries using
these plots we are able to identify the neighbours.

EX NO 12 HIERARCHICAL CLUSTERING MODEL


DATE: 30.05.2023

PROBLEM STATEMENT:
The problem statement is that we have to implement the hierarchical clustering model in both
scratch and using in built functions method in python using numpy and pandas.

PROMBLEM ANALYSIS:
The code imports the required libraries, including numpy, scikit-learn's Agglomerative
Clustering and make_blobs functions, and matplotlib for plotting.The code generates a
synthetic dataset using the make_blobs function. It creates 50 samples distributed among 3
clusters, with a specified standard deviation.An instance of AgglomerativeClustering is
created with the desired number of clusters (3 in this case). The fit method is then called on
the clustering object, which performs the clustering on the given dataset.The scatter function
from matplotlib is used to create a scatter plot of the data points. Each point is colored based
on its assigned cluster label. The xlabel, ylabel, and title functions set the labels and title for
the plot.The centroids of each cluster are computed by taking the mean of the points in that
cluster. For each unique label in the clustering labels, the code retrieves the points belonging
to that cluster and calculates the centroid. These centroids are then plotted on the scatter plot
as red crosses. The show function is called to display the plot with the clustered data and
marked centroids.

CODE 1 - FROM SCRATCH:


import numpy as np
import matplotlib.pyplot as plt
def euclidean_distance(a, b):
return np.sqrt(np.sum((a - b) ** 2))
def hierarchical_clustering(X, n_clusters):
num_samples = X.shape[0]
distances = np.zeros((num_samples, num_samples))
# Calculate pairwise distances
for i in range(num_samples):
for j in range(i+1, num_samples):
distances[i, j] = euclidean_distance(X[i], X[j])
# Initialize clusters
clusters = [[i] for i in range(num_samples)]
# Perform clustering
while len(clusters) > n_clusters:
min_dist = np.inf
merge_indices = (0, 0)
# Find the closest clusters
for i in range(len(clusters)):
for j in range(i+1, len(clusters)):
cluster1 = clusters[i]
cluster2 = clusters[j]
dist = np.mean(distances[np.ix_(cluster1, cluster2)])
if dist < min_dist:
min_dist = dist
merge_indices = (i, j)

# Merge the closest clusters


merged_cluster = clusters[merge_indices[0]] + clusters[merge_indices[1]]
clusters = [c for idx, c in enumerate(clusters) if idx not in merge_indices] +
[merged_cluster]
# Calculate and return centroids
centroids = []
for cluster in clusters:
cluster_points = X[cluster]
centroid = np.mean(cluster_points, axis=0)
centroids.append(centroid)
return clusters, centroids
# Generate sample data
np.random.seed(0)
X, y = make_blobs(n_samples=50, centers=3, random_state=0, cluster_std=0.5)
# Perform hierarchical clustering
cluters, centroids = hierarchical_clustering(X, n_clusters=3)
# Plotting the clusters and centroids
colors = ['red', 'blue', 'green']
for i, cluster in enumerate(clusters):
points = X[cluster]
plt.scatter(points[:, 0], points[:, 1], color=colors[i])
centroid = centroids[i]
plt.scatter(centroid[0], centroid[1], marker='x', color='black', s=100)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Hierarchical Clustering with Centroid Markers')
plt.show()

OUTPUT:

CODE 2 - USING LIBRARIES:


import numpy as np
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# Generate sample data


np.random.seed(0)
X, y = make_blobs(n_samples=50, centers=3, random_state=0, cluster_std=0.5)

# Perform hierarchical clustering


clustering = AgglomerativeClustering(n_clusters=3)
clustering.fit(X)

# Plotting the clusters and centroids


plt.scatter(X[:, 0], X[:, 1], c=clustering.labels_, cmap='viridis')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Hierarchical Clustering')

# Calculate and mark centroids


centroids = []
for label in np.unique(clustering.labels_):
cluster_points = X[clustering.labels_ == label]
centroid = np.mean(cluster_points, axis=0)
centroids.append(centroid)
plt.scatter(centroid[0], centroid[1], marker='x', s=100, color='red')
plt.show()

OUTPUT:
INFERENCE:
We have implemented the Hierearchial clustering in both scratch and in built functions
method and the plots are displayed in both the methods using the numpy and pandas libraries
using these plots we are able to identify the neighbours.

EX NO 13 PRINCIPAL COMPONENT OF ANALYSIS


DATE: 01.06.2023

PROBLEM STATEMENT:
The problem statement is that we have to implement the Principal component of analysis in both
scratch and using in built functions method in python using numpy and pandas.

SAMPLE DATASET:

CODE 1 - FROM SCRATCH:


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

data = pd.read_csv('/kaggle/input/iris-flower-dataset/IRIS.csv')
data.head()

X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values
class PCA:
def __init__(self, n_components):
self.n_components = n_components
self.components = None

def fit(self, X):


# Center the data
X_centered = X - np.mean(X, axis=0)
# Compute the covariance matrix
covariance_matrix = np.cov(X_centered, rowvar=False)

# Perform eigendecomposition
eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix)

# Sort eigenvectors based on eigenvalues


indices = np.argsort(eigenvalues)[::-1]
sorted_eigenvalues = eigenvalues[indices]
sorted_eigenvectors = eigenvectors[:, indices]

# Select the top n_components eigenvectors


self.components = sorted_eigenvectors[:, :self.n_components]

def transform(self, X):


# Center the data
X_centered = X - np.mean(X, axis=0)

# Project the data onto the selected components


transformed_data = np.dot(X_centered, self.components)

return transformed_data
pca = PCA(n_components=2)
pca.fit(X)
transformed_data = pca.transform(X)
print("Original data shape:", X.shape)
print("Transformed data shape:", transformed_data.shape)
species_map = {'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2}
color_labels = [species_map[label] for label in y]

OUTPUT:

PLOT 1:
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 2)
plt.scatter(X[:, 0], X[:, 1], c=color_labels)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Original Data')
PLOT 2:
plt.subplot(1, 2, 2)
plt.scatter(transformed_data[:, 0], transformed_data[:, 1], c=color_labels)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('Transformed Data')

plt.tight_layout()
plt.show()

CODE 2 - USING LIBRARIES:


from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(X)
pca_samples = pca.transform(X)
df = pd.DataFrame(pca_samples)
df.head()
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X,y, test_size=0.3, shuffle=True)
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(x_train, y_train)
y_pred = classifier.predict(x_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_pred,y_test)

OUTPUT:

PLOT:
plt.subplot(1, 2, 2)
plt.scatter(pca_samples[:, 0], pca_samples[:, 1], c=color_labels)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('Transformed Data')
plt.tight_layout()
plt.show()
INFERENCE:
We have implemented the Principal component analysis in both scratch and in built functions
method and the plots are displayed in both the methods using the numpy and pandas libraries
using these plots we are able to identify the neighbours.

EX NO 14 DECISION TREE CLASSIFIER


DATE: 06.06.2023

PROBLEM STATEMENT:
The problem statement is that we have to implement the Decision tree classifier in both scratch and
using in built functions method in python using numpy and pandas.

SAMPLE DATASET:

CODE 1 - FROM SCRATCH:


import pandas as pd
import numpy as np

# Load the dataset


data = pd.read_csv("/kaggle/input/breast-cancer-wisconsin-data/data.csv")

# Drop unnecessary columns


data.drop(['id', 'Unnamed: 32'], axis=1, inplace=True)

# Convert the diagnosis column to numeric values (0 for benign, 1 for malignant)
data['diagnosis'] = data['diagnosis'].map({'M': 1, 'B': 0})

# Split the dataset into features and target variable


X = data.drop('diagnosis', axis=1)
y = data['diagnosis']
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Decision Tree Classifier


class DecisionTreeClassifier:
def __init__(self, max_depth=None):
self.max_depth = max_depth

def fit(self, X, y):


self.X = X
self.y = y
self.n_classes = len(np.unique(y))
self.n_features = X.shape[1]
self.tree = self.build_tree()

def build_tree(self):
Xy = np.concatenate((self.X, self.y[:, np.newaxis]), axis=1)
return self.recursive_build(Xy)

def recursive_build(self, Xy):


if self.max_depth is not None and self.max_depth == 0:
return self.get_leaf_node(Xy)

if np.unique(Xy[:, -1]).size == 1:
return self.get_leaf_node(Xy)

best_split = self.get_best_split(Xy)
if best_split is None:
return self.get_leaf_node(Xy)

left_child = self.recursive_build(best_split['left'])
right_child = self.recursive_build(best_split['right'])

return {
'feature_index': best_split['feature_index'],
'threshold': best_split['threshold'],
'left': left_child,
'right': right_child
}

def get_leaf_node(self, Xy):


leaf_node = {'class_counts': np.bincount(Xy[:, -1].astype(int))}
return leaf_node

def get_best_split(self, Xy):


best_split = None
best_gini = 1.0
for feature_index in range(self.n_features):
feature_values = np.unique(Xy[:, feature_index])
for threshold in feature_values:
left = Xy[Xy[:, feature_index] <= threshold]
right = Xy[Xy[:, feature_index] > threshold]

gini = self.calculate_gini(left, right)


if gini < best_gini:
best_gini = gini
best_split = {
'feature_index': feature_index,
'threshold': threshold,
'left': left,
'right': right
}
return best_split

def calculate_gini(self, left, right):


left_counts = np.bincount(left[:, -1].astype(int), minlength=self.n_classes)
right_counts = np.bincount(right[:, -1].astype(int), minlength=self.n_classes)

left_size = left.shape[0]
right_size = right.shape[0]
total_size = left_size + right_size

gini_left = 1.0 - sum((left_counts[i] / left_size) ** 2 for i in range(self.n_classes))


gini_right = 1.0 - sum((right_counts[i] / right_size) ** 2 for i in range(self.n_classes))

gini = (left_size / total_size) * gini_left + (right_size / total_size) * gini_right


return gini

def predict(self, X):


return np.array([self.traverse_tree(x, self.tree) for x in X])

def traverse_tree(self, x, node):


if 'class_counts' in node:
return np.argmax(node['class_counts'])

if x[node['feature_index']] <= node['threshold']:


return self.traverse_tree(x, node['left'])
else:
return self.traverse_tree(x, node['right'])

# Create an instance of the Decision Tree Classifier


dt_classifier = DecisionTreeClassifier(max_depth=5)

# Fit the classifier to the training data


dt_classifier.fit(X_train.values, y_train.values)

# Predict the test data


y_pred = dt_classifier.predict(X_test.values)

# Calculate the accuracy


from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test.values, y_pred)
print(f"Accuracy: {accuracy}")

OUTPUT:

CODE 2 - USING LIBRARIES:


import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
dataset = pd.read_csv("/kaggle/input/breast-cancer-wisconsin-data/data.csv")
dataset.head()
dataset = dataset.drop(["id"], axis = 1)
dataset = dataset.drop(["Unnamed: 32"], axis = 1)
dataset.diagnosis = [1 if i == "M" else 0 for i in dataset.diagnosis]
x = dataset.drop(["diagnosis"], axis = 1)
y = dataset.diagnosis.values
x = (x - np.min(x)) / (np.max(x) - np.min(x))
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 42)
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(x_train, y_train)
dt.score(x_test, y_test)

OUTPUT:

INFERENCE:
We have implemented the Decision tree classifier in both scratch and in-built functions method and
the Accuracy are displayed in both the methods using the NumPy and pandas

EX NO 15 RANDOM FOREST
DATE: 08.06.2023
PROBLEM STATEMENT:
The problem statement is that we have to implement the Random Forest in both scratch and using in
built functions method in python using numpy and pandas.

SAMPLE DATASET:

CODE 1 - FROM SCRATCH:


import numpy as np
import pandas as pd

class Randomforestclassifier:
def __init__(self, num_trees=100, max_features=None, max_depth=None):
self.num_trees = num_trees
self.max_features = max_features
self.max_depth = max_depth
self.trees = []

def fit(self, X, y):


X = np.array(X) # Convert X to a NumPy array
y=np.array(y)
num_samples = len(X)
num_features = len(X[0])
self.trees = []

for _ in range(self.num_trees):
# Randomly select a subset of features
if self.max_features:
selected_features = np.random.choice(num_features, self.max_features,
replace=False)
X_subset = X[:, selected_features]
else:
X_subset = X

# Randomly select a subset of samples (bootstrap aggregating)


indices = np.random.choice(num_samples, num_samples, replace=True)
X_bootstrap = X_subset[indices]
y_bootstrap = y[indices]

# Create a decision tree using the bootstrap samples


tree = DecisionTreeClassifier(max_depth=self.max_depth)
tree.fit(X_bootstrap, y_bootstrap)
self.trees.append(tree)

def predict(self, X):


X = np.array(X) # Convert X to a NumPy array
predictions = []

for tree in self.trees:


predictions.append(tree.predict(X))
# Voting for the majority class
predictions = np.array(predictions)
return np.round(np.mean(predictions, axis=0))

class DecisionTreeClassifier:
def __init__(self, max_depth=None):
self.max_depth = max_depth
self.tree = None

def fit(self, X, y):


X = np.array(X) # Convert X to a NumPy array
self.tree = self.build_tree(X, y)

def predict(self, X):


X = np.array(X) # Convert X to a NumPy array
predictions = [self.predict_sample(x, self.tree) for x in X]
return predictions

def predict_sample(self, sample, node):


if 'class' in node:
return node['class']

feature_value = sample[node['feature']]

if feature_value <= node['value']:


return self.predict_sample(sample, node['left'])
else:
return self.predict_sample(sample, node['right'])

def build_tree(self, X, y, depth=0):


num_samples, num_features = X.shape
num_classes = len(np.unique(y))

# Base cases: if all samples have the same class or maximum depth is reached
if len(np.unique(y)) == 1 or (self.max_depth and depth == self.max_depth):
return {'class': y[0]}

# Find the best split point


best_feature, best_value = self.find_best_split(X, y)

# Handle the case where best_feature or best_value is None


if best_feature is None or best_value is None:
return {'class': np.argmax(np.bincount(y))}

# Recursive splitting
left_indices = np.where(X[:, best_feature] <= best_value)[0]
right_indices = np.where(X[:, best_feature] > best_value)[0]

left_tree = self.build_tree(X[left_indices], y[left_indices], depth + 1)


right_tree = self.build_tree(X[right_indices], y[right_indices], depth + 1)

return {'feature': best_feature, 'value': best_value, 'left': left_tree, 'right': right_tree}

def find_best_split(self, X, y):


best_gain = 0
best_feature = None
best_value = None

for feature in range(X.shape[1]):


values = np.unique(X[:, feature])

for value in values:


gain = self.calculate_gain(X, y, feature, value)

if gain > best_gain:


best_gain = gain
best_feature = feature
best_value = value

return best_feature, best_value

def calculate_gain(self, X, y, feature, value):


parent_entropy = self.calculate_entropy(y)

left_indices = np.where(X[:, feature] <= value)[0]


right_indices = np.where(X[:, feature] > value)[0]

if len(left_indices) == 0 or len(right_indices) == 0:
return 0

left_entropy = self.calculate_entropy(y[left_indices])
right_entropy = self.calculate_entropy(y[right_indices])

left_weight = len(left_indices) / len(X)


right_weight = len(right_indices) / len(X)

gain = parent_entropy - (left_weight * left_entropy) - (right_weight * right_entropy)


return gain

def calculate_entropy(self, y):


classes, class_counts = np.unique(y, return_counts=True)
class_probs = class_counts / len(y)
entropy = -np.sum(class_probs * np.log2(class_probs + 1e-10))
return entropy

# No need to convert x_train and y_train to NumPy arrays if they are already in that format
rf_classifier = Randomforestclassifier(num_trees=100, max_features=3, max_depth=5)
rf_classifier.fit(x_train, y_train)
y_pred = rf_classifier.predict(x_test)
accuracy_score(y_test, y_pred)

OUTPUT:

CODE 2 - USING LIBRARIES:


import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('/kaggle/input/full-filled-brain-stroke-dataset/full_data.csv')
df.head()
X = df.drop(['stroke'],axis=1)
y = df['stroke']
X= pd.get_dummies(X)
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X,y, test_size=0.3, shuffle=True)


from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
classifier= RandomForestClassifier(n_estimators= 10, criterion="entropy")
classifier.fit(x_train, y_train)
y_pred=classifier.predict(x_test)
accuracy_score(y_test,y_pred)

OUTPUT:

INFERENCE:
We have implemented the Random forest in both scratch and in-built functions method and the
Accuracy are displayed in both the methods using the NumPy and pandas

EX NO 16 SUPPORT VECTOR MACHINE


DATE: 13.06.2023
PROBLEM STATEMENT:
The problem statement is that we have to implement the Support vector machine in both scratch and
using in built functions method in python using numpy and pandas.

SAMPLE DATASET:

CODE 1 - FROM SCRATCH:


class SVM:

def __init__(self, learning_rate=0.001, lambda_param=0.01, n_iters=1000):


self.lr = learning_rate
self.lambda_param = lambda_param
self.n_iters = n_iters
self.w = None
self.b = None

def fit(self, X, y):


n_samples, n_features = X.shape

y_ = np.where(y <= 0, -1, 1)

# initialize weights
self.w = np.zeros(n_features)
self.b = 0

for _ in range(self.n_iters):
for idx, x_i in enumerate(X):
condition = y_[idx] * (np.dot(x_i, self.w) - self.b) >= 1
if condition:
self.w -= self.lr * (2 * self.lambda_param * self.w)
else:
self.w -= self.lr * (2 * self.lambda_param * self.w - np.dot(x_i, y_[idx]))
self.b -= self.lr * y_[idx]

def predict(self, X):


approx = np.dot(X, self.w) - self.b
return np.sign(approx)
acc = accuracy_score(y_test,y_pred)*100
print('Accuracy of the model: {0}%'.format(acc))

OUTPUT:

CODE 2 - USING LIBRARIES:


import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
dataset = pd.read_csv('/kaggle/input/support-vector-machine/Social_Network_Ads.csv')
dataset.head(5)
dataset=dataset.drop(['User ID'],axis=1)

from sklearn.preprocessing import LabelEncoder


le = LabelEncoder()
dataset['Gender'] = le.fit_transform(dataset['Gender'])
X=dataset.drop(['Purchased'],axis=1)
X = dataset
y = dataset['Purchased']
from sklearn.model_selection import train_test_split
X_train , X_test , y_train , y_test = train_test_split(X , y ,test_size=0.3, random_state=0)
from sklearn.svm import LinearSVC
clf = LinearSVC()
clf.fit(X_train , y_train)
y_pred = clf.predict(X_test)
from sklearn.metrics import accuracy_score , classification_report
acc = accuracy_score(y_test,y_pred)*100
print('Accuracy of the model: {0}%'.format(acc))

OUTPUT:

INFERENCE:
We have implemented the SVM in both scratch and in-built functions method and the Accuracy are
displayed in both the methods using the NumPy and pandas

You might also like