Cat 2 Document Likkitha
Cat 2 Document Likkitha
SUBMITTED BY
LIKKITHA S
71762132023
SUBMITTED TO
MRS. D. SUDHA DEVI
PROBLEM STATEMENT:
Predicting the cost of homes in any rural area has become a significant difficulty for
construction companies. In order to anticipate the cost of dwellings in Coimbatore for a
specific square foot, the least squares method must be used.
PROBLEM ANALYSIS:
In this machine learning problem, we aim to build a model to predict the chance of admission
to a graduate school based on various features. The dataset provided contains information
about different applicants, including their GRE scores, TOEFL scores, university ratings,
statement of purpose (SOP) scores, letter of recommendation (LOR) scores, undergraduate
CGPA, research experience, and their corresponding chances of admission. The objective is
to create a model that can predict the likelihood of an applicant's admission based on their
profile. We want to determine the relationship between the various features and the chance of
admission and use this information to make accurate predictions for new, unseen applicants.
SAMPLE DATASET:
CODE 1 - FROM SCRATCH:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('adm_data.csv')
X = data['GRE Score'].values
Y = data['CGPA'].values
data.head()
mean_x = np.mean(X)
mean_y = np.mean(Y)
n = len(X)
numer = 0
denom = 0
for i in range(n):
numer += (X[i] - mean_x) * (Y[i] - mean_y)
denom += (X[i] - mean_x) ** 2
m = numer / denom
c = mean_y - (m * mean_x)
# Printing coefficients
print("Coefficients")
print(m, c)
max_x = np.max(X) + 30
min_x = np.min(X) - 30
x = np.linspace(min_x, max_x, 1000)
y=c+m*x
plt.plot(x, y, color='#58b970', label='Regression Line')
plt.scatter(X, Y, c='#ef5423', label='Scatter Plot')
plt.xlabel('GRE Score')
plt.ylabel('CGPA')
plt.legend()
plt.show()
rmse = 0
for i in range(n):
y_pred = c + m * X[i]
rmse += (Y[i] - y_pred) ** 2
rmse = np.sqrt(rmse/n)
print("RMSE")
print(rmse)
ss_tot = 0
ss_res = 0
for i in range(n):
y_pred = c + m * X[i]
ss_tot += (Y[i] - mean_y) ** 2
ss_res += (Y[i] - y_pred) ** 2
r2 = 1 - (ss_res/ss_tot)
print("R2 Score")
print(r2)
OUTPUT:
CODE 2 - USING LIBRARIES:
import pandas as pd
dataset = pd.read_csv('adm_data.csv')
X = dataset['GRE Score'].values
Y = dataset['CGPA'].values
x=np.reshape(X,(-1,1))
y=np.reshape(Y,(-1,1))
dataset.head()
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.30, random_state = 0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train, y_train)
y_pred = regressor.predict(x_test)
import matplotlib.pyplot as plt
plt.scatter(x_train, y_train, color = 'red')
plt.plot(x_train, regressor.predict(x_train), color = 'blue')
plt.title('GRE Score vs CGPA (Training set)')
plt.xlabel('GRE Score')
plt.ylabel('CGPA')
plt.show()
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt
import numpy as np
y_pred= regressor.predict(x_train)
print(np.sqrt(mean_squared_error(y_train,y_pred)
))
print(r2_score(y_train, y_pred))
y_pred= regressor.predict(x_test)
print(np.sqrt(mean_squared_error(y_test,y_pred)))
print(r2_score(y_test, y_pred))
OUTPUT:
PLOT:
EX NO 02 FIND S ALGORITHM
DATE: 23.03.2023
PROBLEM STATEMENT:
To implement Find S algorithm to find the specific hypothesis that fits all the positive
examples.
SAMPLE DATASET:
SOURCE CODE:
import pandas as pd
import numpy as np
data = pd.read_csv("ws.csv")
print(data,"\n")
d = np.array(data)[:,:-1]
print("\n The attributes are: ",d)
target = np.array(data)[:,-1]
print("\n The target is: ",target)
def train(c,t):
for i, val in enumerate(t):
if val == "Yes":
specific_hypothesis = c[i].copy()
break
for i, val in enumerate(c):
if t[i] == "Yes":
for x in range(len(specific_hypothesis)):
if val[x] != specific_hypothesis[x]:
specific_hypothesis[x] = '?'
else:
pass
return specific_hypothesis
print("\n The final hypothesis is:",train(d,target))
OUTPUT:
INFERENCE:
FIND S Algorithm is used to find the Maximally Specific Hypothesis. Using the Find-S
algorithm gives a single maximally specific hypothesis for the given set of training examples.
EX NO 03 CANDIDATE ELIMINATION
DATE: 28.03.2023
PROBLEM STATEMENT:
The aim of the Candidate Elimination algorithm is to learn a hypothesis that approximates the
target concept based on a set of training examples and a hypothesis space. It seeks to find the
most specific and general hypotheses that can accurately classify the training examples and
generalize to unseen instances, thereby effectively narrowing down the hypothesis space. The
algorithm aims to provide a concise and accurate representation of the target concept using a
minimal set of hypotheses.
SAMPLE DATASET:
SOURCE CODE:
import numpy as np
import pandas as pd
data = pd.read_csv('candidate elimination.csv')
concepts = np.array(data.iloc[:,0:-1])
print("\nInstances are:\n",concepts)
target = np.array(data.iloc[:,-1])
print("\nTarget Values are: ",target)
def learn(concepts, target):
specific_h = concepts[0].copy()
print("\nInitialization of specific_h and genearal_h")
print("\nSpecific Boundary: ", specific_h)
general_h = [["?" for i in range(len(specific_h))] for i in range(len(specific_h))]
print("\nGeneric Boundary: ",general_h)
for i, h in enumerate(concepts):
print("\nInstance", i+1 , "is ", h)
if target[i] == "yes":
print("Instance is Positive ")
for x in range(len(specific_h)):
if h[x]!= specific_h[x]:
specific_h[x] ='?'
general_h[x][x] ='?'
if target[i] == "no":
print("Instance is Negative ")
for x in range(len(specific_h)):
if h[x]!= specific_h[x]:
general_h[x][x] = specific_h[x]
else:
general_h[x][x] = '?'
print("Specific Bundary after ", i+1, "Instance is ", specific_h)
print("Generic Boundary after ", i+1, "Instance is ", general_h)
print("\n")
indices = [i for i, val in enumerate(general_h) if val == ['?', '?', '?', '?', '?', '?']]
for i in indices:
general_h.remove(['?', '?', '?', '?', '?', '?'])
return specific_h, general_h
s_final, g_final = learn(concepts, target)
print("Final Specific_h: ", s_final, sep="\n")
print("Final General_h: ", g_final, sep="\n")
OUTPUT:
INFERENCE:
The candidate Elimination algorithm finds all hypotheses that match all the given training
examples. Unlike in Find-S algorithm, it goes through both negative and positive examples,
eliminating any inconsistent hypothesis.
PROBLEM STATEMENT:
Predicting the house prices in all outskirts have become a major problem for construction
companies. So our problem here is to predict the prices of houses in Coimbatore for the given
square feet, number of bedrooms and age using multiple linear regression
PROBLEM ANALYSIS:
Here we will develop and evaluate the performance and the predictive power of a
model trained and tested on data collected from houses in Coimbatore’s suburbs. Once we get
a good fit, we will use this model to predict the monetary value of a house located at all parts
of Coimbatore. A model like this would be very valuable for real estate agents and
construction companies where they could make use of the information provided in a daily
basis. Our data set consists of four rows (Area, Bedrooms, Age of the home, Price) and thirty
columns. We analyse the problem by applying the least squares method to the given dataset.
We separate the dataset into training and test data, training the MLR model in the training
dataset and we predict the test results and visualize them. By analysing the R square value we
get from the model using MLR, we can predict the accuracy for the model.
SAMPLE DATASET:
CODE 1 - FROM SCRATCH:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from math import sqrt
mean_x1 = np.mean(X1)
mean_x2 = np.mean(X2)
mean_y = np.mean(Y)
for i in range(no):
nr_b1 = ((X2[i] ** 2) * (X1[i] * Y[i])) - ((X1[i] * X2[i]) * (X2[i] * Y[i]))
dr_b1 = ((X1[i] * 2) * (X2[i] * 2)) - ((X1[i] * X2[i]) ** 2)
b1 = nr_b1/dr_b1
nr_b2 = ((X1[i] **2) * (X2[i] * Y[i])) - ((X1[i] * X2[i]) * (X1[i] * Y[i]))
dr_b2 = ((X1[i] * 2) * (X2[i] * 2)) - ((X1[i] * X2[i]) ** 2)
b2 = nr_b2/dr_b2
mean_y = (b1 * mean_x1) + (b2 * mean_x2)
batch_size = 30
no = batch_size
numer = 0
denom = 0
for i in range(no):
numer += (X[i] - mean_x) * (Y[i] - mean_y)
denom += (X[i] - mean_x) ** 2
m = numer / denom
c = mean_y - (m * mean_x)
# Printing coefficients
print("Coefficients")
print(m, c)
max_x = np.max(X) + 30
min_x = np.min(X) - 30
x = np.linspace(min_x, max_x, 1000)
for i in range(no):
y = c + m * X[i]
rmse = 0
for i in range(no):
y_pred = c + m * X[i]
rmse += (Y[i] - y_pred) ** 2
rmse = np.sqrt(rmse/n)
print("RMSE")
print(rmse)
ss_tot = 0
ss_res = 0
for i in range(no):
y_pred = c + m * X[i]
ss_tot += (Y[i] - mean_y) ** 2
ss_res += (Y[i] - y_pred) ** 2
r2 = 1 - (ss_res/ss_tot)
print("R2 Score")
print(r2)
OUTPUT:
Coefficients
97.26987365 -68.67032333 6.78195499
RMSE
245.0986827382367
R2 Score
0.984526627789891
INFERENCE:
From the MLR scratch model we get an RMSE value of 245.09868 and R square
value of 0.98, from which we can infer that the model has an accuracy of 98% which states
that the model has performed very well. The equation obtained here is y = 97.2698 * x1+
6.7819 * x2- 68.6703 * x3. This model can be used further by training it with a large data.
INFERENCE:
From the model we can infer that the training dataset computed the RMSE value as
243.14314 with r square value of 0.98 and the test dataset computed its RMSE as 214.29771
with r square value of 0.99. Here the model has performed well with an accuracy of 99%
which infers that both the models fit perfectly.
Now let’s try to give an input to the model and see how it predicts the price. Here the
input is the area in square feet.
EX NO 05 POLYNOMIAL REGRESSION
DATE: 20.04.2023
PROBLEM STATEMENT:
The objective of this problem statement is to provide comprehensive support to 50 startups by
addressing their key challenges and enabling their growth and success. The problem is to
develop a program that supports the growth of 50 startups by addressing their critical needs
and challenges. The program should encompass various areas and provide tailored solutions
to meet the unique requirements of each startup.
PROBLEM ANALYSIS:
Here we will develop and evaluate the performance and the predictive power of a model
trained and tested on data collected from a Company. Once we get a good fit, we will use this
model to predict the salary of an employee based on their position. The data set consists of
five columns (R&D Spend,Administration,Marketing Spend,State,Profit) and fifty rows. We
separate the dataset into training and test data, training the model in the training dataset and
we predict the test results and visualize them. By analysing the R square value, we can
predict the accuracy for the model.
SAMPLE DATASET:
CODE 1 - FROM SCRATCH:
n=int(input("Enter the degree:"))
xn=data['Level']
for i in range(1,n+1):
for j in range(len(xn)):
data['x',i]=xn**i
xval=np.array(data.iloc[:,-n:])
yval=np.array(y)
coeffs = np.linalg.inv(xval.T @ xval) @ xval.T @ yval
coeffs.shape
print('Coefficients:', coeffs)
new=int(input("Enter the value of x to be predicted:"))
X_n=[]
for i in range(1,n+1):
X_n.append(new**i)
print('Dimensions of coeff matrix:',coeffs.shape)
X_new=np.array(X_n)
y_pred = X_new.T @ coeffs
print('Prediction:', y_pred)
OUTPUT:
y = -38494.26 + 66878.12x + 287369.29x^2 + 460744.27x^3
RMSE: 100912.45186113848
R2 score: 0.8737535471595872
CODE 2 - USING LIBRARIES:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
print("Done")
data=pd.read_csv("/kaggle/input/position-salaries/Position_Salaries.csv")
x=data.iloc[:, 1:2]
y=data['Salary']
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
poly_reg = PolynomialFeatures(degree = 4)
x_poly = poly_reg.fit_transform(x)
poly_reg.fit(x_poly, y)
linear = LinearRegression()
linear.fit(x_poly, y)
y_pred=linear.predict(x_poly)
from sklearn.metrics import mean_squared_error,r2_score
rmse=np.sqrt(mean_squared_error(y,y_pred))
print('RMSE:',rmse)
r2_scr=r2_score(y,y_pred)
print('R2 SCORE:',r2_scr)
OUTPUT:
Salary = -195333.33333333337 + 80878.78787878789 * Level
RMSE : 163388.73519272613
R2 score: 0.6690412331929895
PLOT:
plt.scatter(x,y,color='Black')
plt.plot(x,linear.predict(x_poly),color='Red')
plt.xlabel('Levels')
plt.ylabel('Salary')
INFERENCE:
Using Polynomial Regression model, we have predicted the salaries of the employees of a
company.The models has computed the value of R square score as 0.99739.Therefore the
model has performed with an accuracy of 99.73%.
EX NO 06 LOGISTIC REGRESSION
DATE: 18.05.2023
PROBLEM STATEMENT:
The problem statement is that we have to implement the Logistic regression in both
scratch and using in built functions method in python using numpy and pandas.
PROBLEM ANALYSIS:
The SAT score achieved by a student (input feature). A binary variable indicating whether
the student was admitted (output label). Our goal is to build a logistic regression model that
can accurately predict the admission decision based on the SAT scores. Logistic regression is
a binary classification algorithm commonly used for predicting binary outcomes, such as
"admitted" or "not admitted" in our case. It models the relationship between the input features
and the probability of the output label using a logistic function. The train the logistic
regression model, we will split the dataset into two parts: a training set and a test set. The
training set will be used to train the model, and the test set will be used to evaluate its
performance. We will use evaluation metrics such as accuracy, precision, recall, and F1-score
to assess the performance of our logistic regression model. These metrics will help us
understand how well the model predicts the admission decisions based on the SAT scores.
Once the logistic regression model is trained, we can use it to make predictions on new,
unseen data. Given a student's SAT score, the model will output the probability of being
admitted. We can then apply a threshold (e.g., 0.5) to classify the student as admitted or not
admitted.
SAMPLE DATASET:
CODE 1 - FROM SCRATCH:
import numpy as np
import pandas as pd
# Normalize the independent variable (optional but recommended)
X = (X - np.mean(X)) / np.std(X)
# Add a column of ones to X for the bias term
X = np.column_stack((np.ones(len(X)), X))
# Define the sigmoid function
def sigmoid(z):
return 1 / (1 + np.exp(-z))
# Define the cost function
def cost_function(X, y, theta):
m = len(y)
h = sigmoid(np.dot(X, theta))
cost = (-1/m) * np.sum(y*np.log(h) + (1-y)*np.log(1-h))
return cost
# Define the gradient descent function
def gradient_descent(X, y, theta, alpha, num_iterations):
m = len(y)
cost_history = []
for i in range(num_iterations):
h = sigmoid(np.dot(X, theta))
gradient = (1/m) * np.dot(X.T, (h - y))
theta -= alpha * gradient
cost = cost_function(X, y, theta)
cost_history.append(cost)
return theta, cost_history
# Set the learning rate and number of iterations
learning_rate = 0.01
num_iterations = 1000
# Initialize the parameters (theta)
theta = np.zeros(X.shape[1])
# Run gradient descent to train the model
theta_optimized, cost_history = gradient_descent(X, y, theta, learning_rate, num_iterations)
# Print the optimized parameters (theta)
print("Optimized parameters (theta):", theta_optimized)
# Optimized parameters (theta)
theta_optimized = np.array([0.28904728, 1.85989727])
# Example test data
X_test = np.array([1, 1600]) # Your test data
# Predict the admission using the optimized parameters
h_test = sigmoid(np.dot(X_test, theta_optimized))
prediction = 1 if h_test >= 0.5 else 0
# Calculate the accuracy on the training set
y_pred_train = sigmoid(np.dot(X, theta_optimized))
y_pred_train = np.round(y_pred_train) # Round the predictions to 0 or 1
accuracy = np.mean(y_pred_train == y) * 100
OUTPUT:
CODE 2 - USING LIBRARIES:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Create the dependent and independent variables
Y = data['Admitted']
X= data['SAT'].values
X = X.reshape(-1, 1)
# Create an instance of LogisticRegression model
logreg = LogisticRegression()
# Fit the model to the training data
logreg.fit(X, y)
X_test = np.array([[1600]]) # Your test data
# Predict the admission for the test data
y_pred = logreg.predict(X_test)
# Calculate the accuracy
accuracy = accuracy_score(y, logreg.predict(X))
# Print the predicted admission and accuracy
print("Predicted admission:", y_pred[0])
print("Accuracy:", accuracy)
plt.scatter(x1,y, color='C0')
# Don't forget to label your axes!
plt.xlabel('SAT', fontsize = 20)
plt.ylabel('Admitted', fontsize = 20)
plt.show()
x = sm.add_constant(x1)
reg_log = sm.Logit(y,x)
results_log = reg_log.fit()
def f(x,b0,b1):
return np.array(np.exp(b0+x*b1) / (1 + np.exp(b0+x*b1))) f_sorted =
np.sort(f(x1,results_log.params[0],results_log.params[1]))
x_sorted = np.sort(np.array(x1))
plt.scatter(x1,y,color='C0')
plt.xlabel('SAT', fontsize = 20)
plt.ylabel('Admitted', fontsize = 20)
plt.plot(x_sorted,f_sorted,color='C8')
plt.show()
OUTPUT:
PLOT:
INFERENCE:
We have implemented the Logistic Regression in both scratch and in-built functions method
and the accuracies for both the naïve bayes classifier from scratch and using built in methods
are found to have only slight difference making the model more accurate and precise.
PROBLEM STATEMENT:
The problem statement is that we have to implement the Gaussian naive bayes classifier in
both scratch and using in built functions method in python using numpy and pandas.
PROBLEM ANALYSIS:
The code begins by importing necessary libraries such as pandas, numpy, scikit-learn's
train_test_split, GaussianNB, and accuracy_score.The code uses the pandas library to read
the CSV file 'fraud_oracle.csv' and store it in a DataFrame called 'df'.The code drops the
'PolicyNumber' and 'RepNumber' columns from the DataFrame using the drop() method.he
code selects four features ('DriverRating', 'Deductible', 'Age', 'WeekOfMonth') as input
features (X) for the model and assigns the 'FraudFound_P' column as the target variable
(y).The code splits the dataset into training and testing sets using the train_test_split()
function from scikit-learn. It assigns 80% of the data for training (X_train, y_train) and 20%
for testing (X_test, y_test). The random_state parameter is set to 42 for reproducibility.The
code prints the accuracy score obtained from the accuracy_score() function.
SAMPLE DATASET:
if row[col] in likelihood_nonfraud[col]:
likelihood_nf = likelihood_nonfraud[col][row[col]]
else:
likelihood_nf = 0
posterior_fraud *= likelihood_f
posterior_nonfraud *= likelihood_nf
print("Accuracy:", accuracy)
print("Confusion Matrix:\n", confusion)
OUTPUT:
OUTPUT:
INFERENCE:
We have implemented the Gaussian Naive Bayes Classifier in both scratch and in built
functions method and The accuracies for both the naïve bayes classifier from scratch and
using built in methods are found to be same making the model more accurate and precise.
PROBLEM STATEMENT:
The problem statement is that we have to implement the Bernoulli naive bayes classifier in
both scratch and using in built functions method in python using numpy and pandas.
PROBLEM ANALYSIS:
The code starts by importing necessary libraries such as NumPy, Pandas, and scikit-learn
modules.The Titanic dataset is read from a CSV file using the pd.read_csv() function and
stored in the dataset variable.The categorical variables in the dataset are encoded using the
LabelEncoder from scikit-learn. The apply() function is used to apply the encoding to object-
type columns, while numerical columns are left unchanged. The encoded features are stored
in the X_encoded variable.The target variable is also encoded using the LabelEncoder, and
the encoded labels are stored in the y_encoded variable.The dataset is split into training and
testing sets using the train_test_split() function from scikit-learn. The training set consists of
75% of the data, while the testing set contains the remaining 25%. The random state is set to
42 for reproducibility.An instance of the BernoulliNB class is created and assigned to the bnb
variable. The model is then trained on the training data using the fit() method.The accuracy of
the model is calculated by comparing the predicted labels (y_pred) with the actual labels from
the testing set (y_test) using the accuracy_score() function. The accuracy score is printed to
the console.
SAMPLE DATASET:
OUTPUT:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
dataset = pd.read_csv("/kaggle/input/bernoulli-naive-bayes/titanic_prediction.csv")
label_encoder = LabelEncoder()
X_encoded = dataset.iloc[:, :-1].apply(lambda x: label_encoder.fit_transform(x) if x.dtype ==
"object" else x)
y_encoded = label_encoder.fit_transform(dataset.iloc[:, -1])
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y_encoded, test_size=0.25,
random_state=42)
bnb = BernoulliNB()
bnb.fit(X_train, y_train)
y_pred = bnb.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
OUTPUT:
INFERENCE:
We have implemented the Bernoulli Naive Bayes Classifier in both scratch and in built
functions method and The accuracies for both the naïve bayes classifier from scratch and
using built in methods are found using numpy and pandas libraries in python.
PROBLEM STATEMENT:
The problem statement is that we have to implement the Multinomial naive bayes classifier in
both scratch and using in built functions method in python using numpy and pandas.
PROBLEM ANALYSIS:
The necessary libraries are imported, including pandas, which is used for data manipulation,
and various modules from scikit-learn for the machine learning tasks.The code reads a CSV
file containing heart attack data and assigns it to the variable data. It then separates the
features (X) from the target variable (y).The dataset is split into training and testing sets using
the train_test_split function from scikit-learn. The testing set size is set to 20% of the data,
and a random state of 42 is used for reproducibility.A Multinomial Naive Bayes classifier is
instantiated using MultinomialNB() and trained on the training set using the fit method.The
trained classifier is used to make predictions on the test set (X_test) using the predict
method.The accuracy of the classifier is calculated by comparing the predicted values
(y_pred) with the actual target values (y_test) using the accuracy_score function. The
accuracy is then printed.
SAMPLE DATASET:
# Function to calculate the probability of a value given a mean and standard deviation
def calculate_probability(value, mean, stdev):
exponent = math.exp(-((value - mean) ** 2 / (2 * stdev ** 2)))
return (1 / (math.sqrt(2 * math.pi) * stdev)) * exponent
# Calculate the mean and standard deviation for each feature by class
summaries = {}
for target, features in separated.items():
summaries[target] = []
for i in range(len(features[0])):
values = [row[i] for row in features]
mean = sum(values) / len(values)
stdev = math.sqrt(sum([(x - mean) ** 2 for x in values]) / len(values))
summaries[target].append((mean, stdev))
return summaries
predictions.append(best_class)
return predictions
# Calculate accuracy
correct_predictions = sum(1 for pred, true in zip(predictions, y_test) if pred == true)
accuracy = correct_predictions / len(y_test)
print("Accuracy:", accuracy)
OUTPUT:
CODE 2 - USING LIBRARIES:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np
data = pd.read_csv('/kaggle/input/heart-attack-analysis-prediction-dataset/heart.csv')
data.head()
X=data.iloc[:,:-1]
y=data.iloc[:,-1]
OUTPUT:
INFERENCE:
We have implemented the Multinomial Naive Bayes Classifier in both scratch and in built
functions method and The accuracies for both the naïve bayes classifier from scratch and
using built in methods are found using numpy and pandas libraries in python.
PROBLEM STATEMENT:
The problem statement is that we have to implement the KNN classifier in both scratch and
using in built functions method in python using numpy and pandas.
SAMPLE DATASET:
CODE 1 - FROM SCRATCH:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
class KNNClassifier:
def __init__(self, k=3):
self.k = k
def fit(self, X, y):
self.X_train = X
self.y_train = y
def euclidean_distance(self, x1, x2):
return np.sqrt(np.sum((x1 - x2) ** 2))
def predict(self, X):
y_pred = []
for x_test in X:
distances = []for x_train in self.X_train:
dist = self.euclidean_distance(x_test, x_train)
distances.append(dist)
indices = np.argsort(distances)[:self.k]
k_nearest_labels = self.y_train[indices]
unique x=data.iloc[:,:-1].values
y=data.iloc[:,-1].values
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
# Feature scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# K-NN classification
knn = KNNClassifier(k=)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
accuracy = np.sum(y_pred == y_test) / len(y_test)
print("Accuracy:", accuracy)
_labels, counts = np.unique(k_nearest_labels, return_counts=True)
predicted_label = unique_labels[np.argmax(counts)]
y_pred.append(predicted_label)
return np.array(y_pred)
data=pd.read_csv("/kaggle/input/iris-flower-dataset/IRIS.csv")
data.head()
OUTPUT:
OUTPUT:
INFERENCE:
We have implemented the KNN classifier in both scratch and in built functions method and
the plots are displayed in both the methods using the numpy and pandas libraries using these
plots we are able to identify the neighbours.
PROBLEM STATEMENT:
The problem at hand is to cluster a given dataset into k distinct groups using the K-Means
Clustering algorithm. The dataset consists of various data points/features, and the goal is to
identify natural groupings or patterns within the data. The number of clusters, 'k', needs to be
determined based on the nature of the data or domain knowledge.
PROBLEM ANALYSIS:
The problem at hand is to cluster a given dataset into k distinct groups using the K-Means
Clustering algorithm. The dataset consists of various data points/features, and the goal is to
identify natural groupings or patterns within the data. The number of clusters, 'k', needs to be
determined based on the nature of the data or domain knowledge. Develop the K-Means
Clustering model using a programming language or a machine learning library that supports
the algorithm. Implement the necessary steps for initializing centroids, assigning data points
to clusters, and updating centroids iteratively. Evaluate the performance of the clustering
model. This can be done using metrics such as the silhouette score or the average distance
between data points and their cluster centroids. Assess the quality and coherence of the
obtained clusters. Visualize the clusters to gain insights and interpret the results effectively.
Plotting the data points and their respective clusters can help understand the structure and
patterns within the dataset.
kmeans = KMeansClustering(num_clusters)
clusters, centroids = kmeans.fit(X)
predictions = kmeans.predict(X)
kmeans.plot_clusters(X, clusters)
OUTPUT:
CODE 2 - USING LIBRARIES:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import warnings
warnings.filterwarnings("ignore")
# Generate sample data
np.random.seed(10)
num_clusters = 3
X, _ = make_blobs(n_samples=1000, n_features=2, centers=num_clusters)
# Perform K-means clustering using scikit-learn
kmeans = KMeans(n_clusters=num_clusters)
kmeans.fit(X)
# Get cluster labels and centroids
labels = kmeans.labels_
centroids = kmeans.cluster_centers_
# Plot the clusters and centroids
colors = ['r', 'g', 'b', 'c', 'm', 'y']
for i in range(num_clusters):
plt.scatter(X[labels == i, 0], X[labels == i, 1], c=colors[i])
plt.scatter(centroids[:, 0], centroids[:, 1], marker='x', color='k', s=100)
plt.show()
OUTPUT:
INFERENCE:
We have implemented the K-Mean Clustering in both scratch and in built functions method
and the plots are displayed in both the methods using the numpy and pandas libraries using
these plots we are able to identify the neighbours.
PROBLEM STATEMENT:
The problem statement is that we have to implement the hierarchical clustering model in both
scratch and using in built functions method in python using numpy and pandas.
PROMBLEM ANALYSIS:
The code imports the required libraries, including numpy, scikit-learn's Agglomerative
Clustering and make_blobs functions, and matplotlib for plotting.The code generates a
synthetic dataset using the make_blobs function. It creates 50 samples distributed among 3
clusters, with a specified standard deviation.An instance of AgglomerativeClustering is
created with the desired number of clusters (3 in this case). The fit method is then called on
the clustering object, which performs the clustering on the given dataset.The scatter function
from matplotlib is used to create a scatter plot of the data points. Each point is colored based
on its assigned cluster label. The xlabel, ylabel, and title functions set the labels and title for
the plot.The centroids of each cluster are computed by taking the mean of the points in that
cluster. For each unique label in the clustering labels, the code retrieves the points belonging
to that cluster and calculates the centroid. These centroids are then plotted on the scatter plot
as red crosses. The show function is called to display the plot with the clustered data and
marked centroids.
OUTPUT:
OUTPUT:
INFERENCE:
We have implemented the Hierearchial clustering in both scratch and in built functions
method and the plots are displayed in both the methods using the numpy and pandas libraries
using these plots we are able to identify the neighbours.
PROBLEM STATEMENT:
The problem statement is that we have to implement the Principal component of analysis in both
scratch and using in built functions method in python using numpy and pandas.
SAMPLE DATASET:
data = pd.read_csv('/kaggle/input/iris-flower-dataset/IRIS.csv')
data.head()
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values
class PCA:
def __init__(self, n_components):
self.n_components = n_components
self.components = None
# Perform eigendecomposition
eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix)
return transformed_data
pca = PCA(n_components=2)
pca.fit(X)
transformed_data = pca.transform(X)
print("Original data shape:", X.shape)
print("Transformed data shape:", transformed_data.shape)
species_map = {'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2}
color_labels = [species_map[label] for label in y]
OUTPUT:
PLOT 1:
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 2)
plt.scatter(X[:, 0], X[:, 1], c=color_labels)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Original Data')
PLOT 2:
plt.subplot(1, 2, 2)
plt.scatter(transformed_data[:, 0], transformed_data[:, 1], c=color_labels)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('Transformed Data')
plt.tight_layout()
plt.show()
OUTPUT:
PLOT:
plt.subplot(1, 2, 2)
plt.scatter(pca_samples[:, 0], pca_samples[:, 1], c=color_labels)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('Transformed Data')
plt.tight_layout()
plt.show()
INFERENCE:
We have implemented the Principal component analysis in both scratch and in built functions
method and the plots are displayed in both the methods using the numpy and pandas libraries
using these plots we are able to identify the neighbours.
PROBLEM STATEMENT:
The problem statement is that we have to implement the Decision tree classifier in both scratch and
using in built functions method in python using numpy and pandas.
SAMPLE DATASET:
# Convert the diagnosis column to numeric values (0 for benign, 1 for malignant)
data['diagnosis'] = data['diagnosis'].map({'M': 1, 'B': 0})
def build_tree(self):
Xy = np.concatenate((self.X, self.y[:, np.newaxis]), axis=1)
return self.recursive_build(Xy)
if np.unique(Xy[:, -1]).size == 1:
return self.get_leaf_node(Xy)
best_split = self.get_best_split(Xy)
if best_split is None:
return self.get_leaf_node(Xy)
left_child = self.recursive_build(best_split['left'])
right_child = self.recursive_build(best_split['right'])
return {
'feature_index': best_split['feature_index'],
'threshold': best_split['threshold'],
'left': left_child,
'right': right_child
}
left_size = left.shape[0]
right_size = right.shape[0]
total_size = left_size + right_size
OUTPUT:
OUTPUT:
INFERENCE:
We have implemented the Decision tree classifier in both scratch and in-built functions method and
the Accuracy are displayed in both the methods using the NumPy and pandas
EX NO 15 RANDOM FOREST
DATE: 08.06.2023
PROBLEM STATEMENT:
The problem statement is that we have to implement the Random Forest in both scratch and using in
built functions method in python using numpy and pandas.
SAMPLE DATASET:
class Randomforestclassifier:
def __init__(self, num_trees=100, max_features=None, max_depth=None):
self.num_trees = num_trees
self.max_features = max_features
self.max_depth = max_depth
self.trees = []
for _ in range(self.num_trees):
# Randomly select a subset of features
if self.max_features:
selected_features = np.random.choice(num_features, self.max_features,
replace=False)
X_subset = X[:, selected_features]
else:
X_subset = X
class DecisionTreeClassifier:
def __init__(self, max_depth=None):
self.max_depth = max_depth
self.tree = None
feature_value = sample[node['feature']]
# Base cases: if all samples have the same class or maximum depth is reached
if len(np.unique(y)) == 1 or (self.max_depth and depth == self.max_depth):
return {'class': y[0]}
# Recursive splitting
left_indices = np.where(X[:, best_feature] <= best_value)[0]
right_indices = np.where(X[:, best_feature] > best_value)[0]
if len(left_indices) == 0 or len(right_indices) == 0:
return 0
left_entropy = self.calculate_entropy(y[left_indices])
right_entropy = self.calculate_entropy(y[right_indices])
# No need to convert x_train and y_train to NumPy arrays if they are already in that format
rf_classifier = Randomforestclassifier(num_trees=100, max_features=3, max_depth=5)
rf_classifier.fit(x_train, y_train)
y_pred = rf_classifier.predict(x_test)
accuracy_score(y_test, y_pred)
OUTPUT:
OUTPUT:
INFERENCE:
We have implemented the Random forest in both scratch and in-built functions method and the
Accuracy are displayed in both the methods using the NumPy and pandas
SAMPLE DATASET:
# initialize weights
self.w = np.zeros(n_features)
self.b = 0
for _ in range(self.n_iters):
for idx, x_i in enumerate(X):
condition = y_[idx] * (np.dot(x_i, self.w) - self.b) >= 1
if condition:
self.w -= self.lr * (2 * self.lambda_param * self.w)
else:
self.w -= self.lr * (2 * self.lambda_param * self.w - np.dot(x_i, y_[idx]))
self.b -= self.lr * y_[idx]
OUTPUT:
OUTPUT:
INFERENCE:
We have implemented the SVM in both scratch and in-built functions method and the Accuracy are
displayed in both the methods using the NumPy and pandas