ML Lab
ML Lab
MACHINE LEARNING
LABORATORY - 15CSL76
LAB MANUAL
Prepared By:
Mrs. Aruna M G Y Vishnuvardhan
Associate Professor Assistant Professor
Dept. of CSE,MSEC Dept. of CSE,MSEC
2. For a given set of training data examples stored in a .CSV file, implement and demonstrate the
Candidate-Elimination algorithm to output a description of the set of all hypotheses consistent
with the training examples.
3. Write a program to demonstrate the working of the decision tree based ID3 algorithm. Use an
appropriate data set for building the decision tree and apply this knowledge to classify a new
sample.
5. Write a program to implement the naïve Bayesian classifier for a sample training data set
stored as a .CSV file. Compute the accuracy of the classifier, considering few test data sets.
6. Assuming a set of documents that need to be classified, use the naïve Bayesian Classifier
model to perform this task. Built-in Java classes/API can be used to write the program. Calculate
the accuracy, precision, and recall for your data set.
7. Write a program to construct a Bayesian network considering medical data. Use this model to
demonstrate the diagnosis of heart patients using standard Heart Disease Data Set. You can use
Java/Python ML library classes/API.
8. Apply EM algorithm to cluster a set of data stored in a .CSV file. Use the same data set for
clustering using k-Means algorithm. Compare the results of these two algorithms and comment
Dept. of CSE,MSEC Page 2
15CSL76 ML LAB
on the quality of clustering. You can add Java/Python ML library classes/API in the program.
9. Write a program to implement k-Nearest Neighbour algorithm to classify the iris data set.
Print both correct and wrong predictions. Java/Python ML library classes can be used for this
problem.
10. Implement the non-parametric Locally Weighted Regression algorithm in order to fit data
points. Select appropriate data set for your experiment and draw graphs.
REFERENCES
Text Books:
1. Tom M. Mitchell, Machine Learning, India Edition 2013, McGraw Hill Education.
Step 7: Choose whether to add Anaconda to your PATH environment variable. We recommend
not adding Anaconda to the PATH environment variable, since this can interfere with other
software. Instead, use Anaconda software by opening Anaconda Navigator or the Anaconda
Prompt from the Start Menu.
Step 8: Choose whether to register Anaconda as your default Python. Unless you plan on
installing and running multiple versions of Anaconda, or multiple versions of Python, accept the
default and leave this box checked.
Step 9: Click the Install button. If you want to watch the packages Anaconda is installing, click
Show Details.
Step 11: Optional: To install VS Code, click the Install Microsoft VS Code button. After the
install completes click the Next button.
Step 12: After a successful installation you will see the “Thanks for installing Anaconda” dialog
box:
Introduction
Machine learning
Machine learning is a subset of artificial intelligence in the field of computer science that often
uses statistical techniques to give computers the ability to "learn" (i.e., progressively improve
performance on a specific task) with data, without being explicitly programmed. In the past
decade, machine learning has given us self-driving cars, practical speech recognition, effective
web search, and a vastly improved understanding of the human genome.
Machine learning tasks are typically classified into two broad categories, depending on whether
there is a learning "signal" or "feedback" available to a learning system:
1. Supervised learning: The computer is presented with example inputs and their desired
outputs, given by a "teacher", and the goal is to learn a general rule that maps inputs to outputs.
As special cases, the input signal can be only partially available, or restricted to special feedback:
3. Active learning: the computer can only obtain training labels for a limited set of instances
(based on a budget), and also has to optimize its choice of objects to acquire labels for. When
used interactively, these can be presented to the user for labeling.
4. Reinforcement learning: training data (in form of rewards and punishments) is given only as
feedback to the program's actions in a dynamic environment, such as driving a vehicle or playing
a game against an opponent.
5. Unsupervised learning: No labels are given to the learning algorithm, leaving it on its own to
find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden
patterns in data) or a means towards an end (feature learning).
In classification, inputs are divided into two or more classes, and the learner must produce a
model that assigns unseen inputs to one or more (multi-label classification) of these classes. This
is typically tackled in a supervised manner. Spam filtering is an example of classification, where
the inputs are email (or other) messages and the classes are "spam" and "not spam".
In regression, also a supervised problem, the outputs are continuous rather than discrete.
In clustering, a set of inputs is to be divided into groups. Unlike in classification, the groups are
not known beforehand, making this typically an unsupervised task.
Dimensionality reduction simplifies inputs by mapping them into a lower dimensional space.
Topic modeling is a related problem, where a program is given a list of human language
documents and is tasked with finding out which documents cover similar topics.
4. Deep learning
Falling hardware prices and the development of GPUs for personal use in the last few years have
contributed to the development of the concept of deep learning which consists of multiple hidden
layers in an artificial neural network. This approach tries to model the way the human brain
processes light and sound into vision and hearing. Some successful applications of deep learning
are computer vision and speech Recognition.
of two categories, an SVM training algorithm builds a model that predicts whether a new
example falls into one category or the other.
7. Clustering
Cluster analysis is the assignment of a set of observations into subsets (called clusters) so that
observations within the same cluster are similar according to some pre designated criterion or
criteria, while observations drawn from different clusters are dissimilar. Different clustering
techniques make different assumptions on the structure of the data, often defined by some
similarity metric and evaluated for example by internal compactness (similarity between
members of the same cluster) and separation between different clusters. Other methods are based
on estimated density and graph connectivity. Clustering is a method of unsupervised learning,
and a common technique for statistical data analysis.
8. Bayesian networks
A Bayesian network, belief network or directed acyclic graphical model is a probabilistic
graphical model that represents a set of random variables and their conditional independencies
via a directed acyclic graph (DAG). For example, a Bayesian network could represent the
probabilistic relationships between diseases and symptoms. Given symptoms, the network can be
used to compute the probabilities of the presence of various diseases. Efficient algorithms exist
that perform inference and learning.
9. Reinforcement learning
Reinforcement learning is concerned with how an agent ought to take actions in an environment
so as to maximize some notion of long-term reward. Reinforcement learning algorithms attempt
to find a policy that maps states of the world to the actions the agent ought to take in those states.
Reinforcement learning differs from the supervised learning problem in that correct input/output
pairs are never presented, nor sub-optimal actions explicitly corrected.
10. Similarity and metric learning
In this problem, the learning machine is given pairs of examples that are considered similar and
pairs of less similar objects. It then needs to learn a similarity function (or a distance metric
function) that can predict if new objects are similar. It is sometimes used in Recommendation
systems.
1. Implement and demonstrate the FIND-S algorithm for finding the most specific
hypothesis based on a given set of training data samples. Read the training data from a
.CSV file.
Find-s Algorithm :
1. Load Data set
2. Initialize h to the most specific hypothesis in H
3. For each positive training instance x
• For each attribute constraint ai in h
If the constraint ai in h is satisfied by x then do nothing
else replace ai in h by the next more general constraint that is satisfied by x
4. Output hypothesis h
Input:
finds.csv
Sky Airtemp Humidity Wind Water Forecast Enjoysport
Sunny Warm Normal Strong Warm Same Yes
Sunny Warm High Strong Warm Same Yes
Rain Cold High Strong Warm Change No
Sunny Warm High Strong Cool Change Yes
4. Output hypothesis h
[Sunny, Warm', '?', 'Strong', '?', '?']
Source Code:
import random
import csv
def read_data(filename):
with open(filename, 'r') as csvfile:
datareader = csv.reader(csvfile, delimiter=',')
traindata = []
for row in datareader:
traindata.append(row)
return (traindata)
h=['phi','phi','phi','phi','phi','phi'
data=read_data('finds.csv')
def isConsistent(h,d):
if len(h)!=len(d)-1:
print('Number of attributes are not same in hypothesis.')
return False
else:
matched=0
for i in range(len(h)):
if ( (h[i]==d[i]) | (h[i]=='any') ):
matched=matched+1
if matched==len(h):
return True
else:
return False
def makeConsistent(h,d):
for i in range(len(h)):
if((h[i] == 'phi')):
h[i]=d[i]
elif(h[i]!=d[i]):
h[i]='any'
return h
print('Begin : Hypothesis :',h)
Dept. of CSE,MSEC Page 11
15CSL76 ML LAB
print('==========================================')
for d in data:
if d[len(d)-1]=='Yes':
if ( isConsistent(h,d)):
pass
else:
h=makeConsistent(h,d)
print ('Training data :',d)
print ('Updated Hypothesis :',h)
print()
print('--------------------------------')
print('==========================================')
print('maximally sepcific data set End: Hypothesis :',h)
Output:
Begin : Hypothesis : ['phi', 'phi', 'phi', 'phi', 'phi', 'phi']
==========================================
Training data : ['Cloudy', 'Cold', 'High', 'Strong', 'Warm', 'Change', 'Yes']
Updated Hypothesis : ['Cloudy', 'Cold', 'High', 'Strong', 'Warm', 'Change']
--------------------------------
Training data : ['Sunny', 'Warm', 'Normal', 'Strong', 'Warm', 'Same', 'Yes']
Updated Hypothesis : ['any', 'any', 'any', 'Strong', 'Warm', 'any']
--------------------------------
Training data : ['Sunny', 'Warm', 'High', 'Strong', 'Warm', 'Same', 'Yes']
Updated Hypothesis : ['any', 'any', 'any', 'Strong', 'Warm', 'any']
--------------------------------
Training data : ['Sunny', 'Warm', 'High', 'Strong', 'Cool', 'Change', 'Yes']
Updated Hypothesis : ['any', 'any', 'any', 'Strong', 'any', 'any']
--------------------------------
Training data : ['Overcast', 'Cool', 'Normal', 'Strong', 'Warm', 'Same', 'Yes']
Updated Hypothesis : ['any', 'any', 'any', 'Strong', 'any', 'any']
--------------------------------
==========================================
maximally sepcific data set End: Hypothesis : ['any', 'any', 'any', 'Strong', 'any', 'any']
OR
import csv
def loadCsv(filename):
lines = csv.reader(open(filename, "r"))
dataset = list(lines)
for i in range(len(dataset)):
dataset[i] = dataset[i]
return dataset
attributes = ['Sky','Temp','Humidity','Wind','Water','Forecast']
print('Attributes =',attributes)
num_attributes = len(attributes)
filename = "finds.csv"
dataset = loadCsv(filename)
print(dataset)
hypothesis=['0'] * num_attributes
print("Intial Hypothesis")
print(hypothesis)
print("The Hypothesis are")
for i in range(len(dataset)):
target = dataset[i][-1]
if(target == 'Yes'):
for j in range(num_attributes):
if(hypothesis[j]=='0'):
hypothesis[j] = dataset[i][j]
if(hypothesis[j]!= dataset[i][j]):
hypothesis[j]='?'
print(i+1,'=',hypothesis)
print("Final Hypothesis")
print(hypothesis)
Output:
Attributes = ['Sky', 'Temp', 'Humidity', 'Wind', 'Water', 'Forecast']
[['sky', 'Airtemp', 'Humidity', 'Wind', 'Water', 'Forecast', 'WaterSport'],
['Cloudy', 'Cold', 'High', 'Strong', 'Warm', 'Change', 'Yes'],
['Sunny', 'Warm', 'Normal', 'Strong', 'Warm', 'Same', 'Yes'],
['Sunny', 'Warm', 'High', 'Strong', 'Warm', 'Same', 'Yes'],
['Cloudy', 'Cold', 'High', 'Strong', 'Warm', 'Change', 'No'],
['Sunny', 'Warm', 'High', 'Strong', 'Cool', 'Change', 'Yes'],
['Rain', 'Mild', 'High', 'Weak', 'Cool', 'Change', 'No'],
['Rain', 'Cool', 'Normal', 'Weak', 'Cool', 'Same', 'No'],
['Overcast', 'Cool', 'Normal', 'Strong', 'Warm', 'Same', 'Yes']]
Intial Hypothesis
['0', '0', '0', '0', '0', '0']
The Hypothesis are
2 = ['Cloudy', 'Cold', 'High', 'Strong', 'Warm', 'Change']
3 = ['?', '?', '?', 'Strong', 'Warm', '?']
4 = ['?', '?', '?', 'Strong', 'Warm', '?']
6 = ['?', '?', '?', 'Strong', '?', '?']
9 = ['?', '?', '?', 'Strong', '?', '?']
Final Hypothesis
['?', '?', '?', 'Strong', '?', '?']
2. For a given set of training data examples stored in a .CSV file, implement and
demonstrate the Candidate-Elimination algorithm to output a description of the set of all
hypotheses consistent with the training examples.
Candidate-Elimination Algorithm:
1. Load data set
2. G <-maximally general hypotheses in H
3. S <- maximally specific hypotheses in H
4. For each training example d=<x,c(x)>
Case 1 : If d is a positive example
Remove from G any hypothesis that is inconsistent with d
For each hypothesis s in S that is not consistent with d
• Remove s from S.
• Add to S all minimal generalizations h of s such that
• h consistent with d
• Some member of G is more general than h
• Remove from S any hypothesis that is more general than another hypothesis in S
Step1:
S 0 :{∅,∅,∅,∅,∅,∅}
G 0 : {⟨?,?,?,?,?,?⟩}
1. ⟨ Sunny,Warm,Normal,Strong,Warm,Same⟩,EnjoySport=yes
S 0 :{∅,∅,∅,∅,∅,∅}
S 1 :{⟨Sunny,Warm,Normal,Strong,Warm,Same⟩}
G 0 , G 1 : {⟨?,?,?,?,?,?⟩}
2. ⟨ Sunny,Warm,High,Strong,Warm,Same⟩,EnjoySport=yes
Dept. of CSE,MSEC Page 14
15CSL76 ML LAB
S 0 :{∅,∅,∅,∅,∅,∅}
S 1 :{⟨Sunny,Warm,Normal,Strong,Warm,Same⟩}
S 2 : {⟨Sunny,Warm,?,Strong,Warm,Same⟩}
G 0 , G 1 , G 2 : {⟨?,?,?,?,?,?⟩}
3. ⟨ Rainy,Cold,High,Strong,Warm,Change⟩,EnjoySport=no
S 0 :{∅,∅,∅,∅,∅,∅}
S 1 :{⟨Sunny,Warm,Normal,Strong,Warm,Same⟩}
S 2 , S 3 : {⟨Sunny,Warm,?,Strong,Warm,Same⟩}
G 3 :{⟨Sunny,?,?,?,?,?⟩,⟨?,Warm,?,?,?,?⟩,⟨?,?,?,?,?,Same⟩}
G 0 , G 1 , G 2 : {⟨?,?,?,?,?,?⟩}
4. ⟨ Sunny,Warm,High,Strong,Cool,Change⟩,EnjoySport=yes
S 0 :{∅,∅,∅,∅,∅,∅}
S 1 :{⟨Sunny,Warm,Normal,Strong,Warm,Same⟩}
S 2 , S 3 : {⟨Sunny,Warm,?,Strong,Warm,Same⟩}
S 4 :{⟨Sunny,Warm,?,Strong,?,?⟩}
G 4 :{⟨Sunny,?,?,?,?,?⟩,⟨?,Warm,?,?,?,?⟩}
G 3 :{⟨Sunny,?,?,?,?,?⟩,⟨?,Warm,?,?,?,?⟩,⟨?,?,?,?,?,Same⟩}
G 0 , G 1 , G 2 : {⟨?,?,?,?,?,?⟩}
Source Code:
import numpy as np
import pandas as pd
data = pd.DataFrame(data=pd.read_csv('finds1.csv'))
concepts = np.array(data.iloc[:,0:-1])
target = np.array(data.iloc[:,-1])
def learn(concepts, target):
specific_h = concepts[0].copy()
print("initialization of specific_h and general_h")
print(specific_h)
general_h = [["?" for i in range(len(specific_h))] for i in range(len(specific_h))]
print(general_h)
for i, h in enumerate(concepts):
if target[i] == "Yes":
for x in range(len(specific_h)):
if h[x] != specific_h[x]:
specific_h[x] = '?'
general_h[x][x] = '?'
if target[i] == "No":
for x in range(len(specific_h)):
if h[x] != specific_h[x]:
general_h[x][x] = specific_h[x]
else:
general_h[x][x] = '?'
print(" steps of Candidate Elimination Algorithm",i+1)
print("Specific_h ",i+1,"\n ")
print(specific_h)
print("general_h ", i+1, "\n ")
print(general_h)
indices = [i for i, val in enumerate(general_h) if val == ['?', '?', '?', '?', '?', '?']]
for i in indices:
general_h.remove(['?', '?', '?', '?', '?', '?'])
INPUT:
ky Airtemp Humidity Wind Water Forecast Enjoysport
Sunny Warm Normal Strong Warm Same Yes
Sunny Warm High Strong Warm Same Yes
Rain Cold High Strong Warm Change No
Sunny Warm High Strong Cool Change Yes
OUTPUT
initialization of specific_h and general_h
['Cloudy' 'Cold' 'High' 'Strong' 'Warm' 'Change']
[['?', '?', '?', '?', '?', '?'], ['?', '?', '?', '?', '?', '?'], ['?', '?', '?', '?', '?', '?'], ['?', '?', '?', '?', '?', '?'], ['?', '?', '?',
'?', '?', '?'], ['?', '?', '?', '?', '?', '?']]
steps of Candidate Elimination Algorithm 8
Specific_h 8
[['?', '?', '?', '?', '?', '?'], ['?', '?', '?', '?', '?', '?'], ['?', '?', '?', '?', '?', '?'], ['?', '?', '?', 'Strong', '?', '?'], ['?',
'?', '?', '?', '?', '?'], ['?', '?', '?', '?', '?', '?']]
Final Specific_h:
['?' '?' '?' 'Strong' '?' '?']
Final General_h:
[['?', '?', '?', 'Strong', '?', '?']]
3. Write a program to demonstrate the working of the decision tree based ID3 algorithm.
Use an appropriate data set for building the decision tree and apply this knowledge to
classify a new sample.
Outlook
no yes yes no
Input:
finds.csv
Sky Airtemp Humidity Wind Water Forecast Enjoysport
Sunny Warm Normal Strong Warm Same Yes
Sunny Warm High Strong Warm Same Yes
Rain Cold High Strong Warm Change No
Sunny Warm High Strong Cool Change Yes
Source Code:
import numpy as np
import math
from data_loader import read_data
class Node:
def __init__(self, attribute):
self.attribute = attribute
self.children = []
self.answer = ""
def __str__(self):
return self.attribute
def subtables(data, col, delete):
dict = {}
items = np.unique(data[:, col])
count = np.zeros((items.shape[0], 1), dtype=np.int32)
for x in range(items.shape[0]):
for y in range(data.shape[0]):
if data[y, col] == items[x]:
count[x] += 1
for x in range(items.shape[0]):
dict[items[x]] = np.empty((int(count[x]), data.shape[1]), dtype="|S32")
pos = 0
for y in range(data.shape[0]):
if data[y, col] == items[x]:
dict[items[x]][pos] = data[y]
pos += 1
if delete:
dict[items[x]] = np.delete(dict[items[x]], col, 1)
return items, dict
def entropy(S):
items = np.unique(S)
if items.size == 1:
return 0
counts = np.zeros((items.shape[0], 1))
sums = 0
for x in range(items.shape[0]):
counts[x] = sum(S == items[x]) / (S.size * 1.0)
for count in counts:
sums += -1 * count * math.log(count, 2)
return sums
data = np.array(traindata)
node = create_node(data, metadata)
print_tree(node, 0)
INPUT:
outlook,temperature,humidity,wind,answer
sunny,hot,high,weak,no
sunny,hot,high,strong,no
overcast,hot,high,weak,yes
rain,mild,high,weak,yes
rain,cool,normal,weak,yes
rain,cool,normal,strong,no
overcast,cool,normal,strong,yes
sunny,mild,high,weak,no
sunny,cool,normal,weak,yes
rain,mild,normal,weak,yes
sunny,mild,normal,strong,yes
overcast,mild,high,strong,yes
overcast,hot,normal,weak,yes
rain,mild,high,strong,no
OUTPUT:
outlook
overcast
b'yes'
rain
wind
b'strong'
b'no'
b'weak'
b'yes'
sunny
humidity
b'high'
b'no'
b'normal'
b'yes'
OR
import pandas as pd
import numpy as np
dataset= pd.read_csv('playtennis.csv',names=['outlook','temperature','humidity','wind','class',])
def entropy(target_col):
elements,counts = np.unique(target_col,return_counts = True)
entropy = np.sum([(-counts[i]/np.sum(counts))*np.log2(counts[i]/np.sum(counts)) for i in
range(len(elements))])
return entropy
def InfoGain(data,split_attribute_name,target_name="class"):
total_entropy = entropy(data[target_name])
vals,counts= np.unique(data[split_attribute_name],return_counts=True)
Weighted_Entropy =
np.sum([(counts[i]/np.sum(counts))*entropy(data.where(data[split_attribute_name]==vals[i]).dr
opna()[target_name]) for i in range(len(vals))])
Information_Gain = total_entropy - Weighted_Entropy
return Information_Gain
def ID3(data,originaldata,features,target_attribute_name="class",parent_node_class = None):
if len(np.unique(data[target_attribute_name])) <= 1:
return np.unique(data[target_attribute_name])[0]
elif len(data)==0:
return
np.unique(originaldata[target_attribute_name])[np.argmax(np.unique(originaldata[target_attribut
e_name],return_counts=True)[1])]
elif len(features) ==0:
return parent_node_class
else:
parent_node_class =
np.unique(data[target_attribute_name])[np.argmax(np.unique(data[target_attribute_name],return
_counts=True)[1])]
item_values = [InfoGain(data,feature,target_attribute_name) for feature in features] #Return
the information gain values for the features in the dataset
best_feature_index = np.argmax(item_values)
best_feature = features[best_feature_index]
tree = {best_feature:{}}
features = [i for i in features if i != best_feature]
for value in np.unique(data[best_feature]):
value = value
sub_data = data.where(data[best_feature] == value).dropna()
subtree = ID3(sub_data,dataset,features,target_attribute_name,parent_node_class)
tree[best_feature][value] = subtree
return(tree)
tree = ID3(dataset,dataset,dataset.columns[:-1])
print(' \nDisplay Tree\n',tree)
INPUT:
outlook,temperature,humidity,wind,answer
sunny,hot,high,weak,no
sunny,hot,high,strong,no
overcast,hot,high,weak,yes
rain,mild,high,weak,yes
rain,cool,normal,weak,yes
rain,cool,normal,strong,no
overcast,cool,normal,strong,yes
sunny,mild,high,weak,no
sunny,cool,normal,weak,yes
rain,mild,normal,weak,yes
sunny,mild,normal,strong,yes
overcast,mild,high,strong,yes
overcast,hot,normal,weak,yes
rain,mild,high,strong,no
OUTPUT:
Display Tree
{'outlook': {'Overcast': 'Yes', 'Rain': {'wind': {'Strong': 'No', 'Weak': 'Yes'}}, 'Sunny':
{'humidity': {'High': 'No', 'Normal': 'Yes'}}}}
end
#Sigmoid Function
def sigmoid (x):
return (1/(1 + np.exp(-x)))
#Derivative of Sigmoid Function
def derivatives_sigmoid(x):
return x * (1 - x)
#Variable initialization
epoch=7000 #Setting training iterations
lr=0.1 #Setting learning rate
inputlayer_neurons = 2 #number of features in data set
hiddenlayer_neurons = 3 #number of hidden layers neurons
output_neurons = 1 #number of neurons at output layer
#weight and bias initialization
Dept. of CSE,MSEC Page 23
15CSL76 ML LAB
wh=np.random.uniform(size=(inputlayer_neurons,hiddenlayer_neurons))
bh=np.random.uniform(size=(1,hiddenlayer_neurons))
wout=np.random.uniform(size=(hiddenlayer_neurons,output_neurons))
bout=np.random.uniform(size=(1,output_neurons))
# draws a random range of numbers uniformly of dim x*y
#Forward Propagation
for i in range(epoch):
hinp1=np.dot(X,wh)
hinp=hinp1 + bh
hlayer_act = sigmoid(hinp)
outinp1=np.dot(hlayer_act,wout)
outinp= outinp1+ bout
output = sigmoid(outinp)
#Backpropagation
EO = y-output
outgrad = derivatives_sigmoid(output)
d_output = EO* outgrad
EH = d_output.dot(wout.T)
hiddengrad = derivatives_sigmoid(hlayer_act)
#how much hidden layer wts contributed to error
d_hiddenlayer = EH * hiddengrad
wout += hlayer_act.T.dot(d_output) *lr
# dotproduct of nextlayererror and currentlayerop
bout += np.sum(d_output, axis=0,keepdims=True) *lr
wh += X.T.dot(d_hiddenlayer) *lr
#bh += np.sum(d_hiddenlayer, axis=0,keepdims=True) *lr
print("Input: \n" + str(X))
print("Actual Output: \n" + str(y))
print("Predicted Output: \n" ,output)
INPUT:
Output:
Input:
[[ 0.66666667 1. ]
[ 0.33333333 0.55555556]
[ 1. 0.66666667]]
Actual Output:
[[ 0.92]
[ 0.86]
[ 0.89]]
Predicted Output:
[[ 0.89559591]
[ 0.88142069]
[ 0.8928407 ]]
5. Write a program to implement the naïve Bayesian classifier for a sample training data
set stored as a .CSV file. Compute the accuracy of the classifier, considering few test data
sets.
Problem statement:
– Given features X1 ,X2 ,…,Xn
– Predict a label Y
X = (Rainy, Hot, High, False)
y = No
Or
P(H) is the probability of hypothesis H being true. This is known as the prior probability.
P(E) is the probability of the evidence(regardless of the hypothesis).
P(E|H) is the probability of the evidence given that hypothesis is true.
P(H|E) is the probability of the hypothesis given that the evidence is there.
Step 2: Create Likelihood table by finding the probabilities like Overcast probability = 0.29 and
probability of playing is 0.64.
Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each class.
The class with the highest posterior probability is the outcome of prediction.
Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.
Naive Bayes uses a similar method to predict the probability of different class based on various
attributes. This algorithm is mostly used in text classification and with problems having multiple
classes.
Source Code:
import csv
import random
import math
def loadCsv(filename):
lines = csv.reader(open(filename, "r"))
dataset = list(lines)
for i in range(len(dataset)):
dataset[i] = [float(x) for x in dataset[i]]
return dataset
def separateByClass(dataset):
separated = {}
for i in range(len(dataset)):
vector = dataset[i]
if (vector[-1] not in separated):
separated[vector[-1]] = []
separated[vector[-1]].append(vector)
return separated
def mean(numbers):
return sum(numbers)/float(len(numbers))
def stdev(numbers):
avg = mean(numbers)
variance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1)
return math.sqrt(variance)
def summarize(dataset):
summaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)]
del summaries[-1]
return summaries
def summarizeByClass(dataset):
separated = separateByClass(dataset)
summaries = {}
for classValue, instances in separated.items():
summaries[classValue] = summarize(instances)
return summaries
def main():
filename = 'data.csv'
splitRatio = 0.67
dataset = loadCsv(filename)
trainingSet, testSet = splitDataset(dataset, splitRatio)
print('Split {0} rows into train={1} and test={2} rows'.format(len(dataset),
len(trainingSet), len(testSet)))
# prepare model
summaries = summarizeByClass(trainingSet)
# test model
predictions = getPredictions(summaries, testSet)
accuracy = getAccuracy(testSet, predictions)
print('Accuracy: {0}%'.format(accuracy))
main()
INPUT:
6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1
5,116,74,0,0,25.6,0.201,30,0
3,78,50,32,88,31.0,0.248,26,1
10,115,0,0,0,35.3,0.134,29,0
OUTPUT :
6. Assuming a set of documents that need to be classified, use the naïve Bayesian Classifier
model to perform this task. Built-in Java classes/API can be used to write the program.
Calculate the accuracy, precision, and recall for your data set.
The dataset is divided into two parts, namely, feature matrix and the response vector.
Feature matrix contains all the vectors(rows) of dataset in which each vector consists of
the value of dependent features. In above dataset, features are ‘Outlook’, ‘Temperature’,
‘Humidity’ and ‘Windy’.
Response vector contains the value of class variable(prediction or output) for each row of
feature matrix. In above dataset, the class variable name is ‘Play golf’.
Source Code:
import pandas as pd
msg=pd.read_csv('naivetext1.csv',names=['message','label'])
print('The dimensions of the dataset',msg.shape)
msg['labelnum']=msg.label.map({'pos':1,'neg':0})
X=msg.message
y=msg.labelnum
print(X)
print(y)
INPUT:
I love this sandwich,pos
This is an amazing place,pos
I feel very good about these beers,pos
This is my best work,pos
What an awesome view,pos
I do not like this restaurant,neg
I am tired of this stuff,neg
I can't deal with this,neg
He is my sworn enemy,neg
My boss is horrible,neg
Output:
16 1
17 0
Name: labelnum, dtype: int64
(5,)
(13,)
(5,)
(13,)
Accuracy metrics
Accuracy of the classifer is 0.8
Confusion matrix
[[3 1]
[0 1]]
Recall and Precison
1.0
0.5
7. Write a program to construct a Bayesian network considering medical data. Use this
model to demonstrate the diagnosis of heart patients using standard Heart Disease
Data Set. You can use Java/Python ML library classes/API.
Attribute Information:
-- Only 14 used
-- 1. #3 (age)
-- 2. #4 (sex)
-- 3. #9 (cp)
-- 4. #10 (trestbps)
-- 5. #12 (chol)
-- 6. #16 (fbs)
-- 7. #19 (restecg)
-- 8. #32 (thalach)
-- 9. #38 (exang)
-- 10. #40 (oldpeak)
-- 11. #41 (slope)
-- 12. #44 (ca)
-- 13. #51 (thal)
-- 14. #58 (num)
Source Code:
import numpy as np
from urllib.request import urlopen
import urllib
import pandas as pd
from pgmpy.inference import VariableElimination
from pgmpy.models import BayesianModel
from pgmpy.estimators import MaximumLikelihoodEstimator, BayesianEstimator
names = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca',
'thal', 'heartdisease']
heartDisease = pd.read_csv('heart.csv', names = names)
heartDisease = heartDisease.replace('?', np.nan)
model.fit(heartDisease, estimator=MaximumLikelihoodEstimator)
from pgmpy.inference import VariableElimination
HeartDisease_infer = VariableElimination(model)
INPUT:
age,sex,cp,testbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,heartdisease
28,1,2,130,132,0,2,185,0,0,?,?,?,0
29,1,2,120,243,0,0,160,0,0,?,?,?,0
29,1,2,140,?,0,0,170,0,0,?,?,?,0
30,0,1,170,237,0,1,170,0,0,?,?,6,0
31,0,2,100,219,0,1,150,0,0,?,?,?,0
32,0,2,105,198,0,0,165,0,0,?,?,?,0
32,1,2,110,225,0,0,184,0,0,?,?,?,0
32,1,2,125,254,0,0,155,0,0,?,?,?,0
33,1,3,120,298,0,0,185,0,0,?,?,?,0
OUTPUT:
╒════════════════╤════
│ heartdisease │ phi(heartdisease) │
╞══════════════════════
│ heartdisease_0 │ 0.5593 │
├─────────────────────┤
│ heartdisease_1 │ 0.4407 │
╘════════════════╧═════
8. Apply EM algorithm to cluster a set of data stored in a .CSV file. Use the same data set
for clustering using k-Means algorithm. Compare the results of these two algorithms and
comment on the quality of clustering. You can add Java/Python ML library classes/API in
the program.
K-Means Algorithm
1. Load data set
2. Clusters the data into k groups where k is predefined.
3. Select k points at random as cluster centers.
4. Assign objects to their closest cluster center according to the Euclidean
distance function.
5. Calculate the centroid or mean of all objects in each cluster.
6. Repeat steps 3, 4 and 5 until the same points are assigned to each cluster in consecutive
rounds.
Example:
Suppose we want to group the visitors to a website using just their age (one-dimensional
space) as follows:
n = 19
15,15,16,19,19,20,20,21,22,28,35,40,41,42,43,44,60,61,65
Initial clusters (random centroid or average):
k=2
c1 = 16
c2 = 22
Iteration 1:
c1 = 15.33
c2 = 36.25
Nearest New
xi c1 c2 Distance 1 Distance 2
Cluster Centroid
15 16 22 1 7 1
15 16 22 1 7 1 15.33
16 16 22 0 6 1
19 16 22 9 3 2
19 16 22 9 3 2
20 16 22 16 2 2
20 16 22 16 2 2
21 16 22 25 1 2
22 16 22 36 0 2
28 16 22 12 6 2
35 16 22 19 13 2
36.25
40 16 22 24 18 2
41 16 22 25 19 2
42 16 22 26 20 2
43 16 22 27 21 2
44 16 22 28 22 2
60 16 22 44 38 2
61 16 22 45 39 2
65 16 22 49 43 2
Iteration 2:
c1 = 18.56
c2 = 45.90
Distance Nearest New
xi c1 c2 Distance 1
2 Cluster Centroid
15 15.33 36.25 0.33 21.25 1
15 15.33 36.25 0.33 21.25 1
16 15.33 36.25 0.67 20.25 1
19 15.33 36.25 3.67 17.25 1
19 15.33 36.25 3.67 17.25 1 18.56
20 15.33 36.25 4.67 16.25 1
20 15.33 36.25 4.67 16.25 1
21 15.33 36.25 5.67 15.25 1
22 15.33 36.25 6.67 14.25 1
28 15.33 36.25 12.67 8.25 2
45.9
35 15.33 36.25 19.67 1.25 2
Iteration 3:
c1 = 19.50
c2 = 47.89
Nearest New
xi c1 c2 Distance 1 Distance 2
Cluster Centroid
15 18.56 45.9 3.56 30.9 1
15 18.56 45.9 3.56 30.9 1
16 18.56 45.9 2.56 29.9 1
19 18.56 45.9 0.44 26.9 1
19 18.56 45.9 0.44 26.9 1
19.50
20 18.56 45.9 1.44 25.9 1
20 18.56 45.9 1.44 25.9 1
21 18.56 45.9 2.44 24.9 1
22 18.56 45.9 3.44 23.9 1
28 18.56 45.9 9.44 17.9 1
35 18.56 45.9 16.44 10.9 2
40 18.56 45.9 21.44 5.9 2
41 18.56 45.9 22.44 4.9 2
42 18.56 45.9 23.44 3.9 2
43 18.56 45.9 24.44 2.9 2 47.89
44 18.56 45.9 25.44 1.9 2
60 18.56 45.9 41.44 14.1 2
61 18.56 45.9 42.44 15.1 2
65 18.56 45.9 46.44 19.1 2
Iteration 4:
c1 = 19.50
c2 = 47.89
Distance Distance Nearest New
xi c1 c2
1 2 Cluster Centroid
15 19.5 47.89 4.50 32.89 1
19.50
15 19.5 47.89 4.50 32.89 1
Dept. of CSE,MSEC Page 40
15CSL76 ML LAB
No change between iterations 3 and 4 has been noted. By using clustering, 2 groups have been
identified 15-28 and 35-65. The initial choice of centroids can affect the output clusters, so the
algorithm is often run multiple times with different starting conditions in order to get a fair
view of what the clusters should be.
EM algorithm
These are the two basic steps of the EM algorithm, namely E Step or Expectation
Step or Estimation Step and M Step or Maximization Step.
Estimation step:
initialize , and by some random values, or by K means clustering
results or by hierarchical clustering results.
Then for those given parameter values, estimate the value of the latent variables
(i.e )
Maximization Step:
Update the value of the parameters( i.e. , and ) calculated using ML
method.
Source Code:
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture
import pandas as pd
X=pd.read_csv("kmeansdata.csv")
x1 = X['Distance_Feature'].values
x2 = X['Speeding_Feature'].values
X = np.array(list(zip(x1, x2))).reshape(len(x1), 2)
plt.plot()
plt.xlim([0, 100])
plt.ylim([0, 50])
plt.title('Dataset')
plt.scatter(x1, x2)
plt.show()
#code for EM
gmm = GaussianMixture(n_components=3)
gmm.fit(X)
em_predictions = gmm.predict(X)
print("\nEM predictions")
print(em_predictions)
print("mean:\n",gmm.means_)
print('\n')
print("Covariances\n",gmm.covariances_)
print(X)
plt.title('Exceptation Maximum')
plt.scatter(X[:,0], X[:,1],c=em_predictions,s=50)
plt.show()
print(kmeans.cluster_centers_)
print(kmeans.labels_)
plt.title('KMEANS')
INPUT:
Driver_ID,Distance_Feature,Speeding_Feature
3423311935,71.24,28.0
3423313212,52.53,25.0
3423313724,64.54,27.0
3423311373,55.69,22.0
3423310999,54.58,25.0
OUTPUT:
EM predictions
[0 0 0 1 0 1 1 1 2 1 2 2 1 1 2 1 2 1 0 1 0 1 1]
mean:
[[57.70629058 25.73574491]
[52.12044022 22.46250453]
[46.4364858 39.43288647]]
Covariances
[[[83.51878796 14.926902 ]
[14.926902 2.70846907]]
[[29.95910352 15.83416554]
[15.83416554 67.01175729]]
[[79.34811849 29.55835938]
[29.55835938 18.17157304]]]
[[71.24 28. ]
[52.53 25. ]
[64.54 27. ]
[55.69 22. ]
[54.58 25. ]
[41.91 10. ]
[58.64 20. ]
[52.02 8. ]
[31.25 34. ]
[44.31 19. ]
[49.35 40. ]
[58.07 45. ]
[44.22 22. ]
[55.73 19. ]
[46.63 43. ]
[52.97 32. ]
[46.25 35. ]
[51.55 27. ]
[57.05 26. ]
[58.45 30. ]
[43.42 23. ]
[55.68 37. ]
[55.15 18. ]
9. Write a program to implement k-Nearest Neighbour algorithm to classify the iris data
set. Print both correct and wrong predictions. Java/Python ML library classes can be used
for this problem.
• Principle: points (documents) that are close in the space belong to the same class
Distance Metrics
K-Nearest-Neighbour Algorithm:
Confusion matrix:
Note,
• Class 1 : Positive
• Class 2 : Negative
Example :
Source Code:
dataset=pd.read_csv("iris.csv")
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0,test_size=0.25)
classifier=KNeighborsClassifier(n_neighbors=8,p=3,metric='euclidean')
classifier.fit(X_train,y_train)
cm=confusion_matrix(y_test,y_pred)
print('Confusion matrix is as follows\n',cm)
print('Accuracy Metrics')
print(classification_report(y_test,y_pred))
print(" correct predicition",accuracy_score(y_test,y_pred))
print(" worng predicition",(1-accuracy_score(y_test,y_pred)))
INPUT:
Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
1,5.1,3.5,1.4,0.2,Iris-setosa
2,4.9,3.0,1.4,0.2,Iris-setosa
3,4.7,3.2,1.3,0.2,Iris-setosa
4,4.6,3.1,1.5,0.2,Iris-setosa
5,5.0,3.6,1.4,0.2,Iris-setosa
6,5.4,3.9,1.7,0.4,Iris-setosa
7,4.6,3.4,1.4,0.3,Iris-setosa
8,5.0,3.4,1.5,0.2,Iris-setosa
OUTPUT :
Confusion matrix is as follows
[[13 0 0]
[ 0 15 1]
[ 0 0 9]]
Accuracy Metrics
precision recall f1-score support
10. Implement the non-parametric Locally Weighted Regression algorithm in order to fit
data points. Select appropriate data set for your experiment and draw graphs.
• Regression is a technique from statistics that is used to predict values of a desired target
quantity when the target quantity is continuous.
• In regression, we seek to identify (or estimate) a continuous variable y associated with a given
input vector x.
• y is called the dependent variable.
• x is called the independent variable.
Lowess Algorithm: Locally weighted regression is a very powerful non-parametric model used
in statistical learning .Given a dataset X, y, we attempt to find a model parameter β(x) that
minimizes residual sum of weighted squared errors. The weights are given by a kernel
function(k or w) which can be chosen arbitrarily .
6. Prediction = x0*β
Source Code:
import numpy as np
from bokeh.plotting import figure, show, output_notebook
from bokeh.layouts import gridplot
from bokeh.io import push_notebook
# predict value
return x0 @ beta # @ Matrix Multiplication or Dot Product for prediction
n = 1000
# generate dataset
X = np.linspace(-3, 3, num=n)
print("The Data Set ( 10 Samples) X :\n",X[1:10])
Y = np.log(np.abs(X ** 2 - 1) + .5)
print("The Fitting Curve Data Set (10 Samples) Y :\n",Y[1:10])
# jitter X
X += np.random.normal(scale=.1, size=n)
print("Normalised (10 Samples) X :\n",X[1:10])
def plot_lwr(tau):
# prediction through regression
return plot
INPUT:
total_bill,tip,sex,smoker,day,time,size
16.99,1.01,Female,No,Sun,Dinner,2
10.34,1.66,Male,No,Sun,Dinner,3
21.01,3.5,Male,No,Sun,Dinner,3
23.68,3.31,Male,No,Sun,Dinner,2
24.59,3.61,Female,No,Sun,Dinner,4
25.29,4.71,Male,No,Sun,Dinner,4
8.77,2.0,Male,No,Sun,Dinner,2
26.88,3.12,Male,No,Sun,Dinner,4
OUTPUT:
The Data Set ( 10 Samples) X :
[-2.99399399 -2.98798799 -2.98198198 -2.97597598 -2.96996997 -2.96396396
-2.95795796 -2.95195195 -2.94594595]
The Fitting Curve Data Set (10 Samples) Y :
[2.13582188 2.13156806 2.12730467 2.12303166 2.11874898 2.11445659
2.11015444 2.10584249 2.10152068]
Normalised (10 Samples) X :
[-3.10518137 -3.00247603 -2.9388515 -2.79373602 -2.84946247 -2.85313888
-2.9622708 -3.09679502 -2.69778859]
Xo Domain Space(10 Samples) :
[-2.97993311 -2.95986622 -2.93979933 -2.91973244 -2.89966555 -2.87959866
-2.85953177 -2.83946488 -2.81939799]
OR
from numpy import *
import operator
from os import listdir
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import numpy.linalg
from scipy.stats.stats import pearsonr
def localWeight(point,xmat,ymat,k):
wei = kernel(point,xmat,k)
W = (X.T*(wei*X)).I*(X.T*(wei*ymat.T))
return W
def localWeightRegression(xmat,ymat,k):
m,n = shape(xmat)
ypred = zeros(m)
for i in range(m):
ypred[i] = xmat[i]*localWeight(xmat[i],xmat,ymat,k)
return ypred
#set k here
ypred = localWeightRegression(X,mtip,0.2)
SortIndex = X[:,1].argsort(0)
xsort = X[SortIndex][:,0]
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.scatter(bill,tip, color='green')
ax.plot(xsort[:,1],ypred[SortIndex], color = 'red', linewidth=5)
plt.xlabel('Total bill')
plt.ylabel('Tip')
plt.show();
Output:
Dataset
Add Tips.csv (256 rows)
VIVA Questions
In this technique, a model is usually given a dataset of a known data on which training (training
data set) is run and a dataset of unknown data against which the model is tested. The idea of
cross validation is to define a dataset to “test” the model in the training phase.
c) Probabilistic networks
d) Nearest Neighbor
a) Supervised Learning
b) Unsupervised Learning
c) Semi-supervised Learning
d) Reinforcement Learning
e) Transduction
f) Learning to Learn
9) What are the three stages to build the hypotheses or model in machine learning?
a) Model building
b) Model testing
a) Artificial Intelligence
b) Speech recognition
c) Regression
e) Annotate strings
17) What is the difference between artificial learning and machine learning?
Designing and developing algorithms according to the behaviours based on empirical data are
known as Machine Learning. While artificial intelligence in addition to machine learning, it also
covers other aspects like knowledge representation, natural language processing, planning,
robotics etc.
a) Computer Vision
b) Speech Recognition
c) Data Mining
d) Statistics
e) Informal Retrieval
f) Bio-Informatics
24) What are the two methods used for the calibration in Supervised Learning?
The two methods used for predicting good probabilities in Supervised Learning are
a) Platt Calibration
b) Isotonic Regression
These methods are designed for binary classification, and it is not trivial.
26) What is the difference between heuristic for rule learning and heuristics for decision
trees?
The difference is that the heuristics for decision trees evaluate the average quality of a number of
disjointed sets while rule learners only evaluate the quality of the set of instances that is covered
with the candidate rule.
30) Why instance based learning algorithm sometimes referred as Lazy learning
algorithm?
Instance based learning algorithm is also referred as Lazy learning algorithm as they delay the
induction or generalization process until classification is performed.
31) What are the two classification methods that SVM ( Support Vector Machine) can
handle?
a) Combining binary classifiers
36) What is the general principle of an ensemble method and what is bagging and
boosting in ensemble method?
The general principle of an ensemble method is to combine the predictions of several models
built with a given learning algorithm in order to improve robustness over a single model.
Dept. of CSE,MSEC Page 58
15CSL76 ML LAB
a) Data Acquisition
d) Query Type
e) Scoring Metric
f) Significance Test
43) What are the different methods for Sequential Supervised Learning?
The different methods to solve Sequential Supervised Learning problems are
a) Sliding-window methods
44) What are the areas in robotics and information processing where sequential
prediction problem arises?
The areas in robotics and information processing where sequential prediction problem arises are
a) Imitation Learning
b) Structured prediction
47) What are the different categories you can categorized the sequence learning process?
a) Sequence prediction
b) Sequence generation
c) Sequence recognition
d) Sequential decision
b) Inductive Learning
50) Give a popular application of machine learning that you see on day to day basis?
The recommendation engine implemented by major ecommerce websites uses Machine Learning