0% found this document useful (0 votes)
17 views

Machine Learning Lab Manual - New

Uploaded by

Gagan DN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Machine Learning Lab Manual - New

Uploaded by

Gagan DN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 35

MACHINE LEARNING

LABORATORY
MANUAL
Machine learning
Machine learning is a subset of artificial intelligence in the field of computer science that often
uses statistical techniques to give computers the ability to "learn" (i.e., progressively improve
performance on a specific task) with data, without being explicitly programmed. In the past
decade, machine learning has given us self-driving cars, practical speech recognition, effective
web search, and a vastly improved understanding of the human genome.

Machine learning tasks


Machine learning tasks are typically classified into two broad categories, depending on whether
there is a learning "signal" or "feedback" available to a learning system:

Supervised learning: The computer is presented with example inputs and their desired outputs,
given by a "teacher", and the goal is to learn a general rule that maps inputs to outputs. As
special cases, the input signal can be only partially available, or restricted to special feedback:

Semi-supervised learning: the computer is given only an incomplete training signal: a training
set with some (often many) of the target outputs missing.

Active learning: the computer can only obtain training labels for a limited set of instances (based
on a budget), and also has to optimize its choice of objects to acquire labels for. When used
interactively, these can be presented to the user for labeling.

Reinforcement learning: training data (in form of rewards and punishments) is given only as
feedback to the program's actions in a dynamic environment, such as driving a vehicle or playing
a game against an opponent.

Unsupervised learning: No labels are given to the learning algorithm, leaving it on its own to
find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden
patterns in data) or a means towards an end (feature learning).

Supervised learning Un Supervised learning Instance based


learning
Find-s algorithm EM algorithm
Candidate elimination algorithm
Decision tree algorithm
Back propagation Algorithm Locally weighted
Naïve Bayes Algorithm K means algorithm Regression algorithm
K nearest neighbour
algorithm(lazy learning
algorithm)
Machine learning applications
In classification, inputs are divided into two or more classes, and the learner must produce a
model that assigns unseen inputs to one or more (multi-label classification) of these classes. This
is typically tackled in a supervised manner. Spam filtering is an example of classification, where
the inputs are email (or other) messages and the classes are "spam" and "not spam". In
regression, also a supervised problem, the outputs are continuous rather than discrete.

In clustering, a set of inputs is to be divided into groups. Unlike in classification, the groups are
not known beforehand, making this typically an unsupervised task. Density estimation finds the
distribution of inputs in some space. Dimensionality reduction simplifies inputs by mapping
them into a lower- dimensional space. Topic modeling is a related problem, where a program is
given a list of human language documents and is tasked with finding out which documents
cover similar topics.

Machine learning Approaches

Decision tree learning: Decision tree learning uses a decision tree as a predictive model, which maps
observations about an item to conclusions about the item's target value. Association rule learning
Association rule learning is a method for discovering interesting relations between variables in large
databases.

Artificial neural networks

An artificial neural network (ANN) learning algorithm, usually called "neural network" (NN), is
a learning algorithm that is vaguely inspired by biological neural networks. Computations are
structured in terms of an interconnected group of artificial neurons, processing information using
a connectionist approach to computation. Modern neural networks are non-linear statistical data
modeling tools. They are usually used to model complex relationships between inputs and
outputs, to find patterns in data, or to capture the statistical structure in an unknown joint
probability distribution between observed variables.

Deep learning

Falling hardware prices and the development of GPUs for personal use in the last few years
have contributed to the development of the concept of deep learning which consists of multiple
hidden layers in an artificial neural network. This approach tries to model the way the human
brain processes light and sound into vision and hearing. Some successful applications of deep
learning are computer vision and speech recognition.

Inductive logic programming


Inductive logic programming (ILP) is an approach to rule learning using logic programming as a
uniform representation for input examples, background knowledge, and hypotheses. Given an
encoding of the known background knowledge and a set of examples represented as a logical
database of facts, an ILP system will derive a hypothesized logic program that entails all positive
and no negative examples. Inductive programming is a related field that considers any kind of
programming languages for representing hypotheses (and not only logic programming), such as
functional programs.
Support vector machines

Support vector machines (SVMs) are a set of related supervised learning methods used for
classification and regression. Given a set of training examples, each marked as belonging to one
of two categories, an SVM training algorithm builds a model that predicts whether a new
example falls into one category or the other.

Clustering

Cluster analysis is the assignment of a set of observations into subsets (called clusters) so that
observations within the same cluster are similar according to some pre designated criterion or
criteria, while observations drawn from different clusters are dissimilar. Different clustering
techniques make different assumptions on the structure of the data, often defined by some
similarity metric and evaluated for example by internal compactness (similarity between
members of the same cluster) and separation between different clusters. Other methods are based
on estimated density and graph connectivity. Clustering is a method of unsupervised learning,
and a common technique for statistical data analysis.

Bayesian networks

A Bayesian network, belief network or directed acyclic graphical model is a probabilistic


graphical model that represents a set of random variables and their conditional independencies
via a directed acyclic graph (DAG). For example, a Bayesian network could represent the
probabilistic relationships between diseases and symptoms. Given symptoms, the network can be
used to compute the probabilities of the presence of various diseases. Efficient algorithms exist
that perform inference and learning.

Reinforcement learning
Reinforcement learning is concerned with how an agent ought to take actions in an environment
so as to maximize some notion of long-term reward. Reinforcement learning algorithms attempt
to find a policy that maps states of the world to the actions the agent ought to take in those states.
Reinforcement learning differs from the supervised learning problem in that correct input/output
pairs are never presented, nor sub-optimal actions explicitly corrected.

Similarity and metric learning


In this problem, the learning machine is given pairs of examples that are considered similar and
pairs of less similar objects. It then needs to learn a similarity function (or a distance metric
function) that can predict if new objects are similar. It is sometimes used in Recommendation
systems.

Genetic algorithms
A genetic algorithm (GA) is a search heuristic that mimics the process of natural selection, and
uses methods such as mutation and crossover to generate new genotype in the hope of finding
good solutions to a given problem. In machine learning, genetic algorithms found some uses in
the 1980s and 1990s. Conversely, machine learning techniques have been used to improve the
performance of genetic and evolutionary algorithms.
Rule-based machine learning

Rule-based machine learning is a general term for any machine learning method that identifies,
learns, or evolves "rules" to store, manipulate or apply, knowledge. The defining characteristic
of a rule-based machine learner is the identification and utilization of a set of relational rules that
collectively represent the knowledge captured by the system. This is in contrast to other machine
learners that commonly identify a singular model that can be universally applied to any instance
in order to make a prediction. Rule-based machine learning approaches include learning
classifier systems, association rule learning, and artificial immune systems.

Feature selection approach

Feature selection is the process of selecting an optimal subset of relevant features for use in
model construction. It is assumed the data contains some features that are either redundant or
irrelevant, and can thus be removed to reduce calculation cost without incurring much loss of
information. Common optimality criteria include accuracy, similarity and information measures.
MACHINE LEARNING LABORATORY
[As per Choice Based Credit System (CBCS) scheme]

(Effective from the academic year 2016 -2017) SEMESTER – VII

Subject Code 21CSL68 IA Marks 40


Number of Lecture Hours/Week 01I + 02P Exam Marks 60
Total Number of Lecture Hours 40 Exam Hours 03

CREDITS – 02

Course objectives: This course will enable students to

1. Make use of Data sets in implementing the machine learning algorithms


2. Implement the machine learning concepts and algorithms in any suitable
language of choice.

Description (If any):

1. The programs can be implemented in either JAVA or Python.


2. For Problems 1 to 6 and 10, programs are to be developed without using the
built- in classes or APIs of Java/Python.
3. Data sets can be taken from standard repositories
(https://archive.ics.uci.edu/ml/datasets.html) or constructedby the students.

Lab Experiments:

1. Implement and demonstratethe FIND-Salgorithm for finding the most specific


hypothesis based on a given set of training data samples. Read the training data from
a
.CSV file.

2. For a given set of training data examples stored in a .CSV file, implement and
demonstrate the Candidate-Elimination algorithmto output a description of the set of
all hypotheses consistent with the training examples.

3. Write a program to demonstrate the working of the decision tree based ID3 algorithm.
Use an appropriate data set for building the decision tree and apply this knowledge
toclassify a new sample.

4. Build an Artificial Neural Network by implementing the Backpropagationalgorithm


and test the same using appropriate data sets.

5. Write a program to implement the naïve Bayesian classifier for a sample training data
set stored as a .CSV file. Compute the accuracy of the classifier, considering few test
data sets.
6. Assuming a set of documents that need to be classified, use the naïve Bayesian
Classifier model to perform this task. Built-in Java classes/API can be used to write
the program. Calculate the accuracy, precision, and recall for your data set.

7. Write a program to construct a Bayesian network considering medical data. Use this
model to demonstrate the diagnosis of heart patients using standard Heart Disease
Data Set. You can use Java/Python ML library classes/API.

8. Apply EM algorithm to cluster a set of data stored in a .CSV file. Use the same data
set for clustering using k-Means algorithm. Compare the results of these two
algorithms and comment on the quality of clustering. You can add Java/Python ML
library classes/API in the program.

9. Write a program to implement the k-Nearest Neighbour algorithm to classify the iris
data set. Print both correct and wrong predictions. Java/Python ML library classes can
be used for this problem.

10. Implement the non-parametric Locally Weighted Regression algorithm to fit data
points. Select the appropriate data set for your experiment and draw graphs.

Study Experiment / Project:


Course outcomes: The students should be able to:

1. Understand the implementation procedures for the machine learning algorithms.


2. Design Java/Python programs for various Learning algorithms.
3. Apply appropriate data sets to the Machine Learning algorithms.
4. Identify and apply Machine Learning algorithms to solve real-world problems.

Conduction of Practical Examination:

 All laboratory experiments are to be included for practical


examination. Students are allowed to pick one experiment from the lot.
 Strictly follow the instructions as printed on the cover page of answer
script Marks distribution: Procedure + Conduction + Viva:20 + 50 +10 (80)
 Change of experiment is allowed only once and marks allotted to the
procedure part to be made zero.
1. Implement and demonstrate the FIND-S algorithm for finding the most specific
hypothesis based on a given set of training data samples. Read the training data
from a .CSV file.

import csv

# Load data from CSV file


with open("tennis.csv", "r") as f:
reader = csv.reader(f)
data = list(reader)

# Initialize hypothesis with most specific values


hypothesis = [["0", "0", "0", "0", "0", "0"]]

# Iterate through each instance in the dataset


for instance in data:
print(instance)
# If the instance is positive (the last attribute is "True")
if instance[-1] == "True":
j = 0
# Update hypothesis based on the current instance
for x in instance:
if x != "True":
if x != hypothesis[0][j] and hypothesis[0][j] == "0":
hypothesis[0][j] = x
elif x != hypothesis[0][j] and hypothesis[0][j] != "0":
hypothesis[0][j] = "?"
else:
pass
j = j + 1

# Print the most specific hypothesis


print("Most specific hypothesis is:")
print(hypothesis)

Output

'Sunny', 'Warm', 'Normal', 'Strong', 'Warm', 'Same',True


'Sunny', 'Warm', 'High', 'Strong', 'Warm', 'Same',True
'Rainy', 'Cold', 'High', 'Strong', 'Warm', 'Change',False
'Sunny', 'Warm', 'High', 'Strong', 'Cool','Change',True

Maximally Specific set


[['Sunny', 'Warm', '?', 'Strong', '?', '?']]
2. For a given set of training data examples stored in a .CSV file, implement and
demonstrate the Candidate-Elimination algorithm to output a description of the set of
all hypotheses consistent with the training examples.

class Holder:
def __init__(self, attr):
self.factors = {i: [] for i in attr}
self.attributes = attr

def add_values(self, factor, values):


self.factors[factor] = values

class CandidateElimination:
def __init__(self, data, holder):
self.dataset = data
self.num_factors = len(data[0][0])
self.factors = holder.factors
self.attributes = holder.attributes

def run_algorithm(self):
G = self.initializeG()
S = self.initializeS()

for trial_set in self.dataset:


if self.is_positive(trial_set):
G = self.remove_inconsistent_G(G, trial_set[0])

S_new = S[:]
for s in S:
if not self.consistent(s, trial_set[0]):
S_new.remove(s)
generalization = self.generalize_inconsistent_S(s,
trial_set[0])
if self.get_general(generalization, G):
S_new.append(generalization)
S = self.remove_more_general(S_new)
else:
S = self.remove_inconsistent_S(S, trial_set[0])

G_new = G[:]
for g in G:
if self.consistent(g, trial_set[0]):
G_new.remove(g)
specializations = self.specialize_inconsistent_G(g,
trial_set[0])
G_new.extend(self.get_specific(specializations, S))
G = self.remove_more_specific(G_new)

print("Final S:", S)
print("Final G:", G)

def initializeS(self):
return [tuple(['-' for _ in range(self.num_factors)])]

def initializeG(self):
return [tuple(['?' for _ in range(self.num_factors)])]

def is_positive(self, trial_set):


return trial_set[1] == 'Y'

def match_factor(self, value1, value2):


return value1 == '?' or value2 == '?' or value1 == value2

def consistent(self, hypothesis, instance):


return all(self.match_factor(factor, instance[i]) for i, factor in
enumerate(hypothesis))

def remove_inconsistent_G(self, hypotheses, instance):


return [g for g in hypotheses if self.consistent(g, instance)]

def remove_inconsistent_S(self, hypotheses, instance):


return [s for s in hypotheses if not self.consistent(s, instance)]

def remove_more_general(self, hypotheses):


return [h for h in hypotheses if not any(self.more_general(h2, h) for h2 in
hypotheses if h != h2)]

def remove_more_specific(self, hypotheses):


return [h for h in hypotheses if not any(self.more_specific(h2, h) for h2
in hypotheses if h != h2)]

def generalize_inconsistent_S(self, hypothesis, instance):


return tuple(instance[i] if factor == '-' else '?' if not
self.match_factor(factor, instance[i]) else factor
for i, factor in enumerate(hypothesis))

def specialize_inconsistent_G(self, hypothesis, instance):


specializations = []
for i, factor in enumerate(hypothesis):
if factor == '?':
for value in self.factors[self.attributes[i]]:
if value != instance[i]:
specialization = list(hypothesis)
specialization[i] = value
specializations.append(tuple(specialization))
return specializations

def get_general(self, generalization, G):


return generalization if any(self.more_general(g, generalization) for g in
G) else None

def get_specific(self, specializations, S):


return [hypo for hypo in specializations if any(self.more_specific(s, hypo)
or s == self.initializeS()[0] for s in S)]

def more_general(self, hyp1, hyp2):


return all(h1 == '?' or h1 == h2 for h1, h2 in zip(hyp1, hyp2))

def more_specific(self, hyp1, hyp2):


return self.more_general(hyp2, hyp1)

dataset = [
(('sunny', 'warm', 'normal', 'strong', 'warm', 'same'), 'Y'),
(('sunny', 'warm', 'high', 'strong', 'warm', 'same'), 'Y'),
(('rainy', 'cold', 'high', 'strong', 'warm', 'change'), 'N'),
(('sunny', 'warm', 'high', 'strong', 'cool', 'change'), 'Y')
]

attributes = ('Sky', 'Temp', 'Humidity', 'Wind', 'Water', 'Forecast')


f = Holder(attributes)
f.add_values('Sky', ('sunny', 'rainy', 'cloudy'))
f.add_values('Temp', ('cold', 'warm'))
f.add_values('Humidity', ('normal', 'high'))
f.add_values('Wind', ('weak', 'strong'))
f.add_values('Water', ('warm', 'cold'))
f.add_values('Forecast', ('same', 'change'))
a = CandidateElimination(dataset, f)
a.run_algorithm()
3. Write a program to demonstrate the working of the decision tree-based ID3
algorithm. Use an appropriate data set for building the decision tree and apply this
knowledge to classify a new sample.

import numpy as np
import math
import csv

class Node:
def __init__(self, attribute):
self.attribute = attribute
self.children = []
self.answer = ""

def __str__(self):
return self.attribute

def subtables(data, col, delete):


items = np.unique(data[:, col])
sub_dict = {item: data[data[:, col] == item] for item in items}
if delete:
sub_dict = {item: np.delete(sub_data, col, 1) for item, sub_data in
sub_dict.items()}
return items, sub_dict

def entropy(S):
_, counts = np.unique(S, return_counts=True)
probabilities = counts / len(S)
return -sum(prob * math.log2(prob) for prob in probabilities)

def gain_ratio(data, col):


total_entropy = entropy(data[:, -1])
items, sub_dict = subtables(data, col, delete=False)
total_size = data.shape[0]

sub_entropies = np.array(
[entropy(sub_dict[item][:, -1]) * (sub_dict[item].shape[0] / total_size)
for item in items])
intrinsic_value = -sum(
(sub_dict[item].shape[0] / total_size) *
math.log2(sub_dict[item].shape[0] / total_size) for item in items)

return (total_entropy - sum(sub_entropies)) / intrinsic_value if


intrinsic_value != 0 else 0

def create_node(data, metadata):


if len(np.unique(data[:, -1])) == 1:
node = Node("")
node.answer = np.unique(data[:, -1])[0]
return node

gains = [gain_ratio(data, col) for col in range(data.shape[1] - 1)]


best_col = np.argmax(gains)

node = Node(metadata[best_col])
items, sub_dict = subtables(data, best_col, delete=True)
new_metadata = np.delete(metadata, best_col, 0)
for item in items:
child = create_node(sub_dict[item], new_metadata)
node.children.append((item, child))

return node

def print_tree(node, level=0):


if node.answer:
print(" " * level, node.answer)
else:
print(" " * level, node.attribute)
for value, child in node.children:
print(" " * (level + 1), value)
print_tree(child, level + 2)

def read_data(filename):
with open(filename, 'r') as csvfile:
datareader = csv.reader(csvfile, delimiter=',')
metadata = next(datareader)
traindata = [row for row in datareader]
return metadata, traindata
Tennis.csv

outlook,temperature,humidity,wind,
answer sunny,hot,high,weak,no
sunny,hot,high,strong,no
overcast,hot,high,weak,yes
rain,mild,high,weak,yes
rain,cool,normal,weak,yes
rain,cool,normal,strong,no
overcast,cool,normal,strong,yes
sunny,mild,high,weak,no
sunny,cool,normal,weak,yes
rain,mild,normal,weak,yes
sunny,mild,normal,strong,yes
overcast,mild,high,strong,yes
overcast,hot,normal,weak,yes
rain,mild,high,strong,no

Output
outlook
overcast
b'yes'
rain
wind
b'strong'
b'no'
b'weak'
b'yes'
sunny
humidity
b'high'
b'no'
b'normal'
b'yes
4. Build an Artificial Neural Network by implementing the Backpropagation
algorithm and test the same using appropriate data sets.

import numpy as np

# Dataset
X = np.array([[2, 9], [1, 5], [3, 6]], dtype=float)
y = np.array([[92], [86], [89]], dtype=float)

# Normalize the data


X = X / np.amax(X, axis=0)
y = y / 100

# Sigmoid Function
def sigmoid(x):
return 1 / (1 + np.exp(-x))

# Derivative of Sigmoid Function


def derivatives_sigmoid(x):
return x * (1 - x)

# Variable initialization
epoch = 7000 # Training iterations
lr = 0.1 # Learning rate
inputlayer_neurons = X.shape[1] # Number of features
hiddenlayer_neurons = 3 # Number of hidden layer neurons
output_neurons = 1 # Number of output layer neurons

# Weight and bias initialization


wh = np.random.uniform(size=(inputlayer_neurons, hiddenlayer_neurons))
bh = np.random.uniform(size=(1, hiddenlayer_neurons))
wout = np.random.uniform(size=(hiddenlayer_neurons, output_neurons))
bout = np.random.uniform(size=(1, output_neurons))

# Training algorithm
for i in range(epoch):
# Forward Propagation
hinp1 = np.dot(X, wh)
hinp = hinp1 + bh
hlayer_act = sigmoid(hinp)

outinp1 = np.dot(hlayer_act, wout)


outinp = outinp1 + bout
output = sigmoid(outinp)

# Backpropagation
EO = y - output # Error at output
outgrad = derivatives_sigmoid(output)
d_output = EO * outgrad

EH = d_output.dot(wout.T) # Error at hidden layer


hiddengrad = derivatives_sigmoid(hlayer_act)
d_hiddenlayer = EH * hiddengrad

# Updating weights and biases


wout += hlayer_act.T.dot(d_output) * lr
bout += np.sum(d_output, axis=0, keepdims=True) * lr
wh += X.T.dot(d_hiddenlayer) * lr
bh += np.sum(d_hiddenlayer, axis=0, keepdims=True) * lr

# Output
print("Input: \n" + str(X))
print("Actual Output: \n" + str(y))
print("Predicted Output: \n", output)
output
Input:
[[ 0.66666667 1. ]
[ 0.33333333 0.55555556]
[ 1. 0.66666667]]
Actual Output:
[[ 0.92]
[ 0.86]
[ 0.89]]
Predicted Output:
[[ 0.89559591]
[ 0.88142069]
[ 0.8928407 ]]
5.Write a program to implement the naïve Bayesian classifier for a sample training data
set stored as a .CSV file. Compute the accuracy of the classifier, considering few test
data sets.
6. import csv
import random
import math

def load_csv(filename):
with open(filename, "r") as file:
lines = csv.reader(file)
dataset = list(lines)
for i in range(len(dataset)):
dataset[i] = [float(x) for x in dataset[i]]
return dataset

def split_dataset(dataset, split_ratio):


train_size = int(len(dataset) * split_ratio)
train_set = []
copy = list(dataset)
while len(train_set) < train_size:
index = random.randrange(len(copy))
train_set.append(copy.pop(index))
return train_set, copy

def separate_by_class(dataset):
separated = {}
for i in range(len(dataset)):
vector = dataset[i]
if vector[-1] not in separated:
separated[vector[-1]] = []
separated[vector[-1]].append(vector)
return separated

def mean(numbers):
return sum(numbers) / float(len(numbers))

def stdev(numbers):
avg = mean(numbers)
variance = sum([pow(x - avg, 2) for x in numbers]) /
float(len(numbers) - 1)
return math.sqrt(variance)

def summarize(dataset):
summaries = [(mean(attribute), stdev(attribute)) for attribute in
zip(*dataset)]
del summaries[-1]
return summaries

def summarize_by_class(dataset):
separated = separate_by_class(dataset)
summaries = {}
for class_value, instances in separated.items():
summaries[class_value] = summarize(instances)
return summaries
def calculate_probability(x, mean, stdev):
exponent = math.exp(-(math.pow(x - mean, 2) / (2 * math.pow(stdev,
2))))
return (1 / (math.sqrt(2 * math.pi) * stdev)) * exponent

def calculate_class_probabilities(summaries, input_vector):


probabilities = {}
for class_value, class_summaries in summaries.items():
probabilities[class_value] = 1
for i in range(len(class_summaries)):
mean, stdev = class_summaries[i]
x = input_vector[i]
probabilities[class_value] *= calculate_probability(x, mean,
stdev)
return probabilities

def predict(summaries, input_vector):


probabilities = calculate_class_probabilities(summaries, input_vector)
best_label, best_prob = None, -1
for class_value, probability in probabilities.items():
if best_label is None or probability > best_prob:
best_prob = probability
best_label = class_value
return best_label

def get_predictions(summaries, test_set):


predictions = []
for i in range(len(test_set)):
result = predict(summaries, test_set[i])
predictions.append(result)
return predictions

def get_accuracy(test_set, predictions):


correct = 0
for i in range(len(test_set)):
if test_set[i][-1] == predictions[i]:
correct += 1
return (correct / float(len(test_set))) * 100.0

def main():
filename = '5data.csv'
split_ratio = 0.67
dataset = load_csv(filename)

training_set, test_set = split_dataset(dataset, split_ratio)


print(f'Split {len(dataset)} rows into train={len(training_set)} and
test={len(test_set)} rows')

summaries = summarize_by_class(training_set)

predictions = get_predictions(summaries, test_set)


accuracy = get_accuracy(test_set, predictions)
print(f'Accuracy of the classifier is: {accuracy:.2f}%')

main()

Output
confusion matrix is as
follows [[17 0 0]
[ 0 17 0]
[ 0 0 11]]
Accuracy metrics
precision recall f1-score support

0 1.00 1.00 1.00 17


1 1.00 1.00 1.00 17
2 1.00 1.00 1.00 11

avg / 1.00 1.00 1.00 45


total
7.Write a program to construct a Bayesian network considering medical data.
Use this model to demonstrate the diagnosis of heart patients using standard
Heart Disease Data Set. You can use Java/Python ML library classes/API.

from pomegranate import *

# Define the distributions


asia = DiscreteDistribution({'True': 0.5, 'False': 0.5})

tuberculosis = ConditionalProbabilityTable(
[['True', 'True', 0.2],
['True', 'False', 0.8],
['False', 'True', 0.01],
['False', 'False', 0.99]], [asia])

smoking = DiscreteDistribution({'True': 0.5, 'False': 0.5})

lung = ConditionalProbabilityTable(
[['True', 'True', 0.75],
['True', 'False', 0.25],
['False', 'True', 0.02],
['False', 'False', 0.98]], [smoking])

bronchitis = ConditionalProbabilityTable(
[['True', 'True', 0.92],
['True', 'False', 0.08],
['False', 'True', 0.03],
['False', 'False', 0.97]], [smoking])

tuberculosis_or_cancer = ConditionalProbabilityTable(
[['True', 'True', 'True', 1.0],
['True', 'True', 'False', 0.0],
['True', 'False', 'True', 1.0],
['True', 'False', 'False', 0.0],
['False', 'True', 'True', 1.0],
['False', 'True', 'False', 0.0],
['False', 'False', 'True', 0.0],
['False', 'False', 'False', 1.0]], [tuberculosis, lung])

xray = ConditionalProbabilityTable(
[['True', 'True', 0.885],
['True', 'False', 0.115],
['False', 'True', 0.04],
['False', 'False', 0.96]], [tuberculosis_or_cancer])

dyspnea = ConditionalProbabilityTable(
[['True', 'True', 'True', 0.96],
['True', 'True', 'False', 0.04],
['True', 'False', 'True', 0.89],
['True', 'False', 'False', 0.11],
['False', 'True', 'True', 0.96],
['False', 'True', 'False', 0.04],
['False', 'False', 'True', 0.89],
['False', 'False', 'False', 0.11]], [tuberculosis_or_cancer, bronchitis])

# Create the states


s0 = State(asia, name="asia")
s1 = State(tuberculosis, name="tuberculosis")
s2 = State(smoking, name="smoking")
s3 = State(lung, name="lung")
s4 = State(bronchitis, name="bronchitis")
s5 = State(tuberculosis_or_cancer, name="tuberculosis_or_cancer")
s6 = State(xray, name="xray")
s7 = State(dyspnea, name="dyspnea")

# Build the Bayesian Network


network = BayesianNetwork("Asia Network")
network.add_states(s0, s1, s2, s3, s4, s5, s6, s7)

# Add edges
network.add_edge(s0, s1)
network.add_edge(s2, s3)
network.add_edge(s2, s4)
network.add_edge(s1, s5)
network.add_edge(s3, s5)
network.add_edge(s5, s6)
network.add_edge(s5, s7)
network.add_edge(s4, s7)

# Finalize the network


network.bake()

# Make a prediction
print(network.predict_proba({'tuberculosis': 'True'}))
8.Apply EM algorithm to cluster a set of data stored in a .CSV file. Use the same
data set for clustering using k-Means algorithm. Compare the results of these two
algorithms and comment on the quality of clustering. You can add Java/Python ML
library classes/API in the program.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.mixture import GaussianMixture
from matplotlib.patches import Ellipse

# Generate synthetic data


X, y_true = make_blobs(n_samples=100, centers=4, cluster_std=0.60, random_state=0)
X = X[:, ::-1] # Flip axes for better plotting

# Fit GMM model


gmm = GaussianMixture(n_components=4, random_state=42).fit(X)
labels = gmm.predict(X)

# Plot data points with GMM labels


plt.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis')
plt.title("GMM Clustering")
plt.show()

# Predict probabilities
probs = gmm.predict_proba(X)
print(probs[:5].round(3))

# Plot data points with size based on probabilities


size = 50 * probs.max(1) ** 2 # Square emphasizes differences
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=size)
plt.title("GMM Clustering with Probabilities")
plt.show()

# Function to draw ellipse


def draw_ellipse(position, covariance, ax=None, **kwargs):
"""Draw an ellipse with a given position and covariance"""
ax = ax or plt.gca()

# Convert covariance to principal axes


if covariance.shape == (2, 2):
U, s, Vt = np.linalg.svd(covariance)
angle = np.degrees(np.arctan2(U[1, 0], U[0, 0]))
width, height = 2 * np.sqrt(s)
else:
angle = 0
width, height = 2 * np.sqrt(covariance)

# Draw the ellipse


for nsig in range(1, 4):
ax.add_patch(Ellipse(position, nsig * width, nsig * height, angle,
**kwargs))

# Function to plot GMM


def plot_gmm(gmm, X, label=True, ax=None):
ax = ax or plt.gca()
labels = gmm.fit(X).predict(X)

if label:
ax.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis', zorder=2)
else:
ax.scatter(X[:, 0], X[:, 1], s=40, zorder=2)

ax.axis('equal')

w_factor = 0.2 / gmm.weights_.max()


for pos, covar, w in zip(gmm.means_, gmm.covariances_, gmm.weights_):
draw_ellipse(pos, covar, alpha=w * w_factor, ax=ax)

# Plot GMM results


plt.figure(figsize=(8, 6))
gmm = GaussianMixture(n_components=4, covariance_type='full', random_state=42)
plot_gmm(gmm, X)
plt.title("GMM with Ellipses")
plt.show()

Output

[[1 ,0, 0, 0]
[0 ,0, 1, 0]
[1 ,0, 0, 0]
[1 ,0, 0, 0]
[1 ,0, 0, 0]]

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.cluster import KMeans

# Load data
data = pd.read_csv("kmeansdata.csv")
df1 = pd.DataFrame(data)
print(df1)

# Extract features
f1 = df1['Distance_Feature'].values
f2 = df1['Speeding_Feature'].values

X = np.array(list(zip(f1, f2)))

# Plot the dataset


plt.figure(figsize=(8, 6))
plt.scatter(f1, f2, c='black', marker='o')
plt.xlim([0, 100])
plt.ylim([0, 50])
plt.title('Dataset')
plt.ylabel('Speeding_Feature')
plt.xlabel('Distance_Feature')
plt.show()

# Apply KMeans algorithm


kmeans_model = KMeans(n_clusters=3, random_state=42).fit(X)

# Plot the clustered data


plt.figure(figsize=(8, 6))
colors = ['b', 'g', 'r']
markers = ['o', 'v', 's']

for i, l in enumerate(kmeans_model.labels_):
plt.scatter(f1[i], f2[i], color=colors[l], marker=markers[l])

plt.xlim([0, 100])
plt.ylim([0, 50])
plt.title('K-means Clustering')
plt.ylabel('Speeding_Feature')
plt.xlabel('Distance_Feature')
plt.show()
Driver_ID,Distance_Feature,Speeding_Feature
3423311935,71.24,28
34233132
12,52.53,
25
34233137
24,64.54,
27
34233113
73,55.69,
22
34233109
99,54.58,
25
3423313857,41.91,10
3423312432,58.64,20
3423311434,52.02,8
3423311328,31.25,34
3423312488,44.31,19
3423311254,49.35,40
3423312943,58.07,45
3423312536,44.22,22
3423311542,55.73,19
3423312176,46.63,43
3423314176,52.97,32
3423314202,46.25,35
3423311346,51.55,27
3423310666,57.05,26
3423313527,58.45,30
3423312182,43.42,23
3423313590,55.68,37
3423312268,55.15,18
Write a program to implement k-Nearest Neighbour algorithm to classify the iris data set. Print both
correct and wrong predictions. Java/Python ML library classes can be used for this problem. import
csv import random import math import operator

import csv
import random
import math
import operator

def loadDataset(filename, split, trainingSet=[], testSet=[]):


with open(filename, 'r') as csvfile:
lines = csv.reader(csvfile)
dataset = list(lines)
for x in range(len(dataset) - 1):
for y in range(4):
dataset[x][y] = float(dataset[x][y])
if random.random() < split:
trainingSet.append(dataset[x])
else:
testSet.append(dataset[x])

def euclideanDistance(instance1, instance2, length):


distance = 0
for x in range(length):
distance += (instance1[x] - instance2[x]) ** 2
return math.sqrt(distance)

def getNeighbors(trainingSet, testInstance, k):


distances = []
length = len(testInstance) - 1
for x in range(len(trainingSet)):
dist = euclideanDistance(testInstance, trainingSet[x], length)
distances.append((trainingSet[x], dist))
distances.sort(key=operator.itemgetter(1))
neighbors = []
for x in range(k):
neighbors.append(distances[x][0])
return neighbors

def getResponse(neighbors):
classVotes = {}
for x in range(len(neighbors)):
response = neighbors[x][-1]
if response in classVotes:
classVotes[response] += 1
else:
classVotes[response] = 1
sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1),
reverse=True)
return sortedVotes[0][0]

def getAccuracy(testSet, predictions):


correct = 0
for x in range(len(testSet)):
if testSet[x][-1] == predictions[x]:
correct += 1
return (correct / float(len(testSet))) * 100.0

def main():
trainingSet = []
testSet = []
split = 0.67
loadDataset('iris.data', split, trainingSet, testSet)
print('Train set: ' + repr(len(trainingSet)))
print('Test set: ' + repr(len(testSet)))

predictions = []
k = 3
for x in range(len(testSet)):
neighbors = getNeighbors(trainingSet, testSet[x], k)
result = getResponse(neighbors)
predictions.append(result)
print('> predicted=' + repr(result) + ', actual=' + repr(testSet[x][-1]))

accuracy = getAccuracy(testSet, predictions)


print('Accuracy: ' + repr(accuracy) + '%')

if __name__ == "__main__":
main()
OUTPUT
Confusion matrix is as follows

[[11 0 0]

[0 9 1]

[0 1 8]]

Accuracy metrics

0 1.00 1.00 1.00 11

1 0.90 0.90 0.90 10

2 0.89 0.89 0,89 9

Avg/Total 0.93 0.93 0.93 30


9. Implement the non-parametric Locally Weighted Regression algorithm in order
to fit data points. Select appropriate data set for your experiment and drawgraphs.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import numpy.linalg as linalg

def kernel(point, xmat, k):


m, n = np.shape(xmat)
weights = np.mat(np.eye(m))
for j in range(m):
diff = point - xmat[j]
weights[j, j] = np.exp(diff * diff.T / (-2.0 * k ** 2))
return weights

def localWeight(point, xmat, ymat, k):


wei = kernel(point, xmat, k)
W = (xmat.T * (wei * xmat)).I * (xmat.T * (wei * ymat.T))
return W

def localWeightRegression(xmat, ymat, k):


m, n = np.shape(xmat)
ypred = np.zeros(m)
for i in range(m):
ypred[i] = xmat[i] * localWeight(xmat[i], xmat, ymat, k)
return ypred

# Load data points


data = pd.read_csv('data10.csv')
bill = np.array(data['total_bill'])
tip = np.array(data['tip'])

# Preparing and adding a column of ones for the intercept term


mbill = np.mat(bill)
mtip = np.mat(tip)
m = np.shape(mbill)[1]
one = np.mat(np.ones(m))
X = np.hstack((one.T, mbill.T))

# Set bandwidth parameter k here


k = 2
ypred = localWeightRegression(X, mtip, k)

# Plotting the results


plt.scatter(bill, tip, color='blue', label='Data Points')
plt.plot(bill, ypred, color='red', label='LWR Fit')
plt.xlabel('Total Bill')
plt.ylabel('Tip')
plt.legend()
plt.show()

Output
10. Implement and demonstrate the working of SVM algorithm for
classification.

import numpy as np

class SVM:
def __init__(self, C=1.0, tol=0.001, max_iter=100):
self.C = C # regularization parameter
self.tol = tol # tolerance for stopping criterion
self.max_iter = max_iter # maximum number of iterations
self.w = None # weights
self.b = 0 # bias

def fit(self, X, y):


n_samples, n_features = X.shape
self.w = np.zeros(n_features)

for _ in range(self.max_iter):
num_changed_alphas = 0
for i in range(n_samples):
Ei = self._predict_one(X[i]) - y[i]
if (y[i] * Ei < -self.tol and self.w[i] < self.C) or \
(y[i] * Ei > self.tol and self.w[i] > 0):
j = self._select_random_j(i, n_samples)
Ej = self._predict_one(X[j]) - y[j]
alpha_i_old, alpha_j_old = self.w[i], self.w[j]
L, H = self._calculate_L_H(self.w[i], self.w[j], y[i], y[j])
if L == H:
continue
eta = 2.0 * np.dot(X[i], X[j]) - np.dot(X[i], X[i]) -
np.dot(X[j], X[j])
if eta >= 0:
continue
self.w[j] -= y[j] * (Ei - Ej) / eta
self.w[j] = self._clip_alpha(self.w[j], H, L)
if abs(self.w[j] - alpha_j_old) < 1e-5:
continue
self.w[i] += y[i] * y[j] * (alpha_j_old - self.w[j])
self.b = self._compute_bias(Ei, Ej, X[i], X[j], y[i], y[j],
alpha_i_old, alpha_j_old)
num_changed_alphas += 1
if num_changed_alphas == 0:
break

def predict(self, X):


return np.sign(np.dot(X, self.w) + self.b)

def _predict_one(self, x):


return np.dot(x, self.w) + self.b

def _select_random_j(self, i, n_samples):


j = i
while j == i:
j = np.random.randint(0, n_samples)
return j

def _clip_alpha(self, alpha, H, L):


return max(min(alpha, H), L)

def _calculate_L_H(self, alpha_i, alpha_j, y_i, y_j):


if y_i != y_j:
L = max(0, alpha_j - alpha_i)
H = min(self.C, self.C + alpha_j - alpha_i)
else:
L = max(0, alpha_i + alpha_j - self.C)
H = min(self.C, alpha_i + alpha_j)
return L, H

def _compute_bias(self, Ei, Ej, xi, xj, yi, yj, alpha_i_old, alpha_j_old):
b1 = self.b - Ei - yi * (self.w.dot(xi) - alpha_i_old) * np.dot(xi, xi) \
- yj * (self.w.dot(xj) - alpha_j_old) * np.dot(xi, xj)
b2 = self.b - Ej - yi * (self.w.dot(xi) - alpha_i_old) * np.dot(xi, xj) \
- yj * (self.w.dot(xj) - alpha_j_old) * np.dot(xj, xj)
if 0 < self.w.dot(xi) < self.C:
return b1
elif 0 < self.w.dot(xj) < self.C:
return b2
else:
return (b1 + b2) / 2
Viva Questions
1. What is machine learning?
2. Define supervised learning
3. Define unsupervised learning
4. Define semi supervised learning
5. Define reinforcement learning
6. What do you mean by hypotheses
7. What is classification
8. What is clustering
9. Define precision, accuracy and recall
10.Define entropy
11. Define regression
12. How Knn is different from k-means clustering

13. What is concept learning


14. Define specific boundary and general
boundary 15.Define target function
16.Define decision tree
17.What is ANN
18.Explain gradient descent approximation
19.State Bayes theorem
20.Define Bayesian belief networks
21.Differentiate hard and soft clustering
22. Define variance
23. What is inductive machine learning
24. Why K nearest neighbour algorithm is lazy learningalgorithm
25. Why naïve Bayes is naïve
26.Mention classification algorithms
27.Define pruning
28.Differentiate Clustering and classification
29.Mention clustering algorithms
30.Define Bias

You might also like