Machine Learning Lab Manual - New
Machine Learning Lab Manual - New
LABORATORY
MANUAL
Machine learning
Machine learning is a subset of artificial intelligence in the field of computer science that often
uses statistical techniques to give computers the ability to "learn" (i.e., progressively improve
performance on a specific task) with data, without being explicitly programmed. In the past
decade, machine learning has given us self-driving cars, practical speech recognition, effective
web search, and a vastly improved understanding of the human genome.
Supervised learning: The computer is presented with example inputs and their desired outputs,
given by a "teacher", and the goal is to learn a general rule that maps inputs to outputs. As
special cases, the input signal can be only partially available, or restricted to special feedback:
Semi-supervised learning: the computer is given only an incomplete training signal: a training
set with some (often many) of the target outputs missing.
Active learning: the computer can only obtain training labels for a limited set of instances (based
on a budget), and also has to optimize its choice of objects to acquire labels for. When used
interactively, these can be presented to the user for labeling.
Reinforcement learning: training data (in form of rewards and punishments) is given only as
feedback to the program's actions in a dynamic environment, such as driving a vehicle or playing
a game against an opponent.
Unsupervised learning: No labels are given to the learning algorithm, leaving it on its own to
find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden
patterns in data) or a means towards an end (feature learning).
In clustering, a set of inputs is to be divided into groups. Unlike in classification, the groups are
not known beforehand, making this typically an unsupervised task. Density estimation finds the
distribution of inputs in some space. Dimensionality reduction simplifies inputs by mapping
them into a lower- dimensional space. Topic modeling is a related problem, where a program is
given a list of human language documents and is tasked with finding out which documents
cover similar topics.
Decision tree learning: Decision tree learning uses a decision tree as a predictive model, which maps
observations about an item to conclusions about the item's target value. Association rule learning
Association rule learning is a method for discovering interesting relations between variables in large
databases.
An artificial neural network (ANN) learning algorithm, usually called "neural network" (NN), is
a learning algorithm that is vaguely inspired by biological neural networks. Computations are
structured in terms of an interconnected group of artificial neurons, processing information using
a connectionist approach to computation. Modern neural networks are non-linear statistical data
modeling tools. They are usually used to model complex relationships between inputs and
outputs, to find patterns in data, or to capture the statistical structure in an unknown joint
probability distribution between observed variables.
Deep learning
Falling hardware prices and the development of GPUs for personal use in the last few years
have contributed to the development of the concept of deep learning which consists of multiple
hidden layers in an artificial neural network. This approach tries to model the way the human
brain processes light and sound into vision and hearing. Some successful applications of deep
learning are computer vision and speech recognition.
Support vector machines (SVMs) are a set of related supervised learning methods used for
classification and regression. Given a set of training examples, each marked as belonging to one
of two categories, an SVM training algorithm builds a model that predicts whether a new
example falls into one category or the other.
Clustering
Cluster analysis is the assignment of a set of observations into subsets (called clusters) so that
observations within the same cluster are similar according to some pre designated criterion or
criteria, while observations drawn from different clusters are dissimilar. Different clustering
techniques make different assumptions on the structure of the data, often defined by some
similarity metric and evaluated for example by internal compactness (similarity between
members of the same cluster) and separation between different clusters. Other methods are based
on estimated density and graph connectivity. Clustering is a method of unsupervised learning,
and a common technique for statistical data analysis.
Bayesian networks
Reinforcement learning
Reinforcement learning is concerned with how an agent ought to take actions in an environment
so as to maximize some notion of long-term reward. Reinforcement learning algorithms attempt
to find a policy that maps states of the world to the actions the agent ought to take in those states.
Reinforcement learning differs from the supervised learning problem in that correct input/output
pairs are never presented, nor sub-optimal actions explicitly corrected.
Genetic algorithms
A genetic algorithm (GA) is a search heuristic that mimics the process of natural selection, and
uses methods such as mutation and crossover to generate new genotype in the hope of finding
good solutions to a given problem. In machine learning, genetic algorithms found some uses in
the 1980s and 1990s. Conversely, machine learning techniques have been used to improve the
performance of genetic and evolutionary algorithms.
Rule-based machine learning
Rule-based machine learning is a general term for any machine learning method that identifies,
learns, or evolves "rules" to store, manipulate or apply, knowledge. The defining characteristic
of a rule-based machine learner is the identification and utilization of a set of relational rules that
collectively represent the knowledge captured by the system. This is in contrast to other machine
learners that commonly identify a singular model that can be universally applied to any instance
in order to make a prediction. Rule-based machine learning approaches include learning
classifier systems, association rule learning, and artificial immune systems.
Feature selection is the process of selecting an optimal subset of relevant features for use in
model construction. It is assumed the data contains some features that are either redundant or
irrelevant, and can thus be removed to reduce calculation cost without incurring much loss of
information. Common optimality criteria include accuracy, similarity and information measures.
MACHINE LEARNING LABORATORY
[As per Choice Based Credit System (CBCS) scheme]
CREDITS – 02
Lab Experiments:
2. For a given set of training data examples stored in a .CSV file, implement and
demonstrate the Candidate-Elimination algorithmto output a description of the set of
all hypotheses consistent with the training examples.
3. Write a program to demonstrate the working of the decision tree based ID3 algorithm.
Use an appropriate data set for building the decision tree and apply this knowledge
toclassify a new sample.
5. Write a program to implement the naïve Bayesian classifier for a sample training data
set stored as a .CSV file. Compute the accuracy of the classifier, considering few test
data sets.
6. Assuming a set of documents that need to be classified, use the naïve Bayesian
Classifier model to perform this task. Built-in Java classes/API can be used to write
the program. Calculate the accuracy, precision, and recall for your data set.
7. Write a program to construct a Bayesian network considering medical data. Use this
model to demonstrate the diagnosis of heart patients using standard Heart Disease
Data Set. You can use Java/Python ML library classes/API.
8. Apply EM algorithm to cluster a set of data stored in a .CSV file. Use the same data
set for clustering using k-Means algorithm. Compare the results of these two
algorithms and comment on the quality of clustering. You can add Java/Python ML
library classes/API in the program.
9. Write a program to implement the k-Nearest Neighbour algorithm to classify the iris
data set. Print both correct and wrong predictions. Java/Python ML library classes can
be used for this problem.
10. Implement the non-parametric Locally Weighted Regression algorithm to fit data
points. Select the appropriate data set for your experiment and draw graphs.
import csv
Output
class Holder:
def __init__(self, attr):
self.factors = {i: [] for i in attr}
self.attributes = attr
class CandidateElimination:
def __init__(self, data, holder):
self.dataset = data
self.num_factors = len(data[0][0])
self.factors = holder.factors
self.attributes = holder.attributes
def run_algorithm(self):
G = self.initializeG()
S = self.initializeS()
S_new = S[:]
for s in S:
if not self.consistent(s, trial_set[0]):
S_new.remove(s)
generalization = self.generalize_inconsistent_S(s,
trial_set[0])
if self.get_general(generalization, G):
S_new.append(generalization)
S = self.remove_more_general(S_new)
else:
S = self.remove_inconsistent_S(S, trial_set[0])
G_new = G[:]
for g in G:
if self.consistent(g, trial_set[0]):
G_new.remove(g)
specializations = self.specialize_inconsistent_G(g,
trial_set[0])
G_new.extend(self.get_specific(specializations, S))
G = self.remove_more_specific(G_new)
print("Final S:", S)
print("Final G:", G)
def initializeS(self):
return [tuple(['-' for _ in range(self.num_factors)])]
def initializeG(self):
return [tuple(['?' for _ in range(self.num_factors)])]
dataset = [
(('sunny', 'warm', 'normal', 'strong', 'warm', 'same'), 'Y'),
(('sunny', 'warm', 'high', 'strong', 'warm', 'same'), 'Y'),
(('rainy', 'cold', 'high', 'strong', 'warm', 'change'), 'N'),
(('sunny', 'warm', 'high', 'strong', 'cool', 'change'), 'Y')
]
import numpy as np
import math
import csv
class Node:
def __init__(self, attribute):
self.attribute = attribute
self.children = []
self.answer = ""
def __str__(self):
return self.attribute
def entropy(S):
_, counts = np.unique(S, return_counts=True)
probabilities = counts / len(S)
return -sum(prob * math.log2(prob) for prob in probabilities)
sub_entropies = np.array(
[entropy(sub_dict[item][:, -1]) * (sub_dict[item].shape[0] / total_size)
for item in items])
intrinsic_value = -sum(
(sub_dict[item].shape[0] / total_size) *
math.log2(sub_dict[item].shape[0] / total_size) for item in items)
node = Node(metadata[best_col])
items, sub_dict = subtables(data, best_col, delete=True)
new_metadata = np.delete(metadata, best_col, 0)
for item in items:
child = create_node(sub_dict[item], new_metadata)
node.children.append((item, child))
return node
def read_data(filename):
with open(filename, 'r') as csvfile:
datareader = csv.reader(csvfile, delimiter=',')
metadata = next(datareader)
traindata = [row for row in datareader]
return metadata, traindata
Tennis.csv
outlook,temperature,humidity,wind,
answer sunny,hot,high,weak,no
sunny,hot,high,strong,no
overcast,hot,high,weak,yes
rain,mild,high,weak,yes
rain,cool,normal,weak,yes
rain,cool,normal,strong,no
overcast,cool,normal,strong,yes
sunny,mild,high,weak,no
sunny,cool,normal,weak,yes
rain,mild,normal,weak,yes
sunny,mild,normal,strong,yes
overcast,mild,high,strong,yes
overcast,hot,normal,weak,yes
rain,mild,high,strong,no
Output
outlook
overcast
b'yes'
rain
wind
b'strong'
b'no'
b'weak'
b'yes'
sunny
humidity
b'high'
b'no'
b'normal'
b'yes
4. Build an Artificial Neural Network by implementing the Backpropagation
algorithm and test the same using appropriate data sets.
import numpy as np
# Dataset
X = np.array([[2, 9], [1, 5], [3, 6]], dtype=float)
y = np.array([[92], [86], [89]], dtype=float)
# Sigmoid Function
def sigmoid(x):
return 1 / (1 + np.exp(-x))
# Variable initialization
epoch = 7000 # Training iterations
lr = 0.1 # Learning rate
inputlayer_neurons = X.shape[1] # Number of features
hiddenlayer_neurons = 3 # Number of hidden layer neurons
output_neurons = 1 # Number of output layer neurons
# Training algorithm
for i in range(epoch):
# Forward Propagation
hinp1 = np.dot(X, wh)
hinp = hinp1 + bh
hlayer_act = sigmoid(hinp)
# Backpropagation
EO = y - output # Error at output
outgrad = derivatives_sigmoid(output)
d_output = EO * outgrad
# Output
print("Input: \n" + str(X))
print("Actual Output: \n" + str(y))
print("Predicted Output: \n", output)
output
Input:
[[ 0.66666667 1. ]
[ 0.33333333 0.55555556]
[ 1. 0.66666667]]
Actual Output:
[[ 0.92]
[ 0.86]
[ 0.89]]
Predicted Output:
[[ 0.89559591]
[ 0.88142069]
[ 0.8928407 ]]
5.Write a program to implement the naïve Bayesian classifier for a sample training data
set stored as a .CSV file. Compute the accuracy of the classifier, considering few test
data sets.
6. import csv
import random
import math
def load_csv(filename):
with open(filename, "r") as file:
lines = csv.reader(file)
dataset = list(lines)
for i in range(len(dataset)):
dataset[i] = [float(x) for x in dataset[i]]
return dataset
def separate_by_class(dataset):
separated = {}
for i in range(len(dataset)):
vector = dataset[i]
if vector[-1] not in separated:
separated[vector[-1]] = []
separated[vector[-1]].append(vector)
return separated
def mean(numbers):
return sum(numbers) / float(len(numbers))
def stdev(numbers):
avg = mean(numbers)
variance = sum([pow(x - avg, 2) for x in numbers]) /
float(len(numbers) - 1)
return math.sqrt(variance)
def summarize(dataset):
summaries = [(mean(attribute), stdev(attribute)) for attribute in
zip(*dataset)]
del summaries[-1]
return summaries
def summarize_by_class(dataset):
separated = separate_by_class(dataset)
summaries = {}
for class_value, instances in separated.items():
summaries[class_value] = summarize(instances)
return summaries
def calculate_probability(x, mean, stdev):
exponent = math.exp(-(math.pow(x - mean, 2) / (2 * math.pow(stdev,
2))))
return (1 / (math.sqrt(2 * math.pi) * stdev)) * exponent
def main():
filename = '5data.csv'
split_ratio = 0.67
dataset = load_csv(filename)
summaries = summarize_by_class(training_set)
main()
Output
confusion matrix is as
follows [[17 0 0]
[ 0 17 0]
[ 0 0 11]]
Accuracy metrics
precision recall f1-score support
tuberculosis = ConditionalProbabilityTable(
[['True', 'True', 0.2],
['True', 'False', 0.8],
['False', 'True', 0.01],
['False', 'False', 0.99]], [asia])
lung = ConditionalProbabilityTable(
[['True', 'True', 0.75],
['True', 'False', 0.25],
['False', 'True', 0.02],
['False', 'False', 0.98]], [smoking])
bronchitis = ConditionalProbabilityTable(
[['True', 'True', 0.92],
['True', 'False', 0.08],
['False', 'True', 0.03],
['False', 'False', 0.97]], [smoking])
tuberculosis_or_cancer = ConditionalProbabilityTable(
[['True', 'True', 'True', 1.0],
['True', 'True', 'False', 0.0],
['True', 'False', 'True', 1.0],
['True', 'False', 'False', 0.0],
['False', 'True', 'True', 1.0],
['False', 'True', 'False', 0.0],
['False', 'False', 'True', 0.0],
['False', 'False', 'False', 1.0]], [tuberculosis, lung])
xray = ConditionalProbabilityTable(
[['True', 'True', 0.885],
['True', 'False', 0.115],
['False', 'True', 0.04],
['False', 'False', 0.96]], [tuberculosis_or_cancer])
dyspnea = ConditionalProbabilityTable(
[['True', 'True', 'True', 0.96],
['True', 'True', 'False', 0.04],
['True', 'False', 'True', 0.89],
['True', 'False', 'False', 0.11],
['False', 'True', 'True', 0.96],
['False', 'True', 'False', 0.04],
['False', 'False', 'True', 0.89],
['False', 'False', 'False', 0.11]], [tuberculosis_or_cancer, bronchitis])
# Add edges
network.add_edge(s0, s1)
network.add_edge(s2, s3)
network.add_edge(s2, s4)
network.add_edge(s1, s5)
network.add_edge(s3, s5)
network.add_edge(s5, s6)
network.add_edge(s5, s7)
network.add_edge(s4, s7)
# Make a prediction
print(network.predict_proba({'tuberculosis': 'True'}))
8.Apply EM algorithm to cluster a set of data stored in a .CSV file. Use the same
data set for clustering using k-Means algorithm. Compare the results of these two
algorithms and comment on the quality of clustering. You can add Java/Python ML
library classes/API in the program.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.mixture import GaussianMixture
from matplotlib.patches import Ellipse
# Predict probabilities
probs = gmm.predict_proba(X)
print(probs[:5].round(3))
if label:
ax.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis', zorder=2)
else:
ax.scatter(X[:, 0], X[:, 1], s=40, zorder=2)
ax.axis('equal')
Output
[[1 ,0, 0, 0]
[0 ,0, 1, 0]
[1 ,0, 0, 0]
[1 ,0, 0, 0]
[1 ,0, 0, 0]]
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.cluster import KMeans
# Load data
data = pd.read_csv("kmeansdata.csv")
df1 = pd.DataFrame(data)
print(df1)
# Extract features
f1 = df1['Distance_Feature'].values
f2 = df1['Speeding_Feature'].values
X = np.array(list(zip(f1, f2)))
for i, l in enumerate(kmeans_model.labels_):
plt.scatter(f1[i], f2[i], color=colors[l], marker=markers[l])
plt.xlim([0, 100])
plt.ylim([0, 50])
plt.title('K-means Clustering')
plt.ylabel('Speeding_Feature')
plt.xlabel('Distance_Feature')
plt.show()
Driver_ID,Distance_Feature,Speeding_Feature
3423311935,71.24,28
34233132
12,52.53,
25
34233137
24,64.54,
27
34233113
73,55.69,
22
34233109
99,54.58,
25
3423313857,41.91,10
3423312432,58.64,20
3423311434,52.02,8
3423311328,31.25,34
3423312488,44.31,19
3423311254,49.35,40
3423312943,58.07,45
3423312536,44.22,22
3423311542,55.73,19
3423312176,46.63,43
3423314176,52.97,32
3423314202,46.25,35
3423311346,51.55,27
3423310666,57.05,26
3423313527,58.45,30
3423312182,43.42,23
3423313590,55.68,37
3423312268,55.15,18
Write a program to implement k-Nearest Neighbour algorithm to classify the iris data set. Print both
correct and wrong predictions. Java/Python ML library classes can be used for this problem. import
csv import random import math import operator
import csv
import random
import math
import operator
def getResponse(neighbors):
classVotes = {}
for x in range(len(neighbors)):
response = neighbors[x][-1]
if response in classVotes:
classVotes[response] += 1
else:
classVotes[response] = 1
sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1),
reverse=True)
return sortedVotes[0][0]
def main():
trainingSet = []
testSet = []
split = 0.67
loadDataset('iris.data', split, trainingSet, testSet)
print('Train set: ' + repr(len(trainingSet)))
print('Test set: ' + repr(len(testSet)))
predictions = []
k = 3
for x in range(len(testSet)):
neighbors = getNeighbors(trainingSet, testSet[x], k)
result = getResponse(neighbors)
predictions.append(result)
print('> predicted=' + repr(result) + ', actual=' + repr(testSet[x][-1]))
if __name__ == "__main__":
main()
OUTPUT
Confusion matrix is as follows
[[11 0 0]
[0 9 1]
[0 1 8]]
Accuracy metrics
Output
10. Implement and demonstrate the working of SVM algorithm for
classification.
import numpy as np
class SVM:
def __init__(self, C=1.0, tol=0.001, max_iter=100):
self.C = C # regularization parameter
self.tol = tol # tolerance for stopping criterion
self.max_iter = max_iter # maximum number of iterations
self.w = None # weights
self.b = 0 # bias
for _ in range(self.max_iter):
num_changed_alphas = 0
for i in range(n_samples):
Ei = self._predict_one(X[i]) - y[i]
if (y[i] * Ei < -self.tol and self.w[i] < self.C) or \
(y[i] * Ei > self.tol and self.w[i] > 0):
j = self._select_random_j(i, n_samples)
Ej = self._predict_one(X[j]) - y[j]
alpha_i_old, alpha_j_old = self.w[i], self.w[j]
L, H = self._calculate_L_H(self.w[i], self.w[j], y[i], y[j])
if L == H:
continue
eta = 2.0 * np.dot(X[i], X[j]) - np.dot(X[i], X[i]) -
np.dot(X[j], X[j])
if eta >= 0:
continue
self.w[j] -= y[j] * (Ei - Ej) / eta
self.w[j] = self._clip_alpha(self.w[j], H, L)
if abs(self.w[j] - alpha_j_old) < 1e-5:
continue
self.w[i] += y[i] * y[j] * (alpha_j_old - self.w[j])
self.b = self._compute_bias(Ei, Ej, X[i], X[j], y[i], y[j],
alpha_i_old, alpha_j_old)
num_changed_alphas += 1
if num_changed_alphas == 0:
break
def _compute_bias(self, Ei, Ej, xi, xj, yi, yj, alpha_i_old, alpha_j_old):
b1 = self.b - Ei - yi * (self.w.dot(xi) - alpha_i_old) * np.dot(xi, xi) \
- yj * (self.w.dot(xj) - alpha_j_old) * np.dot(xi, xj)
b2 = self.b - Ej - yi * (self.w.dot(xi) - alpha_i_old) * np.dot(xi, xj) \
- yj * (self.w.dot(xj) - alpha_j_old) * np.dot(xj, xj)
if 0 < self.w.dot(xi) < self.C:
return b1
elif 0 < self.w.dot(xj) < self.C:
return b2
else:
return (b1 + b2) / 2
Viva Questions
1. What is machine learning?
2. Define supervised learning
3. Define unsupervised learning
4. Define semi supervised learning
5. Define reinforcement learning
6. What do you mean by hypotheses
7. What is classification
8. What is clustering
9. Define precision, accuracy and recall
10.Define entropy
11. Define regression
12. How Knn is different from k-means clustering