Ai ML Lab Manual
Ai ML Lab Manual
JSS MAHAVIDYAPEETHA
JSS ACADEMY OF TECHNICAL EDUCATION
Department of Information Science and Engineering
JSSATE Campus, Dr. Vishnuvardhan Road, Bangaluru – 560 060
Phone: 080-28611902, 28612797 Fax: 080-28612706 www.jssateb.ac.in
VII SEMESTER
[As per Choice Based Credit System (CBCS) scheme]
(Effective from the academic year 2018 -2019)
JSS MAHAVIDYAPEETHA
JSS ACADEMY OF TECHNICAL EDUCATION
Department of Information Science and Engineering
JSSATE Campus, Dr. Vishnuvardhan Road, Bangalore – 560 060
Phone: 080-28611902, 28612797 Fax: 080-28612706 www.jssateb.ac.in
VII SEMESTER
Artificial Intelligence and Machine Learning Laboratory
[18CSL76]
Compiled By:
JSS MAHAVIDYAPEETHA
JSS ACADEMY OF TECHNICAL EDUCATION
Department of Information Science and Engineering
JSSATE Campus, Dr.Vishnuvardhan Road, Bangalore – 560 060
Phone: 080-28611902, 28612797 Fax: 080-28612706 www.jssateb.ac.in
VISION
To emerge as a centre for achieving academic excellence, by producing competent professionals
to meet the global challenges in the field of Information science and Technology
MISSION
M1: To prepare the students as competent professionals to meet the advancements in the
industry and academia by imparting quality technical education.
M2: To enrich the technical ability of students to face the world with confidence, commitment
and teamwork.
M3: To inculcate and practice strong techno-ethical values to serve the society.
PEO3: To engage in research and development leading to new innovations and products.
PSO1: Apply the mathematical concepts for solving engineering problems by using appropriate
Programming constructs.
PSO3: Demonstrate the knowledge towards the domain specific initiatives of Information
Science and Engineering
SEMESTER – VII
Subject Code 18CSL76 IA Marks 40
Number of Lecture Hours/Week 01I + 02P Exam Marks 60
Total Number of Lecture Hours 40 Exam Hours 03
CREDITS – 02
Laboratory Experiments:
1. Implement A* algorithm.
2. Impement AO* algorithm.
3. For a given set of training data examples stored in a .CSV file, implement and
demonstrate the Candidate-Elimination algorithm to output a description of the set of
all hypotheses consistent with the training examples.
4. Write a program to demonstrate the working of the decision tree based ID3 algorithm.
Use an appropriate data set for building the decision tree and apply this knowledge to
classify a new sample.
5. Build an Artificial Neural Network by implementing the Back propagation algorithm
and test the same using appropriate data sets.
6. Write a program to implement the naïve Bayesian classifier for a sample training data
set stored as a .CSV file. Compute the accuracy of the classifier, considering few test data
sets.
7. Apply EM algorithm to cluster a set of data stored in a .CSV file. Use the same data set
for clustering using k-Means algorithm. Compare the results of these two algorithms
and comment on the quality of clustering. You can add Java/Python ML library
classes/API in the program.
8. Write a program to implement k-Nearest Neighbour algorithm to classify the iris data
set. Print both correct and wrong predictions. Java/Python ML library classes can be
used for this problem.
9. Implement the non-parametric Locally Weighted Regression algorithm in order to fit
data points. Select appropriate data set for your experiment and draw graphs.
Laborotary Outcomes:
1. Implement and demonstrate AI and ML algorithms
2. Evaluate different algorithms
Note: In the examination each student picks one question from a lot of all the 10
questions.
COURSE OUTCOMES
SUBJECT: Artificial Intelligence and Machine Learning Laboratory
SUB CODE: 18CSL76 SEM:7 COURSE CODE: C413
PO PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
CO
C413.1 2 - 1 - 3 - - 1 1 1 - 1
C413.2 2 1 2 1 3 1 - 2 1 1 - 2
C413.3 2 1 2 1 3 1 - 1 1 1 - 2
AVG 2 1 1.67 1 3 1 - 1.3 1 1 - 1.67
C413.1 2 - -
C413.2 2 1 1
C413.3 2 1 1
AVG 2 1 1
Cos: COURSE OUTCOMES, PSOs: PROGRAM SPECIFIC OUTCOMES CRLs : CORRELATION LEVELS
CONTENTS
3. Overview of AI and ML
4. About Dataset
Program-1: A* Algorithm
6. Viva Questions
➢ Processors:
o Intel® Core™ i5 processor 4300M at 2.60 GHz or 2.59 GHz (1 socket, 2 cores, 2 threads
per core), 8 GB of DRAM
o Intel® Xeon® processor E5-2698 v3 at 2.30 GHz (2 sockets, 16 cores each, 1 thread per
core), 64 GB of DRAM
o Intel® Xeon Phi™ processor 7210 at 1.30 GHz (1 socket, 64 cores, 4 threads per core), 32
GB of DRAM, 16 GB of MCDRAM (flat mode enabled)
➢ Disk space: 2 to 3 GB
➢ Operating systems: Windows® 10, macOS*, and Linux*
1.3 Software
➢ PIP and NumPy: Installed with PIP, Ubuntu*, Python 3.6.2, NumPy 1.13.1, scikit-learn 0.18.2
➢ Windows: Python 3.6.2, PIP and NumPy 1.13.1, scikit-learn 0.18.2
➢ Intel® Distribution for Python* 2018
Weka, CNTK, KNIME, RapidMiner, Deeplearning4j, R, Mahout, H2O, GNU, Octave, MOA (Massive Online
Analysis), Tanagra, Orange, Python, Shogun, TensorFlow, Torch etc.
3. Overview of AI and ML
Machine learning is a branch of artificial intelligence that allows computer systems to learn directly from
examples, data, and experience. Through enabling computers to perform specific tasks intelligently,
machine learning systems can carry out complex processes by learning from data, rather than following
pre-programmed rules.
Recent years have seen exciting advances in machine learning, which have raised its capabilities across a
suite of applications. Increasing data availability has allowed machine learning systems to be trained on
a large pool of examples, while increasing computer processing power has supported the analytical
capabilities of these systems. Within the field itself there have also been algorithmic advances, which
have given machine learning greater power. As a result of these advances, systems which only a few
years ago performed at noticeably below-human levels can now outperform humans at some specific
tasks.
Many people now interact with systems based on machine learning every day, for example in image
recognition systems, such as those used on social media; voice recognition systems, used by virtual
personal assistants; and recommender systems, such as those used by online retailers. As the field
develops further, machine learning shows promise of supporting potentially transformative advances in
a range of areas, and the social and economic opportunities which follow are significant. In healthcare,
machine learning is creating systems that can help doctors give more accurate or effective diagnoses for
certain conditions. In transport, it is supporting the development of autonomous vehicles, and helping to
make existing transport networks more efficient. For public services it has the potential to target
support more effectively to those in need, or to tailor services to users. And in science, machine learning
is helping to make sense of the vast amount of data available to researchers today, offering new insights
into biology, physics, medicine, the social sciences, and more.
The word ‘Machine’ in Machine Learning means computer, as you would expect. So how does a machine
learn?
Given data, we can do all kind of magic with statistics: so can computer algorithms. These algorithms can
solve problems including prediction, classification and clustering. A machine learning algorithm will
learn from new data.
Definition “A computer program is said to 'learn' from experience E with respect to some class of
tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves
with experience E. “
Data
The experiences that are used to improve performance in the task
Measure of improvement
How the improvement is measured - for example, new skills that were not present initially, increasing
accuracy in prediction, or improved speed
Data : Facts and statistics collected together for reference or analysis. It is a set of values of
qualitative or quantitative variables. Numbers, characters or images that designate an attribute of a
phenomenon
Facts :
• Facts are statements that can be proved true or false.
• Facts tell what actually happened.
• Facts tell what is happening now.
• Facts state something that can be easily observed or verified.
Opinions:
• Opinions are statements that cannot be proved true or false because they express a person's thoughts,
beliefs, feelings, or estimates.
• Opinions express worth or value.
• Opinions express what the author or speaker think should or should not be thought or done.
• Opinions are based on what seems true or probable.
Example 1:
Fact: We use AI in many ways today from computer games to digital personal assistants to self driving
cars.
Opinion: Greater use of AI will be even more beneficial to humanity.
In this example, the opinion speculates about a future outcome that cannot yet be known.
Example 2:
Fact: Some jobs have been lost through automation in the past.
Opinion: The use of intelligent machines will replace human jobs and drive down wages for human
workers.
In this example, the author is generalizing without substantiating evidence
4. About Dataset
Before going deeply into machine learning, we first describe the notation of dataset, which will
be used through the whole semester as well as for the lab sessions. There are two general
dataset types. The high-quality, real-world, and well understood machine learning datasets that
you can use to practice applied machine learning. This database is called the UCI machine
learning repository and you can use it to structure a self-study program and build a solid
foundation in machine learning.
Each dataset gets its own webpage that lists all the details known about it including any
relevant publications that investigate it. The datasets themselves can be downloaded as ASCII
files, often the useful CSV format..
For each problem, the student is advised to work on it systematically from end-to-end, for
example, go through the following steps in the applied machine learning process:
1. Define the problem
2. Prepare data
3. Evaluate algorithms
4. Improve results
5. Write-up results
Select a systematic and repeatable process that you can use to deliver results consistently.
Path- A → F → G
Step-03:
Node I can be reached from node G.
A* Algorithm calculates f(I).
f(I) = (3+1+3) + 1 = 8
It decides to go to node I.
Path- A → F → G → I
Step-04:
Node E, Node H and Node J can be reached from node I.
A* Algorithm calculates f(E), f(H) and f(J).
• f(E) = (3+1+3+5) + 3 = 15
• f(H) = (3+1+3+2) + 3 = 12
• f(J) = (3+1+3+3) + 0 = 10
Since f(J) is least, so it decides to go to node J.
Path- A → F → G → I → J
This is the required shortest path from node A to node J.
Important Note-
It is important to note that-
• A* Algorithm is one of the best paths finding algorithms.
• But it does not produce the shortest path always.
• This is because it heavily depends on heuristics.
#!/usr/bin/env python
# coding: utf-8
# In[1]:
def aStarAlgo(start_node, stop_node):
open_set = set(start_node)
closed_set = set ()
g = {} #store distance from starting node
parents = {}# parents contains an adjacency map of all nodes
#for each node m,compare its distance from start i.e g(m) to the
#from start through n node
else:
if g[m] > g[n] + weight:
#update g(m)
g[m] = g[n] + weight
#change parent of m to n
parents[m] = n
if n == None:
print('Path does not exist!')
return None
while parents[n] != n:
path.append(n)
n = parents[n]
path.append(start_node)
path.reverse()
print('Path found: {}'.format(path))
return path
return H_dist[n]
#Describe your graph here
Graph_nodes = {
'A': [('B', 6), ('F', 3)],
'B': [('C', 3), ('D', 2)],
'C': [('D', 1), ('E', 5)],
'D': [('C', 1), ('E', 8)],
'E': [('I', 5), ('J', 5)],
'F': [('G', 1),('H', 7)] ,
'G': [('I', 3)],
'H': [('I', 2)],
'I': [('E', 5), ('J', 3)],
}
aStarAlgo('A', 'J')
AO* Algorithm
AO* Algorithm basically based on problemdecompositon (Breakdown problem into small
pieces)When a problem can be divided into a set of sub problems, where each sub problem can
be solved separately and a combination of these will be a solution, AND-OR graphs or AND -
OR trees are used for representing the solution. The decomposition of the problem or problem
reduction generates AND arcs.
AND-OR Graph
Procedure:
1. In the above diagram we have two ways from A to D or A to B-C (because of and
condition). calculate cost to select a path
2. F(A-D)= 1+10 = 11 and F(A-BC) = 1 + 1 + 6 +12 = 20
3. As we can see F(A-D) is less than F(A-BC) then the algorithm choose the path F(A-D).
4. Form D we have one choice that is F-E.
5. F(A-D-FE) = 1+1+ 4 +4 =10
6. Basically 10 is the cost of reaching FE from D. And Heuristic value of node D also
denote the cost of reaching FE from D. So, the new Heuristic value of D is 10.
7. And the Cost from A-D remain same that is 11.
Suppose we have searched this path and we have got the Goal State, then we will never explore
the other path. (this is what AO* says but here we are going to explore other path as well to see
what happen)
8. F(G-I) = 1 +1 = 2. which is less than Heuristic value 5. So, the new Heuristic value
form G to I is 2.
9. If it is a new value, then the cost from G to B must also have changed. Let's see the
new cost form (B to G)
10. F(B-G)= 1+2 =3 . Mean the New Heuristic value of B is 3.
11. But A is associated with both B and C .
12. As we can see from the diagram C only have one choice or one node to explore
that is J. The Heuristic value of C is 12.
13. Cost form C to J= F(C-J) = 1+1= 2 Which is less than Heuristic value
14. Now the New Heuristic value of C is 2.
15. And the New Cost from A- BC that is F(A-BC) = 1+1+2+3 = 7 which is less than
F(A-D)=11.
16. In this case Choosing path A-BC is more cost effective and good than that of A-D.
But this will only happen when the algorithm explores this path as well. But according to the
algorithm, algorithm will not accelerate this path (here we have just did it to see how the
other path can also be correct).
But it is not the case in all the cases that it will happen in some cases that the algorithm will get
optimal solution.
class Graph:
def __init__(self, graph, heuristicNodeList, startNode): #instantiate graph object with graph
topology, heuristic values, start node
self.graph = graph
self.H=heuristicNodeList
self.start=startNode
self.parent={}
self.status={}
self.solutionGraph={}
def printSolution(self):
print("FOR GRAPH SOLUTION, TRAVERSE THE GRAPH FROM THE START NODE:",self.start)
print("------------------------------------------------------------")
print(self.solutionGraph)
print("------------------------------------------------------------")
h1 = {'A': 1, 'B': 6, 'C': 2, 'D': 12, 'E': 2, 'F': 1, 'G': 5, 'H': 7, 'I': 7, 'J': 1, 'T': 3}
graph1 = {
'A': [[('B', 1), ('C', 1)], [('D', 1)]],
'B': [[('G', 1)], [('H', 1)]],
'C': [[('J', 1)]],
'D': [[('E', 1), ('F', 1)]],
'G': [[('I', 1)]]
}
G1= Graph(graph1, h1, 'A')
G1.applyAOStar()
G1.printSolution()
h2 = {'A': 1, 'B': 6, 'C': 12, 'D': 10, 'E': 4, 'F': 4, 'G': 5, 'H': 7} # Heuristic values of Nodes
graph2 = { # Graph of Nodes and Edges
'A': [[('B', 1), ('C', 1)], [('D', 1)]], # Neighbors of Node 'A', B, C &
D with repective weights
'B': [[('G', 1)], [('H', 1)]], # Neighbors are included in a
list of lists
'D': [[('E', 1), ('F', 1)]] # Each sublist indicate a "OR"
node or "AND" nodes
}
G2 = Graph(graph2, h2, 'A') # Instantiate Graph object with graph, heuristic values
and start Node
G2.applyAOStar() # Run the AO* algorithm
Output:
3. For a given set of training data examples stored in a .CSV file, implement and
demonstrate the Candidate-Elimination algorithm to output a description of the set of all
hypotheses consistent with the training examples.
Dataset: weather.csv
• Remove from G any hypothesis that is less general than another hypothesis in G
Program:
import csv
hypo=[]
data=[]
temp=[]
gen=['?','?','?','?','?','?']
sef=[]
with open('weather2.csv') as csv_file:
fd = csv.reader(csv_file)
print("\nThe given training examples are:")
for line in fd:
print(line)
temp.append(line)
if line[-1]== "Yes":
data.append(line)
print("\nThe positive examples are: Enjoy swimming");
for line in data:
print(line);
row= len(data);
col=len(data[0]);
print("\nThe final specific output......................");
for j in range(col-1):
hypo.append(data[0][j]);
for i in range(row):
for j in range(col-1):
if (hypo[j]!=data[i][j]):
hypo[j]='?'
print(hypo)
print("\nThe final Genralize output..................");
row=len(temp)
col=len(temp)
for i in range(row):
if temp[i][-1]=="No":
for j in range(col-1):
if temp[i][j] !=hypo[j]:
gen[j]=hypo[j]
print(gen)
gen[j]='?'
print(hypo)
print("\nThe final Genralize output..................");
row=len(temp)
col=len(temp)
for i in range(row):
if temp[i][-1]=="No":
for j in range(col-1):
if temp[i][j] !=hypo[j]:
gen[j]=hypo[j]
print(gen)
gen[j]='?'
output:-
The given training examples are:
['Sunny', 'Warm', 'Normal', 'Strong', 'Warm', 'Same', 'Yes']
['Sunny', 'Warm', 'High', 'Strong', 'Warm', 'Same', 'Yes']
['Rainy', 'Cold', 'High', 'Strong', 'Warm', 'Change', 'No']
['Sunny', 'Warm', 'High', 'Strong', 'Cool', 'Change', 'Yes']
4. Write a program to demonstrate the working of the decision tree based ID3 algorithm.
Use an appropriate data set for building the decision tree and apply this knowledge to
classify a new sample.
Decision Tree is one of the most powerful and popular algorithm. Decision-tree algorithm falls
under the category of supervised learning algorithms. It works for both continuous as well as
categorical output variables. Decision tree algorithms are a method for approximating
discrete-valued target functions, in which the learned function is represented by a decision
tree. These kinds of algorithms are famous in inductive learning and have been
successfully applied to a broad range of tasks. Decision trees classify instances by sorting
them down the tree from the root to some leaf node, which provides the classification of the
instance. Each node in the tree specifies a test of some attribute of the instance and each
branch descending from that node corresponds to one of the possible values for this
attribute. Types of decision trees are CART, C4.5, & ID3.
ID3 is a non-incremental algorithm, meaning it derives its classes from a fixed set of training instances.
An incremental algorithm revises the current concept definition, if necessary, with a new sample. The
classes created by ID3 are inductive, that is, given a small set of training instances, the specific classes
created by ID3 are expected to work for all future instances. The distribution of the unknowns must be
the same as the test cases. Induction classes cannot be proven to work in every case since they may
classify an infinite number of instances. Note that ID3 (or any inductive algorithm) may misclassify
data.To imagine, think of decision tree as if or else rules where each if-else condition leads to certain
answer at the end. You might have seen many online games which asks several question and lead to
something that you would have thought at the end. A classic famous example where decision tree is used
is known as Play Tennis.
Each nonleaf node is connected to a test that splits its set of possible answers into subsets
corresponding to different test results.
Algorithm:
ID3 (Examples, Target_Attribute, Attributes)
Create a root node for the tree
If all examples are positive, Return the single-node tree Root, with label = +.
If all examples are negative, Return the single-node tree Root, with label = -.
If number of predicting attributes is empty, then Return the single node tree Root,
with label = most common value of the target attribute in the examples.
Otherwise Begin
A ← The Attribute that best classifies examples.
Decision Tree attribute for Root = A.
For each possible value, vi, of A,
Add a new tree branch below Root, corresponding to the test A = vi.
Let Examples(vi) be the subset of examples that have the value vi for A
If Examples(vi) is empty
Then below this new branch add a leaf node with label = most common target value in
the examples
Else below this new branch add the subtree ID3 (Examples(vi), Target_Attribute,
Attributes – {A})
End
Return Root
Dataset:
outlook temperature humidity wind play
sunny hot high weak no
sunny hot high strong no
overcast hot high weak yes
rainy mild high weak yes
rainy cool normal weak yes
rainy cool normal strong no
overcast cool normal strong yes
sunny mild high weak no
sunny cool normal weak yes
rainy mild normal weak yes
sunny mild normal strong yes
overcast mild high strong yes
overcast hot normal weak yes
rainy mild high strong no
A key point to note here is that the more uniform is the probability distribution, the greater is
its entropy.
Information gain; It measures the expected reduction in entropy by partitioning the examples
according to this attribute. The information gain, Gain(S, A) of an attribute A, relative to the
collection of examples S, is defined as
Where Values (A) is the set of all possible values for attribute A, and Sv, is the subset of S for
which the attribute A has value v. We can use this measure to rank attributes and build the
decision tree where at each node is located the attribute with the highest information gain
among the attributes not yet considered in the path from the root.
Wind attribute has two labels: weak and strong. We would reflect it to the formula.
Program:
import csv
import pprint
from math import *
lines=list(csv.reader(open('weather.csv','r')))
data=lines.pop(0)
print(data)
print()
print(lines)
def entropy(pos,neg):
if pos==0 or neg==0:
return 0
tot=pos+neg
return -pos/tot*log(pos/tot,2)-neg/tot*log(neg/tot,2)
def gain(lines,attr,pos,neg):
d,E,acu={},entropy(pos,neg),0
for i in lines:
if i[attr] not in d:
d[i[attr]]={}
d[i[attr]][i[-1]]=1+d[i[attr]].get(i[-1],0)
for i in d:
tot=d[i].get('yes',0)+d[i].get('no',0)
acu+=tot/(pos+neg)*entropy(d[i].get('yes',0),d[i].get('no',0))
return E-acu
def build(lines,data):
pos=len([x for x in lines if x[-1]=='yes'])
sz=len(lines[0])-1
neg=len(lines)-pos
if neg==0 or pos==0:
return 'yes' if neg==0 else 'no'
root=max([[gain(lines,i,pos,neg),i]for i in range(sz)])[1]
fin,res={},{}
uniq_attr=set([x[root] for x in lines])
print(">>>",uniq_attr)
for i in uniq_attr:
res[i]=build([x[:root]+x[root+1:] for x in lines if x[root]==i],data[:root]+data[root+1:])
fin[data[root]]=res
return fin
tree=build(lines,data)
pprint.pprint(tree)
def classify(instance,tree,default=None):
attribute=next(iter(tree))
if instance[attribute] in tree[attribute].keys():
result=tree[attribute][instance[attribute]]
if isinstance(result,dict):
return classify(instance,result)
else:
return result
else:
return default
import pandas as pd
df_new=pd.read_csv('test.csv')
df_new['predicted']=df_new.apply(classify,axis=1 ,args=(tree,'?'))
print(df_new)
Output:
The cost function is what’s used to learn the optimal solution to the problem being solved. This involves
determining the best values for all of the tuneable model parameters, with neuron path adaptive weights
being the primary target, along with algorithm tuning parameters such as the learning rate. It’s usually
done through optimization techniques such as gradient descent or stochastic gradient descent.
These optimization techniques basically try to make the ANN solution be as close as possible to the
optimal solution, which when successful means that the ANN is able to solve the intended problem with
high performance.
Algorithm:
Phase 1: propagation
Each propagation involves the following steps:
1. The weight's output delta and input activation are multiplied to find the gradient of the weight.
2. A ratio (percentage) of the weight's gradient is subtracted from the weight.
This ratio (percentage) influences the speed and quality of learning; it is called the learning rate. The greater
the ratio, the faster the neuron trains, but the lower the ratio, the more accurate the training is. The sign of
the gradient of a weight indicates whether the error varies directly with, or inversely to, the weight.
Therefore, the weight must be updated in the opposite direction, "descending" the gradient.
Learning is repeated (on new batches) until the network performs adequately.
• two inputs
• two hidden neurons
• two output neurons
• two biases
We will repeat this process for the output layer neurons, using the output from the hidden layer
neurons as inputs.
Now, we will propagate backwards. This way we will try to reduce the error by changing the
values of weights and biases.
Consider W5, we will calculate the rate of change of error w.r.t change in weight W5.
Since we are propagating backwards, first thing we need to do is, calculate the change in total
errors w.r.t the output O1 and O2.
Now, we will propagate further backwards and calculate the change in output O1 w.r.t to its
total net input.
Let’s see now how much does the total net input of O1 changes w.r.t W5?
Step – 3: Putting all the values together and calculating the updated weight value
Program:
import numpy as np
X = np.array(([2, 9], [1, 5], [3, 6]), dtype=float)
# Features(Hours Slept, Hours Studied)
y = np.array(([92], [86], [89]), dtype=float)
# Labels(Marks obtained)
X = X/np.amax(X,axis=0)# Normalize
y = y/100
def sigmoid(x):
return 1/(1 + np.exp(-x))
def sigmoid_grad(x):
return x * (1 - x)
wh=np.random.uniform(size=(input_neurons,hidden_neurons)) # 2x3
bh=np.random.uniform(size=(1,hidden_neurons)) # 1x3
wout=np.random.uniform(size=(hidden_neurons,output_neurons)) # 1x1
bout=np.random.uniform(size=(1,output_neurons))
for i in range(epoch):
#Forward Propogation
h_ip=np.dot(X,wh) + bh # Dot product + Bias
h_act = sigmoid(h_ip) # Activation function
o_ip=np.dot(h_act,wout) + bout
output = sigmoid(o_ip)
#Backpropagation
# Error at Output layer
Eo = y-output # Error at o/p
outgrad = sigmoid_grad(output)
d_output = Eo* outgrad # Errj=Oj(1-Oj)(Tj-Oj)
Output:
Normalized Input:
[[0.66666667 1. ]
[0.33333333 0.55555556]
[1. 0.66666667]]
Actual Output:
[[0.92]
[0.86]
[0.89]]
Predicted Output:
[[0.83940487]
[0.82214181]
[0.84026615]]
6.. Write a program to implement the naïve Bayesian classifier for a sample training data set
stored as a .CSV file. Compute the accuracy of the classifier, considering few test data sets.
Naive Bayesian classifier is a statistical method that can predict class membership
probabilities such as the probability that a given tuple belongs to a particular class.
Bayesian classifier is based on the Bayes theorem and it assumes that the effect of an
attribute value on a given class is independent of the values of the other attributes. This
assumption is called class conditional independence. It is made to simplify the computations
involved and in this sense, is called as "naive".
The naive Bayesian classifier is fast and incremental can deal with discrete and continuous
attributes, has excellent performance in real-life problems. In this paper, the algorithm
of the naive Bayesian classifier is deployed successively enabling it to solve
classification problems while retaining all advantages of naive Bayesian classifier. The
comparison of performance in various domains ofmaterials classes confirms the
advantages of successive learning and suggests its application to other learning
algorithms.
The problem with the above formulation is that if the number of features n is large or if a
feature can take on a large number of values, then basing such a model on probability tables is
infeasible. We therefore reformulate the model to make it more tractable. Using Bayes'
theorem, the conditional probability can be decomposed as
In plain English, using Bayesian probability terminology, the above equation can be written as
In Gaussian Naive Bayes, continuous values associated with each feature are assumed to be
distributed according to a Gaussian distribution.
Working
Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each class.
The class with the highest posterior probability is the outcome of prediction.
Pros:
Cons:
1) If categorical variable has a category (in test data set), which was not observed in training
data set, then model will assign a 0 (zero) probability and will be unable to make a
prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing
technique. One of the simplest smoothing techniques is called Laplace estimation.
2) On the other side naive Bayes is also known as a bad estimator, so the probability outputs
from predict_proba are not to be taken too seriously.
3) Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it
is almost impossible that we get a set of predictors which are completely independent.
----------------------------------Explanation-------------------------------------
It is a classification technique based on Bayes‟ Theorem with an assumption of
independence among predictors. In simple terms, a Naive Bayes classifier assumes that
the presence of a particular feature in a class is unrelated to the presence of any other
feature. For example, a fruit may be considered to be an apple if it is red, round, and
about 3 inches in diameter. Even if these features depend on each other or upon the
existence of the other features, all of these properties independently contribute to the
probability that this fruit is an apple and that is why it is known as „Naive‟.
1) Handling Of Data:
➢ Load the data from the CSV file and split in to training and test data set.
➢ Training data set can be used to by Naïve Bayes to make predictions.
➢ And Test data set can be used to evaluate the accuracy of the model.
Feature vectors represent the frequencies with which certain events have been generated by a
multinomial distribution.
2) Summarize Data:
The summary of the training data collected involves the mean and the standard deviation for
each attribute, by class value.
➢ These are required when making predictions to calculate the probability of
specific attribute values belonging to each class value.
➢ summary data can be break down into the following sub-tasks:
a) Separate Data By Class:The first task is to separate the training dataset instances by class
value so that we can calculate statistics for each class. We can do that by creating a map of each
class value to a list of instances that belong to that class and sort the entire dataset of instances
into the appropriate lists.
b) Calculate Mean: We need to calculate the mean of each attribute for a class value. The mean
is the central middle or central tendency of the data, and we will use it as the middle of
our gaussian distribution when calculating probabilities.
3) Calculate Standard Deviation: We also need to calculate the standard deviation of each
attribute for a class value. The standard deviation describes the variation of spread of the
data, and we will use it to characterize the expected spread of each attribute in our Gaussian
distribution when calculating probabilities.
4) Summarize Dataset: For a given list of instances (for a class value) we can calculate the
mean and the standard deviation for each attribute.
The zip function groups the values for each attribute across our data instances into
their own lists so that we can compute the mean and standard deviation values for the
attribute.
5) Summarize Attributes By Class: We can pull it all together by first separating our
training dataset into instances grouped by class. Then calculate the summaries for each
attribute.
3) Make Predictions:
• Making predictions involves calculating the probability that a given data instance
belongs to each class,
• then selecting the class with the largest probability as the prediction.
• Finally, estimation of the accuracy of the model by making predictions for each data
instance in the test dataset.
4) Evaluate Accuracy: The predictions can be compared to the class values in the test dataset
and a classification\accuracy can be calculated as an accuracy ratio between 0& and
100%.Dataset-This problem is comprised of 768 observations of medical details for Pima
indians patents. The records describe instantaneous measurements taken from the patient
such as their age, the number of times pregnant and blood workup. All patients are
women aged 21 or older. All attributes are numeric, and their units vary from attribute to
attribute.
Program:
import csv
import random
import math
def loadCsv(filename):
lines = csv.reader(open(filename, "r"))
dataset = list(lines)
for i in range(len(dataset)):
dataset[i] = [float(x) for x in dataset[i]]
return dataset
def separateByClass(dataset):
separated = {}
for i in range(len(dataset)):
vector = dataset[i]
if (vector[-1] not in separated):
separated[vector[-1]] = []
separated[vector[-1]].append(vector)
return separated
def mean(numbers):
return sum(numbers)/float(len(numbers))
def stdev(numbers):
avg = mean(numbers)
variance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1)
return math.sqrt(variance)
def summarize(dataset):
summaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)]
del summaries[-1]
return summaries
def summarizeByClass(dataset):
separated = separateByClass(dataset)
summaries = {}
for classValue, instances in separated.items():
summaries[classValue] = summarize(instances)
return summaries
def main():
filename = 'diabetes.csv'
splitRatio = 0.87
dataset = loadCsv(filename)
trainingSet, testSet = splitDataset(dataset, splitRatio)
print("----------------------------------Output of naïve Bayesian classifier\n")
print('Spliting {} rows into training={} and testing={} rows'.format(len(dataset),
len(trainingSet), len(testSet)))
# prepare model
summaries = summarizeByClass(trainingSet)
# test model
predictions = getPredictions(summaries, testSet)
accuracy = getAccuracy(testSet, predictions)
print('Classification Accuracy: {}%'.format(accuracy))
print("-----------------------------------------------------")
main()
Output:
7. Apply EM algorithm to cluster a set of data stored in a .CSV file. Use the same data set
for clustering using k-Means algorithm. Compare the results of these two algorithms and
comment on the quality of clustering. You can add Java/Python ML library classes/API in
the program.
Clustering is a technique for finding similarity groups in a data, called clusters. It attempts to
group individuals in a population together by similarity, but not driven by a specific purpose.
Clustering is often called an unsupervised learning, as you don’t have prescribed labels in the
data and no class values denoting a priori grouping of the data instances are given.
Types of clustering:
Hierarchical clustering: Also known as 'nesting clustering' as it also clusters to exist within
bigger clusters to form a tree.
Partition clustering: Its simply a division of the set of data objects into non-overlapping clusters
such that each objects is in exactly one subset.
Overlapping Clustering: It is used to reflect the fact that an object can simultaneously belong to
more than one group.
Fuzzy clustering: Every objects belongs to every cluster with a membership weight that goes
between 0:if it absolutely doesn't belong to cluster and 1:if it absolutely belongs to the cluster.
Strategy: Use structure inherent in the probabilistic model to separate the original ML problem
into two closely linked sub problems, each of which is hopefully in some sense more tractable
than the original problem.
The original theory of EM algorithm was to obtain mixture probability density distribution from
‘incomplete’ data samples.
A general algorithm to deal with hidden data, but we will study it in the context of unsupervised
learning (hidden class labels = clustering) first.
• EM is an optimization strategy for objective functions that can be interpreted as
likelihoods in the presence of missing data.
• EM is much simpler than gradient methods: No need to choose step size.
• EM is an iterative algorithm with two linked steps:
o E-step: fill-in hidden values using inference
o M-step: apply standard MLE/MAP method to completed data
• We will prove that this procedure monotonically improves the likelihood (or leaves it
unchanged). EM always converges to a local optimum of the likelihood.
intends to partition n objects into k clusters in which each object belongs to the cluster with the
nearest mean. This method produces exactly k different clusters of greatest possible distinction.
The best number of clusters k leading to the greatest separation (distance) is not known as a
priori and must be computed from the data. The objective of K-Means clustering is to minimize
total intra-cluster variance, or, the squared error function:
K-Means is relatively an efficient method. However, we need to specify the number of clusters,
in advance and the final results are sensitive to initialization and often terminates at a local
optimum. Unfortunately there is no global theoretical method to find the optimal number of
clusters. A practical approach is to compare the outcomes of multiple runs with different k and
choose the best one based on a predefined criterion. In general, a large k probably decreases
the error but increases the risk of overfitting.
Advantages
1) Fast, robust and easier to understand.
2) Relatively efficient: O(tknd), where n is # objects, k is # clusters, d is # dimension of each
object, and t is # iterations. Normally, k, t, d << n.
3) Gives best result when data set are distinct or well separated from each other.
Disadvantages
1) The learning algorithm requires apriori specification of the number of cluster centers.
2) The use of Exclusive Assignment - If there are two highly overlapping data then k-means
will not be able to resolve that there are two clusters.
3) The learning algorithm is not invariant to non-linear transformations i.e. with different
representation of data we get different results (data represented in form of cartesian co-
ordinates and polar co-ordinates will give different results).
4) Euclidean distance measures can unequally weight underlying factors.
5) The learning algorithm provides the local optima of the squared error function.
6) Randomly choosing of the cluster center cannot lead us to the fruitful result. Pl. refer Fig.
7) Applicable only when mean is defined i.e. fails for categorical data.
8) Unable to handle noisy data and outliers.
9) Algorithm fails for non-linear data set.
Program:
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.cluster import KMeans
import sklearn.metrics as sm
import pandas as pd
import numpy as np
%matplotlib inline
iris=datasets.load_iris()
X=pd.DataFrame(iris.data)
X.columns=['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width']
y=pd.DataFrame(iris.target)
y.columns=['Targets']
plt.figure(figsize=(14,7))
plt.subplot(1,2,1)
plt.scatter(X.Sepal_Length,X.Sepal_Width,c=colormap[y.Targets],s=40)
plt.title('Sepal')
plt.subplot(1,2,2)
plt.scatter(X.Petal_Length,X.Petal_Width,c=colormap[y.Targets],s=40)
plt.title('Petal')
Text(0.5,1,'Petal')
model=KMeans(n_clusters=3)
model.fit(X)
model.labels_
Output: array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2,
2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2,
2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0])
plt.figure(figsize=(14,7))
plt.subplot(1,2,2)
plt.scatter(X.Petal_Length,X.Petal_Width,c=colormap[predY],s=40)
plt.title('K Mean Classification')
sm.accuracy_score(y,model.labels_)
Output: 0.24
sm.confusion_matrix(y,model.labels_)
GaussianMixture(covariance_type='full', init_params='kmeans',
max_iter=100,means_init=None, n_components=3, n_init=1,
precisions_init=None,random_state=None, reg_covar=1e-06, tol=0.001,
verbose=0,verbose_interval=10, warm_start=False, weights_init=None)
y_cluster_gmm=gmm.predict(xs)
y_cluster_gmm
plt.subplot(1,2,1)
plt.scatter(X.Petal_Length,X.Petal_Width,c=colormap[y_cluster_gmm],s=40)
plt.title('GMM Classification')
plt.subplot(1,2,2)
plt.scatter(X.Petal_Length,X.Petal_Width,c=colormap[predY],s=40)
plt.title('K Mean Classification')
sm.accuracy_score(y,y_cluster_gmm)
Output: 0.03333333333333333
sm.confusion_matrix(y,y_cluster_gmm)
Output: array([[ 0, 0, 50],
[45, 5, 0],
[ 0, 50, 0]], dtype=int64)
8. Write a program to implement k-Nearest Neighbour algorithm to classify the iris data set.
Print both correct and wrong predictions. Java/Python ML library classes can be used for this
problem.
The KNN algorithm is a robust and versatile classifier that is often used as a benchmark for
more complex classifiers such as Artificial Neural Networks (ANN) and Support Vector
Machines (SVM). Despite its simplicity, KNN can outperform more powerful classifiers and is
used in a variety of applications such as economic forecasting, data compression and genetics.
For example, KNN was leveraged in a 2006 study of functional genomics for the assignment of
genes based on their expression profiles.
Let’s first start by establishing some definitions and notations. We will use x to denote a feature
(aka. predictor, attribute) and y to denote the target (aka. label, class) we are trying to predict.
KNN falls in the supervised learning family of algorithms. Informally, this means that we are
given a labelled dataset consiting of training observations (x,y) and would like to capture the
relationship between x and y. More formally, our goal is to learn a function h:X→Y so that given
an unseen observation x, h(x) can confidently predict the corresponding output y.
The KNN classifier is also a non parametric and instance-based learning algorithm.
It is worth noting that the minimal training phase of KNN comes both at a memory cost, since
we must store a potentially huge data set, as well as a computational cost during test time since
classifying a given observation requires a run down of the whole data set. Practically speaking,
this is undesirable since we usually want fast responses.
Example of k-NN classification. The test sample (inside circle) should be classified either to the
first class of blue squares or to the second class of red triangles. If k = 3 (outside circle) it is
assigned to the second class because there are 2 triangles and only 1 square inside the inner
circle. If, for example k = 5 it is assigned to the first class (3 squares vs. 2 triangles outside the
outer circle).
but other measures can be more suitable for a given setting and include the Manhattan,
Chebyshev and Hamming distance.
More formally, given a positive integer K, an unseen observation x and a similarity metric d,
KNN classifier performs the following two steps:
It runs through the whole dataset computing d between x and each training observation. We’ll
call the K points in the training data that are closest to x the set A. Note that K is usually odd to
prevent tie situations.
It then estimates the conditional probability for each class, that is, the fraction of points in A
with that given class label. (Note I(x) is the indicator function which evaluates to 1 when the
argument x is true and 0 otherwise)
Finally, our input x gets assigned to the class with the largest probability.
KNN searches the memorized training observations for the K instances that most closely
resemble the new instance and assigns to it the their most common class.
More on K
At this point, you’re probably wondering how to pick the variable K and what its effects are on
your classifier. Well, like most machine learning algorithms, the K in KNN is a hyperparameter
that you, as a designer, must pick in order to get the best possible fit for the data set. Intuitively,
you can think of K as controlling the shape of the decision boundary we talked about earlier.
When K is small, we are restraining the region of a given prediction and forcing our classifier to
be “more blind” to the overall distribution. A small value for K provides the most flexible fit,
which will have low bias but high variance. Graphically, our decision boundary will be more
jagged.
Pros:
• No assumptions about data — useful, for example, for nonlinear data
• Simple algorithm — to explain and understand/interpret
• High accuracy (relatively) — it is pretty high but not competitive in comparison to better
supervised learning models
• Versatile — useful for classification or regression
Cons:
• Computationally expensive — because the algorithm stores all of the training data
• High memory requirement
• Stores all (or almost all) of the training data
• Prediction stage might be slow (with big N)
• Sensitive to irrelevant features and the scale of the data
Applications of KNN
Credit ratings — collecting financial characteristics vs. comparing people with similar financial
features to a database. By the very nature of a credit rating, people who have similar financial
details would be given similar credit ratings. Therefore, they would like to be able to use this
existing database to predict a new customer’s credit rating, without having to perform all the
calculations.
Should the bank give a loan to an individual? Would an individual default on his or her loan? Is
that person closer in characteristics to people who defaulted or did not default on their loans?
In political science — classing a potential voter to a “will vote” or “will not vote”, or to “vote
Democrat” or “vote Republican”.
More advance examples could include handwriting detection (like OCR), image recognition and
even video recognition.
Program:
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
from sklearn.model_selection import train_test_split
iris_dataset=load_iris()
print("\nIRIS FEATURES \TARGET NAMES: \n",iris_dataset.target_names)
for i in range(len(iris_dataset.target_names)):
print("\n[{0}]:[{1}]".format(i,iris_dataset.target_names[i]))
print("\nIRIS DATA :\n",iris_dataset["data"])
X_train,X_test,y_train,y_test=train_test_split(iris_dataset["data"],iris_dataset["target"],random_s
tate=0)
print("\nTarget :\n",iris_dataset["target"])
print("\nX TRAIN \n",X_train)
print("\nX TEST \n",X_test)
print("\nY TRAIN \n",y_train)
print("\nY TEST \n",y_test)
kn=KNeighborsClassifier(n_neighbors=1)
kn.fit(X_train,y_train)
x_new=np.array([[5,2.9,1,0.2]])
print("\nXNEW \n",x_new)
prediction=kn.predict(x_new)
print("\nPredicted target value: {}\n".format(prediction))
print("\nPredicted feature name:{}\n".format(iris_dataset["target_names"][prediction]))
i=1
x=X_test[i]
x_new=np.array([x])
print("\nXNEW \n",x_new)
for i in range(len(X_test)):
x=X_test[i]
x_new=np.array([x])
prediction=kn.predict(x_new)
print("\n Actual:[{0}][{1}] \t,
Predicted:{2}{3}".format(y_test[i],iris_dataset["target_names"][y_test[i]],prediction,iris_dataset
["target_names"][prediction]))
print("\nTEST SCORE[ACCURACY]: {:.2f}\n".format(kn.score(X_test,y_test)))
Target :
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0000000000000111111111111111111111111
1111111111111111111111111122222222222
2222222222222222222222222222222222222
2 2]
X TEST
[[5.8 2.8 5.1 2.4]
[6. 2.2 4. 1. ]
[5.5 4.2 1.4 0.2]
[7.3 2.9 6.3 1.8]
[5. 3.4 1.5 0.2]
[6.3 3.3 6. 2.5]
[5. 3.5 1.3 0.3]
[6.7 3.1 4.7 1.5]
[6.8 2.8 4.8 1.4]
[6.1 2.8 4. 1.3]
[6.1 2.6 5.6 1.4]
[6.4 3.2 4.5 1.5]
[6.1 2.8 4.7 1.2]
[6.5 2.8 4.6 1.5]
[6.1 2.9 4.7 1.4]
[4.9 3.1 1.5 0.1]
[6. 2.9 4.5 1.5]
[5.5 2.6 4.4 1.2]
[4.8 3. 1.4 0.3]
[5.4 3.9 1.3 0.4]
[5.6 2.8 4.9 2. ]
[5.6 3. 4.5 1.5]
[4.8 3.4 1.9 0.2]
[4.4 2.9 1.4 0.2]
[6.2 2.8 4.8 1.8]
[4.6 3.6 1. 0.2]
[5.1 3.8 1.9 0.4]
[6.2 2.9 4.3 1.3]
[5. 2.3 3.3 1. ]
[5. 3.4 1.6 0.4]
[6.4 3.1 5.5 1.8]
[5.4 3. 4.5 1.5]
[5.2 3.5 1.5 0.2]
[6.1 3. 4.9 1.8]
[6.4 2.8 5.6 2.2]
[5.2 2.7 3.9 1.4]
[5.7 3.8 1.7 0.3]
[6. 2.7 5.1 1.6]]
Y TEST
[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
1]
XNEW
[[5. 2.9 1. 0.2]]
XNEW
[[6. 2.2 4. 1. ]]
Actual:[2][virginica] , Predicted:[2]['virginica']
Actual:[1][versicolor] , Predicted:[1]['versicolor']
Actual:[0][setosa] , Predicted:[0]['setosa']
Actual:[2][virginica] , Predicted:[2]['virginica']
Actual:[0][setosa] , Predicted:[0]['setosa']
Actual:[2][virginica] , Predicted:[2]['virginica']
Actual:[0][setosa] , Predicted:[0]['setosa']
Actual:[1][versicolor] , Predicted:[1]['versicolor']
Actual:[1][versicolor] , Predicted:[1]['versicolor']
Actual:[1][versicolor] , Predicted:[1]['versicolor']
Actual:[2][virginica] , Predicted:[2]['virginica']
Actual:[1][versicolor] , Predicted:[1]['versicolor']
Actual:[1][versicolor] , Predicted:[1]['versicolor']
Actual:[1][versicolor] , Predicted:[1]['versicolor']
Actual:[1][versicolor] , Predicted:[1]['versicolor']
Actual:[0][setosa] , Predicted:[0]['setosa']
Actual:[1][versicolor] , Predicted:[1]['versicolor']
Actual:[1][versicolor] , Predicted:[1]['versicolor']
Actual:[0][setosa] , Predicted:[0]['setosa']
Actual:[0][setosa] , Predicted:[0]['setosa']
Actual:[2][virginica] , Predicted:[2]['virginica']
Actual:[1][versicolor] , Predicted:[1]['versicolor']
Actual:[0][setosa] , Predicted:[0]['setosa']
Actual:[0][setosa] , Predicted:[0]['setosa']
Actual:[2][virginica] , Predicted:[2]['virginica']
Actual:[0][setosa] , Predicted:[0]['setosa']
Actual:[0][setosa] , Predicted:[0]['setosa']
Actual:[1][versicolor] , Predicted:[1]['versicolor']
Actual:[1][versicolor] , Predicted:[1]['versicolor']
Actual:[0][setosa] , Predicted:[0]['setosa']
Actual:[2][virginica] , Predicted:[2]['virginica']
Actual:[1][versicolor] , Predicted:[1]['versicolor']
Actual:[0][setosa] , Predicted:[0]['setosa']
Actual:[2][virginica] , Predicted:[2]['virginica']
Actual:[2][virginica] , Predicted:[2]['virginica']
Actual:[1][versicolor] , Predicted:[1]['versicolor']
Actual:[0][setosa] , Predicted:[0]['setosa']
Actual:[1][versicolor] , Predicted:[2]['virginica']
Locally weighted learning is simple but appealing, both intuitively and statistically. And it has
been around since the turn of the century. When you want to predict what is going to happen in
the future, you simply reach into a database of all your previous experiences, grab some similar
experiences, combine them (perhaps by a weighted average that weights more similar
experiences more strongly) and use the combination to make a prediction, do a regression, or
many other more sophisticated operations. We like this approach to learning, especially for
learning process dynamics or robot dynamics, because it is very flexible (low bias) so provided
we have plenty of data we will eventually get an accurate model.
where x(i) is the ith instance, y(i) is its corresponding class label, θ are the model parameters,
and is the weight of the ith instance, given by
where τ is the bandwidth, and xx is the query point, which is fixed for a given regression model
and is typically one of the instances.
Locally weighted linear regression is a non-parametric method for fitting data points. What
does that mean?
Instead of fitting a single regression line, you fit many linear regression models. The final
resulting smooth curve is the product of all those regression models.
Obviously, we can't fit the same linear model again and again. Instead, for each linear model we
want to fit, we find a point x and use that for fitting a local regression model.
We find points that closest to x to fit each of our local regression model. That's why you'll see
the algorithm is also known as nearest neighbours algorithm in the literature.
Now, if your data points have the x-values from 1 to 100: [1,2,3 ... 98, 99, 100]. The algorithm
would fit a linear model for 1,2,3...,98,99,100. That means, you'll have 100 regression models.
Again, when we fit each of the model, we can't just use all the data points in the sample. For
each of the model, we find the closest points and use that for fitting. For example, if the
algorithm wants to fit for x=50, it will put higher weight on [48,49,50,51,52] and less weight on
[45,46,47,53,54,55]. When it tries to fit for x=95, the points [92,93,95,96,97] will have higher
weight than any other data points.
Linear Regression only give you a overall prediction (a line !!), so it won’t helpful in real world
data. Locally weighted linear regression comes to some bias into our estimator. The bias can be
computed in many ways. In this case, we would like to use RBF equation to set up the bias. RBF
equation, also call RBF kernel, is a way to calculate the distance between one points to others.
The feature of RBF is to give stronger biases to the data points, which is near the data set we are
interesting in.
1. Take a data point, calculate the bias with others using RBF kernel. The equation of RBF can be
found here. I define bias as Wei
2. take biases Wi into normal equation, so the normal equation with local weighted:
If anything seems right, you will get a weight W. This W is specificity for that point.
3. Finally, use the W and the data point to test how good the estimator is.
run step 1 ~ 3 for every points and store into a vector and print it.
LWR uses only the points in a short distance to one another to estimate polynomial expression
of each point subject to smoothing. This method is effective with diverse time series data that
result from the data's varying linearity, periodicity and non‐linearity. While this method is quite
similar to multiple linear regression, the analysis takes place solely with the values of the
nearby data according to the conditions. Because this is a non‐parametric method, no
assumption was presented in the model.
Program:
import numpy as np
from ipywidgets import interact
from bokeh.plotting import figure, show, output_notebook
from bokeh.layouts import gridplot
from bokeh.io import push_notebook
output_notebook()
def plot_lwr(tau):
# prediction
domain = np.linspace(-3, 3, num=300)
prediction = [local_regression(x0, X, Y, tau) for x0 in domain]
plot = figure(plot_width=400, plot_height=400)
plot.title.text = 'tau=%g' % tau
plot.scatter(X, Y, alpha=.3)
plot.line(domain, prediction, line_width=2, color='red')
return plot
show(gridplot([
[plot_lwr(10.), plot_lwr(1.)],
[plot_lwr(0.1), plot_lwr(0.01)]
]))
show(plot, notebook_handle=True)
6. Viva Questions
7. AI and ML Glossary
accuracy
The fraction of predictions that a classification model got right. In multi-class classification,
accuracy is defined as follows:
Accuracy=Correct PredictionsTotal Number Of Examples
In binary classification, accuracy has the following definition:
Accuracy=True Positives+True NegativesTotal Number Of Examples
See true positive and true negative.
activation function
A function (for example, ReLU or sigmoid) that takes in the weighted sum of all of the inputs
from the previous layer and then generates and passes an output value (typically nonlinear) to
the next layer.
backpropagation
The primary algorithm for performing gradient descent on neural networks. First, the output
values of each node are calculated (and cached) in a forward pass. Then, the partial
derivative of the error with respect to each parameter is calculated in a backward pass through
the graph.
baseline
A simple model or heuristic used as reference point for comparing how well a model is
performing. A baseline helps model developers quantify the minimal, expected performance on
a particular problem.
batch
The set of examples used in one iteration (that is, one gradient update) of model training.
See also batch size.
batch size
The number of examples in a batch. For example, the batch size of SGD is 1, while the batch size
of a mini-batch is usually between 10 and 1000. Batch size is usually fixed during training and
inference; however, TensorFlow does permit dynamic batch sizes.
bias (math)
An intercept or offset from an origin. Bias (also known as the bias term) is referred to
as b or w0 in machine learning models. For example, bias is the b in the following formula:
y′=b+w1x1+w2x2+…wnxn
binary classification
A type of classification task that outputs one of two mutually exclusive classes. For example, a
machine learning model that evaluates email messages and outputs either "spam" or "not
spam" is a binary classifier.
candidate sampling
A training-time optimization in which a probability is calculated for all the positive labels, using,
for example, softmax, but only for a random sample of negative labels. For example, if we have
an example labeled beagle and dog candidate sampling computes the predicted probabilities
and corresponding loss terms for the beagle and dog class outputs in addition to a random
subset of the remaining classes (cat, lollipop, fence). The idea is that the negative classes can
learn from less frequent negative reinforcement as long as positive classes always get proper
positive reinforcement, and this is indeed observed empirically. The motivation for candidate
sampling is a computational efficiency win from not computing predictions for all negatives.
categorical data
Features having a discrete set of possible values. For example, consider a categorical feature
named house style, which has a discrete set of three possible values: Tudor, ranch, colonial. By
representing house style as categorical data, the model can learn the separate impacts
of Tudor, ranch, and colonial on house price.
Sometimes, values in the discrete set are mutually exclusive, and only one value can be applied
to a given example. For example, a car maker categorical feature would probably permit only a
single value (Toyota) per example. Other times, more than one value may be applicable. A single
car could be painted more than one different color, so a car color categorical feature would
likely permit a single example to have multiple values (for example, red and white).
Categorical features are sometimes called discrete features.
Contrast with numerical data.
centroid
The center of a cluster as determined by a k-means or k-median algorithm. For instance, if k is
3, then the k-means or k-median algorithm finds 3 centroids.
class
One of a set of enumerated target values for a label. For example, in a binary
classification model that detects spam, the two classes are spam and not spam. In a multi-class
classification model that identifies dog breeds, the classes would be poodle, beagle, pug, and so
on.
classification model
A type of machine learning model for distinguishing among two or more discrete classes. For
example, a natural language processing classification model could determine whether an input
sentence was in French, Spanish, or Italian. Compare with regression model.
classification threshold
A scalar-value criterion that is applied to a model's predicted score in order to separate
the positive class from the negative class. Used when mapping logistic regression results
to binary classification. For example, consider a logistic regression model that determines the
probability of a given email message being spam. If the classification threshold is 0.9, then
logistic regression values above 0.9 are classified as spam and those below 0.9 are classified
as not spam.
clustering
Grouping related examples, particularly during unsupervised learning. Once all the examples
are grouped, a human can optionally supply meaning to each cluster.
Many clustering algorithms exist. For example, the k-means algorithm clusters examples based
on their proximity to acentroid, as in the following diagram:
treeheighttree widthcentroidcluster 1cluster 2
A human researcher could then review the clusters and, for example, label cluster 1 as "dwarf
trees" and cluster 2 as "full-size trees."
As another example, consider a clustering algorithm based on an example's distance from a
center point, illustrated as follows:
cluster 1cluster 2cluster 3
collaborative filtering
Making predictions about the interests of one user based on the interests of many other users.
Collaborative filtering is often used in recommendation systems.
confirmation bias
#fairness
The tendency to search for, interpret, favor, and recall information in a way that confirms one's
preexisting beliefs or hypotheses. Machine learning developers may inadvertently collect or
label data in ways that influence an outcome supporting their existing beliefs. Confirmation bias
is a form of implicit bias.
Experimenter's bias is a form of confirmation bias in which an experimenter continues
training models until a preexisting hypothesis is confirmed.
confusion matrix
An NxN table that summarizes how successful a classification model's predictions were; that
is, the correlation between the label and the model's classification. One axis of a confusion
matrix is the label that the model predicted, and the other axis is the actual label. N represents
the number of classes. In a binary classification problem, N=2. For example, here is a sample
confusion matrix for a binary classification problem:
Tumor Non-Tumor (predicted)
(predicted)
Tumor (actual) 18 1
Non-Tumor 6 452
(actual)
The preceding confusion matrix shows that of the 19 samples that actually had tumors, the
model correctly classified 18 as having tumors (18 true positives), and incorrectly classified 1
as not having a tumor (1 false negative). Similarly, of 458 samples that actually did not have
tumors, 452 were correctly classified (452 true negatives) and 6 were incorrectly classified (6
false positives).
The confusion matrix for a multi-class classification problem can help you determine mistake
patterns. For example, a confusion matrix could reveal that a model trained to recognize
handwritten digits tends to mistakenly predict 9 instead of 4, or 1 instead of 7.
Confusion matrices contain sufficient information to calculate a variety of performance metrics,
including precision and recall.
continuous feature
A floating-point feature with an infinite range of possible values. Contrast with discrete
feature
DataFrame
A popular datatype for representing data sets in Pandas. A DataFrame is analogous to a table.
Each column of the DataFrame has a name (a header), and each row is identified by a number.
data set
A collection of examples.
decision boundary
The separator between classes learned by a model in a binary class or multi-class
classification problems. For example, in the following image representing a binary
classification problem, the decision boundary is the frontier between the orange class and the
blue class:
discrete feature
A feature with a finite set of possible values. For example, a feature whose values may only
be animal, vegetable, or mineral is a discrete (or categorical) feature. Contrast with continuous
feature
dynamic model
A model that is trained online in a continuously updating fashion. That is, data is continuously
entering the model.
ensemble
A merger of the predictions of multiple models. You can create an ensemble via one or more of
the following:
• different initializations
• different hyperparameters
• different overall structure
Deep and wide models are a kind of ensemble.
epoch
A full training pass over the entire data set such that each example has been seen once. Thus, an
epoch represents N/batch size training iterations, where N is the total number of examples.
Estimator
An instance of the tf.Estimator class, which encapsulates logic that builds a TensorFlow graph
and runs a TensorFlow session. You may create your own custom Estimators (as
described here) or instantiate premade Estimatorscreated by others.
feature
An input variable used in making predictions.
feature cross
A synthetic feature formed by crossing (multiplying or taking a Cartesian product of)
individual features. Feature crosses help represent nonlinear relationships.
feature set
The group of features your machine learning model trains on. For example, postal code,
property size, and property condition might comprise a simple feature set for a model that
predicts housing prices.
gradient
The vector of partial derivatives with respect to all of the independent variables. In machine
learning, the gradient is the the vector of partial derivatives of the model function. The gradient
points in the direction of steepest ascent.
gradient descent
A technique to minimize loss by computing the gradients of loss with respect to the model's
parameters, conditioned on training data. Informally, gradient descent iteratively adjusts
parameters, gradually finding the best combination of weights and bias to minimize loss.
heuristic
A practical and nonoptimal solution to a problem, which is sufficient for making progress or for
learning from.
hidden layer
A synthetic layer in a neural network between the input layer (that is, the features) and
the output layer (the prediction). A neural network contains one or more hidden layers.
hyperplane
A boundary that separates a space into two subspaces. For example, a line is a hyperplane in
two dimensions and a plane is a hyperplane in three dimensions. More typically in machine
learning, a hyperplane is the boundary separating a high-dimensional space. Kernel Support
Vector Machines use hyperplanes to separate positive classes from negative classes, often in a
very high-dimensional space.
iteration
A single update of a model's weights during training. An iteration consists of computing the
gradients of the parameters with respect to the loss on a single batch of data.
learning rate
A scalar used to train a model via gradient descent. During each iteration, the gradient
descent algorithm multiplies the learning rate by the gradient. The resulting product is called
the gradient step Learning rate is a key hyperparameter.
linear regression
A type of regression model that outputs a continuous value from a linear combination of input
features.
logistic regression
A model that generates a probability for each possible discrete label value in classification
problems by applying a sigmoid function to a linear prediction. Although logistic regression is
often used in binary classification problems, it can also be used in multi-class classification
problems (where it becomes called multi-class logistic regression or multinomial
regression).
metric
A number that you care about. May or may not be directly optimized in a machine-learning
system. A metric that your system tries to optimize is called an objective.
neural network
A model that, taking inspiration from the brain, is composed of layers (at least one of which
is hidden) consisting of simple connected units or neurons followed by nonlinearities.
neuron
A node in a neural network, typically taking in multiple input values and generating one
output value. The neuron calculates the output value by applying an activation
function (nonlinear transformation) to a weighted sum of input values.
normalization
The process of converting an actual range of values into a standard range of values, typically -1
to +1 or 0 to 1. For example, suppose the natural range of a certain feature is 800 to 6,000.
Through subtraction and division, you can normalize those values into the range -1 to +1.
numpy
An open-source math library that provides efficient array operations in Python. pandas is built
on numpy.
outliers
Values distant from most other values. In machine learning, any of the following are outliers:
• Weights with high absolute values.
• Predicted values relatively far away from the actual values.
• Input data whose values are more than roughly 3 standard deviations from the mean.
Outliers often cause problems in model training.
output layer
The "final" layer of a neural network. The layer containing the answer(s).
overfitting
Creating a model that matches the training data so closely that the model fails to make correct
predictions on new data.
pandas
A column-oriented data analysis API. Many ML frameworks, including TensorFlow, support
pandas data structures as input. See pandas documentation.
parameter
A variable of a model that the ML system trains on its own. For example, weights are
parameters whose values the ML system gradually learns through successive training
iterations. Contrast with hyperparameter.
performance
Overloaded term with the following meanings:
• The traditional meaning within software engineering. Namely: How fast (or efficiently) does
this piece of software run?
• The meaning within ML. Here, performance answers the following question: How correct is
this model? That is, how good are the model's predictions?
precision
• A metric for classification models. Precision identifies the frequency with which a model
was correct when predicting the positive class. That is:
• Precision=True PositivesTrue Positives+False Positives
prediction
• A model's output when provided with an input example.
recall
A metric for classification models that answers the following question: Out of all the possible
positive labels, how many did the model correctly identify? That is:
Recall=True PositivesTrue Positives+False Negatives
regression model
A type of model that outputs continuous (typically, floating-point) values. Compare
with classification models, which output discrete values, such as "day lily" or "tiger lily."
regularization
The penalty on a model's complexity. Regularization helps prevent overfitting. Different kinds
of regularization include:
• L1 regularization
• L2 regularization
• dropout regularization
• early stopping (this is not a formal regularization method, but can effectively limit
overfitting)
regularization rate
A scalar value, represented as lambda, specifying the relative importance of the regularization
function. The following simplified loss equation shows the regularization rate's influence:
minimize(loss function + λ(regularization function))
Raising the regularization rate reduces overfitting but may make the model less accurate.
scikit-learn
A popular open-source ML platform. See www.scikit-learn.org.
semi-supervised learning
Training a model on data where some of the training examples have labels but others don’t. One
technique for semi-supervised learning is to infer labels for the unlabeled examples, and then to
train on the inferred labels to create a new model. Semi-supervised learning can be useful if
labels are expensive to obtain but unlabeled examples are plentiful.
sigmoid function
A function that maps logistic or multinomial regression output (log odds) to probabilities,
returning a value between 0 and 1.
sparsity
The number of elements set to zero (or null) in a vector or matrix divided by the total number
of entries in that vector or matrix. For example, consider a 10x10 matrix in which 98 cells
contain zero. The calculation of sparsity is as follows:
sparsity=98100=0.98
Feature sparsity refers to the sparsity of a feature vector; model sparsity refers to the
sparsity of the model weights.
squared loss
The loss function used in linear regression. (Also known as L2 Loss.) This function calculates
the squares of the difference between a model's predicted value for a labeled example and the
actual value of the label. Due to squaring, this loss function amplifies the influence of bad
predictions. That is, squared loss reacts more strongly to outliers than L1loss.
static model
A model that is trained offline.
step
A forward and backward evaluation of one batch.
step size
Synonym for learning rate.
target
Synonym for label.
temporal data
Data recorded at different points in time. For example, winter coat sales recorded for each day
of the year would be temporal data.
unlabeled example
An example that contains features but no label. Unlabeled examples are the input to inference.
In semi-supervised andunsupervised learning, unlabeled examples are used during training.
validation set
A subset of the data set—disjunct from the training set—that you use to
adjust hyperparameters.
Contrast with training set and test set
weight
A coefficient for a feature in a linear model, or an edge in a deep network. The goal of training a
linear model is to determine the ideal weight for each feature. If a weight is 0, then its
corresponding feature does not contribute to the model.