0% found this document useful (0 votes)
9 views

DataScience - Unit 4

The document discusses machine learning tools, techniques and applications. It covers topics like supervised learning, unsupervised learning, reinforcement learning, dimensionality reduction, classification models, neural networks and evaluating machine learning models. Various machine learning concepts are defined including examples, features, labels, training and more.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

DataScience - Unit 4

The document discusses machine learning tools, techniques and applications. It covers topics like supervised learning, unsupervised learning, reinforcement learning, dimensionality reduction, classification models, neural networks and evaluating machine learning models. Various machine learning concepts are defined including examples, features, labels, training and more.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 236

SCSA3016

DATA SCIENCE
UNIT -IV

Mrs. M. VANATHI
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

1 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

UNIT 4 MACHINE LEARNING TOOLS,


TECHNIQUES AND APPLICATIONS 9 Hrs.

Supervised Learning, Unsupervised Learning,


Reinforcement Learning, Dimensionality Reduction,
Principal Component Analysis, Classification and
Regression models, Tree and Bayesian network models,
Neural Networks, Testing, Evaluation and Validation of
Models.

2 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

What is Learning?
• Herbert Simon: “Learning is any process by which a
system improves performance from experience.”
• For a machine, experiences come in the form of
data.
• What is the task? Classification/Problem solving /
planning / control
• What does it mean to improve performance? Learning
is guided by an objective- notion of Loss or Gain

3 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

To do better in the future based on what was experienced in the


past.

4 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Why Machine Learning?


•We need computers to make informed decisions on new,
unseen data.
•Often it is too difficult to design a set of rules “by
hand”.
•Machine learning is about automatically extracting
relevant information from data and applying it to
analyze new data.

5 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Machine Learning - Definitions
• Machine Learning - Field of study that gives computers
the ability to learn without being explicitly programmed
(Arthur Samuel 1959)
• A branch of artificial intelligence, concerned with the
design and development of algorithms that allow
computers to evolve behaviors based on empirical data
(experience)
• Learning denotes changes in the system that are adaptive
in the sense that they enable the system to do the same
task ( or tasks drawn from a population of similar tasks)
more effectively next time (H.Simon 1983)
6 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
So What Is Machine Learning?
•Automating automation -Getting computers to program themselves
•Writing software is the bottleneck -Let the data do the work instead!

Traditional Programming

Machine Learning

7 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Contd.,
Traditional Programming : Write Programs using hard-coded (fixed)
rules

Machine Learning : Learn Rules by looking at some training data

8 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Machine Learning teams
• Observe the world (Labeled Data)
• Develop models that match observations
• Teach computer to learn these models
• Computer applies learned model to the world

Provides various techniques that can learn from and make predictions on
data
9 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Machine Learning Workflow

10 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

11 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
ML Terminology
Examples: Items or instances used for learning or evaluation.
Features: Set of attributes represented as a vector associated with an example.
Labels: Values or categories assigned to examples. In classification the labels
are categories; in regression the labels are real numbers.
Target: The correct label for a training example. This is extra data that is
needed for supervised learning.
Output: Prediction label from input set of features using a model of the
machine learning algorithm.
Training sample: Examples used to train a machine learning algorithm.
Validation sample: Examples used to tune parameters of a learning
algorithm.
Model: Information that the machine learning algorithm stores after training.
The model is used when predicting the output labels of new, unseen examples.

12 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

13 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

14 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

15 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

16 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Key Elements of Machine Learning

Representation • how to represent knowledge

• accuracy, prediction and recall,


Evaluation squared error, likelihood, posterior
probability, cost, margin, entropy

• combinatorial optimization, convex


Optimization optimization, constrained
optimization

17 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Machine Learning Areas
Supervised Learning: Data and corresponding labels are
given

Unsupervised Learning: Only data is given, no labels


provided

Semi-supervised Learning: Some (if not all) labels are


present

Reinforcement Learning: An agent interacting with the


world makes observations, takes actions, and is rewarded or
punished; it should learn to choose actions in such a way as
to obtain a lot of reward
18 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Supervised Learning : Important Concepts
 Data: labeled instances <x , y>, e.g. emails marked spam/not spam
i
 Training Set
 Held-out Set
 Test Set
 Features: attribute-value pairs which characterize each x
 Experimentation cycle
 Learn parameters (e.g. model probabilities) on training set
 (Tune hyper-parameters on held-out set)
 Compute accuracy of test set
 Very important: never “peek” at the test set!
 Evaluation
 Accuracy: fraction of instances predicted correctly
 Overfitting and generalization
 Want a classifier which does well on test data
 Overfitting: fitting the training data very closely, but not generalizing well
19 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

20 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Example: Spam Filter

21 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

22 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Classification
 In classification, we predict labels y (classes) for inputs x

 Examples:
 OCR (input: images, classes: characters)
 Medical diagnosis (input: symptoms, classes: diseases)
 Automatic essay grader (input: document, classes: grades)
 Fraud detection (input: account activity, classes: fraud / no fraud)
 Customer service email routing
 Recommended articles in a newspaper, recommended books
 DNA and protein sequence identification
 Categorization and identification of astronomical images
 Financial investments
 … many more

23 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Inductive learning
Simplest form: learn a function from examples
f is the target function
An example is a pair (x, f(x))

Pure induction task:


Given a collection of examples of f, return a function h
that approximates f.
find a hypothesis h, such that h ≈ f, given a training set of
examples

(This is a highly simplified model of real learning:


Ignores prior knowledge
Assumes examples are given)

24 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Inductive learning method
Construct/adjust h to agree with f on training set
(h is consistent if it agrees with f on all examples)
E.g., curve fitting:

25 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Contd.,

26 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Supervised Learning
Learning a discrete function: Classification
Boolean classification:
 Each example is classified as true(positive) or false(negative).

Learning a continuous function: Regression

27 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

28 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

29 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Applications of SL

 Pattern recognition

 Medical diagnosis

 Speech recognition

 Face recognition

30 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Unsupervised Learning
 In unsupervised learning, there is no such supervisor and we

only have input data.

 The aim is to find the regularities in the input. There is a

structure to the input space such that certain patterns occur more
often than others, and we want to see what generally happens
and what does not.

 In statistics, this is called density estimation. Trying to analyze

with respect to input data.

31 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

32 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Types of USL….

Dimension reduction

Clustering

Association analysis ( e-commerce - recommendation

engine).

33 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
scenario
 A company with a data of past customers, the customer data

contains the demographic information as well as the past


transactions with the company, and the company may want to see
the distribution of the profile of its customers, to see what type
of customers frequently occur.
 In such a case, a clustering model allocates customers similar in

their attributes to the same group, providing the company with


natural groupings of its customers; this is called customer
segmentation.
34 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Reinforcement learning
 Reinforcement learning is an area of Machine Learning. It
is about taking suitable action to maximize reward in a
particular situation.
 It is employed by various software and machines to find the
best possible behavior or path it should take in a specific
situation.
 Reinforcement learning differs from supervised learning in
a way that in supervised learning the training data has the
answer key with it so the model is trained with the correct
answer itself whereas in reinforcement learning, there is no
answer but the reinforcement agent decides what to do to
perform the given task.
35 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Example
The problem is as follows: We have an agent and a
reward, with many hurdles in between. The agent is
supposed to find the best possible path to reach the
reward. The following problem explains the problem
more easily.

36 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Contd.,
The above image shows the robot, diamond, and fire.
The goal of the robot is to get the reward that is the
diamond and avoid the hurdles that are fired.
The robot learns by trying all the possible paths and
then choosing the path which gives him the reward
with the least hurdles.
Each right step will give the robot a reward and each
wrong step will subtract the reward of the robot. The
total reward will be calculated when it reaches the final
reward that is the diamond.

37 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

REINFORCEMENT LEARNING SUPERVISED LEARNING

Reinforcement learning is all about


making decisions sequentially. In
simple words, we can say that the In Supervised learning, the
output depends on the state of the decision is made on the initial
current input and the next input input or the input given at the start
depends on the output of the
previous input

In supervised learning the


In Reinforcement learning decision
decisions are independent of each
is dependent, So we give labels to
other so labels are given to each
sequences of dependent decisions
decision.
Example: Chess game Example: Object recognition
38 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Types of Reinforcement
1. Positive –

Positive Reinforcement is defined as when an event, occurs due to


a particular behavior, increases the strength and the frequency of
the behavior. In other words, it has a positive effect on
behavior. Advantages of reinforcement learning are:
1. Maximizes Performance

2. Sustain Change for a long period of time

3. Too much Reinforcement can lead to an overload of states

which can diminish the results


39 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Contd.,
2. Negative –

Negative Reinforcement is defined as strengthening of behavior


because a negative condition is stopped or avoided. Advantages of
reinforcement learning:
1. Increases Behavior

2. Provide defiance to a minimum standard of performance

3. It Only provides enough to meet up the minimum behavior

40 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Comparison Table
Criteria Supervised ML Unsupervised ML Reinforcement ML

Trained using unlabelled data without Works on interacting with the


Definition Learns by using labelled data
any guidance. environment

Type of data Labelled data Unlabelled data No – predefined data

Type of problems Regression and classification Association and Clustering Exploitation or Exploration

Supervision Extra supervision No supervision No supervision

Linear Regression, Logistic Regression, SVM, K – Means, Q – Learning,


Algorithms
KNN etc. C – Means, Apriori SARSA

Aim Calculate outcomes Discover underlying patterns Learn a series of action

Recommendation System, Anomaly


Application Risk Evaluation, Forecast Sales Self Driving Cars, Gaming, Healthcare
Detection

41 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Dimensionality
Reduction

42 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Curse of Dimensionality
 Increasing the number of
features will not always improve
classification accuracy.

 In practice, the inclusion of


more features might actually
k=3
lead to worse performance. 3 bins
1

 The number of training


examples required increases
exponentially with
dimensionality d (i.e., kd).
32 bins
33 bins
43 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Dimensionality Reduction
 What is the objective?
Choose an optimum set of features of lower dimensionality to
improve classification accuracy.

 Different methods can be used to reduce dimensionality:


Feature extraction
Feature selection

44 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Contd.,
Feature extraction: finds a set of Feature selection:
new features (i.e., through some chooses a subset of the
mapping f()) from the existing original features.
features.
The mapping f()
could be linear or  x1 
 x1  non-linear x 
x   2  xi1 
 2  .   
 y1 
 .  y     xi2 
 . 
 
.   2 x y . 
x  
f (x)
y   .   .   
 .       . 
   .    .  
.
   yK   .   xiK 
 .   
   xN 
 xN 
45 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Feature Extraction
• Linear combinations are particularly attractive because they are
simpler to compute and analytically tractable.

• Given x ϵ RN, find an K x N matrix T such that:


y = Tx ϵ RK where K<<N
 x1 
x 
 2  y1 
 .  y 
  This is a projection from the
.  2
x     f (x)
y   .  N-dimensional space to a K-
 .   
  .  dimensional space.

 .   yK 
 . 
 
 xN 
46 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Contd.,
 Popular linear feature extraction methods:
Principal Components Analysis (PCA): Seeks a projection that
preserves as much information in the data as possible.
Linear Discriminant Analysis (LDA): Seeks a projection that
best discriminates the data.

 Many other methods:


Making features as independent as possible (Independent
Component Analysis or ICA).
Retaining interesting directions (Projection Pursuit).
Embedding to lower dimensional manifolds (Isomap, Locally
Linear Embedding or LLE).

47 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
 x1 
Vector Representation x
 2


• A vector x ϵ Rn can be represented by  . 
 
.
n components: x: 
 . 
 
 . 
 . 
• Assuming the standard base <v1, v2,  

 xN 

…, vN> (i.e., unit vectors in each
dimension), xi can be obtained by xT vi T
projecting x along the direction of vi: xi  T  x vi
N
vi vi
• x can be “reconstructed” from its x   xi vi  x1v1  x2v2  ...  xN vN
projections as follows: i 1

• Since the basis vectors are the same for all x ϵ R n (standard
basis), we typically represent them as a n-component vector.
48 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Contd.,
• Example assuming n=2:  x1   3 
x:    
 x2   4 

• Assuming the standard base <v1=i, 1 


x1  xT i  3 4    3
v2=j>, xi can be obtained by 0
projecting x along the direction of 0
vi: x2  x j  3 4    4
T

1 

• x can be “reconstructed” from its x  3i  4 j


projections as follows:
49 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Principal Component Analysis (PCA)  x1 
x 
• If x∈RN, then it can be written a linear combination of  2
 . 
an orthonormal set of N basis vectors <v1,v2,…,v𝑁> in  
.
x:  
RN (e.g., using the standard base):
N
 . 
 
1 if i  j x   xi vi  x1v1  x2 v2  ...  xN vN  . 
 . 
viT v j   i 1 xT ui  
0 otherwise where yi  T  xT ui  xN 
ui ui

• PCA seeks to approximate x in a subspace of RN using


a new set of K<<N basis vectors <u1, u2, …,uK> in RN:  y1 
y 
K
 2
xˆ   yi ui  y1u1  y2u2  ...  yK uK xT ui
where yi  T  xT ui xˆ :  . 
i 1 (reconstruction) ui ui  
 . 
 yK 
such that || x  xˆ || is minimized!
(i.e., minimize information loss)
50 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Principal Component Analysis (PCA)
The “optimal” set of basis vectors <u1, u2, …,uK> can be found as
follows (we will see why):
(1) Find the eigenvectors u𝑖 of the covariance matrix of the
(training) data Σx
Σx u𝑖= 𝜆𝑖 u𝑖

(2) Choose the K “largest” eigenvectors u𝑖 (i.e., corresponding to the


K “largest” eigenvalues 𝜆𝑖)

<u1, u2, …,uK> correspond to the “optimal” basis!

We refer to the “largest” eigenvectors u𝑖 as principal components.


51 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
PCA - Steps
Suppose we are given x1, x2, ..., xM (N x 1) vectors
Step 1: compute sample mean N: # of features
1 M M: # data
x
M
x
i 1
i

Step 2: subtract sample mean (i.e., center data at zero)


Φi  xi  x
Step 3: compute the sample covariance matrix Σx

1 M
1 M
1
x 
M

i 1
( x i  x )( x i  xT
) 
M
 i i
 
i 1
T

M
AAT where A=[Φ1 Φ2 ... ΦΜ]
i.e., the columns of A are the Φi
(N x M matrix)

52 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
PCA - Steps
Step 4: compute the eigenvalues/eigenvectors of Σx
 x ui  i ui where we assume 1  2  ...  N
Note : most software packages return the eigenvalues (and corresponding eigenvectors)
is decreasing order – if not, you can explicitly put them in this order)

Since Σx is symmetric, <u1,u2,…,uN> form an orthogonal basis in RN and


we can represent any x∈RN as: x 
 
y

1 1 

N  x2   y2 
x  x   yi ui  y1u1  y2u2  ...  yN uN  . 
 
.
 . 

. 

xx:   
i 1 i.e., this is  .   . 
   
(x  x)T ui just a “change”  .   . 
yi   ( x  x )T ui if || ui || 1  .   . 
T
ui ui of basis!  
 xN 
 
 y N 

Note : most software packages normalize ui to unit length to simplify


calculations; if not, you can explicitly normalize them)
53 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
PCA - Steps
Step 5: dimensionality reduction step – approximate x using only
the first K eigenvectors (K<<N) (i.e., corresponding to the K largest
eigenvalues where K is a parameter):
N
x  x   yi ui  y1u1  y2u2  ...  yN uN
i 1
approximate using first K
K
Eigenvectors only
xˆ  x   yi ui  y1u1  y2u2  ...  yK uK (reconstruction)
i 1
 x1   y1 
x  y 
 2  2  y1 
 .   .  y 
     2
xx:  . 
  . 
 xˆ  x :  . 
note that if K=N, then
 .   . 
   
 
 . 
(i.e., zero reconstruction error)
 .   .   yK 
 .   . 
   
54 UNIT-IV  N 
x  yN  06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
What is the Linear Transformation implied by
PCA?
The linear transformation y = Tx which performs the dimensionality
reduction in PCA is: K
xˆ  x   yi ui  y1u1  y2u2  ...  yK uK
 y1  i 1
y 
 2
(xˆ  x )  U  .  where U  [u1 u2 ... u K ] N x K matrix
 
 . 
 yK  i.e., the columns of U are the
the first K eigenvectors of Σx
 y1 
y 
 2 T = UT K x N matrix
 .   U T (xˆ  x )
  i.e., the rows of T are the first
 .  K eigenvectors of Σx
55 UNIT-IV  yK  06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
What is the form of Σy ?
M M
1 1
x 
M

i 1
(xi  x )(xi  x ) 
T

M
 
i 1
i
T
i

Using diagonalization:
The diagonal elements of
The columns of P are
 x  P P T the eigenvectors of ΣX
Λ are the eigenvalues of ΣX
or the variances
y i  U T ( xi  x )  PT  i
M M
1 M 1 1
y 
M
 (y i  y )(y i  y ) 
T
M

i 1
( y i )(T
y i ) 
M
 (
i 1
P T
 i )( P T
 i )T

i 1

M M
1 1
M

i 1
(T
P  i )(T
i P )  P T
(
M
 i i
 
i 1
T
) P  P T
x P  PT ( PPT ) P  

PCA de-correlates the data!


y  
56 UNIT-IV Preserves original variances! 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Interpretation of PCA
• PCA chooses the eigenvectors of the
covariance matrix corresponding to the
largest eigenvalues.
• The eigenvalues correspond to the
variance of the data along the
eigenvector directions.
• Therefore, PCA projects the data along
the directions where the data varies
most. u1: direction of max variance
• PCA preserves as much information in u2: orthogonal to u1
the data by preserving as much variance
in the data.

57 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Example
• Compute the PCA of the following dataset:
(1,2),(3,3),(3,5),(5,4),(5,6),(6,5),(8,7),(9,8)

• Compute the sample covariance matrix is:


n
ˆ  1  (x  μˆ )(x  μˆ )t
k k
n k 1

• The eigenvalues can be computed by finding the roots of the characteristic


polynomial:

58 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Contd.,
• The eigenvectors are the solutions of the systems:

 xui  i ui

Note: if ui is a solution, then cui is also a solution where c≠0.

Eigenvectors can be normalized to unit-length using:


vi
vˆi 
59 UNIT-IV
|| vi || 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
How do we choose K ?
• K is typically chosen based on how much information (variance)
we want to preserve:
K

Choose the smallest K  i

that satisfies the


i 1
N
T where T is a threshold (e.g ., 0.9)
following inequality: 
i 1
i

• If T=0.9, for example, we “preserve” 90% of the information


(variance) in the data.

• If K=N, then we “preserve” 100% of the information in the data


(i.e., just a “change” of basis and xˆ  x )
60 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Approximation Error
• The approximation error (or reconstruction error) can be
computed by:
|| x  xˆ ||
K
where xˆ   yi ui  x  y1u1  y2u2  ...  yK uK  x
i 1
(reconstruction)

• It can also be shown that the approximation error can be


computed as follows:
1 N
|| x  xˆ ||  i
2 i  K 1
61 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Data Normalization
• The principal components are dependent on the units used to
measure the original variables as well as on the range of values
they assume.

• Data should always be normalized prior to using PCA.

• A common normalization method is to transform all the data to


have zero mean and unit standard deviation:

xi   where μ and σ are the mean and standard


 deviation of the i-th feature xi

62 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Application to Images
• The goal is to represent images in a space of lower
dimensionality using PCA.
o Useful for various applications, e.g., face recognition, image
compression, etc.

• Given M images of size N x N, first represent each image as a


1D vector (i.e., by stacking the rows together).
o Note that for face recognition, faces must be centered and of the same
size.

63 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Application to Images Contd.,
• The key challenge is that the covariance matrix Σx is now very
large (i.e., N2 x N2) – see Step 3:

Step 3: compute the covariance matrix Σx


M
1 1 where A=[Φ1 Φ2 ... ΦΜ]
x 
M

i 1
 T
i
i  
M
AAT

(N2 x M matrix)

• Σx is now an N2 x N2 matrix – computationally expensive to


compute its eigenvalues/eigenvectors λi, ui

64 UNIT-IV (AAT)ui= λiui 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Application to Images Contd.,
• We will use a simple “trick” to get around this by relating the
eigenvalues/eigenvectors of AAT to those of ATA.

• Let us consider the matrix ATA instead (i.e., M x M matrix)


− Suppose its eigenvalues/eigenvectors are μi, vi
(ATA)vi= μivi
− Multiply both sides by A:
A(ATA)vi=Aμivi or (AAT)(Avi)= μi(Avi)

− Assuming (AAT)ui= λiui

A=[Φ1 Φ2 ... ΦΜ]


λi=μi and ui=Avi

(N 2
65 UNIT-IV x M matrix) 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Application to Images Contd.,
• But do AAT and ATA have the same number of
eigenvalues/eigenvectors?
− AAT can have up to N2 eigenvalues/eigenvectors.
− ATA can have up to M eigenvalues/eigenvectors.
− It can be shown that the M eigenvalues/eigenvectors of ATA
are also the M largest eigenvalues/eigenvectors of AAT

• Steps 3-5 of PCA need to be updated as follows:

66 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Application to Images Contd.,
Step 3: compute ATA (i.e., instead of AAT)

Step 4: compute μi, vi of ATA

Step 4b: compute λi, ui of AAT using λi=μi and ui=Avi, then
normalize ui to unit length.

Step 5: dimensionality reduction step – approximate x using only


the first K eigenvectors (K<M):  y1 
 
 y2 
each image can be
K represented by
xˆ  x   yi ui  y1u1  y2u2  ...  yK uK a K-dimensional
ˆ x: . 
x
 
i 1 vector  . 

 yK 

67 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Example

Dataset

68 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Example Contd., Top eigenvectors: u1,…uk
(visualized as an image - eigenfaces)

69 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Example Contd.,
How can you visualize the eigenvectors (eigenfaces) as an
image?
• Their values must be first mapped to integer values in the
interval [0, 255] (required by PGM format).
• Suppose fmin and fmax are the min/max values of a given
eigenface (could be negative).
• If xϵ[fmin, fmax] is the original value, then the new value
yϵ[0,255] can be computed as follows:

y=(int)255(x - fmin)/(fmax - fmin)

70 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Example Contd.,
Interpretation: represent a face in terms of eigenfaces

 y1 
y 
K  2
xˆ   yi ui  y1u1  y2u2  ...  yK uK  x xˆ  x :  . 
i 1
 
 . 
 yK 

x

71 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Limitations
• Background changes cause problems
- De-emphasize the outside of the face (e.g., by multiplying the input
image by a 2D Gaussian window centered on the face).
• Light changes degrade performance
- Light normalization might help but this is a challenging issue.
• Performance decreases quickly with changes to face size
- Scale input image to multiple sizes.
- Multi-scale eigenspaces.
• Performance decreases with changes to face orientation (but not as fast
as with scale changes)
- Out-of-plane rotations are more difficult to handle.
- Multi-orientation eigenspaces.

72 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Limitations contd.,
• Not robust to misalignment

73 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Limitations contd.,
• PCA is not always an optimal dimensionality-reduction technique
for classification purposes.

74 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Linear Discriminant Analysis (LDA)
What is the goal of LDA?
• Seeks to find directions along which the classes are best separated
(i.e., increase discriminatory information).
• It takes into consideration the scatter (i.e., variance) within-classes
and between-classes.

Bad separability Good separability


75 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Linear Discriminant Analysis (LDA) Contd.,
• Let us assume C classes with each class containing Mi samples, i=1,2,..,C
and M the total number of samples:
C
M   Mi
i 1

• Let μi is the mean of the i-th class, i=1,2,…,C and μ is the mean of the
whole dataset: 1 C
μ   μi
C i 1
Within-class scatter matrix
C Mi
S w    ( x j  μi )( x j  μi )T
i 1 j 1

C
S   i
Between-class scatter matrix
b (μ
i 1
 μ )(μ i  μ ) T

76 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Linear Discriminant Analysis (LDA) Contd.,
• Suppose the desired projection transformation is:

y UTx
• Suppose the scatter matrices of the projected data y are:

Sb , S w
• LDA seeks transformations that maximize the between-class
scatter and minimize the within-class scatter:

| Sb | | U T SbU |
max or max T
| Sw | | U S wU |
77 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Linear Discriminant Analysis (LDA) Contd.,
• It can be shown that the columns of the matrix U are the
eigenvectors (i.e., called Fisherfaces) corresponding to the largest
eigenvalues of the following generalized eigen-problem:

Sb uk  k S wuk
• It can be shown that Sb has at most rank C-1; therefore, the max
number of eigenvectors with non-zero eigenvalues is C-1, that
is:
max dimensionality of LDA sub-space is C-1
e.g., when C=2, we always end up with one LDA feature
no matter what the original number of features was!
78 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Example

79 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Linear Discriminant Analysis (LDA) Contd.,
• If Sw is non-singular, we can solve a conventional eigenvalue
problem as follows:
Sb uk  k S wuk

S Sb uk  k uk
1
w

• In practice, Sw is singular due to the high dimensionality of the


data (e.g., images) and a much lower number of data (M <<
N)

80 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Linear Discriminant Analysis (LDA) Contd.,
To alleviate this problem, PCA could be applied first:
1) First, apply PCA to reduce data dimensionality:
 x1   y1 
x  y 
  2  2
x   .  PCA
y   . 
   
  .  . 
 xN   yM 

2) Then, apply LDA to find the most discriminative directions:


 y1   z1 
y  z 
 2  2
y   .  LDA
z   . 
   
  .  . 
81 UNIT-IV  yM   z K  06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Classification

82 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Classification—A Two-Step Process
 Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label
 The set of tuples used for model construction is training set
 The model is represented as classification rules, decision trees, or
mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model
 The known label of test sample is compared with the classified
result from the model
 Test set is independent of training set, otherwise over-fitting
will occur
 If the accuracy is acceptable, use the model to classify data tuples
whose class labels are not known

83 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

84 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
No
1 Yes Large 125K
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set
85 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Issues: Data Preparation
• Data cleaning
– Preprocess data in order to reduce noise and handle missing
values
• Relevance analysis (feature selection)
– Remove the irrelevant or redundant attributes
• Data transformation
– Generalize data to (higher concepts, discretization)
– Normalize attribute values

86 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Classification Techniques
Decision Tree based Methods
Rule-based Methods
KNN
Naïve Bayes and Bayesian Belief Networks
Neural Networks
Support Vector Machines
and more...

87 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Learning decision trees
Example Problem: decide whether to wait for a table at a restaurant,
based on the following attributes:
1. Alternate: is there an alternative restaurant nearby?
2. Bar: is there a comfortable bar area to wait in?
3. Fri/Sat: is today Friday or Saturday?
4. Hungry: are we hungry?
5. Patrons: number of people in the restaurant (None, Some, Full)
6. Price: price range ($, $$, $$$)
7. Raining: is it raining outside?
8. Reservation: have we made a reservation?
9. Type: kind of restaurant (French, Italian, Thai, Burger)
10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60,
>60)

88 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Feature(Attribute)-based representations
• Examples described by feature(attribute) values
– (Boolean, discrete, continuous)
• E.g., situations where I will/won't wait for a table:

• Classification of examples is positive (T) or negative (F)


89 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Decision trees
• One possible representation for hypotheses
• E.g., here is the “true” tree for deciding whether to wait:

90 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Expressiveness
 Decision trees can express any function of the input attributes.
 E.g., for Boolean functions, truth table row → path to leaf:

 Trivially, there is a consistent decision tree for any training set with
one path to leaf for each example (unless f nondeterministic in x) but
it probably won't generalize to new examples

 Prefer to find more compact decision trees

91 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Rule-Based Classifier
 Classify records by using a collection of “if…then…” rules

 Rule: (Condition)  y
where
 Condition is a conjunctions of attributes
 y is the class label
LHS: rule antecedent or condition
RHS: rule consequent
Examples of classification rules:
 (Blood Type=Warm)  (Lay Eggs=Yes)  Birds
 (Taxable Income < 50K)  (Refund=Yes)  Evade=No

92 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Rule-based Classifier (Example)
Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds

R1: (Give Birth = no)  (Can Fly = yes)  Birds


R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians
93 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Rule Extraction from a Decision Tree
 Rules are easier to understand than large trees
age?
 One rule is created for each path from the root to
<=30 31..40 >40
a leaf
student? credit rating?
yes
 Each attribute-value pair along a path forms a
no yes excellent fair
conjunction: the leaf holds the class prediction
no yes no yes

 Example: Rule extraction from our buys_computer decision-tree

IF age = young AND student = no THEN buys_computer = no


IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = yes
IF age = young AND credit_rating = fair THEN buys_computer = no
94 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
K-Nearest Neighbors
Given a query item:
Find k closest matches
in a labeled dataset ↓

95 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
K-Nearest Neighbors
Given a query item: Return the most
Find k closest matches Frequent label

96 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
K-Nearest Neighbors
k = 3 votes for “cat”

97 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
K-Nearest Neighbors
2 votes for cat,
1 each for Buffalo, Cat wins…
Deer, Lion

98 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
K-NN Issues
The Data is the Model
• No training needed.
• Accuracy generally improves with more data.
• Matching is simple and fast (and single pass).
• Usually need data in memory, but can be run off disk.

Minimal Configuration:
• Only parameter is k (number of neighbors)
• Two other choices are important:
• Weighting of neighbors (e.g. inverse distance)
• Similarity metric

99 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
K-NN Metrics
• Euclidean Distance: Simplest, fast to compute

• Cosine Distance: Good for documents, images, etc.

• Jaccard Distance: For set data:

• Hamming Distance: For string data:

100 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
K-NN Metrics
• Manhattan Distance: Coordinate-wise distance

• Edit Distance: for strings, especially genetic data.

• Mahalanobis Distance: Normalized by the sample


covariance matrix – unaffected by coordinate
transformations.

101 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Linear Regression
• We want to find the best line (linear function y=f(X)) to explain the
data.

X
102 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Linear Regression
• The predicted value of y is given by:

• The vector of coefficients is the regression model.

• If , the formula becomes a matrix product:

103 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Linear Regression
• We can write all of the input samples in a single matrix X:

i.e. rows of

are distinct observations, columns of X are input features.

104 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Residual Sum-of-Squares
To determine the model parameters from some data, we can write down
the Residual Sum of Squares:

or symbolically . To minimize it, take the derivative wrt which gives:

And if is non-singular, the unique solution is:

105 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Iterative Regression Solutions
• The exact method requires us to invert a matrix whose size is
nfeatures x nfeatures. This will often be too big.

• There are many gradient-based methods which reduce the RSS error
by taking the derivative wrt

which was

106 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Stochastic Gradient
• A very important set of iterative algorithms use stochastic gradient
updates.

• They use a small subset or mini-batch X of the data, and use it to


compute a gradient which is added to the model

Where is called the learning rate.

• These updates happen many times in one pass over the dataset.

• Its possible to compute high-quality models with very few passes,


sometime with less than one pass over a large dataset.

107 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
R2-values and P-values
• We can always fit a linear model to any dataset, but how do we know
if there is a real linear relationship?

108 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
R2-values and P-values
Approach: Use a hypothesis test. The null hypothesis is that there is no
linear relationship (β = 0).

Statistic: Some value which should be small under the null hypothesis,
and large if the alternate hypothesis is true.

R-squared: a suitable statistic. Let be a predicted value, and be the


sample mean. Then the R-squared statistic is

And can be described as the fraction of the total variance not explained

by the model.

109 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
R-squared

∑ 𝑖 𝑖)
Small if good fit

2 ( 𝑦 − ^
𝑦
2

𝑅 =1−
Line of

∑( 𝑖 )
𝑦 − 𝑦
2
Line of

X
110 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
R2-values and P-values
• Statistic: From R-squared we can derive another statistic (using
degrees of freedom) that has a standard distribution called an F-
distribution.

• From the CDF for the F-distribution, we can derive a P-value for the
data.

• The P-value is, as usual, the probability of observing the data under
the null hypothesis of no linear relationship.

• If p is small, say less than 0.05, we conclude that there is a linear


relationship.

111 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Tree and Bayesian


network
112 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Decision tree learning
 Aim: find a small tree consistent with the training examples
 Idea: (recursively) choose "most significant" attribute as root of
(sub)tree

113 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Decision Tree Construction Algorithm
 Principle
Basic algorithm (adopted by ID3, C4.5 and CART): a greedy
algorithm
Tree is constructed in a top-down recursive divide-and-conquer
manner
 Iterations
At start, all the training tuples are at the root
Tuples are partitioned recursively based on selected attributes
Test attributes are selected on the basis of a heuristic or statistical
measure (e.g, information gain)
 Stopping conditions
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning
–majority voting is employed for classifying the leaf
There are no samples left
114 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Decision Tree Induction: Training Dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
This 31…40 high no fair yes
follows an >40 medium no fair yes
>40 low yes fair yes
example >40 low yes excellent no
of 31…40 low yes excellent yes
Quinlan’s <=30
<=30
medium
low
no fair
yes fair
no
yes
ID3 >40 medium yes fair yes
(Playing <=30 medium yes excellent yes
31…40 medium no excellent yes
Tennis) 31…40 high yes fair yes
>40 medium no excellent no

115 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Example

116 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Example contd.,

117 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Example contd.,

118 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Example contd.,

119 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Example contd.,

120 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Example contd.,

121 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Example contd.,

122 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Tree Induction
 Greedy strategy.
Split the records based on an attribute test that optimizes certain
criterion.

 Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting

123 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Choosing an attribute
 Idea: a good attribute splits the examples into subsets that are
(ideally) "all positive" or "all negative"

 Patrons? is a better choice

124 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
How to determine the Best Split
Greedy approach:
Nodes with homogeneous class distribution are preferred

Need a measure of node impurity:

C0: 5 C0: 9
C1: 5 C1: 1

Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity

125 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Measures of Node Impurity
 Information Gain

 Gini Index

 Misclassification error

Choose attributes to split to achieve minimum impurity

126 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Attribute Selection Measure: Information
Gain (ID3/C4.5)
 Select the attribute with the highest information gain
 Let pi be the probability that an arbitrary tuple in D belongs to class C i,
estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify a tuple in D:
m
I ( D )    pi log 2 ( pi )
i 1

 Information needed (after using A to split D into v partitions) to classify


v | D |
D: Info ( D ) 
A  |D|
j 1
j
 I (D )j

 Information gained by branching on attribute A

Gain(A)  Info(D)  Info A(D)


127 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Information gain
 For the training set, p = n = 6, I(6/12, 6/12) = 1 bit
 Consider the attributes Patrons and Type (and others too):
2 4 6 2 4
IG ( Patrons )  1  [ I (0,1)  I (1,0)  I ( , )]  .0541 bits
12 12 12 6 6
2 1 1 2 1 1 4 2 2 4 2 2
IG (Type )  1  [ I ( , )  I( , )  I( , )  I ( , )]  0 bits
12 2 2 12 2 2 12 4 4 12 4 4

 Patrons has the highest IG of all attributes and so is chosen by the


DTL algorithm as the root

128 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Example contd.,
 Decision tree learned from the 12 examples:

 Substantially simpler than “true” tree---a more complex


hypothesis isn’t justified by small amount of data
129 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Measure of Impurity: GINI
(CART, IBM Intelligent Miner)
 Gini Index for a given node t :

GINI (t )  1   [ p ( j | t )]2
j

(NOTE: p( j | t) is the relative frequency of class j at node t).

Maximum (1 - 1/n ) when records are equally distributed among


c
all classes, implying least interesting information
Minimum (0.0) when all records belong to one class, implying
most interesting information
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
130 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Splitting Based on GINI
Used in CART, SLIQ, SPRINT.
When a node p is split into k partitions (children), the
quality of split is computed as,
GINI (t )  1   [ p ( j | t )]2
j

k
ni
GINI split  
i 1 n
GINI (i )

where,ni = number of records at child i,= number of


records
C1
at node
0
p.C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500

131 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Comparison of Attribute Selection Methods
 The three measures return good results but
Information gain:
biased towards multivalued attributes
Gain ratio:
tends to prefer unbalanced splits in which one partition is much
smaller than the others
Gini index:
biased to multivalued attributes
has difficulty when # of classes is large
tends to favor tests that result in equal-sized partitions and purity
in both partitions

132 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Example Algorithm: C4.5
 Simple depth-first construction.

 Uses Information Gain

 Sorts Continuous Attributes at each node.

 Needs entire data to fit in memory.

 Unsuitable for Large Datasets.

 You can download the software from Internet

133 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Decision Tree Based Classification
 Advantages:

Easy to construct/implement
Extremely fast at classifying unknown records
Models are easy to interpret for small-sized trees
Accuracy is comparable to other classification techniques for
many simple data sets
Tree models make no assumptions about the distribution of the
underlying data : nonparametric
Have a built-in feature selection method that makes them
immune to the presence of useless variables

134 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Decision Tree Based Classification
Disadvantages

Computationally expensive to train


Some decision trees can be overly complex that do not
generalise the data well.
Less expressivity: There may be concepts that are hard to learn
with limited decision trees

135 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Overfitting and Tree Pruning
 Overfitting: An induced tree may overfit the training data
Too many branches, some may reflect anomalies due to noise or outliers
Poor accuracy for unseen samples

 Two approaches to avoid overfitting


Prepruning: Halt tree construction early—do not split a node if this
would result in the goodness measure falling below a threshold
Difficult to choose an appropriate threshold
Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees
Use a set of data different from the training data to decide which is
the “best pruned tree”
136 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
What is a Bayesian Network ?
• Bayesian nets (BN) -network-based framework for representing
and analyzing models involving uncertainty;
• BN are different from other knowledge-based systems tools
because uncertainty is handled in a mathematically rigorous yet
efficient and simple way
• BN are different from other probabilistic analysis tools because of
o network representation of problems, use of Bayesian
statistics, and the synergy between these
• A graphical model that efficiently encodes the joint probability
distribution for a large set of variables
o Naïve Bayes is based on assumption of conditional
independence
o Bayesian networks provide a tractable method for specifying
dependencies among variables
137 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Why learn Bayesian networks?
• Combining domain expert knowledge with data
• Efficient representation and inference
• Incremental Learning
• Handling missing data
• Learning causal relationships

138 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
What Bayesian Networks are good for?
cause
 Diagnosis: P(cause|symptom)=?
C1 C2
 Prediction: P(symptom|cause)=?
 Classification: max P(class|data) symptom
class
 Decision-making (given a cost function)

Stock market

Bio-informatics Computer Speech Text


Medicine recognition Classification
troubleshooting

139 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
App. : Recommendation Systems

 Given user preferences, suggest recommendations


 Example: Amazon.com
 Input: movie preferences of many users
 Solution: model correlations between movie features
Users who like comedy, often like drama
Users who like action, often do not like cartoons
Users who like Robert Deniro films often like Al Pacino films
 Given user preferences, can predict probability that new movies match
preferences

140 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Sample of General Product Rule

X1

X2 X3

X5

X4 X6

p(x1, x2, x3, x4, x5, x6) = p(x6 | x5) p(x5 | x3, x2) p(x4 | x2, x1) p(x3 | x1) p(x2 | x1) p(x1)

141 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Terminology
 A Bayesian Belief Network describes the probability distribution
over a set of random variables Y1, Y2, …Yn
 Each variable Yi can take on the set of values V(Yi)
 The joint space of the set of variables Y is the cross product
V(Y1)  V(Y2) …  V(Yn)
 Joint probability distribution: specifies the probabilities of the
items in the joint space
 A Bayesian Network provides a way to describe the joint
probability distribution in a compact and structured manner.
 A set of random variables makes up the nodes of the network
 A set of directed links or arrows connects pairs of nodes. The
intuitive meaning of an arrow from X to Y is that X has a direct
influence on Y- direct dependence
142 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Bayesian Networks
 Requires that graph is acyclic (no directed cycles - DAG)

 Each node has a conditional probability table that quantifies the


effects that the parents have on the node. The parents of a node
are all those nodes that have arrows pointing to it.

 2 components to a Bayesian network


The graph structure (conditional independence assumptions)
The numerical probabilities (for each variable given its parents)

143 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Bayesian Networks-The Joint Probability
Distribution
Due to the Markov condition, we can compute the joint probability
distribution over all the variables X1, …, Xn in the Bayesian net using the
formula:
n
P ( X 1  x1 ,..., X n  xn )   P ( X i  xi | Parents ( X i ))
i 1

The full joint distribution The graph-structured approximation

Where Parents(Xi) means the values of the Parents of the node X i


with respect to the graph

144 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Bayesian Networks
Good Writer Smart conditional probability table (CPT)

W S P(Q| W, S)
w s 0.6 0.4
Reviewer
Quality w s 0.3 0.7
Mood
w s 0.4 0.6
w s 0.1 0.9
Review Accepted
Length
nodes = domain variables
edges = direct causal influence

Network structure encodes conditional independencies:


I(Review-Length , Good-Writer | Reviewer-Mood)

145 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
BN Semantics
W S
conditional full joint
M Q local
independencies + CPTs = distribution
in BN structure over domain
L A

P (w , s, m, q, l , a ) 
P (w )P (s )P ( m | w )P (q | w , s )P ( l | m )P (a | m, q )
 Compact & natural representation:
nodes  k parents  O(2k n) vs. O(2n) params
natural parameters

146 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Reasoning in BNs
 Full joint distribution answers any query
 P(event | evidence)
W S

M Q

L A
 Allows combination of different types of reasoning:
Causal: P(Reviewer-Mood | Good-Writer)
Evidential: P(Reviewer-Mood | not Accepted)
Intercausal: P(Reviewer-Mood | not Accepted, Quality)

147 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Bayesian Belief Networks (BNs)
A Bayesian network is made up of:

1. A Directed Acyclic Graph A 2. A set of tables for each node


in the graph
B

C D

A P(A) A B P(B|A) B D P(D|B) B C P(C|B)


false 0.6 false false 0.01 false false 0.02 false false 0.4
true 0.4 false true 0.99 false true 0.98 false true 0.6
true false 0.7 true false 0.05 true false 0.9
true true 0.3 true true 0.95 true true 0.1

148 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
A Directed Acyclic Graph
Each node in the graph is a A node X is a parent of another
random variable A node Y if there is an arrow from
node X to node Y eg. A is a
B parent of B

C D

Informally, an arrow from


node X to node Y means X
has a direct influence on Y

149 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
A Set of Tables for Each Node
A B P(B|
A)
A P(A) A Each node Xi has a
false false 0.01
false 0.6 false true 0.99 conditional probability
true 0.4 B true false 0.7 distribution
true true 0.3 P(Xi | Parents(Xi))
B C P(C|B) C D
false false 0.4 that quantifies the effect
false true 0.6 B D P(D|B) of the parents on the node
true false 0.9 false false 0.02 The parameters are the
true true 0.1 false true 0.98 probabilities in these
true false 0.05 conditional probability
true true 0.95 tables (CPTs)

150 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
A Set of Tables for Each Node
Conditional Probability
Distribution for C given B
B C P(C|B)

false false 0.4

false true 0.6 For a given combination of values of the parents (B


true false 0.9 in this example), the entries for P(C=true | B) and
P(C=false | B) must add up to 1
true true 0.1
eg. P(C=true | B=false) + P(C=false |B=false )=1

If you have a Boolean variable with k Boolean parents, this table has
2k+1 probabilities (but only 2k need to be stored)

151 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Using a Bayesian Network Example

Using the network in the example, suppose A


you want to calculate:
P(A = true, B = true, C = true, D = true) This is from the B
graph structure
= P(A = true) * P(B = true | A = true) * C D
P(C = true | B = true) P( D = true | B = true)
= (0.4)*(0.3)*(0.1)*(0.95)
Using a Bayesian network to compute
probabilities is called inference
In general, inference involves queries of the
These numbers are from the form:
conditional probability tables P( X | E )
X = The query variable(s) E = The evidence variable(s)

152 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Example
• Topology of network encodes conditional independence
assertions:

• Weather is independent of the other variables


• Toothache and Catch are conditionally independent given Cavity

153 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Independence
P(A|G) = P(A) A ^ G
P(A,G) = P(G)P(A) P(G|A) = P(G) G ^ A
Age Gender
P(A,G) = P(G|A) P(A) = P(G)P(A)
Age and Gender Independent
P(A,G) = P(A|G) P(G) = P(A)P(G)
Age Gender Serum Calcium and
Cancer
Lung Tumor are
Smoking dependent
Serum Calcium is
independent of Lung
Serum Lung Tumor, given Cancer
Cancer Calcium Tumor
Cancer is independent of Age P(L|SC,C) = P(L|C)
and Gender given Smoking. Age is dependent on
Gender, given Smoking
P(C|A,G,S) = P(C|S) C ^ A,G | S

154 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Put it all together

Age Gender Non-Descendants


P(A, G, E, S, C, L, SC) =
P(A). P(G)
Exposure Smoking Parents P(E | A). P(S | A, G)
to Toxics P(C |E , S)
P(SC | C). P(L | C)
Cancer
Cancer is independent of Age
and Gender given Exposure to
Serum Lung Toxics and Smoking.
Calcium Tumor Descendants

Cancer is independent of Diet given


Exposure to Toxics and Smoking.

155 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Contd.,

• Bayesian networks provide a natural representation for


(causally induced) conditional independence
• Topology + CPTs = compact representation of joint
distribution
• Generally easy for domain experts to construct

156 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Limitations Of Bayesian Networks

• Typically require initial knowledge of many probabilities…

quality and extent of prior knowledge play an important role


• Significant computational cost(NP hard task)

• Unanticipated probability of an event is not taken care of.

157 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Constructing Bayesian Networks

158 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Constructing Bayesian networks

1. Choose an ordering of variables X1, … ,Xn


2. For i = 1 to n
add Xi to the network
select parents from X1, … ,Xi-1 such that
P (Xi | Parents(Xi)) = P (Xi | X1, ... Xi-1)

This choice of parents guarantees:

P (X1, … ,Xn) = πi =1 P (Xi | X1, … , Xi-1) (chain rule)


= πi =1P (Xi | Parents(Xi)) (by construction)

159 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Bayesian Networks – Example I
P(S)
S  no, light , heavy Smoking
P(C|S)
Cancer
P( S=no) 0.80 C  none, benign , malignant 
P(S) P( S=light) 0.15
P( S=heavy) 0.05

Smoking= no light heavy


P(C|S) P( C=none) 0.96 0.88 0.60
P( C=benign) 0.03 0.08 0.25
P( C=malig) 0.01 0.04 0.15

160 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Product Rule
P(C,S)=P(C|S).P(S)

S C none benign malignant



no 0.768 0.024 0.008
light 0.132 0.012 0.006
heavy 0.035 0.010 0.005

161 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Marginalization

P(C,S)=P(C|S).P(S)
S none benign malignant total
C
no 0.768 0.024 0.008 0.800
P(Smoke)
light 0.132 0.012 0.006 0.150
heavy 0.035 0.010 0.005 0.050
total 0.935 0.046 0.019

P(Cancer)

162 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Bayes Rule Revisited
P(S|C)=P(C|S).P(S)/P(C) = P(C,S)/P(C)
S C none benign malignant
no 0.768/.935 0.024/.046 0.008/.019
light 0.132/.935 0.012/.046 0.006/.019
heavy 0.035/.935 0.010/.046 0.005/.019

Cancer= none benign malignant


P(S=no) 0.821 0.522 0.421
P(S=light) 0.141 0.261 0.316
P(S=heavy) 0.037 0.217 0.263

163 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Causes and Bayes’ Rule

Diagnostic inference:
Knowing that the grass is wet,
diagnostic what is the probability that rain is
causal
the cause?

PW | R P R 
P R | W  
P W 
PW | R PR 

PW | R P R   P W |~ R P~ R 
0.9  0.4
  0.75
0 . 9  0 .4  0 .2  0 . 6

164 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Example II - Alarm (from Judea Pearl)
 You have a new burglar alarm installed at home. It is fairly reliable at
detecting a burglary, but also responds on occasion to minor earthquakes.
 You also have two neighbors, John and Mary, who have promised to call
you at work when they hear the alarm.
 John always calls when he hears the alarm, but sometimes confuses the
telephone ringing with the alarm and calls then, too.
 Mary, on the other hand, likes rather loud music and sometimes misses the
alarm altogether.
 Given the evidence of who has or has not called, we would like to estimate
the probability of a burglary.

165 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Step 1
 Determine what the propositional (random) variables should
be
 Determine causal (or another type of influence)
relationships and develop the topology of the network

166 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Example II – Alarm Contd.,

 Variables: Burglary, Earthquake, Alarm, JohnCalls,


MaryCalls
 Network topology reflects "causal" knowledge:
A burglar can set the alarm off
An earthquake can set the alarm off
The alarm can cause Mary to call
The alarm can cause John to call

167 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Topology of Belief Network

Burglary Earthquake

Alarm

John Calls Mary Calls

168 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Step 2
 Specify a conditional probability table or CPT for each node.

 Each row in the table contains the conditional probability of each

node value for a conditioning case (possible combinations of values


for parent nodes).
 In the example, the possible values for each node are true/false.

 The sum of the probabilities for each value of a node given a

particular conditioning case is 1.

169 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Example II-CPT for Alarm Node

P(A|B,E)
B E
True False

T T 0.950 0.050

T F 0.940 0.060

F T 0.290 0.710

F F 0.001 0.999

170 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Example II – Alarm Contd.,

P(B) P(E)
Burglary Earth
0.001 0.002
Quake
B E P(A|B,E)
T T 0.95
Alarm
T F 0.94
F T 0.29
F F 0.001 A P(M|A)

John Calls T 0.70


Mary Calls
F 0.01
A P(J|A)
T 0.90

UNIT-IV
F 0.05
171 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Population-wide ANomaly Detection and
Assessment (PANDA)
 A detector specifically for a large-scale outdoor release of

inhalational anthrax
 Uses a massive causal Bayesian network

 Population-wide approach: each person in the population is

represented as a subnetwork in the overall model

172 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Population-Wide Approach
Anthrax Release Global nodes

Location of Release Time of Release Interface nodes

Each person in the


Person Model Person Model Person Model
population

 Note the conditional independence assumptions


 Anthrax is infectious but non-contagious
 Structure designed by expert judgment
 Parameters obtained from census data, training data, and expert
assessments informed by literature and experience
173 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Full Joint Distribution
n
P ( x1 ,..., xn )   P ( xi | parents ( X i ))
i 1

P ( j  m  a  b  e )
 P ( j | a ) P ( m | a ) P ( a | b   e ) P ( b ) P (  e )
 0.9  0.7  0.001 0.999  0.998  0.00062

174 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Conditional Independence in BNs
Three Types of Connections:

Knowing S makes L and B


independent (common cause)

Converging
L B
Knowing T makes Lung Cancer Bronchitis

A and X independent
(intermediate cause)

NOT knowing D or M makes L and B


175 UNIT-IV 06/02/2024
independent (common effect)
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Semantics of Belief Networks

 View 1: A belief network is a representation of the joint

probability distribution (“joint”) of a domain.


 The joint completely specifies an agent’s probability

assignments to all propositions in the domain (both simple


and complex.)

176 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Example Calculation

Calculate the probability of the event that the alarm has


sounded but neither a burglary nor an earthquake has occurred,
and both John and Mary call.

P(J ^ M ^ A ^ ~B ^ ~E)
= P(J|A) P(M|A) P(A|~B,~E) P(~B) P(~E)

= 0.90 * 0.70 * 0.01 * 0.999 * 0.998


= 0.00062

177 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Network as representation of joint
 A generic entry in the joint probability distribution is the
probability of a conjunction of particular assignments to each
variable, such as:

n
P(x1,..., xn )   P(xi | Parents(Xi ))
i1

• Each entry in the joint is represented by the product of


appropriate elements of the CPTs in the belief network.

178 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Semantics
 View 2: Encoding of a collection of conditional independence

statements.
JohnCalls is conditionally independent of other variables in

the network given the value of Alarm


 This view is useful for understanding inference procedures for

the networks.

179 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Causal Inferences P(B)
P(E)
0.002
0.001
Burglary Earthquake
B E P(A)
Inference from cause to effect. T T 0.95
E.g. Given a burglary, what is A P(J) T F 0.94
T 0.90 Alarm
Probability of John Calling - F T 0.29
P(J|B)? F 0.05 F F 0.001
John Calls Mary Calls
A P(M)
P( J | B )  ?
T 0.70
P ( A | B )  P ( B ) P (E )( 0.94)  P ( B ) P ( E )( 0.95)
F 0.01
P ( A | B )  1(0.998)( 0.94)  1(0.002)( 0.95)
P ( A | B )  0.94
P ( J | B )  P ( A)( 0.9)  P (A)( 0.05) P(M|B)=0.67 via similar
P ( J | B )  (0.94)( 0.9)  (0.06)( 0.05) calculations
 0.85

180 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Diagnostic Inferences
From effect to cause. E.g. Given that John calls, what is the
P(burglary)?
P( J | B) P( B)
What is P(J)? Need P(A) first:
P( B | J ) 
P( J )

P ( A)  P ( B ) P ( E )( 0.95)  P ( B ) P ( E )( 0.29)
 P ( B ) P ( E )( 0.94)  P ( B ) P ( E )( 0.001)
P ( A)  ( 0.001)( 0.002 )( 0.95)  ( 0.999 )( 0.002 )( 0.29)
 ( 0.001)( 0.998)( 0.94)  ( 0.998)( 0.999 )( 0.001)
P ( A)  0.002517

(0.85)(0.001)
P ( J )  P ( A)( 0.9)  P (A)( 0.05)
P( B | J )   0.016
P ( J )  (0.002517)( 0.9)  (0.9975)( 0.05) (0.052)
P ( J )  0.052
Many false positives.
181 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

182 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Neural Network

183 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Real Neurons
•Cell Structure
1. Cell body (SOMA) 1
-Process the inputs 2
2. Dendrites

-Accepts the inputs 3


3. Axon 4
-Turn the processed input to
output
4. Synaptic terminals
-Electro-chemical contacts
between neurons

184 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
The Neuron Metaphor
 Neurons:
• The main purpose of neurons is to
receive, analyze and transmit
further the information in a form
of signals (electric pulses).

 Multiply inputs by weights along


edges
 Apply some function to the set of
inputs at each node

185 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
The goals of neural computation
 To understand how the brain actually works
Its big, complicated and made of yucky stuff that dies when
you poke it around.
 To understand a new style of computation
Inspired by neurons and their adaptive connections.
Different from sequential computation
Good for cognitive process(e.g. vision)
Bad for computation process(e.g. 23 x 71)
 To learn practical problems, Learned algorithms is useful
even if it doesn’t depict how the brain works

186 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
A Neural Network
Can be viewed as a generalization of linear models.

Each ANN is composed of a collection of perceptrons grouped

in layers.
A typical structure has three layers: input, intermediate (called

the hidden layer) and output.


Several hidden layers can be placed between the input and

output layers.

187 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Artificial Neural Network
 ANNs are built of densely interconnected set of simple units,

each taking several real-valued inputs and producing single-


valued output.
 ANN is characterized by its architecture, its training or learning

algorithm and its activation functions.


 One motivation is to capture highly parallel computations on

distributed processes.
 Most ANN software run on sequential machines emulating

distributed processes.
188 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Pros and Cons of Neural Networks
 Pros  Flexibility and ease of
 Good to use in continuous maintenance
domains with little  Fast processing speed.
knowledge.
 Cons
 Ability to solve problems
that are difficult to define.  Not interpretable, “black
 Can be used when a good box”.
functional model is not  Learning is slow.
known.  Good generalization can
 Provides human
require many data points.
characteristics to problem
solving that are difficult to
simulate.

189 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Origin of Neural Networks
• To find information processing models of biological systems
• Term covers wide range of models
• Exaggerated claims of biological plausibility
• Biological realism imposes unnecessary constraints
• Neural networks are efficient models for machine learning
• Particularly multilayer perceptrons
• Network parameters are obtained in maximum likelihood framework
• A nonlinear optimization problem
• Requires evaluating derivative of log-likelihood function w.r.t network
parameters
• Done efficiently using error back propagation
190 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Types of Neural Network
Feed-forward neural network
Feed-forward neural network

Information travel in one

direction only
No feedback or cycles –

Directed Acyclic graph


Simplest and most used one

Used for pattern recognition

191 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Types of Neural Network
Recurrent Neural Network
Recurrent neural network

Signals travel in both


directions
Complicated and dynamic

Can use their internal


memory to process arbitrary
sequences of inputs
Used to learn temporal
patterns
192 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Types of Neural Network

Linear Neuron
Logistic Neuron

Potentially more. Require a


convex loss function for
Perceptron gradient descent training.

193 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Feed Forward Network Functions
 A neural network can also be represented similar to linear models but
basis functions are generalized.

Activation function: Basis functions:


For regression: identity function ϕj (x) a nonlinear function, e.g.,
For classification: a nonlinear function tanh, of a linear combination of D
e.g., sigmoid inputs
Coefficients wj adjusted during Its parameters are adjusted during
training training
There can be several activation
functions

194 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
First Layer: Basis Functions
•D input variables x1,.., xD
•M linear combinations in the form

• Superscript (1) indicates


parameters are in first layer of
network
• Parameters wji are referred to as
weights and wj0 as biases
• Quantities aj are known as
activations
195 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
First Layer: Basis Functions

• Each activation aj is transformed using differentiable nonlinear


activation functions zj=h(aj)

• The zj correspond to outputs of basis functions ϕj(x)


• or first layer of network or hidden units
• Nonlinear functions h chosen to be sigmoidal
• Two examples of activation functions
 1. logistic sigmoid [1/1+exp(-a)]
 2. tanh [(ea-e-a)/(ea+e-a)]
196 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Activation Function
 Determined by the nature of the data and the assumed

distribution of target variables.


 Can use a variety of activation function

Sigmoidal (S-shaped)

Logistic sigmoid [1/1+exp(-a)] (Used for binary


classification)
Hyperbolic tangent - tanh

Radial Basis Function

197 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Activation Function

Softmax

(Used for multi-class classification)


Identity yk = ak (Useful for regression)

 Needs to be differentiable for gradient-based learning (later).

 Can use different activation function in each unit.

198 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Second Layer: Activation Functions
•Values zj are again linearly combined
to give output unit activations

•Where K is the total number of outputs


• Output unit activations are
transformed by using appropriate
activation function to give network

199
outputs
UNIT-IV yk 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Overall Network Function
• Combining stages of the overall function with sigmoidal output

• Where w is the set of all weights and bias parameters. Note


presence of both σ and h functions
• Thus a neural network is simply

– a set of nonlinear functions from input variables {x i} to

output variables {yk}

– controlled by vector w of adjustable parameters


200 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Forward Propagation
• Bias parameters can be absorbed into weight parameters by
defining a new input variable x0

• Process of evaluation is forward propagation through network


• Multilayer perceptron is a misnomer
– Since only continuous sigmoidal functions are used

201 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Feed Forward Networks
• Connect together a number of
these units into a feed-forward
network (DAG)
• Figure shows a network with one
layer of hidden unit
• Implements function

202 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Feed Forward Networks
 We have looked at generalized linear models of the form:

 For fixed non-linear basis function Ø(.)

We now extend this model by allowing adaptive basis

function and learning their parameters.

203 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Feed Forward Networks
 In feed-forward networks (a.k.a. Multi-layer perceptrons) we let

each basis function be another non-linear function of linear


combination of the inputs.

204 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Feed Forward Networks
• Starting with input x= (x1,…,xD), construct linear combinations:
• These aj are known as activations.
• Pass through an activation function h(.)
• to get output zj=h(aj)
• Model of an individual neuron

205 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Representation power of Feed Forward
Networks
 Depends on the width and depth of the networks
 Boolean functions:
Represented by network with two layers of units where the
number of hidden units required grows exponentially.
 Continuous functions(bounded):
Can be approximated with arbitrarily small error, by network
with two layers of units.
 Arbitrary functions:
Can be approximated to arbitrary accuracy by a network with
three layers of units.

206 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Two-class Classification
• Neural Network:
• Two inputs, two hidden units with
tan h activation functions
• Red line:
• decision boundary for network
• Dashed lines:
• contours for two hidden units
• Green line:
• optimal decision boundary from
distributions of the data
• Multiple distinct choices for the
weight vector w for same mapping
function exists.

207 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Network Training
Given a specified network structure, how do we set its parameters
(weights)?
As usual, we define a criterion to measure how well our
network performs, optimize against it

For regression, training data are (xn, t), tn e R, Squared error


arises:

208 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Network Training
For binary classification, this is another discriminative model,
ML:

209 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Parameter Optimization: Geometrical View
• For either of these problems, the error
function E(w) is nasty.
• Nasty Non-Convex Local
minima
• E(w): surface sitting over weight space
• wA:a local minimum; wB global
minimum
• Need to find minimum
• At point wC local gradient
• is given by vector ∇E(w)
• points in direction of greatest rate of
increase of E(w)
• Negative gradient points to rate of
greatest decrease
210 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Neural Network Learning Problem
• Goal is to learn the weights w from a labelled set of training
samples
• Learning procedure has two stages
• 1. Evaluate derivatives of error function ∇E(w) with respect to
weights w1,..wT
• 2. Use derivatives to compute adjustments to weights
• w(t+1) = w(t)−η∇E(w(t) )
• No. of weights is T=(D+1)M+(M+1)K
=M(D+K+1)+K
• Where D is no of inputs, M is no of hidden units, K is no of
outputs

211 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Descent Methods
• The typical strategy for optimization problems of this sort is a descent
method:

w(t+1) = w(t) + Δw(t)

• These come in many flavours


• Gradient descent ∇E(w(t))
• w(τ+1)=w(τ)−η∇E(w(τ))
• η – the learning rate
• Stochastic gradient descent ∇En(w(t))

• w(τ+1)=w(τ)−η∇En(w(τ))
212 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Descent Methods
• Newton-Raphson (second order) ∇2
• All of these can be used here, stochastic gradient descent is
particularly effective

• For good optimization run gradient based algorithm multiple


times, using a different starting point every time.

• Redundancy in training data, escaping local minima

213 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Computing Gradients
• The function y(xn , w)
implemented by a network is
complicated
• It isn’t obvious how to
compute error function • How much computation would
derivatives with respect to this take with W weights in the
weights network?
• Numerical method for calculating • O(w) per derivative, O(w2)
error derivatives, use finite total per gradient descent
differences: step
• Use of gradients improves
computation speed.

214 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Testing, Evaluation and


Validation of Models

215 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Training, Validation, and Test sets
 Training set: A set of examples used for learning, that is to fit

the parameters of the classifier.


 Validation set: A set of examples used to tune the parameters of

a classifier, for example to choose the number of hidden units in


a neural network.
 Test set: A set of examples used only to assess the performance

of a fully-specified classifier.

216 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Evaluating a model
• The most important thing we can do to properly evaluate the
model is to not train the model on the entire dataset.

• A typical train/test split would be to use 70% of the data for


training and 30% of the data for testing.

• It's important to use new data when evaluating the model to


prevent the likelihood of overfitting to the training set. However,
sometimes it's useful to evaluate the model as building it to find
that best parameters of a model – but can't use the test set for this
evaluation or else
217 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Evaluating a model
• we'll end up selecting the parameters that perform best on the
test data but maybe not the parameters that generalize best.

• To evaluate the model while still building and tuning the model,
to create a third subset of the data known as the validation set.

• A typical train/test/validation split would be to use 60% of the


data for training, 20% of the data for validation, and 20% of the
data for testing.

218 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Classification Metrics
 When performing classification predictions, there's four types of
outcomes that could occur.

 True positives are when you predict an observation belongs to a


class and it actually does belong to that class.
 True negatives are when you predict an observation does not
belong to a class and it actually does not belong to that class.
 False positives occur when you predict an observation belongs
to a class when in reality it does not.
 False negatives occur when you predict an observation does not
belong to a class when in fact it does.

219 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Classification Metrics contd.,
• These four outcomes are often plotted on a confusion matrix.
• The following confusion matrix is an example for the case of
binary classification. Have generate this matrix after making
predictions on your test data and then identifying each prediction
as one of the four possible outcomes described above.

220 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Three Main Metrics
Accuracy is defined as the percentage of correct predictions for the
test data. It can be calculated easily by dividing the number of
correct predictions by the number of total predictions.

Precision is defined as the fraction of relevant examples (true


positives) among all of the examples which were predicted to belong
in a certain class.

221 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Contd.,
Recall is defined as the fraction of examples which were predicted
to belong to a class with respect to all of the examples that truly
belong in the class.

222 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Precision and Recall

223 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
F-score
• Combine the precision and recall metrics; the common approach
for combining these metrics is known as the f-score.

• The ββ parameter allows us to control the tradeoff of importance


between precision and recall. β<1β<1 focuses more on precision
while β>1β>1 focuses more on recall.

224 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Regression metrics
 Evaluation metrics for regression models are quite different than the

classification models because we are now predicting in a continuous range

instead of a discrete number of classes.

 If your regression model predicts the price of a house to be $400K and it

sells for $405K, that's a pretty good prediction.

 However, in the classification examples we were only concerned with

whether or not a prediction was correct or incorrect, there was no ability

to say a prediction was "pretty good".

 Thus, we have a different set of evaluation metrics for regression models.

225 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Metrics For Regression Models
Explained variance compares the variance within the expected
outcomes, and compares that to the variance in the error of our model.
This metric essentially represents the amount of variation in the
original dataset that our model is able to explain.

Mean squared error is simply defined as the average of squared


differences between the predicted output and the true output.

226 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Contd.,
Squared error is commonly used because it is agnostic to whether
the prediction was too high or too low, it just reports that the
prediction was incorrect.

The R2 coefficient represents the proportion of variance in the


outcome that our model is capable of predicting based on its
features.

227 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Contd.,
R2 coefficient

228 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Bias vs Variance
 The ultimate goal of any machine learning model is to learn

from examples and generalize some degree of knowledge


regarding the task we're training it to perform.
 Some machine learning models provide the framework for

generalization by suggesting the underlying structure of that


knowledge.
 For example, a linear regression model imposes a framework to

learn linear relationships between the information we feed it.

229 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Bias vs Variance Contd.,
 However, sometimes we provide a model with

too much pre-built structure that we limit the


model's ability to learn from the examples -
such as the case where we train a linear model
on a exponential dataset.
 In this case, our model is biased by the pre-

imposed structure and relationships.


 Models with high bias pay little attention to

the data presented; this is also known


as underfitting
230 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Validation curves
 The goal with any machine learning model is generalization.

Validation curves allow us to find the sweet spot between underfitting


and overfitting a model to build a model that generalizes well.
 A typical validation curve is a plot of the model's error as a function of

some model hyperparameter which controls the model's tendency to


overfit or underfit the data.
 The parameter what we choose depends on the specific model we're

evaluating; For example, we might choose to plot the degree of


polynomial features (typically, this means you have polynomial
features up to this degree) for a linear regression model.

231 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Validation curves Contd.,
 Generally, the chosen parameter will

have some degree of control over the


model's complexity.
 On this curve, we plot both the

training error and the validation error


of the model. Using both of these
errors combined, we can diagnose
whether a model is suffering from
high bias or high variance.

232 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Learning curves
 The second tool we'll discuss for diagnosing bias and variance in

a model is learning curves. Here, we'll plot the error of a model


as a function of the number of training examples. Similar to
validation curves, we'll plot the error for both the training data
and validation data.
 If our model has high bias, we'll observe fairly quick

convergence to a high error for the validation and training


datasets. If the model suffers from high bias, training on more
data will do very little to improve the model.
233 UNIT-IV 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Learning curves Contd.,
 This is because models which underfit the data pay little

attention to the data, so feeding in more data will be useless.


 A better approach to improving models which suffer from high

bias is to consider adding additional features to the dataset so


that the model can be more equipped to learn the proper
relationships.

234 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Learning curves Contd.,
 If the model has high variance, we'll see a

gap between the training and validation error.


This is because the model is performing well
for the training data, since it has been overfit
to that subset, and performs poorly for the
validation data since it was not able to
generalize the proper relationships.
 In this case, feeding more data during training

can help improve the model's performance.

235 UNIT-IV 06/02/2024


DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

236 UNIT-IV 06/02/2024

You might also like