0% found this document useful (0 votes)
19 views

Machine_learning(unit 3)

The document provides an overview of Bayesian reasoning in machine learning, highlighting its importance in making informed decisions based on probability distributions. It discusses various Bayesian learning algorithms, including Naive Bayes and Bayesian Belief Networks, as well as the Expectation Maximization algorithm for handling incomplete data. Key concepts such as minimizing error functions, the significance of prior knowledge, and the challenges of Bayesian methods are also addressed.

Uploaded by

114bhuvankumar2
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Machine_learning(unit 3)

The document provides an overview of Bayesian reasoning in machine learning, highlighting its importance in making informed decisions based on probability distributions. It discusses various Bayesian learning algorithms, including Naive Bayes and Bayesian Belief Networks, as well as the Expectation Maximization algorithm for handling incomplete data. Key concepts such as minimizing error functions, the significance of prior knowledge, and the challenges of Bayesian methods are also addressed.

Uploaded by

114bhuvankumar2
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Machine learning (unit 3)

22 July 2024 10:07 AM

Bayesian Reasoning: An Introduction


What is Bayesian Reasoning?
Bayesian reasoning provides a way to make inferences using probability. It assumes that the
things we are interested in are governed by probability distributions. By understanding and
calculating these probabilities along with observed data, we can make optimal decisions.

Why is it important?

In machine learning, Bayesian reasoning helps in quantitatively evaluating the evidence that
supports different hypotheses. It helps us understand and decide which hypothesis is more likely
to be true given the data we have.

Relevance of Bayesian Learning:

Bayesian Learning Algorithms


Bayesian learning algorithms, such as the Naive Bayes classifier, are practical and effective for
certain types of learning problems. These algorithms explicitly calculate the probabilities of
hypotheses and help us make informed decisions based on those probabilities.

Bayesian Analysis in Neural Network Learning Algorithms


Key Design Choice: Minimizing the Sum of Squared Errors
One common design choice in neural network learning is to minimize the sum of squared errors
(SSE). This means that the algorithm tries to make the difference between the predicted values
and the actual values as small as possible.

Alternative Error Function: Cross Entropy


When the target functions predict probabilities, using the sum of squared errors might not be
the best approach. Instead, the cross-entropy error function is more appropriate.

• Squared Error is about minimizing the differences between actual and predicted values for
each data point (used in regression).
• Cross Entropy is about comparing the distributions of actual and predicted probabilities,
focusing on the accuracy of the predicted probabilities (used in classification).
• Cross entropy penalizes wrong predictions more severely and encourages accurate
probability estimates

Bayesian Analysis in Decision Tree Learning Algorithms


Bias in Decision Trees
Decision tree algorithms often favor shorter trees (trees with fewer splits and nodes). This is a
form of bias.

Bayesian Perspective:
From a Bayesian perspective, shorter trees are favored because they are simpler and less likely
to overfit the data. By favoring shorter trees, the algorithm aims to find a balance between
fitting the data well and maintaining simplicity, which is a key principle in Bayesian reasoning.

Features of Bayesian Learning Methods

Machine Learning Page 1


Features of Bayesian Learning Methods
• Incremental Updates:
○ Each new training example can adjust(increase or decrease) the estimated
probability that a hypothesis is correct.
• Combining Knowledge:
○ Prior knowledge is combined with new data to determine the final probability of
a hypothesis.
• Prior Knowledge:
○ Prior Probability: A belief about the likelihood of each hypothesis before seeing
the data.
Use Cases:
• Hypotheses can make predictions with probabilities (e.g., "93% chance of recovery").
• Combine predictions from multiple hypotheses, weighted by their probabilities, to
classify new instances.
• Bayesian methods set a standard for optimal decision-making against which other
methods can be compared.

Challenges of Bayesian methods:


• Initial knowledge of many probabilities
• Significant computational cost

Maximum Likelihood and Least-Squared Error Hypothesis


Bayesian Analysis
Bayesian analysis helps us understand the relationship between the target function we want
to learn and the noise in the data. It shows that minimizing the squared error in predictions
is equivalent to finding the most likely hypothesis, given the data.

Hypothesis Space
Instance Space (X): The set of all possible inputs.
Hypothesis Space (H): A set of real-valued functions that map inputs from X to real numbers.

Learning Task
The learner’s task is to find a hypothesis that maximizes the likelihood of the observed data,
given that the hypotheses are equally probable before seeing any data.

Why Normal Distribution for Noise?


• Mathematical Convenience:
The properties of the Normal distribution make the analysis and calculations

Machine Learning Page 2


○ The properties of the Normal distribution make the analysis and calculations
simpler and more tractable.
• Real-World Relevance:
○ Many types of noise in physical systems naturally follow a Normal distribution.

Explanation of Maximum Likelihood Hypotheses for Predicting


Probabilities
The Problem
Imagine you want to predict outcomes that are not certain, like whether a patient will
survive a disease or whether a loan applicant will repay their loan. These outcomes have
some inherent uncertainty and can be represented as probabilities.

Maximum Likelihood Estimation (MLE)


Goal: Estimate the parameters of a model such that it matches the observed data as
closely as possible.
Use Case: Common in probabilistic models like logistic regression and neural networks
that predict probabilities.

Anology:
Think of predicting weather. Instead of saying "It will rain" (1) or "It won't rain" (0), you
say "There’s a 70% chance of rain." If your model is accurate, on days when it predicts a
70% chance, it will actually rain 70% of the time over the long run.

How to Learn Probabilistic Functions using Neural Networks


The Goal
We want our neural network to predict probabilities, like predicting the chance of rain
based on weather conditions.

Machine Learning Page 3


Gradient Search to Maximize Likelihood
In neural networks, we optimize parameters (weights and biases) using gradient-based
optimization techniques to maximize the likelihood (or minimize an equivalent loss
function).

Steps to Train a Neural Network:

Define the Model:


• Specify the architecture of the neural network (number of layers, types of layers,
activation functions).
○ Example: A simple neural network with an input layer, one hidden layer, and
an output layer.

Initialize Parameters:
• Start with small random values for weights and biases. This helps the network
begin learning.
○ Example: Weights could be initialized using a small random distribution.

Forward Propagation:
• Pass the input data through the network to get predicted outputs.
○ Example: For weather data, input features (like humidity, temperature) are
processed through the network to predict the probability of rain.

Compute the Loss:


• Calculate how far off the predicted values are from the actual values using a loss
function.
• Common Loss Function: Cross-entropy loss for probability predictions.
○ Example: If the model predicts a 70% chance of rain but it didn’t rain, the loss
function will show the discrepancy.

Backward Propagation:
• Use the chain rule to calculate the gradients (slopes) of the loss function with
respect to each parameter (weight and bias).
• This tells us how to adjust each parameter to reduce the loss.
○ Example: If increasing a weight slightly reduces the loss, the gradient will
indicate this direction.

Machine Learning Page 4


indicate this direction.

Update Parameters:
• Adjust the parameters in the direction that reduces the loss using gradient descent
or its variants (like Adam, RMSprop).
○ Example: If the gradient indicates that a certain weight should be increased
to reduce the loss, we adjust it accordingly.

Minimum Description Length principle:


The Minimum Description Length principle recommends choosing the
hypothesis that minimizes the description length of the hypothesis and the
description length of the data given the hypothesis.

Bayes' theorem and basic results from information theory can be used to
provide a rationale for this principle.

Bayes Optimal Classifier:

Machine Learning Page 5


Drawback: Very costly in terms of computation because it involves evaluating and combining
all hypothesis

Gibbs Algorithm:
Goal: Make a good prediction with much less computation.
Process:
• Random Selection: Choose one hypothesis from the hypothesis space 𝐻
at random. The selection is based on the probability that each hypothesis is correct
(posterior probability).
• Use the Chosen Hypothesis: Use this randomly chosen hypothesis to predict the
classification of the next instance.
Result: Not as accurate as the Bayes optimal classifier but still quite good.
Benefit: Much faster and less computationally expensive.
Expected Performance: Under certain conditions, the Gibbs Algorithm's error is at most twice
that of the Bayes optimal classifier.

Naive Bayes Classifier assumes that all the features are independent .In reality, features are
often not independent.

Bayesian Belief Networks (BBNs)


Basic Idea:
A Bayesian Belief Network (BBN) is a more flexible and realistic model compared to Naive
Bayes.
It allows for representing dependencies between features.

Structure:
Think of a BBN as a map or network. Each node represents a variable (like color, size, shape),
and arrows (or edges) show how these variables are related.
For example, in a BBN, there might be an arrow from "season" to "fruit type" indicating that
the type of fruit you find depends on the season.

How It Works:

Machine Learning Page 6


How It Works:
Instead of assuming that all features are independent, a BBN specifies which features are
directly related to each other.
It uses these relationships to calculate the probability of different outcomes.

Example:
Suppose you have a BBN with nodes for weather, road conditions, and accident risk. There
might be an arrow from weather to road conditions (bad weather leads to poor road
conditions) and another from road conditions to accident risk (poor road conditions increase
accident risk).

Why Use Bayesian Belief Networks?

Advantages:
• More Accurate: BBNs can model complex relationships between variables, leading to
more accurate predictions than Naive Bayes.
• Flexible: You can include your knowledge about which variables are related and which
are not.

Bayesian belief networks allow stating conditional independence


assumptions that apply to subsets of the variables.

A Bayesian belief network describes the joint probability distribution for a set
of variables.

A Bayesian network can be used to compute the probability distribution for any
subset of network variables given the values or distributions for any subset of the
remaining variables.

Expectation Maximization algorithm(EM ALGORITHM)


What is the EM Algorithm?
The EM algorithm is a way to deal with missing or incomplete data. It helps us to estimate
the values of the missing data and update our model parameters until we get a good fit for
the data.

Steps of the EM Algorithm


• Initialization Step: Start with some initial guesses for the parameters of your model.

• Expectation Step (E-step): Use the current parameters to estimate the missing data.

Machine Learning Page 7


• Expectation Step (E-step): Use the current parameters to estimate the missing data.
Think of it as filling in the blanks using the information you already have.

• Maximization Step (M-step): With the newly filled-in data, update the parameters to
better fit the data.

• Repeat: Go back to the E-step and M-step and repeat these steps until the parameters
stop changing much (this is called convergence).

Advantages of EM Algorithm
• Easy to Implement: The basic steps (E-step and M-step) are often straightforward for
many problems.
• Closed-form Solution: Often, the M-step has a simple mathematical solution.
• Guaranteed Improvement: Each iteration improves the likelihood of the data fitting
the model.

Disadvantages of EM Algorithm
• Slow Convergence: It can take many iterations to converge.
• Local Optimum: It might not find the best solution, just a good one (local optimum).
• Complex Calculations: It involves both forward and backward probabilities, making it
complex.

Applications of EM Algorithm
• Gaussian Density Calculation: Estimating the density of Gaussian functions.
• Filling Missing Data: Filling in missing values in datasets.
• Natural Language Processing (NLP): Useful in language models.
• Computer Vision: Helps in image reconstruction.
• Medicine and Structural Engineering: Used in image reconstruction.
• Hidden Markov Models (HMM): Estimating parameters for HMMs.
• Gaussian Mixture Models: Used for mixed models

Analogy to understand EM Algorithm:

Total Survey Sample Size: This is the total number of people you are interested in, let's say
200 people.
Survey Responses: Out of these 200 people, a certain number have responded to the survey,
and the remaining have not.

Detailed Example with Numbers


Total Number of People: 200
Number of People Who Answered the Survey: 100
Number of People Who Didn't Answer the Survey: 100

Now let's walk through the EM algorithm steps with these numbers:

Initialization
Initial Guess: Suppose you initially think that 50% of the people like ice cream.

E-step (Expectation Step)


• Estimate Missing Data: Based on your initial guess, you estimate the preferences of the
100 people who didn't answer the survey.
○ If 50% like ice cream, you estimate that 50 out of these 100 non-responders like
ice cream.

Machine Learning Page 8


M-step (Maximization Step)
• Recalculate with Estimated Data: Now, you combine your initial survey responses with
your estimates:
○ Suppose in the 100 people who answered, 40 said they like ice cream.
○ So, initially, you have 40 (from survey responses) + 50 (from your estimate) = 90
people who like ice cream out of the total 200 people.
○ This means 90/200 = 45% of people like ice cream.

• Update Parameters: Based on this recalculation, you update your estimate to 45%.

Repeat
• Reiterate the Steps: You repeat the E-step and M-step:

• New E-step: Using 45%, estimate that 45% of the 100 non-responders like ice cream.
This gives you 45 people.
• New M-step: Combine again: 40 (from survey responses) + 45 (from new estimate) =
85 people out of 200.
• New percentage is 85/200 = 42.5%.

You keep repeating this process:

Each time, you use the latest estimate to update your guess.
Eventually, the percentage will stabilize and not change much with each iteration.

Convergence
Final Step: Suppose after several iterations, you find that the percentage stabilizes at 57%.
This means your final estimate is that 57% of the total 200 people like ice cream.

Machine Learning Page 9

You might also like