Machine_learning(unit 3)
Machine_learning(unit 3)
Why is it important?
In machine learning, Bayesian reasoning helps in quantitatively evaluating the evidence that
supports different hypotheses. It helps us understand and decide which hypothesis is more likely
to be true given the data we have.
• Squared Error is about minimizing the differences between actual and predicted values for
each data point (used in regression).
• Cross Entropy is about comparing the distributions of actual and predicted probabilities,
focusing on the accuracy of the predicted probabilities (used in classification).
• Cross entropy penalizes wrong predictions more severely and encourages accurate
probability estimates
Bayesian Perspective:
From a Bayesian perspective, shorter trees are favored because they are simpler and less likely
to overfit the data. By favoring shorter trees, the algorithm aims to find a balance between
fitting the data well and maintaining simplicity, which is a key principle in Bayesian reasoning.
Hypothesis Space
Instance Space (X): The set of all possible inputs.
Hypothesis Space (H): A set of real-valued functions that map inputs from X to real numbers.
Learning Task
The learner’s task is to find a hypothesis that maximizes the likelihood of the observed data,
given that the hypotheses are equally probable before seeing any data.
Anology:
Think of predicting weather. Instead of saying "It will rain" (1) or "It won't rain" (0), you
say "There’s a 70% chance of rain." If your model is accurate, on days when it predicts a
70% chance, it will actually rain 70% of the time over the long run.
Initialize Parameters:
• Start with small random values for weights and biases. This helps the network
begin learning.
○ Example: Weights could be initialized using a small random distribution.
Forward Propagation:
• Pass the input data through the network to get predicted outputs.
○ Example: For weather data, input features (like humidity, temperature) are
processed through the network to predict the probability of rain.
Backward Propagation:
• Use the chain rule to calculate the gradients (slopes) of the loss function with
respect to each parameter (weight and bias).
• This tells us how to adjust each parameter to reduce the loss.
○ Example: If increasing a weight slightly reduces the loss, the gradient will
indicate this direction.
Update Parameters:
• Adjust the parameters in the direction that reduces the loss using gradient descent
or its variants (like Adam, RMSprop).
○ Example: If the gradient indicates that a certain weight should be increased
to reduce the loss, we adjust it accordingly.
Bayes' theorem and basic results from information theory can be used to
provide a rationale for this principle.
Gibbs Algorithm:
Goal: Make a good prediction with much less computation.
Process:
• Random Selection: Choose one hypothesis from the hypothesis space 𝐻
at random. The selection is based on the probability that each hypothesis is correct
(posterior probability).
• Use the Chosen Hypothesis: Use this randomly chosen hypothesis to predict the
classification of the next instance.
Result: Not as accurate as the Bayes optimal classifier but still quite good.
Benefit: Much faster and less computationally expensive.
Expected Performance: Under certain conditions, the Gibbs Algorithm's error is at most twice
that of the Bayes optimal classifier.
Naive Bayes Classifier assumes that all the features are independent .In reality, features are
often not independent.
Structure:
Think of a BBN as a map or network. Each node represents a variable (like color, size, shape),
and arrows (or edges) show how these variables are related.
For example, in a BBN, there might be an arrow from "season" to "fruit type" indicating that
the type of fruit you find depends on the season.
How It Works:
Example:
Suppose you have a BBN with nodes for weather, road conditions, and accident risk. There
might be an arrow from weather to road conditions (bad weather leads to poor road
conditions) and another from road conditions to accident risk (poor road conditions increase
accident risk).
Advantages:
• More Accurate: BBNs can model complex relationships between variables, leading to
more accurate predictions than Naive Bayes.
• Flexible: You can include your knowledge about which variables are related and which
are not.
A Bayesian belief network describes the joint probability distribution for a set
of variables.
A Bayesian network can be used to compute the probability distribution for any
subset of network variables given the values or distributions for any subset of the
remaining variables.
• Expectation Step (E-step): Use the current parameters to estimate the missing data.
• Maximization Step (M-step): With the newly filled-in data, update the parameters to
better fit the data.
• Repeat: Go back to the E-step and M-step and repeat these steps until the parameters
stop changing much (this is called convergence).
Advantages of EM Algorithm
• Easy to Implement: The basic steps (E-step and M-step) are often straightforward for
many problems.
• Closed-form Solution: Often, the M-step has a simple mathematical solution.
• Guaranteed Improvement: Each iteration improves the likelihood of the data fitting
the model.
Disadvantages of EM Algorithm
• Slow Convergence: It can take many iterations to converge.
• Local Optimum: It might not find the best solution, just a good one (local optimum).
• Complex Calculations: It involves both forward and backward probabilities, making it
complex.
Applications of EM Algorithm
• Gaussian Density Calculation: Estimating the density of Gaussian functions.
• Filling Missing Data: Filling in missing values in datasets.
• Natural Language Processing (NLP): Useful in language models.
• Computer Vision: Helps in image reconstruction.
• Medicine and Structural Engineering: Used in image reconstruction.
• Hidden Markov Models (HMM): Estimating parameters for HMMs.
• Gaussian Mixture Models: Used for mixed models
Total Survey Sample Size: This is the total number of people you are interested in, let's say
200 people.
Survey Responses: Out of these 200 people, a certain number have responded to the survey,
and the remaining have not.
Now let's walk through the EM algorithm steps with these numbers:
Initialization
Initial Guess: Suppose you initially think that 50% of the people like ice cream.
• Update Parameters: Based on this recalculation, you update your estimate to 45%.
Repeat
• Reiterate the Steps: You repeat the E-step and M-step:
• New E-step: Using 45%, estimate that 45% of the 100 non-responders like ice cream.
This gives you 45 people.
• New M-step: Combine again: 40 (from survey responses) + 45 (from new estimate) =
85 people out of 200.
• New percentage is 85/200 = 42.5%.
Each time, you use the latest estimate to update your guess.
Eventually, the percentage will stabilize and not change much with each iteration.
Convergence
Final Step: Suppose after several iterations, you find that the percentage stabilizes at 57%.
This means your final estimate is that 57% of the total 200 people like ice cream.