0% found this document useful (0 votes)
50 views

Expectation Maximization - Georgia Tech - Machine Learning - English

The document discusses expectation maximization (EM) which is an algorithm that alternates between two steps: expectation (E-step) and maximization (M-step). In the E-step, it computes the probability that each data point belongs to each cluster. In the M-step, it calculates the mean of each cluster based on the probabilities from the E-step. EM is similar to k-means clustering but soft assigns data points to multiple clusters based on probabilities rather than hard assigning to a single cluster. If the probabilities in EM were restricted to 0/1, it would be equivalent to k-means clustering.

Uploaded by

yousef shaban
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

Expectation Maximization - Georgia Tech - Machine Learning - English

The document discusses expectation maximization (EM) which is an algorithm that alternates between two steps: expectation (E-step) and maximization (M-step). In the E-step, it computes the probability that each data point belongs to each cluster. In the M-step, it calculates the mean of each cluster based on the probabilities from the E-step. EM is similar to k-means clustering but soft assigns data points to multiple clusters based on probabilities rather than hard assigning to a single cluster. If the probabilities in EM were restricted to 0/1, it would be equivalent to k-means clustering.

Uploaded by

yousef shaban
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 3

So, this is going to lead us to

the concept of expectation maximization. So, expectation

maximization is actually, at an algorithmic level,

it's surprisingly similar to K means. So, what

we are going to do is, we're going to

tick-tock back and forth between 2 different

probabilistic calculations. So, you see that? I

kind of drew it like the other one.

>> Mm Hm. The names of the 2 phases are expectation,

and maximization. Sort of you know, our name is our algorithim.

>> I like that.

>> So, what they're going to do is, we're going to

move back and forth between a soft clustering, and computing

the means from the soft cluster. So the soft

clustering goes like this. This probablisitic indicator variable, Z I

J. Represents the likelihood that data element I comes

from cluster J. And so, the way we're going to do

that, since we're in the maximum likelihood setting, is to

use Bayes' rule, and say, well, that's going to be proportional

to the probability that data element I was

produced by cluster J. And then we have a

normalization factor. Normally, we'd also have the prior

in there. So why is the prior gone Charles?

>> Well, because you said it was the maximum likelihood. Scenario.

>> Yeah, right. We talked about how that

just meant that it was uniform and that

allowed us to just leave that component out.

It's not going to have any impact on the normalization.

>> Right.

>> So that's what the Z step is. Is if we had


the clusters, if we knew where the means were, then we could compute

how likely it is that the data would come from

the means, and that's just this calculation here. So that's computing

the expectation. Defining the Z varaibles from the muse. The centers.

We're going to pass that information. That clustering information, Z, over to

the maximization step. What the maximization is going to say is,

okay, well if that's the clustering, we can compute the means

from those clusters. All we have to do is just take

the average variable value. Right? So the average of the Xi's.

Within each cluster J. What's the likelihood it came from cluster J and

then again, we have to normalize. If you think of this as being

a 0 1 indicator variable, then really it is just the average of

the things we assign to that cluster. But here, we actually are kind

of soft assigning, so we could have half of one of the data

points in there, and it only counts half towards the average, and we

could have a tenth in another place, and a whole value in another

place, and so we're just doing this weighted average of the data points.

>> So,

can I ask you a question, Michael?

>> Yeah, shoot.

>> So, this makes sense to me, and I, and I

even get that for the Gaussian case, the z i variable will

always be non 0 in the end, because there's always some

probability. They come from some Gaussian because they have infinite extent. So

I, this all makes sense to me. Is there a way

to take exactly this algorithm and turn it into k means? I'm

staring at it, and it feels like if all your probabilities

were ones and zeroes, you would end up with exactly k means.

I think.
>> I dunno, I never really thought about that. Let's

think about that for a moment. Certainly, the case, if

all the z variables were 0, 1, then the maximization

set would be the means, which is what k means does.

>> Mm-hm. Then, what would happen? We send these means back, and what we do

in k-means is we say, each data point belongs to it's closest center.

>> Mm-hm.

>> Which is very similar actually to what this does. Except that

here we then make it proportional. So I guess it would

exactly that if we made these clustering assignments, pushed them to

0 or 1 depending on which was the most likely cluster.

Right, so if you made it so that the probability of you

being to a cluster actually depends upon all the clusters, and

you always got a 1 over 0. Basically you did, this was

like a hidden argmax kind of a thing, or a

hidden max or something. Then you would end up with exactly k-means.

>> I think

you're right.

>> Huh.

>> Yeah, I never thought about that.

>> Okay.

>> So it really does end up being an awful lot like

the k-means algorithm, which is improving

in the error metric, this squared error

metric. This is actually going to be improving in a probabilistic metric, right.

The, the data is going to be more and more likely over time.

>> That makes sense.

You might also like