Expectation maximization

less than 1 minute read

We have observed variable \(x\), latent varivable \(z\) and parameters \(\theta\). The complete log likelihood should be \(l(\theta;x,z) = \log p(x,z | \theta)\). However, since we do not know \(z\), \(l(\theta;x) = \log p(x | \theta) = \log \sum_z p(x,z | \theta)\) will not decouple.

E-step: Hallucinate some data, treating parameters as observed.
M-step: Standard Bayes Net training, treating latent variables as observed.

So we define a new objective function the expectation of \(l(\theta;x,z)\) over \(z \sim q(z)\) Properties are:

\(E[l(\theta;x,z)]\) is the lower bound of \(l(\theta;x)\) \(l(\theta;x) \leq E[l(\theta;x,z)] + H_q\)
Define the right hand part as \(F = E[l(\theta;x,z)] + H_q\). Then EM algroithm is coordinate-ascent on \(F\).
M-step optimizes a lower bound on the likelihood. E-step closes the gap.
EM objective function is non-convex. EM is a local optimization algorithm. Need to choose multiple random starts.
Some good things about EM:
- no learning rate (step-size) parameter
- automatically enforces parameter constraints
- very fast for low dimensions
- each iteration guaranteed to improve likelihood
Some bad things about EM:
- can get stuck in local minima
- can be slower than conjugate gradient (especially near convergence)
- requires expensive inference step
- is a maximum likelihood/MAP method

Rui Sun