Expectation maximization
We have observed variable \(x\), latent varivable \(z\) and parameters \(\theta\). The complete log likelihood should be \(l(\theta;x,z) = \log p(x,z | \theta)\). However, since we do not know \(z\), \(l(\theta;x) = \log p(x | \theta) = \log \sum_z p(x,z | \theta)\) will not decouple.
- E-step: Hallucinate some data, treating parameters as observed.
- M-step: Standard Bayes Net training, treating latent variables as observed.
So we define a new objective function the expectation of \(l(\theta;x,z)\) over \(z \sim q(z)\) Properties are:
- \(E[l(\theta;x,z)]\) is the lower bound of \(l(\theta;x)\) \(l(\theta;x) \leq E[l(\theta;x,z)] + H_q\)
- Define the right hand part as \(F = E[l(\theta;x,z)] + H_q\). Then EM algroithm is coordinate-ascent on \(F\).
- M-step optimizes a lower bound on the likelihood. E-step closes the gap.
- EM objective function is non-convex. EM is a local optimization algorithm. Need to choose multiple random starts.
- Some good things about EM:
- no learning rate (step-size) parameter
- automatically enforces parameter constraints
- very fast for low dimensions
- each iteration guaranteed to improve likelihood
- Some bad things about EM:
- can get stuck in local minima
- can be slower than conjugate gradient (especially near convergence)
- requires expensive inference step
- is a maximum likelihood/MAP method