Expectation maximization

less than 1 minute read

We have observed variable \(x\), latent varivable \(z\) and parameters \(\theta\). The complete log likelihood should be \(l(\theta;x,z) = \log p(x,z | \theta)\). However, since we do not know \(z\), \(l(\theta;x) = \log p(x | \theta) = \log \sum_z p(x,z | \theta)\) will not decouple.

  1. E-step: Hallucinate some data, treating parameters as observed.
  2. M-step: Standard Bayes Net training, treating latent variables as observed.

So we define a new objective function the expectation of \(l(\theta;x,z)\) over \(z \sim q(z)\) Properties are:

  • \(E[l(\theta;x,z)]\) is the lower bound of \(l(\theta;x)\) \(l(\theta;x) \leq E[l(\theta;x,z)] + H_q\)
  • Define the right hand part as \(F = E[l(\theta;x,z)] + H_q\). Then EM algroithm is coordinate-ascent on \(F\).
  • M-step optimizes a lower bound on the likelihood. E-step closes the gap.
  • EM objective function is non-convex. EM is a local optimization algorithm. Need to choose multiple random starts.
  • Some good things about EM:
    • no learning rate (step-size) parameter
    • automatically enforces parameter constraints
    • very fast for low dimensions
    • each iteration guaranteed to improve likelihood
  • Some bad things about EM:
    • can get stuck in local minima
    • can be slower than conjugate gradient (especially near convergence)
    • requires expensive inference step
    • is a maximum likelihood/MAP method