Approximate Inference

3 minute read

(Summarized from the lecture notes from cmu spring17-10708, Eric Xing)

Motivation

When we want to know Ep[f(x)], where xp(x), we need to know the distribution of x. One way is to use a sampled base average to approximate the expectation Ep[f(x)]=1NNt=1f(x(t)). However a few challenges remain that:

  • how to draw samples from a given dist. (not all distributions can be trivially sampled)?
  • how to make better use of the samples (not all sample are useful, or eqally useful, see an example later)?
  • how to know we’ve sampled enough? (For a model with hundreds or more variables, rare events will be very hard to garner evough samples even after a long time or sampling)

So we have a few methods belong to the Monte Carlo methods:

  • Rejection sampling: Create samples like direct sampling, only count samples which is consistent with given evidences.
  • Likelihood weighting: Sample variables and calculate evidence weight. Only create the samples which support the evidences.
  • Markov chain Monte Carlo (MCMC): Metropolis-Hasting, Gibbs

Rejection sampling:

How to generate samples given a distribution p(x)? How to convert samples from a Uniform[0,1] generator to samples that follows distribution p(x)? One way is to compute cdf h(y), which is a uniform non-decreasing function in [0,1]. Then for every sample u from Uniform[0,1], we can get x=h1(u) that xp(x). But this won’t work if ˆp(x) is not normalized (assume p(x)=ˆp(x)Z). This leads to the idea of rejection sampling:

  1. Come up with a probability distribution Q(x) (usually Gaussian) that we can easily draw samples from.
  2. Find a constant k such that ˆp(x)kQ(x)<1.
  3. Draw a sample x0 from Q
  4. Accept a sample with probability ˆp(x)kQ(x): sample u0 from Uniform[0,kQ(x0)], accept the sample if u0<ˆp(x0)

rs

And it can be proved that the accpected samples follows p(x).

Pitfalls:

  • Scale badly with dimensionality (acceptance rate exponentially decreases)
  • If Q(x) is chosen badly (if p and Q both are Gaussian), acceptance rate is really low

Importance sampling:

A different idea is to retain all the samples and re-weight them while computing their mean. Ep[f(x)]=f(x)p(x)dx=f(x)p(x)q(x)q(x)dx1NNn=1f(x(n))p(x(n))q(x(n)), where x(n)q(x). This a an unbiased estimator. However, when we only have unnormalized distribution ˆp(x), Ep[f(x)]nf(x(n))w(n), where w(n)=ˆp(x(n))ˆq(x(n))/lˆp(x(l))ˆq(x(l)). This is biased.

Pitfalls:

  • Scale badly with dimensionality
  • Unless p is close q (p=q gives the best performance), there will be a small number of dominant weights leading to a high variance. This is particularly evident in high-dimensions
  • This problem can be alleviated by resampling the weights

Weighted resampling

  1. Draw N samples {x}N from Q
  2. Constructing weights: w(n)=ˆp(x(n))ˆq(x(n))/lˆp(x(l))ˆq(x(l))
  3. Subsample x from {x}N w.p.t. {w}N

Particle filters

  • Apply the resampled importance sampling to a temporal high dimensional distribution p(x1,,xt). The problem can be reduced to finding a proposal distribution for an one-dimensional problem.
  • Goal is to yield samples from posterior p(Xt|Y1:t) pf

Markov Chain Monte Carlo

Instead of using a fixed proposal Q(x) as the previous methods, we use Q(x|x) where x is the new state being sampled, and x is the previous sample. So as x changes, Q(x|x) can also change. One of the most famous MCMC algorithm is Metropolis-Hastings algorithm.

Metropolis-Hastings

  • Draws a sample x from Q(x;x)
  • The new sample x is accepted or rejected with some probability A(x;x)
    • This acceptance probability is A(x;x)=min(1,p(x)Q(x;x)p(x)Q(x;x))
    • A(x;x) is like a ratio of importance sampling weights
      • p(x)Q(x;x) is the importance weight for x, p(x)Q(x;x) is the importance weight for x.
      • We divide the importance weight for x by that of x
      • Notice that we only need to compute p(x)/p(x) rather than p(x) or p(x) separately
    • A(x;x) ensures that, after sufficiently many draws, our samples will come from the true distribution.

Steps for the algorithm:

First we have a burn-in stage: while samples have not converged, do: pf Then, we take samples using the same procedure as above, and keep those samples.

Gibbs sampling

Gibbs Sampling is an MCMC algorithm that samples each random variable of a graphical model, one at a time. GS is a special case of the MH algorithm.