Approximate Inference
(Summarized from the lecture notes from cmu spring17-10708, Eric Xing)
Motivation
When we want to know Ep[f(x)], where x∼p(x), we need to know the distribution of x. One way is to use a sampled base average to approximate the expectation Ep[f(x)]=1N∑Nt=1f(x(t)). However a few challenges remain that:
- how to draw samples from a given dist. (not all distributions can be trivially sampled)?
- how to make better use of the samples (not all sample are useful, or eqally useful, see an example later)?
- how to know we’ve sampled enough? (For a model with hundreds or more variables, rare events will be very hard to garner evough samples even after a long time or sampling)
So we have a few methods belong to the Monte Carlo methods:
- Rejection sampling: Create samples like direct sampling, only count samples which is consistent with given evidences.
- Likelihood weighting: Sample variables and calculate evidence weight. Only create the samples which support the evidences.
- Markov chain Monte Carlo (MCMC): Metropolis-Hasting, Gibbs
Rejection sampling:
How to generate samples given a distribution p(x)? How to convert samples from a Uniform[0,1] generator to samples that follows distribution p(x)? One way is to compute cdf h(y), which is a uniform non-decreasing function in [0,1]. Then for every sample u from Uniform[0,1], we can get x=h−1(u) that x∼p(x). But this won’t work if ˆp(x) is not normalized (assume p(x)=ˆp(x)Z). This leads to the idea of rejection sampling:
- Come up with a probability distribution Q(x) (usually Gaussian) that we can easily draw samples from.
- Find a constant k such that ˆp(x)kQ(x)<1.
- Draw a sample x0 from Q
- Accept a sample with probability ˆp(x)kQ(x): sample u0 from Uniform[0,kQ(x0)], accept the sample if u0<ˆp(x0)
And it can be proved that the accpected samples follows p(x).
Pitfalls:
- Scale badly with dimensionality (acceptance rate exponentially decreases)
- If Q(x) is chosen badly (if p and Q both are Gaussian), acceptance rate is really low
Importance sampling:
A different idea is to retain all the samples and re-weight them while computing their mean. Ep[f(x)]=∫f(x)p(x)dx=∫f(x)p(x)q(x)q(x)dx≈1N∑Nn=1f(x(n))p(x(n))q(x(n)), where x(n)∼q(x). This a an unbiased estimator. However, when we only have unnormalized distribution ˆp(x), Ep[f(x)]≈∑nf(x(n))w(n), where w(n)=ˆp(x(n))ˆq(x(n))/∑lˆp(x(l))ˆq(x(l)). This is biased.
Pitfalls:
- Scale badly with dimensionality
- Unless p is close q (p=q gives the best performance), there will be a small number of dominant weights leading to a high variance. This is particularly evident in high-dimensions
- This problem can be alleviated by resampling the weights
Weighted resampling
- Draw N samples {x}N from Q
- Constructing weights: w(n)=ˆp(x(n))ˆq(x(n))/∑lˆp(x(l))ˆq(x(l))
- Subsample x from {x}N w.p.t. {w}N
Particle filters
- Apply the resampled importance sampling to a temporal high dimensional distribution p(x1,…,xt). The problem can be reduced to finding a proposal distribution for an one-dimensional problem.
- Goal is to yield samples from posterior p(Xt|Y1:t)
Markov Chain Monte Carlo
Instead of using a fixed proposal Q(x) as the previous methods, we use Q(x′|x) where x′ is the new state being sampled, and x is the previous sample. So as x changes, Q(x′|x) can also change. One of the most famous MCMC algorithm is Metropolis-Hastings algorithm.
Metropolis-Hastings
- Draws a sample x′ from Q(x′;x)
- The new sample x′ is accepted or rejected with some probability A(x′;x)
- This acceptance probability is A(x′;x)=min(1,p(x′)Q(x;x′)p(x)Q(x′;x))
- A(x′;x) is like a ratio of importance sampling weights
- p(x′)Q(x;x′) is the importance weight for x′, p(x)Q(x′;x) is the importance weight for x.
- We divide the importance weight for x′ by that of x
- Notice that we only need to compute p(x′)/p(x) rather than p(x′) or p(x) separately
- A(x′;x) ensures that, after sufficiently many draws, our samples will come from the true distribution.
Steps for the algorithm:
First we have a burn-in stage: while samples have not converged, do:
Then, we take samples using the same procedure as above, and keep those samples.
Gibbs sampling
Gibbs Sampling is an MCMC algorithm that samples each random variable of a graphical model, one at a time. GS is a special case of the MH algorithm.