I ran into an interesting debate last week. Consider the factor analysis model:
where y is an observed vector, G is the mixing matrix, x is a vector of latent variables and the last term is isotropic Gaussian noise. Assume we observe a sequence of n y’s drawn iid from the model. Then writing and we have
A very naive way to infer the posterior would be to perform Gibbs sampling: successively sample , then . Assume conjugate Gaussian priors on G and X, this becomes particularly easy. Note that the likelihood function, assuming Gaussian noise with precision , is
Now assume however that some of the elements of Y are missing, at random. How should we cope with this? Two, superficially quite different approaches are possible.
In the first, the simplest way, would be to consider the unobserved elements of Y as latent variables. Then in our Gibbs sampling scheme, have initialized G and X, we simply sample , i.e. set the unobserved elements of Y to the corresponding elements of (sampling E as noise). Then our sampling steps for G and X are the same as before. This does not change the model structure in any way, and is a completely valid Gibbs sampling scheme.
The second approach is to exclude terms involving the missing values from the likelihood function. We can achieve this algebraically by element-wise multiplication of the (Y-GX) term by a binary matrix H. Element (i,j) of H is equal to one iff element (i,j) of Y was observed. Now the likelihood function becomes:
where is the number of observed elements of Y. This approach makes sampling G and X a little more tricky because H affects the covariance structure of the conditional distributions. To deal with the algebra (specifically to get rid of the troublesome Hadamard product) I use some ideas from Tom Minka’s matrix algebra notes.
Having had two pretty smart people arguing the case for both sides, we decided to give this some more thought. Both schemes are valid Gibbs samplers for the same model. The first is considerably easier to implement, but intuitively should not perform as well. Once we have a sample for the unobserved elements of Y, when we sample G and X the algorithm has no knowledge about which elements of Y are observed, so we must be introducing more noise into the algorithm than we would leaving these terms out.
Jurgen made the very good point that leaving the terms out of the likelihood function is equivalent to integrating out the unobserved variables. This leads to the intuition that the question of which approach to take is actually equivalent to the much broader question of how much to collapse a Gibbs sampler. This question has garnered a great deal of interest of late, and seems to be very much model dependent. If you have two highly correlated variables (such as G and X above) then integrating out one would seem very beneficial, since exploring the joint parameter space will be very slow otherwise, whereas Jurgen and Finale have found cases (specifically the linear Gaussian IBP model) where the uncollapsed Gibbs sampler is much faster.
I’ll report back again when I have some hard numbers to support these ideas!