A couple of weeks ago, I wrote about variational inference for probit regression, which involved some pretty ugly algebra. Although variational inference is a powerful method for approximate Bayesian inference, it can be tedious to come up with the variational updates for every model (which aren’t always available in closed-form), and these updates are model-specific.
Black Box Variational Inference (BBVI) offers a solution to this problem. Instead of computing all the updates in closed form, BBVI uses sampling to approximate the gradient of our bound, and then uses stochastic optimization to optimize this bound. Below, I’ll briefly go over the main ideas behind BBVI, and then demonstrate how easy it makes inference for Bayesian logistic regression. I want to emphasize that the original BBVI paper describes the method better than I ever could, so I encourage you to read the paper as well.
In the context of Bayesian statistics, we’re frequently modeling the distribution of observations, , conditioned on some (random) latent variables . We would like to evaluate , but this distribution is often intractable. The idea behind variational inference is to introduce a family of distributions over that depend on variational parameters , , and find the values of that minimize the KL divergence between and . One of the most common forms of comes from the mean-field variational family, where factors into conditionally independent distributions each governed by some set of parameters, . Minimizing the KL divergence is equivalent to maximizing the Evidence Lower Bound (ELBO), given by
It can involve a lot of tedious computation to evaluate the gradient in closed form (when a closed form expression exists). The key insight behind BBVI is that it’s possible to write the gradient of the ELBO as an expectation:
So instead of evaluating a closed form expression for the gradient, we can use Monte Carlo samples and take the average to get a noisy estimate of the gradient. That is, for our current set of parameters , we can sample for , and for each of these samples evaluate the above expression, replacing with the sample . If we take the mean over all samples, we will have a (noisy) estimate for the gradient. Finally, by applying an appropriate step-size at every iteration, we can optimize the ELBO with stochastic gradient descent.
The above expression may look daunting, but it’s straightforward to evaluate. The first term is the gradient of , which is also known as the score function. As we’ll see in the logistic regression example, this expression is straightforward to evaluate for many distributions, but we can even use automatic differentiation to streamline this process if we have a more complicated model (or if we’re feeling lazy). The next two terms are log-likelihoods that we specify, so we can compute them with a sample .
Consider data with binary outputs . We can model , with the inverse-logit function and drawn from a -dimensional multivariate normal with independent components, . We would like to evaluate , but this is not available in closed form. Instead, we posit a variational distribution over , . To be clear, we model each as an independent Gaussian with mean and , and we use BBVI to learn the optimal values of . We’ll use the shorthand and .
Since is constrained to be positive, we will instead optimize over . First, evaluating the score function, it’s straightforward to see
Note that we use the chain rule in the derivation for . For the complete data log-likelihood, we can decompose , using the chain rule of probability (and noting that is a constant). Thus, it’s straightforward to calculate
The notation refers to evaluating the standard normal pdf with mean and variance at the point .
And that’s it. Thus, given a sample
z_sample from and current variational parameters
sigma , we can approximate the gradient using the following Python code:
def elbo_grad(z_sample, mu, sigma): score_mu = (z_sample - mu)/(sigma) score_logsigma = (-1/(2*sigma) + np.power((z_sample - mu),2)/(2*np.power(sigma,2))) * sigma log_p = np.sum(y * np.log(sigmoid(np.dot(X,z_sample))) + (1-y) * np.log(1-sigmoid(np.dot(X,z_sample)))) + np.sum(norm.logpdf(z_sample, np.zeros(P), np.ones(P))) log_q = np.sum(norm.logpdf(z_sample, mu, np.sqrt(sigma))) return np.concatenate([score_mu,score_logsigma])*(log_p - log_q)
To test this out, I simulated data from the model with and . I set the step-size with AdaGrad, I used 10 samples at every iteration, and I stopped optimizing when the distance between variational means was less than 0.01. The following plot shows the true values of , along with their learned variational distributions (the curves belonging to each parameter are a different color):
It appears that BBVI does a pretty decent job of picking up the distribution over true values. The following plots depict the value of each variational mean at every iteration (left), along with the change in variational means (right).
Again, I highly recommend checking out the original paper. This Python tutorial by David Duvenaud and Ryan Adams, which uses BBVI to train Bayesian neural networks in only a few lines of Python code, is also a great resource.
All my code is available here.