Jekyll2017-12-08T17:36:55+00:00https://keyonvafa.github.io/keyonvafa.github.ioA blog about technology and stuff relatedBlack Box Variational Inference for Logistic Regression2017-04-01T14:00:00+00:002017-04-01T14:00:00+00:00https://keyonvafa.github.io/logistic-regression-bbvi<p>A couple of weeks ago, I wrote about <a href="http://keyonvafa.com/variational-inference-probit-regression/">variational inference for probit regression</a>, which involved some pretty ugly algebra. Although variational inference is a powerful method for approximate Bayesian inference, it can be tedious to come up with the variational updates for every model (which aren’t always available in closed-form), and these updates are model-specific.</p>
<p><a href="http://www.cs.columbia.edu/~blei/papers/RanganathGerrishBlei2014.pdf">Black Box Variational Inference</a> (BBVI) offers a solution to this problem. Instead of computing all the updates in closed form, BBVI uses <em>sampling</em> to approximate the gradient of our bound, and then uses stochastic optimization to optimize this bound. Below, I’ll briefly go over the main ideas behind BBVI, and then demonstrate how easy it makes inference for Bayesian logistic regression. I want to emphasize that the <a href="http://www.cs.columbia.edu/~blei/papers/RanganathGerrishBlei2014.pdf">original BBVI paper</a> describes the method better than I ever could, so I encourage you to read the paper as well.</p>
<h2 id="black-box-variational-inference-a-brief-overview">Black Box Variational Inference: A Brief Overview</h2>
<p>In the context of Bayesian statistics, we’re frequently modeling the distribution of observations, <script type="math/tex">x</script>, conditioned on some (random) latent variables <script type="math/tex">z</script>. We would like to evaluate <script type="math/tex">p(z \vert x)</script>, but this distribution is often intractable. The idea behind variational inference is to introduce a family of distributions over <script type="math/tex">z</script> that depend on <em>variational parameters</em> <script type="math/tex">\lambda</script>, <script type="math/tex">q(z \vert \lambda)</script>, and find the values of <script type="math/tex">\lambda</script> that minimize the KL divergence between <script type="math/tex">q(z \vert \lambda)</script> and <script type="math/tex">p(z \vert x)</script>. One of the most common forms of <script type="math/tex">q</script> comes from the <em>mean-field variational family</em>, where <script type="math/tex">q</script> factors into conditionally independent distributions each governed by some set of parameters, <script type="math/tex">q(z \vert \lambda) = \prod_{j=1}^m q_j(z_j \vert \lambda)</script>. Minimizing the KL divergence is equivalent to maximizing the <em>Evidence Lower Bound</em> (ELBO), given by</p>
<script type="math/tex; mode=display">L(\lambda) = E_{q_{\lambda}(z)}[\log p(x,z) - \log q(z)].</script>
<p>It can involve a lot of tedious computation to evaluate the gradient in closed form (when a closed form expression exists). The key insight behind BBVI is that it’s possible to write the gradient of the ELBO as an expectation:</p>
<script type="math/tex; mode=display">\nabla_{\lambda}L(\lambda) = E_q[(\nabla_{\lambda} \log q(z \vert \lambda)) (\log p(x,z) - \log q(z \vert \lambda))].</script>
<p>So instead of evaluating a closed form expression for the gradient, we can use Monte Carlo samples and take the average to get a noisy estimate of the gradient. That is, for our current set of parameters <script type="math/tex">\lambda</script>, we can sample <script type="math/tex">z_s \sim q(z \vert \lambda)</script> for <script type="math/tex">s \in 1, \dots, S</script>, and for each of these samples evaluate the above expression, replacing <script type="math/tex">z</script> with the sample <script type="math/tex">z_s</script>. If we take the mean over all samples, we will have a (noisy) estimate for the gradient. Finally, by applying an appropriate step-size at every iteration, we can optimize the ELBO with stochastic gradient descent.</p>
<p>The above expression may look daunting, but it’s straightforward to evaluate. The first term is the gradient of <script type="math/tex">\log q(z \vert \lambda)</script>, which is also known as the score function. As we’ll see in the logistic regression example, this expression is straightforward to evaluate for many distributions, but we can even use automatic differentiation to streamline this process if we have a more complicated model (or if we’re feeling lazy). The next two terms are log-likelihoods that we specify, so we can compute them with a sample <script type="math/tex">z_s</script>.</p>
<h2 id="bbvi-for-bayesian-logistic-regression">BBVI for Bayesian Logistic Regression</h2>
<p>Consider data <script type="math/tex">\boldsymbol X \in \mathbb{R}^{N \times P}</script> with binary outputs <script type="math/tex">\boldsymbol y \in \mathbb{R}^{N}</script>. We can model <script type="math/tex">P(y_i \vert \boldsymbol x_i, \boldsymbol z) \sim \text{Bern}(\sigma(\boldsymbol z^T \boldsymbol x_i))</script>, with <script type="math/tex">\sigma(\cdot)</script> the inverse-logit function and <script type="math/tex">\boldsymbol z</script> drawn from a <script type="math/tex">p</script>-dimensional multivariate normal with independent components, <script type="math/tex">\boldsymbol z \sim \mathcal N(\boldsymbol 0, \boldsymbol I_p)</script>. We would like to evaluate <script type="math/tex">p(\boldsymbol z \vert \boldsymbol X, \boldsymbol y)</script>, but this is not available in closed form. Instead, we posit a variational distribution over <script type="math/tex">\boldsymbol z</script>, <script type="math/tex">q(\boldsymbol z \vert \lambda) = \prod_{j=1}^P \mathcal N(z_i \vert \mu_j, \sigma_j^2)</script>. To be clear, we model each <script type="math/tex">z_j</script> as an independent Gaussian with mean <script type="math/tex">\mu_j</script> and <script type="math/tex">\sigma_j^2</script>, and we use BBVI to learn the optimal values of <script type="math/tex">\lambda = \{\mu_j,\sigma_j^2\}_{j=1}^P</script>. We’ll use the shorthand <script type="math/tex">\boldsymbol \mu = (\mu_1, \dots, \mu_P)</script> and <script type="math/tex">\boldsymbol \sigma^2 = (\sigma_1^2, \dots, \sigma_P^2)</script>.</p>
<p>Since <script type="math/tex">\sigma_j^2</script> is constrained to be positive, we will instead optimize over <script type="math/tex">\alpha_i = \log(\sigma_j^2)</script>. First, evaluating the score function, it’s straightforward to see</p>
<script type="math/tex; mode=display">\nabla_{\mu_j}\log q(\boldsymbol z \vert \lambda ) = \nabla_{\mu_j} \sum_{i=1}^P -\frac{\log(\sigma_i^2)}{2}-\frac{(z_i-\mu_i)^2}{2\sigma_i^2} = \frac{(z_j-\mu_j)}{\sigma^2_j}.\\
\nabla_{\alpha_j}\log q(\boldsymbol z \vert \lambda ) = \nabla_{\sigma_j} \left(\sum_{i=1}^P -\frac{\log(\sigma_i^2)}{2}-\frac{(z_i-\mu_i)^2}{2\sigma_i^2}\right) * \nabla_{\alpha_j}(\sigma_j^2) = \left(-\frac{1}{2\sigma_j^2} + \frac{(z_j-\mu_j)^2}{2(\sigma_j^2)^2}\right) * (\sigma_j^2).</script>
<p>Note that we use the chain rule in the derivation for <script type="math/tex">\nabla_{\alpha_j}\log q(\boldsymbol z \vert \lambda )</script>. For the complete data log-likelihood, we can decompose <script type="math/tex">\log p( \boldsymbol y, \boldsymbol X, \boldsymbol z) = \log p( \boldsymbol y \vert \boldsymbol X, \boldsymbol z) + \log p(\boldsymbol z)</script>, using the chain rule of probability (and noting that <script type="math/tex">\boldsymbol X</script> is a constant). Thus, it’s straightforward to calculate</p>
<script type="math/tex; mode=display">\log p(\boldsymbol y, \boldsymbol X, \boldsymbol z) = \sum_{i=1}^N [y_i \log(\sigma(\boldsymbol z^T \boldsymbol x_i)) + (1-y_i)\log(1-\sigma(\boldsymbol z^T \boldsymbol x_i))] + \sum_{j=1}^P \log \varphi(z_j \vert 0, 1).\\
\log q(\boldsymbol z \vert \lambda) = \sum_{j=1}^P \log \varphi(z_j \vert \mu_j, \sigma_j^2).</script>
<p>The notation <script type="math/tex">\varphi(z_j \vert \mu, \sigma^2)</script> refers to evaluating the standard normal pdf with mean <script type="math/tex">\mu</script> and variance <script type="math/tex">\sigma^2</script> at the point <script type="math/tex">z_j</script>.</p>
<p>And that’s it. Thus, given a sample <code class="highlighter-rouge">z_sample</code> from <script type="math/tex">q(\boldsymbol z \vert \lambda) \sim \mathcal N(\boldsymbol \mu, \text{diag}(\boldsymbol \sigma^2))</script> and current variational parameters <code class="highlighter-rouge">mu</code> <script type="math/tex">= \boldsymbol \mu</script> and <code class="highlighter-rouge">sigma</code> <script type="math/tex">= \boldsymbol \sigma^2</script>, we can approximate the gradient using the following Python code:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">elbo_grad</span><span class="p">(</span><span class="n">z_sample</span><span class="p">,</span> <span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span><span class="p">):</span>
<span class="n">score_mu</span> <span class="o">=</span> <span class="p">(</span><span class="n">z_sample</span> <span class="o">-</span> <span class="n">mu</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="n">sigma</span><span class="p">)</span>
<span class="n">score_logsigma</span> <span class="o">=</span> <span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="o">/</span><span class="p">(</span><span class="mi">2</span><span class="o">*</span><span class="n">sigma</span><span class="p">)</span> <span class="o">+</span> <span class="n">np</span><span class="o">.</span><span class="n">power</span><span class="p">((</span><span class="n">z_sample</span> <span class="o">-</span> <span class="n">mu</span><span class="p">),</span><span class="mi">2</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="mi">2</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">power</span><span class="p">(</span><span class="n">sigma</span><span class="p">,</span><span class="mi">2</span><span class="p">)))</span> <span class="o">*</span> <span class="n">sigma</span>
<span class="n">log_p</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">y</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">sigmoid</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="n">z_sample</span><span class="p">)))</span> <span class="o">+</span> <span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">y</span><span class="p">)</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">sigmoid</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="n">z_sample</span><span class="p">))))</span>
<span class="o">+</span> <span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">norm</span><span class="o">.</span><span class="n">logpdf</span><span class="p">(</span><span class="n">z_sample</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">P</span><span class="p">),</span> <span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="n">P</span><span class="p">)))</span>
<span class="n">log_q</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">norm</span><span class="o">.</span><span class="n">logpdf</span><span class="p">(</span><span class="n">z_sample</span><span class="p">,</span> <span class="n">mu</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">sigma</span><span class="p">)))</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">concatenate</span><span class="p">([</span><span class="n">score_mu</span><span class="p">,</span><span class="n">score_logsigma</span><span class="p">])</span><span class="o">*</span><span class="p">(</span><span class="n">log_p</span> <span class="o">-</span> <span class="n">log_q</span><span class="p">)</span>
</code></pre></div></div>
<p>To test this out, I simulated data from the model with <script type="math/tex">N = 100</script> and <script type="math/tex">P = 4</script>. I set the step-size with <a href="http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf">AdaGrad</a>, I used 10 samples at every iteration, and I stopped optimizing when the distance between variational means was less than 0.01. The following plot shows the true values of <script type="math/tex">z_1, \dots, z_4</script>, along with their learned variational distributions (the curves belonging to each parameter are a different color):</p>
<p><img src="/assets/images/logistic_regression_bbvi_blog/densities.png" alt="Variational densities" /></p>
<p>It appears that BBVI does a pretty decent job of picking up the distribution over true values. The following plots depict the value of each variational mean at every iteration (left), along with the change in variational means (right).</p>
<p><img src="/assets/images/logistic_regression_bbvi_blog/trace_plots.png" alt="Trace plots" /></p>
<p>Again, I highly recommend checking out the <a href="http://www.cs.columbia.edu/~blei/papers/RanganathGerrishBlei2014.pdf">original paper</a>. This <a href="http://people.seas.harvard.edu/~dduvenaud/papers/blackbox.pdf">Python tutorial</a> by <a href="https://www.cs.toronto.edu/~duvenaud/">David Duvenaud</a> and <a href="http://people.seas.harvard.edu/~rpa/">Ryan Adams</a>, which uses BBVI to train Bayesian neural networks in only a few lines of Python code, is also a great resource.</p>
<p>All my code is available <a href="https://github.com/keyonvafa/logistic-reg-bbvi-blog">here</a>.</p>keyonvafaA couple of weeks ago, I wrote about variational inference for probit regression, which involved some pretty ugly algebra. Although variational inference is a powerful method for approximate Bayesian inference, it can be tedious to come up with the variational updates for every model (which aren’t always available in closed-form), and these updates are model-specific.US Senators and PCA2017-03-28T00:00:00+00:002017-03-28T00:00:00+00:00https://keyonvafa.github.io/voting-record-pca<p>A couple of weeks ago, I wrote a <a href="http://keyonvafa.com/ideal-points/">blog post about modeling ideal points of US senators</a>. I wanted to follow up (very briefly), since I was curious about comparing the Bayesian method there with Principal Component Analysis (PCA).</p>
<p>Here are the (new) results performing PCA on the voting record:</p>
<iframe width="1000" height="300" frameborder="0" scrolling="no" src="https://plot.ly/~keyonvafa/114.embed"></iframe>
<p>Here are the (older) results using ideal point modeling:</p>
<iframe width="1000" height="300" frameborder="0" scrolling="no" src="https://plot.ly/~keyonvafa/58.embed"></iframe>
<p>It’s interesting to compare the methods (the scale on the x-axis is irrelevant). Both models do a good job of capturing the more moderate senators, since <a href="https://en.wikipedia.org/wiki/Susan_Collins">Susan Collins</a>, <a href="https://en.wikipedia.org/wiki/Lisa_Murkowski">Lisa Murkowski</a>, and <a href="https://en.wikipedia.org/wiki/Kelly_Ayotte">Kelly Ayotte</a> are in the middle in both methods. The furthest left senator using PCA is <a href="https://en.wikipedia.org/wiki/Maria_Cantwell">Maria Cantwell</a>, who is also pretty far left with ideal points. Meanwhile, the furthest right senator with PCA is <a href="https://en.wikipedia.org/wiki/Tom_Coburn">Tom Coburn</a> (whose <a href="https://en.wikipedia.org/wiki/Tom_Coburn">Wikipedia page</a> describes him as “the godfather of the modern conservative, austerity movement”), yet he is further left than 8 senators with ideal point modeling.</p>
<p>Overall, I was surprised by how similar these results were, given how differently the two methods are motivated. Ideal point modeling yields scores for every bill and senator (along with a predictive interpretation), while PCA can reduce the voting data to any dimension to capture senator voting habits (not to mention it’s much faster). I would definitely be interested in exploring these methods with more rigor.</p>keyonvafaA couple of weeks ago, I wrote a blog post about modeling ideal points of US senators. I wanted to follow up (very briefly), since I was curious about comparing the Bayesian method there with Principal Component Analysis (PCA).Variational Inference for Bayesian Probit Regression2017-03-16T20:00:00+00:002017-03-16T20:00:00+00:00https://keyonvafa.github.io/variational-inference-probit-regression<p>Variational inference has become one of the most important approximate inference techniques for Bayesian statistics, but it has taken me a long time to wrap my head around the central ideas (and I’m still learning). Since I’ve found that going through examples is the most efficient way to learn, I thought I would go through a single example in this post, performing variational inference on Bayesian probit regression.</p>
<p>I’m going to assume the reader is somewhat familiar with the basic ideas behind variational inference. If you’ve never seen variational infererence before, I strongly recommend <a href="https://arxiv.org/pdf/1601.00670.pdf">this tutorial</a> by <a href="http://www.cs.columbia.edu/~blei/">David Blei</a>, <a href="http://www.proditus.com/">Alp Kucukelbir</a>, and <a href="https://www.stat.berkeley.edu/~jon/">Jon McAuliffe</a>. These <a href="https://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf">course notes</a> from David Blei are also very <a href="https://www.youtube.com/watch?v=eXiwYUCe_bY">handy</a>.</p>
<h2 id="variational-inference-a-very-brief-overview">Variational Inference: A (Very) Brief Overview</h2>
<p>Bayesian statistics often requires computing the conditional density <script type="math/tex">p(\boldsymbol z \vert \boldsymbol x)</script> of latent variables <script type="math/tex">\boldsymbol z = z_{1:m}</script> given observed variables <script type="math/tex">\boldsymbol x = x_{1:n}</script>. Since this distribution is typically intractable, variational inference learns an approximate distribution <script type="math/tex">q(\boldsymbol z)</script> that is meant to be “close” to <script type="math/tex">p(\boldsymbol z \vert \boldsymbol x)</script>, using <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">Kullback-Leibler divergence</a> as a measure.</p>
<p>Thus, there are two steps. The first comes from providing a form for the variational distribution, <script type="math/tex">q(\boldsymbol z)</script>. The most frequently used form comes from the <em>mean-field variational family</em>, where <script type="math/tex">q</script> factors into conditionally independent distributions each governed by some set of parameters, <script type="math/tex">q(\boldsymbol z) = \prod_{j=1}^m q_j(z_j)</script>. Once we have specified the factorization of the distribution, we are still required to figure out the optimal form of each factor, both in terms of its family and parameters (although these can be conisdered the same thing). Thus, the second step is optimizing <script type="math/tex">KL(q \vert \vert p)</script>.</p>
<p>It turns out the optimal form of each factor is straightforward: <script type="math/tex">q_j^*(z_j) \propto \exp\left\{E_{-j}[\log p(\boldsymbol z, \boldsymbol x)]\right\}</script>, where <script type="math/tex">E_{-j}[\cdot]</script> refers to the expectation when omitting variable <script type="math/tex">z_j</script>. To minimize <script type="math/tex">KL(q \vert \vert p)</script>, we cycle between latent factors <script type="math/tex">q_j</script> and update the mean (with respect to the current parameters) according to the equation above. If these results are unfamiliar, definitely check out <a href="https://arxiv.org/pdf/1601.00670.pdf">the tutorial</a> I mentioned earlier.</p>
<h2 id="variational-inference-for-bayesian-probit-regression">Variational Inference for Bayesian Probit Regression</h2>
<p>Consider a probit regression problem, where we have data <script type="math/tex">\boldsymbol x \in \mathbb{R}^{N \times 1}</script> and a binary outcome <script type="math/tex">\boldsymbol y \in \{0,1\}^{N}</script>. In probit regression, we assume <script type="math/tex">p(y_i = 1) = \Phi(a + bx_i)</script>, where <script type="math/tex">a</script> and <script type="math/tex">b</script> are unknown and random, with a uniform prior, and <script type="math/tex">\Phi(\cdot)</script> is the standard normal CDF. To simplify things, we can introduce variables <script type="math/tex">z_i \sim \mathcal{N}(a+bx_i,1)</script> so <script type="math/tex">y_i = 1</script> if <script type="math/tex">z_i > 0</script> and <script type="math/tex">y_i = 0</script> if <script type="math/tex">z_i \leq 0</script>.</p>
<p>The first step is writing down the log posterior density <script type="math/tex">\log p(a,b,\boldsymbol z \vert \boldsymbol y)</script> up to a constant. It is straightforward to see</p>
<script type="math/tex; mode=display">\log p(a, b, \boldsymbol z \vert \boldsymbol y) \propto \sum_{i=1}^n y_i \log I(z_i > 0) + (1-y_i)\log(I(z_i \leq 0)) - \frac{\sum_{i=1}^n (z_i - (a+bx_i))^2}{2}.</script>
<p>The next step is defining our variational distribution <script type="math/tex">q</script>. We will provide one factor for each <script type="math/tex">z_i</script>, along with indendent factors for <script type="math/tex">a</script> and <script type="math/tex">b</script> each. Therefore, <script type="math/tex">q</script> consists of <script type="math/tex">n + 2</script> independent factors:</p>
<script type="math/tex; mode=display">q(a, b, \boldsymbol z) = q_a(a) q_b(b) \prod_{j=1}^m q_j(z_j).</script>
<p>To learn the optimal form of each factor, we use the rule described above. That is, consider a single <script type="math/tex">z_j</script>. The optimal distribution is therefore <script type="math/tex">q_j^*(z_j) \propto \exp \left\{E_{a,b,\boldsymbol z_{-j}}[\log p(a, b, \boldsymbol z \vert \boldsymbol y)]\right\}</script>. Writing this out, we see</p>
<script type="math/tex; mode=display">E_{a,b,\boldsymbol z_{-j}}[\log p(a, b, \boldsymbol z \vert \boldsymbol y)] \propto y_j \log I(z_j > 0) + (1-y_j)\log I(z_j \leq 0) - \frac{E_{a,b}(z_j-(a+bx_i))^2}{2}.</script>
<p>Thus, after exponentiating, we have that the ideal form is a truncated normal distribution. That is, <script type="math/tex">q_j(z_j) \sim \mathcal N^+(E(a)+E(b)x_i,1)</script> if <script type="math/tex">y_j = 1</script> and <script type="math/tex">q_j(z_j) \sim \mathcal N^-(E(a)+E(b)x_i,1)</script> if <script type="math/tex">y_j = 0</script>, where <script type="math/tex">\mathcal N^+</script> and <script type="math/tex">\mathcal N^-</script> are normal distributions truncated to be positive and negative, respecitively.</p>
<p>Similarly, for <script type="math/tex">a</script>, we have <script type="math/tex">E_{b,\boldsymbol z}[\log p(a, b, \boldsymbol z \vert \boldsymbol y)] \propto E_{b,\boldsymbol z}\left(-\frac{\sum_{i=1}^n (z_i - (a+bx_i))^2}{2}\right)</script>. Removing terms that do not depend on <script type="math/tex">a</script> and completing the square, we have the optimal form as <script type="math/tex">q_a(a) \sim \mathcal N\left(\frac{\sum_{i=1}^n [E(z_i)-E(b)x_i]}{n},\frac{1}{n}\right)</script>.</p>
<p>Finally, for <script type="math/tex">b</script>, we have <script type="math/tex">E_{a,\boldsymbol z}[\log p(a, b, \boldsymbol z)] \propto E_{a, \boldsymbol z}\left(-\frac{\sum_{i=1}^n (z_i - (a+ bx_i))^2}{2}\right)</script>. Again removing the terms that do not depend on <script type="math/tex">b</script> and completing the square, we have the following optimal form:</p>
<script type="math/tex; mode=display">q_b(b) \sim \mathcal N \left(\frac{\sum_{i=1}^n x_i[E(z_i)-E(a)]}{\sum_{=1}^n x_i^2}, \frac{1}{\sum_{i=1}^n x_i^2}\right).</script>
<p>Now that we know the form of all the factors, it’s time to optimize. To do this, we set each parameter to the mean of its optimal factored distribution. The updates can take the following form in R:</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">update_M_zj</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">M_a</span><span class="p">,</span><span class="n">M_b</span><span class="p">,</span><span class="n">j</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">mu</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">M_a</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">M_b</span><span class="o">*</span><span class="n">x</span><span class="p">[</span><span class="n">j</span><span class="p">]</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">y</span><span class="p">[</span><span class="n">j</span><span class="p">]</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">mu</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">dnorm</span><span class="p">(</span><span class="m">-1</span><span class="o">*</span><span class="n">mu</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="m">1</span><span class="o">-</span><span class="n">pnorm</span><span class="p">(</span><span class="m">-1</span><span class="o">*</span><span class="n">mu</span><span class="p">)))</span><span class="w">
</span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">mu</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">dnorm</span><span class="p">(</span><span class="m">-1</span><span class="o">*</span><span class="n">mu</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="n">pnorm</span><span class="p">(</span><span class="m">-1</span><span class="o">*</span><span class="n">mu</span><span class="p">)))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">update_M_a</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">M_z</span><span class="p">,</span><span class="n">M_b</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="nf">sum</span><span class="p">(</span><span class="n">M_z</span><span class="o">-</span><span class="n">M_b</span><span class="o">*</span><span class="n">x</span><span class="p">)</span><span class="o">/</span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">update_M_b</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">M_z</span><span class="p">,</span><span class="n">M_a</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="nf">sum</span><span class="p">(</span><span class="n">x</span><span class="o">*</span><span class="p">(</span><span class="n">M_z</span><span class="o">-</span><span class="n">M_a</span><span class="p">))</span><span class="o">/</span><span class="nf">sum</span><span class="p">(</span><span class="n">x</span><span class="o">^</span><span class="m">2</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>Thefore, a single updating step would look like</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">M_z</span><span class="p">[</span><span class="n">iteration</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">update_M_zj</span><span class="p">(</span><span class="n">M_a</span><span class="p">,</span><span class="n">M_b</span><span class="p">,</span><span class="n">i</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">M_a</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">update_M_a</span><span class="p">(</span><span class="n">M_z</span><span class="p">,</span><span class="n">M_b</span><span class="p">)</span><span class="w">
</span><span class="n">M_b</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">update_M_b</span><span class="p">(</span><span class="n">M_z</span><span class="p">,</span><span class="n">M_a</span><span class="p">)</span><span class="w">
</span><span class="n">as</span><span class="p">[</span><span class="n">iteration</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">M_a</span><span class="w">
</span><span class="n">bs</span><span class="p">[</span><span class="n">iteration</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">M_b</span><span class="w">
</span></code></pre></div></div>
<p>Again, variational inference is an incredibly powerful tool, and I cannot overstate how helpful the links I posted above are in understanding all of this. Hopefully this tutorial clears up some of the confusion about variational inferece.</p>keyonvafaVariational inference has become one of the most important approximate inference techniques for Bayesian statistics, but it has taken me a long time to wrap my head around the central ideas (and I’m still learning). Since I’ve found that going through examples is the most efficient way to learn, I thought I would go through a single example in this post, performing variational inference on Bayesian probit regression.Ideal Points of US Senators2017-03-09T02:00:00+00:002017-03-09T02:00:00+00:00https://keyonvafa.github.io/ideal-points<p><a href="http://k7moa.com/pdf/Upside_Down-A_Spatial_Model_for_Legislative_Roll_Call_Analysis_1983.pdf">Popularized by Keith Poole and Howard Rosenthal</a>, ideal point modeling is a powerful way to extract the relative ideologies of politicans based solely on their voting records. <a href="http://www.acrwebsite.org/search/view-conference-proceedings.aspx?Id=9188">A</a> <a href="http://www.stat.columbia.edu/~gelman/research/published/171.pdf">lot</a> <a href="https://www.cs.princeton.edu/~blei/papers/GerrishBlei2011.pdf">has</a> <a href="http://pablobarbera.com/static/barbera_twitter_ideal_points.pdf">been</a> <a href="https://www.jstor.org/stable/1558585">written</a> on ideal point models, so I’m not going to add anything new, but I wanted to give a brief overview of the Bayesian perspective.</p>
<p>First, some results. The following plot shows the ideal points (essentially inferred ideologies) of US senators based solely on roll call voting from 2013-2015 (scroll over the points to see names):</p>
<iframe width="1000" height="300" frameborder="0" scrolling="no" src="https://plot.ly/~keyonvafa/58.embed"></iframe>
<p>More extreme scores (i.e. away from zero) represent more extreme political views. While the liberal-conservative spectrum is not explicitly encoded into the model, the model picks this up naturally from voting patterns. On the far left are some of the most liberal members of the US Senate, such as <a href="https://en.wikipedia.org/wiki/Brian_Schatz">Brian Schatz</a>, while the far right has some of the most conservative members, such as <a href="https://en.wikipedia.org/wiki/Jim_Risch">Jim Risch</a> and <a href="https://en.wikipedia.org/wiki/Ted_Cruz">Ted Cruz</a>. In the middle are senators sometimes referred to as <a href="https://en.wikipedia.org/wiki/Democrat_In_Name_Only">DINOs</a> and <a href="https://en.wikipedia.org/wiki/Republican_In_Name_Only">RINOs</a>, such as <a href="https://en.wikipedia.org/wiki/Joe_Manchin">Joe Manchin</a>, <a href="https://en.wikipedia.org/wiki/Susan_Collins">Susan Collins</a>, and <a href="https://en.wikipedia.org/wiki/Lisa_Murkowski">Lisa Murkowski</a>.</p>
<p>The basic model is as follows. Consider a legislator <script type="math/tex">u</script> and a particular bill <script type="math/tex">d</script>. The vote <script type="math/tex">u</script> places on <script type="math/tex">d</script> is denoted as a binary variable, <script type="math/tex">v_{ud} = 1</script> for Yea and <script type="math/tex">v_{ud} = 0</script> for Nay. Each legislator has an <em>ideal point</em> <script type="math/tex">x_u</script>; a value of 0 is political neutrality, whereas large values in either direction indicate more political extremism in the respective direction. Every bill has its own <em>discrimination</em> <script type="math/tex">b_d</script>, which is on the same scale as the ideal points for legislators. If <script type="math/tex">x_u*b_d</script> is high, the legislator is likely to vote for the bill, and if the value is low, the legislator is less likely to vote. Finally, each bill also has an offset <script type="math/tex">a_d</script> that indicates how popular the bill is overall, regardless of political affiliation. Formally, the model is as follows:</p>
<script type="math/tex; mode=display">P(v_{ud} = 1) = \sigma(x_ub_d + a_d),</script>
<p>where <script type="math/tex">\sigma(\cdot)</script> is some sigmoidal function, such as the inverse-logit or the standard normal CDF. If a senator didn’t vote on a particular bill, this data is considered missing at random.</p>
<p>Inference requires learning the vectors <script type="math/tex">X, B</script>, and <script type="math/tex">A</script>. I took a Bayesian approach and put (independent) normal priors on each variable. I then used an EM algorithm derived by <a href="http://imai.princeton.edu/research/files/fastideal.pdf">Kosuke Imai et al</a>. The E-Step and M-Step are described in full detail in the paper, and I followed their setup, except I removed senators with less than 50 votes, and I stopped after 500 iterations.</p>
<p>All my code is available <a href="https://github.com/keyonvafa/ideal-point-blog">here</a>.</p>keyonvafaPopularized by Keith Poole and Howard Rosenthal, ideal point modeling is a powerful way to extract the relative ideologies of politicans based solely on their voting records. A lot has been written on ideal point models, so I’m not going to add anything new, but I wanted to give a brief overview of the Bayesian perspective.The Box-Muller Transform2017-02-27T21:00:00+00:002017-02-27T21:00:00+00:00https://keyonvafa.github.io/box-muller-transform<p>Every statistician has a favorite way of generating samples from a distribution (not sure if I need a citation for this one). From <a href="https://en.wikipedia.org/wiki/Rejection_sampling">rejection sampling</a> to <a href="https://arxiv.org/pdf/1206.1901.pdf">Hamiltonian Monte Carlo</a>, there are countless methods to choose from (my personal favorite is <code class="highlighter-rouge">rnorm</code>).</p>
<p>One of the most interesting and counterintuitive sampling techniques is the Box-Muller transform. I’m not sure how widely it’s used today, but given two samples from a uniform distribution, it can generate two <em>independent</em> samples from a standard normal distribution.</p>
<!--Given a uniform sample $$U \sim \text{Unif}(0,1)$$, we can generally sample from a distribution with cdf $$F$$ by taking $$F^{-1}(U)$$. Since we cannot write the normal cdf in closed form, we must rule out the inverse cdf method.-->
<p>The idea behind the Box-Muller transform is to imagine two independent samples <script type="math/tex">X, Y \sim \mathcal{N}(0,1)</script> plotted in the Cartesian plane, and then represent these points as polar coordinates. Recall, to transform to polar, we need the distance <script type="math/tex">R</script> between <script type="math/tex">(X,Y)</script> and the origin along with <script type="math/tex">\theta</script>, the angle this line segment makes with the x-axis.</p>
<p>We start with the distance from the origin, <script type="math/tex">R = \sqrt{X^2 + Y^2}</script>. For simplicity, we work with <script type="math/tex">R^2 = X^2 + Y^2</script>. The sum of two independent squared standard normals follows a <a href="https://en.wikipedia.org/wiki/Chi-squared_distribution">chi-squared distribution</a> with 2 degrees of freedom. It is also a <a href="https://en.wikipedia.org/wiki/Chi-squared_distribution#Gamma.2C_exponential.2C_and_related_distributions">known fact</a> that a chi-squared distribution with 2 degrees of freedom is equivalent to a <script type="math/tex">\text{Gamma}(1,\frac{1}{2})</script> random variable, which is itself <a href="http://stats.stackexchange.com/questions/27908/sum-of-exponential-random-variables-follows-gamma-confused-by-the-parameters">equivalent</a> to a <script type="math/tex">\text{Expo}(\frac{1}{2})</script> variable. Finally, we can express an exponential random variable as the <a href="http://math.stackexchange.com/questions/199614/distribution-of-log-x-if-x-is-uniform">log of a uniform</a>. More succinctly,</p>
<script type="math/tex; mode=display">R^2 \sim \chi^2_{df=2} \sim \text{Gamma}\left(1,\frac{1}{2}\right) \sim \text{Expo}\left(\frac{1}{2}\right) \sim -2\log U_1</script>
<p>where <script type="math/tex">U_1 \sim \text{Unif}(0,1).</script></p>
<p>What about the angle, <script type="math/tex">\theta</script>? If we write the joint density of <script type="math/tex">X</script> and <script type="math/tex">Y</script>, we can see</p>
<script type="math/tex; mode=display">f_{X,Y}(x,y) = \frac{1}{2\pi} e^{-\frac{X^2}{2}}e^{-\frac{Y^2}{2}} = \frac{1}{2\pi}e^{-\frac{(X^2+Y^2)}{2}} = \frac{1}{2\pi}e^{-\frac{R^2}{2}}.</script>
<p>Thus, once we have <script type="math/tex">R^2</script>, the squared distance between <script type="math/tex">(X,Y)</script> and the origin, the joint distribution of <script type="math/tex">X</script> and <script type="math/tex">Y</script> is uniform. That is, as long as <script type="math/tex">(X,Y)</script> is a pair satisfying <script type="math/tex">X^2 + Y^2 = R^2</script>, it can be any point on the circle with radius <script type="math/tex">R</script>. As a result, we can simply take <script type="math/tex">\theta = 2\pi U_2</script>, where <script type="math/tex">U_2 \sim \text{Unif}(0,1).</script></p>
<p>Putting all these results together, if we take <script type="math/tex">R = \sqrt{-2\log U_1}</script> and <script type="math/tex">\theta = 2\pi U_2</script> for <script type="math/tex">U_1, U_2 \sim \text{Unif}(0,1)</script>, we have the polar coordinates for two independent standard normal draws. Thus, converting back to Cartesian, we have</p>
<script type="math/tex; mode=display">X = R\cos\theta = \sqrt{-2\log U_1}\cos(2\pi U_2)\\
Y = R\sin\theta = \sqrt{-2\log U_1}\sin(2\pi U_2).</script>
<p>This is straightforward to implement in R:</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">nsims</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10000</span><span class="w">
</span><span class="n">samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="n">nsims</span><span class="o">*</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">sim</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">nsims</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">us</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">runif</span><span class="p">(</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">R</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">-2</span><span class="o">*</span><span class="nf">log</span><span class="p">(</span><span class="n">us</span><span class="p">[</span><span class="m">1</span><span class="p">]))</span><span class="w">
</span><span class="n">theta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="o">*</span><span class="nb">pi</span><span class="o">*</span><span class="n">us</span><span class="p">[</span><span class="m">2</span><span class="p">]</span><span class="w">
</span><span class="n">samples</span><span class="p">[</span><span class="m">2</span><span class="o">*</span><span class="n">sim</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">R</span><span class="o">*</span><span class="nf">cos</span><span class="p">(</span><span class="n">theta</span><span class="p">)</span><span class="w">
</span><span class="n">samples</span><span class="p">[</span><span class="m">2</span><span class="o">*</span><span class="n">sim</span><span class="m">-1</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">R</span><span class="o">*</span><span class="nf">sin</span><span class="p">(</span><span class="n">theta</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>Using the above code, I compared the histogram of Box-Muller samples to those using <code class="highlighter-rouge">rnorm</code>, which were nearly identical:</p>
<p><img src="/assets/images/box_muller_blog/box_muller_samples.png" alt="Box-Muller Samples" /></p>
<p><em>Interesting, but this is nothing more than a cool sampling trick, right?</em> Wrong. If we represent normal random variables in Box-Muller form, it can become easier to prove results about the normal distribution.</p>
<p>For example, consider the problem of proving that for independent draws <script type="math/tex">X,Y \sim \mathcal{N}(0,1)</script>, <script type="math/tex">X+Y</script> is independent of <script type="math/tex">X-Y</script>, and both distributed as <script type="math/tex">\mathcal{N}(0,2)</script>. A proof that doesn’t require the use of pdfs involves representing <script type="math/tex">X</script> and <script type="math/tex">Y</script> in Box-Muller form (I first saw this solution in <a href="http://www.people.fas.harvard.edu/~blitz/Site/Home.html">Joe Blitzstein’s</a> class <a href="https://locator.tlt.harvard.edu/course/colgsas-111696">Stat 210</a>, which I encourage any Harvard student who’s reading this to take). Let <script type="math/tex">R^2 \sim \chi^2_{df=2}</script> and <script type="math/tex">U \sim \text{Unif}(0,1)</script>, as in the representation above. Thus, <script type="math/tex">X = R\cos(\theta) = R\cos(2\pi U)</script>, and <script type="math/tex">Y = R\sin(\theta) = R\sin(2\pi U)</script>. This form gives us</p>
<script type="math/tex; mode=display">X + Y = R\cos(2\pi U) + R\sin(2\pi U) = \sqrt{2}R\sin(2\pi U + \pi/4)\\
X - Y = R\cos(2\pi U) - R\sin(2\pi U) = \sqrt{2}R\cos(2\pi U + \pi/4)</script>
<p>Note that we use the trigonometric identities for <script type="math/tex">\cos(\alpha + \beta)</script> and <script type="math/tex">\sin(\alpha + \beta)</script> in the derivation. The final form should look familiar – we’ve recovered the Box-Muller representation, albeit with some modifications. The <script type="math/tex">\sqrt{2}</script> in front scales the standard normal so it now has a variance of 2. Additionally, note that we are using <script type="math/tex">2\pi U + \pi/4</script> as <script type="math/tex">\theta</script> instead of <script type="math/tex">2\pi U</script>. However, we do not have to worry about it as it still results in a uniform sample over the possible angles.</p>
<p>Thus, <script type="math/tex">X+Y</script> and <script type="math/tex">X-Y</script> are independent draws from the distribution <script type="math/tex">\mathcal{N}(0,2)</script>.</p>
<!--between the x-axis and the line segment connecting the origin and $$(X,Y)$$. -->
<!--I first came across the method in a class taught by <a href='http://www.people.fas.harvard.edu/~blitz/Site/Home.html'>Joe Blitzstein</a>, and a conversation today with another PhD student inspired me to write up a short tutorial.-->keyonvafaEvery statistician has a favorite way of generating samples from a distribution (not sure if I need a citation for this one). From rejection sampling to Hamiltonian Monte Carlo, there are countless methods to choose from (my personal favorite is rnorm).Lies, Damned Lies, and Causal Inference2017-02-18T17:00:00+00:002017-02-18T17:00:00+00:00https://keyonvafa.github.io/smoking-causal-inference-paradox<p>To paraphrase <a href="https://en.wikipedia.org/wiki/Lies,_damned_lies,_and_statistics">Benjamin Disraeli</a>, statistics makes it easy to lie. In this post, I’ll go over an example from Judea Pearl’s excellent textbook, <a href="https://www.amazon.com/Causality-Reasoning-Inference-Judea-Pearl/dp/052189560X">Causality</a>, that shows how different statistical approaches can lead to different estimates of the causal effect of smoking on lung cancer.</p>
<p>First, the (fictional) data, which is taken from Section 3.3 of <a href="https://www.amazon.com/Causality-Reasoning-Inference-Judea-Pearl/dp/052189560X">Causality</a>. Say we have results from an observational (i.e. non-randomized) study, that aims to assess the affect of smoking on developing lung cancer. For every person, we have a binary variable <script type="math/tex">X</script> that indicates whether that person is a smoker and a binary outcome variable <script type="math/tex">Y</script> that indicates whether that person developed lung cancer. Additionally, we have a binary variable <script type="math/tex">Z</script> that indicates whether each person had a significant amount of tar in their lungs.</p>
<p>The results from the (fictional) study are depicted in the table below:</p>
<p>\begin{array}{c|c|c|c}
\text{Smoker } (X) & \text{Tar }(Z) & \text{Group Size (% of population)} & \text{Cancer Prevalence (% of group)} \\
\hline
0 & 0 & 47.5\% & 10\%\\
0 & 1 & 2.5\% & 5\%\\
1 & 0 & 2.5\% & 90\%\\
1 & 1 & 47.5\% & 85\%\\
\end{array}</p>
<p>At first glance, it seems that smoking is likely to cause cancer. Ignoring <script type="math/tex">Z</script>, both groups of <script type="math/tex">X = 1</script> have a far larger prevalence of cancer than <script type="math/tex">X = 0</script>. Even considering <script type="math/tex">Z</script>, smokers with tar buildup are more likely to have cancer than nonsmokers with tar buildup, and smokers without tar buildup are still more likely to have cancer than nonsmokers without tar buildup.</p>
<p>Indeed, simple calculations using Bayes’ rule verify <script type="math/tex">P(Y = 1 \vert X =0) = .10</script> and <script type="math/tex">P(Y = 1 \vert X = 1) = .85</script>, indicating one is much more likely to have lung cancer if that person is also a smoker.</p>
<p>However, this might be misleading. The Bayes’ rule calculation above corresponds to a <em>prediction</em> problem: What’s the probability someone has cancer if she’s a smoker? In real life, we may be more curious about the <em>causal</em> problem: What’s the probability that smoking will <em>cause</em> someone to have cancer? The distinction may seem like a subtle one but it’s important. It may be possible that lung cancer and smoking are correlated due to a common cause, but that lung cancer does not directly (or indirectly) cause smoking. Since we’re concerned with an intervention (i.e. choosing to smoke or not), we would like to estimate the direct cause of this intervention.</p>
<p>This problem came up in a <a href="http://www.cs.columbia.edu/~blei/seminar/2017_applied_causality/index.html">causal inference class</a> I’m taking this semester, and our professor likes to say it’s easy to go down philosophical rabbit holes when defining causality. I’ll leave that to the experts (there are excellent textbooks by <a href="https://www.amazon.com/Causal-Inference-Statistics-Biomedical-Sciences/dp/0521885884">Guido Imbens and Don Rubin</a> along with <a href="https://www.amazon.com/Counterfactuals-Causal-Inference-Principles-Analytical/dp/0521671930">Stephen Morgan and Christopher Winship</a>).</p>
<p>An intuitive approach for me is through the use of causal graphs. I won’t go over all the details, but the main idea is that every node in the graph represents a variable in the causal problem of interest, and the arrows between each node show the causal direction. Nodes can either be observed (shaded) or latent (unshaded).</p>
<p>For example, in the smoking example, we would depict <script type="math/tex">X</script>, <script type="math/tex">Y</script>, and <script type="math/tex">Z</script> with observed nodes. It’s fair to imagine that the decision to smoke will cause the amount of tar buildup in the lungs, and we can also assume that lung cancer is only caused by tar in the lungs. In this case, we would have an arrow from <script type="math/tex">X</script> to <script type="math/tex">Z</script> followed by another arrow from <script type="math/tex">Z</script> to <script type="math/tex">Y</script>.</p>
<p>This is unrealistic, however, as there are likely unknown, unobserved causes that <em>confound</em> these variables. For example, genetics can influence our decision to smoke, and it can also determine our predisposition to cancer. It wouldn’t be a stretch to assume that tar buildup is determined only by smoking. (These assumptions are definitely simplifying and unrealistic, but that’s besides the point for this example.) Accounting for this <em>confounder</em> illuminates the difficulties posed by the causal approach: people who are genetically inclined to smoke may also be more genetically likely to have cancer, correlating these two variables without a causal relationship.</p>
<p>Denoting genetics as the latent variable <script type="math/tex">U</script>, the causal graph is depicted in subfigure (a) below:</p>
<p><img src="/assets/images/causal_inference_lies_blog/observed_do_model.png" alt="Causal Graphs" /></p>
<p>If we’re interested in the causal effect of <script type="math/tex">X</script> on <script type="math/tex">Z</script>, we are thinking in terms of interventions; that is, <script type="math/tex">X</script> would no longer depend on <script type="math/tex">U</script> if someone is forced to smoke or to not smoke. Thus, Pearl introduces the <script type="math/tex">do(\cdot)</script> operator, which imagines the causal graph under intervention. If <script type="math/tex">do(X = 1)</script>, we force <script type="math/tex">X</script> to be 1, and imagine that <script type="math/tex">X</script> is only caused by the “do-er” as opposed to any of its causal predecessors, since we can intervene. Thus, the causal effect of interest becomes <script type="math/tex">P(Y = 1 \vert do(X = 1))</script> as opposed to <script type="math/tex">P(Y = 1 \vert X = 1)</script>. This scenario is depicted in subfigure (b) above.</p>
<p>Because of the confounding variable <script type="math/tex">U</script>, the numbers at the beginning of this post do not accurately reflect the causal effect. There are several set of criteria for calculating causal effects based off causal graphs, most notably the <a href="http://bayes.cs.ucla.edu/BOOK-2K/ch3-3.pdf">back-door and front-door criteria</a>. Using the front-door criterion (which I won’t elaborate on here but deserves its own post), we can see that <script type="math/tex">Z</script> is an intermediate causal effect. That is, <script type="math/tex">Z</script> only depends on <script type="math/tex">X</script> through <script type="math/tex">X</script>.</p>
<p>We can then calculate the effect of <script type="math/tex">Z</script> on <script type="math/tex">Y</script>; however, there exists what’s called a <em>back-door path</em> from <script type="math/tex">Z</script> to <script type="math/tex">Y</script> through <script type="math/tex">X</script>. That is, if we just calculate the causal effect of <script type="math/tex">Z</script> on <script type="math/tex">Y</script>, because of the confounder <script type="math/tex">U</script>, we would include spurious effects that are due to <script type="math/tex">X</script>. Therefore, we must <em>block</em> <script type="math/tex">X</script> by accounting for it when calculating the causal effect. Chapter 3.3 of <a href="https://www.amazon.com/Causality-Reasoning-Inference-Judea-Pearl/dp/052189560X">Pearl’s textbook</a> goes through these derivations in more depth.</p>
<p>Mathematically, then, we can calculate</p>
<script type="math/tex; mode=display">P(Y = 1 \vert do(X = x)) = \sum_{z=0}^1 P(Z = z \vert X = x) \sum_{x'=0}^{1} P(Y = 1 \vert X = x', Z = z)P(X = x').</script>
<p>The <script type="math/tex">P(Z = z \vert X = x)</script> term accounts for the intermediate causal effect of <script type="math/tex">X</script> on <script type="math/tex">Z</script>. The term in the sum estimates <script type="math/tex">P(y = 1 \vert do(Z = z))</script> by conditioning on <script type="math/tex">X</script> to account for the final causal effect of <script type="math/tex">Z</script> on <script type="math/tex">Y</script>. Using this formula with the same data (re-posted below), we can calculate <script type="math/tex">P(Y = 1 \vert do(X = 1)) = 0.45</script> and <script type="math/tex">P(Y = 1 \vert do(X = 0)) = 0.50</script>, indicating that smoking would actually <em>decrease</em> the chance of lung cancer.</p>
<p>Intuitively, what’s going on? It appears that smoking increases the amount of tar buildup in the lungs, which is easily verified in the table below, since <script type="math/tex">P(Z = 1 \vert X = 1) = 0.95</script> and <script type="math/tex">P(Z = 1 \vert X = 0) = 0.05</script>. However, we can see that conditioning on <script type="math/tex">X</script>, tar buildup <em>decreases</em> your likelihood of getting lung cancer. That is, <script type="math/tex">P(Y = 1 \vert X = 1, Z = 0) > P(Y = 1 \vert X = 1, Z =1)</script> and <script type="math/tex">P(Y = 1 \vert X = 0, Z = 0) > P(Y = 1 \vert X = 0, Z = 1).</script> Thus, combining these results: smoking causes a larger amount of tar buildup in the lungs, and large tar buildups in the lungs prevent cancer.</p>
<p>\begin{array}{c|c|c|c}
\text{Smoker } (X) & \text{Tar }(Z) & \text{Group Size (% of population)} & \text{Cancer Prevalence (% of group)} \\
\hline
0 & 0 & 47.5\% & 10\%\\
0 & 1 & 2.5\% & 5\%\\
1 & 0 & 2.5\% & 90\%\\
1 & 1 & 47.5\% & 85\%\\
\end{array}</p>
<p>I want to stress this data is fictional, and the arguments are simplistic. One could easily come up with another causal diagram to show that smoking increases the likelihood of cancer. However, I think this example illustrates the importance of being careful when performing causal inference analyses, along with the differences between causal inference and prediction problems.</p>keyonvafaTo paraphrase Benjamin Disraeli, statistics makes it easy to lie. In this post, I’ll go over an example from Judea Pearl’s excellent textbook, Causality, that shows how different statistical approaches can lead to different estimates of the causal effect of smoking on lung cancer.Tweet Counts as Poisson GLMs2017-02-10T17:00:00+00:002017-02-10T17:00:00+00:00https://keyonvafa.github.io/tweet-counts-poisson-glm<p><em>Last week, <a href="http://keyonvafa.com/tweet-counts-poisson-processes/">I wrote about modeling tweet counts as a simple Poisson process</a>. In this post, I’ll dive into a slightly more sophisticated method, so check out the previous post for some background.</em></p>
<p>I’m interested in estimating the number of tweets President Trump will post in a given week so I can use the model to <a href="https://www.predictit.org/Market/2956/How-many-tweets-will-%40realDonaldTrump-post-from-noon-Feb-8-to-noon-Feb-15">bet on PredictIt</a>. <a href="http://keyonvafa.com/tweet-counts-poisson-processes/">My post last week</a> demonstrated that a stationary Poisson process had some weaknesses – the rate wasn’t constant everywhere, and Trump’s tweets seemed to self-excite (i.e. if he’s in the middle of a tweet storm, he’s likely to keep tweeting).</p>
<p>In this post, I’ll focus on modeling tweet counts as a Poisson <em>generalized linear model</em> (GLM). (You probably won’t need to know much about GLMs to understand this post, but if you’re interested, the <a href="https://www.amazon.com/Generalized-Chapman-Monographs-Statistics-Probability/dp/0412317605">canonical text</a> is by <a href="https://galton.uchicago.edu/~pmcc/">Peter McCullagh</a> and <a href="https://en.wikipedia.org/wiki/John_Nelder">John Nelder</a>. I also highly recommend <a href="http://www.stat.ufl.edu/~aa/">Alan Agresti’s</a> <a href="https://www.amazon.com/Foundations-Linear-Generalized-Probability-Statistics/dp/1118730038">textbook</a>, which I used in his class.) The model will be autoregressive, as I will include the tweet counts for the previous few days among my set of predictors.</p>
<p>First I’ll go over the results, so <a href="#model">jump ahead</a> if you’re interested in the more technical model details.</p>
<h2 id="results">Results</h2>
<p>In short, my model uses simulations to predict the weekly tweet count probabilities. That is, it simulates 5,000 possible versions of the week, and counts how many of these simulations are in each <a href="https://www.predictit.org/Market/2956/How-many-tweets-will-%40realDonaldTrump-post-from-noon-Feb-8-to-noon-Feb-15">PredictIt bucket</a>. It uses these counts to assign probabilities to each bucket.</p>
<p>I ran the model last night and compared the results to the probabilities on PredictIt – all of my predictions were within three percentage points of those online, with the exception of one bucket that was eight off (the “55 or more” bucket, which my model thought was less likely than the market). Running it again this morning, however, something was off – the odds in the market had shifted considerably toward preferring less tweets, at odds with my model.</p>
<p>Confused, I read the comments, which indicated that seven tweets had been removed from Trump’s account this morning. However, the removed tweets were from a while ago, so I was confused why they would make a difference in this week’s count. Then I read the market rules:</p>
<blockquote>
<p><em>“The number of total tweets posted by the Twitter account realDonaldTrump shall exceed 34,455 by the number or range identified in the question…The number by which the total tweets at expiration exceeds 34,455 may not equal the number of tweets actually posted over that time period … [since] <strong>tweets may be deleted prior to expiration of this market</strong>.”</em></p>
</blockquote>
<p>D’oh. That didn’t seem like the smartest rule. It meant the number of weekly tweets could be negative if Trump deleted a whole bunch of tweets from before the week. There weren’t many options for modeling these purges with the data at hand. Therefore, I decided to assume that no more tweets would be deleted this week, and subtracted the 7 missing tweets from the simulation.</p>
<p>I ran the model on Friday evening, with the following histogram depicting the distribution of simulated total weekly tweet counts:</p>
<p><img src="/assets/images/tweet_counts_poisson_glm_blog/simulated_tweet_hist.png" alt="Simulated tweet histogram" /></p>
<p>The following plot shows the simulated trajectories for the week, with 4 paths randomly colored for emphasis:</p>
<p><img src="/assets/images/tweet_counts_poisson_glm_blog/simulated_tweet_paths.png" alt="Simulated tweet paths" /></p>
<p>Finally, the following table shows my model probabilities, compared to those on PredictIt as of this writing:</p>
<p>\begin{array}{c|cccc}
\text{Number of tweets} & \text{“Yes” Price} & \text{Model “Yes” Probability} & \text{“No” Price} & \text{Model “No” Probability} \\
\hline\text{24 or fewer} & $0.11 & 1\% & $0.90 & 99\%\\
\text{25 - 29} & $0.14 & 7\% & $0.88 & 93\%\\
\text{30 - 34} & $0.23 & 24\% & $0.79 & 76\%\\
\text{35 - 39} & $0.31 & 35\% & $0.73 & 65\%\\
\text{40 - 44} & $0.19 & 23\% & $0.84 & 77\%\\
\text{45 - 49} & $0.09 & 9\% & $0.93 & 91\%\\
\text{50 - 54} & $0.05 & 2\% & $0.96 & 98\%\\
\text{55 or more} & $0.04 & 0.3\% & $0.97 & 99.7\%\\
\end{array}</p>
<p>Thus, compared to my model, the market believes Trump will have a quiet week. This may reflect the possibility of Trump deleting more tweets, or it could be some market knowledge that Trump will be preoccupied by various presidential engagements.</p>
<p>In general, however, the market prices align nicely with the model; no two buckets (beside the first two) disagree with the model probability by more than 4%. I think this is definitely a more robust model than the simple Poisson process, as the probabilities align quite well with the market. Thus, not expecting much in returns, I bought shares of “No” for “24 or fewer” and “25-29” and “Yes” for “35-39” and “40-44”.</p>
<h2 id="model">Model</h2>
<p>For this analysis, I thought it made sense to predict tweets as daily counts as opposed to weekly counts, so the predictions would be more fine-tuned. Thus, denote by <script type="math/tex">y_t</script> the number of tweets made by Trump on day <script type="math/tex">t</script>. Given a vector of predictors <script type="math/tex">\boldsymbol x_t</script> for day <script type="math/tex">t</script> and a vector of (learned) coefficients <script type="math/tex">\boldsymbol \beta</script>, the model I used was</p>
<script type="math/tex; mode=display">y_t \sim \text{Pois}(\exp(\boldsymbol x_t^T \boldsymbol \beta)).</script>
<p>Note that because we are exponentiating <script type="math/tex">\boldsymbol x_t^T \boldsymbol\beta</script>, the rate parameter will never be negative, so there are no constraints on the sign of <script type="math/tex">\boldsymbol \beta</script>.</p>
<p>To keep the model simple, I was fairly limited in my set of predictors. I included an intercept term, the day of the week, and binary variables to indicate if the tweet occurred after Trump won the election and whether the tweet occurred after the inauguration (the graph from <a href="http://keyonvafa.com/tweet-counts-poisson-processes/">my previous post</a> indicates a significant changepoint after the election). I also included an indicator variable indicating whether there was a presidential or vice presidential debate – although these won’t happen again, they explain spikes in the existing data.</p>
<p>It also seemed reasonable that the number of Trump’s tweets today would depend on how many tweets he had in the previous few days. Thus, as a first attempt, I included the past 5 days of history, and used the following model:</p>
<script type="math/tex; mode=display">y_t |\boldsymbol x_t,y_{t-1}, \dots, y_{t-5} \sim \text{Pois}\left(\exp\left(\boldsymbol\beta^T \boldsymbol x_t + \sum_{k=1}^5 \gamma_k y_{t-k} \right)\right).</script>
<p>Here, <script type="math/tex">\boldsymbol x_t</script> is the vector of aforementioned predictors, i.e. intercept, day of week, etc. At time <script type="math/tex">t</script>, the scalars <script type="math/tex">y_{t-1}, \dots, y_{t-5}</script> indicate the counts of the previous 5 days, and each count has its own parameter to be estimated, <script type="math/tex">\gamma_k</script>. Thus, this model requires that we estimate <script type="math/tex">\boldsymbol \beta</script> along with <script type="math/tex">\gamma_1, \dots, \gamma_5</script>.</p>
<p>I used the built-in <code class="highlighter-rouge">glm</code> function in R to estimate these variables using maximum likelihood. If you’re unfamiliar with maximum likelihood, the basic idea is that we can maximize <script type="math/tex">\sum_{t=1}^T \log p(y_t\vert x_t,y_{t-1}, \dots, y_{t-5})</script> by taking the gradient with respect to our parameters <script type="math/tex">\boldsymbol \gamma</script> and <script type="math/tex">\boldsymbol \beta</script> and using an iterative method to set the gradient to 0. (I’d like to get a blog post up someday about GLMs in general so I could focus on maximum likelihood estimation and discuss some other nice properties.)</p>
<p>After fitting to the current data, I found that among the <script type="math/tex">\boldsymbol\gamma</script>, only <script type="math/tex">\gamma_1</script> and <script type="math/tex">\gamma_2</script> were deemed statistically significant (and even these predicted values were quite small). Besides the intercept and debate indicator, the most statistically significant <script type="math/tex">\boldsymbol\beta</script> coefficient was for the indicator of being after the election, at <script type="math/tex">-0.44</script> (recall that these end up getting exponentiated). Thus, I re-ran the model using only the past two days of history (as opposed to five) in the autoregressive component. The following graph shows how the model mean fits to the training data:</p>
<p><img src="/assets/images/tweet_counts_poisson_glm_blog/trained_tweet_data.png" alt="Trained tweet data" /></p>
<p>Not perfect, but reasonable given the basic set of predictors, and it appears to get the general trends right. Note that the four spikes correspond exactly to the debates.</p>
<p>I was initially worried about overdispersion – recall that in a Poisson model, the variance of the output <script type="math/tex">y_t</script> is equal to the mean, so if the variance in reality is larger than the mean, a Poisson would be a poor approximation. Thus, I also tried using a negative binomial to model the data, which performed worse in training log-likelihood and training error. As a result, I stuck with the original Poisson model.</p>
<p>After estimating all the coefficients, it was time to model the probability of finishing in each <a href="https://www.predictit.org/Market/2956/How-many-tweets-will-%40realDonaldTrump-post-from-noon-Feb-8-to-noon-Feb-15">bucket on PredictIt</a>. Because the number of tweets in one day would affect the number of tweets for the next day, I couldn’t model these probabilities analytically. Thus, I ran 5,000 simulations to approximate the probability of being in each bucket by Wednesday noon.</p>
<p>One final note about the model – it predicts tweets for full-day length intervals, i.e. noon Monday to noon Tuesday. However, what if it’s 8 pm on Sunday, and we’re curious how often Trump will tweet before Wednesday at noon? Predicting for 2 more rows would not be enough (finishing Tuesday at 8 pm), and using 3 would be too much (finishing Wednesday at 8 pm). Thus, I decided to run an additional model that rounded at the nearest noon. That is, I duplicated the above model, except I used the number of tweets between now and the next noon as the response variable. For example, if I were running the program at 8 pm on Sunday, I would model how often Trump tweets between 8 pm and the following day’s noon for every day in the history. Then, I would use this set of coefficients to predict the tweets between now and the next noon, and then finish off all remaining full days with the coefficients from the aforementioned model. (If none of this paragraph makes sense, don’t worry about it, as it’s a pretty minor detail.)</p>
<p>In the future, I’d be interested in more complicated variations, such as modeling tweet deletions or using a larger set of predictors (along with performing a more rigorous dispersion analysis).</p>
<p>All code is available <a href="https://github.com/keyonvafa/tweet-count-poisson-blog">here</a>.</p>
<h2 id="update">Update</h2>
<p>I bought shares in four markets (two Yes’s and two No’s). The tweet count ended up being in one of the Yes markets, good enough for a 25% return. That’s a great return, but it’s too early to say anything conclusive about the model because <script type="math/tex">N = 1</script>. That being said, I’ll continue to use the GLM because the results seem promising so far.</p>
<h2 id="acknowledgments">Acknowledgments</h2>
<p>Thanks to <a href="http://www.columbia.edu/~swl2133/">Scott Linderman</a> for suggesting an autoregressive GLM model. Also thanks to <a href="https://medium.com/@Teddy__Kim">Teddy Kim</a> for various suggestions and brainstorming help. A final thank you to <a href="http://stat.columbia.edu/department-directory/name/owen-ward/">Owen Ward</a> for suggesting the connection between spikes and debates in the model.</p>keyonvafaLast week, I wrote about modeling tweet counts as a simple Poisson process. In this post, I’ll dive into a slightly more sophisticated method, so check out the previous post for some background.Tweet Counts as Poisson Processes2017-02-02T11:00:00+00:002017-02-02T11:00:00+00:00https://keyonvafa.github.io/tweet-counts-poisson-processes<p>I’ve written before on <a href="http://keyonvafa.com/gp-predictit/"> using statistics to bet on PredictIt’s political markets</a>, and this morning a new market caught my eye: <a href="https://www.predictit.org/Market/2934/How-many-tweets-will-%40realDonaldTrump-post-from-noon-Feb-1-to-noon-Feb-8"> How many tweets will Donald Trump post from noon Feb. 1 to noon Feb. 8?</a>. If statistics can be used to inform any prediction market, it’s this one, so I figured I’d give this counting problem a go.</p>
<p>At the time of this writing, the market looked like the following: 8 buckets for the number of tweet counts, so I would need to assign a probability to each bucket:</p>
<p><img src="/assets/images/tweet_counts_poisson_process_blog/tweet_count_predictit_screenshot.png" alt="Tweet count screenshot" /></p>
<p>I started by scraping Trump’s tweets on Twitter – I essentially lifted code from <a href="http://www.craigaddyman.com/mining-all-tweets-with-python/">this tutorial</a> by Craig Addyman. I only downloaded his last 1800 tweets, which seemed like enough because I reasoned his older tweeting habits wouldn’t be so informative nowadays.</p>
<p>I decided to model the number of weekly tweets as a Poisson process. For those unfamiliar with a Poisson process, the main idea is that the number of tweets <script type="math/tex">N(t)</script> in a given interval, say <script type="math/tex">[0,t)</script> where <script type="math/tex">t</script> is a scalar denoting seconds, is given by a Poisson distribution with some rate <script type="math/tex">\lambda * t</script>:</p>
<script type="math/tex; mode=display">N(t) \sim \text{Pois}(\lambda t),\text{so } P(N(t) = t) = \frac{(\lambda t)^n}{n!}e^{-\lambda t}.</script>
<p>Moreover, this model assumes that the number of tweets in any disjoint interval is independent, and that the rate is constant for any fixed length. These assumptions are definitely wrong. There are numerous instances where Trump has rapidly strung a series of tweets together on the same topic one after another, which breaks both assumptions. Additionally, the rate is not constant, as he is much more likely to tweet during the day than <a href="https://www.theatlantic.com/politics/archive/2016/09/trump-tweets-alicia-machado/502415/">during sleeping hours</a>.</p>
<p>Although these assumptions are violated, I decided to use a Poisson process because it’s intuitive/straightforward (read: I’m lazy) and I didn’t have a ton of time (read: I procrastinate). Next week, I hope to use a more complicated model like a <a href="http://www.dcscience.net/Hawkes-Biometrika-1971.pdf">Hawkes process</a>, so stay tuned.</p>
<p>The plot below shows the number of weekly tweets for all the scraped data. We can see that the stationary rate assumption is <em>definitely</em> violated, as it looks like his tweeting rate dropped severely after the election. As a result, I decided to only use tweet counts from after the election (even this small sample isn’t perfect, but it’s the best I could do given that there’s only been one week of presidential tweeting data).</p>
<p><img src="/assets/images/tweet_counts_poisson_process_blog/full_tweet_counts.png" alt="Tweet counts graph" /></p>
<p>I used the maximum likelihood estimate (the average) to predict the rate, which came out to 64.7 tweets per week, or 9.24 in a day. Given the rate, I was able to put percentages on each of the buckets by taking quantiles of the Poisson distribution (after accounting, of course, for the 9 times he’s already tweeted this week).</p>
<p>Here are my estimates compared to their price on PredictIt:</p>
<p>\begin{array}{c|cccc}
\text{Number of tweets} & \text{“Yes” Price} & \text{Model “Yes” Probability} & \text{“No” Price} & \text{Model “No” Probability} \\
\hline\text{29 or fewer} & $0.05 & 3\% & $0.97 & 97\%\\
\text{30 - 34} & $0.09 & 15\% & $0.93 & 85\%\\
\text{35 - 39} & $0.18 & 33\% & $0.83 & 67\%\\
\text{40 - 44} & $0.31 & 31\% & $0.70 & 69\%\\
\text{45 - 49} & $0.27 & 14\% & $0.75 & 86\%\\
\text{50 - 54} & $0.16 & 3\% & $0.86 & 96\%\\
\text{55 - 59} & $0.07 & 0.5\% & $0.94 & 99\%\\
\text{60 or more} & $0.05 & 0.04\% & $0.96 & 99.9\%\\
\end{array}</p>
<p>It looks like my model prefers lower tweet counts since it accounts for all post-election tweets, and Trump tweeted less in November than he has done in the past few weeks. Interestingly, when I run the same model using only the last week of data to train, none of my estimates are off by more than 0.06 when compared to the market price. Therefore, it appears that the market puts more weight on his most recent tweet totals than my (admittedly basic) model.</p>
<p>At any rate, I decided to buy 5 shares of “Yes” for 35-39 tweets, 2 shares of “No” for 45-49 tweets, and 2 shares of “No” for 50-54 tweets (everywhere my model differed with the market prices by at least 10 percentage points). Stay tuned for updates on how I do, along with a more complicated model.</p>
<p>All code is available <a href="https://github.com/keyonvafa/tweet-count-poisson-blog">here</a>.</p>
<h2 id="update">Update</h2>
<p>Trump ended up tweeting 44 times; therefore, I lost the “Yes” for 35-39 tweets, but I won both “No” markets I invested in, 45-49 and 50-54. Though it’s impossible to tell with a sample size of one, it seems like this model is a good start – Trump’s final tweet count was in the 40-44 bucket, which was deemed the second-most likely bucket by the model, at 31%.</p>keyonvafaI’ve written before on using statistics to bet on PredictIt’s political markets, and this morning a new market caught my eye: How many tweets will Donald Trump post from noon Feb. 1 to noon Feb. 8?. If statistics can be used to inform any prediction market, it’s this one, so I figured I’d give this counting problem a go.Inauguration Word Clouds with tf-idf2017-01-21T11:00:00+00:002017-01-21T11:00:00+00:00https://keyonvafa.github.io/inauguration-wordclouds<p>Trump’s inauguration was yesterday, and we’re all coping with it in different ways. Instead of watching yesterday’s events, I decided to download a bunch of inaugural addresses and make some word clouds (jump ahead a little bit if you’re curious about the more technical details).</p>
<p>These word clouds aren’t meant to show the most frequent words of each address – rather, the words depicted are both frequent in a given speech and rare compared to every other inaugural address, based on the metric <em>tf-idf</em> (more on that later). Thus, if a word appears large in a president’s cloud, it means that the word was used more by that president than in the typical inaugural address.</p>
<h2 id="donald-trump-2017">Donald Trump (2017)</h2>
<p><img src="/assets/images/inauguration_wordclouds_blog/trump_wordcloud_2017.png" alt="Trump Word Cloud" /></p>
<h2 id="barack-obama-2009">Barack Obama (2009)</h2>
<p><img src="/assets/images/inauguration_wordclouds_blog/obama_wordcloud_2009.png" alt="Obama Word Cloud" /></p>
<h2 id="george-bush-2005">George Bush (2005)</h2>
<p><img src="/assets/images/inauguration_wordclouds_blog/bush_wordcloud_2005.png" alt="Bush Word Cloud" /></p>
<hr />
<p>A few trends stand out off the bat. Trump’s address is focused on jobs and success, as the words “jobs”, “workers”, “factories”, and “winning” all appear large. Additionally, “politicians” received a lot of Trump’s attention, largely in a negative context. Note “carnage” in the lower right corner, which for me was the most notable word of the speech.</p>
<p>In comparison, Obama’s address is more policy-driven – note the words “healthcare”, “warming”, and “women”. There is an additional focus on storytelling and optimism, as demonstrated by the words “father”, “journey”, “generation”, and “ambitions”. Taking place during the Great Recession, the speech also highlights “crisis”, “winter”, and “icy”. The top words in Bush’s address highlight the nationalistic mood of post-9/11 America, with “tyranny”, “defended”, and “freedom” prominently featured.</p>
<p>Finally, note that Trump’s word cloud has a mix of very large and small words, while Obama’s lexicon is more uniformly distributed. This suggests two things: 1) Trump repeated himself and 2) he used certain words and phrases that were very atypical for an inaugural address.</p>
<h2 id="technical-details">Technical Details</h2>
<p>First, I want to thank <a href="http://amueller.github.io/">Andreas Muller</a> for making the <a href="https://github.com/amueller/word_cloud">“word_cloud” library for Python</a> publicly available; the only reason these graphics exist is because the library is straightforward to use and incredibly well-documented.</p>
<p>To gather the data, I downloaded most of the addresses from <a href="http://avalon.law.yale.edu/subject_menus/inaug.asp"> The Avalon Project</a> at Yale. Some speeches were missing, so I found them in various online resources.</p>
<p>I preprocessed the data by removing stop-words I found from a <a href="https://pypi.python.org/pypi/stop-words">standard list</a>. I added the names of all former presidents to the list of stop-words – since the new president typically thanks the former president in his speech, I did not think they would be informative in the diagrams.</p>
<p>I used <em>tf-idf</em> to find the most important words for each speech, which is essentially the product of how common a word is for a certain speech (<em>tf</em>, or term frequency), and how rare that word is in comparison to the other speeches (<em>idf</em>, or inverse document frequency). We have a score for each speech <script type="math/tex">d</script> and word <script type="math/tex">t</script> (with <script type="math/tex">N</script> total words), given by <script type="math/tex">tfidf(d,t)</script> where</p>
<ul>
<li><script type="math/tex">tf(d,t) =</script> number of times word <script type="math/tex">t</script> appears in speech <script type="math/tex">d</script></li>
<li><script type="math/tex">idf(t) = \log \frac{N}{\text{number of speeches with word } t}</script> (note this score is shared across speeches)</li>
<li><script type="math/tex">tfidf(d,t) = tf(d,t) \cdot idf(t)</script>.</li>
</ul>
<p>Thus, words in a speech with a high <script type="math/tex">tfidf</script> are used frequently in that speech yet rarely mentioned in other speeches. I used an <a href="https://radimrehurek.com/gensim/models/tfidfmodel.html">off-the-shelf implementation<a></a> of tf-idf from <code class="highlighter-rouge">gensim</code>.</a></p>
<p>Finally, I used the <a href="https://github.com/amueller/word_cloud">“word_cloud” library</a> to make the graphics. I found <a href="https://github.com/amueller/word_cloud/blob/master/examples/colored.py">these</a> <a href="https://github.com/amueller/word_cloud/blob/master/examples/a_new_hope.py">examples</a> incredibly helpful, and the library is very straightforward to use with basic Python experience. I used these images of <a href="https://img1.etsystatic.com/140/0/6522319/il_fullxfull.990448319_izew.jpg">Trump</a>, <a href="https://tmillan1.files.wordpress.com/2012/03/nobackgroundobama.png">Obama</a>, and <a href="https://lhhs.neocities.org/georgebush.png">Bush</a> as masks.</p>
<p>All my code is available <a href="https://github.com/keyonvafa/inaugural-wordclouds">here</a>.</p>keyonvafaTrump’s inauguration was yesterday, and we’re all coping with it in different ways. Instead of watching yesterday’s events, I decided to download a bunch of inaugural addresses and make some word clouds (jump ahead a little bit if you’re curious about the more technical details).Causal Inference for Genetic Associations2017-01-19T11:00:00+00:002017-01-19T11:00:00+00:00https://keyonvafa.github.io/genetic-associations-song-hao-storey<p>Recently I’ve become interested in causal inference, as I’ve been exploring both the <a href="https://www.cambridge.org/core/books/causal-inference-for-statistics-social-and-biomedical-sciences/71126BE90C58F1A431FE9B2DD07938AB">foundational approaches</a> and the more recent <a href="http://icml.cc/2012/papers/625.pdf">applications to machine learning</a>. An important application is genome-wide association studies (GWAS), where biologists attempt to uncover the causal link between genotypes and traits of interest (i.e. what part of the genome causes orange hair?).</p>
<p>In this post, I’ll go over one specific GWAS approach by Minsun Song, Wei Hao, and John Storey, as described in “<a href="https://www.ncbi.nlm.nih.gov/pubmed/25822090">Testing for genetic associations in arbitrarily structured populations</a>.” Although their writeup is specific to genetic studies, the main ideas of the paper extend to applications beyond GWAS. No background in genetics is required for this summary.</p>
<h2 id="the-genotype-conditional-association-test">The Genotype-Conditional Association Test</h2>
<p>We’re interested in testing whether certain genes cause a trait, but there are two confounding problems:</p>
<ol>
<li>Genotype frequencies aren’t homogeneous – specific and complicated population patterns are encoded into the genome (these may also affect the trait of interest). In statistical terms, I think this amounts to the genotypes not being i.i.d.</li>
<li>There may be non-genetic factors that affect the trait of interest, such as lifestyle or environment.</li>
</ol>
<p>Song et. al’s solution to this problem is, in my opinion, the coolest thing about <a href="https://www.ncbi.nlm.nih.gov/pubmed/25822090">the paper</a>: they introduce a latent catch-all variable, <script type="math/tex">z</script>, which captures information including population structure (which directly affects the genotype frequencies) and non-genetic factors (which directly affect the the trait of interest).</p>
<p>First, some notation (I’m diverging slightly from the notation used in the paper): we have <script type="math/tex">n</script> human beings, and for each human <script type="math/tex">i</script>, we are interested in a particular trait of interest <script type="math/tex">y_i</script>. Each human has <script type="math/tex">m</script> SNP’s, referred to as <script type="math/tex">x_{ij} \in \{0,1,2\}</script> for human <script type="math/tex">i</script> and SNP <script type="math/tex">j \in 1, \dots, m.</script> In the causal inference framework, each <script type="math/tex">x_{ij}</script> is a treatment, and we would like to know which are causally linked to the outcome <script type="math/tex">y_i</script> (which I’ll assume is continuous for this post). SNP’s refer to specific genome locations, and the <script type="math/tex">x_{ij}</script> values refer to the possible pairs of letters the alleles can take on. Introducing <script type="math/tex">z_i</script>, the diagram below (modified from the original paper) depicts the relationships of interest:</p>
<p><img src="/assets/images/genetic_associations_blog/song_gwas_figure.png" alt="Model from Song et al." /></p>
<p>In this diagram, we are testing the causal effect of <script type="math/tex">x_{ij}</script> on <script type="math/tex">y_i</script>. The latent variable <script type="math/tex">z_i</script> captures information including population structure (which directly affect the <script type="math/tex">x_{ij}</script>, through <script type="math/tex">\pi</script>) and non-genetic factors (which directly affect the <script type="math/tex">y_i,</script> through <script type="math/tex">\lambda</script>). Thus, by assuming the treatments <script type="math/tex">x_{ij}</script> only depend on <script type="math/tex">z_i</script> through <script type="math/tex">\pi</script>, we can remove the confounding effect of <script type="math/tex">z_i</script> by modeling <script type="math/tex">\pi</script>.</p>
<p>Their full process, known as a “genotype-conditional association test” (GCAT, which are also the four possible nucleotide letters) has two parts:</p>
<ol>
<li>Modeling <script type="math/tex">\pi_j(z_i)</script> to account for the confounding effect of <script type="math/tex">z_i</script>.</li>
<li>Testing for causality between <script type="math/tex">x_{ij}</script> and <script type="math/tex">y_i</script> after accounting for <script type="math/tex">\pi_j(z_i)</script>.</li>
</ol>
<p>Song et al. use the <a href="https://en.wikipedia.org/wiki/Hardy%E2%80%93Weinberg_principle">Hardy-Weinberg Equilibrium</a> to model <script type="math/tex">\pi_j(z_i)</script>. For those of us (like me) who are unfamiliar with biology, the idea is that <script type="math/tex">x_{ij}</script> takes on values 0, 1, or 2 based on a binomial distribution with some probability; thus, <script type="math/tex">\pi_j(z_i)</script> is an attempt to model this probability. The authors introduce a method they call “<a href="https://arxiv.org/pdf/1312.2041v2.pdf">logistic factor analysis</a>” (LFA) as a solution. The full math is a little hairy and is largely based on singular value decompositions and projections which I don’t have the intuition for (check out the <a href="https://arxiv.org/pdf/1312.2041v2.pdf">paper</a> for more details). The basic model is a matrix decomposition with <script type="math/tex">d</script> latent factors, so <script type="math/tex">\text{logit}(\pi_j(z_i)) = \sum_{k=1}^d h_{ik}a_{kj} = h_i^Ta_j</script>, and</p>
<script type="math/tex; mode=display">x_{ij} \vert z_i \sim \text{Bin}(2,\text{logit}^{-1}(h_i^Ta_j).</script>
<p>In this model, the person-specific <script type="math/tex">h_i</script> and the SNP-specific <script type="math/tex">a_j</script> are learned using maximum likelihood.</p>
<p>Finally, we would like to test the causal relationship between <script type="math/tex">x_{ij}</script> and <script type="math/tex">y_i</script>. They test significance by checking if <script type="math/tex">\beta_j = 0</script> in the following model:</p>
<script type="math/tex; mode=display">y_i = \alpha + \sum_{j=1}^m \beta_jx_{ij} + \lambda_i + \epsilon_i,</script>
<p>where <script type="math/tex">\lambda_i</script> is the non-genetic effect and <script type="math/tex">\epsilon_i</script> is a Gaussian error (both functions of <script type="math/tex">z_i</script>). However, the authors would prefer to not model <script type="math/tex">y_i</script>, so as to not encode any assumptions about its distribution. The distribution of <script type="math/tex">x_{ij}</script> is the only one based on a known, scientific phenomenon (the <a href="https://en.wikipedia.org/wiki/Hardy%E2%80%93Weinberg_principle">Hardy-Weinberg Equilibrium</a>), so we would like to model as little else as possible. They claim that testing <script type="math/tex">\beta_j = 0</script> in the above model is equivalent to testing <script type="math/tex">b_j = 0</script> in:</p>
<script type="math/tex; mode=display">x_{ij} \vert y_i, z_i \sim \text{Bin}(2, \text{logit}^{-1}(a_i + b_jy_i + \text{logit}(\pi_i(z_j))).</script>
<p>This is known as setting up the problem as an <em>inverse-regression</em>. Finally, they run a likelihood-ratio test to test significance. The paper includes simulation studies that validate the effectiveness of the model.</p>
<h2 id="final-thoughts">Final Thoughts</h2>
<p>I think the paper does a great job of explaining a novel model, even to someone (like me) who is unfamiliar with biology. There were only a couple of things that confused me: for one, I wasn’t entirely sure why the inverse-regression step was necessary. They claim it’s done to avoid putting a distribution on <script type="math/tex">y_i</script>, but it appears that they’re doing that by including a Gaussian term in the specification of <script type="math/tex">y_i</script>. Another thing I’m not sure about is false discovery rates – it looks like we’re performing the same test for each SNP independently, but I don’t see where we correct for multiple testing (unless it’s somehow incorporated in the <script type="math/tex">z_i</script> term).</p>
<p>Overall, I highly recommend reading the paper, and I would be excited to see applications in areas outside of genetics.</p>keyonvafaRecently I’ve become interested in causal inference, as I’ve been exploring both the foundational approaches and the more recent applications to machine learning. An important application is genome-wide association studies (GWAS), where biologists attempt to uncover the causal link between genotypes and traits of interest (i.e. what part of the genome causes orange hair?).