Recently I’ve become interested in causal inference, as I’ve been exploring both the foundational approaches and the more recent applications to machine learning. An important application is genome-wide association studies (GWAS), where biologists attempt to uncover the causal link between genotypes and traits of interest (i.e. what part of the genome causes orange hair?).
In this post, I’ll go over one specific GWAS approach by Minsun Song, Wei Hao, and John Storey, as described in “Testing for genetic associations in arbitrarily structured populations.” Although their writeup is specific to genetic studies, the main ideas of the paper extend to applications beyond GWAS. No background in genetics is required for this summary.
We’re interested in testing whether certain genes cause a trait, but there are two confounding problems:
Song et. al’s solution to this problem is, in my opinion, the coolest thing about the paper: they introduce a latent catch-all variable, , which captures information including population structure (which directly affects the genotype frequencies) and non-genetic factors (which directly affect the the trait of interest).
First, some notation (I’m diverging slightly from the notation used in the paper): we have human beings, and for each human , we are interested in a particular trait of interest . Each human has SNP’s, referred to as for human and SNP In the causal inference framework, each is a treatment, and we would like to know which are causally linked to the outcome (which I’ll assume is continuous for this post). SNP’s refer to specific genome locations, and the values refer to the possible pairs of letters the alleles can take on. Introducing , the diagram below (modified from the original paper) depicts the relationships of interest:
In this diagram, we are testing the causal effect of on . The latent variable captures information including population structure (which directly affect the , through ) and non-genetic factors (which directly affect the through ). Thus, by assuming the treatments only depend on through , we can remove the confounding effect of by modeling .
Their full process, known as a “genotype-conditional association test” (GCAT, which are also the four possible nucleotide letters) has two parts:
Song et al. use the Hardy-Weinberg Equilibrium to model . For those of us (like me) who are unfamiliar with biology, the idea is that takes on values 0, 1, or 2 based on a binomial distribution with some probability; thus, is an attempt to model this probability. The authors introduce a method they call “logistic factor analysis” (LFA) as a solution. The full math is a little hairy and is largely based on singular value decompositions and projections which I don’t have the intuition for (check out the paper for more details). The basic model is a matrix decomposition with latent factors, so , and
In this model, the person-specific and the SNP-specific are learned using maximum likelihood.
Finally, we would like to test the causal relationship between and . They test significance by checking if in the following model:
where is the non-genetic effect and is a Gaussian error (both functions of ). However, the authors would prefer to not model , so as to not encode any assumptions about its distribution. The distribution of is the only one based on a known, scientific phenomenon (the Hardy-Weinberg Equilibrium), so we would like to model as little else as possible. They claim that testing in the above model is equivalent to testing in:
This is known as setting up the problem as an inverse-regression. Finally, they run a likelihood-ratio test to test significance. The paper includes simulation studies that validate the effectiveness of the model.
I think the paper does a great job of explaining a novel model, even to someone (like me) who is unfamiliar with biology. There were only a couple of things that confused me: for one, I wasn’t entirely sure why the inverse-regression step was necessary. They claim it’s done to avoid putting a distribution on , but it appears that they’re doing that by including a Gaussian term in the specification of . Another thing I’m not sure about is false discovery rates – it looks like we’re performing the same test for each SNP independently, but I don’t see where we correct for multiple testing (unless it’s somehow incorporated in the term).
Overall, I highly recommend reading the paper, and I would be excited to see applications in areas outside of genetics.