Incorporating false discovery rates into genetic association in autism

This guest post was contributed by Joseph Buxbaum, Mark Daly, Silvia De Rubeis, Bernie Devlin, Kathryn Roeder, and Kaitlin Samocha from the Autism Sequencing Consortium (see affiliations and details at the end of the post).

Autism spectrum disorder (ASD) is a highly heritable condition characterized by deficits in social communication, and by the presence of repetitive behaviors and/or stereotyped interests. While it is clear from family and twin studies that genetic factors contribute strongly to the onset of this disorder, the search for specific risk genes for ASD has only recently begun to yield fruit. Finding these specific genes is critical not only for providing potential diagnoses for individual families, but also for obtaining insights into the pathological processes that underlie this neurodevelopmental disorder, which may ultimately lead to novel therapeutic approaches. Identification of ASD genes may at some point also reveal part of what makes us social beings.

In a paper published in Nature last week we and the other members of the Autism Sequencing Consortium (ASC) describe the application of whole exome sequencing (WES), selectively sequencing the coding regions of the genome, to identify rare genetic variants and then genes associated with risk for ASD. WES data were analyzed from nearly 4,000 individuals with autism and nearly 10,000 controls. In these analyses, we identify and subsequently analyze a set of 107 autosomal genes with a false discovery rate (FDR) of <30%; in total, this larger set of genes harbor de novo loss of function (LoF) mutations in 5% of cases, and numerous de novo missense and inherited LoF mutations in additional cases. Three critical pathways contributing to ASD were identified: chromatin remodelling, transcription and splicing, and synaptic function. Chromatin remodelling controls events underlying neural connectivity. Risk variation also impacted multiple components of synaptic networks. Because a wide set of synaptic genes is disrupted in ASD, it seems reasonable to suggest that altered chromatin dynamics and transcription, induced by disruption of relevant genes, leads to impaired synaptic function as well.

In this post we wanted to focus on an easily-overlooked aspect of this paper: the use of a false discovery rate (FDR) approach to identifying genes for follow-up analysis. While FDR is a well-recognized approach in biology, one could also argue for using a family wise error rate (FWER), which has been the norm in recent large-scale, genome-wide association studies (GWAS). So why did we decide to take this alternative approach here?

Let’s start with how to interpret FDR (i.e., the False Discovery Rate). To make it concrete, within the top 107 genes identified, approximately 30% are not actually associated to autism (and could be considered ‘false positives’). They are simply present on the list by chance. We emphasize this mathematical point for an important reason. There is a profusion of papers and grant applications that cite weak genetic evidence for a particular gene as a rationale for a study and we would not want to contribute to that trend. If a postdoc dedicates his/her career to a gene with FDR = .29, we would have to wonder about the advisor; this is a gamble that we would not encourage in any way. Indeed, in our study, we only use the complete FDR < .3 results for enrichment and pathway analyses. For a more complete look, we examined genes at multiple levels, including FDR controlled at .05, .1 and 3. Defined rigorously and interpreted appropriately, we think it is the best approach for extracting insights from these highly discrete data – and the fundamental differences from approaches used with common variation are important to explain clearly.

Why is the FDR a good fit to our study design? A first reason is that the nature of the signal here. WES captures discrete independent events within a gene, almost all de novo or very recent mutations and each of which has complete affiliation to only one gene. Moreover, the data themselves sketch a scenario where there is a strong genome-wide excess of LoF and other deleterious events in cases, but there are very few such events in each gene, even looking at the top genes. We have a good handle on both the excess of these events and the accounting for such events, given the availability of accurate per gene mutation rates. With over 1,000 genes relevant to ASD, FDR then becomes a useful construct for exploring patterns and processes implicated in a principled manner. We should note that we did compute p-values from de novo LoF events (as in Samocha et al., 2014) and, while some genes exceed genome-wide significance, the de novo LoF events in this small list of genes explain only a small amount of the overall excess of events.

Specifically – considering only de novo LoF mutations observed in the 2297 trios analyzed in the ASC paper, we observed 317 mutations and expected only 197, suggesting that roughly 120 constitute true signal. In a gene-by-gene analysis, 7 individual genes exceed a traditional “genome-wide significance” threshold – here set at .05 / # genes in the genome or roughly 2.5×10^-6. These genes harbor in total only 26 of the de novo LoF mutations and therefore obviously account for a distinct minority of the relevant genes hit by mutation. Moreover, such an analysis, while a useful starting point, quite clearly can be expanded to inherited variation and mutations in other categories in order to fully assess the role of rare variation (per-gene and genome-wide) in autism.

To incorporate additional data beyond de novo LoF variation, the ASC used a Bayesian framework, called TADA (He et al., 2013). While this Bayesian model does not produce a p-value, Bayesian posterior probabilities work comfortably with FDR and perform very similarly, in that they control false discoveries at the expected rate. Extrapolating p-values from the Bayesian analyses would require many more approximations. So here, too, an FDR approach is justified. The two approaches however, are quite reliably linked – the 7 genes noted above, along with 6 others, all have FDR < 0.01 – articulating both consistency but also the immediate value in expanding the analytic framework in a robust fashion.

Why has the FDR approach not been widely embraced for GWAS? It is worth considering the proximal goal of most genetic studies of common diseases is to define a conclusive link between disease and gene. In GWAS the considerable underlying complexity of linking positive SNP associations to specific genes adds considerable but underappreciated complexity. For example, the wide variation in both extent of (1) regional linkage disequilibrium and (2) gene sizes has by itself made even the derivation of a “gene-wise” p-value from GWAS a problem that has yielded numerous meritorious but distinct solutions (gene-size in particular being challenging to harmonize under the null and alternative hypotheses). Further higher order complexities (e.g., genes in specific functional categories are often systematically larger or smaller than average, related genes are often physically co-localized, all GWAS studies as expected have stronger results in regions near genes and in regions of larger LD since each SNP tested represents more neighboring and potentially functional variants) makes it very challenging to use FDR appropriately as related to genes, which are the primary units of study for downstream functional analyses. Moreover, the hypotheses pursued in initial GWAS studies have quite sensibly generally been a) could we prove one or more common variants are conclusively relevant to disease and subsequently b) having done so, could we understand and connect the molecular function of associated alleles to disease biology in an actionable setting. Given the uncertainty of a) and considerable expense of b), a robust means of identifying associated common variants seems critical.

There are other arguments in favor of FWER for GWAS arising from the fundamental distinction between FDR and FWER. FWER and FDR perform essentially the same when there is no signal to be found in the data (i.e., under the null hypothesis of no association). They diverge only when there is signal and diverge widely only when there are many common variants associated with disease. In this sense one can learn more from FDR, but only if the model assumed to be generating the data is a very good approximation for reality. Here is where the nature of GWAS data becomes important. The patchiness of LD – strong and including many common variants in some regions of the genome, yet weak in others – together with potential biases inherent in calling SNPs from microarrays or imputing their genotypes on the basis of LD and measured genotypes, makes it very complex data to model. How one calibrates the FDR to “learn” how much signal is truly excess remains an open question, at least in our opinion. FWER does not “learn” from the pattern of association and thus is robust against these complexities.

Groups interested in specific disorders often use a genetics-first rationale to choose genes for follow up; strong genetic support for a specific gene enhances the validity of the subsequent in vitro or in vivo analyses. In spite of this, there remains significant investment in genes that do not have robust genetic support for a given genetic disorder, even when the rationale of the study begins with genetic relevance. (This phenomenon is particularly surprising when it occurs in industry, where an explicit commitment to the most compelling targets at the outset of a drug discovery pipeline has been proposed as a means of providing some protection against the systemic downstream failures.) We therefore stress again that a major value of the FDR approach is to highlight pathways implicated in disease, and to a lesser extent to implicate specific genes. We also stress that the FDR framework is straightforward and justified for multiple reasons in the particular case of WES analyses of rare variation in complex disease, but is not nearly as obviously or readily applicable to GWAS. Even in the case of GWAS, the emerging evidence of the considerable polygenicity of disease and continued efforts to wrestle with the many analytic complexities unique the GWAS data may yield similarly important uses if done carefully.

Moving forward, the identification of ASD genes will lead to better cell and animal models, which in turn will enhance our understanding of the pathological processes involved in ASD. There are several ongoing clinical trials in ASD that have emerged from gene discovery and the subsequent study of animal models, so there is good reason for optimism in ASD.

About the authors
The authors are all members of the Autism Sequencing Consortium (ASC), a multinational collaboration to identify and characterize autism genes. The rationale for, and approaches of, the ASC were described in Buxbaum et al., 2012 and further information can be found here.

Joseph D. Buxbaum (Seaver Autism Center, Department of Psychiatry, Icahn School of Medicine, New York, NY), Mark J. Daly (Analytic and Translational Genetics Unit, Massachusetts General Hospital and the Broad Institute, Boston, MA), Bernie Devlin (Department of Psychiatry, University of Pittsburgh, Pittsburgh, PA) and Kathryn Roeder (Ray and Stephanie Lane Center for Computational Biology, Department of Statistics, Carnegie Mellon University Pittsburgh, PA), together with Matthew W. State (Department of Psychiatry, University of California, San Francisco, San Francisco, CA), are PIs in the consortium, and Daly, Devlin, and Roeder lead the Statistical Analysis Committee of the ASC. Silvia De Rubeis is a postdoctoral student at the Seaver Autism Center, and Kaitlin Samocha is a graduate student with the Analytic and Translational Genetics Unit.