Author Archive for Joe Pickrell

Should the FDA regulate the interpretation of traditional epidemiology?

Last week, the FDA sent a sternly-worded letter to the personal genomics company 23andMe, arguing that the company is marketing an unapproved diagnostic device. Many have weighed in on this, but I’d like to highlight a thoughtful post by Mike Eisen.

Eisen makes the important point that interpreting the genetics literature is complicated, and a company (like 23andMe) that provides this interpretation as a service could potentially add value. I’d like to add a simple point: this is absolutely not limited to genetics. In fact, there are already many software applications that calculate your risk for various diseases based on standard (i.e. non-genetic) epidemiology. For example, here’s a (NIH-based) site for calculating your risk of having a heart attack:


And here’s a site for calculating your risk of having a stroke in the next 10 years:


And here’s one for diabetes. And colorectal cancer. And breast cancer. And melanoma. And Parkinson’s.

I don’t point this out because it leads to an obvious conclusion; it doesn’t. But all of the scientific points made about risk prediction from 23andMe (the models are not very predictive, they’re missing a lot of important variables, there are likely errors in measurements, etc.) of course apply to traditional epidemiology as well. Ultimately, I think a lot rides on the question: what is the aspect of 23andMe that sets them apart from these websites and makes them more suspect? Is it because they focus on genetic risk factors rather than “traditional” risk factors (though note several of these sites ask about family history, which of course implicitly includes genetic information)? Is it the fact that they’re a for-profit company selling a product? Is it something about the way risks are reported, or the fact that risks for many diseases are presented on a single site? Is it because some genetic risk factors (like BRCA1) have strong effects, while standard epidemiological risk factors are usually of small effect? Or is it something else?

Henrietta Lacks’s genome sequence has been publicly available for years

Last week, scientists at the European Molecular Biology Laboratory reported that they had sequenced the genome of the Henrietta Lacks, or “HeLa”, cell line. This report was met with considerable consternation by those who (justifiably, in my opinion) wondered why scientists are still experimenting on a cell line obtained without consent in the 1950s [1]. In response to a bit of a backlash, the researchers removed the HeLa sequence from the public internet, and even the paper itself might disappear from the formal scientific literature.

However, it is unfair to treat the authors of this paper as scapegoats for the systematic failure of scientists to deal with issues surrounding genomic “privacy”. Consider this important piece of information: the genome sequence of the HeLa cell line has been publicly available for years (and remains so).

Continue reading ‘Henrietta Lacks’s genome sequence has been publicly available for years’

My genome, unzipped

As part of the Personal Genome Project (PGP), my genome was recently sequenced by Complete Genomics. My PGP profile, including the sequence, is here, and their report on my genome is here. As I play around with the best ways to analyze these data, I’ll write additional posts, but for now I’ve noticed only one thing: I’m almost surprised by how unsurprising my full genome sequence is.

According the the PGP’s genome annotator, I have two variants of “high” clinical relevance. The first is the APOE4 allele, which Luke had already reported that I carry. The second is a variant that causes alpha-1-antitrypsin deficiency, which is also typed by 23andMe.

Of course, this is all quite reassuring. Long-time readers will remember that last year I was briefly worried that I might have Brugada syndrome. I do not carry any of the known pathogenic mutations (modulo worries about false negatives); this of course is now unsurprising, but would have been really nice information to have, say, when I was talking with a cardiologist last year.

The first steps towards a modern system of scientific publication

About a year ago on this site, I discussed a model for addressing some of the major problems in scientific publishing. The main idea was simple: replace the current system of pre-publication peer review with one in which all research is immediately published and only afterwards sorted according to quality and community interest. This post generated a lot of discussion; in conversations since, however, I’ve learned that almost anyone who has thought seriously about the role of the internet in scientific communication has had similar ideas.

The question, then, is not whether dramatic improvements in the system of scientfic publication are possible, but rather how to implement them. There is now a growing trickle of papers posted to pre-print servers ahead of formal publication. I am hopeful that this is bringing us close to dispensing with one of the major obstacles in the path towards a modern system of scientific communication: the lack of rapid and wide distribution of results.*
Continue reading ‘The first steps towards a modern system of scientific publication’

Questioning the evidence for non-canonical RNA editing in humans

In May of last year, Li and colleagues reported that they had observed over 10,000 sequence mismatches between messenger RNA (mRNA) and DNA from the same individuals (RDD sites, for RNA-DNA differences) [1]. This week, Science has published three technical comments on this article (one that I wrote with Yoav Gilad and Jonathan Pritchard; one by Wei Lin, Robert Piskol, Meng How Tan, and Billy Li; and one by Claudia Kleinman and Jacek Majewski). We conclude that at least ~90% of the Li et al. RDD sites are technical artifacts [2,3,4]. A copy of the comment I was involved in is available here, and Li et al. have responded to these critiques [5].

In this post, I’m going to describe how we came to the conclusion that nearly all of the RDD sites are technical artifacts. For a full discussion, please read the comments themselves.


Position biases in alignments around RDD sites. For each RDD site with at least five reads mismatching the genome, we calculated the fraction of reads with the mismatch (or the match) at each position in the alignment of the RNA-seq read to the genome (on the + DNA strand). Plotted is the average of this fraction across all sites, separately for the alignments which match and mismatch the genome.

Continue reading ‘Questioning the evidence for non-canonical RNA editing in humans’

Identifying targets of natural selection in human and dog evolution

Over the course of the past year or so, I’ve been working (with Jonathan Pritchard) on a statistical method for learning about the history of a set of populations from genetic data. Much of this work is described in a paper we recently made available as a preprint [1]. However, as many readers will know, writing a paper involves deciding which results are important to the main point (and worth fleshing out in detail), and which aren’t. In this post, I’m going to describe some results and thoughts that didn’t quite make the cut, but which I think merit a small note. In particular, I’m going to discuss how having a demographic model for a large number of populations might be used to identify genes important in adaptation, and describe results from humans and dogs.


Imagine you have genome-wide genetic data (from SNP arrays, genome sequencing, or whatever) from a number of populations in a species. A common way to visualize the relationship between your populations is to use a tree. For example, below I’ve built a tree of the 53 human populations from the Human Genome Diversity Panel (using the data from Li et al. [2]).


Maximum likelihood tree of 53 human populations built using TreeMix.

Continue reading ‘Identifying targets of natural selection in human and dog evolution’

Review of the Lumigenix “Comprehensive” personal genome service

This is the first of a new format on Genomes Unzipped: as we acquire tests from more companies, or get data from others who have been tested, we’ll post reviews of those tests here. The aim of this series is to help potential genetic testing customers to make an informed decision about the products on the market. We’re still tweaking the format, so if you have any suggestions regarding additional analyses or areas that should be covered in more detail, let us know in the comments.


Lumigenix is a relative newcomer to the personal genomics scene: the Australian-based company launched back in March this year, offering a SNP chip-based genotyping service similar in concept to those provided by 23andMe, deCODEme and Navigenics.

The company kindly provided Genomes Unzipped with 12 free “Comprehensive” kits, which provide genotypes at over 700,000 positions in the genome, to enable us to review their product. We note that the company offers several other services, including a lower-priced “Introductory” test that covers fewer SNPs, and whole-genome sequencing for the more ambitious personal genomics enthusiast. This review should be regarded as entirely specific to the Comprehensive test.
Continue reading ‘Review of the Lumigenix “Comprehensive” personal genome service’

Size matters, and other lessons from medical genetics

Size really matters: prior to the era of large genome-wide association studies, the large effect sizes reported in small initial genetic studies often dwindled towards zero (that is, an odds ratio of one) as more samples were studied. Adapted from Ioannidis et al., Nat Genet 29:306-309.

[Last week, Ed Yong at Not Exactly Rocket Science covered a paper positing an association between a genetic variant and an aspect of social behavior called prosociality. On Twitter, Daniel and Joe dismissed this study out of hand due to its small sample size (n = 23), leading Ed to update his post. Daniel and Joe were then contacted by Alex Kogan, the first author of the study in question. He kindly shared his data with us, and agreed to an exchange here on Genomes Unzipped. In this post, we expand on our point about the importance of sample size; Alex’s reply is here.

Edit 01/12/11 (DM): The original version of this post included language that could have been interpreted as an overly broad attack on more serious, well-powered studies in psychiatric disease genetics. I’ve edited the post to reduce the possibility of collateral damage. To be clear: we’re against over-interpretation of results from small studies, not behavioral genetics as a whole, and I apologise for any unintended conflation of the two.]

In October of 1992, genetics researchers published a potentially groundbreaking finding in Nature: a genetic variant in the angiotensin-converting enzyme ACE appeared to modify an individual’s risk of having a heart attack. This finding was notable at the time for the size of the study, which involved a total of over 500 individuals from four cohorts, and the effect size of the identified variant–in a population initially identified as low-risk for heart attack, the variant had an odds ratio of over 3 (with a corresponding p-value less than 0.0001).

Readers familiar with the history of medical association studies will be unsurprised by what happened over the next few years: initial excitement (this same polymorphism was associated with diabetes! And longevity!) was followed by inconclusive replication studies and, ultimately, disappointment. In 2000, 8 years after the initial report, a large study involving over 5,000 cases and controls found absolutely no detectable effect of the ACE polymorphism on heart attack risk. In the meantime, the same polymorphism had turned up in dozens of other association studies for a wide range of traits ranging from obstet­ric cholestasis to menin­go­­coccal disease in children, virtually none of which have ever been convincingly replicated.
Continue reading ‘Size matters, and other lessons from medical genetics’

Revisiting RNA-DNA sequence differences

A few months ago, I discussed a paper by Li and colleagues reporting a large number of sequence differences between mRNA and DNA from the same individual [1]. While some such differences are expected due to known mechanisms of RNA editing (e.g. A->I editing, see [2]), Li et al. reported an astonishingly high number of them, including thousands of events inconsistent with any known regulatory mechanism. These results implied at least one, and probably many, new mechanisms of gene regulation, and called into question some basic assumptions in molecular biology.

An alternative explanation for the observations of Li et al. is less exciting–imagine two genes with similar (but not identical) sequences, which produce similar (but not identical) mRNAs. If you accidentally attributed both mRNA sequences to the same gene, you could erroneously conclude that one of the two sequences arose via RNA editing of the other. According to a new paper in by Schrider and colleagues [3], this banal artifact accounts for the majority of the reported RNA-DNA sequence differences in Li et al.

Schrider et al. show that RNA-DNA mismatches are enriched in genes with close paralogs or copy number variants, both of which are consistent with the technical artifact mentioned above. However, their most striking result is that, at many of the putative RNA editing sites, the “edited” base from the mRNA is actually present in genomic DNA. To show this, Schrider et al. took advantage of the fact that low-coverage DNA sequencing data is available for the individuals used in the Li et al. study. They searched through these data to find genomic sequences matching the “edited” mRNA form. If these sites were truly due to RNA editing, they shouldn’t find any. Instead, at ~75% of the tested sites, they could find a genomic match to the “edit” in at least one individual. There are some potential complications with the interpretation of this number (as they note, the genomic data could include sequencing errors that happen to be the same base as the “edit”), but this observation strongly suggests that a majority of the sites identified by Li et al. are false positives due to this single technical issue.

[1] Li et al. (2011) Widespread RNA and DNA Sequence Differences in the Human Transcriptome. Science. doi: 10.1126/science.1207018

[2] Levanon et al. (2004) Systematic identification of abundant A-to-I editing sites in the human transcriptome. Nature Biotechnology. doi:10.1038/nbt996

[3] Schrider et al. (2011) Very Few RNA and DNA Sequence Differences in the Human Transcriptome. PLoS One. doi:10.1371/journal.pone.0025842

The week that I worried I had a rare genetic disease

I recently had a series of moderately unpleasant health problems, which eventually led to my being tested for a rare, and potentially very serious, genetic disease (for worried parties: the test was negative). I thought I would share this anecdote because, first, it’s the only time I’ve wished I had more genetic information about myself in a medical setting, and second, because it illustrates the sorts of gaps in medical knowledge that could be aided by routine genome sequencing.

Continue reading ‘The week that I worried I had a rare genetic disease’

Page optimized by WP Minify WordPress Plugin