Why prediction is a risky business

(This is an extended version of a short piece written as part of a series organized by the excellent Mary Carmichael at Newsweek. Readers eager for more detail on the statistics behind risk prediction should read Kate’s excellent discussion posted yesterday.)

In 2003 Francis Collins, having just led the human genome project to completion, made a prediction: within ten years, “predictive genetic tests will exist for many common conditions” and “each of us can learn of our individual risks for future illness”. The deadline of his prophecy is fast approaching, but how close are we to realizing his vision of being able to get a read-out of disease risk from a person’s DNA?

In order to evaluate this question, it’s helpful to look back at the state of human disease genetics before the human genome project had even begun. Geneticists had become very efficient at pinpointing the roots of so-called “monogenic” disorders (e.g. sickle cell anemia, cystic fibrosis) where a rare defect in a single gene causes disease. These discoveries provided both insight into the biology of these diseases and provided highly predictive genetic tests. For instance, the disrupted gene in sickle cell anemia, called HBB, plays a key role in oxygen transportation by red blood cells, and the mutation which causes the disease is routinely used as a diagnostic test in at risk babies in hospitals around the world.

By contrast, the more complex conditions that Collins hopes to predict (such as diabetes or multiple sclerosis) aren’t caused by a catastrophic problem with a single gene, but are instead subtly influenced by a combination of many different genes, as well as environmental factors such as diet and exercise. Progress in understanding the genetic part of that equation has accelerated rapidly in the last four years as genome-wide association studies (GWAS) have identified hundreds of locations in the human genome which influence a wide variety of diseases and traits. Much like the monogenic examples described above, the specific genes associated with each disease have told us a great deal about the underlying biology of disease. For example, the most recent GWAS of type 2 diabetes highlighted a previously unsuspected mechanism: many associated genes are involved in regulating the cell cycle (the fundamental process by which cells grow, replicate and divide throughout our lives).

Unlike monogenic disorders, however, the predictive power of the variants discovered by GWAS is generally very poor. The gene variants discovered by GWAS barely nudge someone’s overall risk, typically increasing it by a factor of 1.1–1.5. These tiny effects can only be found by studying tens of thousands of individuals, which is critical when interpreting these findings in light of one person’s disease risk: statistically significant association in a population does not translate into meaningful individual prediction. For instance, GWAS have found 38 genes affecting type 2 diabetes, but these only explain about 10% of its observed heritability. This means that current prediction algorithms based solely on these genes are missing the majority of relevant genetic information, as well as all the environmental factors! The combination of small genetic effects documented to date with the lack of key environmental information severely hampers the statistical models used to predict genetic disease risk.

Genetic risk prediction of these conditions is further clouded by the difficulty in translating results from the scientific literature to something more relevant to an individual. GWAS typically report something called a “relative risk,” which measures the increased chance someone with a particular genetic variant has of getting sick compared to the background rate of that disease in his community. Translating this information to a meaningful personal prediction can be tricky, because the background rate can vary widely around the world. If an individual isn’t well matched to the background in a study, basing his personal predictions on that study could yield highly inaccurate results. Furthermore, someone’s interpretation of a given relative risk could change dramatically depending on the underlying population risks for different diseases: a two-fold increase in predicted risk of multiple sclerosis would be a rounding error to most people (a change from 0.1% to 0.2%) but the same effect size for diabetes would represent an alarming increase from 20% to 40% lifetime risk.

Nevertheless, there is some hope for predicting our risk of getting sick from our genes. Genes have already been discovered for some traits, such as severe adverse reactions to certain drugs, which are essentially monogenic. These are already used clinically. There are also many types of genetic risk factor which are hidden to GWAS technologies (such as low frequency variants of intermediate effect size), and the rapid decrease in the cost of sequencing a person’s whole genome is likely to unleash a new wave of discoveries in coming years. These advances could be combined with prediction models which incorporate non-genetic information, or are used in conjunction with specific symptoms to aid diagnosis. The clock is ticking, but time hasn’t quite run out for Collins’ prediction about prediction.

5 Responses to “Why prediction is a risky business”

Feed for this Entry

Jean M
03/08/2010 at 20:33

Excellent post. Complexity turned into clarity.
Nick E
04/08/2010 at 06:02

Thanks for the post Jeff. I’ve got two small complaints, though:

“statistically significant association in a population does not translate into meaningful individual prediction. For instance, GWAS have found 38 genes affecting type 2 diabetes, but these only explain about 10% of its observed heritability.”

The fact that a markers only explains a small percentage of heritability does _not_ mean that it necessarily doesn’t translate into meaningful individual prediction. For example, BRCA variants IIRC only explain 5% or so of breast cancer risk, would you really argue that they aren’t meaningful for individuals? And if so, then 10 SNPs with 1.5 ORs have a BRCA-like effect in aggregate (making the numbers up, but you get the point), so what’s the difference?

Second, “GWAS typically report something called a “relative risk,” which measures the increased chance someone with a particular genetic variant has of getting sick compared to the background rate of that disease in his community.”

I’ve never seen a GWAS reporting a relative risk instead of an odds ratio (not appropriate for case-control studies, etc). And translating odds ratios to absolute risks is indeed tricky, but if the only barrier is getting good prevalence numbers, I think we can agree that this problem is easier than most of the other problems we face in this field.
Jeff Barrett
04/08/2010 at 08:10

@Nick, You make two excellent points.

Regarding the high individual risk prediction from genetic models like BRCA1, that’s what I was hinting at in my concluding paragraph. As I’m sure you know, sequencing based studies will hopefully turn up variants with higher penetrance than GWAS have (though with lower population frequency); these will provide more meaningful risk prediction to the small number of individuals carrying each specific mutation.

As you further note, however, since these mutations are individually rare they won’t contribute much to the average prediction made across a population. If we find lots of these things, though, then in aggregate they could be the means by which personalized genomic medicine truly materializes.

Regarding relative risks vs odds ratios, you’re completely correct. I elided the difference for the sake of simplicity. For low disease prevalences they should also be equivalent. Speaking of prevalences, I think it’s harder than it might seem at first to get good estimates! Mark Henderson at the Times had a great piece (sadly behind a paywall, now) last weekend about how two different DTC companies gave him either 2% or 36% risk for glaucoma. They had identical relative risk predictions, but wildly different prevalences in their reference populations.
Cerebellar Stroke
19/11/2012 at 08:52

Hey! This post could not be written any better! Reading through this post reminds me of my old room mate!
He always kept talking about this. I will forward this page to him.
Fairly certain he will have a good read. Many thanks for sharing!
Tommie
07/12/2012 at 07:58

I would also like to mention that most people that find themselves without health insurance are generally students, self-employed and
those that are laid-off. More than half with the uninsured are under
the age of Thirty-five. They do not think they are wanting health insurance
simply because they’re young in addition to healthy. Its income is often spent on housing, food, and also entertainment. A lot of people that do go to work either whole or as a hobby are not supplied insurance via their work so they move without owing to the rising price of health insurance in the states. Thanks for the ideas you reveal through this site.