Archive for the 'Background' Category

Society and the personal genome

Victory! Those of us involved in genomics research spend a lot of time thinking about how scientific and technological developments might influence personal genomics. For instance, does the falling cost of sequencing mean that medically useful personal genomics will likely be based on sequence rather than genotype data? (Yes.)

At the Sanger Institute we’ve recently launched (along with our friends at EBI) a project to look more deeply at a question which is less often on the lips of genomics boffins: “How does genomics affect as us people, both individually and in communities?” Because of the obvious resonance with Genomes Unzipped it should come as no surprise that many of us (including myself, Daniel and Luke) have been intimately involved in this initiative.

The actual line-up of events has been diverse, and a lot of fun. We’ve had two excellent debates, including one between Ewan Birney and Paul Flicek (pictured) on the value, or lack thereof, of celebrity genomes (covered in more detail here). A poet, Fiona Sampson, spent some time on campus and we’ve commissioned a book of poetry from her. This one raised some eyebrows, but I have to say that talking to her has given me some brand new ways of thinking about my own work. We’re also working on a more interactive project in the hope of making personal genomics a bit more personal. Stay tuned.

At odds with disease risk estimates

It's all a game of Risk!

The first thing I did when I received my genotyping results from 23andMe was log on to their website and take a look at my estimated disease risks. For most people, these estimates are one of the primary reasons for buying a direct to consumer (DTC) genetics kit. But how accurate are these disease risk estimates? How robust is the information that goes into calculating them? In a previous post I focused on how odds ratios (the ratio of the odds of disease if allele A is carried as opposed to allele B) can vary across different populations, environments and age groups and, as a consequence, affect disease risk estimates.  It turns out that even if we forget about these concerns for a moment, getting an accurate estimate of disease risk is far from straightforward. One of the primary challenges is deciding which disease loci to include in the risk prediction and in this post I will investigate the effect this decision can have on risk estimates.

To help me in my quest, I will use ulcerative colitis (UC) as an example throughout the post, estimating Genomes Unzipped members’ risk for the disease as I go. Ulcerative colitis is one of two common forms of autoimmune infllammatory bowel disease and I have selected it not on the basis of any special properties (either genetic or biological) but because I am familiar with the genetics of the disease having worked on it extensively.

The table below gives our ulcerative colitis risks according to 23andMe. The numbers in the table represent the percentage of people 23andMe would expect to suffer from UC given our genotype data (after taking our sex and ethnicity into account). The colours highlight individuals who fall into 23andMe’s “increased risk” (red) or “decreased risk” (blue) categories based on comparisons with the average risk (males: 0.77%; females 0.51%). As far as I am aware none of us actually do suffer from UC.
Continue reading ‘At odds with disease risk estimates’

Estimating heritability using twins

Last week, a post went up on the Bioscience Resource Project blog entited The Great DNA Data Deficit. This is another in a long string of “Death of GWAS” posts that have appeared around the last year. The authors claim that because GWAS has failed to identify many “major disease genes”, i.e. high frequency variants with large effect on disease, it was therefore not worthwhile; this is all old stuff, that I have discussed elsewhere (see also my “Standard GWAS Disclaimer” below). In this case, the authors argue that the genetic contribution to complex disease has been massively overestimated, and in fact genetics does not play as large a part in disease as we believe.

The one particularly new thing about this article is that they actually look at the foundation for beliefs about missing heritability; the twin studies of identical and non-identical twins from which we get our estimates of the heritability of disease. I approve of this: I think all those who are interested in the genetics of disease should be fluent in the methodology of twin studies. However, in this case, the authors come to the rather odd conclusion that heritability measures are largely useless, based on a small statistical misunderstanding of how such studies are done.

I thought I would use this opportunity to explain, in relative detail, where we get our estimates of heritability from, why they are generally well-measured and robust, and real issues need to be considered when interpreting twin study results. This post is going to contain a little bit of maths, but don’t worry if it scares you a little, you only really need to get the gist.
Continue reading ‘Estimating heritability using twins’

The cell is a messy place: understanding alternative splicing with RNA sequencing

Though this site is largely dedicated to discussions of personal genomics, I’d like to use this post to discuss some of my recent work (done with Athma Pai, Yoav Gilad, and Jonathan Pritchard) on mRNA splicing. Our paper, in which we argue that splicing is a relatively error-prone and noisy process, has just been published in PLoS Genetics [1].

Continue reading ‘The cell is a messy place: understanding alternative splicing with RNA sequencing’

Getting even with the odds ratio

In the recent report from the US Government Accountability Office on direct-to-consumer genetic tests, much was made of the fact that risk predictions from DTC genetic tests may not be applicable to individuals from all ethnic groups. This observation was not new to the report – it has been commented on by numerous critics ever since the inception of the personal genomics industry.

So, why does risk prediction accuracy vary between individuals and what can be done to combat this? Are the DTC companies really to blame?

To explore these questions it is first necessary to understand what is meant by the odds ratio (OR). In genetic case-control association studies the OR typically represents the ratio of the odds of disease if allele A is carried compared to if allele B is carried. If all else is equal, genetic loci with a higher OR are more informative for disease prediction – so getting an accurate estimate is extremely important if prediction underpins your business model. However, getting an accurate estimate of OR is far from easy because many, often unmeasured, factors can cause OR estimates to vary. In this post I will try to break down the concept of a single, fixed odds ratio for a disease association, and highlight a number of factors that can cause odds ratios to vary using examples from the scientific literature.

Continue reading ‘Getting even with the odds ratio’

Detecting positive natural selection from genetic data

As humans expanded out of Africa into the rest of the world, they adapted to a whole host of new habitats, pathogens, and food sources. In recent years, there has been an explosion of interest in identifying the specific genetic loci underlying these adaptations using whole genome genotyping (and now sequencing). In this post, I’ll outline some of the basic principles of how these methods work.

Continue reading ‘Detecting positive natural selection from genetic data’

Basics: Second-Generation Sequencing

This is an edited repost of a year-old article from my blog Genetic Inference. It explains how the state-of-the-art Second Generation sequencing works, and how it is being used to sequence thousands of genomes per day. I also try to explain some of the distinctions between First, Second and Third Generation sequencing.

This post follows on from an even older post that explained First Generation sequencing; the tech that was used in the Human Genome Project.

Recap: What are we trying to do?

In a previous post, we saw how DNA is made up of little strings of nucleotides, and we used different shapes to represent different base pairs (A = triangle, C = diamond, G = circle, T = pentagon). For instance, stage20_1 is GCAT.

We looked at how the DNA polymerase enzyme can be used to amplify up DNA, using the Polymerase Chain Reaction, and how we can determine the sequence of DNA using ddNTPs; nucleotides that, when incorporated into DNA, stop the polymerase working.

In First Generation (Sanger) sequencing, we run a PCR reaction in the presence of a bunch of ddNTPs, with each different base pair dyed a different colour. We then measure the length and colour of the resulting fragments of DNA, and use that to work out the sequence; a bit of DNA 35 base pairs long ending in a blue ddNTP tells us that the sequence has a “C” at the 35th position.

The problem with this method is that it requires a lot of space; you need a place to run the reaction, and then you need a capillary tube or a gel to determine the length of the DNA. As a result, you could only run perhaps a hundred of these reactions at any one time. There are 3 billion base pairs of DNA in the human genome, meaning about 6 million 500-base pair fragments of DNA; it would take a very long time to sequence all of these if you had to do them one hundred at a time.

Second Generation sequencing techniques overcome this restriction by finding ways to sequence the DNA without having to move it around. You stick the bit of DNA you want to sequence in a little dot, called a cluster, and you do the sequencing there; as a result, you can pack many millions of clusters into one machine. Sequencing a strand of DNA while keeping it held in place is tricky, and requires a lot of cleverness. I’ll explain how Illumina‘s Second Generation technology achieves this, as it is the most similar to Sanger sequencing.

Continue reading ‘Basics: Second-Generation Sequencing’

Dude, where are my copy number variants?

The genome scans currently offered by major personal genomics companies provide information about only one kind of genetic variation: single nucleotide polymorphisms, or SNPs. However, SNPs are just one end of a size spectrum of variation, reaching all the way up to large duplications or deletions of DNA known as copy number variants (CNVs). Over the last decade we have learned that CNVs are a surprisingly common form of variation in humans, and they span a formidable chunk of the genome. While there are about 3M-3.5M bases of variation due to SNPs within an individual genome (in say, a typical person of European descent), there are at least 50-60M variable bases due to CNVs.

For the personal genome enthusiast with their SNP chip data from 23andMe or deCODEme in hand, there are two important practical questions: (1) can I learn about my CNVs using SNP chip data; and (2) will that information be useful?

Continue reading ‘Dude, where are my copy number variants?’

How widespread personal genomics could benefit molecular biology

While the majority of the buzz surrounding personal genomics has to do with prediction of disease risk and other medical applications, there’s clearly the potential for these sorts of technologies to influence basic science as well. In this post, I’ll lay out one such potential application: the use of personal genomics in understanding basic molecular biology, in particular the biology of transcriptional regulation in humans.

Continue reading ‘How widespread personal genomics could benefit molecular biology’

Why prediction is a risky business

(This is an extended version of a short piece written as part of a series organized by the excellent Mary Carmichael at Newsweek. Readers eager for more detail on the statistics behind risk prediction should read Kate’s excellent discussion posted yesterday.)

In 2003 Francis Collins, having just led the human genome project to completion, made a prediction: within ten years, “predictive genetic tests will exist for many common conditions” and “each of us can learn of our individual risks for future illness”. The deadline of his prophecy is fast approaching, but how close are we to realizing his vision of being able to get a read-out of disease risk from a person’s DNA?
Continue reading ‘Why prediction is a risky business’


Page optimized by WP Minify WordPress Plugin