Personal genomics: the importance of sequencing

Those of us who live and breath genomics get very excited about sequencing DNA. Genomes Unzipped will be sure to cover the constant battles between sequencing companies to produce complete and accurate genome sequences for low prices; from our point of view, ‘low prices’ means affordable for consumers, or less than £1000 or so for a full sequence of an individual.

But why do we care about sequencing? You can go to a company like 23andMe and get a genotyping chip done; this won’t give you your full DNA sequence, but it will give you information about half a million sites on your genome, at the much lower cost of around £300. The sites picked for these chips are ones that are most variable in the population, and those that are well-studied. Why do we care about the rest? What more does sequencing give you?

What Do We Miss?
There are lots of types of variation that can occur in the 3 billion bases of the human genome. The simplest change, and the one that genotyping chips tend to look at, are Single Nucleotide Polymorphisms (SNPs); in these simple mutations a single base is changed (an A changes to a C, for instance). The database dbSNP knows of around 15 million such variations, and any two individuals will differ at around around 3 million single sites (more, if you’re from the more variable African populations).

23andMe can probably identify around 250 thousand of these differences (less if you’re not of Western European descent). New genotyping technologies, combined with statistical techniques like genotype imputation, can look at several million SNPs; these techniques may allow (more expensive) genotyping to find around half of your 3 million variations; even all bells-and-whistles genotyping is going to miss most of the data out there.

Of course, single-base mutations are not the only source of variation; neither are they the most interesting. Other types of variation are even less likely to be covered by genotyping. Each individual will have around 800,000 small insertions or deletions of DNA (called indels), very few of which are well covered by genotyping chips. Then there are the larger, potentially very interesting structural variants; thousands of bases or more that have been deleted, inserted, moved around or inverted; each individual will have a few thousand of these, and looking at them in the sort of detail required to figure out exactly what change has occured is virtually impossible with chips.

When you send your DNA for sequencing, you get the chance to see a massive chunk of all of these variations. Craig Venter’s super-high quality genome sequence (costing a crazy £45 million on first generation technology) found basically all variation in his genome, including 3.2 million SNPs, 900,000 indels and a range of other things. When Life Technologies sequenced a single African individual using low-cost second-generation sequencing, they found 3.8 million single-base variations, 230 thousand small indels, 565 large insertions or deletions, 91 inversions and a couple of crazier things, like gene fusions and complex rearrangements. The cost of this sort of analysis is currently massive compared to genotyping, but when you are done, you have captured a big proportion of the variations in your own genome.

(Note the difference between Venter’s 900k indels and Life Tech’s 230k indels; this is because 70% of indels are in repetitive regions of the genome that are hard to sequencing using second-generation sequencing. Our hope is that the next batch of technology, third-generation sequencing, will be able to plug this gap. If the whole First/Second/Third-gen stuff doesn’t mean anything to you yet, sit tight; we’ll cover all this in a later post.)

Why does this matter?
So we miss a lot of data when we settle for genotyping. But why does this matter? What do we fail to learn from chips that we could learn from sequencing?

The thing to understand about genotyping is that it is ultimately reactive, rather than proactive. You can only look for variants that you have seen before. As a result, you miss variants that are rare, that are population-specific, or that come from understudied populations.

Partly, this just means you are missing a lot of data. If you want to do something like ancestry testing (place yourself on a genetic map of Europe, pinning down exactly what Y chromosome haplotype you have, etc), the more data the better. But more than that, the variants that are missed are likely to be more population specific. As more and more individuals are sequenced from many populations, the potential for higher-resolution ancestry tracking appears. However, a 23andMe chip just doesn’t have enough variants on it to make full use of this new data.

The reactive nature of genotyping also means that you may have to do it again whenever new discoveries are made. 23andMe will make sure that they cover all the known risk loci for diseases, but what happens when we find new regions of the genome associated with disease? In that case, you have to hope that 23andMe happens to have them well covered on their chip, and if not, you have to wait until they bring out a new chip that covered the new discoveries. This is a especially a problem when the disease-associated variants are rare, as most new disease variants will probably be, or are caused by the sorts of variants not well covered by the chip. However, with sequencing, you have most of your genome there already, so any new discoveries can just be looked up on your genome sequence, without needing to go back to the lab to spend more time and money.

There is a particular type of variation that genotype chips can never get at, the type of variation that most people will find most interesting: variation that is unique to you, or to your family. If you get sequenced now, about 200,000 single-base variants in your genome will never have been seen before, ever. These are likely to include changes that modify proteins in a unique way, that may make them act differently in your cells. A big proportion of indels and structural variants will be novel, and these can include strange and exotic things: genes that have been swapped around, jumbled up, fused together, or deleted entirely. There may well be stretches of DNA, hundreds of base pairs long or longer, that have never been observed in another human. Regardless of how “useful” these personal oddities are, to be able to look directly at new genomic discoveries that live inside you makes them invaluable.

The Future of Personal Genomics
None of this is supposed to be an argument to fork out the (frankly ridiculous) $20k to get your own genome sequenced from something like Illumina’s personal service. Instead, I want to show that those of us who are interested in investigating our own genomes should be keeping close tabs on the sequencing wars. Illumina and Life Technologies are currently battling to bring the materials cost of a human genome into the low thousands of dollars, Complete Genomics is trying to sequence and analyse an entire genome as a service for $5,000, Pacific Biosciences, Life Technologies and Oxford Nanopore are bringing out new tech that may change the entire genomics field. Rather than just being esoteric events in the research and business communities, these developments will fundamentally determine if and when personal genomics can transition from a simple chip-based industry to a richer sequencing based one.

The first image is a colourised screenshot of aligned Illumina reads from the program MAQ, the second is an illustration of a large structural re-arrangement taken from the brilliant (and copyright-free) NHGRI Talking Glossary of Genetic Terms.

  • Digg
  • StumbleUpon
  • Facebook
  • Twitter
  • Google Bookmarks
  • FriendFeed
  • Reddit

13 Responses to “Personal genomics: the importance of sequencing”

  • Gene sequencing technology has advanced so dramatically, but our knowledge and ability to interpret these information fall far behind the technology advancement. It is like a four-year-old reading an encyclopaedia. Even the cost is as low as $1000 per genome, it is still a kind of spending you need to wait a couple of years to make it worth. It might be the case that at that time, more cheaper technology for sequencing is available. But on the other hand, a genetic testing for BRAC mutations cost about $3000, it seems $1000 is a an very attractive price for a whole genome. Is there any sign that the Insurance will cover this whole genome sequencing?

  • “the (frankly ridiculous) $20k to get your own genome sequenced” – shows how much we’ve moved on over the last couple of years!

  • Whole genome sequencing will be a great opportunity in the next years, when the cost will be affordable for everyone and we will have more knowledge about human genomics. For the time being, it is still expensive compared to what we know about our genome. Anyway we are going certainly in this direction!

  • Good post Luke,
    you mentioned 23andMe alot, what about the other companies?

  • Indels, allelic conversions and rearrangements are likely to be where most of the excitement is; SNPs only get you so far. However, it is not clear to me that we won’t need 4th or beyond generations of sequencing technology to get inexpensive surveys of larger indels and rearrangements. Ironically, major rearrangements were the first genomic patterns to be observed (salivary gland chromosome banding patterns, for example).

  • @Steve Murphy

    You’re right, concentrating overly on 23andMe was a bit blinkered of me, especially given that they use a smaller chip than most companies (it was mostly just because they are cheapest, and thus gives the strongest contrast to the price of sequencing). So, to plug the gap, here is some information on other companies and other, higher density chips:

    deCODEme uses the next grade up of Illumina chip, the Illumina 1M; this will give you more coverage than 23andMe’s 550k+ chip, but not an awful lot more.

    Pathway genomics, despite using “a chip capable of detecting thousands of genes“, doesn’t release whole-genome data, and I’m not even sure what chip they use.

    Navigenics and SeqWright both use the Affymetrix 6.0, which covers around 900k SNPs, but also contains a low density set of probes for detecting structural variation; this won’t tell you that much, but at least will catch the relatively rare very large (>5kbp) deletions and duplications – probably giving 40 or CNVs for each person. However, even this won’t really give much resolution on what is going on; all it will tell you that some sequence is deleted or duplicated somewhere within a ~10kbp window.

    There are some more advanced chips out there (like the NimbleGen 42M) that can more accurately detect many structural variants (catching everything >1kb, perhaps). However, the ~700 variants these pick up will only account for perhaps 30% of the structurally varied sequence in an individual (<1% of the number of structural variants), and they won't pick up any of the entirely novel sequence.

    I'm getting a lot of my data for this from Pang et al, which looked at all the structural variation from the Venter genome, and how much of it could be picked up by various methods.

  • An excellent, concise review of some of the pitfalls of genotyping vs sequencing. I think the next 5 years of sequencing technology development are going to be some of the most exciting (I’ve got my money on Oxford Nanopore personally).

  • The excitement surrounding the value of the sequencing of personal genomes is extremely misplaced in view of the very limited understanding that we presently possess about the ~23,000 genes in the human genome and the proteins that they encode. The phenotype of individuals cannot be simply predicted from the sequences of these genes or even the levels of the mRNA transcripts from these genes. Post-translational modification of proteins plays an equally important role. Gene expression is strongly influenced by the environment, and the traits of humans and other organisms is manifested ultimately at the protein and metabolome levels. The balance between healthy and pathological states is the outcome of extremely complex regulatory systems that operate at the molecular, cellular and organ levels, often with immense redundancy and feedback controls, which permit organisms to survive and operate in changing environments.

    This article nicely points out some of the millions of variations in nucleotide base changes, insertions and deletions in the genomes of people. It will take many decades to sort out which of these are truly meaningful and useful. Less than 3% of the human genomes actually encodes proteins and other recognizable RNA elements such as tRNA, rRNA and microRNA. Over 97% of the genome is probably filler or “junk” DNA. This is pretty apparent when one considers the number of nucleotides in humans compared to other species. Humans have about 2.9 billion base pairs (bp) in their genome, whereas the lungfish has 139 billion bp and the crested newt has 18.9 billion bp . Even amongst insects the total number of nucleotides in their genomes can vary markedly. For example, the fruit fly Drosophila melanogaster has 0.165 billion bp, whereas the butterfly Fritillaria assyriaca has 124.9 billion bp. The vast majority of genomic differences in people will be found in these non-coding regions of their genomes.

    The current euphoria about genome sequencing, including eventually the genomes of hundreds of thousands of people over the next decade will lead to a major diversion of funding and research into relatively non-productive directions. This will probably result in even less attention devoted to understanding the roles and interactions of proteins. Regretfully, without this knowledge, it will be impossible to interpret the results of genome sequencing studies. Government and charitable organizations that fund biomedical research should be doing a much better job in coordination of genomics- and proteomics-based studies.

  • @S. Pelech

    Many detailed studies of the cellular and biochemical basis of complex traits in recent years have come from genetic studies (sequencing or genotyping individuals). For instance, much light has been shed on the underlying etiology of inflammatory bowel disease and type I diabetes by the results of genome-wide association studies, and the functional role of dozens or hundreds of proteins involved in these diseases have been uncovered. We’ve also been given access to higher level understanding, such as which pathways are and are not shared in common between all autoimmune disorders. Many other phenotypes, such as height and metabolism, are following suit.

    Genetic variation, when associated with phenotypic variation, can give a genome and proteome wide glimpse into biology – and this includes glimpses into the mechanism of environmental and non-genetic effects, for example see the association between FTO and Type II diabetes, which helped understand the relationship between obesity, overeating, appetite and diabetes.

    Far from being a diversion into non-productive directions, the genetics of complex traits are helping us get a handle on some of the most complicated areas of biology.

  • Geneticist from the East

    The fact remains not much is known about variations beyond common SNPs. Therefore there is only marginal gain for whole genome versus gene chip with imputation to 3M hapmap SNPs.

    But then if we can do GWAS with whole genome, then this situation might change. However, that might not happen for at least five more years.

  • i think that there is a reasonable possibility that we will have to go to times series sequencing. We need to see the genome in action. Sequencing may be like insulin testing or at least for groups looking for what turns what on and when

  • What do you think of exome sequencing? Is there much to be gained over standard SNP platforms, or is it better to concentrate all efforts on WG sequencing?
    Thanks for this great blog! Regards

  • @ Luke

    Genome sequencing studies of individuals with hereditary diseases that have a profound phenotype can definitely be informative. However, most common diseases arise from complex multi-gene changes and interactions. Genome wide sequencing of large numbers of people with diverse phenotypes randomly is not likely to be any more informative that the more focused genomic studies that have been performed with families with severe diseases.

Comments are currently closed.

Page optimized by WP Minify WordPress Plugin