This is an edited repost of a year-old article from my blog Genetic Inference. It explains how the state-of-the-art Second Generation sequencing works, and how it is being used to sequence thousands of genomes per day. I also try to explain some of the distinctions between First, Second and Third Generation sequencing.
This post follows on from an even older post that explained First Generation sequencing; the tech that was used in the Human Genome Project.
Recap: What are we trying to do?
In a previous post, we saw how DNA is made up of little strings of nucleotides, and we used different shapes to represent different base pairs (A = triangle, C = diamond, G = circle, T = pentagon). For instance, is GCAT.
We looked at how the DNA polymerase enzyme can be used to amplify up DNA, using the Polymerase Chain Reaction, and how we can determine the sequence of DNA using ddNTPs; nucleotides that, when incorporated into DNA, stop the polymerase working.
In First Generation (Sanger) sequencing, we run a PCR reaction in the presence of a bunch of ddNTPs, with each different base pair dyed a different colour. We then measure the length and colour of the resulting fragments of DNA, and use that to work out the sequence; a bit of DNA 35 base pairs long ending in a blue ddNTP tells us that the sequence has a “C” at the 35th position.
The problem with this method is that it requires a lot of space; you need a place to run the reaction, and then you need a capillary tube or a gel to determine the length of the DNA. As a result, you could only run perhaps a hundred of these reactions at any one time. There are 3 billion base pairs of DNA in the human genome, meaning about 6 million 500-base pair fragments of DNA; it would take a very long time to sequence all of these if you had to do them one hundred at a time.
Second Generation sequencing techniques overcome this restriction by finding ways to sequence the DNA without having to move it around. You stick the bit of DNA you want to sequence in a little dot, called a cluster, and you do the sequencing there; as a result, you can pack many millions of clusters into one machine. Sequencing a strand of DNA while keeping it held in place is tricky, and requires a lot of cleverness. I’ll explain how Illumina‘s Second Generation technology achieves this, as it is the most similar to Sanger sequencing.
Reversible Terminator Sequencing
Just like Sanger sequencing relies on the ddNTP to stop the PCR reaction, Illumina’s reversible terminator sequencing all rests on the reversible terminator bases (RT-bases). Just like ddNTP, these bases stop PCR reactions when they are incorporated; they have additional molecules, including a base-specific dye, attached to the standard base which stops the PCR enzyme adding more bases (A bases have red dye, C bases have blue dues, G yellow and T green):
However, they have an additional, very useful property: there exists a cleavage enzyme that chops all the extra molecules off, and turns the RT-base into a normally functioning nucleotide. This is hugely useful, and gives us a method of sequencing that doesn’t require moving the DNA.
We multiply up the template stand, i.e. the bit of DNA that we are sequencing, and stick on a few bases of ‘adaptor sequence'; this sequence sticks on to complementary bits of DNA stuck to a surface, which holds the DNA in place while we sequence it:
We then flood the DNA with RT-bases:
We also add a polymerase enzyme, which incorporates the RT-base into the new strand that is complementary to the template strand:
We then wash away all the RT-bases, leaving just those that were incorporated into the new strand; we can read off what base this is by looking at the colour of the dye:
In this case the dye is green, meaning that the base at the first position is a T.
Finally, we send in the cleavage enzyme, which cuts off the terminator region and the dye, leaving a normal base pair. We can then start again to sequence the next base pair.
In a single Illumina machine we have hundreds of millions of these clusters; cameras look at all of these dots and record how they change colour over time, allowing you to determine the sequence of bases of millions of bits of DNA at once. This animation illustrates how the process works over time; the main image shows the base pairs being incorporated into the DNA, and the little box shows what the camera sees; each dot is a reaction, with our reaction circled.
This system is exactly what we were looking for. Note that the sequencing method is pretty inefficient; for each base you read, you have to flood the DNA with RT-bases, wash them off again, and use a cleavage enzyme. This is very slow, and in fact it takes about an hour to read each base. However, this doesn’t really matter; each individual bit of DNA may be slow to sequence, but you can sequence millions of DNA fragments at once. In fact, the way we do sequencing these days is to cut up an entire genome, and sequence all the fragments. The real state-of-the-art machines can produce a pretty high quality human genome in around a week.
Illumina sequencing is not the only second generation tech, and it has many disadvantages. Firstly, because it takes so long to produce a single base pair, and because the different molecules in the cluster can get out of sync, it is impossible to sequence long bits of DNA. Mostly, the read length is under 100bp, much less than the 500-1000bp that you can get from Sanger sequencing. 454 sequencing is another second generation sequencing method that gets around this: instead of using dyes they use nucleotides that flash when the polymerase adds them to the DNA; they can get read lengths of greater than 500bp, getting close to Sanger sequencing. However, while 454 has a) longer read lengths b) a cooler name and c) a cooler sequencing method, it cannot rival Illumina for sheer amount of DNA sequenced per unit time.
Secondly, because it is very easy for the polymerase enzyme to add in the wrong RT-base, Illumina sequencing has a relatively high error rate (1-2% per base). ABI’s SOLiD sequencing adds bases in pairs, rather than singely, and thus sequences pairs of bases rather than single bases. E.g. while Illumina will read “GACT” as “G”, then “A”, then “C”, then “T”, SOLiD reads it as “GA”, “AC”, “CT” etc. Because you sequence each base twice, the error rates are much lower (0.1-0.2% per base)*.
Next Next Gen Sequencing
The First Generation sequencing technology, automated capillary sequencing machines, changed the way we thought about DNA; it went from something that we could glimpse to something we could sequence on mass. Second Generation sequencing has changed it again, from something we sequence once, to something we sequence again and again. Second gen sequencing has allowed us to re-sequence many human individuals; the 1000 Genomes Project is using 454, SOLiD and Illumina machines to sequence hundreds of individuals. This sort of thing has allowed us to get an idea, not just of what a single genome looks like, but how the genome changes from person to person; we can look at how much variation there is in the genome, how different populations differ in their genome structure, and even what makes a cancer genome different from a healthy genome.
However, second gen sequencing is not without its flaws. While it has got cheap ($10-20k to sequence a human genome these days), it still requires a lot of reagents, a lot of work and a lot of cost (the RT-bases aren’t cheap, and neither are the enzymes used). The low read lengths are still a problem, as they make knowing precisely where you got the bit of DNA from hard to discover, and it takes a long time to run the machines to completion. New tech is on the horizon that intends to overcome some of these constraints, including the Second Generation Ion Torrent, and the so-called Third Generation sequencers, Pacific Biosciences, Oxford Nanopore and Life Sciences Qdot technology, all of which sequence single molecules of DNA in real-time. Watch this space for more information as this tech develops.