Genetic Sequencing

Abstract

James Watson and Francis Crick's 1953 revelation of the molecular mechanism and structure of the gene forever changed humanity's view of life, moving the center of mass of the scientific community to the biological sciences. All organisms, of which we are so far aware, are programmed by a sequence of chemical components–adenine, thymine, cyto-sine, and guanine–known as bases, that make up deoxyribonucleic acid (DNA), and a close relative, ribonucleic acid, or RNA. DNA is the software code that programs the hardware contained in the chemical soup of living organisms.

Along each chain that forms a DNA double helix, bases pair off predictably, adenine with thymine and cytosine with guanine, and this pairing is what allows the genetic code to be copied with high fidelity. By cutting a strand of DNA lengthwise and assembling complements of each chain, it is possible to produce two new identical double helices.

Or is it really thus? The 2007 “Breakthrough of the Year,” according to the journal Science, is the recognition that genomes, the total amount of genetic information within an individual or cell, are not the inviolable repository of faithfully recorded information that scientists once thought them to be. 1 Scientists' new understanding of genomes recognizes that they are not an identical copy of some perfect “parental” genes. They are shuffled, mixed, copied, deleted, and even inverted in a process called meiotic recombination. But that is just the start. Born with this gemisch, the human genome has little reason to retain this messy structure much beyond the onset of an individual's reproductive years.

One unfortunate manifestation of dras tic genomic shuffling is cancer, though evidence suggests that this shuffling is a consequence, rather than a cause, of cancer. Cancer, the onslaught of pathogens and even aging may be the ultimate price we pay for genetic instability. But gene shuffling is also the basis of our immune system. The human body uses it to prepare an army of “random” hunters for unknown targets. The very same process of random gene shuffling may be critical to the formation of the complex neural networks that let you read this article. 2 The bottom line: The genome you carry is not just your mother's, not just your father's, and, thanks to environmental pressures, not even really yours to keep. Rather, it reflects a complex set of factors, mostly inherited, but many stochastic and environmental.

In 2001, scientists first sequenced the human genome, determining the order of the bases that made up the genome's array of DNA. 3 This event, an enormous milestone, was really just the first tiny step–the average DNA sequence from essentially one person–in a much larger process. Scientists are hoping to develop a method to sequence the genome not just of individual humans, but ultimately, of individual cells from different types of tissue, in different states of health. The first human genome took 10 years to complete and cost billions of dollars. Scientists need to reduce the cost and time required for sequencing by many orders of magnitude to benefit human health. By gaining deeper insight into the genetic factors that make one person more likely to suffer a disease than another, scientists will be able to tailor treatment to individual characteristics, avoiding many of the side effects of drugs and targeting expensive treatments to patients on whom they will be effective. The key to understanding the composition of and changes that occur in the 7 billion or so human genomes on the planet will be to develop a vanguard technology to make much longer DNA sequencing reads than are currently possible.

Conventional sequencing methods are based on the technique introduced by chemist Frederick Sänger, who won a Nobel Prize–his second–for DNA sequencing research in 1980. The four bases that comprise DNA are themselves chemical rings composed of carbon, hydrogen, oxygen, and nitrogen atoms. Each of them is tethered to a sugar molecule that, in turn, is attached to a phosphate group. The assembly of base, sugar, and phosphate is I called a nucle-otide. In a DNA strand, nucleotides attach at the phosphates to form a long-chain polymer. Sanger's sequencing method is based on “defective” copying of DNA. The DNA to be sequenced is mixed with the enzyme that copies DNA (DNA poly-merase) and a mixture of nucleotides, which contains a small fraction of defective nucleotides whose sugar molecules lack a hydroxyl molecule. Let's say that the defective nucleotides were the ones containing gua-nine. The polymerase will fail to copy a part of the DNA some small fraction of the time, when it encounters one of the defective guanines. As a consequence, the mixture of copied material will contain all possible fragments of the original DNA that end with a guanine. The polymerase only copies DNA in one direction, so there is no ambiguity about whether it is working “up” or “down” the DNA chain. The same process is repeated for the other bases. Finally, all four solutions go through a process called electrophoresis, which measures the lengths of the fragments. Reading and ordering these lengths yields the sequence of the original DNA.

The problem with this method is that it fails to detect distinct lengths of nucleotides longer than a thousand or so bases. The length of the human genome is about 1 billion bases, about a million times longer than the Sänger method is capable of detecting. Accordingly, to sequence a human genome using this method requires very elaborate steps to split up the original genome into small packets that are somehow indexed to their source. Using an over-sampling process that generates a large amount of overlapped data, scientists are then able to reassemble the entire genome from the smaller sequencing reads. This sequence assembly process is problematic for other reasons. It cannot easily identify the inversions, deletions, copies, and tandem repeats of our dynamic genomes since assembly programs tend to put repeated sequences into the same place and fill in deleted gaps.

A generation after Sanger's work, scientists have developed a radically new set of sequencing technologies based on “sequencing by addition.” They exploit the natural ability of DNA polymerase to copy enormous lengths of DNA at a speed of some 1,000 bases a second. Each time a particular nucleotide is added to a copy of the target DNA, a chemical signal could be released, for example through the quenching of a dye molecule or the detection of a byproduct of the reaction. By sensing this signal, scientists can record the order of bases along a stretch of DNA.

The first human genome took 10 years to complete and cost billions of dollars. Scientists need to reduce the cost and time required for sequencing by many orders of magnitude to benefit human health.

One such scheme, pyrosequencing, is the basis of a new commercial technology introduced by a Connecticut-based company, 454 Life Sciences. To initiate this process, scientists attach many copies of a small DNA chain (a hundred or so bases) to be sequenced to a chemical bead and flow solutions of each of the four nucleosides triphosphates (NTPs) over the chain in the presence of DNA polymerase. As their name implies, NTPs come loaded with extra phosphate molecules. When these NTPs attach to a polymer chain grown by the polymerase, two of the NTPs' phosphates are cleaved off. These detached phosphates, known as pyrophosphates, are energetic and can trigger additional enzyme-mediated reactions in the solution that lead to a light-emitting reaction.

The entire pyrosequencing apparatus consists of millions of beads, each bead playing host to many copies of a given fragment. The entire array is imaged by a digital sensor. Each time a particular NTP passes through the system and produces a flash at a bead, a computer records the complement of that NTP as the next base in the sequence attached to that particular bead. The method is massively parallel, as millions of beads, each with its own sequence of DNA attached, are processed simultaneously. The length of chain this method can read is limited, however, as the inefficiency of the reaction introduces uncertainties in the position of the added nucleotide. Despite these technical achievements, pyrosequencing is still quite slow and costly, and can't produce the very long reads that will be essential for understanding the gross structure of genomes.

So, how might scientists read really long lengths of DNA? This is the subject of an intense research effort–driven by a “$1,000 genome” National Institutes of Health initiative–aimed at making it cost-effective to sequence whole genomes. 4 One approach, under development in many labs, aims to use a “nano-pore,” a tiny orifice so small that only one DNA base could pass through it at a time, to read a sequence. Because DNA is negatively charged in an aqueous solution, it can be pulled through the orifice using an electric field (see diagram at right). This scheme should allow scientists to pass very long lengths of DNA by a single point, one base at a time. Scientists' initially hoped that they could measure the characteristic blockages of nanopore current as each base occluded the pore, enabling them to identify the base. In practice, this process is a messy affair, and sequence information is entangled with information about the secondary structure of DNA. This difficulty has caused many to dismiss the nanopore approach. One way to solve this problem would be to attach an independent “reading head” to the nanopore. The nanopore would serve to present each base in turn to the reading head, which would then generate a signal characteristic of each base as it passes the reader. 5

An under-development sequencing technique uses an electric field to drive a charged DNA molecule through a nanopore (above at right); each base would be read as it passes through the nanopore

Scientists at the Biodesign Institute at Arizona State University, including this author, recently proposed a new method to read the order of bases called “sequencing by recognition,” demonstrating that it is possible to identify at least some of the bases with high fidelity in single-molecule reads. This method exploits chemical recognition of the bases. Two electrodes are placed in close proximity to the nanopore. One is attached to a chemical that specifically forms hydrogen bonds with the phosphates in the backbone of the DNA, the second with a chemical that recognizes one of the four bases. A tunnel current flows between the electrodes if, and only if, a complete circuit is formed by simultaneous completion of the hydrogen bonds between the phosphate-recognition element and the phosphates, and between the target base and the base-recognition element on the second electrode. The scheme works because interactions involving multiple hydrogen bonds can be remarkably specific. A single set of electrodes would identify one particular base, so four different readers would be required to sequence the DNA. Scientists have demonstrated this technique on simple model systems, but a manufacturable device with hundreds, or even thousands, of parallel reading heads is still a long way off. 6 Eventually, reading heads should be able to scan a base in a few milliseconds and read at least hundreds of thousands of bases continuously. The availability of cheap, rapid whole genome sequencing may not be far off.

Reading List

FOR MORE INFORMATION ON THE IDEAS AND ISSUES DISCUSSED IN THIS ESSAY, WE RECOMMEND READING:

“The Race for the $1,000 Genome,” Robert F. Service, Science, vol. 311, no. 5767. The scientific and public demand for more affordable, more capable sequencing technologies is contributing to a proliferation of sequencing research. This article provides a look at ongoing efforts and introduces some of the field's leading scientists.

Making PCR: A Story of Biotechnology Paul Rabinow (1996). Polymerase chain reaction (PCR) is among the most common tools in use in molecular biology labs around the world. Rabinow, a Berkeley anthropologist, documented the development of this transformative biotechnology and explores how PCR changed the potential of biological research.

The Selected Papers of Frederick Sänger Edited by Frederick Sänger and Margaret Dowding (1996). The originator of the first sequencing method provides backstory to much of his early sequencing work, exposing the genesis of his research. The commentary and primary source materials are technical, but there's nothing like hearing the story straight from the horse's mouth.

An A to Z of DNA Science Jeffre L. Witherly, Galen P. Perry, and Darryl L. Leja (2001). A genomics glossary for the non-scientist, this guidebook from the esteemed Cold Spring Harbor Laboratory Press includes informative illustrations.

The Genomic Revolution: Unveiling the Unity of Life Edited by Rob DeSalle and Michael Yudell (2002). This collection of essays by scientists, including Harold Varmus, Leroy Hood, and Mary Jeanne Kreek, examines the basic implications of genomic research on medicine, agriculture, ethics, and privacy. It also includes a good explanation of “shotgun sequencing,” which helped to complete the first sequencing of the human genome.

The National Plant Genome Initiative: Objectives for 2003-2008 (2002). Plants and trees have smaller genomes than humans, but understanding their genetic structure could have broad effect. This National Academies publication summarizes the work being done to explore the genetic structure of plants with the hope of addressing national needs in agriculture, energy, and waste reduction.

Synthetic Genomics: Options for Governance (2007) The consequence of improved genetic sequencing capabilities will be a boatload of new genetic information. Scientists in related fields have been working to improve genetic synthesis technology, which would allow for the construction of specific genetic sequences. This accessible report, authored by leading scientists in synthetic biology, addresses the range of implications of synthesis technology, and also suggests strategies for ensuring its safe use.

When this type of operation is possible, is a person's genome really their own? Will the ability to record the entire genomes of each and every human change our relationships with health care providers and governments? Not any time soon. Insurance companies already have much simpler procedures that look at just one or a few genes, which can be abused as gate-keeping devices. The early implications of radically new whole genome methods will be for research. Humans will be surprised at the complexity and diversity of possible long-range genomic rearrangements. A full understanding of a genome could provide a better basis for treating humans as the individuals we are, in contrast to current methods that look at a few selected genes.

Life with ubiquitous genetic sequencing certainly will be different. But I doubt that the consequences will be Or-wellian. Life is so complex, and no doubt genomes will be found to be as individual as their owners. I foresee a time when diversity and lack of racial bias translate into genomic diversity and lack of genomic bias. Deeper insights into biology will follow.

Supplementary Material

National Science Advisory Board for Biosecurity Meeting Summary

Footnotes

1.

Elizabeth Pennisi, “Human Genetic Variation,” Science, vol. 318, pp. 1,842-43 (2007).

2.

Brian E. Chen et al., “The Molecular Diversity of Dscam Is Functionally Required for Neuronal Wiring Specificity in Drosophila,” Cell, vol. 125, pp. 607-20 (2006).

3.

E. S. Lander, et al., “Initial Sequencing and Analysis of the Human Genome,” Nature, vol. 409, pp. 860-921 (2001).

4.

M. Zwolak and M. Di Ventra, “Physical Approaches to DNA Sequencing and Detection,” Reviews of Modern Physics, vol. 80, p. 141-65 (2008).

5.

Ibid.

6.

Jin He et al., “Identification of DNA Base-Pairing Via Tunnel-Current Decay,” Nano Letters, vol. 7, pp. 3,854-58 (2007).