|
Description  |
|
|
1. INTRODUCTION
Considerable interest has been developing in the past few years to sequence
the entire human genome (i.e., all of the genetic material in a human
cell). The task, however, is enormous because it involves the sequencing
of at least 3,000,000,000 base pairs, an effort which is likely to take
ten or more years and cost $3,000,000,000 if undertaken using conventional
technology (1993 Edgington, Bio/Technology 11:39-42, which is incorporated
herein by reference).
The Committee on Mapping and Sequencing the Human Genome of the National
Research Council in their 1988 report entitled, Mapping and Sequencing the
Human Genome (which is incorporated herein by reference), stated that, "No
foreseeable technology will be able to automate DNA sequencing
comprehensively." The present invention is a method and apparatus for
comprehensively automating this effort with substantial improvements in
speed and cost. The invention is applicable to the sequencing of genetic
material from any source, human or otherwise.
2. BACKGROUND OF THE INVENTION
2.1. DNA AND RNA
Deoxyribonucleic acid (DNA) is the primary genetic material of most
organisms. Ribonucleic acid (RNA) is the primary genetic material in
certain viruses. Additionally, a form of RNA known as messenger RNA (mRNA)
is found in all cells and comprises copies of portions of the primary
genetic information found in the DNA. In its natural state, DNA is found
in the form of a pair of complementary chains of nucleotides which are
interconnected as a double helix (see FIG. 1). A nucleotide in turn is
compesed of a nitrogenous base (see FIGS. 2 and 3), which identifies the
nucleotide, linked by an N-glycosidic bond to a five-carbon sugar. RNA
differs from DNA in that in DNA the nucleotide sugar is deoxyribose, while
in RNA, the sugar is ribose. A phosphate group serves to link the
nucleotides together, formihg the backbone of a single strand of DNA (see
FIG. 2). Normally, the nitrogenous base is one of the following: adenine,
guanine, thymine and cytosine (respectively denoted A, G, T, and C), or
uracil (U) in place of thymidine in RNA (see FIG. 3). The order of the
four nucleotides, A, G, T and C, in the chain is often referred to as the
sequence of the DNA and can be specified simply by setting down the
symbols A, G, T and C in the order in which these four nucleotides appear
in the DNA strand.
The two chains (or strands) of a DNA double helix are held together by
hydrogen bonding between the nitrogenous bases of their individual
nucleotides. This hydrogen bonding is specific in-that adenine in one
strand must pair with thymine (or uracil in RNA) in the other strand, and
guanine with cytosine. The sequence of bases in one strand of DNA is thus
complementary to the sequenceon the other strand.
A DNA chain has polarity: one end of the chain has a free 5'--OH (or
phosphate) group (termed "the 5' end") and the other a free 3'--OH (or
phosphate) group ("the 3' end"). By convention, the nucleotide sequence is
written or read left-to-right in the direction from the 5' end to the 3'
end. The two strands of a DNA double helix have opposite polarities. Thus
the 5' end of one strand pairs with the 3' end of the Other strand and the
complementarity of the two strands is revealed by comparing one strand
read,in the 5' to 3' direction with the other strand read in the 3' to 5'
direction.
Genetic information is encoded in the particular sequence (order of
occurrence) of nucleotides along a DNA molecule and DNA sequencing is the
process of determining that order in a particular DNA molecule.
2.2. ENZYMES USED IN DNA SEQUENCING
Two classes of enzyme activity which have been employed in certain methods
used to sequence DNA are DNA polymerase and exonuclease activity.
A DNA polymerase is an enzyme that has the ability to catalytically
synthesize new strands of DNA in vitro. The DNA polymerase carries out
this synthesis by moving along a preexisting single DNA strand ("the
template") and creating a new strand, complementary to the preexisting
strand, by incorporating single nucleotides one at a time into the new
strand following the base-pairing rule described above.
In contrast to polymerase activity, exonuclease activity refers to the
ability of an enzyme (an exonuclease) to cleave off a nucleotide at the
end of a DNA strand. Enzymes are known which can cleave successive
nucleotides in the single DNA strand of a single-chain DNA molecule,
working from the 5' end of the strand to the 3' end; such enzymes are
termed single-stranded 5' to 3' exonucleases. Other enzymes are known
which perform this operation in the opposite direction (single-stranded 3'
to 5' exonucleases). There also exist enzymes which can cleave successive
nucleotides from the end of a single strand of a double-stranded DNA
molecule. These enzymes are termed double-stranded 5' to 3' or 3' to 5'
exonucleases, depending on the direction in which they proceed along the
strand. Exonucleases are also characterized as being distributive or
processive in their action. Distributive exonucleases dissociate from the
DNA following each internucleotide bond cleavage, whereas processive
exonucleases will hydrolyze many internucleotide bonds without
dissociating from the DNA.
2.3. SEQUENCING OF DNA
Approaches to DNA sequencing have varied widely. Use of these enzymes or
other chemical methods, as described below, has made it possible to
sequence small portions of the human genome. Despite these successes, most
of the human genome remains unexplored. Of the 3,000,000,000 base pairs in
the human genome, only about 20 million base pairs have been sequenced
(GenBank.RTM. Release 74--December 1992).
2.3.1. SEQUENCING LADDER METHODS
Many techniques for sequencing DNA have involved generating fragments of
labeled DNA, the lengths of which are sequence-dependent, and separating
the fragments according. to their lengths by electric field-induced
migration in a gel, so as to be able to discern the DNA sequence from the
appearance of the separated fragments. Such a pattern of
sequence-dependent fragment lengths is known as a sequencing ladder. The
fragments can be generated by either: (a) cleaving the DNA in a
base-specific manner (see FIG. 4), or (b) synthesizing. a copy of the DNA
wherein the synthesized strand terminates in a base-specific manner (see
FIG. 5).
The Maxam-Gilbert technique for sequencing (Maxam and Gilbert, 1977, Proc.
Natl. Acad. Sci. USA 74:560, which is incorporated herein by reference)
involves the specific chemical cleavage of DNA. According to this
technique, four samples of the same labeled DNA are each subjected to a
different chemical reaction to effect preferential cleavage of the DNA
molecule at one or two nucleotides of a specific base identity. By
adjusting the conditions to obtain only partial cleavage, DNA fragments
are thus generated in each sample whose lengths are dependent upon the
position within the DNA ba.se sequence of the nucleotide(s) which are
subject to such cleavage. Thus, after partial cleavage is performed, each
sample contains DNA fragments of different lengths each of which ends with
the same one or two of the four nucleotides. In particular, in one sample
each fragment ends with a C, in another sample each fragment ends with a C
or a T, in a third sample each ends with a G, and in a fourth sample each
ends with an A or a G. The fragments so generated are then separated from
one another by electric field-induced migration in a polyacrylamide gel.
The four individual sets of fragments produced by cleavage using chemical
reactions of different specificity are run side-by-side, in separate lanes
of the gel. The DNA fragments are then visualized, and sequence is
determined by the observing the position in the gel of the generated
fragments.
FIG. 4 schematically depicts the visualization of DNA fragments that are
generated by cleaving the labelled DNA having the sequence
5'-AAGTACT-3'-label. The fragments from the four samples are run
side-by-side in the four lanes of the gel identified by G, A+G, C, T+C
where G identifies the sample in which all the fragments end with guanine
nucleotides, A+G identifies the sample in which all the fragments end with
either an adenine or a guanine nucleotide, C identifies the sample in
which all the fragments end with a cytosine nucleotide, and T+C identifies
the sample in which all the fragments end with either a thymine or a
cytosine nucleotide. The distance the fragments migrate in the gel is a
monotonic function of their length. Thus, after the migrating fragments
are visualized, the order of the nucleotides in the labelled DNA molecule
can be read directly from the vertical position of the fragments in the
gel. The fragments that end with adenine that appear in the A+G lane, and
the fragments that end with thymine that appear in the T+C lane, can be
distinguished from the fragments in the same lanes that end with guanine
and cytosine, respectively, by noting that the fragments that end with
guanine and cytosine also appear at the same vertical position in the G
and C lanes, respectively.
In the DNA of many organisms, a significant fraction of the cytosines are
methylated in vivo at the 5 position to give 5-methylcytosine. Such
methylation is involved in the regulation of gene expression and in
genetic imprinting. Church and Gilbert (1984, Proc. Natl. Acad. Sci. USA
81:1991-1995; incorporated herein by reference) and Saluz and Jost (1987,
"A Laboratory Guide to Genomic Sequencing," BioMethods, Vol. 1,
Birkhauser, Boston; incorporated herein by reference) devised a
modification of the Maxam and Gilbert chemical cleavage method to provide
a means for directly determining the position of 5-methylcytosine in
genomic DNA. In this method, genomic DNA is chemically cleaved, then
completely digested with a restriction enzyme and separated by gel
electrophoresis, resulting in a complex mixture of superimposed sequencing
ladders. The DNA bands forming the rungs of the sequencing ladder are next
transferred and cross linked to a nylon membrane. A specific ladder from
the mixture is then recognized by hybridizing the membrane with a labeled
oligonucleotide probe which uniquely recognizes the sequence immediately
adjacent to a particular restriction site. Frommet et al. (1992, Proc.
Natl. Acad. Sci. USA 89:1827-1831, which is incorporated herein by
reference) have recently developed an alternative genomic DNA sequencing
method wherein cytosines in the sample DNA are converted to uracil by
bisulfite treatment which leaves 5-methylcytosine unmodified. Comparison
of the sequence of modified and unmodified DNA reveals the positions in
the sequence of 5'-methylcytosine. Such genomic sequencing methods can
only be carried out with genomic DNA. The methylation pattern is lost
during gene cloning in microorganisms in vivo, and during DNA copying or
amplification in vitro.
The plus/minus DNA sequencing method (Sanger and Coulson, 1975, J. Mol.
Biol. 94:441-448, which is incorporated herein by reference) involves: (a)
use of DNA polymerase to generate.complementary .sup.32 P-labeled DNA
oligonucleotides of different.lengths; (b) (the "minus" system) in four
separate reaction vessels, reaction of one half of the generated DNA with
DNA polymerase and three out of the four nucleotide precursors; and (c)
(the "plus" system) in four separate reaction vessels, reaction of the
remaining half of the generated DNA with DNA polymerase and only one of
each of the four nucleotide precursors. Each reaction mixture generated in
steps (b) and (c) is subjected to a denaturing polyacrylamide gel
electrophoresis. The generated fragments are separated from one another by
migration in the polyacrylamide gel; the shorter the fragment, the greater
the migration. After visualization of the DNA in the gel by detection of
its label, the sequence of the DNA can be determined by observing the
position in the gel of the generated fragments.
The dideoxy method of sequencing was published in 1977 by Sanger and his
colleagues (Sanger et al., 1977, Proc. Natl. Acad. Sci. USA 74:5463, which
is incorporated herein by reference). In contrast to the method of Maxam
and Gilbert which relies on specific chemical cleavage to generate
fragments with lengths which are sequence-dependent, the Sanger dideoxy
method relies on enzymatic activity of a DNA polymerase to synthesize
fragments with lengths that are sequence-dependent. The Sanger dideoxy
method utilizes an enzymatically active fragment of the DNA polymerase
termed E. coli DNA polymerase I, to carry out the enzymatic synthesis of
new DNA strands. The newly synthesized DNA strands consist of fragments of
sequence-dependent length, generated through the use of inhibitors of the
DNA polymerase which cause base-specific termination of synthesis. Such
inhibitors are dideoxynucleotides which, upon their incorporation by the
DNA polymerase, destroy the ability of the enzyme to further elongate the
DNA chain due to their lack of a suitable 3'--OH necessary in the
elongation reaction. When a dideoxy nucleotide whose base can
appropriately hydrogen bond with the template DNA is thus incorporated by
the enzyme, synthesis of the growing DNA strand halts. Thus DNA fragments
are generated by the DNA polymerase, the lengths of which are dependent
upon the position within the DNA base sequence of the nucleotide whose
base identity is the same as that of the incorporated dideoxynucleotide.
The fragments so generated can then be separated in a gel as in the
Maxam-Gilbert procedure, visualized, and the sequence determined.
For example, for the case of a template DNA molecule having the sequence
5'-GCCATCG-3'-label, FIG. 5 depicts the visualization of the DNA fragments
that are generated by the dideoxy method after terminating synthesis at
each of the nucleotides G, A, C and T. Since the distance a fragment
migrates in the gel is a monotonic function of its length, the sequence of
the DNA molecule can be read directly from the gel after the fragments are
visualized.
Sanger and colleagues utilized an E. coli DNA polymerase I fragment termed
the Klenow fragment. After the disclosure of the original Sanger dideoxy
technique, the enzyme used in most dideoxy sequencing was the Klenow
fragment. Other enzymes with DNA polymerase activity that have been used
in sequencing include AMV reverse transcriptase and T7 DNA polymerase
Tabor and Richardson, U.S. Pat. No. 4,795,699, which is incorporated
herein by reference).
DNA sequencing methods have been automated to varying degrees. In the
manual methods, radioactive labels suchas .sup.32 P are typically used
identify the bands of the sequencing ladder by autoradiographic imaging on
X-ray film. Digital imaging systems and pattern recognition software have
been developed by several groups for automatic interpretation and data
entry from such autoradiographs (Elder et al., 1986, Nucl. Acids Res.
14:417-424, which is incorporated herein by reference). Real-time
recording of the sequencing ladder during gel electrophoresis was made
possible by positioning .beta.-emission detectors at the bottom of the gel
(EG&G Biomolecular ACUGEN.TM. Sequencer, Acugen.TM. System Report 88-106,
EG&G Biomolecular), or by employing fluorescent labeling techniques in
combination with real-time fluorescence detection during electrophoresis.
Smith et al. (1986, Nature 321:674, which is incorporated herein by
reference) disclose a method for partial automation of DNA sequencing,
which involves use of four different color fluorophores bound to the
primer (Smith et al., 1985, Nucl. Acids Res. 13:2399-2412, which is
incorporated herein by reference) used for synthesis in one of four
reaction vessels, each containing a different dideoxynucleotide in the
Sanger dideoxy method. The reaction mixtures are combined and subjected to
electrophoresis, during which the separated DNA fragments are identified
by a fluorescent detection apparatus, and the sequence information
acquired directly by computer. In an alternative approach, the dideoxy
nucleotide chain terminators have each been chemically linked to different
succinylfluorescein fluorescent dyes which can be distinguished by their
fluorescent emission, allowing the four sequencing reactions to be run in
a single tube (Prober et al., 1987, Science 238:336, which is incorporated
herein by reference). Japanese scientists and engineers are participating
in the development of a completely automated DNA sequencing system,
employing the Sanger dideoxy method of sequencing (Endo et al., 1991,
Nature 352:89-90; Wada et al., 1987, Nature 325:771-772, which are
incorporated herein by reference).
Ladder-based sequencing methods are currently the most widely utilized, and
variations on the Sanger method of generating the sequencing ladder are
used predominantly. The throughput and cost of ladder-based sequencing
methods are currently limited by three major factors: (1) the number of
resolvable bases in a single ladder, (2) the time required to separate the
fragments and generate the ladder, and (3) the number of ladders which can
be run in parallel. Numerous efforts are presently underway to further
improve each of these aspects and to thereby enhance the performance of
ladder-based sequencing methods. Conventional DNA sequencing gels are
typically .about.300-500 micrometers thick. With such gels it is usually
possible to obtain 300-500 bases of sequence from a single sequencing
ladder. The limit depends on the ability to resolve a band containing
fragments which are N nucleotides long from those containing fragments
which are N+1 or N-1 nucleotides in length. Increased resolution can be
achieved by employing thinner gels, typically .about.25-100 micrometer,
either in ultrathin slab gels (Kostichka et al., 1992, Bio/Technology
10:78-81) or in capillary gels (Drossman et al., 1990, Anal. Chem.
62:900-903, which are incorporated herein by reference). It has recently
been demonstrated that such gels are capable of resolving >1,000 bases,
and further improvements are projected to achieve .about.2,000 bases. One
approach to further increase the resolution of the gel is to employ
programmed pulse-field techniques (C. Turmel, E. Brassard, R. Forsyth, J.
Randell, D. Thomas, J. Noolandi (1992) "Sequencing up to 800 bases
manually using pulsed field", IN: Genome Mapping & Sequencing, Cold Spring
Harbor Laboratory, Abstract #112; C. Turmel, E. Brassard, J. Noolandi
(1992) Electrophoresis (in press), which are incorporated herein by
reference). Because ultrathin gels can be cooled more efficiently, they
can be operated at much higher voltages per unit length, thereby reducing
the time required to effect the separation of the sequencing ladder.
Multiple capillaries can be run in parallel or a greater number of samples
can be loaded in slab gels to further increase throughput. Both capillary
and ultrathin slab gels have been demonstrated to have some degree of
reusability. In order to achieve the improved performance offered by
ultrathin gels, it is necessary to reduce the number of DNA molecules
loaded onto the gel, which therefore reduces the number of the DNA
molecules in each band or rung of the sequencing ladder. This requires
more sensitive detection methods which have included the use of
sheath-flow cuvette fluorescence techniques (1991 Chen et al., SPIE Vol.
1435, Optical Methods for Ultrasensitive Detection and Analysis:
Techniques and Applications, p. 161-167, which is incorporated herein by
reference), confocal fluorescence microscopy (1992 Mathies and Huang,
"Capillary array electrophoresis: an approach to high-speed, high
throughput DNA sequencing," Nature 359:167-169, which is incorporated
herein by reference), mass spectrometry (1990 T. Brennan, J. Chakel, P.
Bente, M. Field, "New Methods to Sequence DNA by Mass Spectrometry," SPIE
Vol. 1206, New Technologies in Cytometry and Molecular Biology, pp. 60-77;
1990 T. Brennan, J. Chakel, P. Bente, M. Field, "New Methods to Sequence
DNA by Mass Spectrometry," IN: A. L. Burlingame and J. A. McCloskey (Eds.)
Biological Mass Spectrometry, Elsevier, Amsterdam, pp. 159-177, which are
incorporated herein by reference), and resonance ionization spectroscopy
(RIS) (1979 G. S. Hurst, M. G. Payne, S. D. Kramer, J. P. Young,
"Resonance ionization spectroscopy and one-atom detection", Rev. Mod Phys.
51:767-819; 1991 H. F. Arlinghaus, M. T. Spaar, N. Thonnard, A. W.
McMahon, K. B. Jacobson, "Application of resonance ionization spectroscopy
for semiconductor, environmental and biomedical analysis, and for DNA
sequencing," SPIE Vol. 1435, Optical Methods for Ultrasensitive Detection
and Analysis: Techniques and Applications, pp. 26-35; 1991 K. B. Jacobson,
H. F. Arlinghaus, H. W. Schmitt, R. A. Sachleben, G.M. Brown, N. Thonnard,
F. V. Sloop, R. S. Foote, F. W. Latimer, R. P. Woychik, M. W. England, K.
L. Burchett, D. A. Jacobson, "An Approach to the Use of Stable Isotopes
for DNA Sequencing," Genomics 9:51-59, which are incorporated herein by
reference).
Another improvement which was developed from the original genomic
sequencing methods is known as multiplex sequencing (Church and
Kieffer-Higgins, 1988, Science 240:185-188, which is incorporated herein
by reference). In multiplex sequencing, multiple sequencing reactions are
pooled and electrophoresed together in a single gel to generate multiple
superimposed sequencing ladders which are then transferred and bound to a
nitrocellulose membrane. The membrane is then probed with an
oligonucleotide which is specific for only one of the pools in order to
reveal the corresponding ladder. By repeatedly stripping the membrane of
probe and rehybridizing with different oligonhcleotides it is possible to
obtain the sequence from each of the individual reactions. Although
originally developed using radioactive isotopes to label the probes and
therefore requiring lengthy autoradiographic exposures in order to
visualize the ladder, newer multiplex sequencing protocols have been
devised which employ chemiluminescent detection of the probes (Gillevet,
1990, Nature 348:657-658, which is incorporated herein by reference) or
fluorescence detection (Yang and Youvan, 1989, Bio/Technology 7:576-580,
which is incorporated herein by reference).
Mass spectrometry offers the potential of further improving ladder-based
sequencing by also eliminating the electrophoresis step and replacing it
with mass separation of conventional sequencing reaction mixtures using
time-of-flight methods which require only milliseconds. Matrix-assisted
laser desorption/ionization is currently being explored to generate mass
ions as large as .about.300,000 daltons without fragmentation which might
permit the determination of .about.600 bases. (1992 M. C. Fitzgerald, G.
R. Parr, L. M. Smith, "DNA Sequence Analysis by Mass Spectrometry?" IN:
Genome Mapping & Sequencing, Cold Spring Harbor Laboratory, Abstract #113;
1992 G. R. Parr, M. C. Fitzgerald, L. M. Smith, "Matrix-Assisted Laser
Desorption/Ionization Mass Spectrometry of Synthetic
| | |