|
Description  |
|
|
BACKGROUND OF THE INVENTION
The ability to determine nucleotide sequences has had enormous impact on
biology, medicine and biotechnology. An appreciation of the benefits of
knowing the nucleotide sequences of genes, chromosomes, and entire genomes
has led to the current proposals to determine the nucleotide sequence of
the human genome and the genomes of other well studied or economically
important organisms.
Cloning and mapping specific DNA fragments is an important part of the
strategy for sequencing large genomes. The entire human genome of about
3.times.10.sup.9 base pairs could be contained in a set of about 100,000
cosmids, each of which contains about 40,000 or more base pairs of human
DNA. Even larger segments can be cloned in yeast artificial chromosomes.
The genome sequencing problem then reduces to the problem of sequencing a
large number of DNAs of 40,000 or more base pairs. Such an enterprise
represents a tremendous increase in scale over the most ambitious
sequencing projects that have been undertaken heretofore. If cosmids were
sequenced at the rate of one a day, a formidable task for a sequencing
center using today's technology, centuries would be required to complete
the task.
Currently useful methods for determining nucleotide sequence involve
generating nucleic acid fragments having defined ends and resolving them
according to size, using gel electrophoresis. These defined fragments are
produced chemically (Maxam & Gilbert, Proc. Nat. Acad. Sci. USA, 74,
560-564 (1977); Methods in Enzymology 65, 499-560 (1980)), enzymatically
(Sanger et al., Proc. Nat. Acad. Sci. USA 74, 5463-5467 (1977)), or by
some combination of the two, and are typically identified in
electrophoresis patterns by radioactivity, fluorescence or chemical
reactivity.
The enzymatic sequencing technique has been highly developed, and several
different DNA polymerases and reverse transcriptases are used for this
purpose. These enzymes can be used for sequencing double-stranded or
single-stranded DNA or RNA. Oligonucleotide primers direct DNA synthesis
from a specific site in the molecule, which generates the common end
needed for sequence analysis. The variable end is typically generated by
incorporation of specific chain terminators, such as dideoxynucleotide
triphosphates, or by incorporation of nucleoside triphosphate derivatives
and subsequent cleavage of the molecule at the site of incorporation.
Specific priming is critical for the success of the enzymatic sequencing
technique. Much is known about the specific association between
oligonucleotides and longer nucleic acids, and about the ability of
specifically associated oligonucleotides to prime DNA synthesis by the
enzymes used for nucleotide sequencing (for example, M. Smith, in "Methods
of DNA and RNA Sequencing", edited by S.M. Weissman, Praeger Publishers,
New York, pp 23-68, 1983). Oligonucleotides as short as three or four
bases long have been reported to prime DNA synthesis, and a mixture of
hexamers is widely used to prime random DNA synthesis for labeling
hybridization probes. Oligonucleotides of length 6 or longer are useful
for priming specific sequencing reactions.
In practice, blocks of nucleotide sequence up to several hundred but rarely
as long as a thousand nucleotides can be determined from the products of a
single sequencing reaction or set of reactions. Cosmid DNAs, and in fact
most nucleic acids of interest, are much longer than the few hundred
nucleotides that is the basic unit of sequence determination. Therefore, a
substantial part of the effort involved in sequencing genes or genomes, or
almost any nucleic acid, must be devoted to obtaining and assembling the
many individual blocks of a few hundred nucleotides of sequence that make
up the entire nucleic acid to be sequenced. If an average of 500
nucleotides of sequence could be obtained in each analysis, a minimum of
160 sets of sequencing reactions would have to be prepared and analyzed to
obtain the sequence of both strands of one cosmid DNA.
Several strategies have been developed for obtaining and ordering the many
individual blocks of sequence needed to determine the entire sequence of
larger molecules. One strategy is to use restriction enzymes to obtain and
map specific fragments of the DNA molecule. The nucleotide sequences of
appropriate fragments are determined, and the sequence of the entire
molecule is assembled :from the known positions of the fragments. As
example of the use of this strategy is the determination of the sequence
of T7 DNA, a double-stranded molecule about 40,000 bp long (Dunn &
Studier, J. Mol. Biol. 166, 477-535 (1983)). However, such a strategy is
too labor intensive to be economical for sequencing large numbers of DNA
molecules.
A more typical strategy is to subclone random fragments of the DNA into a
cloning vector, typically derived from M13. The sequence of the cloned DNA
is usually determined by the enzymatic sequencing technique, starting from
a unique priming site within the vector DNA. Randomly selected subclones
are sequenced, and the sequence of the original DNA is reconstructed from
overlaps among the many blocks of sequence obtained from the different
subclones. The sequence of lambda DNA, about 48,500 bp long (Sanger et
al., J. Mol. Biol. 162, 729-773 (1982)), was determined by extensive use
of such a strategy. Although relatively efficient in the early stages, a
random cloning strategy becomes highly redundant in the later stages. In a
purely random strategy, perhaps ten times the minimum possible number of
sequence analyses may have to be done before all of the blocks of sequence
can be overlapped. In practice, labor intensive mapping techniques are
often used to close gaps.
Modifications have improved the efficiency of random cloning strategies.
The length of continuous sequence that can be generated from a single
priming site in a cloning vector can be extended considerably by
generating sets of nested deletions that bring different portions of the
DNA close to the priming site (Barnes, Methods in Enzymology, 152,
538-(1987)). However, this remains relatively labor intensive for a large
scale sequencing effort. Multiplexing improves sequencing efficiency by
allowing a single gel electrophoresis pattern to be probed repeatedly to
determine the sequence of many different cloned DNAs (Church &
Kieffer-Higgins, Science 240, 185-188 (1988)). However, all subcloning
strategies suffer from the necessity to prepare many different clones and
isolate DNA from each of them, an effort that will typically be comparable
to that required to do the sequence analyses themselves.
A directed priming, or "walking" strategy allows the sequence to be
determined directly from a nucleic acid molecule of interest without
mapping or subcloning, a considerable savings in effort. To use directed
priming, at least a small portion of the nucleotide sequence in the
molecule must be known or determined in some other way. This known
sequence information is used to synthesize a primer for enzymatic
sequencing reactions that will extend the sequence into the unknown
region. Such primers are synthesized by well known techniques or can be
purchased commercially and are typically at least 16 nucleotides long, so
as to be unique in the entire molecule. In order to continue extending the
sequence further along the molecule, a new primer must be synthesized for
every few hundred nucleotides of sequence obtained. Although a directed
priming strategy eliminates the considerable effort needed for mapping or
subcloning, the cost of primers nevertheless makes the directed priming
strategy very expensive for large scale sequencing.
The recently described polymerase chain reaction (PCR) for amplifying
specific segments of a DNA molecule is also being used to prepare samples
for sequencing (Saiki et al., Science 239, 487-491 (1988); Stoflet et at.,
Science 239, 491-494 (1988)). This technique can eliminate the subcloning
steps, and the PCR primers themselves can be used as primers for
sequencing by the enzymatic technique. However, the use of this technique
requires knowledge of the nucleotide sequence flanking the region to be
amplified, information that is generally not available at the outset, and
the cost of primers would be comparable to that for the directed priming
strategy.
Although determination of nucleotide sequences has become routine, high
volume sequencing is still a difficult problem. The need for methods that
allow more efficient high volume sequencing is widely recognized and is
being addressed in various ways. Machines are being developed to carry out
sequencing reactions and to automate DNA sample preparation and collection
of data. Completely novel sequencing methods that do not require
resolution of DNA fragments by gel electrophoresis are also being
explored. For example, Drmanac et al. (Genomics 4, 114-128 (1989)) have
proposed a method based on the pattern of hybridization of
oligonucleotides to the DNA to be sequenced. However, these initiatives
have not yet had a practical impact.
The current state of the art in high volume sequencing was summarized in a
brief report in Science (242, 1245, Dec. 2, 1988). Bart Barrell and Ellson
Chen, whose laboratories have led the way in high volume sequencing and
had sequenced the largest contiguous stretches of DNA at that time,
reportedly concluded that the current technology realistically allows one
skilled technician to sequence about 50,000 bases a year, and even that
output is difficult to sustain. This rate of sequencing is still far short
of the capacity needed for projects like sequencing the human genome.
SUMMARY OF THE INVENTION
The present invention is directed to a more efficient method for
determining the sequence of nucleotides in nucleic acids. The method
greatly reduces the cost and effort of nucleotide sequencing and is
particularly suitable for very large scale sequence determinations such as
the proposed determination of the nucleotide sequence of the entire human
genome.
The present invention provides methods for improving the efficiency and
economy of enzymatic nucleotide sequencing. The methods include a random
priming method for determining the sequence of nucleotides in parts of a
nucleic acid molecule where the sequence is not known, the method
comprising the steps of:
(a) mixing said nucleic acid molecule with a primer or primer combination
under conditions suitable for forming a primed substrate for DNA synthesis
by a polymerizing enzyme that is suitable for nucleotide sequencing, said
primer or primer combination having a length and composition such that the
average number of priming sites in those parts of the nucleic acid
molecule where the sequence of nucleotides is not known is expected
statistically to be between 0.05 and 4.5, but excluding Primers and primer
combinations that would prime in any parts of the nucleic acid molecule
where the sequence of nucleotides is known, said mixing being either
previous to or simultaneous with step (b);
(b) incubating the mixture of step (a) with a polymerizing enzyme under
conditions suitable for primed synthesis of DNA that can be used for
determining nucleotide sequence;
(c) analyzing the reaction products to determine the sequence of a block of
nucleotides in any DNA that was synthesized from a single priming site in
the nucleic acid molecule; and
(d) repeating steps (a)-(c), using different primers or primer
combinations, until one or more blocks of nucleotide sequence have been
determined.
The present invention further provides a directed priming method that
repeatedly uses the same primers for determining or confirming the
sequence of nucleotides in different nucleic acid molecules for which at
least a portion of the nucleotide sequence is known, the method comprising
the steps of:
(a) selecting a primer having 8, 9 or 10 bases, the primer being perfectly
complementary to one and only one site in the known sequence of
nucleotides in a nucleic acid molecule, said site being located so that
the primer, by associating at said site, is capable of priming a
polymerizing enzyme to synthesize DNA complementary to a region of the
nucleic acid molecule where the nucleotide sequence is to be determined or
confirmed, and said primer being obtained from a primer library or being
newly prepared and the unused portion being deposited in a primer library;
(b) mixing said primer and nucleic acid molecule under conditions suitable
for forming a primed substrate for DNA synthesis by a polymerizing enzyme
that is suitable for nucleotide sequencing, said mixing occurring under
conditions where perfect pairing is sufficiently greater than mismatched
pairing that nucleotide sequence can be determined if exactly one perfect
pairing site exists in the nucleic acid molecule, and said mixing being
either previous to or simultaneous with step (c);
(c) incubating the mixture of step (b) with a polymerizing enzyme under
conditions suitable for primed synthesis of DNA that can be used for
determining nucleotide sequence;
(d) analyzing the reaction products to determine the sequence of
nucleotides in any DNA that was synthesized from a single priming site in
the nucleic acid molecule;
(e) repeating steps (a)-(d) until the desired sequences have been
determined or until all blocks of nucleotide sequence merge or reach the
ends of the molecule; and
(f) repeating steps (a)-(e) to determine nucleotide sequences of different
nucleic acid molecules.
Additionally, the present invention provides a combined random and directed
priming method that repeatedly uses the same primers for determining the
sequence of nucleotides in different nucleic acid molecules, the method
comprising the steps of:
(a) mixing a nucleic acid molecule with a random primer or primer
combination under conditions suitable for forming a primed substrate for
DNA synthesis by a polymerizing enzyme that is suitable for nucleotide
sequencing, said random primer or primer combination having a length or
lengths and composition such that the average number of priming sites in
those parts of the nucleic acid molecule where the sequence of nucleotides
is not known is expected statistically to be between 0.05 and 4.5, but
excluding primers and primer combinations that would prime in any parts of
the nucleic acid molecule where the sequence of nucleotides is known, said
mixing being either previous to or simultaneous with step (b);
(b) incubating the mixture of step (a) with a polymerizing enzyme under
conditions suitable for primed synthesis of DNA that can be used for
determining nucleotide sequence;
(c) analyzing the reaction products to determine the sequence of
nucleotides in DNA that was synthesized from a single priming site in the
nucleic acid molecule;
(d) repeating steps (a)-(c), using different random primers or primer
combinations, sequentially or in parallel, until one or more blocks of
nucleotide sequence have been determined;
(e) selecting a directed primer that is perfectly complementary to one and
only one site in the known sequence of nucleotides in the nucleic acid
molecule, whether said sequence was previously known or determined in
steps (a)-(d), said site being located so that the directed primer, by
associating at said site, is capable of priming a polymerizing enzyme to
synthesize DNA complementary to a region of the nucleic acid molecule
where the nucleotide sequence is to be determined or confirmed, and said
directed primer being obtained from a primer library or being newly
prepared and the unused portion being deposited in a primer library;
(f) mixing said directed primer and nucleic acid molecule under conditions
suitable for forming a primed substrate for DNA synthesis by a
polymerizing enzyme that is suitable for nucleotide sequencing, said
mixing occurring under conditions where perfect pairing is sufficiently
greater than mismatched pairing that nucleotide sequence can be determined
if exactly one perfect pairing site exists in the nucleic acid molecule,
and said mixture being either previous to or simultaneously with step (g);
(g) incubating the mixture of step (f) with a Polymerizing enzyme under
conditions suitable for primed synthesis of DNA that can be used for
determining nucleotide sequencing;
(h) analyzing the reaction products to determine the sequence of
nucleotides in any DNA that was synthesized from a single priming site in
the nucleic acid molecule;
(i) repeating steps (e)-(h) until the desired sequences have been
determined or until all blocks of nucleotide sequence merge or reach the
ends of the molecule; and
(j) repeating steps (a)-(i) to determine nucleotide sequences of different
nucleic acid molecules.
DETAILED DESCRIPTION OF THE DRAWINGS
FIG. 1. Sequencing the cloned portion of a cosmid DNA by random and
directed priming. Line lengths are to scale for an unknown sequence of
40,000 bp, vector sequences of 2,500 bp at each end, and primings that
produce 500 nucleotides of sequence each. The expected fraction of unique
primings at each stage is given; during directed extension this fraction
would be 0.40-0.74 for octamer primers, 0.80-0.93 for nonamers (as shown),
or 0.94-0.98 for decamers (Table 2). Priming from within the vector
sequences into the ends of the cloned DNA is assumed to use primers that
are long enough to be unique.
FIG. 2. Autoradiogram showing nucleotide sequence primed by an octamer
primer in T7 DNA, as described in Example 1.
DETAILED DESCRIPTION OF THE INVENTION
The invention teaches random and directed priming methods wherein a
statistical approach greatly improves the efficiency and economy of
enzymatic nucleotide sequencing. Individual preparations of
oligonucleotide primers typically provide enough material for hundreds of
thousands of primings and the invention teaches methods for efficient use
of this material to obtain sequence information from many different
nucleic acid molecules. The methods do not require mapping or subcloning
and are applicable to nucleic acid molecules of any size suitable for
primed enzymatic sequencing.
The invention relates to the use of primers, selected from a primer
library, to determine or confirm the sequence of nucleotides in nucleic
acid molecules for which at least a portion of the nucleotide sequence is
known. The primer library is a central supply of primers, a collection of
different primers where each primer in the collection is present in
sufficient quantity so that samples can be removed to be used in many
sequencing reactions. The sequence of each oligonucleotide in each of the
primers included in the primer library is known. For random priming as
such is defined herein, sets of preparations of primers or primer
combinations may be used. All the sequences in such sets are known. Each
sample of primer or primer combination taken from the primer library to be
used to prime DNA synthesis in the initial steps of the sequencing methods
described herein comprises only a portion of a single preparation, and
different portions of the single preparation of the primer contained in
the primer library are used to prime DNA synthesis in different nucleic
acid molecules.
STATISTICAL ANALYSIS OF OLIGONUCLEOTIDE PRIMING
The present invention is based on a consideration of the statistics of the
priming of enzymatic DNA synthesis as applied to nucleotide sequencing.
Primers in the enzymatic sequencing method are typically
oligodeoxyribonucleotides. However, oligoribonucloetides,
oligoribonucleotides containing methylphosphonate bonds, and perhaps other
types of linkages of normal DNA or RNA bases, or bases such as inosine,
5-bromouracil, or other modified bases that are not normally found in DNA
or RNA, can also associate specifically with template nucleic acids and
prime sequencing reactions, as can such linkages of bases which are
themselves linked to various reporter groups such as fluorescent tags,
biotin etc.
In this specification, the terms primer or oligonucleotide are meant to
specify a molecule containing a defined sequence of bases linked together
in such a way that said molecule is capable of specific association
according to known base pairing rules with a sequence of bases in the
template nucleic acid, and is capable of priming DNA synthesis reactions
suitable for nucleotide sequencing. The terms hexamer, heptamer, octamer,
nonamer and decamer are meant to refer specifically to primers of length
6, 7, 8, 9 and 10 bases, respectively. The nucleic acid to be sequenced
may be referred to for convenience as DNA, but it should be understood
that this is only for convenience and that the invention applies to
single-stranded or double-stranded DNA molecules or to single-stranded or
double-stranded RNA molecules. A primer that primes at one and only one
site in a nucleic acid molecule may be referred to as a unique primer for
that molecule and the priming site as a unique priming site in that
molecule.
In the statistical analysis of priming, which provides the basis for the
present invention, important parameters are the length of the primer, p,
and the total length of the nucleic acid to be sequenced, T. The length of
the primer is the number of bases in the primer molecule that are capable
of specific base pairing with the template nucleic acid. For a
single-stranded nucleic acid, T=L, where L is the number of bases in the
chain. For a double-stranded nucleic acid having complementary strands of
equal length, T=2L, where L is again the number of bases in a single
chain, which also equals the number of base pairs. For substantially
equimolar mixtures of different nucleic acid chains, including
double-stranded nucleic acids having complementary strands of unequal
length, T=.SIGMA.L, the sum of the numbers of bases in the individual
chains. For mixtures of the type that would be equivalent to random
breakage of a unique molecule, T is the total number of bases that would
have been in the unique molecule.
In the statistical analysis, it is assumed that primers of arbitrary length
prime DNA synthesis at every perfectly complementary sequence in a
template nucleic acid molecule but at no other sequence. The number of
potential priming sites in the molecule is approximately equal to the
total number of bases T. For a nucleic acid molecule of random sequence,
the expected frequency of priming sites for a single randomly selected
oligonucleotide is approximated by the Poisson distribution
##EQU1##
where P(r) is the probability of having exactly r priming sites in the
nucleic acid molecule and n=T/4.sup.p is the average number of priming
sites for an individual oligonucleotide per nucleic acid molecule of
length T, where 4.sup.p is the number of different combinations of the
four nucleotides that can form an oligonucleotide of length p.
Random Priming
Useful sequence information is obtained when DNA synthesis is primed at a
single site in a nucleic acid molecule. For a nucleic acid molecule of
essentially random sequence, the probability P(1) that a randomly selected
oligonucleotide will have a single priming site is a maximum of 0.368 when
n=1 (Table 1). Attempts to prime sequencing reactions where it is not
known whether or where a selected oligonucleotide will prime in the
nucleic acid molecule are referred to in this specification as random
priming. The term "random primer" refers to a primer used for random
priming.
In practice, primers of length 6 or longer are used to prime sequencing
reactions. By simple manipulations of equation 1 and the equation for n,
it is easily shown that for single primers of length 6-10, a value of n
between approximately 0.462 and 1.848, and an expected fraction of
productive primings of sequencing reactions between 0.291 and 0.368, can
be achieved for any single-stranded nucleic acid of length between
approximately 1900 and 1,938,000 bases or any double-stranded nucleic acid
of length between approximately 950 and 969,000 base pairs. This is
illustrated by the figures shown in Table 1, which are rounded off from
the exact calculations. For example, the largest double-stranded molecule
for which P(1) stays above 0.291 with octamer primers is approximately
60,600 base pairs, which is the same size as the smallest molecule for
which P(1) stays above 0.291 with nonamer primers.
The minimum fraction of productive primings can be increased by using a
mixture of two or three primers of the same length. Mixtures of more than
one primer, lo all of which have the same length, are referred to in this
specification as primer combinations. Increasing the number of primers in
the combination decreases the length of nucleic acid that has a given
value of P(1), and the decrease in length is in the same ratio as the
increase in number of primers. For example, doubling the number of primers
provides the same value of P(1) for a nucleic acid half the length,
quadrupling the number of primers provides the same value of P(1) for a
molecule one-fourth the length, etc.
By the use of single primers or two-primer combinations with primer lengths
in the range of 6 to 10, the value of n can be maintained between
approximately 0.693 and 1.386, and the expected frequency of productive
primings can be maintained between 0.347 and 0.368, for single-stranded
molecules between about 1420 and 1,454,000 bases or for double-stranded
molecules between about 710 and 727,000 base pairs. Again to illustrate
from Table 1, the largest molecule for which P(1) stays above 0.347 with
an octamer primer is 45,400 base pairs. The smallest molecule for which
P(1) stays above 0.347 with a single nonamer primer is twice this size,
90,900 base pairs, but a combination of two nonamers produces the same
value of P(1) for a molecule half the size, which is the same length
molecule as the maximum for octamers.
Extending this analysis, single primers, two-primer combinations, or
three-primer combinations with primer lengths in the range of 6 to 10 can
maintain the value of n between approximately 0.863 and 1.151, and the
expected frequency of productive primings between 0.364 and 0.368, for
single-stranded molecules between about 1180 and 1,206,000 bases or for
double-stranded molecules between about 590 and 603,000 base pairs. Again
to illustrate from Table 1, the largest molecule for which P(1) stays
above 0.364 with an octamer primer is 37,700 base pairs. The smallest
molecule for which P(1) stays above 0.364 with a single nonamer primer is
113,000 base pairs, but a combination of three nonamers produces the same
value of P(1) for a molecule one-third the size, which is the same length
molecule as the maximum for octamers.
Primer combinations containing more than three primers may also be used,
applying the same principles. For example, a single octamer, a combination
of four nonamers, and a combination of 16 decamers all would have the
maximum fraction of productive primings with a double-stranded molecule of
32,800 base pairs (Table 1), as would combinations of 64 primers of length
11 bases, 256 of length 12 bases, or 1024 of length 13 bases. The use of
primer combinations extends the useful range of random priming for a given
nucleic acid molecule to longer primers, which might have advantages in
some situations. For example, longer primers would be expected to have a
higher temperature optimum for priming sequencing reactions.
When using primer combinations, multiple priming may result from unique
priming by more than one primer in the combination. In such cases,
sequence information can be obtained by priming with individual primers
from the combination. The frequency of obtaining sequence information from
such individual primers may be higher than from further random primings,
depending on the average number of priming sites and the number of primers
in the combination.
The above principles allow the method of random priming to be applied to
any nucleic acid molecules that can be analyzed by primed sequencing
techniques. The size ranges given in the above examples are not intended
to limit the invention. The random priming method can also be applied with
any primers suitable for primed sequencing techniques, including primers
longer than 10 and potentially even those shorter than 6. When referring
to random priming, the terms primer and priming are understood to include
the possibility of both single primers and primer combinations unless
stated otherwise.
Directed priming
If the sequence of part of a nucleic acid molecule is known, a primer that
has a single priming site in the known sequence can be used for priming
sequencing reactions. Priming in situations where the primer is known to
have a single priming site within the known sequence is referred to in
this specification as directed priming. The term "directed primer" refers
to a primer used for directed priming. The probability that such a primer
will have only a single priming site in the entire molecule, and will
therefore provide useful sequence information, is the probability P(0)
that no priming site occurs in the unknown sequence. The value of P(0) is
given by equation i and depends on the lengths of both the unknown
sequence and the primer.
SEQUENCING STRATEGY
Oligodeoxyribonucleotides of any desired nucleotide sequence can be
synthesized readily by standard techniques with commercially available
instruments or can be purchased from companies that make them to order
(for example, from Genetic Designs, Inc., Houston, Tex.). Typical
preparations yield 0.2-10 .mu.mole of primer. A sequencing reaction
typically requires about 1 pmole of primer, so each preparation of primer
would contain enough material to prime 2.times.10.sup.5 to 10.sup.7
separate sequencing reactions.
The improvement and efficiency in the method of the invention over
conventional methods should be noted. In the conventional directed priming
method, where known sequence is extended from a newly synthesized primer
that primes near the end of the known sequence, primers are typically of
length 16 bases or longer and therefore can be used only once for an
amount of sequence equivalent to the entire human genome. The methods of
this invention use statistical analysis to select primers of lengths that
allow repeated use of primers from the same preparation and therefore have
the potential to lower the cost of primers relative to the amount of
sequence obtained by a factor of 10.sup.5 or more, depending on the volume
of sequencing. The methods will be illustrated with cosmid DNAs such as
might be used for sequencing the human genome.
A typical cosmid DNA might contain 5,000 base pairs of vector DNA and
40,000 base pairs of cloned DNA. The probabilities that randomly selected
primers of lengths 6-12 will have no priming site, exactly one priming
site, or more than one priming site in such a DNA molecule are given in
Table 2. Clearly, hexamers and heptamers are too small to have much chance
of priming useful sequence information in such a cosmid molecule.
Libraries of octamers, nonamers or decamers, on the other hand, could
generate sequence information quite efficiently from large numbers of
different cosmid DNAs.
Combined random and directed priming
In a preferred embodiment of the invention, the sequencing strategy
combines random and directed priming. Initial blocks of sequence are
generated by random priming and these sequences are then extended by
directed priming until they merge. FIG. 1 provides a diagrammatic summary
of this strategy.
Random priming phase
In the first stage, random priming with single octamers would provide
sequence information in a fraction of sequencing reactions equal to 0.348,
the value of P(1) for the cosmid DNAs. The fraction of productive
reactions primed by single nonamers is expected to be only 0.244, but
combinations of 2 nonamers increase this to 0.346 and combinations of 3
nonamers to 0.368. The fraction of productive reactions would also be
0.368 when priming is with combinations of 12 decamers. Thus, the random
priming phase is expected to generate sequence information in slightly
more than one of three sequencing reactions.
With current technology, each successful set of sequencing reactions
determines the sequence of several hundred nucleotides. For purposes of
illustration, it will be assumed in this specification that an average of
500 nucleotides of sequence is obtained from each successful priming. It
should be recognized, however, that the same methods apply when the
average lengths of sequence obtained per priming are shorter or longer
than 500. Of course, the longer the block of sequence obtained from each
priming the more efficient will be the sequencing process.
In the random phase, different primers can be used individually and
sequentially or in sets of sequencing reactions that are prepared and
analyzed in parallel. Different sets can themselves be analyzed
sequentially. When priming is done sequentially, succeeding primers or
sets of primers are preferably selected to exclude any that would prime
within the previously determined sequence. In this way, the priming is
restricted to the unknown portion of the molecule. Which of the
embodiments is preferred depends on the specifics of the sequencing
program. In some high volume situations it may be more economical to prime
each DNA individually and sequentially with one primer or primer
combination at a time, although many different DNAs would probably be
analyzed in parallel. On the other hand, where the complete sequence of a
single cosmid or other nucleic acid is desired in the shortest possible
period, it is preferable to perform a set of randomly primed reactions in
parallel to start the sequencing process.
An advantage of the random priming method is that the same set of primers
can be used repeatedly to determine sequences in many different DNAs. What
is meant by repeated use of a primer is that many different samples from
the same preparation of primer are used in many different sequencing
reactions In this specification, the term "set" as applied to primers,
refers to a group of primers used repeatedly for random priming of many
different DNAs. This is to distinguish "set" from the broader term
"library" which is meant to apply to a collection of primers that is used
repeatedly for directed priming or for both random and directed priming.
Libraries would usually be larger than sets, in which case many different
sets could be assembled from the primers in a library.
In random priming, each successful reaction should produce at least about
500 nucleotides of sequence, and these blocks of sequence should be
distributed at essentially random positions in the DNA molecule. For
cosmid DNAs, the first 10 blocks of randomly primed sequence are expected
to have an average of about one overlap. Because cosmid DNAs are double
stranded, the sequence of the complement of each block of sequence can
also be inferred.
In cosmid DNAs, as in any double-stranded nucleic acid, the complement of
any primer that is unique in the molecule will also be unique. The primer
complement will prime at the same site but will direct DNA synthesis to
the complementary strand and will extend the initial block of sequence in
the opposite direction. Therefore, each of the initial blocks of randomly
primed sequence can be extended at least about 500 base pairs in the
opposite direction. Because of the difficulty in reading nucleotide
sequence close to the primer, there will probably be a short gap between
the two blocks of base pairs of sequence; however, the location of the gap
is known and such gaps are easily closed when the confirmatory sequences
of the complementary strands are determined by directed priming.
For initiating the sequence of cosmid DNAs by random priming, a set of 30
primers and their complements are expected to generate perhaps 20-25% of
the sequence in 8-11 blocks of about 1000 base pairs each. The same set of
primers could be used repeatedly to initiate the sequence of many
different cosmid DNAs. In each cosmid DNA, about the same fraction of the
30 initial primers is expected to prime uniquely, but the subset of
primers that is unique will normally be different for each DNA molecule
and the blocks of sequence will normally also be different, unless the
cosmid DNAs overlapped in the genomic DNA from which the cosmids were
derived, or unless the priming site is located in a repeated portion of
the genome.
The primers in the set used to initiate sequencing by random priming can be
selected so as to optimize their usefulness for determining the sequence
of a particular set of nucleic acids. Although it has been assumed that
the nucleotide sequence is essentially random in the nucleic acids to be
sequenced, the statistical analyis can be modified by well known
techniques to take into account known deviations from randomness. For
example, the DNA is often known to be enriched in AT or GC base pairs, and
mammalian DNAs are known to have a strong bias against the dinucleotide
sequence CG, with clustering of the CG sequences that are present. For
some genomes the nucleotide sequences of highly or moderately repeated
elements are known. The sequence of the vector portion that would be
present in each cosmid DNA derived from the same cosmid vector would also
be known or easily determined. The primers in a set used for random
sequencing might for example exclude any that would prime in the vector
portion of the cosmids or in known repeated elements of the genome, and
might be chosen to reflect the average base composition and known
dinucleotide biases of the genome being sequenced. These are examples of
the types of optimization that is possible. Primer selection in individual
cases could be optimized according to what is known about the nucleic acid
being sequenced and the specific goals of the sequencing project.
The initial blocks of sequence provided by random priming give a unique
signature to each cosmid DNA being sequenced. When the same set is used to
prime each cosmid DNA, these initial blocks of sequence are useful for
comparing different cosmids with each other and with emerging blocks of
genomic sequence to detect overlaps. In a large scale genome sequencing
project, it might be more efficient to use the initial blocks of randomly
primed sequence to establish overlaps between cosmids rather than to make
the independent effort to order the cosmid DNAs by other means before
sequencing them. Such a signature provides a great deal more information
than almost any other mapping method, and in a high volume sequencing
facility would be easy to obtain. Where some but not all blocks of
randomly primed se | | |