WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Method for high-volume sequencing of nucleic acids: random and directed priming with libraries of oligonucleotides    
United States Patent5407799   
Link to this pagehttp://www.wikipatents.com/5407799.html
Inventor(s)Studier; F. William (Stony Brook, NY)
AbstractRandom and directed priming methods for determining nucleotide sequences by enzymatic sequencing techniques, using libraries of primers of lengths 8, 9 or 10 bases, are disclosed. These methods permit direct sequencing of nucleic acids as large as 45,000 base pairs or larger without the necessity for subcloning. Individual primers are used repeatedly to prime sequence reactions in many different nucleic acid molecules. Libraries containing as few as 10,000 octamers, 14,200 nonamers, or 44,000 decamers would have the capacity to determine the sequence of almost any cosmid DNA. Random priming with a fixed set of primers from a smaller library can also be used to initiate the sequencing of individual nucleic acid molecules, with the sequence being completed by directed priming with primers from the library. In contrast to random cloning techniques, a combined random and directed priming strategy is far more efficient.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 5407799
Method for high-volume sequencing of nucleic acids: random and directed

     priming with libraries of oligonucleotides - US Patent 5407799 Drawing
Method for high-volume sequencing of nucleic acids: random and directed priming with libraries of oligonucleotides
Inventor     Studier; F. William (Stony Brook, NY)
Owner/Assignee     Associated Universities, Inc. (Washington, DC)
Patent assignment
All assignments
Publication Date     April 18, 1995
Application Number     08/135,317
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     October 12, 1993
US Classification     435/6 435/91.1
Int'l Classification     C12Q 001/68 C12P 019/34
Examiner     Zitomer; Stephanie W.
Assistant Examiner    
Attorney/Law Firm     Bogosian; Margaret C.
Address
Parent Case     This application is a file wrapper continuation of application Ser. No. 779,290, filed Oct. 18, 1991, now abandoned, which is a continuation-in-part of patent application Ser. No. 407,238, filed Sep. 14, 1989, now abandoned.
Priority Data    
USPTO Field of Search     435/6 435/91.1
Patent Tags     high-volume sequencing nucleic acids: random directed priming libraries oligonucleotides
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5043272
Hartley
435/5
Aug,1991

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


I claim:

1. A statistically-based random-priming method for determining nucleotide sequence in a nucleic acid template having a completely unknown or partly known nucleotide sequence by priming within a region of the template for which the nucleotide sequence is not known, the method comprising the steps of:

a) supplying a template for which the approximate total length of unknown nucleotide sequence is known;

b) selecting a primer or primer combination whose length or lengths relative to the template length are such that the probability P(1) of priming at a single site within the total length of unknown nucleotide sequence in the template is about 0.291-0.368, but excluding any primer that would prime in any part of the template where the nucleotide sequence is known;

c) forming an incubation mixture comprising:

i) the template;

ii) the primer or primer combination selected in step b); and

iii) a polymerizing enzyme;

d) incubating the mixture of step c) under conditions appropriate for primed synthesis of DNA to generate products suitable for determining the nucleotide sequence in the template;

e) analyzing the products of step d) to determine nucleotide sequence, which will be determinable only if priming occurred in step d) and was at a single site in the template; and

f) if necessary, repeating steps b)-e), using different primers or primer combinations, until the nucleotide sequence has been determined.

2. The method of claim 1 wherein the primer combination is a two-primer combination.

3. The method of claim 1 wherein the primer combination is a three-primer combination.

4. The method of claim 1 wherein the primer or primer combination is selected from a primer library.

5. The method of claim 1 wherein the primer is an oligonucleotide 6, 7, 8, 9 or 10 bases long.

6. The method of claim 1 wherein the primer is selected from a primer library comprised of hexamers, heptamers, octamers, nonamers, decamers and primer combinations thereof.

7. The method of claim 1 wherein the nucleic acid template is DNA.

8. The method of claim 1 wherein the nucleic acid template is RNA.

9. A statistically-based directed-priming method that uses primers selected from a primer library to determine nucleotide sequence in a nucleic acid template having a nucleotide sequence which is partly known and partly unknown, the method comprising the steps of:

a) supplying a template of known approximate length;

b) supplying a primer library containing primers having lengths relative to the template length such that the probability P(O) that an individual primer will have no perfectly complementary priming site in any unknown nucleotide sequence in the template is greater than about 0.25, said library being of a size that the average priming interval is less than about half the average length of nucleotide sequence that can be determined from a single priming site;

c) selecting from the primer library a primer that is perfectly complementary to one and only one site in the known nucleotide sequence in the template;

d) forming an incubation mixture comprising:

i) the template;

ii) the primer selected in step c); and

iii) a polymerizing enzyme;

e) incubating the mixture of step d) under conditions appropriate for primed synthesis of DNA to generate products suitable for determining nucleotide sequence in the template; and

f) analyzing the products of step e) to determine nucleotide sequence, which will be determinable only if the priming step e) was at a single site in the template.

10. The method of claim 9 wherein the primer library is comprised of hexamers, heptamers, octamers, nonamers, decamers and primer combinations thereof.

11. The method of claim 10 wherein the nucleic acid template is DNA.

12. The method of claim 10 wherein the nucleic acid template is RNA.

13. A statistically-based combined random- and directed-priming method for determining nucleotide sequence in a nucleic acid template having a completely unknown or partly known nucleotide sequence, comprising the steps of:

a) supplying a template of known approximate length and for which the approximate total length of unknown nucleotide sequence is known;

b) supplying a primer library containing primers having lengths relative to the template length such that the probability P(0) that an individual primer will have no perfectly complementary priming site in any unknown nucleotide sequence part in the template is greater than about 0.25, said library being of a size that the average priming interval is less than about half the average length of nucleotide sequence that can be determined from a single priming site;

c) selecting from the primer library a primer or primer combination whose length or lengths are such that the probability P(1) of priming at a single site within the total length of the unknown nucleotide sequence part in the template is about 0.291-0.368, but excluding any primer that would prime in any part of the template for which the nucleotide sequence is known;

d) forming an incubation mixture comprising:

i) the template;

ii) the primer or primer combination selected in step c); and

iii) a polymerizing enzyme;

e) incubating the mixture of step d) under conditions appropriate for primed synthesis of DNA to generate products suitable for determining nucleotide sequence in the template;

f) analyzing the products of step e) to determine nucleotide sequence, which will be determinable only if priming occurred in step e) and was at a single site in the template;

g) if necessary, repeating steps c)-f), using different primers or primer combinations, until nucleotide sequence information is determined;

h) selecting from the primer library a primer that is perfectly complementary to one and only one site in the nucleotide sequence which is known in the template;

i) forming an incubation mixture comprising:

i) the template;

ii) the primer selected in step h; and

iii) a polymerizing enzyme;

j) incubating the mixture of step i) under conditions appropriate for primed synthesis of DNA to generate products suitable for determining nucleotide sequence in the template; and

k) analyzing the products of step j) to determine nucleotide sequence, which will be determinable only if the priming in step j) was at a single site in the template.

14. The method of claim 13 wherein the primer combination selected in step c) is a two-primer combination.

15. The method of claim 13 wherein the primer combination selected in step c) is a three-primer combination.

16. The method of claim 13 wherein the primer library is comprised of hexamers, heptamers, octamers, nonamers, decamers and primer combinations thereof.

17. The method of claim 13 wherein the nucleic acid template is DNA.

18. The method of claim 13 wherein the nucleic acid template is RNA.
 Description Submit all comments and votes
 


BACKGROUND OF THE INVENTION

The ability to determine nucleotide sequences has had enormous impact on biology, medicine and biotechnology. An appreciation of the benefits of knowing the nucleotide sequences of genes, chromosomes, and entire genomes has led to the current proposals to determine the nucleotide sequence of the human genome and the genomes of other well studied or economically important organisms.

Cloning and mapping specific DNA fragments is an important part of the strategy for sequencing large genomes. The entire human genome of about 3.times.10.sup.9 base pairs could be contained in a set of about 100,000 cosmids, each of which contains about 40,000 or more base pairs of human DNA. Even larger segments can be cloned in yeast artificial chromosomes. The genome sequencing problem then reduces to the problem of sequencing a large number of DNAs of 40,000 or more base pairs. Such an enterprise represents a tremendous increase in scale over the most ambitious sequencing projects that have been undertaken heretofore. If cosmids were sequenced at the rate of one a day, a formidable task for a sequencing center using today's technology, centuries would be required to complete the task.

Currently useful methods for determining nucleotide sequence involve generating nucleic acid fragments having defined ends and resolving them according to size, using gel electrophoresis. These defined fragments are produced chemically (Maxam & Gilbert, Proc. Nat. Acad. Sci. USA, 74, 560-564 (1977); Methods in Enzymology 65, 499-560 (1980)), enzymatically (Sanger et al., Proc. Nat. Acad. Sci. USA 74, 5463-5467 (1977)), or by some combination of the two, and are typically identified in electrophoresis patterns by radioactivity, fluorescence or chemical reactivity.

The enzymatic sequencing technique has been highly developed, and several different DNA polymerases and reverse transcriptases are used for this purpose. These enzymes can be used for sequencing double-stranded or single-stranded DNA or RNA. Oligonucleotide primers direct DNA synthesis from a specific site in the molecule, which generates the common end needed for sequence analysis. The variable end is typically generated by incorporation of specific chain terminators, such as dideoxynucleotide triphosphates, or by incorporation of nucleoside triphosphate derivatives and subsequent cleavage of the molecule at the site of incorporation.

Specific priming is critical for the success of the enzymatic sequencing technique. Much is known about the specific association between oligonucleotides and longer nucleic acids, and about the ability of specifically associated oligonucleotides to prime DNA synthesis by the enzymes used for nucleotide sequencing (for example, M. Smith, in "Methods of DNA and RNA Sequencing", edited by S.M. Weissman, Praeger Publishers, New York, pp 23-68, 1983). Oligonucleotides as short as three or four bases long have been reported to prime DNA synthesis, and a mixture of hexamers is widely used to prime random DNA synthesis for labeling hybridization probes. Oligonucleotides of length 6 or longer are useful for priming specific sequencing reactions.

In practice, blocks of nucleotide sequence up to several hundred but rarely as long as a thousand nucleotides can be determined from the products of a single sequencing reaction or set of reactions. Cosmid DNAs, and in fact most nucleic acids of interest, are much longer than the few hundred nucleotides that is the basic unit of sequence determination. Therefore, a substantial part of the effort involved in sequencing genes or genomes, or almost any nucleic acid, must be devoted to obtaining and assembling the many individual blocks of a few hundred nucleotides of sequence that make up the entire nucleic acid to be sequenced. If an average of 500 nucleotides of sequence could be obtained in each analysis, a minimum of 160 sets of sequencing reactions would have to be prepared and analyzed to obtain the sequence of both strands of one cosmid DNA.

Several strategies have been developed for obtaining and ordering the many individual blocks of sequence needed to determine the entire sequence of larger molecules. One strategy is to use restriction enzymes to obtain and map specific fragments of the DNA molecule. The nucleotide sequences of appropriate fragments are determined, and the sequence of the entire molecule is assembled :from the known positions of the fragments. As example of the use of this strategy is the determination of the sequence of T7 DNA, a double-stranded molecule about 40,000 bp long (Dunn & Studier, J. Mol. Biol. 166, 477-535 (1983)). However, such a strategy is too labor intensive to be economical for sequencing large numbers of DNA molecules.

A more typical strategy is to subclone random fragments of the DNA into a cloning vector, typically derived from M13. The sequence of the cloned DNA is usually determined by the enzymatic sequencing technique, starting from a unique priming site within the vector DNA. Randomly selected subclones are sequenced, and the sequence of the original DNA is reconstructed from overlaps among the many blocks of sequence obtained from the different subclones. The sequence of lambda DNA, about 48,500 bp long (Sanger et al., J. Mol. Biol. 162, 729-773 (1982)), was determined by extensive use of such a strategy. Although relatively efficient in the early stages, a random cloning strategy becomes highly redundant in the later stages. In a purely random strategy, perhaps ten times the minimum possible number of sequence analyses may have to be done before all of the blocks of sequence can be overlapped. In practice, labor intensive mapping techniques are often used to close gaps.

Modifications have improved the efficiency of random cloning strategies. The length of continuous sequence that can be generated from a single priming site in a cloning vector can be extended considerably by generating sets of nested deletions that bring different portions of the DNA close to the priming site (Barnes, Methods in Enzymology, 152, 538-(1987)). However, this remains relatively labor intensive for a large scale sequencing effort. Multiplexing improves sequencing efficiency by allowing a single gel electrophoresis pattern to be probed repeatedly to determine the sequence of many different cloned DNAs (Church & Kieffer-Higgins, Science 240, 185-188 (1988)). However, all subcloning strategies suffer from the necessity to prepare many different clones and isolate DNA from each of them, an effort that will typically be comparable to that required to do the sequence analyses themselves.

A directed priming, or "walking" strategy allows the sequence to be determined directly from a nucleic acid molecule of interest without mapping or subcloning, a considerable savings in effort. To use directed priming, at least a small portion of the nucleotide sequence in the molecule must be known or determined in some other way. This known sequence information is used to synthesize a primer for enzymatic sequencing reactions that will extend the sequence into the unknown region. Such primers are synthesized by well known techniques or can be purchased commercially and are typically at least 16 nucleotides long, so as to be unique in the entire molecule. In order to continue extending the sequence further along the molecule, a new primer must be synthesized for every few hundred nucleotides of sequence obtained. Although a directed priming strategy eliminates the considerable effort needed for mapping or subcloning, the cost of primers nevertheless makes the directed priming strategy very expensive for large scale sequencing.

The recently described polymerase chain reaction (PCR) for amplifying specific segments of a DNA molecule is also being used to prepare samples for sequencing (Saiki et al., Science 239, 487-491 (1988); Stoflet et at., Science 239, 491-494 (1988)). This technique can eliminate the subcloning steps, and the PCR primers themselves can be used as primers for sequencing by the enzymatic technique. However, the use of this technique requires knowledge of the nucleotide sequence flanking the region to be amplified, information that is generally not available at the outset, and the cost of primers would be comparable to that for the directed priming strategy.

Although determination of nucleotide sequences has become routine, high volume sequencing is still a difficult problem. The need for methods that allow more efficient high volume sequencing is widely recognized and is being addressed in various ways. Machines are being developed to carry out sequencing reactions and to automate DNA sample preparation and collection of data. Completely novel sequencing methods that do not require resolution of DNA fragments by gel electrophoresis are also being explored. For example, Drmanac et al. (Genomics 4, 114-128 (1989)) have proposed a method based on the pattern of hybridization of oligonucleotides to the DNA to be sequenced. However, these initiatives have not yet had a practical impact.

The current state of the art in high volume sequencing was summarized in a brief report in Science (242, 1245, Dec. 2, 1988). Bart Barrell and Ellson Chen, whose laboratories have led the way in high volume sequencing and had sequenced the largest contiguous stretches of DNA at that time, reportedly concluded that the current technology realistically allows one skilled technician to sequence about 50,000 bases a year, and even that output is difficult to sustain. This rate of sequencing is still far short of the capacity needed for projects like sequencing the human genome.

SUMMARY OF THE INVENTION

The present invention is directed to a more efficient method for determining the sequence of nucleotides in nucleic acids. The method greatly reduces the cost and effort of nucleotide sequencing and is particularly suitable for very large scale sequence determinations such as the proposed determination of the nucleotide sequence of the entire human genome.

The present invention provides methods for improving the efficiency and economy of enzymatic nucleotide sequencing. The methods include a random priming method for determining the sequence of nucleotides in parts of a nucleic acid molecule where the sequence is not known, the method comprising the steps of:

(a) mixing said nucleic acid molecule with a primer or primer combination under conditions suitable for forming a primed substrate for DNA synthesis by a polymerizing enzyme that is suitable for nucleotide sequencing, said primer or primer combination having a length and composition such that the average number of priming sites in those parts of the nucleic acid molecule where the sequence of nucleotides is not known is expected statistically to be between 0.05 and 4.5, but excluding Primers and primer combinations that would prime in any parts of the nucleic acid molecule where the sequence of nucleotides is known, said mixing being either previous to or simultaneous with step (b);

(b) incubating the mixture of step (a) with a polymerizing enzyme under conditions suitable for primed synthesis of DNA that can be used for determining nucleotide sequence;

(c) analyzing the reaction products to determine the sequence of a block of nucleotides in any DNA that was synthesized from a single priming site in the nucleic acid molecule; and

(d) repeating steps (a)-(c), using different primers or primer combinations, until one or more blocks of nucleotide sequence have been determined.

The present invention further provides a directed priming method that repeatedly uses the same primers for determining or confirming the sequence of nucleotides in different nucleic acid molecules for which at least a portion of the nucleotide sequence is known, the method comprising the steps of:

(a) selecting a primer having 8, 9 or 10 bases, the primer being perfectly complementary to one and only one site in the known sequence of nucleotides in a nucleic acid molecule, said site being located so that the primer, by associating at said site, is capable of priming a polymerizing enzyme to synthesize DNA complementary to a region of the nucleic acid molecule where the nucleotide sequence is to be determined or confirmed, and said primer being obtained from a primer library or being newly prepared and the unused portion being deposited in a primer library;

(b) mixing said primer and nucleic acid molecule under conditions suitable for forming a primed substrate for DNA synthesis by a polymerizing enzyme that is suitable for nucleotide sequencing, said mixing occurring under conditions where perfect pairing is sufficiently greater than mismatched pairing that nucleotide sequence can be determined if exactly one perfect pairing site exists in the nucleic acid molecule, and said mixing being either previous to or simultaneous with step (c);

(c) incubating the mixture of step (b) with a polymerizing enzyme under conditions suitable for primed synthesis of DNA that can be used for determining nucleotide sequence;

(d) analyzing the reaction products to determine the sequence of nucleotides in any DNA that was synthesized from a single priming site in the nucleic acid molecule;

(e) repeating steps (a)-(d) until the desired sequences have been determined or until all blocks of nucleotide sequence merge or reach the ends of the molecule; and

(f) repeating steps (a)-(e) to determine nucleotide sequences of different nucleic acid molecules.

Additionally, the present invention provides a combined random and directed priming method that repeatedly uses the same primers for determining the sequence of nucleotides in different nucleic acid molecules, the method comprising the steps of:

(a) mixing a nucleic acid molecule with a random primer or primer combination under conditions suitable for forming a primed substrate for DNA synthesis by a polymerizing enzyme that is suitable for nucleotide sequencing, said random primer or primer combination having a length or lengths and composition such that the average number of priming sites in those parts of the nucleic acid molecule where the sequence of nucleotides is not known is expected statistically to be between 0.05 and 4.5, but excluding primers and primer combinations that would prime in any parts of the nucleic acid molecule where the sequence of nucleotides is known, said mixing being either previous to or simultaneous with step (b);

(b) incubating the mixture of step (a) with a polymerizing enzyme under conditions suitable for primed synthesis of DNA that can be used for determining nucleotide sequence;

(c) analyzing the reaction products to determine the sequence of nucleotides in DNA that was synthesized from a single priming site in the nucleic acid molecule;

(d) repeating steps (a)-(c), using different random primers or primer combinations, sequentially or in parallel, until one or more blocks of nucleotide sequence have been determined;

(e) selecting a directed primer that is perfectly complementary to one and only one site in the known sequence of nucleotides in the nucleic acid molecule, whether said sequence was previously known or determined in steps (a)-(d), said site being located so that the directed primer, by associating at said site, is capable of priming a polymerizing enzyme to synthesize DNA complementary to a region of the nucleic acid molecule where the nucleotide sequence is to be determined or confirmed, and said directed primer being obtained from a primer library or being newly prepared and the unused portion being deposited in a primer library;

(f) mixing said directed primer and nucleic acid molecule under conditions suitable for forming a primed substrate for DNA synthesis by a polymerizing enzyme that is suitable for nucleotide sequencing, said mixing occurring under conditions where perfect pairing is sufficiently greater than mismatched pairing that nucleotide sequence can be determined if exactly one perfect pairing site exists in the nucleic acid molecule, and said mixture being either previous to or simultaneously with step (g);

(g) incubating the mixture of step (f) with a Polymerizing enzyme under conditions suitable for primed synthesis of DNA that can be used for determining nucleotide sequencing;

(h) analyzing the reaction products to determine the sequence of nucleotides in any DNA that was synthesized from a single priming site in the nucleic acid molecule;

(i) repeating steps (e)-(h) until the desired sequences have been determined or until all blocks of nucleotide sequence merge or reach the ends of the molecule; and

(j) repeating steps (a)-(i) to determine nucleotide sequences of different nucleic acid molecules.

DETAILED DESCRIPTION OF THE DRAWINGS

FIG. 1. Sequencing the cloned portion of a cosmid DNA by random and directed priming. Line lengths are to scale for an unknown sequence of 40,000 bp, vector sequences of 2,500 bp at each end, and primings that produce 500 nucleotides of sequence each. The expected fraction of unique primings at each stage is given; during directed extension this fraction would be 0.40-0.74 for octamer primers, 0.80-0.93 for nonamers (as shown), or 0.94-0.98 for decamers (Table 2). Priming from within the vector sequences into the ends of the cloned DNA is assumed to use primers that are long enough to be unique.

FIG. 2. Autoradiogram showing nucleotide sequence primed by an octamer primer in T7 DNA, as described in Example 1.

DETAILED DESCRIPTION OF THE INVENTION

The invention teaches random and directed priming methods wherein a statistical approach greatly improves the efficiency and economy of enzymatic nucleotide sequencing. Individual preparations of oligonucleotide primers typically provide enough material for hundreds of thousands of primings and the invention teaches methods for efficient use of this material to obtain sequence information from many different nucleic acid molecules. The methods do not require mapping or subcloning and are applicable to nucleic acid molecules of any size suitable for primed enzymatic sequencing.

The invention relates to the use of primers, selected from a primer library, to determine or confirm the sequence of nucleotides in nucleic acid molecules for which at least a portion of the nucleotide sequence is known. The primer library is a central supply of primers, a collection of different primers where each primer in the collection is present in sufficient quantity so that samples can be removed to be used in many sequencing reactions. The sequence of each oligonucleotide in each of the primers included in the primer library is known. For random priming as such is defined herein, sets of preparations of primers or primer combinations may be used. All the sequences in such sets are known. Each sample of primer or primer combination taken from the primer library to be used to prime DNA synthesis in the initial steps of the sequencing methods described herein comprises only a portion of a single preparation, and different portions of the single preparation of the primer contained in the primer library are used to prime DNA synthesis in different nucleic acid molecules.

STATISTICAL ANALYSIS OF OLIGONUCLEOTIDE PRIMING

The present invention is based on a consideration of the statistics of the priming of enzymatic DNA synthesis as applied to nucleotide sequencing. Primers in the enzymatic sequencing method are typically oligodeoxyribonucleotides. However, oligoribonucloetides, oligoribonucleotides containing methylphosphonate bonds, and perhaps other types of linkages of normal DNA or RNA bases, or bases such as inosine, 5-bromouracil, or other modified bases that are not normally found in DNA or RNA, can also associate specifically with template nucleic acids and prime sequencing reactions, as can such linkages of bases which are themselves linked to various reporter groups such as fluorescent tags, biotin etc.

In this specification, the terms primer or oligonucleotide are meant to specify a molecule containing a defined sequence of bases linked together in such a way that said molecule is capable of specific association according to known base pairing rules with a sequence of bases in the template nucleic acid, and is capable of priming DNA synthesis reactions suitable for nucleotide sequencing. The terms hexamer, heptamer, octamer, nonamer and decamer are meant to refer specifically to primers of length 6, 7, 8, 9 and 10 bases, respectively. The nucleic acid to be sequenced may be referred to for convenience as DNA, but it should be understood that this is only for convenience and that the invention applies to single-stranded or double-stranded DNA molecules or to single-stranded or double-stranded RNA molecules. A primer that primes at one and only one site in a nucleic acid molecule may be referred to as a unique primer for that molecule and the priming site as a unique priming site in that molecule.

In the statistical analysis of priming, which provides the basis for the present invention, important parameters are the length of the primer, p, and the total length of the nucleic acid to be sequenced, T. The length of the primer is the number of bases in the primer molecule that are capable of specific base pairing with the template nucleic acid. For a single-stranded nucleic acid, T=L, where L is the number of bases in the chain. For a double-stranded nucleic acid having complementary strands of equal length, T=2L, where L is again the number of bases in a single chain, which also equals the number of base pairs. For substantially equimolar mixtures of different nucleic acid chains, including double-stranded nucleic acids having complementary strands of unequal length, T=.SIGMA.L, the sum of the numbers of bases in the individual chains. For mixtures of the type that would be equivalent to random breakage of a unique molecule, T is the total number of bases that would have been in the unique molecule.

In the statistical analysis, it is assumed that primers of arbitrary length prime DNA synthesis at every perfectly complementary sequence in a template nucleic acid molecule but at no other sequence. The number of potential priming sites in the molecule is approximately equal to the total number of bases T. For a nucleic acid molecule of random sequence, the expected frequency of priming sites for a single randomly selected oligonucleotide is approximated by the Poisson distribution ##EQU1## where P(r) is the probability of having exactly r priming sites in the nucleic acid molecule and n=T/4.sup.p is the average number of priming sites for an individual oligonucleotide per nucleic acid molecule of length T, where 4.sup.p is the number of different combinations of the four nucleotides that can form an oligonucleotide of length p.

Random Priming

Useful sequence information is obtained when DNA synthesis is primed at a single site in a nucleic acid molecule. For a nucleic acid molecule of essentially random sequence, the probability P(1) that a randomly selected oligonucleotide will have a single priming site is a maximum of 0.368 when n=1 (Table 1). Attempts to prime sequencing reactions where it is not known whether or where a selected oligonucleotide will prime in the nucleic acid molecule are referred to in this specification as random priming. The term "random primer" refers to a primer used for random priming.

In practice, primers of length 6 or longer are used to prime sequencing reactions. By simple manipulations of equation 1 and the equation for n, it is easily shown that for single primers of length 6-10, a value of n between approximately 0.462 and 1.848, and an expected fraction of productive primings of sequencing reactions between 0.291 and 0.368, can be achieved for any single-stranded nucleic acid of length between approximately 1900 and 1,938,000 bases or any double-stranded nucleic acid of length between approximately 950 and 969,000 base pairs. This is illustrated by the figures shown in Table 1, which are rounded off from the exact calculations. For example, the largest double-stranded molecule for which P(1) stays above 0.291 with octamer primers is approximately 60,600 base pairs, which is the same size as the smallest molecule for which P(1) stays above 0.291 with nonamer primers.

The minimum fraction of productive primings can be increased by using a mixture of two or three primers of the same length. Mixtures of more than one primer, lo all of which have the same length, are referred to in this specification as primer combinations. Increasing the number of primers in the combination decreases the length of nucleic acid that has a given value of P(1), and the decrease in length is in the same ratio as the increase in number of primers. For example, doubling the number of primers provides the same value of P(1) for a nucleic acid half the length, quadrupling the number of primers provides the same value of P(1) for a molecule one-fourth the length, etc.

By the use of single primers or two-primer combinations with primer lengths in the range of 6 to 10, the value of n can be maintained between approximately 0.693 and 1.386, and the expected frequency of productive primings can be maintained between 0.347 and 0.368, for single-stranded molecules between about 1420 and 1,454,000 bases or for double-stranded molecules between about 710 and 727,000 base pairs. Again to illustrate from Table 1, the largest molecule for which P(1) stays above 0.347 with an octamer primer is 45,400 base pairs. The smallest molecule for which P(1) stays above 0.347 with a single nonamer primer is twice this size, 90,900 base pairs, but a combination of two nonamers produces the same value of P(1) for a molecule half the size, which is the same length molecule as the maximum for octamers.

Extending this analysis, single primers, two-primer combinations, or three-primer combinations with primer lengths in the range of 6 to 10 can maintain the value of n between approximately 0.863 and 1.151, and the expected frequency of productive primings between 0.364 and 0.368, for single-stranded molecules between about 1180 and 1,206,000 bases or for double-stranded molecules between about 590 and 603,000 base pairs. Again to illustrate from Table 1, the largest molecule for which P(1) stays above 0.364 with an octamer primer is 37,700 base pairs. The smallest molecule for which P(1) stays above 0.364 with a single nonamer primer is 113,000 base pairs, but a combination of three nonamers produces the same value of P(1) for a molecule one-third the size, which is the same length molecule as the maximum for octamers.

Primer combinations containing more than three primers may also be used, applying the same principles. For example, a single octamer, a combination of four nonamers, and a combination of 16 decamers all would have the maximum fraction of productive primings with a double-stranded molecule of 32,800 base pairs (Table 1), as would combinations of 64 primers of length 11 bases, 256 of length 12 bases, or 1024 of length 13 bases. The use of primer combinations extends the useful range of random priming for a given nucleic acid molecule to longer primers, which might have advantages in some situations. For example, longer primers would be expected to have a higher temperature optimum for priming sequencing reactions.

When using primer combinations, multiple priming may result from unique priming by more than one primer in the combination. In such cases, sequence information can be obtained by priming with individual primers from the combination. The frequency of obtaining sequence information from such individual primers may be higher than from further random primings, depending on the average number of priming sites and the number of primers in the combination.

The above principles allow the method of random priming to be applied to any nucleic acid molecules that can be analyzed by primed sequencing techniques. The size ranges given in the above examples are not intended to limit the invention. The random priming method can also be applied with any primers suitable for primed sequencing techniques, including primers longer than 10 and potentially even those shorter than 6. When referring to random priming, the terms primer and priming are understood to include the possibility of both single primers and primer combinations unless stated otherwise.

Directed priming

If the sequence of part of a nucleic acid molecule is known, a primer that has a single priming site in the known sequence can be used for priming sequencing reactions. Priming in situations where the primer is known to have a single priming site within the known sequence is referred to in this specification as directed priming. The term "directed primer" refers to a primer used for directed priming. The probability that such a primer will have only a single priming site in the entire molecule, and will therefore provide useful sequence information, is the probability P(0) that no priming site occurs in the unknown sequence. The value of P(0) is given by equation i and depends on the lengths of both the unknown sequence and the primer.

SEQUENCING STRATEGY

Oligodeoxyribonucleotides of any desired nucleotide sequence can be synthesized readily by standard techniques with commercially available instruments or can be purchased from companies that make them to order (for example, from Genetic Designs, Inc., Houston, Tex.). Typical preparations yield 0.2-10 .mu.mole of primer. A sequencing reaction typically requires about 1 pmole of primer, so each preparation of primer would contain enough material to prime 2.times.10.sup.5 to 10.sup.7 separate sequencing reactions.

The improvement and efficiency in the method of the invention over conventional methods should be noted. In the conventional directed priming method, where known sequence is extended from a newly synthesized primer that primes near the end of the known sequence, primers are typically of length 16 bases or longer and therefore can be used only once for an amount of sequence equivalent to the entire human genome. The methods of this invention use statistical analysis to select primers of lengths that allow repeated use of primers from the same preparation and therefore have the potential to lower the cost of primers relative to the amount of sequence obtained by a factor of 10.sup.5 or more, depending on the volume of sequencing. The methods will be illustrated with cosmid DNAs such as might be used for sequencing the human genome.

A typical cosmid DNA might contain 5,000 base pairs of vector DNA and 40,000 base pairs of cloned DNA. The probabilities that randomly selected primers of lengths 6-12 will have no priming site, exactly one priming site, or more than one priming site in such a DNA molecule are given in Table 2. Clearly, hexamers and heptamers are too small to have much chance of priming useful sequence information in such a cosmid molecule. Libraries of octamers, nonamers or decamers, on the other hand, could generate sequence information quite efficiently from large numbers of different cosmid DNAs.

Combined random and directed priming

In a preferred embodiment of the invention, the sequencing strategy combines random and directed priming. Initial blocks of sequence are generated by random priming and these sequences are then extended by directed priming until they merge. FIG. 1 provides a diagrammatic summary of this strategy.

Random priming phase

In the first stage, random priming with single octamers would provide sequence information in a fraction of sequencing reactions equal to 0.348, the value of P(1) for the cosmid DNAs. The fraction of productive reactions primed by single nonamers is expected to be only 0.244, but combinations of 2 nonamers increase this to 0.346 and combinations of 3 nonamers to 0.368. The fraction of productive reactions would also be 0.368 when priming is with combinations of 12 decamers. Thus, the random priming phase is expected to generate sequence information in slightly more than one of three sequencing reactions.

With current technology, each successful set of sequencing reactions determines the sequence of several hundred nucleotides. For purposes of illustration, it will be assumed in this specification that an average of 500 nucleotides of sequence is obtained from each successful priming. It should be recognized, however, that the same methods apply when the average lengths of sequence obtained per priming are shorter or longer than 500. Of course, the longer the block of sequence obtained from each priming the more efficient will be the sequencing process.

In the random phase, different primers can be used individually and sequentially or in sets of sequencing reactions that are prepared and analyzed in parallel. Different sets can themselves be analyzed sequentially. When priming is done sequentially, succeeding primers or sets of primers are preferably selected to exclude any that would prime within the previously determined sequence. In this way, the priming is restricted to the unknown portion of the molecule. Which of the embodiments is preferred depends on the specifics of the sequencing program. In some high volume situations it may be more economical to prime each DNA individually and sequentially with one primer or primer combination at a time, although many different DNAs would probably be analyzed in parallel. On the other hand, where the complete sequence of a single cosmid or other nucleic acid is desired in the shortest possible period, it is preferable to perform a set of randomly primed reactions in parallel to start the sequencing process.

An advantage of the random priming method is that the same set of primers can be used repeatedly to determine sequences in many different DNAs. What is meant by repeated use of a primer is that many different samples from the same preparation of primer are used in many different sequencing reactions In this specification, the term "set" as applied to primers, refers to a group of primers used repeatedly for random priming of many different DNAs. This is to distinguish "set" from the broader term "library" which is meant to apply to a collection of primers that is used repeatedly for directed priming or for both random and directed priming. Libraries would usually be larger than sets, in which case many different sets could be assembled from the primers in a library.

In random priming, each successful reaction should produce at least about 500 nucleotides of sequence, and these blocks of sequence should be distributed at essentially random positions in the DNA molecule. For cosmid DNAs, the first 10 blocks of randomly primed sequence are expected to have an average of about one overlap. Because cosmid DNAs are double stranded, the sequence of the complement of each block of sequence can also be inferred.

In cosmid DNAs, as in any double-stranded nucleic acid, the complement of any primer that is unique in the molecule will also be unique. The primer complement will prime at the same site but will direct DNA synthesis to the complementary strand and will extend the initial block of sequence in the opposite direction. Therefore, each of the initial blocks of randomly primed sequence can be extended at least about 500 base pairs in the opposite direction. Because of the difficulty in reading nucleotide sequence close to the primer, there will probably be a short gap between the two blocks of base pairs of sequence; however, the location of the gap is known and such gaps are easily closed when the confirmatory sequences of the complementary strands are determined by directed priming.

For initiating the sequence of cosmid DNAs by random priming, a set of 30 primers and their complements are expected to generate perhaps 20-25% of the sequence in 8-11 blocks of about 1000 base pairs each. The same set of primers could be used repeatedly to initiate the sequence of many different cosmid DNAs. In each cosmid DNA, about the same fraction of the 30 initial primers is expected to prime uniquely, but the subset of primers that is unique will normally be different for each DNA molecule and the blocks of sequence will normally also be different, unless the cosmid DNAs overlapped in the genomic DNA from which the cosmids were derived, or unless the priming site is located in a repeated portion of the genome.

The primers in the set used to initiate sequencing by random priming can be selected so as to optimize their usefulness for determining the sequence of a particular set of nucleic acids. Although it has been assumed that the nucleotide sequence is essentially random in the nucleic acids to be sequenced, the statistical analyis can be modified by well known techniques to take into account known deviations from randomness. For example, the DNA is often known to be enriched in AT or GC base pairs, and mammalian DNAs are known to have a strong bias against the dinucleotide sequence CG, with clustering of the CG sequences that are present. For some genomes the nucleotide sequences of highly or moderately repeated elements are known. The sequence of the vector portion that would be present in each cosmid DNA derived from the same cosmid vector would also be known or easily determined. The primers in a set used for random sequencing might for example exclude any that would prime in the vector portion of the cosmids or in known repeated elements of the genome, and might be chosen to reflect the average base composition and known dinucleotide biases of the genome being sequenced. These are examples of the types of optimization that is possible. Primer selection in individual cases could be optimized according to what is known about the nucleic acid being sequenced and the specific goals of the sequencing project.

The initial blocks of sequence provided by random priming give a unique signature to each cosmid DNA being sequenced. When the same set is used to prime each cosmid DNA, these initial blocks of sequence are useful for comparing different cosmids with each other and with emerging blocks of genomic sequence to detect overlaps. In a large scale genome sequencing project, it might be more efficient to use the initial blocks of randomly primed sequence to establish overlaps between cosmids rather than to make the independent effort to order the cosmid DNAs by other means before sequencing them. Such a signature provides a great deal more information than almost any other mapping method, and in a high volume sequencing facility would be easy to obtain. Where some but not all blocks of randomly primed se