|
Description  |
|
|
BACKGROUND OF THE INVENTION
1. Technical Field
This invention is directed to methods for sequencing nucleic acids by
positional hybridization, to procedures combining these methods with more
conventional sequencing techniques, to the creation of probes useful for
nucleic acid sequencing by positional hybridization, to diagnostic aids
useful for screening biological samples for nucleic acid variations, and
to methods for using these diagnostic aids.
2. Description of the Prior Art
Since the recognition of nucleic acid as the carrier of the genetic code, a
great deal of interest has centered around determining the sequence of
that code in the many forms which it is found. Two landmark studies made
the process of nucleic acid sequencing, at least with DNA, a common and
relatively rapid procedure practiced in most laboratories. The first
describes a process whereby terminally labeled DNA molecules are
chemically cleaved at single base repetitions (A. M. Maxim and W. Gilbert,
Proc. Natl. Acad. Sci. USA 74:560-564, 1977). Each base position in the
nucleic acid sequence is then determined from the molecular weights of
fragments produced by partial cleavages. Individual reactions were devised
to cleave preferentially at guanine, at adenine, at cytosine and thymine,
and at cytosine alone. When the products of these four reactions are
resolved by molecular weight, using, for example, polyacrylamide gel
electrophoresis, DNA sequences can be read from the pattern of fragments
on the resolved gel.
The second study describes a procedure whereby DNA is sequenced using a
variation of the plus-minus method (F. Sanger et al., Proc. Natl. Acad.
Sci. USA 74:5463-67, 1977). This procedure takes advantage of the chain
terminating ability of dideoxynucleoside triphosphates (ddNTPs) and the
ability of DNA polymerase to incorporate ddNTP with nearly equal fidelity
as the natural substrate of DNA polymerase, deoxynucleosides triphosphates
(dNTPs). Briefly, a primer, usually an oligonudeotide, and a template DNA
are incubated together in the presence of a useful concentration of all
four dNTPs plus a limited amount of a single ddNTP. The DNA polymerase
occasionally incorporates a dideoxynucleotide which terminates chain
extension. Because the dideoxynudeotide has no 3'-hydroxyl, the initiation
point for the polymerase enzyme is lost. Polymerization produces a mixture
of fragments of varied sizes, all having identical 3' terminal.
Fractionation of the mixture by, for example, polyacrylamide gel
electrophoresis, produces a pattern which indicates the presence and
position of each base in the nucleic acid. Reactions with each of the four
ddNTPs allows one of ordinary skill to read an entire nucleic acid
sequence from a resolved gel.
Despite their advantages, these procedures are cumbersome and impractical
when one wishes to obtain megabases of sequence information. Further,
these procedures are, for all practical purposes, limited to sequencing
DNA. Although variations have developed, it is still not possible using
either process to obtain sequence information directly from any other form
of nucleic acid.
A new method of sequencing has been developed which overcomes some of the
problems associated with current methodologies wherein sequence
information is obtained in multiple discrete packages by hybridization.
Instead of having a particular nucleic acid sequenced one base at a time,
groups of contiguous bases are determined simultaneously. Advantages in
speed, expense and accuracy are clear.
Two general approaches of sequencing by hybridization have been suggested.
Their practicality has been demonstrated in pilot studies. In one format,
a complete set of 4.sup.n nucleotides of length n is immobilized as an
ordered array on a solid support and an unknown DNA sequence is hybridized
to this array (K. R. Khrapko et al., J. DNA Sequencing and Mapping
1:375-88, 1991). The resulting hybridization pattern provides all n-tuple
words in the sequence. This is sufficient to determine short sequences
except for simple tandem repeats.
In the second format, an array of immobilized samples is hybridized with
one short oligonudeotide at a time (Z. Strezoska et al., Proc. Natl. Acad.
Sci. USA 88: 10,089-93, 1991). When repeated N.sup.4 times for each
oligonucleotide of length n, much of the sequence of all the immobilized
samples would be determined. In both approaches, the intrinsic power of
the method is that many sequenced regions are determined in parallel. In
actual practice the array size is about 10.sup.4 to 10.sup.5.
Another powerful aspect of the method is that information obtained is quite
redundant, especially as the size of the nucleic acid probe grows.
Mathematical simulations have shown that the method is quite resistant to
experimental errors and that far fewer than all probes are necessary to
determine reliable sequence data (P. A. Pevzner et al., J. Biomol. Struc.
& Dyn. 9:399-410, 1991; W. Bains, Genomics 11:295-301, 1991).
In spite of an overall optimistic outlook, there are still a number of
potentially severe drawbacks to actual implementation of sequencing by
hybridization. First and foremost among these is that 4.sup.n rapidly
becomes quite a large number if chemical synthesis of all of the
oligonucleotide probes is actually contemplated. Various schemes of
automating this synthesis and compressing the products into a small scale
array, a sequencing chip, have been proposed.
A second drawback is the poor level of discrimination between a correctly
hybridized, perfectly matched duplexes, and an end mismatch. In part,
these drawbacks have been addressed at least to a small degree by the
method of continuous stacking hybridization as reported by a Khrapko et
al. (FEBS Lett. 256:118-22, 1989). Continuous stacking hybridization is
based upon the observation that when a single stranded oligonucleotide is
hybridized adjacent to a double stranded oligonucleotide, the two duplexes
are mutually stabilized as if they are positioned side to side due to a
stacking contact between them. The stability of the interaction decreases
significantly as stacking is disrupted by nucleotide displacement, gap, or
terminal mismatch. Internal mismatches are presumably ignorable because
their thermodynamic stability is so much less than perfect matches.
Although promising, a related problem arise which is distinguishing
between weak but correct duplex formation and simple background such as
non-specific adsorption of probes to the underlying support matrix.
A third drawback is that detection is monochromatic. Separate sequential
positive and negative controls must be run to discriminate between a
correct hybridization match, a mis-match, and background.
A fourth drawback is that ambiguities develop in reading sequences longer
than a few hundred base pairs on account of sequence recurrences. For
example, if a sequence the same length of the probe recurs three times in
the target, the sequence position cannot be uniquely determined. The
locations of these sequence ambiguities are called branch points.
A fifth drawback is the effect of secondary structures in the target
nucleic acid. This could lead to blocks of sequences that are unreadable
if the secondary structure is more stable than occurs on the complementary
strand.
A final drawback is the possibility that certain probes will have anomalous
behavior and for one reason or another, be recalcitrant to hybridization
under whatever standard sets of conditions that are ultimately used. A
simple example of this is the difficulty in finding matching conditions
for probes rich in G/C content. A more complex example could be sequences
with a high propensity to form triple helices. The only way to rigorously
explore these possibilities is to carry out extensive hybridization
studies with all possible oligonucleotides of length n, under the
particular format and conditions chosen. This is clearly impractical if
many sets of conditions are involved.
SUMMARY OF THE INVENTION
The present invention overcomes the problems and disadvantages associated
with current strategies and design and provides a new method for rapidly
and accurately determining the nucleotide sequence of a nucleic acid by
the herein described methods of positional sequencing by hybridization.
As broadly described herein, this invention is directed to a rapid,
accurate, and reproducible method of sequencing a nucleic acid by
hybridizing that nucleic acid with a set of nucleic acid probes containing
random, but determinable sequences within the single stranded portion
adjacent to a double stranded portion wherein the single stranded portion
of the set preferably comprises every possible combination of sequences
over a predetermined range. Hybridization occurs by complementary
recognition of the single stranded portion of a target with the single
stranded portion of the probe and is thermodynamically favored by the
presence of adjacent double strandedness of the probe.
As broadly described herein, another object of this invention is the
integration of molecular biology techniques to the method of positional
sequencing by hybridization. This includes such techniques as the use of
exonucleases to partially cleave the target nucleic acid prior to
hybridization, and the use of polymerase to extend one strand of a target
hybridized probe using the target as a template. Polymerization can be of
a single nucleotide or of a sequence of nucleotides, as determined by
known methods which are easily applied by one of ordinary skill in the
art.
As broadly described herein, another object of the present invention is the
creation of nucleic acid probes for determining the sequence of an unknown
nucleic acid. These probes comprise a double stranded portion, which is
preferably constant, a single stranded portion, and a determinable random
nucleotide sequence within the single stranded portion which hybridizes to
the target. Probes may comprise a complete set of all possible sequences
of the random single stranded portion or a set comprising only a portion
of all possible combinations.
As broadly described herein, another object of the present invention is the
use of nucleic acid probes as diagnostic aids in the analysis of nucleic
acids of a biological sample. The invention includes diagnostic aids and
methods for using diagnostic aids for the analysis of the relatedness or
unrelatedness of one nucleic acid to another. Probes may be created in
which an unknown or undetermined nucleotide sequence has been identified
as the source of a mutation or genetic variation. Probes created herein
may be used to quickly, easily, and accurately identify that mutation or
variation without having to perform a single conventional sequencing
reaction.
As broadly described herein, another object of this invention is a method
for determining the position of a partial sequence within the whole
nucleic acid by labeling the nucleic acid of interest at one terminal site
with a first detectable label, labeling the nucleic acid of interest at an
internal site with a second detectable label, and comparing the relative
mounts of the first label with the relative amounts of the second label to
determine the position of the partial sequence.
Other objects and advantages of the invention are set forth in part in the
description which follows, and in part, will be obvious from this
description, or may be learned from the practice of this invention. The
accompanying drawings which are incorporated in and constitute a part of
this specification, illustrate and, together with this description, serve
to explain the principle of the invention.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 (A) Shown is the first step of the basic scheme for positional
sequencing by hybridization depicting the hybridization of target nucleic
acid with probe forming a 5' overhang of the target. (B) Shown is the
first step of the alternate scheme for positional sequencing by
hybridization depicting the hybridization of target nucleic acid with
probe forming a 3' overhang of the probe.
FIG. 2 Preparation of a random probe array.
FIG. 3 Graphic representation of the ligation step of positional sequencing
by hybridization wherein hybridization of the target nucleic acid produces
(A) a 5' overhang or (B) a 3' overhang.
FIG. 4 Single nucleotide extension of a probe hybridized with a target
nucleic acid using DNA polymerase and a single dideoxynucleotide.
FIG. 5 Preparation of a nested set of targets using labeled target nucleic
acids partially digested with Exonudease III.
FIG. 6 Determination of positional information using the ratio of internal
label to terminal label.
FIG. 7 (A) Extension of one strand of the probe using hybridized target as
template with a single deoxynucleotide. (B) Hybridization of target with a
fixed probe followed by ligation of probe to target.
FIG. 8 Four color analysis of sequence extensions of the 3' end of a probe
using three labeled nucleoside triphosphates and one unlabeled chain
terminator.
FIG. 9 Extension of a nucleic acid probe by ligation of a pentanucleotide
3' blocked to prevent polymerization.
FIG. 10 Preparation of a customized probe containing a 10 base pair
sequence that was present in the original target nucleic acid.
FIG. 11 Graphic representation of the general procedure of positional
sequencing by hybridization.
FIG. 12 (A) Graphical representation of the ligation efficiency of
positional sequencing. Depicted is the relationship between the amount of
label remaining over the total amounts of label in the reaction, verses
NaCl concentration. (B) Test sequences of biotinylated duplex probes
tethered to strepavidin coated magnetic microbeads utilized to determine
ligation efficiency.
DESCRIPTION OF THE INVENTION
To achieve the objects and in accordance with the purpose of the invention,
as embodied and broadly described herein, the present invention comprises
methods, probes, diagnostic aids, and methods for using the diagnostic
aids to determine sequence information from nucleic acids. Nucleic acids
of the present invention include sequences of deoxyribonucleic acid (DNA)
or ribonucleic acid (RNA) which may be isolated from natural sources,
recombinantly produced, or artificially synthesized. Preferred embodiments
of the present invention is probe synthesized using traditional chemical
synthesis, using the more rapid polymerase chain reaction (PCR)
technology, or using a combination of these two methods.
Nucleic acids of the present invention further include polyamide nucleic
acid (PNA) or any sequence of what are commonly referred to as bases
joined by a chemical backbone that have the ability to base pair, or
hybridize, with a complementary chemical structure. The bases of DNA, RNA,
and PNA are purines and pyrimidines linearly linked to a chemical
backbone. Common chemical backbone structures are deoxyribose phosphate
and ribose phosphate. Recent studies demonstrated that a number of
additional structures may also be effective, such as the polyamide
backbone of PNA (P. E. Nielsen et al., Sci. 254:1497-1500, 1991).
The purines found in both DNA and RNA are adeninc and guanine, but others
known to exist are xanthine, hypoxanthine, 2, 1,-diaminopurine, and other
more modified bases. The pyrimidines are cytosine, which is common to both
DNA and RNA, uracil found predominantly in RNA, and thyrmidine which
occurs exclusively in DNA. Some of the more atypical pyrimidines include
methylcytosine, hydroxymethylcytosine, methyluracil, hydroxymethyluracil,
dihydroxypentyluracil, and other base modifications. These bases interact
in a complementary fashion to form basepairs, such as, for example,
gunnine with cytosine and adeninc with thymidine. However, this invention
also encompasses situations in which there is nontraditional base pairing
such as Hoogsteen base pairing which has been identified in certain tRNA
molecules and postulated to exist in a triple helix.
One embodiment of the present invention is a method for determining a
nucleotide sequence by positional hybridization comprising the steps of
(a) creating a set of nucleic acid probes wherein each probe has a double
stranded portion, a single stranded portion, and a random sequence within
the single stranded portion which is determinable, (b) hybridizing a
nucleic acid target which is at least partly single stranded to the set of
nucleic acid probes, and (c) determining the nucleotide sequence of the
target which hybridized to the single strand portion of any probe. The set
of nucleic acid probes and the target nucleic acid may comprise DNA, RNA,
PNA, or any combination thereof, and may be derived from natural sources,
recombinant sources, or be synthetically produced. Each probe of the set
of nucleic acid probes has a double stranded portion which is preferably
about 10 to 30 nucleotides in length, a single stranded portion which is
preferably about 4 to 20 nucleotides in length, and a random sequence
within the single stranded portion which is preferably about 4 to 20
nucleotides in length and more preferably about 5 nucleotides in length. A
principle advantage of this probe is in its structure. Hybridization of
the target nucleic acid is encouraged due to the favorable thermodynamic
conditions established by the presence of the adjacent double strandedness
of the probe. An entire set of probes contains at least one example of
every possible random nucleotide sequence.
By way of example only, if the random portion consisted of a four
nucleotide sequence of adenine, guanine, thymine, and cytosine, the total
number of possible combinations would be 4.sup.4 or 256 different nucleic
acid probes. If the number of nucleotides in the random sequence was five,
the number of different probes within the set would be 4.sup.5 or 1,024.
This becomes a very large number indeed when considering sequences of 20
nucleotides or more.
However, to determine the complete sequence of a nucleic acid target, the
set of probes need not contain every possible combination of nucleotides
of the random sequence to be encompassed by the method of this invention.
This variation of the invention is based on the theory of degenerated
probes proposed by S. C. Macevicz (U.S. Pat. No. 5,002,867, and herein
specifically incorporated by reference). The probes are divided into four
subsets. In each, one of the four bases is used at a defined number of
positions and all other bases except that one on the remaining positions.
Probes from the first subset contain two elements, A and non-A
(A=adenosine). For a nucleic acid sequence of length k, there are
4(2.sup.k -1), instead of 4.sup.k probes. Where k=8, a set of probes would
consist of only 1020 different members instead of the entire set of
65,536. The savings in time and expense would be considerable. In
addition, it is also a method of the present invention to utilize probes
wherein the random nucleotide sequence contains gapped segments, or
positions along the random sequence which will base pair with any
nucleotide or at least not interfere with adjacent base pairing.
Hybridization between complementary bases of DNA, RNA, PNA, or combinations
of DNA, RNA and PNA, occurs under a wide variety of conditions such as
variations in temperature, salt concentration, electrostatic strength, and
buffer composition. Examples of these conditions and methods for applying
them are described in Nucleic Acid Hybridization: A Practical Approach (B.
D. Hames and S. J. Higgins, editors, IRL Press, 1985), which is herein
specifically incorporated by reference. It is preferred that hybridization
takes place between about 0.degree. C. and about 70.degree. C., for
periods of from about 5 minutes to hours, depending on the nature of the
sequence to be hybridized and its length. It is also preferred that
hybridization between nucleic acids be facilitated using certain reagents
and chemicals. Preferred examples of these reagents include single
stranded binding proteins such as Rec A protein, T4 gene 32 protein, E.
coli single stranded binding protein, and major or minor nucleic acid
groove binding proteins. Preferred examples of other reagents and
chemicals include divalent ions, polyvalent ions, and intercalating
substances such as ethidium bromide, actinomycin D, psoralen, and
angelicin.
The nucleotide sequence of the random portion of each probe is determinable
by methods which are well-known in the art. Two methods for determining
the sequence of the nucleic acid probe are by chemical cleavage, as
disclosed by Maxim and Gilbert (1977), and by chain extension using
ddNTPs, as disclosed by Sanger et al. (1977), both of which are herein
specifically incorporated by reference. Alternatively, another method for
determining the nucleotide sequence of a probe is to individually
synthesize each member of a probe set. The entire set would comprise every
possible sequence within the random portion or some smaller portion of the
set. The method of the present invention could then be conducted with each
member of the set. Another procedure would be to synthesize one or more
sets of nucleic acid probes simultaneously on a solid support. Preferred
examples of a solid support include a plastic, a ceramic, a metal, a
resin, a gel, and a membrane. A more preferred embodiment comprises a
two-dimensional or three-dimensional matrix, such as a gel, with multiple
probe binding sites, such as a hybridization chip as described by Pevzner
et al. (J. Biomol. Struc. & Dyn. 9:399-410, 1991), and by Maskos and
Southern (Nuc. Acids Res. 20:1679-84, 1992), both of which are herein
specifically incorporated by reference.
Hybridization chips can be used to construct very large probe arrays which
are subsequently hybridized with a target nucleic acid. Analysis of the
hybridization pattern of the chip provides an immediate fingerprint
identification of the target nucleotide sequence. Patterns can be manually
or computer analyzed, but it is clear that positional sequencing by
hybridization lends itself to computer analysis and automation. Algorithms
and software have been developed for sequence reconstruction which are
applicable to the methods described herein (R. Drmanac et al., J. Biomol.
Struc.& Dyn. (in press); P. A. Pevzner, J. Biomol. Struc. & Dyn. 7:63-73,
1989, both of which are herein specifically incorporated by reference).
Another embodiment of the invention comprises target nucleic acid labeled
with a detectable label. Label may be incorporated at a 5' terminal site,
a 3' terminal site, or at an internal site within the length of the
nucleic acid. Preferred detectable labels include a radioisotope, a stable
isotope, an enzyme, a fluorescent chemical, a luminescent chemical, a
chromatic chemical, a metal, an electric charge, or a spatial structure.
There are many procedures whereby one of ordinary skill can incorporate
detectable label into a nucleic acid. For example, enzymes used in
molecular biology will incorporate radioisotope labeled substrate into
nucleic acid. These include polymerases, kinases, and transferases. The
labeling isotope is preferably, .sup.32 P, .sup.35 S, .sup.14 C, or
.sup.125 L.
Label may be directly or indirectly detected using scintillation fluid or a
PhosphorImager, chromatic or fluorescent labeling, or mass spectrometry.
Other, more advanced methods of detection include evanescent wave
detection of surface plasmon resonance of thin metal film labels such as
gold, by, for example, the BIAcore sensor sold by Pharmacia, or other
suitable biosensors.
Another embodiment of the present invention comprises a method for
determining a nucleotide sequence of a nucleic acid comprising the steps
of labeling the nucleic acid with a first detectable label at a terminal
site, labeling the nucleic acid with a second detectable label at an
internal site, identifying the nucleotide sequences of portions of the
nucleic acid, determining the relationship of the nucleotide sequence
portions to the nucleic acid by comparing the first detectable label and
the second detectable label, and determining the nucleotide sequence of
the nucleic acid. Fragments of target nucleic acids labeled both
terminally and internally can be distinguished based on the relative
amounts of each label within respective fragments. Fragments of a target
nucleic acid terminally labeled with a first detectable label will have
the same amount of label as fragments which include the labeled terminus.
However, theses fragments will have variable amounts of the internal label
directly proportional to their size and distance for the terminus. By
comparing the relative amount of the first label to the relative amount of
the second label in each fragment, one of ordinary skill is able to
determine the position of the fragment or the position of the nucleotide
sequence of that fragment within the whole nucleic acid.
A further embodiment of the present invention is a method for determining a
nucleotide sequence by hybridization comprising the steps of (a) creating
a set of nucleic acid probes wherein each probe has a doubled stranded
portion, a single stranded portion, and a random sequence within the
single stranded portion which is determinable, (b) hybridizing a nucleic
acid target which is at least party single stranded to the set, (c)
ligating the hybridized target to the probe, and (d) determining the
nucleic sequence of the target which is hybridized to the single stranded
portion of any probe. This embodiment adds a step wherein the hybridized
target is ligated to the probe. Ligation of the target nucleic acid to the
complementary probe increases fidelity of hybridization and allows for
incorrectly hybridized target to be easily washed from correctly
hybridized target (see FIG. 11). Ligation can be accomplished using a
eukaryotic derived or a prokaryotic derived ligase. Preferred is T4 DNA or
RNA ligase. Methods for use of these and other nucleic acid modifying
enzymes are described in Current Protocols in Molecular Biology (F. M.
Ausubel et al., editors, John Wiley & Sons, 1989), which is herein
specifically incorporated by reference.
Another embodiment of the present invention is a method for determining a
nucleotide sequence by hybridization which comprises the steps of (a)
creating a set of nucleic acid probes wherein each probe has a double
stranded portion, a single stranded portion, and a random sequence within
the single stranded portion which is determinable, (b) hybridizing a
target nucleic acid which is at least partly single stranded to the set of
nucleic acid probes, (c) enzymatically extending a strand of the probe
using the hybridized target as a template, and (d) determining the
nucleotide sequence of the single stranded portion of the target nucleic
acid. This embodiment of the invention is similar to the previous
embodiment, as broadly described herein, and includes all of the aspects
and advantages described therein. An alternative embodiment also includes
a step wherein hybridized target is ligated to the probe. Ligation
increases the fidelity of the hybridization and allows for a more
stringent wash step wherein incorrectly hybridized, unligated target can
be removed.
Hybridization produces either a 5' overhang or a 3' overhang of target
nucleic acid. Where there is a 5' overhang, a 3- hydroxyl is available on
one strand of the probe from which nucleotide addition can be initiated.
Preferred enzymes for this process include eukaryotic or prokaryotic
polymerases such as T3 or T7 polymerase, Klenow fragment, or Taq
polymerase. Each of these enzymes are readily available to those of
ordinary skill in the art as are procedures for their use (see Current
Protocols in Molecular Biology).
Hybridized probes may also be enzymatically extended a predetermined
length. For example, reaction condition can be established wherein a
single dNTP or ddNTP is utilized as substrate. Only hybridized probes
wherein the first nucleotide to be incorporated is complementary to the
target sequence will be extended, thus, providing additional hybridization
fidelity and additional information regarding the nucleotide sequence of
the target. Sanger or Maxim and Gilbert sequencing can be performed which
would provide further target sequence data.
Alternatively, hybridization of target to probe can produces 3' extensions
of target nucleic acids. Hybridized probes can be extended using
nucleoside biphosphate substrates or short sequences which are ligated to
the 5' terminus.
Another embodiment of the invention is a method for determining a
nucleotide sequence of a target by hybridization comprising the steps of
(a) creating a set of nucleic acid probes wherein each probe has a double
stranded portion, a single stranded portion, and a random nucleotide
sequence within the single stranded portion which is determinable, (b)
cleaving a plurality of nucleic acid targets to form fragments of various
lengths which are at least partly single stranded, (c) hybridizing the
single stranded region of the fragments with the single stranded region of
the probes, (d) identifying the nucleotide sequences of the hybridized
portions of the fragments, and (e) comparing the identified nucleotide
seque | | |