|
Description  |
|
|
BACKGROUND OF THIS INVENTION
Deoxyribonucleic acid (DNA) is the primary genetic material. DNA is an informational molecule which encodes all of the proteins which make up a living organism. This capacity it serves in all living organisms.
DNA consists of two intertwined polynucleotide chains, the double helix. Each chain is a polymer made up of nucleotides. The nucleotide constituents consist of a sugar, a phosphate and a nitrogenous heterocyclic base. Interchain pairing
between the bases, via hydrogen bonding, holds the two chains together in the helical coil.
The nucleotides of DNA contain the sugar 2-deoxyribose and are designated deoxyribonucleotides. Nucleotides which contain the sugar D-ribose are called ribonucleotides: these are the building blocks of ribonucleic (RNA), the intermediary
transcript of DNA which serves as the actual template for protein synthesis.
In all nucleotides the sugar moiety is attached to the nitrogenous base via the glycosidic carbon (1' carbon of the ribose). This combination of sugar and base is called a nucleoside. Phosphorylation of a nucleoside at the 5' carbon of the
sugar gives a nucleotide. The backbone of the DNA polymer is formed of phosphodiester bonds between 2'-deoxynucleoside 5'-monophosphates. (3', 5'-phosphodiester bridges).
The nucleotides of DNA differ only in the nitrogenous base. There are two types of nitrogenous bases in nucleotides, pyrimidines and purines. The pyrimidines are uracil, thymine and cytosine (abbreviated U, T, and C respectively). Nucleotides
containing uracil are found primarily in RNA whereas thymine is found in DNA. The major purines are adenine and guanine (abbreviated A and G respectively). In the DNA helix, complementary DNA chains are held together by base pairing. The
sugar-phosphate backbones are on the outside of the DNA molecule and the purine and pyrimidine bases on the inside. Adenine (A) can pair only with thymine (T) while guanine (G) can pair only with cytosine.
The genetic information of DNA is stored in the linear sequence of the four nucleotides. Most nucleotides along a strand of DNA make up genes which code for specific polypeptides. The nucleotide sequence of DNA is read in groups of three
nucleotides. Each triplet is a "code-word", or codon, for an amino acid. As there are 4 different nucleotides in DNA (distinguished by the bases A, T, G, and C) there are 64 different codons. These codons comprise the entire genetic code. Most of the
codons designate an amino acid; some serve as start and stop signals for protein translation. The genetic code is degenerate because there is more than one codon for most of the amino acids. For example, the amino acid alanine is coded for by the
codons CGA, CGG, CGT and CGC. (In RNA, the triplets are GCU, GCC, GCA and GCG.)
Several techniques have been developed for determining the nucleotide sequence of DNA. Among the more widely practiced are the methods of Maxam and Gilbert and of Sanger.
In the DNA sequencing technique of Maxam and Gilbert, a segment of DNA is labeled at one end with radiolabeled phosphate. The labeled DNA is divided into four samples and each sample is treated with a chemical that specifically destroys one or
two of the four bases in the DNA. The "nicked" molecules are then treated with piperidine which breaks the DNA backbone at the site where the base has been destroyed. This generates a series of labeled fragments the lengths of which depend on the
distance of the destroyed base from the labeled end of the segment. The labeled polynucleotides are separated according to size on an acrylamide gel. The gel is autoradiographed and the patterns of bands on the X-ray film determine which base was
destroyed to produce each radioactive fragment. From this information the position of the destroyed bases can be determined and the overall sequence of the DNA deduced.
The DNA-sequencing technique of Sanger is an enzymatic procedure which entails the synthesis of radiolabeled DNA polynucleotides from the DNA strand to be sequenced. Chain-terminating dideoxynucleoside triphophates are used to stop synthesis at
a particular nucleotide. A Sanger sequencing reaction includes a DNA strand to be sequenced, a labeled primer complementary to the end of that strand, a carefully controlled ratio of one particular dideoxynucleotide with its normal deoxynucleotide and
the other 3 deoxynucleotides. When DNA polymerase is added, normal polymerization begins from the primer; when a dideoxynucleotide is incorporated, the chain is terminated. This results in a series of labeled polynucleotides whose lengths depend upon
the location of a particular base relative to the end of the DNA strand.
Four separate polymerase reactions are conducted each containing one type of dideoxynucleotide. Radiolabeled fragments are separated by size on an acrylamide gel. The pattern of polynucleotides gives the DNA sequence.
The methods of Sanger and Maxam and Gilbert are convenient ways of sequencing single fragments of DNA of about 400 base pairs or less in length. To sequence large pieces of DNA, overlapping fragments of suitable length must be generated and
sequenced individually. The overlapping sequence information among fragments provides the sequential relationship of the fragments so that their relative order can be assigned. From this information, the entire sequence of the parent piece of DNA can
be pieced together.
With these approaches it is apparent that as the length of the DNA strand to be sequenced increases, the probability of obtaining fragments of a DNA strand which overlap sufficiently to eliminate the random occurrence of corresponding sequence
decreases. Consequently, the number of randomly generated fragments of the strand necessary for accurate sequencing increases. For DNA strands even two orders of magnitude less than the length of a human chromosome the number of randomly generated
fragments is immense. Thus, the technique is impractical for sequencing molecules of this size.
The Sanger technique of sequencing long strands of DNA requires DNA cloning procedures because DNA polymerization requires a primer. This is overcome by cloning the fragment into a vector so that it is contiguous with a region of known sequences
so that the complementary primer may be provided. Because of this dependency on cloning technology, the procedure is difficult to automate.
DISCLOSURE OF THE INVENTION
This invention pertains to a method of sequencing DNA which entails the generation of a specific set of polynucleotides from the DNA to be sequenced, determining the nucleic acid base composition and terminal base of these molecules, and solving
the base sequence of the molecules and thus, the sequence of the DNA, by an algorithm called the matrix method of analysis.
The molecules generated from the DNA to be sequenced comprise families of polynucleotides. Each family corresponds to a segment of the DNA to be sequenced and is made up of a longest polynucleotide (the length of which is selected to be within
the analyzable limit of the procedure used to determine base composition and identity of the terminal base) and shorter polynucleotides which form a "sequential subset" of the longest polynucleotide. Grouped heirarchically from the longest to the
shortest polynucleotide, each polynucleotide of the family is progressively one nucleotide shorter than the preceding polynucleotide and has the same sequence except that it lacks the one nucleotide. A further restraint on the elements of the family is
that there is a specific dinucleotide of the sequence contained in each element. The molecules can be envisioned as being built around an "axis" which is at the mid position of the common dinucleotide. The "axis" constitutes an internal reference
point. The polynucleotides vary around the "axis", each containing one less nucleotide on the 3' or 5' end than its longer predecessor in the group. All such molecules are included in the family, from the longest to the shortest, a dinucleotide.
In the preferred embodiment of the invention, the polynucleotides are RNA/DNA hybrid molecules generated from the DNA to be sequenced. To form these hybrids, DNA to be sequenced is broken into fragments and each fragment used as a template to
form one or more RNA transcript(s). The RNA transcript(s) is then extended on the original intact DNA template with deoxyribonucleotides to form the DNA portion of the hybrid(s). The extension is terminated randomly by addition of dideoxynucleotides to
the polymerase reaction. This yields RNA/DNA hybrid molecules which are "random" in length at the 3' end. The molecules can then be randomized at the 5' (RNA) end preferably by using an RNA exonuclease which under appropriate conditions, degrades the
5' RNA portion. The result of this procedure is a family of polynucleotides having the characteristic set forth above. The "axis" referred to above is the dividing line between the RNA and DNA and it immediately follows the 3' most ribonucleotide of
all the hybrid molecules.
The sequence of the DNA portion from which each family of polynucleotides has been made can be solved by determining the base composition (the number of A's, T's, C's and G's) and the identity of the 3' terminal base of each polynucleotide of the
family. A preferred method of analyzing polynucleotide to obtain this qualitative and quantitative data is mass spectroscopic analysis.
In the preferred embodiment the fifth position of the pentose of the nucleotides of the polynucleotide are mass labeled with isotopes of carbon, hydrogen, and oxygen in such a fashion that each possible nucleotide and terminal nucleotide releases
a distinct mass labeled molecule (such as formaldehyde) from the fifth position of the pentose when the polynucleotide is degraded to 3' nucleotides or nucleosides and reacted with periodic acid. The relative abundance of the liberated molecules of
different massess corresponding to different bases is recorded using a mass spectrometer. The intensities of the signals which correspond to the different bases are normalized with the signal corresponding to the terminal base which serves as the
internal standard with a signal of unity. The base composition and internal base identity are given by the normalized data.
The composition and terminal nucleotide data of the elements of each family of polynucleotides are used to solve the sequence of the corresponding DNA portion template by a method of first generating all polynucleotides which can be obtained from
a guessed solution of the sequence by successive removal of a 3' or 5' nucleotide consistent with the data of the change in composition between set elements and with the further constraint that a specific dinucleotide of the sequence must be present in
all polynucleotides. The terminal nucleotide data is used to determine if a subset of the hypothetical family of polynucleotides exists such that the elements have a one to one correspondence with the data of terminal nucleotide as well as composition.
If no such subset exists, the process is repeated for improved guesses until convergence to the correct solution for the sequence occurs.
An algorithm which performs this analysis by testing for the validity of a guess for part of the sequence while solving for the remaining part using the composition and terminal base data independently to execute binary hypothesis testing
decisions compatible with computer logic is the matrix method of analysis algorithm.
The matrix method of analysis is analogous to solving a system of n equations in n unknowns where the knowns are: 1) the structural properties of the polynucleotides, 2) the base composition and the identity of the terminal base, 3) the change in
composition and change in terminal base between a polynucleotide and the next in the family. Knowledge of the 5' half of the sequence may or may not be known. A different form of the matrix method is used depending on whether the sequence of the 5'
half of the polynucleotides is known extrinsically. The method exploits the given information by implementing a reiterative procedure to find a path through a matrix of the possible polynucleotides having sequences consistent with the data. Final
assignment of the sequence is made when the entire path finding procedure can be accomplished without contradictions between sequence assignment and actual data.
A major feature of the sequencing procedure of this invention is that it eliminates the obstacle of overlap among sequenced fragments of DNA occurring by chance. Because of the manner in which the polynucleotides are generated, each DNA fragment
sequenced can be placed in proper order in relationship to all other fragments because a sufficient portion the 5' end of the flanking DNA fragment is elucidated as the sequence of any DNA fragment is determined. This permits the proper ordering of
sequenced DNA fragments of a large DNA molecule to yield the entire sequence of DNA. Thus, an entire gene or even an entire chromosome can be sequenced from one set of restriction enzyme fragments of the gene or chromosome.
For sequencing fragments of DNA containing greater than 400 nucleotides, the methods of Maxam and Gilbert and Sanger rely on the chance overlap of restriction fragments which are isolated from a digest of the DNA to be sequenced. If small
fragments are lost during the isolation procedure, then these molecules will not be sequenced; therefore, certain sequence information may be lost on account of the deletions. The strategy of the present invention, however, circumvents this shortcoming
because the procedure to sequence any given restriction fragment solves for the sequence of the restriction fragment and sequences into the 5' region of the contiguous restriction fragment; any short fragments which are lost during isolation procedures
are sequenced. Thus, no small deletions occur in the solution of the sequence by this method.
In addition, both strands of the template can be sequenced simultaneously; the sequence for any one strand can be verified by the sequence obtained for the antiparallel strand.
Another important attribute of the procedure is that it can be automated with instrumentation. Separation procedures can be automated with instruments such as automated electrophoresis and fraction collectors. A mass spectrometer with a
response time of 10 msecs can scan 100 molecules in a second and NMR equipment which can be used to scan NMR labeled molecules has the capacity to scan 80,000 samples in less than one second. The sequence can be solved from the data obtained by using
the matrix method of analysis programmed into a high speed computer. Thus, this method constitutes a procedure which is automatable to sequence DNA rapidly on a large scale.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a schematic illustration of the Gaussian distribution of polynucleotide reaction products.
FIG. 2 is a flow chart of the steps of the Best Mode of Carrying Out the Invention.
FIG. 3 is a flow chart of the steps of Procedures IV and V of Alternative Modes of Carrying Out the Invention.
FIG. 4 is an example of a configuration matrix used to solve a sequence by the Matrix Method of Analysis I.
FIG. 5 is a lattice used to solve the first rectangular matrix of FIG. 4.
FIG. 6 is a lattice used to solve the fourth rectangular matrix of FIG. 4.
FIG. 7A is an example of NMR data from fragment #15 of FIG. 4.
FIG. 7B is an example of mass spectroscopic data from fragment #15 of FIG. 4 where the data was obtained of labeled CO.sub.2 using scheme 1 as described in the Determination of the Base Composition and Terminal Base Identity by Mass Spectrometry
Section.
FIG. 8 is an automation scheme which parallels the flow chart of FIG. 2.
FIGS. 9A, 9B and 9C are diagrams of the electro-optical ion detector array mass spectrometer.
FIG. 10A is a diagram of the electrophoresis apparatus.
FIG. 10B is a cross section of a tube gel, manifold, and glass bead well.
FIG. 10C is a diagram that illustrates the collection of the glass beads in receiving vessels.
FIG. 10D is a diagram of the concentric disks containing wells of activated glass beads which rotate independently to collect electrophoresed nucleic acid.
FIG. 10E is a diagram of a single unit of the electrophoresis apparatus where DNA is collected from the units through outlets.
PRINCIPLES BEHIND STRATEGY OF THE METHOD OF THE INVENTION
I. Background
Two major techniques have been developed to sequence DNA, a chemical one, the Maxam Gilbert technique, and an enzymatic one, the Sanger technique. The principle of the former involves labeling the 5'-hydroxyl of a DNA strand with .sup.32 P and
establishing chemical conditions that cause the DNA strand to break at the site of occurrence of one specific nucleotide, either A, T, G, or C. The net effect is to produce polynucleotides of differing lengths and the length of these polynucleotides
reflects the occurrence of a base at a specific position in a DNA strand.
The principle of the latter technique involves copying a DNA template from a predetermined position on the template and terminating transcription where the terminating nucleotide is known from the specific reagent used during transcription. DNA
polymerase I, a 5' to 3' DNA polymerase which requires a free 3'OH, is used with a primer to synthesize a radioactive single strand copy of a template DNA strand whose sequence is to be determined. To each of four reaction mixtures, a blocked derivative
of one of the four different nucleotide substrates is added. These reagents are 2', 3' dideoxy analogues of each nucleoside triphosphate; they lack a 3' hydroxyl group. They block further extension of any chain into which they are incorporated. As a
result, a series of polynucleotides is formed whose length reflects the position in the polynucleotide of a base corresponding to the analogues.
In both techniques, the reaction products are analyzed by separating them according to length by electrophoresis on a polyacrylamide gel and exposing the gel to x-ray film. The sequence can be read directly from bottom to top of the film.
Advancing from one position of radiopacity to the next represents the addition of one nucleotide, and the identity of that nucleotide is assigned from the nature of the reaction that produced the radiopaque band.
These popular methods of DNA sequencing rely on the formation of a set of molecules all of which have a common property, a reference point at the 5' end of the molecule. Each member of the entire set has the following properties: they are
superimposable on a portion of the parent molecule of the set of unknown sequence and the nature of the superimposition is such that all of the molecules can be superimposed on the parent when the 5' end of any molecule of the set and the 5' end of the
parent molecule are aligned. The molecules make up a subset of the parent where subset denotes the relationship of identical sequence to some portion of the parent molecule and a total number of nucleotides less than or equal to that of the parent while
maintaining a common 5' end. The entire subset refers to a group of molecules differing in length by one nucleotide ranging in length from one nucleotide to the number of nucleotides of the parent.
Molecules which differ in length by one nucleotide can be separated on a polyacrylamide gel. For any molecule, up to about 400 base pairs in length, a molecule that is identical except for possessing an additional nucleotide at the 3' end, will
have a slower electrophoresis velocity. Thus, the smaller polynucleotide will migrate closer to the anode than the longer. Thus, if the last nucleotide of the molecule at the 3' end is known and all molecules have the same 5' end and are
superimposable, the sequence can be determined from the knowledge of the identity of the last nucleotide. This is the basis of sequencing by the Maxam Gilbert and Sanger techniques.
Another strategy can be envisioned where the set of molecules are of length less than or equal to the parent molecule, are superimposable beginning from the 3' end of the parent and contain elements of length from one nucleotide to the number of
the parent. In this strategy the common reference point is the 3' end. Utilizing the capability of separating of molecules differing by a single nucleotide and the knowledge of the last nucleotide on the 5' end, this procedure could serve to assign the
identity of any 5' nucleotide as the length of the polynucleotide increases by one. The sequence of the parent can be assigned by reiteration of this procedure.
Catalyzed chemical reactions are used to generate the set of molecules in the case of the Sanger technique and chemical cleavage methods are employed to generate the polynucleotide set in the Maxam and Gilbert technique. Chemical or enzymatic
reactions of these types have not been developed for the strategy where all the elements of the polynucleotide set have a common 3' end. However, because the approach is essentially the same, except that the opposite end of the parent molecule is chosen
as the reference point, any limitations of the Maxam and Gilbert and Sanger techniques would probably apply to this strategy.
All of the described techniques are inadequate for sequencing DNA molecules of length greater than about 400 nucleotides due to inherent shortcomings. To sequence DNA of 10.sup.5 nucleotides, for example, by any of the above strategies, a set of
molecules would have to be generated which contains all set elements that differ in length by one nucleotide from length one to 10.sup.5 and the molecules would have to be separatable on a polyacrylmaide gel. Both requirements result in failure of the
above mentioned methods.
Chemical reactions often yield a Gaussian distribution of reaction products. For the Maxam and Gilbert and Sanger techniques, set generation is dependent on chemical reactions for which the yield as a function of length is approximately a
Guassian distribution, and for both methods, the average length is about 200 nucleotides.
In addition, molecules migrate at a velocity which is inversely proportional to their molecular weight. As the molecular weight increases, the difference in migration velocity decreases. Thus, molecules that differ by one nucleotide and range
in length from one nucleotide to a length greater than 400 nucleotides can not be reliably separated by polyacrylamide electrophoresis to produce a band pattern from which the sequence can be determined. For these reasons, these methods are not
appropriate for sequencing DNA molecules greater than 400 nucleotides in length.
To overcome these limitations, an approach of making overlapping fragments of the parent molecule has been widely implemented. The most common method of generating the fragments has been by cleavage with restriction enzymes. These fragments are
individually sequenced by the methods of Maxam and Gilbert or Sanger and the total sequence is put together by overlap of the sequences of the fragments. However, this strategy also fails for sufficiently large molecules because of inherent limitations.
With present technology, a single fragment of DNA which contains less than 400 base pairs can be sequenced using the Sanger method of primed synthesis or the method of sequencing with base specific chemical cleavages of Maxam and Gilbert. These
methods are described with other less predominant methods in Methods of Enzymology, Vol. 65, 1980, pp. 497-701. Both methods are convenient for single fragments of DNA; however, to sequence a very large piece of DNA, both methods rely on overlap of
fragments of the larger piece which are independently sequenced. Sets of fragments are generated to produce overlapping sequence information, where the relationship of one set of fragments to another is unknown; any overlap of fragments from different
sets is due to chance. With this approach it is apparent that as the length of the DNA fragment to be sequenced even approaches two orders of magnitude less than the size of a human chromosome the probability of obtaining sets of fragments whose
sequences overlap each other sufficiently to rule out random overlap, so that the relative order can be assigned and thus allow the assignment of the sequence of the entire parent molecule, goes to zero; thus, the number of different randomly generated
sets of differing fragments that are sequenced individually goes to infinity.
Furthermore, the popular method of primed synthesis depends on DNA cloning. DNA polymerase I extends primers from a 3' OH where the primer is bound to the complementary region at the 3' end of the template molecule which is the complement of the
molecule which is sequenced. One of the difficulties with the Sanger technique is that suitable primers must be generated which will bind to only the 3' end of the template which is of unknown sequence. This problem can be circumvented by cloning the
template into a region of a vector such that both its 5' end, in one orientation, and its 3' end in another orientation are contiguous to a region of the cloning vector of known sequence. Transcription is initiated from a primer which is complementary
to this region.
Methods using cloning strategy are set forth in Methods of Enzymology, Vol. 101, 1983, pp. 3-122. The strategy of generating random sets of fragments that are individually sequenced until enough sets have been sequenced to provide overlap to
assign accurately the relative order of the fragments is discussed in New M13 Vectors for Cloning--Shotgun Cloning, Methods of Enzymology, supra pp. 43-47. Sequencing procedures which depend on cloning as an integral part of the procedure are very
difficult if not impossible to automate.
II. Strategy of the Sequencing Method of the Invention
The method of this invention is a readily automatable method which circumvents the problems inherent in the Maxam and Gilbert and Sanger techniques. The strategy is to create a group of molecules which contain a reference point which is
internal. Initially, location of the reference point is unknown, but it exists in all of the molecules. The molecules are a family of polynucleotides comprising complementary copies of a portion of the parent molecule from which they are generated and
are superimposable on the parent by alignment of this internal point of reference. The location of the point of reference or "axis," and the sequence of the parent molecule is solved for simultaneously by an algorithm called the matrix method of
analysis.
The family of polynucleotides can be thought of as being all molecules which result from the sequential loss of nucleotides from the 5' and 3' end of the longest polynucleotide of the group. An ordered pattern of terminal nucleotide change and
nucleotide compositional change occurs between members of sequential subsets. This algorithm exploits the pattern of ordered or systematic nucleotide compositional change and terminal nucleotide change that a designated longest polynucleotide with a
given internal reference point and given nucleotide loss constraints can produce.
III. Criteria of Polynucleotides
The nucleotide sequence of a DNA strand can be solved by generating a family of polynucleotides overlapping portions of the DNA to be sequenced. Each family of polynucleotides forms a "sequential subset" of the longest polynucleotide of the
group. The molecules are identical less one nucleotide from either the 5' or 3' end of a given molecule, and the former are defined as sequential subsets of the latter. A family of molecules with this intra member relationship is defined as a proper
sequential subset. Also, if all such families are considered collectively, then another type of sequential subsets can be defined. A family of molecules can be selected from this group which are "sequential subsets" in the sense that any molecule of
the family contains the exact same base composition less one nucleotide relative to another molecule of the family. The former is defined as an improper sequential subset of the latter if it does not contain the exact same sequence as the latter D less
one nucleotide from the 3' or 5' end. The family is defined as "improper sequential subsets" and only one polynucleotide of a given length is present in the family.
The molecules can be depicted as follows:
where the series K.sub.1, K.sub.2, K.sub.3, K.sub.4 . . . K.sub.n' represent the nucleotides of the polynucleotide 5' to the internal reference point, or axis, and the series X.sub.1, X.sub.2, X.sub.3, X.sub.4 . . . X.sub.n represents the
nucleotides of the polynucleotide on the 3' side of the axis. The 5' end with respect to the axis is designated as the "known" portion of the molecules (this does not necessarily imply that this sequence is initially known), and the 3' end of the
polynucleotide is designated as the "unknown" portion. Thus, K.sub.1, K.sub.2, K.sub.3, K.sub.4 . . . represent the "known" sequence and X.sub.1, X.sub.2, X.sub.3, X.sub.4 . . . represent the "unknown" sequence. The distinction is that in the matrix,
as described below, K.sub.1, K.sub.2, K.sub.3, K.sub.4 . . . appear as nucleotides, whereas the X's represent variables (define). The nucleotides of the "known" portion can be known extrinsically or they can be guessed.
There are at least three strategies of polynucleotide generation which yield a population of molecules from which sufficient data can be derived and analyzed by the proper form of the matrix method of analysis to solve for the sequence of the
parent molecule uniquely.
A. Strategy I
According to Strategy I the polynucleotides are governed by the following constraints. No polynucleotide contains X.sub.2 without containing X.sub.1. In general terms, no polynucleotide contains X.sub.n without containing X.sub.n-1, X.sub.n-2,
. . . X.sub.1. In addition, no polynucleotide contains K.sub.2 without containing K.sub.1. That is, no polynucleotide contains an unknown without containing all preceeding unknowns and, every polynucleotide contains all succeeding knowns if it
contains any given known. As a set, all the polynucleotides satisfy these criteria and vary randomly at the 3' and 5' ends.
The criteria can be represented symbolically as follows:
X.sub.n .fwdarw.X.sub.1 (X.sub.n implies X.sub.1)
K.sub.n .fwdarw.K.sub.1 (K.sub.n' implies K.sub.1)
. . K.sub.n '-X.sub.n. . . (The polynucleotides are random at the 5' and 3' ends; the knowns and unknowns are variables where K=Known, X=Unknown, n'=1 to 4 . . . and n=1 to 4 . . . )
B. Strategy II
All polynucleotides of a family conform to the following criteria. As in strategy I, no polynucleotide contains X.sub.2 without containing X.sub.1 and no polynucleotide contains X.sub.3 without containing X.sub.2 and X.sub.1, and so on. In
general terms, no polynucleotide contains X.sub.n without containing X.sub.n-1, X.sub.n-2, . . . X.sub.1. However, no restrictions are placed on the "known" nucleotides K.sub.1, K.sub.2, K.sub.3, K.sub.4 . . . . As a set, all polynucleotides conform
to these criteria and vary randomly at the 3' and 5' ends.
These criteria can be represented symbolically as follows:
X.sub.n .fwdarw.X.sub.1 (X.sub.n implies X.sub.1)
K.sub.n' .fwdarw.K.sub.1 (K.sub.n' does not imply K.sub.1)
K.sub.n' -K.sub.n" (The knowns are 3' and 5' random; the knowns are variables where the range of n' and n" is 1 to 4 . . .
K.sub.n' -X.sub.n (the polynucleotides are 5' and 3' random; the knowns and unknowns are variables, where K=known; X=Unknown, n=1 to 4 . . . and n'=1 to 4 . . . )
C. Strateqy III
All polynucleotides of each family must conform to the following criteria. No polynucleotide contains X.sub.2 without containing X.sub.1 and no polynucleotide contains X.sub.3 without containing X.sub.2 and X.sub.1, and so on. In general terms,
no polynucleotide contains X.sub.n without containing X.sub.n-1, X.sub.n-2, . . . X.sub.1. However, no restrictions are placed on the "known" nucleotides; that is all possible polynucleotides of known with unknowns is possible as long as the first rule
is satisfied. Furthermore, a change in known nucleotides of any two polynucleotides of the same composition implies that they have different terminal nucleotides. All polynucleotides adhere to these criteria and vary randomly at the 3' and 5' ends.
These criteria can be represented symbolically as follows:
X.sub.n .fwdarw.X.sub.1 (X.sub.n implies X.sub.1)
K.sub.n' .fwdarw.K.sub.1 (K.sub.n' does not imply K.sub.1)
For K-X.sub.T1 and K'-X.sub.T2, .DELTA.K and .DELTA.K', X.sub.T1 .noteq.X.sub.T2
(Given two polynucleotides which begin with knowns and end with unknowns and have the same composition, a change in knowns for both implies that the terminals are not the same.)
K.sub.n' -X.sub.n (the fragments are 5 and 3, random; the knowns and unknowns are variables, where K=known; X=unknown; n'=1 to 4 . . . and n=1-4 . . . )
An entire set of polynucleotides is defined as a family which satisfies all criteria of one of these three strategies. By determining the base composition and terminal base for sequential subsets, the sequence of the polynucleotides, and
therefore, the sequence of the original DNA strand (the parent), can be solved by the matrix method of analysis.
IV. Principles of Matrix Method of Analysis
The basis of the matrix method of analysis is that there exists in all molecules, as described, a reference point which is a nucleotide position relative to the parent. These molecules compromise a sequential subset of a polynucleotide copy of a
segment of the parent DNA. Nucleotides are lost from the 5' or 3' end. All the set elements are superimposable on the parent about this point of reference. The change in composition and terminal nucleotide and the order in which this change occurs is
unique moving from one set element to the next, from the longest to the shortest element of the set.
The matrix method of analysis is a method that exploits the information that is known from the design of the polynucleotide generating reactions and the data obtained from the polynucleotides. The information is as follows:
1) the composition and change in composition of a polynucleotide as a result of random loss of one nucleotide from either the 3' or 5' end at each step; 2) the constraints on the random 3' and 5'loss described under criteria of polynucleotides;
3) the identity of the terminal base and the change in terminal base at each step; and 4) the order at which these changes occur.
The matrix method of analysis entails setting up a rectangular matrix where the designated longest polynucleotide appears at position (1,1). The sequence of one half of this molecule is "known" and the nucleotide sequence at the other one half
of the molecule is designated "unknown" and represented by variables. The term "known" does not necessarily imply that the nucleotide sequence at the parent molecule is known initially. The division between the "knowns" and "unknowns" is the internal
reference point. The location of the reference point is not necessarily known initially and can be changed by changing the knowns so that this sequence superimposes a different region of the parent molecule. That is, when the sequence is solved, it
will be superimposable upon a region of the parent and the location of the internal reference point will be fixed. The location on the parent is at the line dividing the "knowns" and the "unknowns". If the 5, end of the sequence (and consequently the
entire sequence) were superimposable on a different region of the parent, the location of the internal reference point would be different. Thus, the location of the internal reference point relative to the parent molecule is determined by the "knowns".
An exemplary matrix is shown below for polynucleotides which conform to the criteria set forth for Strategy I. For a designated longest polynucleotide which contains a total of eight (8) nucleotides the matrix consists of 5 rows and 4 columns.
##STR1##
The matrix columns contain polynucleotides which have lost nucleotides at the 5' end; the rows are formed of polynucleotides which have lost from the 3' end. Nucleotides are lost from the 5' end down any column and lost from the 3' end across
any row. The matrix is constructed such that all the constraints governing the polynucleotides are satisfied and all possible polynucleotides are recorded in the matrix according to the described format.
The determination of the sequence of the polynucleotides proceeds as follows: starting at position (1,1) in the matrix, the base which has been lost is determined by the difference in base composition between the longest polynucleotide and the
next longest of the set. The change is consistent with a move to position (1,2) and/or (2,1) of the matrix. The step is repeated for each polynucleotide of the family. These moves are down a column and/or across the row from left to right. Moves down
a column or across a row from left to right are designated from/to moves. The result can be recorded, e.g. in a "lattice" which contains all coordinate positions arranged in levels such that each successive level from top to bottom corresponds to all
possible from/to moves, and each successive level from bottom to top corresponds to all possible to/from moves. A to/from move is a movement up a column and/or across a row from right to left.
______________________________________ General Lattice ##STR2## Lattice Polynucleotide Coordinate Position ______________________________________ K.sub.4 K.sub.3 K.sub.2 K.sub.1 15 K.sub.3 K.sub.2 K.sub.1 25 K.sub.2 K.sub.1 35 K.sub.1
45 ______________________________________
For each step, the base which could have been lost from the 3' or 5' end is determined and the appropriate move to a position in the matrix is made. This establishes the appropriate path in the matrix which can be designated by connecting the
corresponding coordinates in the lattice. This procedure is repeated until all consistent from/to moves are recorded in the lattice. At least one path is formed from coordinate position (1,1) to a point of convergence, i.e., a coordinate position from
which no further from/to moves can be made.
The next step is to determine which path is the correct path. This is accomplished by starting at a point of convergence and determining which to/from steps for all single or binary decisions are consistent with the terminal base data as moves
are made back to position (I,1) from the point of convergence. Assignment of a base to the 3' or 5' end is made by a to/from move which does not contradict the change in base. For all to/from moves, if the path that is chosen from one coordinate to
another corresponds to a move across a row from right to left, then the base is assigned to the 3' end which is consistent with the move. That is the base change determined from the data occurred from the 3' end therefore the base lost is assigned to
the 3' end. A contradiction arises if this assignment is inconsistent with terminal base data for the polynucleotide represented at the coordinate position or if the change in terminal base for this step is inconsistent with the data.
For all to/from moves, if the path that is chosen from one coordinate to another corresponds to a move up a column then the base change for that step indicates which base to assign to the 5' end. A contradiction would arise if the next "known"
up the column in the matrix is different from that indicated by the base change.
The sequence is solved when at least one path is found from (1,1) to a point of convergence by from/to moves and to the (1,1) position from the point of convergence by to/from moves at each data step without contradictions.
The matrix method of analysis yields a unique solution for a matrix of all possible polynucleotides of size (1/2M+1, 1/2M) that conform to the constraints for polynucleotides of Strategy I, or of size (1/2M+1,M) for polynucleotides of Strategy
II, for any set of data of M-1 polynucleotides that are successively one nucleotide less and are sequential subsets from M-1 nucleotides to a dinucleotide. (The longest polynucleotide is M nucleotides in length.)
The key to the matrix method of analysis is that there is convergence to at least one of the terminal possibilities (point in the matrix at which no further from/to moves can be made). It may converge to more than one (e.g., if the sequence
contained only A, or T, or C, or G bases, then it would converge to all possible termini of the matrix that yields the solution of the sequence). Once any terminus is determined to be correct, it can serve as an initiation point, that is, a point, or
coordinate position from which the initial to/from move is made. A terminus representing a single nucleotide or single variable in the matrix is correct if it is consistent with the data. The sequence can be deciphered by making decisions at branch
points and by taking the return path that is determined to be correct by the data, i.e. the terminal base and the change in the terminal base at each step. If more than one path is correct, anyone of the correct paths will yield the sequence.
For a given set of polynucleotides which terminate in the leftmost terminus (K.sub.4 K.sub.3 K.sub.2 K.sub.1) (see lattice, page 25,) if the data contains K.sub.1, K.sub.2, K.sub.3, K.sub.4, then this coordinate (1,5) can not be excluded as an
initiation point. But, if the sequence, K.sub.4 K.sub.3 K.sub.2 K.sub.1, is known then this is an initiation point from which to initiate the path which gives the sequence.
In the special case, where only proper subsets, (i.e., polynucleotides which have the same sequence but differ by one nucleotide) were chosen and the 5' knowns are assigned from extrinsic information, if the sequence of the polynucleotide at the
step at which a terminus is reached is not known but contains the same base composition as the terminus polynucleotide represented in the matrix, then this path can be determined to be correct if no other path exists or is consistent with the data. That
is, it is proven correct by exclusion. Furthermore, if the data converges in this fashion, and also to one or more other possibilities then, a terminus containing no variables may be an initiation point, but it may also not be. The initiation point
cannot be validated unless the sequence is known extrinsically at this point. In all cases, an initiation point is chosen that is consistent with the data; there will always be at least one. And, in the special case described, if it does not exist
directly then it exists by exclusion.
In the case where polynucleotides conform to the rules of Strategy II where all possible polynucleotides of the knowns may be present in the set to be analyzed, these polynucleotides would occupy the appropriate position in the matrix and the
additional coordinates would become part of the lattice at the proper level. Thus, sequential subsets could result in convergence to one or more of these new coordinate positions which would serve as initiation points. This is discussed in more detail
under matrix method of analysis II. A special matrix and lattice is discussed under matrix method of analysis I.
For a population of molecules with special properties specified in Strategy III, a special form of the matrix method of analysis will yield a unique solution where information that was required to solve the sequence with data from a population of
molecules generated with different constraints are not needed under these special conditions. In general, the "knowns" of any given parent molecule at position (1,1) can be defined by two different procedures which have implications as to the type of
polynucleotides which can be generated.
The "knowns" can be assigned extrinsically. That is, a reiterative procedure can be implemented such that when the 3' half of the sequence of one set of polynucleotides is solved the 5' half, the "knowns", for the adjoining group of
polynucleotides is solved. When a method is used which does not solve for the next contiguous half, the 5' half ("knowns") can be guessed. As described above, in essence, this amounts to a guess of the position of the axis on the parent molecule.
Alternately, the axis in the case where the "knowns" are assigned extrinsically may be guessed independently which conversely determines the "knowns."
Sequential subsets, as defined previously, can be polynucleotides of the exact same composition and sequence as the predecessor minus one nucleotide, i.e. a proper sequential subset, or each polynucleotide of the set could be a molecule that has
the exact same composition minus one nucleotide, but a different sequence, an improper sequential subset. All of these variables have implications with regard to the nature of the set of generated molecules and the analysis of these molecules.
As will be described under Matrix Method of Analysis I and Matrix Method of Analysis II, implementation of the Matrix Method of Analysis requires that the solution for the position of the "axis" be guessed (Matrix Method of Analysis II) or
requires that both the position of the "axis" and the "knowns" must be guessed (Matrix Method of Analysis I). Furthermore, a proper set of sequential subsets must be guessed for both methods. Also, a guess can be made for the position of the axis, and
the set of proper sequential subsets with or without the composition of the 5' one half of the polynucleotide being known extrinsically and an unambiguous solution can be found by the matrix method only if the correct guesses are put forth.
From data, the sequence is verified by the absence of contradictions; conversely, fictitious data can be proposed from guesses for the sequence other than the correct sequence and the fictitious data may be consistent with the base composition
and the terminal base of each true polynucleotides, but the matrix generated from the incorrect guess will not yield an unambiguous solution using the real data. The matrix method of analysis reconstructs the sequence in the exact order that the
nucleotides are successively eliminated from the 3' and 5' end. From other guesses a set of polynucleotides consistent with the data may be proposed but a different order of elimination must be imposed and the matrix method will only yield the correct
order for the correct matrix which is unique to each guess. Also, the four data properties are unique due | | |