A computational method maximizing open reading frame length in an assembly consensus sequence is provided. Systems employing the method are also provided.
A method and apparatus are disclosed for identifying protein-coding regions in a nucleic acid sequence. In order to predict the coding regions within a sequence, a single metric is used without reference or comparison to a predetermined noise level or signal-to-noise ratio. This single metric, periodicity at 1/3, is analyzed using a discrete cosine transformation (DiCTion) function. For each open reading frame or sequence region, a score is generated using the discrete cosine transformation function. Regions within a sequence that have a score above a predetermined threshold are identified as protein-coding regions. The aspects of the invention may also be applied to gather other information from sequences, such as the location of the borders of a putative coding region, within a genomic sequence.
A system and method for improving the accuracy of DNA sequencing and error probability estimation through application of a mathematical model to the analysis of electropherograms. The method includes processing a plurality of information obtained from a base calling system and creating a plurality of refined base calls using a plurality of original base calls and a plurality of intrinsic peak characteristics. A quality value is also assigned to each of the plurality of refined base calls using the plurality of intrinsic peak characteristics. Processing comprises detecting a plurality of peaks, expanding the plurality of peaks, and resolving the plurality of expanded peaks. Resolving may include fitting the plurality of expanded peaks using a model of a peak shape. A peak resolution parameter is calculated and used in processing. The system may also be trained.