A computer method is disclosed for analyzing text by employing a model known as a paradigm, that provides all the inflectional forms of a word. A file structure is created consisting of two components, a list of words (a dictionary), each word of which is associated with a set of paradigm references, and the file of paradigms consisting of grammatical categories paired with their corresponding ending or affix portions (known as the desinence) specifying tense, mood, number, gender or other linguistic attribute. A computer method is disclosed for generating the file structure of the dictionary by generating all forms of the words from a list of standard forms of the words (known as the lemma) which is generally the infinitive of a verb or the singular form of a noun, the lemmas being generated with their corresponding paradigms.
A machine translation apparatus in which the sentence construction of a source language entered by an input device is analyzed in order to generate the corresponding translated text after being converted into a sentence construction in a target language, wherein the machine translation apparatus uses a device for determining whether or not a word string obtained from a sentence construction analysis is a proper noun with an acronym, a device for examining whether or not the number of first letters of each of a certain number of words corresponds to the number of letters of the acronym, and also for examining whether or nor these words are registered in a dictionary, and a device for outputting the corresponding term after it is translated into a target language, when the words are registered in the dictionary, and for outputting directly the words, whose number of first letters corresponds to the number of the letters of the acronym, without translating them, when the words are not registered in the dictionary.
To enable inflected word forms to be derived, a method and a data processing unit are provided for inflecting a given word and for adapting the lexical data of the word correspondingly by reference to a classification scheme containing inflection rules for a given natural language. The effect achieved is that inflected word forms do not have to be contained in an electronic word register, and this results in a considerable reduction in memory space occupied by the word register.
A computer method of parsing a sentence into sentence parts to be described with functional indications, by means of lexicalized word units. The method includes determining, for each word unit and for each constituent, the functional word category within the constituent or a new constituent and, for each constituent, describing a step relating to closure of the constituent, and allocating a functional label to that constituent. The current constituent is then tested against rules relating to the context of the words and/or subsidiary constituents, and probability factor allocated to the sentence representation is reevaluated. Each sentence representation having a probability factor above a certain threshold value is then selected. A grammatically incorrect sentence already parsed can be corrected by selecting the grammatically incorrect constituent within that sentence and changed by reference to rules stored in the computer.
This invention pertains to a dictionary-linked text base apparatus having a text searching function and text searching methods using this apparatus at various levels. This realizes a searched result with a higher degree of accuracy and, by linking a text base with an electronic dictionary, this invention enables searches at three levels: all texts containing the morpheme included in a searched object word at a morpheme level; all texts related to the grammatical attribute of the item in the electronic dictionary matching or including the morpheme at a syntactical level; and all texts related to no less than one concept of the item in the electronic dictionary at a semantic level. The invention includes an electronic dictionary for storing, in correspondence with an identifier for each item, no less than one morpheme identifier forming the item, an identifier of the grammatical attribute of the item, and a concept identifier. In addition, it includes a relative information part for showing an identifier for the related text in the electronic dictionary, in correspondence with each of the morpheme identifiers, the grammatical attribute identifier and the concept identifier in the electronic dictionary.
A data processing method is disclosed for storing and retrieving text. The storage part of the method includes the steps of compiling a vocabulary list of words occurring in the text and augmenting the vocabulary list with lemmas of the words in the text, as an augmented word list. It then continues with the steps of compiling a cross reference table relating the lemmas of the words to locations of the words in the text and storing the text, the augmented word list and the cross reference table. The retrieval part of the method includes the steps of inputting a query word to access a portion of the stored text, searching the augmented vocabulary list using the query word as a search term, and accessing the cross reference table with a lemma of the query word to locate the portion of the stored text. The resulting invention enables a faster performance for "fuzzy" searches of text in documents, while enabling the cross reference lists used in the retrieval process, to be compactly stored.