|
Claims  |
|
|
We claim:
1. An improved machine translation system having a natural language source
module for accepting externally introduced text in said source language,
said module including a lexical database, said system being broadly based
upon the concept of Chaos and conducts a divergent search in the source
language, a morpheme root database, and further including a morphological
word stripping means, said means to be implemented on a data processing
device, said system source module includes means implementing a method
having the steps whereby each of the words in a subject clause, phrase, or
sentence of said externally introduced source language text are
individually compared first to data in said lexical database and if said
individual words are not found among said data in said lexical database
then means are provided whereby said words are subjected to said
morphological word stripping means, said stripping means being directed to
the affixes of said words and first to the stripping of suffixes, if any,
from each said word followed by the step of comparing an individual
stripped word, in the absence of that particular word's stripped suffix,
with the data in said morpheme root database, which comparison normally
proceeds downward through descending length character strings until a
morpheme root match is found, further stripping and comparison with said
database are repeated as often as required to find a root match.
2. A machine translation system as claimed in claim 1 wherein the method
utilizing said word affix stripping means also includes means for
stripping prefixes, and infixes, if any, from said words in the event that
the stripping of suffixes was not adequate for reaching the word root and
matching each said affix stripped word to said morpheme root data base.
3. A machine translation system of the type claimed in claim 2 wherein such
a divergent search can produce a multiplicity of possible solutions.
4. A machine translation system of the type claimed in claim 3 wherein such
a divergent search will also include inflected forms of all words.
5. A machine translation system of the type claimed in claim 2 wherein
means for attaching appropriate tags are provided and at least one
appropriate tag is attached to said root word denoting the affixes such as
prefixes, infixes, as well as suffixes that have been stripped from said
root word, along with syntactic analysis, including but not limited to,
word type, tense, gender, pluralism, and location clause or phrase,
subject, object, and any other identification thought necessary in order
to provide a smooth translation into a target language.
6. An improved machine translation system to be implemented on a data
processing device, as claimed in claim 5, wherein said system generally
consists of three modules, said modules including said source or first
module in a first natural national language adapted to accept said
externally introduced text in said first language that is to be
translated, said text being subjected to said method contained in said
source module, a universal second or intermediate bridge module including
means for translating said first national natural language into a
universal internationally created second language, said second module
including means for carrying out said translation with said at least one
tag attached to each said word for identification and classification
purposes; and a third or target module carrying a second natural national
language, said target module including means and a database capable of
accepting the tagged words from said second module and readily translating
them into said target second natural national language; said universal
second or intermediate bridge module being usable universally with all of
said first and third or source and target national natural language
modules, respectively, regardless of whatever different languages might be
resident therein.
7. A machine translation system of the type claimed in claim 6 wherein said
third module including means for utilizing a portion of its database for
direct translation from said universal internationally created language
into said target second natural national language, and a portion of said
module having means for recombinant morphology usable in the rebuilding
step, if necessary, of root words in said second national natural target
language text by the method of addition of morphemes in said target
language in order to bring about a relatively accurate and true
translation thereof in relation to any stripped affixes carried out in
said first module.
8. A machine translation system of the type claimed in claim 7 wherein said
word stripping is the degenerative stage of morphology in the source
language while said recombinant or replacement of the stripped
suffix/prefix to the root word is the generative stage of morphology in
the target language, the generative stage being substantially a mirror
image of the degenerative stage.
9. A machine translation system of the type claimed in claim 8 wherein said
generative stage is based on substantially the reverse of the degenerative
morphology table of said target language when it is used as a source
language.
10. A machine translation system of the type claimed in claim 9 wherein
said generative morphology is the means for recognizing and being
cognizant of spelling shifts, if they exist, in said target second
language contained in said third module.
11. A machine translation system for translating text from a first national
natural source language to a second national natural target language
through a universal machine method adapted to be implemented on a data
processing device including a first module having a lexical database
identifiably with said source language and said first module including
means capable of performing a syntactic and lexical analysis on said text
and attaching informational tags on each word of said text, a universal
intermediate second module providing an interface having an operating
environment for display to a user and a basis for issuing commands and
receiving information, said second module also including a lexical
database in an intermediate international created language that is capable
of accepting said syntactic and lexical analysis of said text from said
first module and including means for translating said source language
words carrying said informational tags into said international created
language while retaining said informational tags, and a third target
module having a lexical database identifiable with said second national
natural target language, and including means to accept said intermediate
created language with its tagged words of said text and proceed to
translate the text into the target national natural language, said second
module being universally accepted by a multiplicity of differing national
languages each of which has one of its own said first source module of one
of its own said third target module; said first module also including a
root word morpheme database, and having means whereby any individual words
of said source text which cannot be initially matched with a word in said
first module lexical database are then subjected to morphological
stripping of endings and prefixes until the root of said words can be
matched with said root word morpheme database, appropriate designating
tags are attached to each said root word indicating, but not limited to,
the root word designator, type of word, tense, gender, pluralism, and
particular ending or prefix morpheme stripped therefrom, means are
provided so that appropriate morphemes can be added to the translated root
word in the target language of said third module, said system further
including means for inputting text into file means in said first module,
said machine method includes means adapted to read the said input file a
character at a time until it reaches some form of punctuation which
terminates a statement, including periods, commas exclamation marks,
dashes, ellipsis, question marks; said last mentioned means is directed to
process only one statement at a time and all punctuation falls through as
is appropriate; means are provided wherein each word in the statement is
looked up in the lexical database, if no match is found for an individual
word the lexical database returns an error code; said individual words
returned with an error code goes to a morphology database including means
which strips successive affixes, including suffixes or prefixes, off said
word and modifies it to determine if the root of said word is in the
lexical database, such a termination is made by checking said word against
said database each time a morpheme is stripped from said word, and
repeated until a match is found, said lexical database returns grammatical
information about each of said words, however, said morphology database
includes means that has the power to supersede this grammatical
information during said stripping operation, however, if said word is of
the type that may be many different parts of speech, including a verb,
noun, adjective, adverb, article or preposition, and is ambiguous and/or
did not pass through morphological stripping; means are provides for an
indeterminate flag to be set and additional means are provided whereby a
grammatical analysis is performed by examination of the proximal words, if
the said word is the first word followed by a noun the probability of it
being adjectival is very high, if, on the other hand if the word before
said word is an article said word must be a noun, in either event said
word is appropriately flagged as to word type, once said word type has
been resolved, in the lexical database, means are provided whereby it is
tagged as to type and the proper individual identification for said word,
which identification remains the same regardless of what language or what
module the text may reside, if said word has multiple possibilities as to
its type, as set forth above, namely including verb, noun, adjective,
adverb, article, or preposition, then means are provided whereby a
heuristic approach is utilized land it will appear as many times as there
are possibilities, lookups are repeated a plurality of times until no
ambiguities are remaining, said system further including program means
whereby verbs are identified next by starting at the end of the sentence
and/or statement and working forward until a first main verb is located,
said system program means is intelligent since it stops processing when it
encounters any additional main verbs or definite clause markers or
punctuation, said program means then continues and if a verb is marked as
an infinitive said program means moves on to further translation, verbs
are tensed and during this process the said program means checks for
modals and auxiliary verbs and sets them aside for later treatment.
12. A machine translation system as claimed in claim 11 including means
whereby, after verbs are tensed, subjects and objects are located by
proximal rule along with clausal analysis, said means then commences with
the last verb in the statement and works backwards looking for nouns as
well as moving forward from the verb to look for nouns, on each side of
said verb the program means looks for clause and direction markers,
direction indicates an object when after said verb, and the program looks
for nouns before the verb for subjects.
13. A machine translation system as claimed in claim 12 including means
wherein phrases are identified idiomatically at the end of each sentence,
if a word is part of a phrase it is assembled, using the same mechanism by
which verbs are handled, a separate phrase/idiom database is provided and
when it is identified an intermediate number is used in place of the
phrase, means are provided in said third module database for accepting
phrases in their own database and translating them into the target
language from the intermediate language bridge.
14. A machine translation system as claimed in claim 13 wherein means are
provided whereby the user has the option to use his own lexicon to define
a particular word differently than the database has done, once the user
has introduced his definition it will supersede that of the program in the
lookup and will be processed that way thereafter as if it were part of the
original program definition. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
BACKGROUND OF THE INVENTION
This is a continuation-in-part of application Ser. No. 07/010,989, filed
Feb. 5, 1987.
The present invention relates to the translation of documents having a
source text written in any one of a plurality of national languages being
translated into a text that is written in any one of a plurality of second
target national languages by utilizing a created international language as
an intermediate pathway between the two chosen national languages.
The desire of various nationalities speaking different languages to readily
converse has been ever present in the history of humanity. There are about
3,000 known languages in the world (the number varies according to what is
counted as a language; dialects that are clearly just that are not
included in this number), and each is the vehicle of a culture that is
different in at least some ways from any other culture. The learning and
teaching of languages, the recording of languages in intercultural
communication are matters of primary importance.
Languages have had to be taught and learned for centuries. Everywhere, when
speakers of different languages have come in contact, somebody had to
learn a foreign language. There have always been individuals who found it
interesting or profitable to do this. The earliest of explorers and
traders were forced by necessity to learn to understand one another's
language or to perish in the economic as well as the physical worlds.
This, as we all know, resulted in extensive and long language studies with
the erudite academicians handling the complex aspects of the
communications exchange, while the more pragmatic day-to-day traders and
businessmen developed short terse means of communication. A need arose to
satisfy the requirements of an exact but easy means for correspondence
between lay persons and small businessmen.
Small, handheld, phrase books proliferated to facilitate phonetic
intercourse by visiting tourists and servicemen. Unfortunately, the
phonics in these booklets, as well as their limited scope, limited the
amount of intercourse possible. Small dictionaries that permitted word to
word translation were available but unfortunately they did not provide a
means for transposing words to give a more accurate grammatical rendition
in the target language. Variations on these items became available upon
the appearance of the liquid and gaseous crystal readout devices which
permitted storage of a limited vocabulary of words and their direct
translatable equivalents in a phonic form. Here again, the limited
capacity did not permit the introduction of adequate grammatical
improvement of syntax.
The advent of the personal computers and the microprocessors has brought a
flood of approaches to the patent offices around the world. The devices
have ranged from direct word for word translation devices to key word
translation directly into phrases. For example, a word to word translation
device can be found in U.S. Pat. No. 4,502,128, TRANSLATION BETWEEN
NATURAL LANGUAGES, this patent being directed to an inputting of a
sentence described by a first natural language being sectioned into
individual words. Parts of speech corresponding to these individual words
are retrieved from a lexical word storage, whereby the input sentence is
described by a corresponding string of the parts-of-speech as retrieved. A
translation pattern table previously prepared compares strings of
parts-of-speech for the first natural language with those of the second
language and transforms the first strings of parts-of-speech into strings
of parts-of-speech of the second language. The output sentence described
by the second natural language is generated by sequencing target words in
accordance with the sequential order of the parts of speech of the string
pattern obtained after the transformation. This is a complex procedure at
best.
U.S. Pat. Nos. 4,412,305; 4,541,069 4,439,836 and 4,365,315 relate to
translation devices wherein a single word is used as the input to produce
the translation of entire groups of words, such as sentences or phrases; a
single word entered will access particular sentences within limited
subject categories; letters within words or groups of words produces an
equivalency detectable by a comparison circuit resulting in the
representation in a second language of a plurality of words regardless of
whether it is a noninflected word or an inflected word; and phrases can be
tied to computer specified aural or visual control messages for use by an
operator who chooses to use a particular language in the operation of a
machine tool. Similarly, alphabetical accessing to an electronic
translator can be accomplished by storing address codes with each word, as
in U.S. Pat. No. 4,541,069; as well as utilization of a system for
automatically hyphenating and verifying the spelling of words in a
multi-lingual document can be carried out under U.S. Pat. No. 4,456,969.
As can be seen from study of these prior art references, generally found in
U.S.Cl. 364/900, a direct translation from one natural language to another
natural language has a multiplicity of roadblocks, either in the lack of
an available direct translation or in major grammatical problems due to
language structure or in the relative stage of development of one of the
languages.
Later patents relating to translation systems and apparatus can be found in
U.S. Pat. Nos. 4,774,596; 4,774,666; 4,800,522; 4,814,987; 4,814,988; and
4,833,611. These patents relate, among other things, to the use of
translation dictionaries, defined grammatical rules and tree conversion
rules which, unfortunately, are quite rigid in nature in that the
apparatus and systems involved merely utilize direct translation between
languages and rigid grammatical relationships. They do not have the
flexibility or adaptibility necessary to handle the translation of unique
clauses or phrases.
None of the cited references have the universality and reversibility that
is found in the present invention and its improvements set forth
hereinafter. The cited references are useable only with a single source
language going into a single target language. To adapt such reference
devices and systems for use with other languages would require a complete
reworking of the programs.
SUMMARY OF THE INVENTION
The present invention relates to the translation between two national
languages by the utilization of an improved intermediate step or pathway
of translating into an improved created international language from a
first or source national language and then translating from the improved
created international language into a second or target national language.
Such a translation is reversible in either direction through the
intermediate path and can accommodate translation from one national
language into the created international language intermediate path and
then translation into any one of a multiplicity of second national
languages from the created international language intermediate path text.
It must be recognized that, while the term "created international language"
or "artificial language" is used herein, this invention contemplates as
well the utilization of alphabetic, numeric, alphanumeric, symbolic (or
any combination of these) that relates to a compressed vocabulary and/or
syntax (or a non-compressed vocabulary) but with each having a simplified
and regular grammar.
While the original invention generally contemplated the use of forms of
Esperanto, or other universal created languages, as the intermediate path,
the present improved form utilizes this as well as stripped words that are
primarily in their root form and capable of accepting a multiplicity of
endings that may transform the root from an adjectival form to the
adverbial form, or to transform a verb to a noun, or vice versa. Thus,
Esperanto or intermediate path language of the original invention now
includes one or more indicators or tags which provide a complete
grammatical and lexical analysis of a particular word. All of this
information is appended into the intermediate language for a particular
word, i.e., a tag on the word delineates the type word that it is; another
tag indicates its relative part of speech, there are tags that have to do
with verb tensing information and construction information, for instance,
whether it is a part of a phrase, or a clause, or things of that nature.
This numbering and word definition must generally be non-specific in
nature since the system is utilized to go from language to language and it
is desireable to remain independent, whereby the intermediate language has
the ability of branching out into any other language that is chosen, based
on the information that is given to it.
An economic consideration of the improved invention is to provide a means
for permitting its use on PC-type computers that are readily available for
office usage rather than building up a monster size data base that will
require a huge mainframe computer in order to carry out the translations
contemplated. Therefore, it was resolved to utilized a single intermediate
path disk that can be used with any combination of source language-target
language disks to reduce the memory capacity of the operating computer
required to carry out such a translation. Thus, the universality and
reversibility are maintained with the only change being to restrict the
dictionary and morphology data base required for the selected source and
target languages.
A powerful morphology analysis is utilized that goes onto the words
themselves. What this does is to strip endings and/or prefixes off of
words to get them back down to their roots so that the dictionary does not
list for the most part all of the forms of a word, but rather it only
lists their roots. The morphology analysis can also be used to find out
if, for example, a word is a verb that has been made into a noun, or, if
it's an adjective that has been made into an adverb, or other transitional
configurations, all of which possibilities can be taken care of with the
morphology. Thus, the database, what is in the database itself, while its
size may be small, the morphology amplifies what is in there by an order
of magnitude. For example, no plurals are indicated in the database
itself, very few "-ly" adverbs are indicated in the database because the
morphology can find those things, mark them as such and that gets marked
onto the intermediate language that this is the root form of the
particular word. If it is an adjective which has come to be an adverb,
then this information can be used in any other language by taking the
corresponding root word in the other language and turning it into its
appropriate usage as the other language would require. This permits the
database to be kept to a minimum and to admit other more important things
into the database without taking up a lot of room with plurals and other
things of that nature which can be made or broken as needed, thereby
giving a lot richer language possibility in the translation process.
The use of parse and flags to numerically keep track of the sentence or
clause being worked on also expedites the operation of the method.
BRIEF DESCRIPTION OF THE DRAWING
FIG. 1 is a diagramatic showing of the utilization of a created
international language, the example utilizes Esperanto although others
could be used, as an intermediate pathway in translating between two
national languages;
FIGS. 2A and 2B are an expanded block diagram showing a schematic
arrangement of the flow of information within a computer under the method
of the present invention and can be referred to in following the
description that follows;
FIG. 3 is a wheel shaped schematic showing the intermediate path created
international language module being located at the "hub" of the wheel,
while the source and target language modules are located at the ends of
the spokes with their sole interconnecting path being universally and
reversibly through the "hub" or intermediate created language path;
FIG. 3A is a schematic diagram showing in abbreviated format the keyboard
where the text is manually input, the source module of language "A", the
universal intermediate module, and the target module of language "B", with
ancillary items such as the visual CRT display, on-line storage means, and
a printer terminal; and
FIG. 4 is a schematic diagram showing the improved flow of information
utilized in the improved translation system.
DESCRIPTION OF AN EMBODIMENT
This invention contemplates the usage of a computer, such as an IBM PC or
PS/2, that utilizes MS-DOS or Micro Channel architecture and is capable of
accepting BASIC as well as other programming languages, such as C,
Assembler language, Cobol, Fortran, or any other compatible computer
language. Other software such as compilers plus other speed enhancing
arrangements can be utilized in subroutines as well as in the main stream
of this method.
As was indicated above, this method of translation between two national
languages includes the step of utilizing a created international language
bridge, whereby any one of a plurality of national languages can be
compatibly translated into the chosen created international language and,
then, can be translated from the created international language into any
chosen one of a plurality of national languages. There are several such
"created" international languages, the most common of which is Esperanto
created in the 1880's by Dr. Ludovic Lazarus Zamenhof (1859-1917) of
Poland. It contains a compressed vocabulary (roughly one-tenth the number
of words as English) and a completely simplified and regular grammar. This
eliminates the need for many complex mathematical statements to account
for the grammatical differences between existing national languages. While
other created international languages, for example, Inter Lingua, Modified
Esperanto, or Volupuk, could be used, the present disclosure utilizes a
modified or stripped down Esperanto. It must be recognized that, while the
term "created international language" or "artificial language" is used
herein, this invention contemplates as well the utilization of alphabetic,
numeric, alphanumeric, symbolic (or any combination of these) that relates
to a compressed vocabulary and/or syntax (or a non-compressed vocabulary)
such as a data base of word roots but with each of these having a
simplified and regular grammar which can be modified by one or more of
identifiable suffixes or prefixes.
There are Esperanto textbooks available in some fifty languages. The two
national languages used in the illustrated embodiments of this
specification are English and Spanish, however, the method can be
successfully utilized with a multitude of other languages, i.e., Japanese,
German, French, Russian, and Chinese, etc. Additionally, most all
languages are compatible with an intermediate simplified and regularized
language, one of which is Esperanto, and they could be readily adapted for
use with this method. It must be realized that, by utilizing a regularized
Esperanto (or colloquially, Esperantoish) as the intermediate pathway
between the two national languages, the method is reversible and the
translation from language A to language B can go in the opposite
direction, from language B to language A, with equal facility, see FIGS. 1
and 3.
In the original application, Ser. No. 07/010,989, filed Feb. 5, 1987, a
limited multiple language dictionary data base, including Esperanto, was
prepared and placed on a limited access disk; along with other
subroutines, that can be accessed by computer, were provided and called
upon to smooth out the translation as it progressed. It was recognized
that it is not only possible, but also acceptable, in certain
circumstances, to utilize the simplistic approach of translating from a
base national language into Esperanto and then directly into the target
national language. This often produces an elementary type of resulting
language that is totally acceptable in instances where the recipient of
the document is not linguistically sophisticated, or where the message
being conveyed does not require additional nuances. This is often utilized
to great advantage in brief offers and acceptances in commercial
transactions, where one party orders a specific quantity of a product
having a generic name utilized in both languages and the second party
merely confirms availability and delivery information. It also is often
readily acceptable in the scientific community.
The niceties required in social intercourse, however, can be supplied by
the application of the other subroutines shown in abbreviated flowchart
form in FIG. 2, as well as the improved routines shown in FIG. 4 and which
are described now in more detail.
An operator makes a choice, from an appropriate starting menu, of the
national language that will be used in entering the text that is to be
translated. From a keyboard terminal, the source text in the chosen
language, in this example English, is introduced into the computer and
placed in a created text file.<ENGTXT> (It should be noted that the
language of the boxes in the flow chart of FIG. 2 will be utilized in the
description of each of the steps in this method)
When the text has been fully entered into the text file it is then operated
upon and parsed into individual sentences with each sentence being placed
in it own file.<SENPARSE>
Each of the individual sentence files is preferably "flagged" whereby it is
numerically kept track of, thereby aiding the computer in ascertaining
which sentence it is working on, as well as providing a return point of a
loop for operation on successive sentences <SENROUT1>. (In the flowchart
of FIG. 2 the term "TEXT" is whimsically shown as being broken up into
individual parts and includes an additional one indicated as "n+1" which
would indicate that all of the sentences had been handled and the computer
would then proceed to the steps leading to "end".)
With the text parsed into individual sentences and properly flagged, the
individual words are translated from the original text language, English,
into Esperanto to form the streamline intermediate pathway. Each
individual word is assigned a grammatical tag as it is being translated.
All irregular verbs in English are "smoothed out" into regular ending
Esperanto verb endings. Since Esperanto uses one-tenth the number of words
that are found in the English vocabulary the number of "lookups" in the
electronic data base is drastically reduced. The dictionary data base, as
was previously noted, is provided with limited access whereby introduction
of special words that have a highly repeated volume of usage or which are
of a specialized nature, i.e., medical, scientific, or restricted
commercial, can under proper circumstances and procedures be added to the
dictionary.
Continuing this translation, it is placed in a temporary file until the
entire sentence being acted upon is completely translated into the
intermediate language.
The next step is for the computer to access another sector of the
electronic dictionary data base for the translation of all intermediate
pathway Esperanto words (except verbs) into the target language
equivalents.
Each sentence then is parsed into individual words, each being preferably
assigned their own temporary file.
After each sentenced is parsed, the program may terminate by utilizing the
path to the far left in FIG. 2 and proceed solely on the basis of the
translation from the source language into the intermediate pathway
language and thence into the target language. As has been previously
indicated, there are circumstances where such a translation is totally
adequate and has the advantage of speed. If, however, a more refined
interpretation is required then the program provides a plurality of
alternative subroutines which can be called up for action on the parsed
sentence. There is no important order or sequence in which these
subroutines must be used. Further, it is not mandatory that each of them
be used in the smoothing proces | | |