|
|
|
| United States Patent | 5497319 |
| Link to this page | http://www.wikipatents.com/5497319.html |
| Inventor(s) | Chong; Leighton K. (New York, NY);
Kamprath; Christine K. (Austin, TX) |
| Abstract | A machine translation and telecommunications system automatically
translates input text in a source language to output text in a target
language using a dictionary database (22) containing core language
dictionaries for general words, a plurality of sublanguage dictionaries
for specialized words of different domains or user groups, and a plurality
of user dictionaries for individualized words used by different users. The
system includes a receiving interface (11) for receiving input from a
sender, in the form of electronic text, facsimile (graphics) input, or
page image data, and an output module (30) for sending translated output
text to any designated recipient(s). The input text is accompanied by a
cover page or header (50) identifying the sender, one or more recipients,
their addresses, the source/target languages of the text, any
sublanguage(s) applicable to the input text, and any formatting
requirements for the output text. The system uses the cover page or header
data to select the core language, sublanguage, and/or user dictionaries to
be used for translation processing, to format the translated output text,
and to send the output to the recipient(s) at the designated address(es).
The dictionary database (22) can cumulate and evolve over time by adding
new words as scratch entries to the user dictionaries and, through the use
of dictionary maintenance utilities, by updating and/or moving the scratch
entries to higher-level subdomain, domain, or even core dictionaries as
their usage gains currency. |
|
|
|
Title Information  |
|
|
|
|
|
Drawing from US Patent 5497319 |
|
|
Machine translation and telecommunications system |
|
|
|
|
|
| Publication Date |
March 5, 1996 |
|
|
|
|
|
| Filing Date |
September 26, 1994 |
|
|
|
|
|
|
|
|
|
|
|
| Parent Case |
SPECIFICATION
This application is a continuation of Ser. No. 07/920,456 filed Aug. 12,
1992, now abandoned, which is a continuation-in-part of U.S. patent
application Ser. No. 636,400, entitled "Automatic Text Translation and
Routing System", filed on Dec. 31, 1990, now issued as U.S. Pat. No.
5,175,684. |
|
|
|
|
|
|
|
|
|
|
|
|
|
Title Information  |
|
|
Description  |
|
|
TECHNICAL FIELD
This invention relates to a system for automatic (machine) translation of
text and, more particularly, to a telecommunications-based system for
automatically translating and sending text from a sender to a recipient in
another language.
BACKGROUND ART
After several decades of development, the field of automatic (machine)
translation of text from a source language to a target language with a
minimum of human intervention has developed to a rudimentary level where
machine translation systems with limited vocabularies or limited language
environments can produce a basic level of acceptably translated text. Some
current systems can produce translations for unconstrained input in a
selected language pair, i.e., from a chosen source language to a chosen
target language, that is perhaps 50% acceptable to a native writer in the
target language (using an arbitrary scale measure). When the translation
system is constrained to a particular vocabulary or syntax style of a
limited area of application (referred to as a "sublanguage"), the results
that can now be achieved may approach a level 90% acceptable to a native
writer. The wide difference in results is attributable to the difficulty
of producing accurate translation when the system must encompass a wide
variability in vocabulary use, syntax, and expression, as compared to the
limited vocabularies and translation equivalents of a chosen sublanguage.
One example of a machine translation system limited to a specific
sublanguage application is the TAUM-METEO system developed by the
University of Montreal for translating weather reports issued by the
Canadian Environment Department from English into French. TAUM-METEO uses
the transfer method of translation, which consists basically of the three
steps of: (1) analyzing the sequence and morphological forms of input
words of the source language and determining their phrase and sentence
structure, (2) transferring (directly translating) the input text into
sentences of equivalent words of the target language using dictionary
look-up and a developed set of transfer rules for word and/or phrase
selections; then (3) synthesizing an acceptable output text in the target
language using developed rules for target language syntax and grammar.
TAUM-METEO was designed to operate for English-to-French translation in
the narrow sublanguage of meteorology (1,500 dictionary entries, with
several hundred place names; text having no tensed verbs). It can obtain
high levels of translation accuracy of 80% to 90% by avoiding the need for
any significant level of morphological analysis of input words, by
analyzing input texts for domain-specific word markers which narrow the
range of choices for output word selection and syntax structure, and by
using ad hoc transfer rules for output word and phrase selections.
Another example of a sublanguage translation system is the METAL system
developed by the Linguistics Research Center at the University of Texas at
Austin for large-volume translations from German into English of texts in
the field of telecommunications. The METAL system also uses the transfer
method, but adds a fourth step called "integration" between the analysis
and transfer steps. The integration step attempts to reduce the
variability of output word selection and syntax by performing tests on the
constituent words of the input text strings and constraining their
application based upon developed grammar and phrase structure rules.
Transfer dictionaries typically consist of roughly 10,000 word pairs. In
terms of translation quality, the METAL system is reported to have
achieved between 45% and 85% correct translations.
A strategy competing with the transfer approach is the "interlingua"
approach which attempts to decompile input texts of a source language into
an intermediate language which represents their "meaning" or semantic
content, and then convert the semantic structures into equivalent output
sentences of a target language by using a knowledge base of contextual,
lexical, and syntactic rules. Historically, transfer systems lacking a
comprehensive knowledge base and limited to translation of sentences in
isolation have had the central problem of obtaining accurate word and
phrase selections in the face of ambiguities presented by homonyms,
polysemic phrases, and anaphoric references. The interlingua approach is
favored because its representation of text meaning within a context larger
than single sentences can, in theory, greatly reduce ambiguity in the
analysis of input texts. Also, once the input text has been decompiled
into a semantic structure, it can theoretically be translated into
multiple target languages using the linguistic and semantic rules
developed for each target language. In practice, however, the interlingua
approach has proven difficult to implement because it requires the
development of a universal symbolic language for representing "meaning"
and comprehensive knowledge bases for making the conversions from source
to intermediate and then to target languages. Examples of interlingua
systems include the Distributed Translation Language (DLT) undertaken in
Utrecht, the Netherlands, and the Knowledge-Based Machine Translation
(KBMT) system of the Center for Machine Translation at Carnegie-Mellon
University.
Other machine translation systems have been developed or are under
development using modifications or hybrids of the transfer and interlingua
approaches. For example, some systems use human pre-editing and/or
post-editing to reduce text ambiguity and improve the correctness of word
and phrase selections. Other systems attempt to combine a basic transfer
approach with knowledge bases and artificial intelligence techniques for
machine editing and enhancement. Another approach is to combine
decompilation to a syntactically-based intermediate structure with
transfer to equivalent output phrases and sentences. For a further
discussion of current developments in machine translation, reference is
made to Machine Translation, Theoretical and Methodological Issues, edited
by Sergei Nirenberg, published by Cambridge University Press, 1987, and
"Proceedings of The Third International Conference on Theoretical and
Methodological Issues in Machine Translation of Natural Language",
published by Linguistics Research Center, University of Texas at Austin,
Jun. 1990.
It is expected that machine translation (MT) systems will develop in time
to provide higher levels of translation accuracy and utility. However,
current MT techniques using a basic transfer approach can produce
acceptable translation accuracy in a selected sublanguage, yet they are
not in widespread use. One reason for the limited use of MT systems is
that most current systems are designed for a single, specific application,
environment and language pair context. The requirements of that context
motivate the design and development of the grammar, dictionary structure,
and parsing algorithms. Thus, the utility of the system becomes confined
to that particular context. This approach greatly limits the range of
applications and the audience of users which can be productively served by
such application- and language-specific MT systems.
SUMMARY OF INVENTION
It is therefore a principal object of the present invention to provide a
system for performing machine translation for different source languages,
target languages, and sublanguages, and automatically sending the
translated text via telecommunications links to one or more recipients in
different languages and/or in different locations. The system should be
capable of providing acceptable levels of translation accuracy and be
readily upgradable to higher levels of accuracy and utility. It is a
further object that such a system be capable of operation with a minimum
of human intervention, yet have interactive utilities for obtaining and
adding new word entries to its dictionary database. It is also desired
that such a system be capable of building and organizing a large-scale
dictionary database containing core language dictionaries, plural
sublanguage dictionaries, and individual user dictionaries in a manner
which cumulates and evolves over time.
In accordance with a principal aspect of the present invention, a machine
translation and telecommunications system comprises:
(a) a machine translation module for performing machine translation from
input text of a source language to output text of a target language;
(b) a receiving interface for receiving input via a first
telecommunications link, said input including an input text to be
translated accompanied by a control portion having at least a first
predefined field therein for designating an address of a recipient to
which translated output text is to be sent;
(c) a recognition module coupled to said receiving interface for
electronically scanning the control portion and recognizing the address of
the recipient designated in the first predefined field of the control
portion; and
(d) an output module including a sending interface for sending translated
output text generated by said machine translation module to the address of
the recipient recognized by said recognition module via a second
telecommunications link.
In a more specific aspect of the invention relating to sublanguage
selection, a machine translation system comprises:
(a) a receiving interface for receiving an input text and a sublanguage
control input indicative of a selected sublanguage applicable to the input
text from among a plurality of possible sublanguages;
(b) a machine translation module capable of performing machine translation
of an input text in a source language to an output text in a target
language using a dictionary database containing entries for words of the
target language corresponding to words of the source language;
(c) a dictionary database including a core language dictionary containing
entries for generic words of the source and target languages, and a
plurality of sublanguage dictionaries each containing entries for
specialized words of a sublanguage;
(d) a dictionary control module responsive to the sublanguage control input
for selecting a sublanguage dictionary of the dictionary database
applicable to the input text, and for causing the machine translation
module to use the selected sublanguage dictionary in performing
translation of the input text; and
(e) an output module for outputting translated text in the target language
generated by the machine translation module.
In the present invention, the sublanguage control input causes a selected
sublanguage dictionary deemed applicable to the input text to be used in
order to perform more accurate translation of the input text. The
dictionary database includes core and sublanguage dictionaries for
different source/target languages and sublanguages. The machine
translation system with this multiple core languages and sublanguages
capability is employed in a telecommunications system which automatically
translates and transmits text from a sender to one or more recipients in
other languages. A cover page or header accompanying the input text is
used to designate the selected source/target languages, the applicable
sublanguages, and the address(es)--electronic, fax, or mail--of the
recipient(s).
In a preferred embodiment, the receiving interface receives input text as
electronic (machine-readable) text over a communications line, or as page
image data via a fax/modem board or page scanner. The receiving interface
is operated in a computer server along with a recognition module for
converting any page image data to electronic text. The recognition module
scans and recognizes designations of the cover page or header accompanying
the input text for determining the selections of the source/target
languages and sublanguage(s) applicable to the input text. In the case of
electronic text, the cover page and the input text may be introduced by
means of a disk file, by downloading an electronic file, or by online
user-system interaction. An optional interaction mode prompts the user for
information concerning the user's identity, sublanguage preferences, etc.,
in order to facilitate generation of a suitable cover page. Inferencing
algorithms may be used to assess the user and cover page information and
determine the applicable sublanguage dictionary(ies).
The output module may have a page formatting program for composing the
translated output text into a desired page format appropriate to a
particular recipient or target language. It may also have a footnoting
function for providing footnotes of ambiguous phrases of the input text in
their original source language and/or with alternate translations in the
target language. The output module includes a sending interface coupled to
a fax/modem board for facsimile transmission, or a printer for printing
output pages, or a telecommunications interface for sending output
electronic text to a recipient's electronic address. The modularity of the
receiving interface, dictionary database, dictionary control module, and
output module from the machine translation module assures that, as machine
translation improvements are developed, the machine translation module may
be upgraded or replaced without rendering the other portions of the system
dysfunctional or obsolete.
As another aspect of the invention related to a machine dictionary
database, a machine translation system comprises:
(a) a machine translation module for performing machine translation of
input text in a source language to output text in a target language using
a dictionary database containing entries for words of the target language
corresponding to words of the source language;
(b) a dictionary database including a core language dictionary containing
entries for generic words of the source/target languages, a plurality of
sublanguage dictionaries each containing entries for specialized words of
a sublanguage used by a group of users, and a plurality of user
dictionaries each containing entries for individualized words of a user;
and
(c) a dictionary control module responsive to control inputs to the machine
translation system for causing the machine translation module to use the
core language dictionary, any applicable sublanguage dictionary, and any
applicable user dictionary for performing translation of an input text
attributed to a user of the system.
In the invention, a large-scale dictionary database is maintained which has
dictionaries containing word entries specified linguistically at different
hierarchical levels of usage. At the lowest (user) level, a particular
user can enter temporary or "scratch" word entries into a respective user
dictionary. The machine translation system uses the particular user's
dictionary to perform machine translation of text which may contain
idiosyncratic or new words or phrases particularly used by that user. The
dictionary control module includes dictionary maintenance utilities which
allow such scratch entries to be entered by users into their user
dictionaries, and which assist a dictionary maintenance operator (DMO) to
review the scratch entries so that they can be confirmed as valid
dictionary entries for machine translation. The dictionary maintenance
utilities include automated programmed procedures for assessing whether
word entries appearing in lower-level dictionaries should be moved into
higher-level dictionaries.
Other objects, features, and advantages of the present invention will
become apparent from the following detailed description of the preferred
embodiments of the invention, as considered with reference to the
following drawings:
BRIEF DESCRIPTION OF DRAWINGS
FIG. 1 is a schematic diagram of a machine translation and
telecommunications system in accordance with the invention.
FIG. 1A is a schematic diagram of a computer server which includes a
receiving interface, recognition module, and dictionary control module,
and is coupled to a machine translation module and an output module.
FIG. 1B is a schematic diagram of a machine translation module which
includes a translation processing module and a dictionary database, and
its linkage to the computer server and the output module.
FIG. 1C is a schematic diagram of the output module, including a page
formatting module and a sending interface.
FIG. 2 is an illustration of a cover page for designating core language
pair, sublanguage(s), and recipient information, and accompanying text
pages.
FIG. 3 is an illustration of input ideographic text and output English text
as performed by the machine translation system using page formatting
functions.
FIG. 4 is a schematic diagram of the dictionary control module, including
dictionary selection and maintenance submodules, the latter containing an
(interactive) user maintenance module and a dictionary maintenance module.
FIG. 5 is a schematic representation of an interactive input editor for
interactions with users of the system.
FIG. 6 is a schematic diagram illustrating dictionary maintenance utilities
for collapsing and promotion of entries from subordinate to superordinate
dictionaries.
FIG. 7A illustrates, as a function of the dictionary maintenance utilities,
the creation of scratch word entry from an identical word entry.
FIG. 7B illustrates the use of utilities with an interactive input editor
to scan various levels of the dictionary hierarchy for word entries on
which to base scratch word entries.
FIG. 7C illustrates a typical content of an identical word entry from which
a scratch word entry is created.
FIG. 7D illustrates the creation of a "copy-cat" word entry from a
synonymous word entry.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
Referring to FIG. 1, a preferred form of the machine translation and
telecommunications system in accordance with the present invention
comprises a computer server 10, a machine translation module 20, and an
output module 30. (These and further-described components of the system
will be denoted with capital letters for clarity of reference.) The
Computer Server 10 receives electronic text input accompanied by a cover
page or header from any of a plurality of input sources, designated
generally as a telecommunications link A. The Computer Server 10 has a
function for recognizing control data in the cover page or header
designating core language and sublanguage selections applicable to the
input text to be translated. It also recognizes output addresses and page
formatting data to be used by the Output Module 30 for transmitting the
translated text to the designated recipient(s) via any of a plurality of
output devices, designated generally as a telecommunications link B. Due
to the modularity of the system, the Machine Translation Module 20 may be
updated by operator maintenance or upgraded or replaced without rendering
the other functions of the system dysfunctional or obsolete.
The Machine Translation Module 20 is capable of performing machine
translation from input text in a source language to output text in a
target language. In the examples of a machine translation (MT) system
described herein, reference is made to an MT system of the transfer type
which relies upon the use of a machine-readable dictionary for lookup of
source/target word entries. The principles of the present invention may
also be applied to an MT system of the interlingua type. Transfer-type MT
systems are widely accepted for near-term usage than interlingua systems,
and they rely more heavily on linguistic knowledge incorporated into
machine dictionaries designed for source/target language pairs. The
operation of transfer-type MT systems is well understood by those skilled
in the machine translation field, and is not described further herein.
Input Data Reception and Extraction
FIG. 1A shows the Computer Server 10 having a Receiving Interface 11 linked
to the telecommunications link A, a Recognition Module 12, and a
Dictionary Control Module 13. The Receiving Interface 11 may include an
interactive mode program (to be described further herein) whereby a user
can provide cover page or header designations, update or create User ID
files pertinent to translation parameters associated with that user's
communications, or create specialized user dictionary entries during
interactive text entry sessions. The Recognition Module 12 includes a
character recognition (often referred to as "OCR") program which
recognizes and converts page image data into machine-readable text, and
which recognizes cover page designations or user designations referencing
cover page data stored in the User ID files. The Dictionary Control Module
13 includes a Dictionary Selection Module, which assesses the control data
it receives from the Recognition Module 12 and designates the appropriate
core language and sublanguage dictionary(ies) to be used by the Machine
Translation Module 20. It also includes a Dictionary Maintenance Module,
which allows a dictionary maintenance operator (DMO) to create and update
dictionary entries in the Dictionary Database 22.
Using the control data from a cover page or header accompanying the input
text, the Computer Server 10 allows the system to automatically recognize
a sender's designations of the source language of the input text, the
target language(s) of the output text, any particular sublanguage(s) used
in a specialized domain, user group, or correspondence type, any preferred
page format for the output text, and the address(es) of one or more
recipients to whom the output is to be sent. Thus, the system can
automatically access designated core and sublanguage dictionaries
maintained in the Dictionary Database 22 for different source/target
languages and sublanguages, and can format and transmit the translated
text to recipient(s) in respective target language(s) via
telecommunications link B, without the need for any substantial human
intervention.
The Computer Server 10 interfaces with a plurality of receiving devices.
For example, input data can be received as a facsimile transmission via a
fax/modem board plugged into the I/O bus for the server system. Such
fax/modem boards are widely available and their operation in a server
system is well understood by those skilled in this field. Input may also
be received from a conventional facsimile machine coupled to a telephone
line which prints facsimile pages converted from signals transmitted on
the telephone line. A conventional page scanner with a sheet feeder can be
used to scan in facsimile or printed pages as page image data for input to
the Computer Server. The page image data is then converted to
machine-readable form by the OCR program. Input may also be received
through a telecommunications program or network interface as electronic
text or text files (such as ASCII text), in which case conversion by the
OCR program is not required.
The OCR program may be resident as an application program in the Computer
Server 10 along with the interface programs for handling the reception of
input data. OCR programs are widely available, and their operation is well
known in this field. For example, an OCR program for recognizing Japanese
kana and ideographic characters is offered by Catena Corp., Tokyo, Japan.
An example of an OCR program for alphanumeric characters is WordScan.TM.
offered by Calera Recognition Systems, Santa Clara, Calif. The Computer
Server 10 is preferably a high-speed, multi-tasking PC computer or
workstation.
Referring to FIG. 2, the Computer Server 10 receives input data which is
divided into two parts: a cover page or header 50 and input text 60. In
the example shown, a cover page is used in conjunction with other pages of
input text in a page-oriented system. In the case of transmission of an
electronic text file or a text message, a preceeding header or identifier
for the communication is used. The cover page 50 has a number of fields
for designating selections of source/target language(s), sublanguage(s),
page format, and recipient(s) for the text. The cover page 50 is organized
with data fields in a predefined format which is readily recognized by the
Recognition Module 12 of the Computer Server 10 so that the control data
in the predefined fields can be readily recognized.
For example, the cover page 50 may be laid out and formatted with field
boundaries and markings on the printed page for optically scanning with a
high level of reliability. Line dividers 51 and large type-size headers 52
may be used to mark the sender, source/target language(s), sublanguage
(communication type or subject matter), page format, and recipient address
fields. Boxes 53, which can be marked or blackened in, allow the
designated selections to be determined without error. The names of the
sender and recipients, their respective companies, addresses, and
telephone and/or facsimile transmission numbers are determined by
character recognition once the respective fields 51, 52 have been
distinguished. Any page length of input text 60 can follow the cover page
50. Alternatively, information ordinarily supplied by a cover page or
header may be stored in the User ID files and supplied automatically as a
memorized script in response to user selection.
It is the task of the Recognition Module 12 to extract data pertinent to
dictionary selection from the fields of the cover page or header. In batch
mode this data is predetermined--it is either filled into the cover page
fields by the user with each specific translation transaction, or it can
be supplied by a reference to the User Identification (ID) files resident
in the Recognition Module 12.
In the Interactive Mode for specifying the cover page or header through the
Receiving Interface 11, the user may first be presented with predetermined
sets of fill-in data and then prompted for alternative values, or provided
with a variety of alternatives from which to choose, based upon data
already stored in the User ID files, or based upon inferences drawn from
the data as it is entered by the user. For example, a User A may specify
Recipient Z by name only, and then be presented with additional data, such
as Recipient Z's address, title, or affiliation, already stored in the
User ID files for verification or correction. Alternatively, Recipient Z
may never have been addressed by User A in the past but may be a user
categorized in Domain L, which is a domain of which User A is also a
member, thus triggering the inference that the sublanguage dictionary of
Domain L may be presented to User A as an option for use.
The user may be prompted in Interactive Mode to verify or choose among
field values which aid in selecting one or more sublanguage dictionaries
for a given translation, including correspondence types, subject domains,
social indicators, etc. By automating the filling-in of cover page
information, the system employs its computerized capabilities for the user
while controlling and monitoring the completeness and cohesiveness of the
data supplied.
The cover page may designate a plurality of recipients in a plurality of
address locations and target languages, each of which may have particular
formatting requirements for the output. For automated assistance, each
prospective recipient can be referenced by an identifying code indexed to
data stored in the User ID files. For example, a travel agent may have a
regular set of clients in a variety of locations and languages, with
access to a variety of communication modes, to whom he or she regularly
sends advertising material. One client may require Japanese translation
formatted as "right-to-left" vertical lines of ideographic characters, to
be printed and sent as ordinary mail. Another may require faxed
translation into German. Still another may have E-mail capability and
require a printed copy as well. These combinations of addressees and
requirements can be predefined and stored in the User ID files. The data
for the cover page fields for each of these addressees may be indexed to
mnemonic codes, such as the addressee's alphabetic name, and are retrieved
from the User ID files by the Recognition Module.
The User ID files may be established at the time of subscription by a user
to a machine translation service, and updated from time to time
thereafter. Using the Interactive Mode, the user may be prompted to supply
his or her name, sex, title, company, address, group affiliations, source
language, etc., as well as data relevant to prospective recipients or
groups of recipients to be stored in the User ID files for filling in
cover pages automatically. Sublanguage selections appropriate to the user
may be identified or queried by comparing the requirements of the user
with those of other users subscribing to the service.
The user may be prompted to provide samples of typical texts expected to be
submitted for translation, as well as individualized or key words for a
thesaurus of terms. Automatic ut | | |