|
Claims  |
|
|
What is claimed is:
1. A computer method for the automatic extraction of commonly specified
information from a business correspondence document, such as date of
letter, name of recipient, name of sender, address of sender, title of
sender, carbon copy list, subject statement, and the like, comprising the
steps of:
a first scanning step of scanning the input data stream to locate
postscripts, attachments of appendices at a first location by matching
each word from the input data stream against a list of expressions used to
indicate postscripts, attachments of appendices, said first location being
set equal to the final occurring line in said data stream if said first
scanning step does not locate any postscripts, attachments or appendices
therein, said first location alternately being set equal to a location of
postscripts, attachments or appendices found in said first scanning step;
a second scanning step of scanning the input data stream to locate the
final sentence in said document, starting from said first location and
scanning toward the beginning of said data stream, searching for words
which are verbs in the final sentence in said document, by identifying the
last occurrence of a verb in the input data stream, which will occur in
the final sentence of said document;
a first identifying step of identifying an ending portion of the document
expected to contain a sender's name, return address, title of carbon copy
list information, at a location in the input data stream occurring after
the end of said final sentence located in said second scanning step, and
occurring before said first location located in said first scanning step;
a third scanning step of scanning said input data stream to locate any
salutation by matching each word from the input data stream against a list
of natural language expressions that can be used as a salutation;
a second identifying step of identifying a beginning portion of the
document at a location which includes a portion from the start of the
input data stream to the end of a salutation, if a salutation was located
in said third scanning step;
a fourth scanning step of scanning the input data stream if no salutation
was found in said third scanning step, said fourth scanning step to locate
date, addressee, sender, return address, personal title or subject
information in the input data stream by matching each word of the input
data stream against a list of expressions that are used to indicate the
date, addressee, the sender, the return address, personal title and the
subject of the correspondence document;
a third identifying step of identifying, if no salutation was found in said
third scanning step, a beginning portion of the document at a location
which includes the date, addressee, sender, return address, personal title
or subject information of the correspondence document located in said
fourth scanning step;
isolating and storing from said beginning portion of said document, any
addressee, sender, return address, personal title or subject information
therein;
isolating and storing from said ending portion of said document, any
sender, return address, title or carbon copy list information therein.
2. A computer method for the automatic extraction of commonly specified
information from a business correspondence document, such as date of
letter, name of recipient, name of sender, address of sender, title of
sender, carbon copy list, subject statement, and the like, comprising the
steps of:
a first scanning step of scanning the input data stream to locate
postscripts, attachments or appendices at a first location by matching
each word from the input data stream against a list of expressions used to
indicate postscripts, attachments or appendices, said first location being
set equal to the final occurring line in said data stream if said first
scanning step does not locate any postscripts, attachments or appendices
therein, said first location alternately being set equal to a location of
postscripts, attachments or appendices found in said first scanning step;
a second scanning step of scanning the input data stream to locate the
final sentence in said document, starting from said first location and
scanning toward the beginning of said data stream, searching for words
which are verbs in the final sentence in said document, by identifying the
last occurrence of a verb in the input data stream, which will occur in
the final sentence of said document;
a first identifying step of identifying an ending portion of the document
expected to contain a sender's name, return address, title or carbon copy
list information, at a location in the input data stream occurring after
the end of said final sentence located in said second scanning step, and
occurring before said first location located in said first scanning step;
a third scanning step of scanning said input data stream to locate any
salutation by matching each word from the input data stream against a list
of natural language expressions that can be used as a salutation;
a second identifying step of identifying a beginning portion of the
document at a location which includes a portion from the start of the
input data stream to the end of a salutation, if a salutation was located
in said third scanning step;
a fourth scanning step of scanning the input data stream if no salutation
was found in said third scanning step, said fourth scanning step to locate
date, addressee, sender, return address, personal title or subject
information in the input data stream by matching each word of the input
data stream against a list of expressions that are used to indicate the
date, addressee, the sender, the return address, personal title and the
subject of the correspondence document;
a third identifying step of identifying, if no salutation was found in said
third scanning step, a beginning portion of the document at a location
which includes the date, addressee, sender, return address, personal title
or subject information of the correspondence document located in said
fourth scanning step;
isolating and storing from said beginning portion of said document, any
addressee, sender, return address, personal title or subject information
therein;
isolating and storing from said ending portion of said document, any
sender, return address, title or carbon copy list information therein;
storing said document in a file accessible by any addressee, sender, return
address, title, subject information, or carbon copy list information. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
BACKGROUND OF THE INVENTION
1. Technical Field
The invention disclosed broadly relates to data processing and more
particularly relates to linguistic applications in data processing.
2. Background Art
Text processing and word processing systems have been developed for both
stand-alone applications and distributed processing applications. The
terms text processing and word processing will be used interchangeably
herein to refer to data processing systems primarily used for the
creation, editing, communication, and/or printing of alphanumeric
character strings composing written text. A particular distributed
processing system for word processing is disclosed in the copending U.S.
patent application Ser. No. 781,862 filed Sept. 30, 1985 entitled
"Multilingual Processing for Screen Image Build and Command Decode in a
Word Processor, with Full Command, Message and Help Support," by K. W.
Borgendale, et al., assigned to IBM Corporation. The figures and
specification of the Borgendale, et al. patent application are
incorporated herein by reference, as an example of a host system within
which the subject invention herein can be applied.
Document retrieval is the function of finding stored documents which
contain information relevant to a user's query. Prior art computer methods
for document retrieval are logically divided into a first component
process for creating a document retrieval data base and a second component
process for interrogating that data base with the user's queries. In the
process of creating the data base, each document which is desired to be
entered into the data base, is associated with a unique document number.
Then the words comprising the text of the document are scanned and are
compiled into an inverted file index. The inverted file index is the
accumulation of each unique word encountered in all of the documents
scanned. As each word of a document is scanned, the corresponding document
number is associated with that word and a search is made through the
inverted file index to determine whether that particular word has been
previously encountered in either the current document or previous
documents entered into the data base. If the word has not been previously
encountered, then the word is entered as a new word in the inverted file
index and the document number is associated therewith. If, instead, the
word has been previously encountered, either in the current document or in
a previous document, then the location of the word in the inverted file
index is found and the current document number is added to the collection
of previous document numbers in which the word has been found. As
additional documents are added to the data base, each respective unique
word in the inverted file index accumulates additional document numbers
for those documents containing the particular word. The inverted file
index is stored in the memory of the data processor in the document
retrieval system. A document table can also be stored in the memory,
containing each respective document number and the corresponding document
identification such as its title, location, or other identifying
attributes. Typically, prior art techniques for creating a document
retrieval data base required a scanning of the entire document in the
compilation of the inverted file index. After the inverted file index and
the document table have been created in the computer memory, the second
stage in the prior art computer methods for document retrieval can take
place, namely the input by the user of query words or expressions selected
by the user to characterize the types of documents he is seeking in a
particular retrieval application. When the user inputs his query words,
each word is compared with the inverted file index to determine whether
that word matches with any words previously entered in the inverted file
index. Upon making a successful match with the query word, the
corresponding document numbers for the matched entry in the inverted file
index are noted. If additional words are present in the user's input
query, each respective word is subjected to the matching operation with
the words in the inverted file index and the corresponding document
numbers for matched words are noted. Then, a scoring technique is employed
to identify those documents having the largest number of matching words to
the words in the user' s input query. The highest scoring documents can
then have their titles or other identifying attributes displayed on the
display monitor for the computer in the retrieval system. An example of
such a prior art document retrieval system is the IBM System/370 Storage
and Information Retrieval System (STAIRS) which is described in IBM
publication GH12-5123-1 entitled "IBM System/370 Storage and Information
Retrieval System/Virtual Storage--Thesaurus and Linguistic Integrated
System," November 1976. Another such system is described in U.S. Pat. No.
4,358,824 to Glickman, et al. entitled "Office Correspondence Storage and
Retrieval System," assigned to the IBM Corporation.
Although these prior art document retrieval systems work well, because
documents have different topics and are written by different authors at
different times, the user may seek only the particular document of a
certain author and/or certain subject or date. This retrieval-related
information is referred to as the retrieval parameters. This becomes
particularly true with business correspondence where the user desiring to
retrieve a document may remember only the author, date, recipient,
address, subject statement, or other document parameter. It would
therefore be desirable to have a document retrieval system which isolates
the business correspondence parameters in the process of a data base
creation, thereby facilitating the retrieval of business correspondence
through the use of queries comprising such business correspondence
parameters. The problem of reliably retrieving business correspondence is
further compounded when the user compiles a query containing terms which
are not exactly the same as the terms in the parameters compiled into the
data base during the data base creation phase. It would be desirable to
have a document retrieval system suitable for retrieving business
correspondence using terms in a query which are different in their
linguistic structure, syntax or semantics from the terms employed in the
compilation of the data base.
Objects of the Invention
It is therefore an object of the invention to provide an improved document
retrieval system.
It is another object of the invention to provide an improved computer
method for retrieval of business correspondence.
It is still a further object of the invention to provide an improved
business correspondence document retrieval system which is based upon
parametric fields which characterize business correspondence.
It is yet a further object of the invention to provide an improved computer
method for the retrieval of business correspondence which is tolerant to
variations in the linguistic structure, syntactic, or semantic form of the
user's input query.
SUMMARY OF THE INVENTION
These and other objects, features and advantages of the invention are
accomplished by the computer method disclosed herein. A Parametric
Information Extraction (PIE) system has been developed to identify
automatically parametric fields such as author, date, recipient, address,
subject statement, etc. from documents in free format. The
program-generated data can be used directly or can be supplemented
manually to provide automatic indexing or indexing aid, respectively.
The PIE system uses structural, syntactic, and semantic knowledge to
accomplish its objective. The structural analysis identifies the document
heading, body, and ending. The heading and ending, which are the
components that contain the parametric information, are then analyzed by a
battery of morphologic, syntactic, and semantic pattern-matching
procedures that provide the parametric information in standardized forms
that can be easily manipulated by computer.
BRIEF DESCRIPTION OF THE DRAWINGS
These and other objects, features and advantages of the invention can be
more fully appreciated with reference to the accompanying figures.
FIG. 1 is a data flow diagram of the parametric information extraction
process.
FIG. 2 is a discourse model of business correspondence documents.
FIG. 3 illustrates the frame slots for business correspondence.
FIG. 4 illustrates a typical business correspondence document.
FIG. 5 illustrates a list of business correspondence closing phrases.
FIG. 6 illustrates a list of the heading identifiers.
FIG. 7 illustrates a list of heading expectations.
FIG. 8 illustrates a list of ending expectations.
FIG. 9 is a data flow diagram of the date syntax.
FIG. 10 is a flow diagram of the MAINEXT program which extracts parametric
fields from a document.
FIG. 11 is a flow diagram of the END.sub.13 DOC program which identifies
document endings.
FIG. 12 is a flow diagram of the HEADDOC program which identifies the
heading of a document.
FIG. 13 is a flow diagram of the HEADING program which extracts parametric
fields from a heading.
FIG. 14 is a flow diagram of the ENDING program which extracts parametric
fields from an ending.
FIG. 15 is a flow diagram of the ISOLEXT program which creates a frame of
parametric fields.
FIG. 16 is a flow diagram illustrating the operation of entering a document
identification into a data base.
FIG. 17 is a flow diagram illustrating inputting a query in order to
retrieve a document identification from a data base.
FIG. 18 is a schematic illustration of a portion of the memory in the
computer in which the inverted file index is constructed for document
retrieval, using PIE frame categories.
DESCRIPTION OF THE BEST MODE FOR CARRYING OUT THE INVENTION
Introduction
Document retrieval is the problem of finding stored documents which contain
information relevant to a user's query. Because the documents have
different topics and are written by different authors at different times,
the user may seek only the particular document of a certain author and/or
certain subject or date. This retrieval-related information is referred to
as "parameters." This paper describes a system that isolates certain
document attributes and encodes them into a structure for the storing of
office document. The structure is suitable to establish a data base that
identifies only relevant items for user queries in a regular office
environment.
Approach
Although the task of automatically extracting parametric data appears to be
well-defined, the problem is difficult because the document format often
depends on the whims of the author, the vocabulary is unconstrained, and
the contents of the fields to be extracted are unknown. The inventive
approach used relies on computational linguistics methods for structural,
syntactic, and semantic knowledge. Each English sentence in the office
text presented to the PIE system is interpreted via a parser, a discourse
analysis procedure, a frame interpreter, and a mapping program that
converts the textual information into standard formats.
The structural (discourse) analysis uses a model of the discourse to
control the focus of the programming environment for the three
identifiable components of business correspondence discourse--the heading,
body, and ending of the document. The syntactic analysis (parsing), by
contrast, is concerned with the grammatical interpretation of text to
determine the parts of speech of the words and the phrase structure of the
sentences.
The structural and syntactic information makes it possible to set up a
frame work of expectations to drive subsequent field-oriented semantic
text analysis. Finally, the actual data extraction consists of mapping the
data found in the document to the slots reserved for the data in the
output structure. This is a "data cleanup" procedure that standardizes the
format of the data as required by the information storage and retrieval
programs which use the information.
Syntactic Module
To analyze a sentence of natural language, a computer program recognizes
the words and the phrases within the sentence, builds data structures
representing their syntactic structure and combines them into a structure
that corresponds to the entire sentence. The algorithm which recognizes
the phrases and invokes the structure-building procedures is the parser.
An example of such a parser is disclosed in the copending patent
application Ser. No. 924,670, filed Oct. 29, 1986, entitled "A Parser for
Natural Language Text," by A. Zamora, et al., assigned to IBM Corporation,
and incorporated herein by reference.
The parser analyzes text for the identification of sentence components
including part of speech and phrase structure. It constructs a
bidirectional-list data structure consisting of list nodes, string nodes,
and attribute nodes. The list nodes make it possible to scan the data
structure forward and backwards. The string nodes are attached to the list
nodes; they represent each lexical item in the text and contain pointers
to the attribute nodes. The attribute nodes consist of an attribute name
and a value which may be used to indicate part of speech, level of nesting
of a phrase, start of a line, etc. The PIE system accesses the parser's
word-oriented data structure through service subroutines to get the
lexical items corresponding to the string nodes, and retrieve the
attributes associated with them.
Discourse Interpreter Module
Isolation of parametric information depends on the correct identification
of the discourse structure in the documents. This aspect of the analysis
depends heavily on the format of the document. Most of the information
that the system needs is located in the heading and ending of a document.
Therefore, specific search procedures concentrate their efforts in these
portions of the document.
In the PIE system the HEADING means the top portion of a document before
the salutation. It usually does not contain verbs in the sentences (except
in the subject or reference statements). The HEADING of a business
document contains the date, the names of sender and recipient, the
addresses, and the subject statement. It may also contain copy (cc)
information, userid/nodeid information, and reference to previous
correspondence.
The ENDING is the bottom portion of a business document that contains the
signature of the author, but it may also contain carbon copy (cc)
information, userid/nodeid information, and sender's address.
The basic purpose of the discourse structure analysis is to obtain and use
locative clues that improve the extraction of information. These clues
encode knowledge that can direct the programs to examine the locations
within the discourse where co-referents (actual data) may be found.
Therefore, clear identification of the heading and the ending of a
document is very important to eliminate ambiguities. Date information, for
example, may be located in the body of a document as well as in the
heading, but only the date from the heading portion will be extracted
after the discourse interpreter identifies the document structure.
Frame Interpreter Module
The parametric information extracted from the parser data structure is
identified and stored in standard formats in the form of frames. A frame
provides a set of expectations that have to be fulfilled in particular
situations. For our analysis of business correspondence data, the
expectations embodied with the frame procedures are that there will be a
discourse structure with a heading, body, and ending. Within each of these
sections there are additional lower-order expectations. However, these
expectations may not always be realized because not every business
document contains all these constituents.
A frame defines a chunk of knowledge which is represented by a set of slots
and their content. It is exactly these slots that serve to associate the
concepts in an organized manner. The PIE frame has a fixed number of
categories and a variable number of slots. The categories of this frame
correspond to the 10 parameters: (1) date of the letter, (2) name of the
sender, (3) name of the recipient, (4) title of the sender, (5) address of
the sender, (6) userid/nodeid of the sender, (7) userid/nodeid of the
recipient, (8) carbon copy list, (9) the subject statement, and (10) the
reference statement. The slots of the frame correspond to each of the
above categories, but permit one or more instances of each category to
occur. This is important since an unspecified number of recipients, or
carbon copy names may exist in a document.
Different types of pattern recognition are required to isolate fields such
as addressee or date. The recognition mechanisms for personal names, for
example, depends on context (personal titles like "Mr.," "Dr.") or
syntactic structure (a prepositional phrase like "to J. Doe"). Dates, by
contrast, have more predictable formats and are recognized by application
of finite stage procedures which are described by formal languages or
syntax diagrams.
Mapping Module
Whereas the frame interpreter module scans the relevant portions of a
document in search of data for specific slots, the mapping procedure
standardizes the format of the data and organizes it in the slots of the
frame. Dates, for example, can be found in both textual and numeric
formats in the text of a letter. Also, numeric dates can be in American or
European formats. The mapping procedure converts these dates to YYMMDD
format, where YY is the year, MM the month, and DD the day. Proper names
are also scanned to remove titles such as Mr., Dr., etc. The mapping
module fills the slots of the frame for the 10 categories using formal
syntactic descriptions of the data to be extracted to ascertain that the
format corresponds to what is expected.
The structural information used by the mapping, complements that used
during the identification of the fields. The formal syntactic descriptions
insure that only the data that is appropriately recognized is placed into
the slots of the output frames. The syntactic descriptions, in essence,
act as "cleanup" filters that standardize the format of the data selected.
Development of a formal description of text requires analysis of a
substantial amount of text to produce an accurate and comprehensive
description.
General Description
In building a natural language understanding system, programs need various
degrees of linguistic knowledge. Therefore, one of the first major
decision to be made is how to express and organize the necessary
linguistic and conceptual knowledge. The programs to extract parametric
information from business correspondence text have to "understand" the
material to at least the extent of determining how much of the information
in the text is needed to identify parametric information, and translating
that information into the appropriate representation in the data base
while preserving the meaning.
The PIE system must isolate many different document attributes and encode
them into the format or structure suitable for establishing a data base to
identify only relevant items for the user queries in the regular office
environment. The generated structure must contain all parametric
information from a document.
We shall now discuss briefly some aspects of natural language processing in
order to provide a little perspective on the subject. Specialized
Information Extraction (SIE) systems obtain parametric information from
the text and place it in a data base. When we refer to an SIE task, we
will mean one that deals with a restricted subject matter; requires
information that can be classified under a limited number of discrete
parameters; and deals with language of a specialized type. The particular
cases of SIE that we have chosen are highly structured business
correspondence.
Programs which purport to "understand" some aspects of the language being
processed, for whatever purpose, will need various amounts of linguistic
knowledge. The degree of linguistic sophistication needed varies with the
application. A program for word processing needs essentially no linguistic
knowledge, for instance, while a program for producing a word index at
least needs to know the definition of a word.
The various levels of linguistic knowledge to build a natural language
understanding system are the following:
1. Lexical Knowledge--the words of the language and their individual
syntactic properties (their "parts of speech," and often more complex
properties, including co-occurrence relations and perhaps lexical
decomposition) and meaning.
2. Morphological Knowledge--how the words are modified in shape in
particular circumstances (e.g. how plural or past tense are formed).
3. Syntactic Knowledge--how the words are put together to make meaningful
sentences.
4. Semantic Knowledge--how the form of the sentences expresses particular
meanings.
5. Discourse Knowledge--how sentences are put together to form utterances,
i.e. how sentences in an utterance relate to one another, both in forms
and content (syntax and semantics).
An understanding of the semantics of the language depends to a certain
extent upon lexical, syntactic and discourse knowledge. The lexical
knowledge will provide information about the meaning of individual words,
and it is then necessary to express how these meanings are put together to
form meanings of sentences (or multi-sentence utterances), for each
meaningful sentence or discourse in the language. The task of mapping a
sentence's form into some representation of the meaning is called the
semantic mapping. Of course it is necessary to define some meaning
representation before one can do any semantic mapping.
Meaning representation is machine-based data representation designed to
provide a means of expressing the meaning of a language. In the fields of
Computational Linguistics and Artificial Intelligence "frames" are used to
represent knowledge in the format suitable for computer manipulation.
Frames serve to simplify the control structure necessary for assigning
attributes to conceptual entities. It is the task of semantic mapping to
attach each attribute in the corresponding slot of the frame.
In all phrases of language processing, the human listener or reader brings
to bear both linguistic and non-linguistic knowledge, and a computational
system for language processing must also use both linguistic and
non-linguistic knowledge.
One type of non-linguistic knowledge is embodied in what we usually think
of as logic--not only the true/false variety, but including things like
time relationships and probabilistic reasoning. A second form of
non-linguistic knowledge constantly used in dealing with language is
empirical knowledge, which consists of facts about the world that are not
specifically linguistic or logical.
In this PIE system, the empirical knowledge is in the program in that form
of heuristics and assumptions derived from our knowledge of the subject
matter of the text. In the semantic portion (which is used to extract the
desired parametric information) empirical knowledge is represented in the
form of "frames." Although this is not always the sense in which "frame"
is used, this is the sense in which we shall use the term in our
discussion below: Frames encode non-linguistic "expectation" brought to
bear on the task.
Whether one is dealing with a natural language or an artificial one, the
extraction of information expressed in specimens of the language is done
by analyzing the form of the utterances and proceeding to the meaning,
according to the conventions of the language. The conventions that
describe the form of possible utterances are called the syntax of the
language.
In the PIE program, there are only a finite number of parameters to be
determined in a restricted universe of discourse. It is still well to
assume that there are an infinite number of ways of expressing in the
language the information desired, as both theoretical considerations and
experience show that it would be futile to treat the problem in any other
way. It is necessary, as always, to deal with these infinite possibilities
by finite means through the use of problem segmentation and formal
descriptions where applicable.
It is quite possible that some advantage can be gained by first examining
in detail the potential input material for its special characteristics. It
may be that these special characteristics render the language easier to
process. The language may have the regularities that are built into
artificial languages to make them easier to process. To cite a particular
example, it may be that the name of recipient is always preceded by the
preposition "to." Then by looking for a personal name preceded by "to,"
one would hope to extract a relevant parameter, and also to obtain a piece
of information that may help in determining other aspects of sentence
structure.
Methods used to obtain information characteristic of the specialized
corpus, but which could not be motivated linguistically for the language
as a whole are called "ad hoc methods." As with computer methods in
general, the "ad hoc" methods may either be algorithmic or heuristic in
nature, but they are likely to be the latter. That is, they are likely to
be rules-of-thumb, which often, but not always, return an answer (they may
even return an incorrect answer on occasion, but if they do this very
often, there must be some method to check that answer, or the method
becomes counterproductive). If an answer is not returned, then other
heuristics are applied, but in some cases, none may work.
The grammar of the system created in this project consists of a lexicon, a
syntax, a meaning representation structure, and a semantic mapping. The
lexicon consists of the list of words in the language and one or more
grammatical categories for each word. The syntax specifies the structure
of sentences in the language in terms of the grammatical categories.
Morphological procedures recognize the regularities in the structure of
words and thereby reduce the size of the lexicon. A discourse structure,
or extrasentential syntax, is also included.
To understand the meaning of a sentence in business correspondence text the
invention is capable of: parsing the syntactic structure; interpreting
each sentence for its discourse purpose; disambiguating the referential
terms; and mapping the words of each sentence to a representation used by
the programs.
Therefore automatic process of extraction of parametric information from
the business correspondence may be split into four major tasks: syntactic
analysis of text; structural analysis of text; semantic analysis of text;
and semantic mapping procedure.
The establishment of a grammar is one of the fundamental tasks which has to
be accomplished before text that exhibits substantial variation, such as
natural language text, can be manipulated. The grammar is the basis of the
computer programs generated to analyze, or parse, text.
In order to be able to utilize the syntactic structure of a language to
determine the structure of individual sentences in a computational system,
it is first necessary to formalize the grammar and rid it of any
ambiguities, and second, to develop a parser. Therefore, the syntactic
analysis task of this project has been concerned with the use of a grammar
that adequately describes the business correspondence documents for
parsing purposes and parsing algorithms that extract parametric
information from business correspondence, implemented in programs.
To analyze a sentence of a natural language, a computer program recognizes
the phrases within the sentence, builds data structures for each of them
and combines those structures into one that corresponds to the entire
sentence. The algorithm which recognizes the phrases and invokes the
structure-building procedures is the parsing algorithm implemented in the
program.
Along another dimension, language understanding is embedded in a form of
discourse. Understanding language involves interpreting the language in
terms of the discourse in which it is embedded. Therefore, the semantic
analysis of any "understanding" system has to include knowledge for
understanding situations, objects and events, and also knowledge about the
conventions of the form of discourse.
The role of semantics in language analysis is to relate symbols to
concepts. The semantic mapping provides for each syntactically correct
sentence, a meaning representation in the meaning representation language
and it is the crux of the whole system. If the semantic mapping is
fundamentally straightforward, then the syntactic processing can often be
reduced. This is one of the virtues of SIE systems; because of the
specialized subject matter, the syntactic processing can often be
simplified through the use of either "ad hoc" or algorithmic procedures
derived from text analysis.
Semantic analysis can be considered to consist of the recognition of
references to particular objects or events and the integration of familiar
concepts into unusual ones. When language understanding goes beyond the
boundaries of single sentences, various linguistic structures are
recognized. According to current theories, if a familiar event, such as a
document parameter, is described, understanding the parameter description
involves recognizing the similarities and differences between the current
description and a description of a stereotype of a document parameter.
The complications of automatically extracting information from specialized
natural language text require sophisticated techniques, within a
methodology that combines linguistic theory and "ad hoc" heuristics (based
upon the specialized nature of the material) to provide more satisfactory
results than either the application of available linguistic knowledge or
"ad hoc" heuristics alone could provide.
One of the problems that has to be confronted in the design of a language
understanding system is how to design the system components and their
interaction. Thus, identification of the frames that are to be implemented
is a very important consideration. For the extraction of parametric
information our first impulse might be to define a frame containing the
expectations mentioned above: date, name of sender, name of recipient,
address, etc. However, consideration of how the parameters found in the
text will be used to fill the slots of the frame makes it necessary to
take into account the discourse structure of business correspondence text
and the semantic content of the information presented. The structure that
we call "PIE model" integrates the discourse structure and provides a
logical foundation for the design of two procedures: the Discourse PIE
Module and PIE Frame.
Each English sentence in office correspondence text presented to the PIE
system is interpreted via a parser, a discourse analysis procedure, a
frame interpreter, and a mapping program that converts the textual
information into standard formats. FIG. 1 illustrates a data flow for the
PIE system.
The following paragraphs explain the linguistic techniques and terminology
which have been used in this work.
Discourse Analysis
The basic purpose of analyzing the discourse structure is to obtain and
make use of locative clues that improve the extraction of information.
Stated in another way, knowledge of the context in which specific words
occur narrows the scope of their meaning sufficiently to eliminate
ambiguities. Discourse analysis, thus, refines specialized information
extraction tasks by identifying the heading, body, and ending of each
document.
Discourse is any connected piece of text or more than one sentence or more
than one independent sentence fragment. In order to interpret discourse it
is necessary to: disambiguate the referential terms for their
intersentential and extrasentential links; and determine the purpose of
each sentence in the discourse.
The purpose of the discourse analysis in the PIE system is to fill slots of
frame with values and required information correctly. While the PIE system
is designed to understand the English form of business correspondence, the
design depends on the method of interpreting the discourse structure of
the business correspondence data.
One of the interesting aspects of computational linguistics is that the
specific tasks that need to be accomplished to understand text are
intertwined so that it is impossible to design a system in a purely
hierarchical manner. In the task of extracting parametric information from
office correspondence, for example, we can operate most effectively when
we have identified the three components of the model in a document:
heading, body, and ending. However, the identification and classification
of the sentences of the text into these three categories requires
algorithmic procedures that have a detailed knowledge of the
characteristics of each of the three components.
An example of the business correspondence discourse model is given in FIG.
2. Because the purpose of the PIE system is to extract parametric
information from the heading or/and ending portions of a document, the
clear identification of the heading and ending becomes very important to
eliminate ambiguities. The discourse model of the PIE system will be
discussed later.
Frame Procedure
Frame procedures provide a set of expectations that have to be fulfilled in
particular situations. For our analysis of business correspondence data,
the expectations embodied with the frame procedures are that there will be
a discourse structure with a heading, body, and ending. Within each of
these sections there are additional lower-order expectations. These
expectations may be the following: date of a letter, name of a sender,
name of recipient, title of a sender, address of a sender, and other
parameters. There are expectations which may not always be realized
because not every business document contains all these parameters.
A frame is defined as a chunk of knowledge consisting of slots and their
content. It is exactly these slots that serve the purpose of association
links to other concepts. The PIE frame has a fixed number of categories
and a variable number o | | |