WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Computer method for automatic extraction of commonly specified information from business correspondence    

Get related patents on CD
United States Patent4965763   
Link to this pagehttp://www.wikipatents.com/4965763.html
Inventor(s)Zamora; Elena M. (Chevy Chase, MD)
AbstractA Parametric Information Extraction (PIE) system has been developed to identify automatically commonly specified information such as author, date, recipient, address, subject statement, etc. from documents in free format. The program-generated data can be used directly or can be supplemented manually to provide automatic indexing or indexing aid, respectively.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History Custom Search
Drawing from US Patent 4965763
Computer method for automatic extraction of commonly specified

     information from business correspondence - US Patent 4965763 Drawing
Computer method for automatic extraction of commonly specified information from business correspondence
Inventor     Zamora; Elena M. (Chevy Chase, MD)
Owner/Assignee     International Business Machines Corporation (Armonk, NY)
Patent assignment
All assignments
Company News
Publication Date     October 23, 1990
Application Number     07/308,955
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     February 6, 1989
US Classification     704/1
Int'l Classification     G06F 015/40
Examiner     Chan; Eddie P.
Assistant Examiner    
Attorney/Law Firm     Hoel; John E.
Address
Parent Case     This is a continuation of U.S. patent application Ser. No. 021,078, filed Mar. 3, 1987, now abandoned.
Priority Data    
USPTO Field of Search     364/200 MS File 364/900 MS File 364/419
Patent Tags     computer automatic extraction commonly specified information business correspondence
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
4773009
Kucera
715/531
Sep,1988

[0 after 0 votes]
4506326
Shaw
707/4
Mar,1985

[0 after 0 votes]
4417321
Chang
707/7
Nov,1983

[0 after 0 votes]
4384329
Rosenbaum
704/10
May,1983

[0 after 0 votes]
4358824
Glickman
707/5
Nov,1982

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B

[0 market size comments]
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%

[0 market share comments]
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%

[0 reasonable royalty comments]
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

[0 Guesstimation of Royalty Value Comments]
License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
[0 license availability comments]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
[0 owner/assignee comments]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

[0 competitive advantage comments]
Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

[0 commercial alternatives comments]
 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. A computer method for the automatic extraction of commonly specified information from a business correspondence document, such as date of letter, name of recipient, name of sender, address of sender, title of sender, carbon copy list, subject statement, and the like, comprising the steps of:

a first scanning step of scanning the input data stream to locate postscripts, attachments of appendices at a first location by matching each word from the input data stream against a list of expressions used to indicate postscripts, attachments of appendices, said first location being set equal to the final occurring line in said data stream if said first scanning step does not locate any postscripts, attachments or appendices therein, said first location alternately being set equal to a location of postscripts, attachments or appendices found in said first scanning step;

a second scanning step of scanning the input data stream to locate the final sentence in said document, starting from said first location and scanning toward the beginning of said data stream, searching for words which are verbs in the final sentence in said document, by identifying the last occurrence of a verb in the input data stream, which will occur in the final sentence of said document;

a first identifying step of identifying an ending portion of the document expected to contain a sender's name, return address, title of carbon copy list information, at a location in the input data stream occurring after the end of said final sentence located in said second scanning step, and occurring before said first location located in said first scanning step;

a third scanning step of scanning said input data stream to locate any salutation by matching each word from the input data stream against a list of natural language expressions that can be used as a salutation;

a second identifying step of identifying a beginning portion of the document at a location which includes a portion from the start of the input data stream to the end of a salutation, if a salutation was located in said third scanning step;

a fourth scanning step of scanning the input data stream if no salutation was found in said third scanning step, said fourth scanning step to locate date, addressee, sender, return address, personal title or subject information in the input data stream by matching each word of the input data stream against a list of expressions that are used to indicate the date, addressee, the sender, the return address, personal title and the subject of the correspondence document;

a third identifying step of identifying, if no salutation was found in said third scanning step, a beginning portion of the document at a location which includes the date, addressee, sender, return address, personal title or subject information of the correspondence document located in said fourth scanning step;

isolating and storing from said beginning portion of said document, any addressee, sender, return address, personal title or subject information therein;

isolating and storing from said ending portion of said document, any sender, return address, title or carbon copy list information therein.

2. A computer method for the automatic extraction of commonly specified information from a business correspondence document, such as date of letter, name of recipient, name of sender, address of sender, title of sender, carbon copy list, subject statement, and the like, comprising the steps of:

a first scanning step of scanning the input data stream to locate postscripts, attachments or appendices at a first location by matching each word from the input data stream against a list of expressions used to indicate postscripts, attachments or appendices, said first location being set equal to the final occurring line in said data stream if said first scanning step does not locate any postscripts, attachments or appendices therein, said first location alternately being set equal to a location of postscripts, attachments or appendices found in said first scanning step;

a second scanning step of scanning the input data stream to locate the final sentence in said document, starting from said first location and scanning toward the beginning of said data stream, searching for words which are verbs in the final sentence in said document, by identifying the last occurrence of a verb in the input data stream, which will occur in the final sentence of said document;

a first identifying step of identifying an ending portion of the document expected to contain a sender's name, return address, title or carbon copy list information, at a location in the input data stream occurring after the end of said final sentence located in said second scanning step, and occurring before said first location located in said first scanning step;

a third scanning step of scanning said input data stream to locate any salutation by matching each word from the input data stream against a list of natural language expressions that can be used as a salutation;

a second identifying step of identifying a beginning portion of the document at a location which includes a portion from the start of the input data stream to the end of a salutation, if a salutation was located in said third scanning step;

a fourth scanning step of scanning the input data stream if no salutation was found in said third scanning step, said fourth scanning step to locate date, addressee, sender, return address, personal title or subject information in the input data stream by matching each word of the input data stream against a list of expressions that are used to indicate the date, addressee, the sender, the return address, personal title and the subject of the correspondence document;

a third identifying step of identifying, if no salutation was found in said third scanning step, a beginning portion of the document at a location which includes the date, addressee, sender, return address, personal title or subject information of the correspondence document located in said fourth scanning step;

isolating and storing from said beginning portion of said document, any addressee, sender, return address, personal title or subject information therein;

isolating and storing from said ending portion of said document, any sender, return address, title or carbon copy list information therein;

storing said document in a file accessible by any addressee, sender, return address, title, subject information, or carbon copy list information.
 Description Submit all comments and votes
 


BACKGROUND OF THE INVENTION

1. Technical Field

The invention disclosed broadly relates to data processing and more particularly relates to linguistic applications in data processing.

2. Background Art

Text processing and word processing systems have been developed for both stand-alone applications and distributed processing applications. The terms text processing and word processing will be used interchangeably herein to refer to data processing systems primarily used for the creation, editing, communication, and/or printing of alphanumeric character strings composing written text. A particular distributed processing system for word processing is disclosed in the copending U.S. patent application Ser. No. 781,862 filed Sept. 30, 1985 entitled "Multilingual Processing for Screen Image Build and Command Decode in a Word Processor, with Full Command, Message and Help Support," by K. W. Borgendale, et al., assigned to IBM Corporation. The figures and specification of the Borgendale, et al. patent application are incorporated herein by reference, as an example of a host system within which the subject invention herein can be applied.

Document retrieval is the function of finding stored documents which contain information relevant to a user's query. Prior art computer methods for document retrieval are logically divided into a first component process for creating a document retrieval data base and a second component process for interrogating that data base with the user's queries. In the process of creating the data base, each document which is desired to be entered into the data base, is associated with a unique document number. Then the words comprising the text of the document are scanned and are compiled into an inverted file index. The inverted file index is the accumulation of each unique word encountered in all of the documents scanned. As each word of a document is scanned, the corresponding document number is associated with that word and a search is made through the inverted file index to determine whether that particular word has been previously encountered in either the current document or previous documents entered into the data base. If the word has not been previously encountered, then the word is entered as a new word in the inverted file index and the document number is associated therewith. If, instead, the word has been previously encountered, either in the current document or in a previous document, then the location of the word in the inverted file index is found and the current document number is added to the collection of previous document numbers in which the word has been found. As additional documents are added to the data base, each respective unique word in the inverted file index accumulates additional document numbers for those documents containing the particular word. The inverted file index is stored in the memory of the data processor in the document retrieval system. A document table can also be stored in the memory, containing each respective document number and the corresponding document identification such as its title, location, or other identifying attributes. Typically, prior art techniques for creating a document retrieval data base required a scanning of the entire document in the compilation of the inverted file index. After the inverted file index and the document table have been created in the computer memory, the second stage in the prior art computer methods for document retrieval can take place, namely the input by the user of query words or expressions selected by the user to characterize the types of documents he is seeking in a particular retrieval application. When the user inputs his query words, each word is compared with the inverted file index to determine whether that word matches with any words previously entered in the inverted file index. Upon making a successful match with the query word, the corresponding document numbers for the matched entry in the inverted file index are noted. If additional words are present in the user's input query, each respective word is subjected to the matching operation with the words in the inverted file index and the corresponding document numbers for matched words are noted. Then, a scoring technique is employed to identify those documents having the largest number of matching words to the words in the user' s input query. The highest scoring documents can then have their titles or other identifying attributes displayed on the display monitor for the computer in the retrieval system. An example of such a prior art document retrieval system is the IBM System/370 Storage and Information Retrieval System (STAIRS) which is described in IBM publication GH12-5123-1 entitled "IBM System/370 Storage and Information Retrieval System/Virtual Storage--Thesaurus and Linguistic Integrated System," November 1976. Another such system is described in U.S. Pat. No. 4,358,824 to Glickman, et al. entitled "Office Correspondence Storage and Retrieval System," assigned to the IBM Corporation.

Although these prior art document retrieval systems work well, because documents have different topics and are written by different authors at different times, the user may seek only the particular document of a certain author and/or certain subject or date. This retrieval-related information is referred to as the retrieval parameters. This becomes particularly true with business correspondence where the user desiring to retrieve a document may remember only the author, date, recipient, address, subject statement, or other document parameter. It would therefore be desirable to have a document retrieval system which isolates the business correspondence parameters in the process of a data base creation, thereby facilitating the retrieval of business correspondence through the use of queries comprising such business correspondence parameters. The problem of reliably retrieving business correspondence is further compounded when the user compiles a query containing terms which are not exactly the same as the terms in the parameters compiled into the data base during the data base creation phase. It would be desirable to have a document retrieval system suitable for retrieving business correspondence using terms in a query which are different in their linguistic structure, syntax or semantics from the terms employed in the compilation of the data base.

Objects of the Invention

It is therefore an object of the invention to provide an improved document retrieval system.

It is another object of the invention to provide an improved computer method for retrieval of business correspondence.

It is still a further object of the invention to provide an improved business correspondence document retrieval system which is based upon parametric fields which characterize business correspondence.

It is yet a further object of the invention to provide an improved computer method for the retrieval of business correspondence which is tolerant to variations in the linguistic structure, syntactic, or semantic form of the user's input query.

SUMMARY OF THE INVENTION

These and other objects, features and advantages of the invention are accomplished by the computer method disclosed herein. A Parametric Information Extraction (PIE) system has been developed to identify automatically parametric fields such as author, date, recipient, address, subject statement, etc. from documents in free format. The program-generated data can be used directly or can be supplemented manually to provide automatic indexing or indexing aid, respectively.

The PIE system uses structural, syntactic, and semantic knowledge to accomplish its objective. The structural analysis identifies the document heading, body, and ending. The heading and ending, which are the components that contain the parametric information, are then analyzed by a battery of morphologic, syntactic, and semantic pattern-matching procedures that provide the parametric information in standardized forms that can be easily manipulated by computer.

BRIEF DESCRIPTION OF THE DRAWINGS

These and other objects, features and advantages of the invention can be more fully appreciated with reference to the accompanying figures.

FIG. 1 is a data flow diagram of the parametric information extraction process.

FIG. 2 is a discourse model of business correspondence documents.

FIG. 3 illustrates the frame slots for business correspondence.

FIG. 4 illustrates a typical business correspondence document.

FIG. 5 illustrates a list of business correspondence closing phrases.

FIG. 6 illustrates a list of the heading identifiers.

FIG. 7 illustrates a list of heading expectations.

FIG. 8 illustrates a list of ending expectations.

FIG. 9 is a data flow diagram of the date syntax.

FIG. 10 is a flow diagram of the MAINEXT program which extracts parametric fields from a document.

FIG. 11 is a flow diagram of the END.sub.13 DOC program which identifies document endings.

FIG. 12 is a flow diagram of the HEADDOC program which identifies the heading of a document.

FIG. 13 is a flow diagram of the HEADING program which extracts parametric fields from a heading.

FIG. 14 is a flow diagram of the ENDING program which extracts parametric fields from an ending.

FIG. 15 is a flow diagram of the ISOLEXT program which creates a frame of parametric fields.

FIG. 16 is a flow diagram illustrating the operation of entering a document identification into a data base.

FIG. 17 is a flow diagram illustrating inputting a query in order to retrieve a document identification from a data base.

FIG. 18 is a schematic illustration of a portion of the memory in the computer in which the inverted file index is constructed for document retrieval, using PIE frame categories.

DESCRIPTION OF THE BEST MODE FOR CARRYING OUT THE INVENTION

Introduction

Document retrieval is the problem of finding stored documents which contain information relevant to a user's query. Because the documents have different topics and are written by different authors at different times, the user may seek only the particular document of a certain author and/or certain subject or date. This retrieval-related information is referred to as "parameters." This paper describes a system that isolates certain document attributes and encodes them into a structure for the storing of office document. The structure is suitable to establish a data base that identifies only relevant items for user queries in a regular office environment.

Approach

Although the task of automatically extracting parametric data appears to be well-defined, the problem is difficult because the document format often depends on the whims of the author, the vocabulary is unconstrained, and the contents of the fields to be extracted are unknown. The inventive approach used relies on computational linguistics methods for structural, syntactic, and semantic knowledge. Each English sentence in the office text presented to the PIE system is interpreted via a parser, a discourse analysis procedure, a frame interpreter, and a mapping program that converts the textual information into standard formats.

The structural (discourse) analysis uses a model of the discourse to control the focus of the programming environment for the three identifiable components of business correspondence discourse--the heading, body, and ending of the document. The syntactic analysis (parsing), by contrast, is concerned with the grammatical interpretation of text to determine the parts of speech of the words and the phrase structure of the sentences.

The structural and syntactic information makes it possible to set up a frame work of expectations to drive subsequent field-oriented semantic text analysis. Finally, the actual data extraction consists of mapping the data found in the document to the slots reserved for the data in the output structure. This is a "data cleanup" procedure that standardizes the format of the data as required by the information storage and retrieval programs which use the information.

Syntactic Module

To analyze a sentence of natural language, a computer program recognizes the words and the phrases within the sentence, builds data structures representing their syntactic structure and combines them into a structure that corresponds to the entire sentence. The algorithm which recognizes the phrases and invokes the structure-building procedures is the parser. An example of such a parser is disclosed in the copending patent application Ser. No. 924,670, filed Oct. 29, 1986, entitled "A Parser for Natural Language Text," by A. Zamora, et al., assigned to IBM Corporation, and incorporated herein by reference.

The parser analyzes text for the identification of sentence components including part of speech and phrase structure. It constructs a bidirectional-list data structure consisting of list nodes, string nodes, and attribute nodes. The list nodes make it possible to scan the data structure forward and backwards. The string nodes are attached to the list nodes; they represent each lexical item in the text and contain pointers to the attribute nodes. The attribute nodes consist of an attribute name and a value which may be used to indicate part of speech, level of nesting of a phrase, start of a line, etc. The PIE system accesses the parser's word-oriented data structure through service subroutines to get the lexical items corresponding to the string nodes, and retrieve the attributes associated with them.

Discourse Interpreter Module

Isolation of parametric information depends on the correct identification of the discourse structure in the documents. This aspect of the analysis depends heavily on the format of the document. Most of the information that the system needs is located in the heading and ending of a document. Therefore, specific search procedures concentrate their efforts in these portions of the document.

In the PIE system the HEADING means the top portion of a document before the salutation. It usually does not contain verbs in the sentences (except in the subject or reference statements). The HEADING of a business document contains the date, the names of sender and recipient, the addresses, and the subject statement. It may also contain copy (cc) information, userid/nodeid information, and reference to previous correspondence.

The ENDING is the bottom portion of a business document that contains the signature of the author, but it may also contain carbon copy (cc) information, userid/nodeid information, and sender's address.

The basic purpose of the discourse structure analysis is to obtain and use locative clues that improve the extraction of information. These clues encode knowledge that can direct the programs to examine the locations within the discourse where co-referents (actual data) may be found. Therefore, clear identification of the heading and the ending of a document is very important to eliminate ambiguities. Date information, for example, may be located in the body of a document as well as in the heading, but only the date from the heading portion will be extracted after the discourse interpreter identifies the document structure.

Frame Interpreter Module

The parametric information extracted from the parser data structure is identified and stored in standard formats in the form of frames. A frame provides a set of expectations that have to be fulfilled in particular situations. For our analysis of business correspondence data, the expectations embodied with the frame procedures are that there will be a discourse structure with a heading, body, and ending. Within each of these sections there are additional lower-order expectations. However, these expectations may not always be realized because not every business document contains all these constituents.

A frame defines a chunk of knowledge which is represented by a set of slots and their content. It is exactly these slots that serve to associate the concepts in an organized manner. The PIE frame has a fixed number of categories and a variable number of slots. The categories of this frame correspond to the 10 parameters: (1) date of the letter, (2) name of the sender, (3) name of the recipient, (4) title of the sender, (5) address of the sender, (6) userid/nodeid of the sender, (7) userid/nodeid of the recipient, (8) carbon copy list, (9) the subject statement, and (10) the reference statement. The slots of the frame correspond to each of the above categories, but permit one or more instances of each category to occur. This is important since an unspecified number of recipients, or carbon copy names may exist in a document.

Different types of pattern recognition are required to isolate fields such as addressee or date. The recognition mechanisms for personal names, for example, depends on context (personal titles like "Mr.," "Dr.") or syntactic structure (a prepositional phrase like "to J. Doe"). Dates, by contrast, have more predictable formats and are recognized by application of finite stage procedures which are described by formal languages or syntax diagrams.

Mapping Module

Whereas the frame interpreter module scans the relevant portions of a document in search of data for specific slots, the mapping procedure standardizes the format of the data and organizes it in the slots of the frame. Dates, for example, can be found in both textual and numeric formats in the text of a letter. Also, numeric dates can be in American or European formats. The mapping procedure converts these dates to YYMMDD format, where YY is the year, MM the month, and DD the day. Proper names are also scanned to remove titles such as Mr., Dr., etc. The mapping module fills the slots of the frame for the 10 categories using formal syntactic descriptions of the data to be extracted to ascertain that the format corresponds to what is expected.

The structural information used by the mapping, complements that used during the identification of the fields. The formal syntactic descriptions insure that only the data that is appropriately recognized is placed into the slots of the output frames. The syntactic descriptions, in essence, act as "cleanup" filters that standardize the format of the data selected. Development of a formal description of text requires analysis of a substantial amount of text to produce an accurate and comprehensive description.

General Description

In building a natural language understanding system, programs need various degrees of linguistic knowledge. Therefore, one of the first major decision to be made is how to express and organize the necessary linguistic and conceptual knowledge. The programs to extract parametric information from business correspondence text have to "understand" the material to at least the extent of determining how much of the information in the text is needed to identify parametric information, and translating that information into the appropriate representation in the data base while preserving the meaning.

The PIE system must isolate many different document attributes and encode them into the format or structure suitable for establishing a data base to identify only relevant items for the user queries in the regular office environment. The generated structure must contain all parametric information from a document.

We shall now discuss briefly some aspects of natural language processing in order to provide a little perspective on the subject. Specialized Information Extraction (SIE) systems obtain parametric information from the text and place it in a data base. When we refer to an SIE task, we will mean one that deals with a restricted subject matter; requires information that can be classified under a limited number of discrete parameters; and deals with language of a specialized type. The particular cases of SIE that we have chosen are highly structured business correspondence.

Programs which purport to "understand" some aspects of the language being processed, for whatever purpose, will need various amounts of linguistic knowledge. The degree of linguistic sophistication needed varies with the application. A program for word processing needs essentially no linguistic knowledge, for instance, while a program for producing a word index at least needs to know the definition of a word.

The various levels of linguistic knowledge to build a natural language understanding system are the following:

1. Lexical Knowledge--the words of the language and their individual syntactic properties (their "parts of speech," and often more complex properties, including co-occurrence relations and perhaps lexical decomposition) and meaning.

2. Morphological Knowledge--how the words are modified in shape in particular circumstances (e.g. how plural or past tense are formed).

3. Syntactic Knowledge--how the words are put together to make meaningful sentences.

4. Semantic Knowledge--how the form of the sentences expresses particular meanings.

5. Discourse Knowledge--how sentences are put together to form utterances, i.e. how sentences in an utterance relate to one another, both in forms and content (syntax and semantics).

An understanding of the semantics of the language depends to a certain extent upon lexical, syntactic and discourse knowledge. The lexical knowledge will provide information about the meaning of individual words, and it is then necessary to express how these meanings are put together to form meanings of sentences (or multi-sentence utterances), for each meaningful sentence or discourse in the language. The task of mapping a sentence's form into some representation of the meaning is called the semantic mapping. Of course it is necessary to define some meaning representation before one can do any semantic mapping.

Meaning representation is machine-based data representation designed to provide a means of expressing the meaning of a language. In the fields of Computational Linguistics and Artificial Intelligence "frames" are used to represent knowledge in the format suitable for computer manipulation. Frames serve to simplify the control structure necessary for assigning attributes to conceptual entities. It is the task of semantic mapping to attach each attribute in the corresponding slot of the frame.

In all phrases of language processing, the human listener or reader brings to bear both linguistic and non-linguistic knowledge, and a computational system for language processing must also use both linguistic and non-linguistic knowledge.

One type of non-linguistic knowledge is embodied in what we usually think of as logic--not only the true/false variety, but including things like time relationships and probabilistic reasoning. A second form of non-linguistic knowledge constantly used in dealing with language is empirical knowledge, which consists of facts about the world that are not specifically linguistic or logical.

In this PIE system, the empirical knowledge is in the program in that form of heuristics and assumptions derived from our knowledge of the subject matter of the text. In the semantic portion (which is used to extract the desired parametric information) empirical knowledge is represented in the form of "frames." Although this is not always the sense in which "frame" is used, this is the sense in which we shall use the term in our discussion below: Frames encode non-linguistic "expectation" brought to bear on the task.

Whether one is dealing with a natural language or an artificial one, the extraction of information expressed in specimens of the language is done by analyzing the form of the utterances and proceeding to the meaning, according to the conventions of the language. The conventions that describe the form of possible utterances are called the syntax of the language.

In the PIE program, there are only a finite number of parameters to be determined in a restricted universe of discourse. It is still well to assume that there are an infinite number of ways of expressing in the language the information desired, as both theoretical considerations and experience show that it would be futile to treat the problem in any other way. It is necessary, as always, to deal with these infinite possibilities by finite means through the use of problem segmentation and formal descriptions where applicable.

It is quite possible that some advantage can be gained by first examining in detail the potential input material for its special characteristics. It may be that these special characteristics render the language easier to process. The language may have the regularities that are built into artificial languages to make them easier to process. To cite a particular example, it may be that the name of recipient is always preceded by the preposition "to." Then by looking for a personal name preceded by "to," one would hope to extract a relevant parameter, and also to obtain a piece of information that may help in determining other aspects of sentence structure.

Methods used to obtain information characteristic of the specialized corpus, but which could not be motivated linguistically for the language as a whole are called "ad hoc methods." As with computer methods in general, the "ad hoc" methods may either be algorithmic or heuristic in nature, but they are likely to be the latter. That is, they are likely to be rules-of-thumb, which often, but not always, return an answer (they may even return an incorrect answer on occasion, but if they do this very often, there must be some method to check that answer, or the method becomes counterproductive). If an answer is not returned, then other heuristics are applied, but in some cases, none may work.

The grammar of the system created in this project consists of a lexicon, a syntax, a meaning representation structure, and a semantic mapping. The lexicon consists of the list of words in the language and one or more grammatical categories for each word. The syntax specifies the structure of sentences in the language in terms of the grammatical categories. Morphological procedures recognize the regularities in the structure of words and thereby reduce the size of the lexicon. A discourse structure, or extrasentential syntax, is also included.

To understand the meaning of a sentence in business correspondence text the invention is capable of: parsing the syntactic structure; interpreting each sentence for its discourse purpose; disambiguating the referential terms; and mapping the words of each sentence to a representation used by the programs.

Therefore automatic process of extraction of parametric information from the business correspondence may be split into four major tasks: syntactic analysis of text; structural analysis of text; semantic analysis of text; and semantic mapping procedure.

The establishment of a grammar is one of the fundamental tasks which has to be accomplished before text that exhibits substantial variation, such as natural language text, can be manipulated. The grammar is the basis of the computer programs generated to analyze, or parse, text.

In order to be able to utilize the syntactic structure of a language to determine the structure of individual sentences in a computational system, it is first necessary to formalize the grammar and rid it of any ambiguities, and second, to develop a parser. Therefore, the syntactic analysis task of this project has been concerned with the use of a grammar that adequately describes the business correspondence documents for parsing purposes and parsing algorithms that extract parametric information from business correspondence, implemented in programs.

To analyze a sentence of a natural language, a computer program recognizes the phrases within the sentence, builds data structures for each of them and combines those structures into one that corresponds to the entire sentence. The algorithm which recognizes the phrases and invokes the structure-building procedures is the parsing algorithm implemented in the program.

Along another dimension, language understanding is embedded in a form of discourse. Understanding language involves interpreting the language in terms of the discourse in which it is embedded. Therefore, the semantic analysis of any "understanding" system has to include knowledge for understanding situations, objects and events, and also knowledge about the conventions of the form of discourse.

The role of semantics in language analysis is to relate symbols to concepts. The semantic mapping provides for each syntactically correct sentence, a meaning representation in the meaning representation language and it is the crux of the whole system. If the semantic mapping is fundamentally straightforward, then the syntactic processing can often be reduced. This is one of the virtues of SIE systems; because of the specialized subject matter, the syntactic processing can often be simplified through the use of either "ad hoc" or algorithmic procedures derived from text analysis.

Semantic analysis can be considered to consist of the recognition of references to particular objects or events and the integration of familiar concepts into unusual ones. When language understanding goes beyond the boundaries of single sentences, various linguistic structures are recognized. According to current theories, if a familiar event, such as a document parameter, is described, understanding the parameter description involves recognizing the similarities and differences between the current description and a description of a stereotype of a document parameter.

The complications of automatically extracting information from specialized natural language text require sophisticated techniques, within a methodology that combines linguistic theory and "ad hoc" heuristics (based upon the specialized nature of the material) to provide more satisfactory results than either the application of available linguistic knowledge or "ad hoc" heuristics alone could provide.

One of the problems that has to be confronted in the design of a language understanding system is how to design the system components and their interaction. Thus, identification of the frames that are to be implemented is a very important consideration. For the extraction of parametric information our first impulse might be to define a frame containing the expectations mentioned above: date, name of sender, name of recipient, address, etc. However, consideration of how the parameters found in the text will be used to fill the slots of the frame makes it necessary to take into account the discourse structure of business correspondence text and the semantic content of the information presented. The structure that we call "PIE model" integrates the discourse structure and provides a logical foundation for the design of two procedures: the Discourse PIE Module and PIE Frame.

Each English sentence in office correspondence text presented to the PIE system is interpreted via a parser, a discourse analysis procedure, a frame interpreter, and a mapping program that converts the textual information into standard formats. FIG. 1 illustrates a data flow for the PIE system.

The following paragraphs explain the linguistic techniques and terminology which have been used in this work.

Discourse Analysis

The basic purpose of analyzing the discourse structure is to obtain and make use of locative clues that improve the extraction of information. Stated in another way, knowledge of the context in which specific words occur narrows the scope of their meaning sufficiently to eliminate ambiguities. Discourse analysis, thus, refines specialized information extraction tasks by identifying the heading, body, and ending of each document.

Discourse is any connected piece of text or more than one sentence or more than one independent sentence fragment. In order to interpret discourse it is necessary to: disambiguate the referential terms for their intersentential and extrasentential links; and determine the purpose of each sentence in the discourse.

The purpose of the discourse analysis in the PIE system is to fill slots of frame with values and required information correctly. While the PIE system is designed to understand the English form of business correspondence, the design depends on the method of interpreting the discourse structure of the business correspondence data.

One of the interesting aspects of computational linguistics is that the specific tasks that need to be accomplished to understand text are intertwined so that it is impossible to design a system in a purely hierarchical manner. In the task of extracting parametric information from office correspondence, for example, we can operate most effectively when we have identified the three components of the model in a document: heading, body, and ending. However, the identification and classification of the sentences of the text into these three categories requires algorithmic procedures that have a detailed knowledge of the characteristics of each of the three components.

An example of the business correspondence discourse model is given in FIG. 2. Because the purpose of the PIE system is to extract parametric information from the heading or/and ending portions of a document, the clear identification of the heading and ending becomes very important to eliminate ambiguities. The discourse model of the PIE system will be discussed later.

Frame Procedure

Frame procedures provide a set of expectations that have to be fulfilled in particular situations. For our analysis of business correspondence data, the expectations embodied with the frame procedures are that there will be a discourse structure with a heading, body, and ending. Within each of these sections there are additional lower-order expectations. These expectations may be the following: date of a letter, name of a sender, name of recipient, title of a sender, address of a sender, and other parameters. There are expectations which may not always be realized because not every business document contains all these parameters.

A frame is defined as a chunk of knowledge consisting of slots and their content. It is exactly these slots that serve the purpose of association links to other concepts. The PIE frame has a fixed number of categories and a variable number o