WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Sentence analyzer    
United States Patent4864502   
Link to this pagehttp://www.wikipatents.com/4864502.html
Inventor(s)Kucera; Henry (Providence, RI); Carus; Alwin B. (Newton, MA)
AbstractAn apparatus for the grammatical anlysis of digitally encoded text material receives encoded text, annotates each word of the text with a tag, and processes the annotated text to identify basic syntactic units such as noun phrases and verb groups. A clausal analyzer then operates on the identified nominal and predicate structures to identify clause boundaries and clause types. During processing, feature agreement between parts of successively larger entities--noun phrases, predicates, and clauses--are successively derived. When an error is detected, an error maessage identifies the error and displays a suggested correction.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 4864502
Sentence analyzer - US Patent 4864502 Drawing
Sentence analyzer
Inventor     Kucera; Henry (Providence, RI); Carus; Alwin B. (Newton, MA)
Owner/Assignee     Houghton Mifflin Company (Boston, MA)
Patent assignment
All assignments
Publication Date     September 5, 1989
Application Number     07/106,535
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     October 7, 1987
US Classification     704/8
Int'l Classification     G06F 015/38
Examiner     Fleming; Michael R.
Assistant Examiner    
Attorney/Law Firm     Lahive & Cockfield
Address
Parent Case    
Priority Data    
USPTO Field of Search     364/419 364/900 364/300
Patent Tags     sentence analyzer
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
4773009
Kucera
715/531
Sep,1988

[0 after 0 votes]
4771401
Kaufman
715/533
Sep,1988

[0 after 0 votes]
4750122
Kaji
704/1
Jun,1988

[0 after 0 votes]
4712174
Minkler, II
704/1
Dec,1987

[0 after 0 votes]
4635199
Muraki
704/2
Jan,1987

[0 after 0 votes]
4633435
Morimoto
704/5
Dec,1986

[0 after 0 votes]
4584667
Hashimoto
704/7
Apr,1986

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. Apparatus for the analysis of digitally encoded natural language, comprising

means for receiving encoded data representative of a body of natural language words such encoded data including for each word at least a provisionally-assigned syntactic tag of the word,

noun group identifying means operative on the encoded data for identifying noun groups and producing noun data structures representative thereof which augment the encoded data,

verb group identifying means operative on said encoded data augmented by the noun data structures for identifying verb groups and for producing predicate data structures representative thereof which further augment the encoded data, and

clausal analysis means operative on said encoded data augmented by said noun data structures and said predicate data structures for identifying well formed clauses of said body of words.

2. Apparatus according to claim 1, further comprising concordance means for determining agreement among syntactic parts of a clause identified by the clausal analysis means.

3. Apparatus according to claim 1, further comprising

means for identifying a syntax error, and

means for displaying an error message indicative of the identified error.

4. Apparatus according to claim 3 wherein the noun group identifying means identifies the words of a noun phrase from the ordering of syntactic tags of words.

5. Apparatus according to claim 4, wherein the noun group means comprises means for identifying noun phrase boundaries and a feature agreement property.

6. Apparatus according to claim 5, wherein the verb group identifying means comprises a finite state automaton operative on words not lying within the boundaries of a simplex noun phrase.

7. Apparatus according to claim 6, wherein the clausal analysis means comprises means for identifying clause type.

8. Apparatus according to claim 7, wherein the means for identifying a syntax error further comprises concordance means for determining agreement among parts of a clause identified by the clausal analysis means.

9. Apparatus according to claim 8, wherein the means for identifying a syntax error further comprises means for displaying an error correction message keyed to a detected error type.

10. Apparatus for the grammatical analysis of digitally encoded natural language text, such apparatus comprising

code annotation means for annotating each word of a text with data codes, such code annotation means including a stored dictionary of words together with associated data codes, wherein the data codes include tag codes representing possible grammatical or syntactic used of a word, and feature codes representing agreement properties of a word and such code annotation means also including means for looking up words of the text in the dictionary to identify said tag and feature codes for annotating the words, and

noun phrase means operative on the identified tag codes of successive words for identifying noun phrases from the ordering of the tag codes.

11. Apparatus according to claim 10, further comprising

means for processing the feature codes of the annotated words of a noun phrase identified by the noun phrase means to determine feature agreement consistency.

12. Apparatus according to claim 11, further comprising

complex phrase means for identifying complex noun phrases from the noun phrases identified by the noun phrase means.

13. Apparatus according to claim 12, wherein the complex phrase means include means for processing feature codes of constituent noun phrases to determine feature codes for a complex noun phrase.

14. Apparatus according to claim 13, further comprising

predicate means for identifying verb groups of the text.

15. Apparatus according to claim 14, further comprising

means for identifying feature agreement inconsistencies between verb groups and noun phrases.

16. Apparatus according to claim 15, further comprising

means for determining an error message associated with a feature agreement inconsistency and for displaying the error message.

17. Apparatus according to claim 16, further comprising

means for determining a correct replacement word for a word of text having feature agreement inconsistency,

said means for determining and displaying an error message including means for receiving the correct replacement word so as to display the correct replacement word together with the error message associated with the feature agreement inconsistency.

18. Apparatus according to claim 15, wherein the means for processing the identified predicates and noun phrases comprises means for checking subject-verb concordance.

19. Apparatus according to claim 15 wherein the means for processing the identified predicates and noun phrases comprises means for checking subject-verb-object concordance.

20. Apparatus according to claim 10, further comprising predicate means, for identifying verb groups in the annotated text, and

clausal analysis means for operating on identified verb groups to determine the structure of clauses in the annotated text.

21. Apparatus according to claim 20, wherein said clausal analysis means includes means for identifying types of clauses.

22. Apparatus according to claim 21, further comprising means for identifying clause-based errors of syntax.

23. A method of analyzing a digitally encoded body of natural language words for grammatical well-formedness, such method comprising, in order, the steps of

(I) annotating the words with candidate syntactic tags to produce a data structure,

(II) inspecting the ordering of the candidate tags to identify noun groups and producing noun data representative thereof thereby augmenting the data structure,

(III) processing the annotated words and the noun data to identify verb groups and producing predicate data representative thereof further augmenting the data structure, and

(IV) processing the further augmented data structure to identify well-formed clauses of said body of words.

24. A method according to claim 223, further comprising the step of displaying an error message when one of said steps of inspecting or processing detects probable errors.

25. A method according to claim 24, wherein the step of displaying an error message includes the step of displaying a message identifying a probable error together with a correction therefor.

26. An automated parser for processing digitally encoded natural language text, such parser comprising

means for annotating words of text with possible grammatical tags,

noun means operative on the annotated words of text for identifying noun phrases and for further annotating the text with noun phrase data,

predicate means operative on said annotated words of text and noun phrase data for identifying verb groups and for further annotating the text with verb group data, and

error detection means included in said means for annotating, said noun means and said predicate means, for identifying text errors during operation of the aforesaid three means and displaying an indication thereof,

whereby errors in text are identified for correction during processing stages of ascending structural complexity thereby permitting in-process correction of errors and parsing of unedited text without breakdown of the processing operations or requiring excessive processing time.

27. A method for grammatical analysis of digitally encoded natural language text, such method comprising the steps of

providing an automated means for annotating each word of a text with data codes, such means including a stored dictionary of words together with associated data codes, wherein the data codes include tag codes representing possible grammatical or syntactic uses of a word, and feature codes representing agreement properties of a word and such automated means being operative for looking up words of the text in the dictionary to identify said tag and feature codes thereby annotating the words, and

providing an automated noun phrase identifier which successively inspects the order of the identified tag codes of successive words to identify noun phrases of the text.

28. A method for an automated parsing of digitally encoded natural language text, such method comprising the steps of

annotating words of text with possible grammatical tags,

identifying noun phrases and further annotating the text with noun phrase data,

identifying verb groups and further annotating the text with verb group data,

identifying text errors during the aforesaid three steps and displaying an error indication thereof, and

correcting the displayed errors in text during processing stages of ascending structural complexity thereby permitting automated parsing of unedited text without breakdown of the processing operations or requiring excessive processing time.
 Description Submit all comments and votes
 


The present invention relates to automated language analysis systems, and relates to such systems embodied in a computer for receiving digitally encoded text composed in a natural language, and to systems for the grammatical analysis of encoded text.

In recent years a number of systems have been developed for the automated recognition of syntactic information. A survey of some systems appears in the textbook of Winograd, Language as a Cognitive Process - Syntax (ISBN 0-201-08571-2 v. 1) at pages 357-361 and pages 390-403. As a rule, although a number of theoretical linguistic formalisms have been developed to identify correct grammatical constructions, the practical construction of grammatical analyzing devices has proven difficult. Because the number of combinations of possible parts of speech for a string of words escalates exponentially with string length, syntax-recognizing systems have in general been limited to operating on text having a small, well-defined vocabulary, or to operating on more general text but dealing with a limited range of syntactic features. Extensions of either vocabulary or syntactic range require increasingly complex structures and an increasing number of special recognition rules, which would make a system large or too unwieldy for commercial implementation on commonly available computing systems. Moreover, the automated grammatical systems which have been designed are special processors, in that they are not adapted to conventional word processing or computer-aided publishing functions. For example, such systems may require that their input text be at least sufficiently pre-edited so that it is both correctly spelled and grammatically well formed. Text having a misspelling, a wrong word such as a homonym, a compound word, or even a simple syntax error may render an input sentence unanalyzable.

OBJECTS AND SUMMARY OF THE INVENTION

It is an object of the present invention to provide an improved device for the grammatical analysis of digitally encoded natural language text.

It is another object of the invention to provide a digital text analyzer which associates a tag with each word of a digitally encoded text and processes the tags to identify higher grammatical structure.

It is another object of the invention to provide a text analyzer operative on tagged text which successively identifies noun groups, then predicates and then clausal structures.

These and other features of the invention are obtained in an apparatus for the grammatical analysis of digitally encoded text material which receives encoded text, annotates each word of the text with a tag, and processes the annotated text to identify basic syntactic units such as noun phrases and verb groups. A clausal analyzer then operates on the identified nominal and predicate structures to identify clause boundaries and clause types. When an error is detected, an error message identifies the error and displays a suggested correction.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of a system according to the present invention;

FIG. 2 is a listing of system tags in an illustrative embodiment;

FIG. 3 is a simplified listing of classes of tags;

FIG. 4 shows the processing of a general grammatical analyzer operative on tagged text;

FIGS. 5 and 6 are flow charts of the noun phrase processing of tagged text sentences;

FIG. 7 shows verbal tag processing indices;

FIGS. 8-11 show predicational templates;

FIG. 12 is a descriptive structure for a predicate-analyzing FSA;

FIG. 12A is a state transition table for one such FSA;

FIG. 13 is a flow chart of the simplex and complex predication processor;

FIGS. 14-15 are flow charts of clausal analysis using the previously determined syntactic structures; and

FIG. 16 is a flow chart of feature concordance processing.

DETAILED DESCRIPTION OF THE DRAWINGS

FIG. 1 shows a block diagram of a grammatical analyzer according to the present invention having a CPU/controller 2 which may, for example, be a general purpose computer such as a micro- or mini- computer having storage of at least several megabytes and sufficient (e.g. 640 kilobytes) random access memory. The computer receives input text 4, e.g., from keyboard entry, a communications link, or a data storage device, and, if necessary, runs a sentence splitter 6 which partitions the text into sentences for grammatical analysis. Alternatively, the system may receive as input discrete sentences of encoded text with sentence boundary markers inserted. Sentence splitting per se is known in the art, and is used, for example, in computerized readability analysis systems for deriving word-per-sentence and similar statistical information. A suitable sentence splitter is disclosed in U.S. Pat. No. 4,773,009 of Henry Kucera, Rachael Sokolowski and Jacqueline Russom filed June 6, 1986 as Ser. No. 872,094 entitled Method and Apparatus for Text Analysis.

The controller 2 then passes each sentence to a grammatical analyzer 10, which processes the sentences to determine higher grammatical structure and to detect errors. Analyzer 10 includes an annotator 10a which annotates each word of the sentence, e.g., by reference to a stored word dictionary 8, so as to produce an annotated sentence structure. Analyzer 10 also includes a processing module 10b which operates on the annotated structure to define noun and verb groups, and to define clausal and sentence structure. The annotated sentence, or partial parses thereof and error messages or "prompts", are displayed on display 9.

The dictionary includes a record for each word which contains a list of "tags", each tag encoding a syntactic or inflectional property of the word, and which also contains a list of special features used in the grammatical processing. The processor annotates the sentence words with this information. It further preprocesses the annotated sentences to identify a single most likely tag for each word. Such tags may be identified by employing a number of common template-matching rules to quickly scan sets of tags and identify tags which fit common syntactic patterns. Preferably, however, the single tag for each word is identified using probabilistic data defined on the sets of possible tags, as described in the research on collocational tagging published by Ian Marshall, or as set forth in the patent application Ser. No. 106,127 of Henry Kucera, Alwin Carus and Jeffrey Hopkins for Collocational Grammar System filed on the same date as this patent application.

A prototype processor was constructed with the tag annotations provided by a main dictionary with 28,223 80-byte records, each having the complete grammatical information for a given "word". Each dictionary "word" was either a base form from which various inflections are recovered through a coding and analysis technique, or an irregularly inflected form. The records included, an addition to the base form, a list of codes for "tags" indicative of all the grammatical or syntactic uses that the linguistic words derivable from the base form may assume, and further, for noun and verb words, a list of "inflection codes" which encode inflectional paradigms for generating the noun or verb inflections from the base form.

FIG. 2 is a listing of the tags used in the prototype embodiment, each of which is represented in the drawing by a one to three character mnemonic and also by a one to two digit tag code. There are ninety-three such tags, although any given text word will generally have between one and six possible tags. Each tag indicates a possible syntactic use of the word, or an inflection. The dictionary records preferably also include, for nouns and verbs, certain information encoding word features such as its number agreement behavior.

FIG. 3 shows a simpler list of tag classes, which for many of the processing steps discussed below provide all necessary information. There are thirty-two such classes. The Figure gives the class code in column one, a class description in column two, and the number of tags in the class in column three. Column four contains an "exclusive group" or "EG" number, used in connection with verbal processing as described further below.

As noted above, the dictionary is used in connection with appropriate "unflection" processing modules which analyze a text word to determine its base form and look up the base form in the dictionary to retrieve its listed tags. Related "inflection" processing, conversely, generates a given inflection from its base form and from a data code indicative of the inflectional paradigm.

One suitable unflection/inflection processor is described in detail in U.S. Pat. No. 4,724,523 and entitled "Method and Apparatus for the Electronic Storage and Retrieval of Expressions and Linguistic Information" of inventor Henry Kucera. The disclosure of that patent application is hereby incorporated by reference. For purposes of the present invention however, what is important is that text material be annotated to identify for each word of a sentence tags indicating its possible uses. The data for each word is then contained in a data structure denoted a sentence node or SEN-NODE below. In the discussion below it will further be assumed that each SEN-NODE includes an indication of a single tag, denoted "MPP", which has initially been selected as the most likely tag for the word represented by that node.

Parsing of Annotated Sentence

FIG. 4 shows the construction of an exemplary grammatical text analyzer according to the invention, in which, by way of illustration, a disambiguation processor 20 provides a data output including a SEN NODE data structure 22 for each word, with its most probable tag, denoted MPP, and other tag and feature annotations. A grammatical analyzer 30 then operates under control of control module 24 on the annotated word data to successively build up larger syntactic structures and derive a parse of a text sentence.

In this construction, the set of tagged sentence nodes is parsed in three general phases: (a) the identification of the simplex noun phrases (NPs) in the sentence, and if there is more than one simplex NP, their combination, where possible, into complex NPs; (b) the identification of the simplex verb groups (VGs) in the sentence and, if there is more than one simplex VG, their combination, where possible, into complex VGs; and (c) the identification of the simplex sentence(s) (or clauses) in the (matrix) sentence and, if there is more than one simplex sentence their combination (where possible) into complex sentences.

The NP processing 26 of the first phase is accomplished in a double-scan of the sentence. The parser first ascertains NP boundaries primarily by inspecting tagged words and applying ordering criteria to their "rank". This rank, which characterizes a word's functional role in noun phrase construction, is determined by inspection of the word's tag in conjunction with other data, which may be stored in the word's sentence node. It then operates on the simplex NP structures to detect complex phrases which include prepositional phrases, a coordinating conjunction, certain coordinating punctuation and appositions. When such a complex phrase is identified, the processor creates a complex NP record which includes pointers to the component NPs and the boundaries of the NP, and derives the feature agreement properties, e.g., number, of the complex NP.

Once the NP-structure of the sentence has been determined, a predication analyzer module 28 inspects the portions of the sentence that are not incorporated into nominalizations, and assigns predicational structure to these portions where appropriate.

After operation of module 28, the predicational structure of the sentence has been provisionally determined. Some sentential structure is also determined incident to the predicational analysis, as tentative assignments of subjects and their corresponding finite predications will have been made.

At this point the controller 24 analyzes the higher syntactic structure of the sentence by a clausal analysis module 32 that inspects the tentative sentence-level structures generated by module 28 and either confirms them or replaces them.

The noun group and verb group modules each insert boundary markers and provide other data to appropriate registers 34 which maintain the boundary data for phrase and verb groups, and also maintain the derived feature information. This allows concordance rule checking of different syntactic units and permits the clausal analysis module to match related sub-units of clauses. An error message module 36 displays error messages when errors of syntax are detected.

When agreement errors are detected, the correct noun and verb inflections are generated from the base form, by an "inflection" procedure, and are displayed with a corresponding error message as suggested corrections.

Determination of Noun Phrases

The initial noun phrase processing of the parsing module will be understood by reference to the flow charts, FIGS. 5 and 6.

First, for each pre-nominal tag, a "rank" is associated corresponding to its functional position in noun phrase construction. Thus, pre-qualifiers and pre-quantifiers have rank 0; determiners, articles and possessives have rank 1; post-determiners (tag AP) have rank 2; cardinal and ordinal numbers rank 3; comparative and superlative adjectives, adverbs, qualifiers, and semantic superlatives such as "top" have rank 4; adjectives have rank 5; and post-qualifiers rank 6. With such information available from the tag of each word, noun phrase determination is then accomplished as follows.

As shown in FIG. 5. upon entry at step 105, the noun phrase processing module, which in the prototype system is part of a module denoted GcsPrsr initializes a number of processing variables for the simplex-NP scan. The variables and flag bits are evaluated or ascertained during processing and their values loaded into a data structure NGROUPS which serves as a record of noun phrase data generated by the processing. These variables are: (a) NP-COUNT (the location in NGROUPS of the NP currently under construction--if any--or the next available slot where one can be constructed), which is set to 1; (b) NP-FLAG (an NP scope flag), which is set to `0`B to indicate no NP is presently under construction; (c) SN-PTR (a basing pointer for the SEN-NODE structure), which is set to point at the SEN-NODE that corresponds to the first word in the sentence; (d) N-CPNTS (a complex-NP endpoints array in NGROUPS), which is set to zero, and (e) N-DOM-LK (an NP-domination link array in NGROUPS), which is also set to zero).

The structure NGROUPS is an external data array of 16-byte structures for containing information concerning simplex and complex noun phrases; elements accessed by the GcsPrsr are: (a) N-SPNTS (which contains the starting and ending points for each noun phrase in NGROUPS); (b) N-CPNTS (which contains the starting and ending points for each complex noun phrase in NGROUPS); (c) N-PREPS (which contains the location of the preposition (if any) that dominates a given noun phrase); (d) N-DOM-LK (which contains the identity of the complex-NP node (if any) that dominates a given NP); (e) N-FEATS (which contains the identity of the nominal features associated with the head noun (if any) in a given simplex NP); and (f) N-WEIGHT (which contains the "weight" of the given NP, calculated as described below).

After initialization, denoted 100, the NP processor enters a simplex-NP scan loop, which iterates through the chain of SEN-NODE structures containing the syntactic representation of the current sentence, including the most likely tag of each word, and groups as many nominal and pre-nominal elements as possible into simplex noun phrases satisfying certain ordering and recognition criteria of NP well-formedness. This process calls to three sub-procedures, designated (i) NP-START, (ii) GcsNPEB, and (iii) NP-END-AFTER. Each of these inserts an NP boundary and updates the NGROUPS structure as described below.

For each iteration of the simplex-NP scan loop, the following variables are set according to the contents of the SEN-NODE structure currently under consideration: (a) CUR-PARSE is set equal to the index in the portion of the SEN-NODE array that contains the MPP tag of the current word; (b) TAG-TYPE is set equal to the code that describes the general tag class of this tag; (c) TAG is set equal to the character representation of this tag for purpose of display, if required; (d) CUR-FLAGS is set equal to a bit string that contains the features corresponding to the MPP tag; and (e) WORD is set equal to the current word. After the above variables have been set, the value of a lexical-NP recognition flag (LEX-NP-FLAG) is initialized to `0`B, and the value of NP-FLAG is then inspected. If NP-FLAG is set to `1`B (indicating that an NP is already under construction), then the following actions are taken in order to determine if the current element will close this NP (either with or without starting a new NP).

A check 106 determines if (1) the first letter of the current word's tag type stored in TAG-TYPE is equal to "V" (indicating that this word is a verb); or (2) the tag type is "UR" (indicating an "unrestricted" tag) and the NP rank of the previous word was 7 (a rank assigned by GcsPrsr when it identifies a word as a head noun or a pronoun). If so the parsing processor at 107 calls the external procedure (GcsNPEB) in order to insert a noun phrase end boundary marker, closing the current NP with its last word the word before the current word; and resets the value of NP-FLAG bit to `0`B.

Otherwise at 108 if the tag type is "UR" (and therefore the NP rank of the previous word is not 7), a check 109 determines if (1) the current element of the sentence is a comma, or (2) the MPP tag of this element is "RB" or "CC". If either condition holds, then: the tag type of the element following the current one is inspected at 110, and if this word is not pre-nominal, GcsNPEB is called in order to close the current NP with its last word being the word before the current word; and the value of NP-FLAG is reset to `0`B. Otherwise, that is, if the next word is a pre-nominal, the current word is included in the NP currently under construction.

On the other hand, if the tag type is "UR" but neither of conditions (1) or (2) given above holds, then GcsNPEB is called directly to close the current NP with its last word being the word before the current word, and the value of NP-FLAG is reset to `0`B.

If the current word's tag is neither verbal nor unrestricted, then this word may either end or be added to the current NP (and in the former case will start a new NP if the current word is a pronoun), so the tag type of this word is next inspected at stage 111 and the following actions are taken.

If the tag type is "PN" (indicating that the word is functioning as a pre-nominal), then at step 111 the NP rank of the current word is obtained and is compared at 112 with the NP rank of the previous word. If the current NP rank is less than the NP rank of the previous word, then, except in special cases, the boundary marker inserting routine GcsNPEB is called at step 113 in order to close the current NP with its last word being the word before the current node. Also the procedure NP-START is called at step 114 in order to start a new NP with the current word as its first element. NP-START modifies the simplex-NP endpoints array N-SPNTS in the processing structure NGROUPS to set the opening boundary of the current NP equal to the location of the current word in the sentence, and modifies a prepositional phrase marker N-PREPS in NGROUPS by inspecting the word previous to the start of the NP, and, if it is a preposition, setting the phrase marker to its address, thus marking the current simplex NP as a potential prepositional phrase. NP-START also sets an "open-NP" boundary flag on the current SEN NODE structure.

It should be noted that this processing may be refined in various ways. For example, detection of an NP rank decrease at step 112 may legally occur within a noun phrase in constructions such as "A good, highly regarded, book" in which an adverb (rank 4) follows one of a series of adjectives (rank 5). However, in normal circumstances during NP construction, the occurrence of a pre-nominal word of lesser NP rank than the preceding word is generally recognized as the start of a new noun phrase.

Otherwise, if at step 115 an inspection of the NP-boundary flags associated with the current word indicates that it is being interpreted as a lexical NP, then the following actions are taken: (i) GcsNPEB is called in order to close the current NP with its last word the word before the current word; (ii) NP-START is called in order to start the new NP with the current word as its first element; (iii) NP-END-AFTER is called at 116 in order to end the new NP with the current word as its last element; (iv) NP-FLAG is reset to `0`B in order to indicate that the current NP cannot be added to on the right; and (v) LEX-NP-FLAG is set to `1`B in order to indicate that the current NP is a lexical NP.

If the current word is not a lexical NP, then it is necessarily either a head noun or a noun adjunct, so the value of NP-RANK is set equal to 7.

At this point the values of both NP-FLAG and LEX-NP-FLAG are checked and if they are both equal to `0`B, indicating that the current word has not been processed yet by either adding it to an existing NP or by interpreting it as a lexical NP, then the following actions are taken.

If an inspection 117 of the word's tag type indicates that it is a noun or pronoun, then if an inspection of 118 of the NP-boundary flags associated with the current word indicates that it is being interpreted as a lexical NP, then the following actions are taken at 119: (i) the value of NP-RANK is set equal to 7; (ii) NP-START is called in order to start the new NP with the current word as its first element; and (iii) NP-END-AFTER is called in order to end the new NP with the current word as its last element.

Otherwise the NP currently being opened may have more than one element in it. For example, even if it begins with a noun, this noun may turn out to be an adjunct to one or more further nouns in the NP. Thus, the following processing is done. At step 120 NP-FLAG is set equal to `1`B and at 121 a check determines if the tag type is "PN" (indicating that the word is functioning as a pre-nominal). If so, then at 122 the NP rank of the current word is obtained and stored in the various NP-RANK.

Otherwise the current word is necessarily a noun, so at 123 NP-RANK is set equal to 7, and NP-START is called in order to set up the necessary environment for processing the simplex NP that was started in either Step 122 or 123 above.

Above the above steps have been completed, the value of NP-FLAG is insected; if it is set to `1`B, then the next word to be processed may possibly be added onto an NP that is currently under construction, so PREV-NP-RANK is set equal to NP-RANK; otherwise it is set equal to zero. The value of SN-PTR is then set to access the SEN-NODE corresponding to the next element of the sentence in the next iteration of the loop, and if this SEN-NODE is active in the current sentence, then the above procedure is repeated.

During the noun phrase processing described in detail with reference to FIG. 5, the processing routines NPEB and NP-END-AFTER are invoked (e.g., as described at steps 124 and 113 of FIG. 5) to construct an appropriate noun phrase storage entry and to insert a "close-NP" boundary before (respectively, after) the current word. These processing routines operate as follows.

The routine NP-END-AFTER modifies the NGROUPS array of structures which is the aggregate containing information concerning simplex and complex noun phrases by (a) modifying the simplex-NP endpoints array N-SPNTS in order to set the extent of the current NP. These endpoints are both set equal to the address of the current word in the overall sentence processing structure SEN-TBL since the NP-END-AFTER procedure is only invoked for lexical NP'S. It also modifies (b) the NP-feature string array N-FEATS by setting the slot that is indexed by the curren value of NP-COUNT equal to CUR-FLAGS which contains a copy of the current noun feature string; and modifies (c) the NP-weight array N-WEIGHT by setting the slot that is indexed by the current value of NP-COUNT equal to the appropriate lexical-NP value. Specifically, it sets the "NP-WEIGHT" corresponding to the current lexical NP equal to: (i) +2 (a default value for singular lexical NPs), or (ii) -2 (a default value for plural lexical NPs); or (iii) 0 (a default value for NPs that are neutral with respect to number agreement). Finally, NP-END-AFTER closes the NP currently under construction by incrementing the current NP counter variable (NP-COUNT) in the external sentence-ordering structure by one, and then returns to its calling procedure.

The processing routine GcsNPEB, which is invoked when a noun phrase is determined to have ended before the current word, inserts a close-NP boundary before the current word, and also checks the internal structure of the NP by inspecting the features of its individual elements, and, if inconsistencies are detected, initiates corrective processing.

On entry, GcsNPEB sets the feature string for the simplex NP currently under construction equal to the feature string corresponding to the word preceding the current word in SEN-TBL, and then sets the simplex-NP closing boundary equal to the address of that word (in SEN-TBL). In addition to this processing, GcsNPEB of the prototype embodiment also checks the internal number agreement consistency of the NP it has created by inspecting its features, as follows.

First of all, a loop inspects each SEN-NODE within the scope of the given simplex NP (i.e., those SEN-NODE's between and including the nodes specified by the endpoints stored in NGROUPS.N-SPNTS). Each SEN-NODE in this range that does not correspond to a word whose MPP tag indicates that it is most probably functioning as a noun adjunct is inspected to see whether its MPP tag carries features that are not neutral with respect to number agreement. Further processing is done for each of these words that is not a noun adjunct and is also not neutral with respect to number agreement, as follows. If the word is plural, then a plural counter (PL-CTR) is incremented by one and the location of the word (in SEN-TBL) is stored in the element of the plural storage matrix (PL-TBL) that as indexed by PL-CTR; similarly, if the word is singular, then a singular counter (SG-CTR) is incremented by one and the location of the word (in SEN-TBL) is stored in the element of the singular storage matrix (SG-TBL) that is indexed by SG-CTR. The value of the NP weight is also checked and modified as appropriate at this point (i.e., if the word is singular, then if the current value of N-WEIGHT is not negative, then this value is incremented by one, and if the word is plural, then if the current value of N-WEIGHT is not positive, then this value is decremented by one). Once this loop exits, the values of SG-CTR and PL-CTR are checked; if both counters hold non-zero values, then there is a number-agreement inconsistency in the NP. In that case, it is processed further as follows.

First, the magnitudes of SG-CTR and PL-CTR are compared relative to one another; if only one word is marked as being singular and more than one is marked as being plural, then the singular one is flagged and a message suggesting its transformation into the appropriate plural form is generated. On the other hand, if more than one element in the NP is singular, but the number of singular elements is still less than the number of plural ones, then an internal procedure "MULT-ERRORS" is called to generate and link together the two rings of error messages associated with the elements of the NP that are inconsistent with respect to number agreement. Otherwise, if only one word is marked as being plural and more than one is marked as being singular, then the plural one is flagged and a message suggesting its transformation into the appropriate singular form is generated. On the other hand, if more than one element in the NP is plural (but the number of plural elements is still less than the number of singular ones), then "MULT-ERRORS" is called to generate and link together the two rings of error messages associated with the elements of the NP that are inconsistent with respect to number agreement.

If none of the above cases is true, then there are equal numbers of words with singular and plural tags in the NP. By far the most common sequences of this type will involve one plural and one singular tag. In this case "MULT-ERRORS" is also called to generate and link together the two rings of error messages associated with the elements of the NP that are inconsistent with respect to number agreement.

The foregoing error messages are generated by a procedure that copies the original form of the inconsistent work, accesses an error message from a table of such messages, calls a procedure to construct the appropriate singular or plural form of the inconsistent work (e.g., from its stored base form and inflection code), and links together several error messages if the inconsistency does not suggest a unique error. It then displays the original word, error message(s) and suggested correction(s).

Finally, GcsNPEB sets the value of the current NPs number-agreement feature (according to the value of the NPs weight), and then closes the current NP by incrementing the current NP counter variable (NP-COUNT) by one and also setting (to `1`B) the "close-NP" boundary on the SEN-NODE preceding the SEN-NODE associated with the current word. GcsNPEB then returns to its calling procedure.

In this manner the noun phrase processor identifies each noun group, determining its endpoints, feature agreement properties and component words. For each identified noun group a NP data structure receives this information, with the set of such NP data structures forming an ordered NGROUPS structure.

After the above-described simplex NP processing loop exits (having applied its algorithms to each SEN-NODE in turn, scanning from left to right across the sentence), another loop cycles through the contents of the NGROUPS structure, proceeding from the last entry to the first, thus in effect scanning the noun-phrase projection of the sentence from right to left. This loop constructs complex noun phrases, where possible, based on the simplex noun phrases previously identified, and also, where appropriate, on the complex NPs that it is in the process of creating. In an early prototype embodiment this is done by applying two rules: (I) a complex NP may be constructed over an NP followed by a prepositional phrase (i.e., an NP preceded by a preposition), and (II) a complex NP may be constructed over two NPs that are separated only by a coordinating conjunction. In both cases, the second element of the pair of NPs specified by the rule may be a complex NP, thus building up, in the case of complex NPs that contain complex-NP constituents, the appropriate structures by a process of left-adjunction. FIG. 6 shows the details of this operation of the noun groups processor.

On entry to this loop, which uses the data of the NGROUPS structure to access and process successive pairs of noun phrases of the sentence, starting at the rightmost simplex noun phrase, the processor starts by examining the last simplex NP in the sentence and the next-to-last one, then continues by examining the next-to-last simplex NP and the third-from-last one, on down to examining the second simplex NP and the first one. At 130 a determination is made if there are less than "n" elements (words or non-terminal punctuation) between the two NPs and, if so, then processing described below is performed to identify certain complex noun phrases satisfying this criterion. If there are more than "n" elements separating two successive NPs then the processor skips over these elements, resets the value of NP-FLAG to `0`B, and proceeds to process the next NP. For descriptive purposes below, n is taken as two and the construction of complex noun phrases whose noun phrase components are joined by one or two elements is described. First, two bit flags, NP-FLAG and CC-FLAG, are initialized to `0`B, and at the beginning of each iteration of the loop the number of elements between the two simplex NPs currently under examination is determined, and the value of a variable, XTAG, is set equal to the null-valu