|
Claims  |
|
|
What is claimed is:
1. Apparatus for the analysis of digitally encoded natural language,
comprising
means for receiving encoded data representative of a body of natural
language words such encoded data including for each word at least a
provisionally-assigned syntactic tag of the word,
noun group identifying means operative on the encoded data for identifying
noun groups and producing noun data structures representative thereof
which augment the encoded data,
verb group identifying means operative on said encoded data augmented by
the noun data structures for identifying verb groups and for producing
predicate data structures representative thereof which further augment the
encoded data, and
clausal analysis means operative on said encoded data augmented by said
noun data structures and said predicate data structures for identifying
well formed clauses of said body of words.
2. Apparatus according to claim 1, further comprising concordance means for
determining agreement among syntactic parts of a clause identified by the
clausal analysis means.
3. Apparatus according to claim 1, further comprising
means for identifying a syntax error, and
means for displaying an error message indicative of the identified error.
4. Apparatus according to claim 3 wherein the noun group identifying means
identifies the words of a noun phrase from the ordering of syntactic tags
of words.
5. Apparatus according to claim 4, wherein the noun group means comprises
means for identifying noun phrase boundaries and a feature agreement
property.
6. Apparatus according to claim 5, wherein the verb group identifying means
comprises a finite state automaton operative on words not lying within the
boundaries of a simplex noun phrase.
7. Apparatus according to claim 6, wherein the clausal analysis means
comprises means for identifying clause type.
8. Apparatus according to claim 7, wherein the means for identifying a
syntax error further comprises concordance means for determining agreement
among parts of a clause identified by the clausal analysis means.
9. Apparatus according to claim 8, wherein the means for identifying a
syntax error further comprises means for displaying an error correction
message keyed to a detected error type.
10. Apparatus for the grammatical analysis of digitally encoded natural
language text, such apparatus comprising
code annotation means for annotating each word of a text with data codes,
such code annotation means including a stored dictionary of words together
with associated data codes, wherein the data codes include tag codes
representing possible grammatical or syntactic used of a word, and feature
codes representing agreement properties of a word and such code annotation
means also including means for looking up words of the text in the
dictionary to identify said tag and feature codes for annotating the
words, and
noun phrase means operative on the identified tag codes of successive words
for identifying noun phrases from the ordering of the tag codes.
11. Apparatus according to claim 10, further comprising
means for processing the feature codes of the annotated words of a noun
phrase identified by the noun phrase means to determine feature agreement
consistency.
12. Apparatus according to claim 11, further comprising
complex phrase means for identifying complex noun phrases from the noun
phrases identified by the noun phrase means.
13. Apparatus according to claim 12, wherein the complex phrase means
include means for processing feature codes of constituent noun phrases to
determine feature codes for a complex noun phrase.
14. Apparatus according to claim 13, further comprising
predicate means for identifying verb groups of the text.
15. Apparatus according to claim 14, further comprising
means for identifying feature agreement inconsistencies between verb groups
and noun phrases.
16. Apparatus according to claim 15, further comprising
means for determining an error message associated with a feature agreement
inconsistency and for displaying the error message.
17. Apparatus according to claim 16, further comprising
means for determining a correct replacement word for a word of text having
feature agreement inconsistency,
said means for determining and displaying an error message including means
for receiving the correct replacement word so as to display the correct
replacement word together with the error message associated with the
feature agreement inconsistency.
18. Apparatus according to claim 15, wherein the means for processing the
identified predicates and noun phrases comprises means for checking
subject-verb concordance.
19. Apparatus according to claim 15 wherein the means for processing the
identified predicates and noun phrases comprises means for checking
subject-verb-object concordance.
20. Apparatus according to claim 10, further comprising predicate means,
for identifying verb groups in the annotated text, and
clausal analysis means for operating on identified verb groups to determine
the structure of clauses in the annotated text.
21. Apparatus according to claim 20, wherein said clausal analysis means
includes means for identifying types of clauses.
22. Apparatus according to claim 21, further comprising means for
identifying clause-based errors of syntax.
23. A method of analyzing a digitally encoded body of natural language
words for grammatical well-formedness, such method comprising, in order,
the steps of
(I) annotating the words with candidate syntactic tags to produce a data
structure,
(II) inspecting the ordering of the candidate tags to identify noun groups
and producing noun data representative thereof thereby augmenting the data
structure,
(III) processing the annotated words and the noun data to identify verb
groups and producing predicate data representative thereof further
augmenting the data structure, and
(IV) processing the further augmented data structure to identify
well-formed clauses of said body of words.
24. A method according to claim 223, further comprising the step of
displaying an error message when one of said steps of inspecting or
processing detects probable errors.
25. A method according to claim 24, wherein the step of displaying an error
message includes the step of displaying a message identifying a probable
error together with a correction therefor.
26. An automated parser for processing digitally encoded natural language
text, such parser comprising
means for annotating words of text with possible grammatical tags,
noun means operative on the annotated words of text for identifying noun
phrases and for further annotating the text with noun phrase data,
predicate means operative on said annotated words of text and noun phrase
data for identifying verb groups and for further annotating the text with
verb group data, and
error detection means included in said means for annotating, said noun
means and said predicate means, for identifying text errors during
operation of the aforesaid three means and displaying an indication
thereof,
whereby errors in text are identified for correction during processing
stages of ascending structural complexity thereby permitting in-process
correction of errors and parsing of unedited text without breakdown of the
processing operations or requiring excessive processing time.
27. A method for grammatical analysis of digitally encoded natural language
text, such method comprising the steps of
providing an automated means for annotating each word of a text with data
codes, such means including a stored dictionary of words together with
associated data codes, wherein the data codes include tag codes
representing possible grammatical or syntactic uses of a word, and feature
codes representing agreement properties of a word and such automated means
being operative for looking up words of the text in the dictionary to
identify said tag and feature codes thereby annotating the words, and
providing an automated noun phrase identifier which successively inspects
the order of the identified tag codes of successive words to identify noun
phrases of the text.
28. A method for an automated parsing of digitally encoded natural language
text, such method comprising the steps of
annotating words of text with possible grammatical tags,
identifying noun phrases and further annotating the text with noun phrase
data,
identifying verb groups and further annotating the text with verb group
data,
identifying text errors during the aforesaid three steps and displaying an
error indication thereof, and
correcting the displayed errors in text during processing stages of
ascending structural complexity thereby permitting automated parsing of
unedited text without breakdown of the processing operations or requiring
excessive processing time. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
The present invention relates to automated language analysis systems, and
relates to such systems embodied in a computer for receiving digitally
encoded text composed in a natural language, and to systems for the
grammatical analysis of encoded text.
In recent years a number of systems have been developed for the automated
recognition of syntactic information. A survey of some systems appears in
the textbook of Winograd, Language as a Cognitive Process - Syntax (ISBN
0-201-08571-2 v. 1) at pages 357-361 and pages 390-403. As a rule,
although a number of theoretical linguistic formalisms have been developed
to identify correct grammatical constructions, the practical construction
of grammatical analyzing devices has proven difficult. Because the number
of combinations of possible parts of speech for a string of words
escalates exponentially with string length, syntax-recognizing systems
have in general been limited to operating on text having a small,
well-defined vocabulary, or to operating on more general text but dealing
with a limited range of syntactic features. Extensions of either
vocabulary or syntactic range require increasingly complex structures and
an increasing number of special recognition rules, which would make a
system large or too unwieldy for commercial implementation on commonly
available computing systems. Moreover, the automated grammatical systems
which have been designed are special processors, in that they are not
adapted to conventional word processing or computer-aided publishing
functions. For example, such systems may require that their input text be
at least sufficiently pre-edited so that it is both correctly spelled and
grammatically well formed. Text having a misspelling, a wrong word such as
a homonym, a compound word, or even a simple syntax error may render an
input sentence unanalyzable.
OBJECTS AND SUMMARY OF THE INVENTION
It is an object of the present invention to provide an improved device for
the grammatical analysis of digitally encoded natural language text.
It is another object of the invention to provide a digital text analyzer
which associates a tag with each word of a digitally encoded text and
processes the tags to identify higher grammatical structure.
It is another object of the invention to provide a text analyzer operative
on tagged text which successively identifies noun groups, then predicates
and then clausal structures.
These and other features of the invention are obtained in an apparatus for
the grammatical analysis of digitally encoded text material which receives
encoded text, annotates each word of the text with a tag, and processes
the annotated text to identify basic syntactic units such as noun phrases
and verb groups. A clausal analyzer then operates on the identified
nominal and predicate structures to identify clause boundaries and clause
types. When an error is detected, an error message identifies the error
and displays a suggested correction.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram of a system according to the present invention;
FIG. 2 is a listing of system tags in an illustrative embodiment;
FIG. 3 is a simplified listing of classes of tags;
FIG. 4 shows the processing of a general grammatical analyzer operative on
tagged text;
FIGS. 5 and 6 are flow charts of the noun phrase processing of tagged text
sentences;
FIG. 7 shows verbal tag processing indices;
FIGS. 8-11 show predicational templates;
FIG. 12 is a descriptive structure for a predicate-analyzing FSA;
FIG. 12A is a state transition table for one such FSA;
FIG. 13 is a flow chart of the simplex and complex predication processor;
FIGS. 14-15 are flow charts of clausal analysis using the previously
determined syntactic structures; and
FIG. 16 is a flow chart of feature concordance processing.
DETAILED DESCRIPTION OF THE DRAWINGS
FIG. 1 shows a block diagram of a grammatical analyzer according to the
present invention having a CPU/controller 2 which may, for example, be a
general purpose computer such as a micro- or mini- computer having storage
of at least several megabytes and sufficient (e.g. 640 kilobytes) random
access memory. The computer receives input text 4, e.g., from keyboard
entry, a communications link, or a data storage device, and, if necessary,
runs a sentence splitter 6 which partitions the text into sentences for
grammatical analysis. Alternatively, the system may receive as input
discrete sentences of encoded text with sentence boundary markers
inserted. Sentence splitting per se is known in the art, and is used, for
example, in computerized readability analysis systems for deriving
word-per-sentence and similar statistical information. A suitable sentence
splitter is disclosed in U.S. Pat. No. 4,773,009 of Henry Kucera, Rachael
Sokolowski and Jacqueline Russom filed June 6, 1986 as Ser. No. 872,094
entitled Method and Apparatus for Text Analysis.
The controller 2 then passes each sentence to a grammatical analyzer 10,
which processes the sentences to determine higher grammatical structure
and to detect errors. Analyzer 10 includes an annotator 10a which
annotates each word of the sentence, e.g., by reference to a stored word
dictionary 8, so as to produce an annotated sentence structure. Analyzer
10 also includes a processing module 10b which operates on the annotated
structure to define noun and verb groups, and to define clausal and
sentence structure. The annotated sentence, or partial parses thereof and
error messages or "prompts", are displayed on display 9.
The dictionary includes a record for each word which contains a list of
"tags", each tag encoding a syntactic or inflectional property of the
word, and which also contains a list of special features used in the
grammatical processing. The processor annotates the sentence words with
this information. It further preprocesses the annotated sentences to
identify a single most likely tag for each word. Such tags may be
identified by employing a number of common template-matching rules to
quickly scan sets of tags and identify tags which fit common syntactic
patterns. Preferably, however, the single tag for each word is identified
using probabilistic data defined on the sets of possible tags, as
described in the research on collocational tagging published by Ian
Marshall, or as set forth in the patent application Ser. No. 106,127 of
Henry Kucera, Alwin Carus and Jeffrey Hopkins for Collocational Grammar
System filed on the same date as this patent application.
A prototype processor was constructed with the tag annotations provided by
a main dictionary with 28,223 80-byte records, each having the complete
grammatical information for a given "word". Each dictionary "word" was
either a base form from which various inflections are recovered through a
coding and analysis technique, or an irregularly inflected form. The
records included, an addition to the base form, a list of codes for "tags"
indicative of all the grammatical or syntactic uses that the linguistic
words derivable from the base form may assume, and further, for noun and
verb words, a list of "inflection codes" which encode inflectional
paradigms for generating the noun or verb inflections from the base form.
FIG. 2 is a listing of the tags used in the prototype embodiment, each of
which is represented in the drawing by a one to three character mnemonic
and also by a one to two digit tag code. There are ninety-three such tags,
although any given text word will generally have between one and six
possible tags. Each tag indicates a possible syntactic use of the word, or
an inflection. The dictionary records preferably also include, for nouns
and verbs, certain information encoding word features such as its number
agreement behavior.
FIG. 3 shows a simpler list of tag classes, which for many of the
processing steps discussed below provide all necessary information. There
are thirty-two such classes. The Figure gives the class code in column
one, a class description in column two, and the number of tags in the
class in column three. Column four contains an "exclusive group" or "EG"
number, used in connection with verbal processing as described further
below.
As noted above, the dictionary is used in connection with appropriate
"unflection" processing modules which analyze a text word to determine its
base form and look up the base form in the dictionary to retrieve its
listed tags. Related "inflection" processing, conversely, generates a
given inflection from its base form and from a data code indicative of the
inflectional paradigm.
One suitable unflection/inflection processor is described in detail in U.S.
Pat. No. 4,724,523 and entitled "Method and Apparatus for the Electronic
Storage and Retrieval of Expressions and Linguistic Information" of
inventor Henry Kucera. The disclosure of that patent application is hereby
incorporated by reference. For purposes of the present invention however,
what is important is that text material be annotated to identify for each
word of a sentence tags indicating its possible uses. The data for each
word is then contained in a data structure denoted a sentence node or
SEN-NODE below. In the discussion below it will further be assumed that
each SEN-NODE includes an indication of a single tag, denoted "MPP", which
has initially been selected as the most likely tag for the word
represented by that node.
Parsing of Annotated Sentence
FIG. 4 shows the construction of an exemplary grammatical text analyzer
according to the invention, in which, by way of illustration, a
disambiguation processor 20 provides a data output including a SEN NODE
data structure 22 for each word, with its most probable tag, denoted MPP,
and other tag and feature annotations. A grammatical analyzer 30 then
operates under control of control module 24 on the annotated word data to
successively build up larger syntactic structures and derive a parse of a
text sentence.
In this construction, the set of tagged sentence nodes is parsed in three
general phases: (a) the identification of the simplex noun phrases (NPs)
in the sentence, and if there is more than one simplex NP, their
combination, where possible, into complex NPs; (b) the identification of
the simplex verb groups (VGs) in the sentence and, if there is more than
one simplex VG, their combination, where possible, into complex VGs; and
(c) the identification of the simplex sentence(s) (or clauses) in the
(matrix) sentence and, if there is more than one simplex sentence their
combination (where possible) into complex sentences.
The NP processing 26 of the first phase is accomplished in a double-scan of
the sentence. The parser first ascertains NP boundaries primarily by
inspecting tagged words and applying ordering criteria to their "rank".
This rank, which characterizes a word's functional role in noun phrase
construction, is determined by inspection of the word's tag in conjunction
with other data, which may be stored in the word's sentence node. It then
operates on the simplex NP structures to detect complex phrases which
include prepositional phrases, a coordinating conjunction, certain
coordinating punctuation and appositions. When such a complex phrase is
identified, the processor creates a complex NP record which includes
pointers to the component NPs and the boundaries of the NP, and derives
the feature agreement properties, e.g., number, of the complex NP.
Once the NP-structure of the sentence has been determined, a predication
analyzer module 28 inspects the portions of the sentence that are not
incorporated into nominalizations, and assigns predicational structure to
these portions where appropriate.
After operation of module 28, the predicational structure of the sentence
has been provisionally determined. Some sentential structure is also
determined incident to the predicational analysis, as tentative
assignments of subjects and their corresponding finite predications will
have been made.
At this point the controller 24 analyzes the higher syntactic structure of
the sentence by a clausal analysis module 32 that inspects the tentative
sentence-level structures generated by module 28 and either confirms them
or replaces them.
The noun group and verb group modules each insert boundary markers and
provide other data to appropriate registers 34 which maintain the boundary
data for phrase and verb groups, and also maintain the derived feature
information. This allows concordance rule checking of different syntactic
units and permits the clausal analysis module to match related sub-units
of clauses. An error message module 36 displays error messages when errors
of syntax are detected.
When agreement errors are detected, the correct noun and verb inflections
are generated from the base form, by an "inflection" procedure, and are
displayed with a corresponding error message as suggested corrections.
Determination of Noun Phrases
The initial noun phrase processing of the parsing module will be understood
by reference to the flow charts, FIGS. 5 and 6.
First, for each pre-nominal tag, a "rank" is associated corresponding to
its functional position in noun phrase construction. Thus, pre-qualifiers
and pre-quantifiers have rank 0; determiners, articles and possessives
have rank 1; post-determiners (tag AP) have rank 2; cardinal and ordinal
numbers rank 3; comparative and superlative adjectives, adverbs,
qualifiers, and semantic superlatives such as "top" have rank 4;
adjectives have rank 5; and post-qualifiers rank 6. With such information
available from the tag of each word, noun phrase determination is then
accomplished as follows.
As shown in FIG. 5. upon entry at step 105, the noun phrase processing
module, which in the prototype system is part of a module denoted GcsPrsr
initializes a number of processing variables for the simplex-NP scan. The
variables and flag bits are evaluated or ascertained during processing and
their values loaded into a data structure NGROUPS which serves as a record
of noun phrase data generated by the processing. These variables are: (a)
NP-COUNT (the location in NGROUPS of the NP currently under
construction--if any--or the next available slot where one can be
constructed), which is set to 1; (b) NP-FLAG (an NP scope flag), which is
set to `0`B to indicate no NP is presently under construction; (c) SN-PTR
(a basing pointer for the SEN-NODE structure), which is set to point at
the SEN-NODE that corresponds to the first word in the sentence; (d)
N-CPNTS (a complex-NP endpoints array in NGROUPS), which is set to zero,
and (e) N-DOM-LK (an NP-domination link array in NGROUPS), which is also
set to zero).
The structure NGROUPS is an external data array of 16-byte structures for
containing information concerning simplex and complex noun phrases;
elements accessed by the GcsPrsr are: (a) N-SPNTS (which contains the
starting and ending points for each noun phrase in NGROUPS); (b) N-CPNTS
(which contains the starting and ending points for each complex noun
phrase in NGROUPS); (c) N-PREPS (which contains the location of the
preposition (if any) that dominates a given noun phrase); (d) N-DOM-LK
(which contains the identity of the complex-NP node (if any) that
dominates a given NP); (e) N-FEATS (which contains the identity of the
nominal features associated with the head noun (if any) in a given simplex
NP); and (f) N-WEIGHT (which contains the "weight" of the given NP,
calculated as described below).
After initialization, denoted 100, the NP processor enters a simplex-NP
scan loop, which iterates through the chain of SEN-NODE structures
containing the syntactic representation of the current sentence, including
the most likely tag of each word, and groups as many nominal and
pre-nominal elements as possible into simplex noun phrases satisfying
certain ordering and recognition criteria of NP well-formedness. This
process calls to three sub-procedures, designated (i) NP-START, (ii)
GcsNPEB, and (iii) NP-END-AFTER. Each of these inserts an NP boundary and
updates the NGROUPS structure as described below.
For each iteration of the simplex-NP scan loop, the following variables are
set according to the contents of the SEN-NODE structure currently under
consideration: (a) CUR-PARSE is set equal to the index in the portion of
the SEN-NODE array that contains the MPP tag of the current word; (b)
TAG-TYPE is set equal to the code that describes the general tag class of
this tag; (c) TAG is set equal to the character representation of this tag
for purpose of display, if required; (d) CUR-FLAGS is set equal to a bit
string that contains the features corresponding to the MPP tag; and (e)
WORD is set equal to the current word. After the above variables have been
set, the value of a lexical-NP recognition flag (LEX-NP-FLAG) is
initialized to `0`B, and the value of NP-FLAG is then inspected. If
NP-FLAG is set to `1`B (indicating that an NP is already under
construction), then the following actions are taken in order to determine
if the current element will close this NP (either with or without starting
a new NP).
A check 106 determines if (1) the first letter of the current word's tag
type stored in TAG-TYPE is equal to "V" (indicating that this word is a
verb); or (2) the tag type is "UR" (indicating an "unrestricted" tag) and
the NP rank of the previous word was 7 (a rank assigned by GcsPrsr when it
identifies a word as a head noun or a pronoun). If so the parsing
processor at 107 calls the external procedure (GcsNPEB) in order to insert
a noun phrase end boundary marker, closing the current NP with its last
word the word before the current word; and resets the value of NP-FLAG bit
to `0`B.
Otherwise at 108 if the tag type is "UR" (and therefore the NP rank of the
previous word is not 7), a check 109 determines if (1) the current element
of the sentence is a comma, or (2) the MPP tag of this element is "RB" or
"CC". If either condition holds, then: the tag type of the element
following the current one is inspected at 110, and if this word is not
pre-nominal, GcsNPEB is called in order to close the current NP with its
last word being the word before the current word; and the value of NP-FLAG
is reset to `0`B. Otherwise, that is, if the next word is a pre-nominal,
the current word is included in the NP currently under construction.
On the other hand, if the tag type is "UR" but neither of conditions (1) or
(2) given above holds, then GcsNPEB is called directly to close the
current NP with its last word being the word before the current word, and
the value of NP-FLAG is reset to `0`B.
If the current word's tag is neither verbal nor unrestricted, then this
word may either end or be added to the current NP (and in the former case
will start a new NP if the current word is a pronoun), so the tag type of
this word is next inspected at stage 111 and the following actions are
taken.
If the tag type is "PN" (indicating that the word is functioning as a
pre-nominal), then at step 111 the NP rank of the current word is obtained
and is compared at 112 with the NP rank of the previous word. If the
current NP rank is less than the NP rank of the previous word, then,
except in special cases, the boundary marker inserting routine GcsNPEB is
called at step 113 in order to close the current NP with its last word
being the word before the current node. Also the procedure NP-START is
called at step 114 in order to start a new NP with the current word as its
first element. NP-START modifies the simplex-NP endpoints array N-SPNTS in
the processing structure NGROUPS to set the opening boundary of the
current NP equal to the location of the current word in the sentence, and
modifies a prepositional phrase marker N-PREPS in NGROUPS by inspecting
the word previous to the start of the NP, and, if it is a preposition,
setting the phrase marker to its address, thus marking the current simplex
NP as a potential prepositional phrase. NP-START also sets an "open-NP"
boundary flag on the current SEN NODE structure.
It should be noted that this processing may be refined in various ways. For
example, detection of an NP rank decrease at step 112 may legally occur
within a noun phrase in constructions such as "A good, highly regarded,
book" in which an adverb (rank 4) follows one of a series of adjectives
(rank 5). However, in normal circumstances during NP construction, the
occurrence of a pre-nominal word of lesser NP rank than the preceding word
is generally recognized as the start of a new noun phrase.
Otherwise, if at step 115 an inspection of the NP-boundary flags associated
with the current word indicates that it is being interpreted as a lexical
NP, then the following actions are taken: (i) GcsNPEB is called in order
to close the current NP with its last word the word before the current
word; (ii) NP-START is called in order to start the new NP with the
current word as its first element; (iii) NP-END-AFTER is called at 116 in
order to end the new NP with the current word as its last element; (iv)
NP-FLAG is reset to `0`B in order to indicate that the current NP cannot
be added to on the right; and (v) LEX-NP-FLAG is set to `1`B in order to
indicate that the current NP is a lexical NP.
If the current word is not a lexical NP, then it is necessarily either a
head noun or a noun adjunct, so the value of NP-RANK is set equal to 7.
At this point the values of both NP-FLAG and LEX-NP-FLAG are checked and if
they are both equal to `0`B, indicating that the current word has not been
processed yet by either adding it to an existing NP or by interpreting it
as a lexical NP, then the following actions are taken.
If an inspection 117 of the word's tag type indicates that it is a noun or
pronoun, then if an inspection of 118 of the NP-boundary flags associated
with the current word indicates that it is being interpreted as a lexical
NP, then the following actions are taken at 119: (i) the value of NP-RANK
is set equal to 7; (ii) NP-START is called in order to start the new NP
with the current word as its first element; and (iii) NP-END-AFTER is
called in order to end the new NP with the current word as its last
element.
Otherwise the NP currently being opened may have more than one element in
it. For example, even if it begins with a noun, this noun may turn out to
be an adjunct to one or more further nouns in the NP. Thus, the following
processing is done. At step 120 NP-FLAG is set equal to `1`B and at 121 a
check determines if the tag type is "PN" (indicating that the word is
functioning as a pre-nominal). If so, then at 122 the NP rank of the
current word is obtained and stored in the various NP-RANK.
Otherwise the current word is necessarily a noun, so at 123 NP-RANK is set
equal to 7, and NP-START is called in order to set up the necessary
environment for processing the simplex NP that was started in either Step
122 or 123 above.
Above the above steps have been completed, the value of NP-FLAG is
insected; if it is set to `1`B, then the next word to be processed may
possibly be added onto an NP that is currently under construction, so
PREV-NP-RANK is set equal to NP-RANK; otherwise it is set equal to zero.
The value of SN-PTR is then set to access the SEN-NODE corresponding to
the next element of the sentence in the next iteration of the loop, and if
this SEN-NODE is active in the current sentence, then the above procedure
is repeated.
During the noun phrase processing described in detail with reference to
FIG. 5, the processing routines NPEB and NP-END-AFTER are invoked (e.g.,
as described at steps 124 and 113 of FIG. 5) to construct an appropriate
noun phrase storage entry and to insert a "close-NP" boundary before
(respectively, after) the current word. These processing routines operate
as follows.
The routine NP-END-AFTER modifies the NGROUPS array of structures which is
the aggregate containing information concerning simplex and complex noun
phrases by (a) modifying the simplex-NP endpoints array N-SPNTS in order
to set the extent of the current NP. These endpoints are both set equal to
the address of the current word in the overall sentence processing
structure SEN-TBL since the NP-END-AFTER procedure is only invoked for
lexical NP'S. It also modifies (b) the NP-feature string array N-FEATS by
setting the slot that is indexed by the curren value of NP-COUNT equal to
CUR-FLAGS which contains a copy of the current noun feature string; and
modifies (c) the NP-weight array N-WEIGHT by setting the slot that is
indexed by the current value of NP-COUNT equal to the appropriate
lexical-NP value. Specifically, it sets the "NP-WEIGHT" corresponding to
the current lexical NP equal to: (i) +2 (a default value for singular
lexical NPs), or (ii) -2 (a default value for plural lexical NPs); or
(iii) 0 (a default value for NPs that are neutral with respect to number
agreement). Finally, NP-END-AFTER closes the NP currently under
construction by incrementing the current NP counter variable (NP-COUNT) in
the external sentence-ordering structure by one, and then returns to its
calling procedure.
The processing routine GcsNPEB, which is invoked when a noun phrase is
determined to have ended before the current word, inserts a close-NP
boundary before the current word, and also checks the internal structure
of the NP by inspecting the features of its individual elements, and, if
inconsistencies are detected, initiates corrective processing.
On entry, GcsNPEB sets the feature string for the simplex NP currently
under construction equal to the feature string corresponding to the word
preceding the current word in SEN-TBL, and then sets the simplex-NP
closing boundary equal to the address of that word (in SEN-TBL). In
addition to this processing, GcsNPEB of the prototype embodiment also
checks the internal number agreement consistency of the NP it has created
by inspecting its features, as follows.
First of all, a loop inspects each SEN-NODE within the scope of the given
simplex NP (i.e., those SEN-NODE's between and including the nodes
specified by the endpoints stored in NGROUPS.N-SPNTS). Each SEN-NODE in
this range that does not correspond to a word whose MPP tag indicates that
it is most probably functioning as a noun adjunct is inspected to see
whether its MPP tag carries features that are not neutral with respect to
number agreement. Further processing is done for each of these words that
is not a noun adjunct and is also not neutral with respect to number
agreement, as follows. If the word is plural, then a plural counter
(PL-CTR) is incremented by one and the location of the word (in SEN-TBL)
is stored in the element of the plural storage matrix (PL-TBL) that as
indexed by PL-CTR; similarly, if the word is singular, then a singular
counter (SG-CTR) is incremented by one and the location of the word (in
SEN-TBL) is stored in the element of the singular storage matrix (SG-TBL)
that is indexed by SG-CTR. The value of the NP weight is also checked and
modified as appropriate at this point (i.e., if the word is singular, then
if the current value of N-WEIGHT is not negative, then this value is
incremented by one, and if the word is plural, then if the current value
of N-WEIGHT is not positive, then this value is decremented by one). Once
this loop exits, the values of SG-CTR and PL-CTR are checked; if both
counters hold non-zero values, then there is a number-agreement
inconsistency in the NP. In that case, it is processed further as follows.
First, the magnitudes of SG-CTR and PL-CTR are compared relative to one
another; if only one word is marked as being singular and more than one is
marked as being plural, then the singular one is flagged and a message
suggesting its transformation into the appropriate plural form is
generated. On the other hand, if more than one element in the NP is
singular, but the number of singular elements is still less than the
number of plural ones, then an internal procedure "MULT-ERRORS" is called
to generate and link together the two rings of error messages associated
with the elements of the NP that are inconsistent with respect to number
agreement. Otherwise, if only one word is marked as being plural and more
than one is marked as being singular, then the plural one is flagged and a
message suggesting its transformation into the appropriate singular form
is generated. On the other hand, if more than one element in the NP is
plural (but the number of plural elements is still less than the number of
singular ones), then "MULT-ERRORS" is called to generate and link together
the two rings of error messages associated with the elements of the NP
that are inconsistent with respect to number agreement.
If none of the above cases is true, then there are equal numbers of words
with singular and plural tags in the NP. By far the most common sequences
of this type will involve one plural and one singular tag. In this case
"MULT-ERRORS" is also called to generate and link together the two rings
of error messages associated with the elements of the NP that are
inconsistent with respect to number agreement.
The foregoing error messages are generated by a procedure that copies the
original form of the inconsistent work, accesses an error message from a
table of such messages, calls a procedure to construct the appropriate
singular or plural form of the inconsistent work (e.g., from its stored
base form and inflection code), and links together several error messages
if the inconsistency does not suggest a unique error. It then displays the
original word, error message(s) and suggested correction(s).
Finally, GcsNPEB sets the value of the current NPs number-agreement feature
(according to the value of the NPs weight), and then closes the current NP
by incrementing the current NP counter variable (NP-COUNT) by one and also
setting (to `1`B) the "close-NP" boundary on the SEN-NODE preceding the
SEN-NODE associated with the current word. GcsNPEB then returns to its
calling procedure.
In this manner the noun phrase processor identifies each noun group,
determining its endpoints, feature agreement properties and component
words. For each identified noun group a NP data structure receives this
information, with the set of such NP data structures forming an ordered
NGROUPS structure.
After the above-described simplex NP processing loop exits (having applied
its algorithms to each SEN-NODE in turn, scanning from left to right
across the sentence), another loop cycles through the contents of the
NGROUPS structure, proceeding from the last entry to the first, thus in
effect scanning the noun-phrase projection of the sentence from right to
left. This loop constructs complex noun phrases, where possible, based on
the simplex noun phrases previously identified, and also, where
appropriate, on the complex NPs that it is in the process of creating. In
an early prototype embodiment this is done by applying two rules: (I) a
complex NP may be constructed over an NP followed by a prepositional
phrase (i.e., an NP preceded by a preposition), and (II) a complex NP may
be constructed over two NPs that are separated only by a coordinating
conjunction. In both cases, the second element of the pair of NPs
specified by the rule may be a complex NP, thus building up, in the case
of complex NPs that contain complex-NP constituents, the appropriate
structures by a process of left-adjunction. FIG. 6 shows the details of
this operation of the noun groups processor.
On entry to this loop, which uses the data of the NGROUPS structure to
access and process successive pairs of noun phrases of the sentence,
starting at the rightmost simplex noun phrase, the processor starts by
examining the last simplex NP in the sentence and the next-to-last one,
then continues by examining the next-to-last simplex NP and the
third-from-last one, on down to examining the second simplex NP and the
first one. At 130 a determination is made if there are less than "n"
elements (words or non-terminal punctuation) between the two NPs and, if
so, then processing described below is performed to identify certain
complex noun phrases satisfying this criterion. If there are more than "n"
elements separating two successive NPs then the processor skips over these
elements, resets the value of NP-FLAG to `0`B, and proceeds to process the
next NP. For descriptive purposes below, n is taken as two and the
construction of complex noun phrases whose noun phrase components are
joined by one or two elements is described. First, two bit flags, NP-FLAG
and CC-FLAG, are initialized to `0`B, and at the beginning of each
iteration of the loop the number of elements between the two simplex NPs
currently under examination is determined, and the value of a variable,
XTAG, is set equal to the null-valu | | |