|
Claims  |
|
|
What is claimed is:
1. A computer implemented method, implemented by a programmed computer, of
building a co-occurrence dictionary describing whether phrases co-occur in
one sentence, the phases belonging to first and second categories in a
dictionary containing phrases of a natural language which is an object,
said method comprising using the computer to build the co-occurrence
dictionary by implementing the steps of:
selecting, as a first sub-group of phrases (11), phrases from a first group
of phrases (1) comprising all phrases belonging to said first category in
said dictionary;
selecting, as a second sub-group of phrases (21), phrases from a second
group of phrases (2) comprising all phrases belonging to said second
category in the dictionary;
preparing first co-occurrence information describing whether each phrase
belonging to the first sub-group (11) and each phrase belonging to the
second sub-group (21) co-occur in one sentence of the object language;
preparing second-co-occurrence information describing whether each phrase
belonging to a third sub-group of phrases (12), comprising all the phrases
in the first group (1) which do not belong to the first sub-group (11) and
each phrase belonging to the second sub-group (21), co-occur in one
sentence of the object language;
preparing third co-occurrence information describing whether each phrase
belonging to a fourth sub-group of phrases (22), comprising all the
phrases in the second group (2) which do not belong to the second
sub-group (21) and each phrase belonging to the first sub-group (11)
co-occur in one sentence of the object language;
arranging the first co-occurrence information such that each phrase
belonging to the first sub-group (11) corresponds to a real number vector
with a dimension below a common maximum dimension and each phrase
belonging to the second sub-group (21) corresponds to a real number vector
with a dimension below the common maximum dimension;
calculating a value of the real number vector corresponding to each phrase
in the first sub-group (11) and a value of the real number vector
corresponding to each phrase in the second sub-group (21) on the basis of
the first co-occurrence information so that the number of sets of two
phrases, wherein:
a value of an inner product of the real number vector corresponding to a
first phrase and the real number vector corresponding to a second phrase
becomes positive when describing, in the first co-occurrence information,
that a first phrase belonging to said first sub-group (11) and a second
phrase belonging to said second sub-group (21) co-occur in one sentence,
and
the value of an inner product of the real number vector corresponding to
said first phrase and the real number vector corresponding to said second
phrase becomes negative when describing, in said first co-occurrence
information, that said first phrase belonging to said first sub-group (11)
and said second phrase belonging to said second sub-group (21) do not
co-occur in one sentence,
becomes the greatest of all the numbers of sets each comprising phrases
belonging to said first sub-group (11) and phrases belonging to the second
sub-group (21);
arranging said second co-occurrence information such that each phrase
belonging to said third sub-group (12) corresponds to a real number vector
with a dimension below the maximum dimension;
calculating a value of the real number vector corresponding to each phrase
in said third sub-group (12) on the basis of said second co-occurrence
information so that the number of sets of two phrases, wherein:
a value of the inner product of the real number vector corresponding to a
third phrase belonging to said third sub-group (12) and the real number
vector corresponding to a fourth phrase belonging to said second sub-group
(21) and calculated on the basis of said first co-occurrence information
becomes positive when describing, in said second co-occurrence
information, that the third phase and the fourth phrase co-occur in one
sentence, and
a value of an inner product of the real number vector corresponding to the
third phrase and the real number vector corresponding to the fourth phrase
becomes negative when describing, in said second co-occurrence
information, that the third phrase and the fourth phrase do not co-occur
in one sentence,
becomes the largest of all the numbers of sets each comprising a phrase
belonging to said third sub-group (12) and a phrase belonging to said
second sub-group (21);
arranging said third co-occurrence information such that each phrase
belonging to the fourth sub-group (22) corresponds to a real number vector
with a dimension below the maximum dimension; and
calculating a value of the real number vector corresponding to each phrase
in the fourth sub-group (22) on the basis of said third co-occurrence
information so that the number of sets of two phrases, wherein:
the inner product of the real number vector corresponding to a fifth phrase
belonging to said first sub-group (11) and calculated on the basis of said
first co-occurrence information and the real number vector corresponding
to a sixth phrase belonging to the fourth sub-group (22) becomes positive
when describing, in the third co-occurrence information, that the fifth
phrase and the sixth phrase co-occur in one sentence and, on the other
hand,
the inner product of the real number vector corresponding to the fifth
phrase calculated on the basis of the first co-occurrence information and
the real number vector corresponding to the sixth phrase becomes negative
when describing, in the third co-occurrence information, that the fifth
phrase and the sixth phrase do not co-occur in one sentence,
becomes the greatest of all the numbers of sets each comprising a phrase
belonging to said first sub-group (11) and a phrase belonging to said
fourth sub-group (22).
2. A method as claimed in claim 1, comprising the further step of:
correcting said first co-occurrence information by exceptionally reversing
the decision of the co-occurrence with respect to a portion of said first
co-occurrence information so that the number of sets of two phrases,
wherein:
the value of the inner product of the real number vector corresponding to
said first phrase belonging to said first sub-group (11) and the real
number vector corresponding to said second phrase belonging to said second
sub-group (21) becomes positive when describing that said first phrase and
said second phrase co-occur in one sentence and the value of the inner
product of the real number vector corresponding to said first phrase and
the real number vector corresponding to said second phrase becomes
negative when describing, in the first co-occurrence information, that the
first phrase and the second phrase do not co-occur in one sentence,
is above a constant rate to the number of all sets each comprising phrases
belonging to the first sub-group (11) and phrases belonging to said second
sub-group (21), and the corrected first co-occurrence information is used
as said first co-occurrence information and the co-occurrence information
are calculated in real number vector form with respect to all the phrases
of said first group of phrases (1) and said second group of phrases (2) so
as to calculate the co-occurrence information in the real number vector
form and in exception information form.
3. A method as claimed in claim 1, further comprising the steps of:
when a new seventh phrase belonging to a first category is added to the
built co-occurrence dictionary which describes whether each phrase
belonging to said first category and each phrase belonging to a second
category co-occur in one sentence in a dictionary containing phrases of a
natural language,
selecting a first select group of phrases in said dictionary, which
consists of N phrases, of phrases belonging to said second category and
which are above a maximum dimension of the corresponding vectors and in
which the absolute value of an inner product of the real number vectors
corresponding to every two phrases is below a constant value, so as to
give additional co-occurrence information indicative of whether the N
phrases and the seventh phrase co-occur in one sentence of said language;
arranging that said seventh phrase corresponding to a real number vector
having a dimension below the maximum dimension; and
calculating a real number vector corresponding to said seventh phrase so
that the number M of sets of two phrases, wherein
a value of the inner product of the real number vector corresponding to
said seventh phrase and the real number vector corresponding to an eighth
phrase belonging to the first select group of phrases becomes positive
when describing, in said additional co-occurrence information, that said
eighth phrase and said seventh phrase co-occur in one sentence and the
value of the inner product of the real number vector corresponding to said
seventh phrase and the real number vector corresponding to said eighth
phrase becomes negative when describing, in said additional co-occurrence
information, that the eighth phrase and the seventh phrase do not co-occur
in one sentence,
has a maximum so that the calculated real number vector is added as the
co-occurrence information for the seventh phrase to said co-occurrence
dictionary.
4. A method as claimed in claim 3, comprising the further step of, when the
number M is below a predetermined number L,
selecting a second select group of phrases whose number is constant from
the second category to give readditional co-occurrence information
indicative of whether said second select group of phrases and said seventh
phrase co-occur in one sentence of the language so as to correct said
additional co-occurrence information so that the co-occurrence decisions
of said additional co-occurrence information and a portion of said
readditional co-occurrence information are exceptionally reversed, and
calculating the real number vector corresponding to said seventh phrase on
the basis of the corrected additional co-occurrence information so that
the number M become above the predetermined number L, and adding the
calculated real number vector as the co-occurrence information for said
seventh phrase to said co-occurrence dictionary.
5. A method as claimed in claim 3, further comprising the step of
performing a co-occurrence analysis using said co-occurrence dictionary to
automatically decide whether phrases which are included in said dictionary
and which belong to first and second categories co-occur in one sentence,
wherein, when said first phrase included in said first category in said
co-occurrence dictionary and said second phrase included in said second
category in said co-occurrence dictionary appear at positions, allowable
on morphpheme and syntax, in the sentence to be analyzed, if the inner
product of the real number vector corresponding to the first phrase and
the real number vector corresponding to the second phrase is positive, a
decision is made that said first phrase and said second phrase co-occur,
and on the other hand, if the inner product of the real number vector
corresponding to said first phrase and the real number vector
corresponding to said second phrase is negative, a decision is made that
said first phrase and said second phrase do not co-occur.
6. A method as claimed in claim 5, wherein, when said first phrase and said
second phrase in the sentence to be analyzed are vague on a morphpheme and
syntax, using the interpretation that the absolute value of the inner
product of the real number vector corresponding to said second phrase and
the real number vector corresponding to said first phrase calculated in
accordance with the co-occurrence analysis method is the greatest value or
a group of interpretations that the absolute value of the inner product is
above a constant value, and rejecting the other interpretations.
7. A method as claimed in claim 6, wherein the natural language is the
Japanese language, said first category is nouns, and said second category
is deep cases of a predicate.
8. A method as claimed in claim 7, further comprising the steps of:
calculating an inner product Q of a real number vector corresponding to
each deep case in a plurality of deep case patterns P of the predicate in
a sentence S to be analyzed and a real number vector corresponding to a
noun which is applied to each deep case in the sentence S; and
adding a weight proper to each deep case in the deep case patterns P to the
inner product Q to adopt the deep case pattern, that the added value E is
the greatest, as an interpretation for the deep case pattern of the
predicate in the sentence S.
9. A method as claimed in claim 7, further comprising the steps of:
calculating an inner product Q of a real number vector corresponding to
each deep case in a plurality of deep case patterns P of the predicate in
a sentence S to be analyzed and a real number vector corresponding to a
noun which is applied to each deep case in the sentence S; and
adding a weight proper to each deep case in the deep case patterns P to the
inner product Q to adopt all the deep case patterns, that the added value
E is above a predetermined constant value, as an interpretation for the
deep case pattern of the predicate in the sentence S. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
BACKGROUND OF THE INVENTION
The present invention relates generally to natural language processing
techniques to be used in computer applied systems such as word processors,
machine translations and interactive systems, and more particularly to an
apparatus and method of building and updating a semantic analysis
co-occurrence dictionary and an apparatus and method of analyzing
co-occurrences and meanings.
Recently, various computer application systems have been researched and
developed on the basis of the natural language processing techniques and a
portion of the various computer applied systems are gradually being fixed
in our language culture. Particularly, in Japan, the progress of the
kana-kanji conversion technique allows easy input of sentences, comprising
a mixture of kanji and kana, to computers, whereby text processing
softwares on Japanese word processors and personal computers are used
widely. However, we do not still have an effective means to represent and
process the meaning of words and the semantic relation between words for
selecting a correct word from homonyms on the kana-kanji conversion. In
the present stage, it is common practice in the machine translation or the
like to process the meaning of words in accordance with the semantic
analysis technique based on the case grammer described by C. J. Fillmore
and to use semantic labels in the co-occurrence analysis. A description
will be made hereinbelow with reference to FIGS. 8 to 13 in terms of a
conventional co-occurrence analysis method using the semantic label, a
conventional semantic analysis method using this conventional
co-occurrence analysis method, and a conventional co-occurrence dictionary
building and updating method necessary for these analysises.
FIG. 8 is a block diagram showing one example of Japanese sentence analysis
apparatus based on the conventional semantic analysis method. In FIG. 8,
numeral 701 represents an inputting means for inputting a sentence to be
analyzed, 702 designates a morphological analysis means for dividing the
inputted sentence into a list (string) of morphphemes (morphemes), 703
denotes a morphpheme dictionary to be retrieved by the morphological
analysis means 702 when performing the morphological segmentation, 704
depicts a connection rule to be used by the morphological analysis means
702 when performing the connection test between the morphphemes, 705
indicates a syntactic analysis means for inputting the list (string) of
morphphemes from the morphological analysis means 702 to analyze the
syntactic structure and output the syntactic tree, 706 represents a
context-free grammar rule to be used by the syntactic analysis means 705
when performing the syntactic structure analysis, 707 designates a
semantic analysis means for inputting the syntactic tree from the
syntactic analysis means 706 to perform the case analysis and output the
semantic structure, 708 denotes a verbal case dictionary to be used by the
semantic analysis means 707, 709 depicts a noun semantic label dictionary
to be used by the semantic analysis means 707, and 710 indicates a
semantic structure storing means for storing a semantic structure
centering the case frame produced by the semantic analysis means 707,
which is referred to and operated by an external apparatus. The noun
semantic label dictionary 709 to be used for the semantic analysis
describes the meaning of each of nouns within the morphpheme dictionary
703 with above one semantic label in accordance with the semantic
classification standard as shown in FIG. 11 and has the contents as shown
in FIG. 12. Further, the verbal case dictionary 708 divides the meaning of
each of the verbs within the morphpheme dictionary 703 into one case
pattern or more and describes them as illustrated in FIG. 13. As well as
the noun semantic label dictionary 709, the meaning of the noun
co-occuring with each case slot is described with one semantic label or
more in accordance with the semantic classification standard shown in FIG.
11.
The operation of the conventional sentence analysis apparatus thus arranged
will be described hereinbelow in terms of the case of analyzing the typed
sentence "A B C V ". First, the typed sentence "A B C V " is supplied
as a character train through the inputting means 701 to the morphological
analysis means 702. The morphological analysis means 702 performs the
morphological segmentation process from the beginning of the sentence
toward the end of the sentence. If the morphpheme coincident with a
portion of the inputted sentence train is found by the retrieval of the
morphpheme dictionary 703, the connection possibility to the morphpheme
immediately before the found portion is checked through the connection
rule 704. If the connection is possible, the morphological segmentation
process is further effected in terms of the inputted sentence train
subsequent to the found portion. If a plurality of morphphemes coincident
therewith are found by the retrieval of the morphpheme dictionary 703, the
priority is given therebetween in accordance with a heuristic method such
as the maximum coincidence and the minimum clause number. Thus, the
following list (string) of morphphemes up to the end of the sentence can
be obtained.
"A (noun), (case post-positional particle), B (noun), (case
post-positional particle), C (noun), (case post-positional particle), V
(verb), (ending of verb), (ending of verb), (past auxiliary verb)"
The aforementioned morphpheme train is supplied to the syntactic analysis
means 705 so as to analyze the syntactic structure to obtain a syntactic
tree as illustrated in FIG. 14. From this syntactic tree, it is understood
that all of the three post-positional phrases "A ", "B " and "C " are
connected or applied to the verb phrase "V ".
The syntactic tree illustrated in FIG. 14 is led to the semantic analysis
means 707 so as to perform the semantic analysis of the inputted sentence
in accordance with the procedure illustrated in FIG. 9 which shows a
procedure for the semantic analysis of a sentence "A B C V ". First, the
case patterns of the verb "V" are obtained by retrieving the verbal case
dictionary 708, and the semantic labels respectively corresponding to the
nouns "A", "B" and "C" are obtained by retrieving the noun semantic label
dictionary 709 (step 801). Secondly, it is checked, in accordance with the
co-occurrence analysis procedure illustrated in FIG. 10, whether the case
slot corresponding to the semantic label of the noun of each of the
post-positional phrases co-occurs with respect to each of the case
patterns of the verb V. That is, only the case patterns with which all the
three nouns co-occur are selected as a candidate, and further the best
case pattern is selected on the basis of the priority between the case
patterns, the filling degree of the case slot and others so that
information such as the tense and the voice is added to the selected case
pattern which is in turn outputted as the semantic structure (steps 802 to
812).
In the co-occurrence analysis procedure, as illustrated in FIG. 10 which
shows a procedure of the analysis as to whether or not the noun N, being
the C case, co-occurs with the case pattern P of the verb V, it is first
checked whether the C case is in the case of the case pattern P (step
901). If the C case exists therein, it is checked whether there is a
common semantic label between a group of semantic labels in the case slot
of the C case of the case pattern P and a group of semantic labels of the
noun N (step 902). If the common semantic label exists therebetween, the
decision of the co-occurrence is made (step 903), and if not existing
therebetween, no co-occurrence is decided (step 904). Further, if there is
no C case in the cases of the case pattern P, it is checked whether the C
case can be taken as the optional case such as the time and the place
(step 905). If not, the decision of no co-occurrence is made (step 904).
If so, the case slot information of the optional case which does not
depend on the verb is retrieved so as to check whether there is a common
semantic label between a group of semantic labels in the optional case
slot and a group of semantic labels of the noun (step 906). If the common
semantic label exists therebetween, the decision of the co-occurrence is
made (step 903). On the other hand, if not existing therebetween, the
decision of no co-occurrence is made (step 903).
The above-mentioned verbal case dictionary 708 and noun semantic label
dictionary 709 to be used for the sentence analysis apparatus are paired
so as to construct the co-occurrence dictionary. Conventionally, this
construction is entirely effected by hand. A description will be made
hereinbelow in terms of the typical procedure of the construction of the
co-occurrence dictionary. First, one or plural specialists determine the
semantic classification standard, as illustrated in FIG. 11, with
reference to dictionaries, past systems and others. Secondly, one or
plural workers give one or more semantic labels to each of the nouns in
the morphpheme dictionary 703 on the basis of the determined semantic
classification standard. Further, one or plural workers classify each of
the verbs in the morphpheme dictionary 703 into one or more subsheets
different in the case pattern and the regulation information such as the
rule, voice and phase, and successively state the case pattern information
and the other regulation information at every case subsheet as shown in
FIG. 13. If the failure of the semantic classification standard has been
found at the stage of the co-occurrence dictionary construction, the
addition to the semantic classification standard and the change of the
semantic classification standard can be performed. Further, a customary
and special co-occurrence relation such as " " is directly stated as
an exception in the verbal case dictionary and exception-processed prior
to the aforementioned semantic analysis or after a failure of the
aforementioned semantic analysis. The updating of the co-occurrence
dictionary is also effected by a hand to take a matching with the
construction members of the co-occurrence dictionary totally taking into
account the semantic classification standard and the contents of the
co-occurrence dictionary built hitherto. For a large-scale updating, the
addition and change of the semantic classification standard are generally
made.
There is a problem which arises with such a conventional method, however,
in that there is no systematic and objective method for the construction
and updating of the co-occurrence dictionary, and hence the construction
and updating of the co-occurrence dictionary greatly depend upon the
know-how and skill of the language specialist or the like. That is, since
the building method of the semantic label system is not clear, the kind
and interpretation of the semantic label are required to be set by hand of
the specialist before building the noun semantic dictionary and the verbal
case dictionary, and therefore the addition and change of the system are
required in the actual dictionary construction and analysis because the
semantic label system is rough and insufficient in kind. Further, since
the interpretation of each of the semantic labels cannot be made clear,
for building a large-scale dictionary by a plurality of persons,
difficulty is encountered to adequately give a set of semantic labels to
each word and discrepancies of interpretation occurs between the workers.
In addition, in the case the end user uses a computer application system
including a semantic analysis system and registers an unknown word, it is
difficult that the end user understands the semantic label system of the
system to adequately give semantic labels, whereby difficulty is
encountered to easily update the co-occurrence dictionary by the end user.
In addition, there are several problems in accuracy of the co-occurrence
analysis and semantic analysis. First, since difficulty is encountered to
accurately build the co-occurrence dictionary, the semantic label is
rough, and particularly the accuracy of the co-occurrence analysis between
an abstract noun and the case slot thereof becomes deteriorated. For
example, words pronounced as " " are above 20 in number and are
abstract nouns, and hence difficulty is encountered to convert them into
kanji in accordance with the conventional co-occurrence analysis.
Moreover, difficulty is encountered to accurately determine the case
frame, which is a principle portion of the semantic analysis, and the
priority thereof.
SUMMARY OF THE INVENTION
It is therefore an object of the present invention to provide a method of
systematically and accurately building a co-occurrence dictionary, a
method of easily performing the consistent updating of the co-occurrence
dictionary, a co-occurrence analysis method which is capable of accurately
calculating the degree of the co-occurrence, a semantic analysis method
which is capable of numerically and accurately calculating the ranking of
the priority between the competitive interpretations.
According to the present invention, the co-occurrence dictionary building
method includes a process for calculating three kinds of co-occurrence
information and a real number vector corresponding to each category, the
co-occurrence dictionary updating method includes a process for selecting
the peer lexicon or phrase of the co-occurrence for the additional
co-occurrence information and a process for calculating a real number
vector corresponding to an additional word on the basis of the additional
co-occurrence information, the co-occurrence analysis method includes a
process for calculating in real number the degree of the co-occurrence on
the basis of the real number vectors corresponding to two categories to be
checked in the co-occurrence relation, and the semantic analysis method
includes a process for indicating, by a numerical value, the propriety of
the interpretation on the basis of the degree of each co-occurrence.
More specifically, as illustrated in FIG. 1, for building the co-occurrence
dictionary describing as to whether the phases belonging to two categories
in a dictionary containing phrases of the natural language which is an
object co-occur in one sentence, phases are selected as a group of phrases
11 from a group of phrases 1 comprising all the phrases belonging to the
first category in the dictionary and phrases are selected as a group of
phrases 21 from a group of phrases 2 comprising all the phrases belonging
to the second category in the dictionary, and there are prepared three
kinds of co-occurrence information: first co-occurrence information
describing as to whether each phrase belonging to the phrase group 11 and
each phrase belonging to the phase group 21 co-occur in one sentence of
the object language, second co-occurrence information describing as to
whether each phrase belonging to a group of phrases 12 comprising all the
phrases which do not belong to the phrase group 11 in the phrase group 1
and each phrase belonging to the phrase group phrase group 21 co-occur in
one sentence of the object language and third co-occurrence information
describing as to whether each phrase belonging to a group of phrases 22
comprising all the phrases which do not belong to the phrase group 21 in
the phrase group 2 and each phrase belonging to the phrase group 11.
Secondly, the first co-occurrence information is arranged such that each
phrase belonging to the phrase group 11 corresponds to a real number
vector with a dimension below the common maximum dimension and each phrase
belonging to the phrase group 21 is corresponds to a real number vector
with a dimension below the common maximum dimension, and the value of real
number vector corresponding to each phrase in the phrase group 11 and the
value of the real number vector corresponding to each phrase in the phrase
group 21 are calculated on the basis of the first co-occurrence
information so that the number of sets of two phrases that the value of
the inner product of the real number vector corresponding to the phrase 1
and the real number vector corresponding to the phrase 2 becomes positive
in the case of describing, in the first co-occurrence information, that a
phrase 1 belonging to the phrase group 11 and a phrase 2 belonging to the
phrase group 21 co-occur in one sentence and the value of the inner
product of the real number vector corresponding to the phrase 1 and the
real number vector corresponding to the phrase 2 becomes negative in the
case of describing, in the first co-occurrence information, that the
phrase 1 belonging to the phrase group 11 and the phrase 2 belonging to
the phrase group 21 do not co-occur in one sentence becomes the greatest
of all the numbers of sets each comprising a phrase(s) belonging to the
phrase group 11 and a phrase belonging to the phrase group 21. Further,
the second co-occurrence information is arranged such that each phrase
belonging to the phrase group 12 corresponds to a real number vector with
a dimension below the maximum dimension, and the value of the real number
vector corresponding to each phrase in the phrase group 12 is calculated
on the basis of the second co-occurrence information so that the number of
sets of two phrases that the value of the inner product of the real number
vector corresponding to a phrase 3 belonging to the phrase group 12 and
the real number vector corresponding to a phrase 4 belonging to the phrase
group 21 and calculated on the basis of the first co-occurrence
information becomes positive in the case of describing, in the second
co-occurrence information, that the phase 3 and the phrase 4 co-occur in
one sentence and the value of the inner product of the real number vector
corresponding to the phrase 3 and the real number vector corresponding to
the phrase 4 becomes negative in the case of describing, in the second
co-occurrence information, that the phrase 3 and the phrase 4 do not
co-occur in one sentence becomes the largest of all the numbers of sets
each comprising a phrase(s) belonging to the phrase group 12 and a phrase
belonging to the phrase group 21. Still further, the third co-occurrence
information is arranged such that each phrase belonging to the phrase
group 22 corresponds to a real number vector with a dimension below the
maximum dimension, and the value of the real number vector corresponding
to each phrase in the phrase group 22 is calculated on the basis of the
third co-occurrence information so that the number of sets of two phrases
the inner product of the real number vector corresponding to a phrase 5
belonging to the phrase group 11 and calculated on the basis of the first
co-occurrence information and the real number vector corresponding to a
phrase 6 belonging to the phrase group 22 becomes positive in the case of
describing, in the third co-occurrence information, that the phrase 5 and
the phrase 6 co-occur in one sentence and on the other hand the inner
product of the real number vector corresponding to the phrase 5 calculated
on the basis of the first co-occurrence information and the real number
vector corresponding to the phrase 6 becomes negative in the case of
describing, in the third co-occurrence information, that the phrase and
the phrase 6 do not co-occur in one sentence becomes the greatest of all
the numbers of sets each comprising a phrase(s) belonging to the phrase
group 11 and a phrase belonging to the phrase group 22. Thus, the
co-occurrence information are calculated in real number vector form with
respect to all the phrases of the phrase group 1 and the phrase group 2.
Further, for calculating the real number vector corresponding to each
phrase on the basis of the first co-occurrence information, the first
co-occurrence information is corrected by exceptionally reversing the
decision of the co-occurrence with respect to a portion of the first
co-occurrence information so that the number of sets of two phrases that
the value of the inner product of the real number corresponding to the
phrase 1 belonging to the phrase group 11 and the real number vector
corresponding to the phrase 2 belonging to the phrase group 21 becomes
positive in the case of describing that the phrase 1 and the phrase 2
co-occur in one sentence and the value of the inner product of the real
number vector corresponding to the phrase 1 and the real number vector
corresponding to the phrase 2 becomes negative in the case of describing,
in the first co-occurrence information, that the phrase 1 and the phrase 2
do no co-occur in one sentence be above a constant ratio (rate) to the
number of all sets each comprising a phrase(s) belonging to the phrase 11
and a phrase(s) belonging to a phrase group 21, and this corrected first
co-occurrence information is used as the first co-occurrence information
and the co-occurrence information are calculated in real number vector
form with respect to all the phrases of the phrase group 1 and the phrase
group 2. Thus, the co-occurrence information are calculated in the real
number vector form and in exception information form.
In addition, for updating the co-occurrence dictionary, in the
co-occurrence dictionary built in accordance with the above-described
method or the like which describes as to whether each phrase belonging to
a first category and each phrase belonging to a second category co-occur
in one sentence in a dictionary containing phrases of the object natural
language and describing each phrase in a real number vector form, when
adding a new phrase 7 belonging to the first category to the
aforementioned co-occurrence dictionary, a group of phrases 23 consisting
of N phrases in the above-mentioned dictionary and belonging to the second
category and being above the maximum dimension of the corresponding
vectors are selected so that the absolute value of the inner product of
the real number vectors corresponding to every two phrase of the N phrases
is below a given constant value, and additional co-occurrence information
indicative of whether the N phrases and the phrase 7 co-occur in one
sentence of the object language is added and the phrase 7 is arranged to
correspond to a real number vector having a dimension below the
aforementioned maximum dimension, and further the real number vector V
corresponding to the phrase 7 is calculated so that the number M of sets
of two phrases that the value of the inner product of the real number
vector corresponding to the phrase 7 and the real number vector
corresponding to a phrase 8 belonging to the phrase group 23 becomes
positive in the case of describing in the additional co-occurrence
information that the phrase 8 and the phrase 7 co-occur in one sentence
and the value of the inner product of the real number vector corresponding
to the phrase 7 and the real number vector corresponding to the phrase 8
becomes negative in the case of describing in the additional co-occurrence
information that the phrase 8 and the phrase 7 do not co-occur in one
sentence has a maximum, and the the calculated real number vector V is
added as the co-occurrence information for the phrase 7 to the
above-mentioned co-occurrence dictionary.
Further, in the case that the number M is below a predetermined number L, a
group of phrases 24 whose number is constant are selected from the second
category, and readditional co-occurrence information indicative of whether
the phrase group 24 and the phrase 7 co-occur in one sentence of the
object language is added so as to correct the additional co-occurrence
information so that the co-occurrence decisions of the additional
co-occurrence information and a portion of the readditional co-occurrence
information are exceptionally reversed, and the real number vector
corresponding to the phrase 7 is calculated on the basis of the corrected
additional co-occurrence information so that the number M become above the
predetermined number L, and the calculated real number vector is added as
the co-occurrence information for the phrase 7 to the above-mentioned
co-occurrence dictionary.
For performing the co-occurrence analysis to mechanically decide whether
phrases which are included in a dictionary comprising the object natural
language and which belong to two kinds of categories co-occur in one
sentence, there is used the co-occurrence dictionary which is built and
updated in accordance with the above-described method or a similar method
and which describes the co-occurrence information by the real number
vectors corresponding to the phrases. When the phrase 1 included in the
first category in the above-mentioned co-occurrence dictionary and the
phrase 2 included in the second category in the above-mentioned
co-occurrence dictionary appear at positions, allowable on the
morphphemeand syntax, in the sentence to be analyzed, in the case that the
inner product of the real number vector corresponding to the phrase 1 and
the real number vector corresponding to the phrase 2 is positive, a
decision is made such that the phrase 1 and the phrase 2 co-occur, and on
the other hand, in the case that the inner product of the real number
vector corresponding to the phrase 1 and the real number vector
corresponding to the phrase 2 is negative, a decision is made such that
the phrase 1 and the phrase 2 does not co-occur.
Further, for performing the semantic analysis, when the phrase 1 and the
phrase 2 in the sentence to be analyzed have morphological and/or
syntactic ambiguities, the interpretation that the absolute value of the
inner product of the real number vector corresponding to the phrase 2 and
the real number vector corresponding to the phrase 1 calculated in
accordance with the above-described co-occurrence analysis method is the
greatest value or a group of interpretations that the aforementioned
absolute value of the inner product is above a constant value is used, and
the other interpretations are rejected.
BRIEF DESCRIPTION OF THE DRAWINGS
The object and features of the present invention will become more readily
apparent from the following detailed description of the preferred
embodiments taken in conjunction with the accompanying drawings in which:
FIG. 1 is an illustration for describing a co-occurrence dictionary
building and updating method and a co-occurrence and semantic analysis
method according to the present invention;
FIG. 2 is a block diagram showing an apparatus for building and updating a
co-occurrence dictionary for the verb and the noun in the Japanese
language;
FIG. 3 is a block diagram showing a Japanese language sentence analysis
apparatus according to a second embodiment of this invention;
FIG. 4 is a flow chart for describing an operation for the semantic
analysis according to the second embodiment of this invention;
FIG. 5 is a flow chart for describing the co-occurrence analysis according
to the second embodiment of this invention;
FIG. 6 shows a portion of the contents of a noun semantic dictionary in the
second embodiment;
FIG. 7 illustrates a portion of the contents of a verb semantic dictionary
in the second embodiment;
FIG. 8 is a block diagram showing a Japanese language sentence analysis
apparatus based on a conventional semantic analysis method;
FIG. 9 is a flow chart showing an operation for the conventional semantic
analysis;
FIG. 10 is a flow chart showing an operation for a conventional
co-occurrence analysis;
FIG. 11 shows a portion of a conventional semantic label system;
FIG. 12 shows a portion of the contents of a conventional noun semantic
label dictionary;
FIG. 13 illustrates a portion of the contents of a conventional verb case
dictionary; and
FIG. 14 shows one example of syntax trees.
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of this invention will be described hereinbelow with reference
to the drawings. In the embodiments, the object natural language is the
Japanese language, the first category is the noun and the second category
is the case of the verb.
FIG. 2 is a block diagram showing an apparatus according to a first
embodiment of this invention to build and update the co-occurrence
dictionary for the case of the verb and the noun in the Japanese language.
In FIG. 2, the numeral 101 represents a noun dictionary describing the
notation, reading and others of the noun in the Japanese language, 102
designates a verbal case pattern dictionary containing the typical noun
which can be included in the surface case pattern of the verb in the
Japanese language and the case slot, 103 denotes a pivot selecting means
for selecting an element, which is the axis (center), from elements of the
noun dictionary 101 and the verbal case pattern dictionary 102, and 104
depicts a question sentence producing means for producing a question
sentence, to be shown for the co-occurrence information inputting person,
on the basis of the element selected by the pivot selecting means 103.
Further, numeral 105 is a question sentence indicating means for showing
the co-occurrence information inputting person the question sentence
produced by the question sentence producing means 104, 106 designates a
co-occurrence information inputting means by which the co-occurrence
inputting person inputs the co-occurrence information in accordance with
the indication of the question sentence indicating means 105, 107
represents a feature vector calculating means for calculating, on the
basis of the co-occurrence information from the co-occurrence information
inputting means 106 and the selection result from the pivot selecting
means 103, a feature vector to be given to each element, 108 denotes a
noun semantic dictionary for encasing the noun dictionary information
including the feature vector of the noun outputted from the feature vector
calculating means 107, and 109 depicts a verb semantic dictionary for
encasing the dictionary information of the case pattern of the verb
including the feature vector of the case of the verb outputted from
feature vector calculating means 107.
Here, in the verbal case pattern dictionary 102, the surface case pattern
of each verb and the typical example of the noun are described in the form
of " / / ] [ / ] / ".
Secondly, a description will be made in terms of the operation of the
co-occurrence dictionary building apparatus thus arranged. Prior to the
description of the operation of the co-occurrence dictionary building
apparatus, the description of the formulas to be used for the description
of the operation thereof will first be made hereinbelow. From the linear
algebra, a n-row and v-column matrix C having a rank p can be expressed as
the following equation (1) on the basis of a v-row and p-column orthogonal
matrix A and an n-row and p-column orthogonal matrix B.
##EQU1##
Accordingly, the original matrix C can be changed into the following
equation (2).
##EQU2##
where b.sub.k and a.sub.k are respectively column vectors of the k.sup.th
column of the matrixes B and A.
In the aforementioned equation (2), .lambda. is called the singular value
of the matrix C and the right side of the equation (2) is called the
spectral decomposition.
The spectral decomposition of the matrix C has the following properties.
Now, let it be assumed that the matrix C with the rank p is approximated
by a matrix D having a rank q smaller than the rank p. If the metric of
the poorness of the approximation is measured on the basis of the
Euclidean distance in accordance with the following equation (3), the
matrix D which minimizes the metric .delta. of the poorness of the
approximation can be given by the following equation (4) on the basis of
the partial sum of the spectral decomposition of the matrix C.
##EQU3##
where c.sub.ij and d.sub.ij are the elements of the i row | | |