WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Building and updating of co-occurrence dictionary and analyzing of co-occurrence and meaning    
United States Patent5406480   
Link to this pagehttp://www.wikipatents.com/5406480.html
Inventor(s)Kanno; Yuji (Kawasaki, JP)
AbstractA co-occurrence dictionary is built through a process for calculating three kinds of co-occurrence information and a real number vector corresponding to each category. The co-occurrence dictionary is updated through a process for selecting the opposite phrase of the co-occurrence for the additional co-occurrence information and a process for calculating a real number vector corresponding to an additional word on the basis of the additional co-occurrence information. A co-occurrence analysis is effected through a process for calculating in real number the degree of the co-occurrence on the basis of the real number vectors corresponding to two categories to be checked in the co-occurrence relation, and a semantic analysis is effected through a process for indicating, by a numerical value, the propriety of the interpretation on the basis of the degree of each co-occurrence.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 5406480
Building and updating of co-occurrence dictionary and analyzing of

     co-occurrence and meaning - US Patent 5406480 Drawing
Building and updating of co-occurrence dictionary and analyzing of co-occurrence and meaning
Inventor     Kanno; Yuji (Kawasaki, JP)
Owner/Assignee     Matsushita Electric Industrial Co., Ltd. (Osaka, JP)
Patent assignment
All assignments
Publication Date     April 11, 1995
Application Number     08/004,029
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     January 15, 1993
US Classification     704/10 704/9
Int'l Classification     G06F 015/38 G06F 015/40
Examiner     Hayes; Gail O.
Assistant Examiner     Chung-Trans; Xuong M.
Attorney/Law Firm     Lowe, Price, LeBlanc & Becker
Address
Parent Case    
Priority Data     Jan 17, 1992[JP]4-006454
USPTO Field of Search     364/419.02 364/419.08 364/419.11
Patent Tags     building updating co-occurrence dictionary analyzing of co-occurrence meaning
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5227971
Nakajima
704/2
Jul,1993

[0 after 0 votes]
4916614
Kaji
704/2
Apr,1990

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. A computer implemented method, implemented by a programmed computer, of building a co-occurrence dictionary describing whether phrases co-occur in one sentence, the phases belonging to first and second categories in a dictionary containing phrases of a natural language which is an object, said method comprising using the computer to build the co-occurrence dictionary by implementing the steps of:

selecting, as a first sub-group of phrases (11), phrases from a first group of phrases (1) comprising all phrases belonging to said first category in said dictionary;

selecting, as a second sub-group of phrases (21), phrases from a second group of phrases (2) comprising all phrases belonging to said second category in the dictionary;

preparing first co-occurrence information describing whether each phrase belonging to the first sub-group (11) and each phrase belonging to the second sub-group (21) co-occur in one sentence of the object language;

preparing second-co-occurrence information describing whether each phrase belonging to a third sub-group of phrases (12), comprising all the phrases in the first group (1) which do not belong to the first sub-group (11) and each phrase belonging to the second sub-group (21), co-occur in one sentence of the object language;

preparing third co-occurrence information describing whether each phrase belonging to a fourth sub-group of phrases (22), comprising all the phrases in the second group (2) which do not belong to the second sub-group (21) and each phrase belonging to the first sub-group (11) co-occur in one sentence of the object language;

arranging the first co-occurrence information such that each phrase belonging to the first sub-group (11) corresponds to a real number vector with a dimension below a common maximum dimension and each phrase belonging to the second sub-group (21) corresponds to a real number vector with a dimension below the common maximum dimension;

calculating a value of the real number vector corresponding to each phrase in the first sub-group (11) and a value of the real number vector corresponding to each phrase in the second sub-group (21) on the basis of the first co-occurrence information so that the number of sets of two phrases, wherein:

a value of an inner product of the real number vector corresponding to a first phrase and the real number vector corresponding to a second phrase becomes positive when describing, in the first co-occurrence information, that a first phrase belonging to said first sub-group (11) and a second phrase belonging to said second sub-group (21) co-occur in one sentence, and

the value of an inner product of the real number vector corresponding to said first phrase and the real number vector corresponding to said second phrase becomes negative when describing, in said first co-occurrence information, that said first phrase belonging to said first sub-group (11) and said second phrase belonging to said second sub-group (21) do not co-occur in one sentence,

becomes the greatest of all the numbers of sets each comprising phrases belonging to said first sub-group (11) and phrases belonging to the second sub-group (21);

arranging said second co-occurrence information such that each phrase belonging to said third sub-group (12) corresponds to a real number vector with a dimension below the maximum dimension;

calculating a value of the real number vector corresponding to each phrase in said third sub-group (12) on the basis of said second co-occurrence information so that the number of sets of two phrases, wherein:

a value of the inner product of the real number vector corresponding to a third phrase belonging to said third sub-group (12) and the real number vector corresponding to a fourth phrase belonging to said second sub-group (21) and calculated on the basis of said first co-occurrence information becomes positive when describing, in said second co-occurrence information, that the third phase and the fourth phrase co-occur in one sentence, and

a value of an inner product of the real number vector corresponding to the third phrase and the real number vector corresponding to the fourth phrase becomes negative when describing, in said second co-occurrence information, that the third phrase and the fourth phrase do not co-occur in one sentence,

becomes the largest of all the numbers of sets each comprising a phrase belonging to said third sub-group (12) and a phrase belonging to said second sub-group (21);

arranging said third co-occurrence information such that each phrase belonging to the fourth sub-group (22) corresponds to a real number vector with a dimension below the maximum dimension; and

calculating a value of the real number vector corresponding to each phrase in the fourth sub-group (22) on the basis of said third co-occurrence information so that the number of sets of two phrases, wherein:

the inner product of the real number vector corresponding to a fifth phrase belonging to said first sub-group (11) and calculated on the basis of said first co-occurrence information and the real number vector corresponding to a sixth phrase belonging to the fourth sub-group (22) becomes positive when describing, in the third co-occurrence information, that the fifth phrase and the sixth phrase co-occur in one sentence and, on the other hand,

the inner product of the real number vector corresponding to the fifth phrase calculated on the basis of the first co-occurrence information and the real number vector corresponding to the sixth phrase becomes negative when describing, in the third co-occurrence information, that the fifth phrase and the sixth phrase do not co-occur in one sentence,

becomes the greatest of all the numbers of sets each comprising a phrase belonging to said first sub-group (11) and a phrase belonging to said fourth sub-group (22).

2. A method as claimed in claim 1, comprising the further step of:

correcting said first co-occurrence information by exceptionally reversing the decision of the co-occurrence with respect to a portion of said first co-occurrence information so that the number of sets of two phrases, wherein:

the value of the inner product of the real number vector corresponding to said first phrase belonging to said first sub-group (11) and the real number vector corresponding to said second phrase belonging to said second sub-group (21) becomes positive when describing that said first phrase and said second phrase co-occur in one sentence and the value of the inner product of the real number vector corresponding to said first phrase and the real number vector corresponding to said second phrase becomes negative when describing, in the first co-occurrence information, that the first phrase and the second phrase do not co-occur in one sentence,

is above a constant rate to the number of all sets each comprising phrases belonging to the first sub-group (11) and phrases belonging to said second sub-group (21), and the corrected first co-occurrence information is used as said first co-occurrence information and the co-occurrence information are calculated in real number vector form with respect to all the phrases of said first group of phrases (1) and said second group of phrases (2) so as to calculate the co-occurrence information in the real number vector form and in exception information form.

3. A method as claimed in claim 1, further comprising the steps of:

when a new seventh phrase belonging to a first category is added to the built co-occurrence dictionary which describes whether each phrase belonging to said first category and each phrase belonging to a second category co-occur in one sentence in a dictionary containing phrases of a natural language,

selecting a first select group of phrases in said dictionary, which consists of N phrases, of phrases belonging to said second category and which are above a maximum dimension of the corresponding vectors and in which the absolute value of an inner product of the real number vectors corresponding to every two phrases is below a constant value, so as to give additional co-occurrence information indicative of whether the N phrases and the seventh phrase co-occur in one sentence of said language;

arranging that said seventh phrase corresponding to a real number vector having a dimension below the maximum dimension; and

calculating a real number vector corresponding to said seventh phrase so that the number M of sets of two phrases, wherein

a value of the inner product of the real number vector corresponding to said seventh phrase and the real number vector corresponding to an eighth phrase belonging to the first select group of phrases becomes positive when describing, in said additional co-occurrence information, that said eighth phrase and said seventh phrase co-occur in one sentence and the value of the inner product of the real number vector corresponding to said seventh phrase and the real number vector corresponding to said eighth phrase becomes negative when describing, in said additional co-occurrence information, that the eighth phrase and the seventh phrase do not co-occur in one sentence,

has a maximum so that the calculated real number vector is added as the co-occurrence information for the seventh phrase to said co-occurrence dictionary.

4. A method as claimed in claim 3, comprising the further step of, when the number M is below a predetermined number L,

selecting a second select group of phrases whose number is constant from the second category to give readditional co-occurrence information indicative of whether said second select group of phrases and said seventh phrase co-occur in one sentence of the language so as to correct said additional co-occurrence information so that the co-occurrence decisions of said additional co-occurrence information and a portion of said readditional co-occurrence information are exceptionally reversed, and calculating the real number vector corresponding to said seventh phrase on the basis of the corrected additional co-occurrence information so that the number M become above the predetermined number L, and adding the calculated real number vector as the co-occurrence information for said seventh phrase to said co-occurrence dictionary.

5. A method as claimed in claim 3, further comprising the step of performing a co-occurrence analysis using said co-occurrence dictionary to automatically decide whether phrases which are included in said dictionary and which belong to first and second categories co-occur in one sentence, wherein, when said first phrase included in said first category in said co-occurrence dictionary and said second phrase included in said second category in said co-occurrence dictionary appear at positions, allowable on morphpheme and syntax, in the sentence to be analyzed, if the inner product of the real number vector corresponding to the first phrase and the real number vector corresponding to the second phrase is positive, a decision is made that said first phrase and said second phrase co-occur, and on the other hand, if the inner product of the real number vector corresponding to said first phrase and the real number vector corresponding to said second phrase is negative, a decision is made that said first phrase and said second phrase do not co-occur.

6. A method as claimed in claim 5, wherein, when said first phrase and said second phrase in the sentence to be analyzed are vague on a morphpheme and syntax, using the interpretation that the absolute value of the inner product of the real number vector corresponding to said second phrase and the real number vector corresponding to said first phrase calculated in accordance with the co-occurrence analysis method is the greatest value or a group of interpretations that the absolute value of the inner product is above a constant value, and rejecting the other interpretations.

7. A method as claimed in claim 6, wherein the natural language is the Japanese language, said first category is nouns, and said second category is deep cases of a predicate.

8. A method as claimed in claim 7, further comprising the steps of: calculating an inner product Q of a real number vector corresponding to each deep case in a plurality of deep case patterns P of the predicate in a sentence S to be analyzed and a real number vector corresponding to a noun which is applied to each deep case in the sentence S; and

adding a weight proper to each deep case in the deep case patterns P to the inner product Q to adopt the deep case pattern, that the added value E is the greatest, as an interpretation for the deep case pattern of the predicate in the sentence S.

9. A method as claimed in claim 7, further comprising the steps of: calculating an inner product Q of a real number vector corresponding to each deep case in a plurality of deep case patterns P of the predicate in a sentence S to be analyzed and a real number vector corresponding to a noun which is applied to each deep case in the sentence S; and

adding a weight proper to each deep case in the deep case patterns P to the inner product Q to adopt all the deep case patterns, that the added value E is above a predetermined constant value, as an interpretation for the deep case pattern of the predicate in the sentence S.
 Description Submit all comments and votes
 


BACKGROUND OF THE INVENTION

The present invention relates generally to natural language processing techniques to be used in computer applied systems such as word processors, machine translations and interactive systems, and more particularly to an apparatus and method of building and updating a semantic analysis co-occurrence dictionary and an apparatus and method of analyzing co-occurrences and meanings.

Recently, various computer application systems have been researched and developed on the basis of the natural language processing techniques and a portion of the various computer applied systems are gradually being fixed in our language culture. Particularly, in Japan, the progress of the kana-kanji conversion technique allows easy input of sentences, comprising a mixture of kanji and kana, to computers, whereby text processing softwares on Japanese word processors and personal computers are used widely. However, we do not still have an effective means to represent and process the meaning of words and the semantic relation between words for selecting a correct word from homonyms on the kana-kanji conversion. In the present stage, it is common practice in the machine translation or the like to process the meaning of words in accordance with the semantic analysis technique based on the case grammer described by C. J. Fillmore and to use semantic labels in the co-occurrence analysis. A description will be made hereinbelow with reference to FIGS. 8 to 13 in terms of a conventional co-occurrence analysis method using the semantic label, a conventional semantic analysis method using this conventional co-occurrence analysis method, and a conventional co-occurrence dictionary building and updating method necessary for these analysises.

FIG. 8 is a block diagram showing one example of Japanese sentence analysis apparatus based on the conventional semantic analysis method. In FIG. 8, numeral 701 represents an inputting means for inputting a sentence to be analyzed, 702 designates a morphological analysis means for dividing the inputted sentence into a list (string) of morphphemes (morphemes), 703 denotes a morphpheme dictionary to be retrieved by the morphological analysis means 702 when performing the morphological segmentation, 704 depicts a connection rule to be used by the morphological analysis means 702 when performing the connection test between the morphphemes, 705 indicates a syntactic analysis means for inputting the list (string) of morphphemes from the morphological analysis means 702 to analyze the syntactic structure and output the syntactic tree, 706 represents a context-free grammar rule to be used by the syntactic analysis means 705 when performing the syntactic structure analysis, 707 designates a semantic analysis means for inputting the syntactic tree from the syntactic analysis means 706 to perform the case analysis and output the semantic structure, 708 denotes a verbal case dictionary to be used by the semantic analysis means 707, 709 depicts a noun semantic label dictionary to be used by the semantic analysis means 707, and 710 indicates a semantic structure storing means for storing a semantic structure centering the case frame produced by the semantic analysis means 707, which is referred to and operated by an external apparatus. The noun semantic label dictionary 709 to be used for the semantic analysis describes the meaning of each of nouns within the morphpheme dictionary 703 with above one semantic label in accordance with the semantic classification standard as shown in FIG. 11 and has the contents as shown in FIG. 12. Further, the verbal case dictionary 708 divides the meaning of each of the verbs within the morphpheme dictionary 703 into one case pattern or more and describes them as illustrated in FIG. 13. As well as the noun semantic label dictionary 709, the meaning of the noun co-occuring with each case slot is described with one semantic label or more in accordance with the semantic classification standard shown in FIG. 11.

The operation of the conventional sentence analysis apparatus thus arranged will be described hereinbelow in terms of the case of analyzing the typed sentence "A B C V ". First, the typed sentence "A B C V " is supplied as a character train through the inputting means 701 to the morphological analysis means 702. The morphological analysis means 702 performs the morphological segmentation process from the beginning of the sentence toward the end of the sentence. If the morphpheme coincident with a portion of the inputted sentence train is found by the retrieval of the morphpheme dictionary 703, the connection possibility to the morphpheme immediately before the found portion is checked through the connection rule 704. If the connection is possible, the morphological segmentation process is further effected in terms of the inputted sentence train subsequent to the found portion. If a plurality of morphphemes coincident therewith are found by the retrieval of the morphpheme dictionary 703, the priority is given therebetween in accordance with a heuristic method such as the maximum coincidence and the minimum clause number. Thus, the following list (string) of morphphemes up to the end of the sentence can be obtained.

"A (noun), (case post-positional particle), B (noun), (case post-positional particle), C (noun), (case post-positional particle), V (verb), (ending of verb), (ending of verb), (past auxiliary verb)"

The aforementioned morphpheme train is supplied to the syntactic analysis means 705 so as to analyze the syntactic structure to obtain a syntactic tree as illustrated in FIG. 14. From this syntactic tree, it is understood that all of the three post-positional phrases "A ", "B " and "C " are connected or applied to the verb phrase "V ".

The syntactic tree illustrated in FIG. 14 is led to the semantic analysis means 707 so as to perform the semantic analysis of the inputted sentence in accordance with the procedure illustrated in FIG. 9 which shows a procedure for the semantic analysis of a sentence "A B C V ". First, the case patterns of the verb "V" are obtained by retrieving the verbal case dictionary 708, and the semantic labels respectively corresponding to the nouns "A", "B" and "C" are obtained by retrieving the noun semantic label dictionary 709 (step 801). Secondly, it is checked, in accordance with the co-occurrence analysis procedure illustrated in FIG. 10, whether the case slot corresponding to the semantic label of the noun of each of the post-positional phrases co-occurs with respect to each of the case patterns of the verb V. That is, only the case patterns with which all the three nouns co-occur are selected as a candidate, and further the best case pattern is selected on the basis of the priority between the case patterns, the filling degree of the case slot and others so that information such as the tense and the voice is added to the selected case pattern which is in turn outputted as the semantic structure (steps 802 to 812).

In the co-occurrence analysis procedure, as illustrated in FIG. 10 which shows a procedure of the analysis as to whether or not the noun N, being the C case, co-occurs with the case pattern P of the verb V, it is first checked whether the C case is in the case of the case pattern P (step 901). If the C case exists therein, it is checked whether there is a common semantic label between a group of semantic labels in the case slot of the C case of the case pattern P and a group of semantic labels of the noun N (step 902). If the common semantic label exists therebetween, the decision of the co-occurrence is made (step 903), and if not existing therebetween, no co-occurrence is decided (step 904). Further, if there is no C case in the cases of the case pattern P, it is checked whether the C case can be taken as the optional case such as the time and the place (step 905). If not, the decision of no co-occurrence is made (step 904). If so, the case slot information of the optional case which does not depend on the verb is retrieved so as to check whether there is a common semantic label between a group of semantic labels in the optional case slot and a group of semantic labels of the noun (step 906). If the common semantic label exists therebetween, the decision of the co-occurrence is made (step 903). On the other hand, if not existing therebetween, the decision of no co-occurrence is made (step 903).

The above-mentioned verbal case dictionary 708 and noun semantic label dictionary 709 to be used for the sentence analysis apparatus are paired so as to construct the co-occurrence dictionary. Conventionally, this construction is entirely effected by hand. A description will be made hereinbelow in terms of the typical procedure of the construction of the co-occurrence dictionary. First, one or plural specialists determine the semantic classification standard, as illustrated in FIG. 11, with reference to dictionaries, past systems and others. Secondly, one or plural workers give one or more semantic labels to each of the nouns in the morphpheme dictionary 703 on the basis of the determined semantic classification standard. Further, one or plural workers classify each of the verbs in the morphpheme dictionary 703 into one or more subsheets different in the case pattern and the regulation information such as the rule, voice and phase, and successively state the case pattern information and the other regulation information at every case subsheet as shown in FIG. 13. If the failure of the semantic classification standard has been found at the stage of the co-occurrence dictionary construction, the addition to the semantic classification standard and the change of the semantic classification standard can be performed. Further, a customary and special co-occurrence relation such as " " is directly stated as an exception in the verbal case dictionary and exception-processed prior to the aforementioned semantic analysis or after a failure of the aforementioned semantic analysis. The updating of the co-occurrence dictionary is also effected by a hand to take a matching with the construction members of the co-occurrence dictionary totally taking into account the semantic classification standard and the contents of the co-occurrence dictionary built hitherto. For a large-scale updating, the addition and change of the semantic classification standard are generally made.

There is a problem which arises with such a conventional method, however, in that there is no systematic and objective method for the construction and updating of the co-occurrence dictionary, and hence the construction and updating of the co-occurrence dictionary greatly depend upon the know-how and skill of the language specialist or the like. That is, since the building method of the semantic label system is not clear, the kind and interpretation of the semantic label are required to be set by hand of the specialist before building the noun semantic dictionary and the verbal case dictionary, and therefore the addition and change of the system are required in the actual dictionary construction and analysis because the semantic label system is rough and insufficient in kind. Further, since the interpretation of each of the semantic labels cannot be made clear, for building a large-scale dictionary by a plurality of persons, difficulty is encountered to adequately give a set of semantic labels to each word and discrepancies of interpretation occurs between the workers. In addition, in the case the end user uses a computer application system including a semantic analysis system and registers an unknown word, it is difficult that the end user understands the semantic label system of the system to adequately give semantic labels, whereby difficulty is encountered to easily update the co-occurrence dictionary by the end user.

In addition, there are several problems in accuracy of the co-occurrence analysis and semantic analysis. First, since difficulty is encountered to accurately build the co-occurrence dictionary, the semantic label is rough, and particularly the accuracy of the co-occurrence analysis between an abstract noun and the case slot thereof becomes deteriorated. For example, words pronounced as " " are above 20 in number and are abstract nouns, and hence difficulty is encountered to convert them into kanji in accordance with the conventional co-occurrence analysis. Moreover, difficulty is encountered to accurately determine the case frame, which is a principle portion of the semantic analysis, and the priority thereof.

SUMMARY OF THE INVENTION

It is therefore an object of the present invention to provide a method of systematically and accurately building a co-occurrence dictionary, a method of easily performing the consistent updating of the co-occurrence dictionary, a co-occurrence analysis method which is capable of accurately calculating the degree of the co-occurrence, a semantic analysis method which is capable of numerically and accurately calculating the ranking of the priority between the competitive interpretations.

According to the present invention, the co-occurrence dictionary building method includes a process for calculating three kinds of co-occurrence information and a real number vector corresponding to each category, the co-occurrence dictionary updating method includes a process for selecting the peer lexicon or phrase of the co-occurrence for the additional co-occurrence information and a process for calculating a real number vector corresponding to an additional word on the basis of the additional co-occurrence information, the co-occurrence analysis method includes a process for calculating in real number the degree of the co-occurrence on the basis of the real number vectors corresponding to two categories to be checked in the co-occurrence relation, and the semantic analysis method includes a process for indicating, by a numerical value, the propriety of the interpretation on the basis of the degree of each co-occurrence.

More specifically, as illustrated in FIG. 1, for building the co-occurrence dictionary describing as to whether the phases belonging to two categories in a dictionary containing phrases of the natural language which is an object co-occur in one sentence, phases are selected as a group of phrases 11 from a group of phrases 1 comprising all the phrases belonging to the first category in the dictionary and phrases are selected as a group of phrases 21 from a group of phrases 2 comprising all the phrases belonging to the second category in the dictionary, and there are prepared three kinds of co-occurrence information: first co-occurrence information describing as to whether each phrase belonging to the phrase group 11 and each phrase belonging to the phase group 21 co-occur in one sentence of the object language, second co-occurrence information describing as to whether each phrase belonging to a group of phrases 12 comprising all the phrases which do not belong to the phrase group 11 in the phrase group 1 and each phrase belonging to the phrase group phrase group 21 co-occur in one sentence of the object language and third co-occurrence information describing as to whether each phrase belonging to a group of phrases 22 comprising all the phrases which do not belong to the phrase group 21 in the phrase group 2 and each phrase belonging to the phrase group 11. Secondly, the first co-occurrence information is arranged such that each phrase belonging to the phrase group 11 corresponds to a real number vector with a dimension below the common maximum dimension and each phrase belonging to the phrase group 21 is corresponds to a real number vector with a dimension below the common maximum dimension, and the value of real number vector corresponding to each phrase in the phrase group 11 and the value of the real number vector corresponding to each phrase in the phrase group 21 are calculated on the basis of the first co-occurrence information so that the number of sets of two phrases that the value of the inner product of the real number vector corresponding to the phrase 1 and the real number vector corresponding to the phrase 2 becomes positive in the case of describing, in the first co-occurrence information, that a phrase 1 belonging to the phrase group 11 and a phrase 2 belonging to the phrase group 21 co-occur in one sentence and the value of the inner product of the real number vector corresponding to the phrase 1 and the real number vector corresponding to the phrase 2 becomes negative in the case of describing, in the first co-occurrence information, that the phrase 1 belonging to the phrase group 11 and the phrase 2 belonging to the phrase group 21 do not co-occur in one sentence becomes the greatest of all the numbers of sets each comprising a phrase(s) belonging to the phrase group 11 and a phrase belonging to the phrase group 21. Further, the second co-occurrence information is arranged such that each phrase belonging to the phrase group 12 corresponds to a real number vector with a dimension below the maximum dimension, and the value of the real number vector corresponding to each phrase in the phrase group 12 is calculated on the basis of the second co-occurrence information so that the number of sets of two phrases that the value of the inner product of the real number vector corresponding to a phrase 3 belonging to the phrase group 12 and the real number vector corresponding to a phrase 4 belonging to the phrase group 21 and calculated on the basis of the first co-occurrence information becomes positive in the case of describing, in the second co-occurrence information, that the phase 3 and the phrase 4 co-occur in one sentence and the value of the inner product of the real number vector corresponding to the phrase 3 and the real number vector corresponding to the phrase 4 becomes negative in the case of describing, in the second co-occurrence information, that the phrase 3 and the phrase 4 do not co-occur in one sentence becomes the largest of all the numbers of sets each comprising a phrase(s) belonging to the phrase group 12 and a phrase belonging to the phrase group 21. Still further, the third co-occurrence information is arranged such that each phrase belonging to the phrase group 22 corresponds to a real number vector with a dimension below the maximum dimension, and the value of the real number vector corresponding to each phrase in the phrase group 22 is calculated on the basis of the third co-occurrence information so that the number of sets of two phrases the inner product of the real number vector corresponding to a phrase 5 belonging to the phrase group 11 and calculated on the basis of the first co-occurrence information and the real number vector corresponding to a phrase 6 belonging to the phrase group 22 becomes positive in the case of describing, in the third co-occurrence information, that the phrase 5 and the phrase 6 co-occur in one sentence and on the other hand the inner product of the real number vector corresponding to the phrase 5 calculated on the basis of the first co-occurrence information and the real number vector corresponding to the phrase 6 becomes negative in the case of describing, in the third co-occurrence information, that the phrase and the phrase 6 do not co-occur in one sentence becomes the greatest of all the numbers of sets each comprising a phrase(s) belonging to the phrase group 11 and a phrase belonging to the phrase group 22. Thus, the co-occurrence information are calculated in real number vector form with respect to all the phrases of the phrase group 1 and the phrase group 2.

Further, for calculating the real number vector corresponding to each phrase on the basis of the first co-occurrence information, the first co-occurrence information is corrected by exceptionally reversing the decision of the co-occurrence with respect to a portion of the first co-occurrence information so that the number of sets of two phrases that the value of the inner product of the real number corresponding to the phrase 1 belonging to the phrase group 11 and the real number vector corresponding to the phrase 2 belonging to the phrase group 21 becomes positive in the case of describing that the phrase 1 and the phrase 2 co-occur in one sentence and the value of the inner product of the real number vector corresponding to the phrase 1 and the real number vector corresponding to the phrase 2 becomes negative in the case of describing, in the first co-occurrence information, that the phrase 1 and the phrase 2 do no co-occur in one sentence be above a constant ratio (rate) to the number of all sets each comprising a phrase(s) belonging to the phrase 11 and a phrase(s) belonging to a phrase group 21, and this corrected first co-occurrence information is used as the first co-occurrence information and the co-occurrence information are calculated in real number vector form with respect to all the phrases of the phrase group 1 and the phrase group 2. Thus, the co-occurrence information are calculated in the real number vector form and in exception information form.

In addition, for updating the co-occurrence dictionary, in the co-occurrence dictionary built in accordance with the above-described method or the like which describes as to whether each phrase belonging to a first category and each phrase belonging to a second category co-occur in one sentence in a dictionary containing phrases of the object natural language and describing each phrase in a real number vector form, when adding a new phrase 7 belonging to the first category to the aforementioned co-occurrence dictionary, a group of phrases 23 consisting of N phrases in the above-mentioned dictionary and belonging to the second category and being above the maximum dimension of the corresponding vectors are selected so that the absolute value of the inner product of the real number vectors corresponding to every two phrase of the N phrases is below a given constant value, and additional co-occurrence information indicative of whether the N phrases and the phrase 7 co-occur in one sentence of the object language is added and the phrase 7 is arranged to correspond to a real number vector having a dimension below the aforementioned maximum dimension, and further the real number vector V corresponding to the phrase 7 is calculated so that the number M of sets of two phrases that the value of the inner product of the real number vector corresponding to the phrase 7 and the real number vector corresponding to a phrase 8 belonging to the phrase group 23 becomes positive in the case of describing in the additional co-occurrence information that the phrase 8 and the phrase 7 co-occur in one sentence and the value of the inner product of the real number vector corresponding to the phrase 7 and the real number vector corresponding to the phrase 8 becomes negative in the case of describing in the additional co-occurrence information that the phrase 8 and the phrase 7 do not co-occur in one sentence has a maximum, and the the calculated real number vector V is added as the co-occurrence information for the phrase 7 to the above-mentioned co-occurrence dictionary.

Further, in the case that the number M is below a predetermined number L, a group of phrases 24 whose number is constant are selected from the second category, and readditional co-occurrence information indicative of whether the phrase group 24 and the phrase 7 co-occur in one sentence of the object language is added so as to correct the additional co-occurrence information so that the co-occurrence decisions of the additional co-occurrence information and a portion of the readditional co-occurrence information are exceptionally reversed, and the real number vector corresponding to the phrase 7 is calculated on the basis of the corrected additional co-occurrence information so that the number M become above the predetermined number L, and the calculated real number vector is added as the co-occurrence information for the phrase 7 to the above-mentioned co-occurrence dictionary.

For performing the co-occurrence analysis to mechanically decide whether phrases which are included in a dictionary comprising the object natural language and which belong to two kinds of categories co-occur in one sentence, there is used the co-occurrence dictionary which is built and updated in accordance with the above-described method or a similar method and which describes the co-occurrence information by the real number vectors corresponding to the phrases. When the phrase 1 included in the first category in the above-mentioned co-occurrence dictionary and the phrase 2 included in the second category in the above-mentioned co-occurrence dictionary appear at positions, allowable on the morphphemeand syntax, in the sentence to be analyzed, in the case that the inner product of the real number vector corresponding to the phrase 1 and the real number vector corresponding to the phrase 2 is positive, a decision is made such that the phrase 1 and the phrase 2 co-occur, and on the other hand, in the case that the inner product of the real number vector corresponding to the phrase 1 and the real number vector corresponding to the phrase 2 is negative, a decision is made such that the phrase 1 and the phrase 2 does not co-occur.

Further, for performing the semantic analysis, when the phrase 1 and the phrase 2 in the sentence to be analyzed have morphological and/or syntactic ambiguities, the interpretation that the absolute value of the inner product of the real number vector corresponding to the phrase 2 and the real number vector corresponding to the phrase 1 calculated in accordance with the above-described co-occurrence analysis method is the greatest value or a group of interpretations that the aforementioned absolute value of the inner product is above a constant value is used, and the other interpretations are rejected.

BRIEF DESCRIPTION OF THE DRAWINGS

The object and features of the present invention will become more readily apparent from the following detailed description of the preferred embodiments taken in conjunction with the accompanying drawings in which:

FIG. 1 is an illustration for describing a co-occurrence dictionary building and updating method and a co-occurrence and semantic analysis method according to the present invention;

FIG. 2 is a block diagram showing an apparatus for building and updating a co-occurrence dictionary for the verb and the noun in the Japanese language;

FIG. 3 is a block diagram showing a Japanese language sentence analysis apparatus according to a second embodiment of this invention;

FIG. 4 is a flow chart for describing an operation for the semantic analysis according to the second embodiment of this invention;

FIG. 5 is a flow chart for describing the co-occurrence analysis according to the second embodiment of this invention;

FIG. 6 shows a portion of the contents of a noun semantic dictionary in the second embodiment;

FIG. 7 illustrates a portion of the contents of a verb semantic dictionary in the second embodiment;

FIG. 8 is a block diagram showing a Japanese language sentence analysis apparatus based on a conventional semantic analysis method;

FIG. 9 is a flow chart showing an operation for the conventional semantic analysis;

FIG. 10 is a flow chart showing an operation for a conventional co-occurrence analysis;

FIG. 11 shows a portion of a conventional semantic label system;

FIG. 12 shows a portion of the contents of a conventional noun semantic label dictionary;

FIG. 13 illustrates a portion of the contents of a conventional verb case dictionary; and

FIG. 14 shows one example of syntax trees.

DETAILED DESCRIPTION OF THE INVENTION

Embodiments of this invention will be described hereinbelow with reference to the drawings. In the embodiments, the object natural language is the Japanese language, the first category is the noun and the second category is the case of the verb.

FIG. 2 is a block diagram showing an apparatus according to a first embodiment of this invention to build and update the co-occurrence dictionary for the case of the verb and the noun in the Japanese language. In FIG. 2, the numeral 101 represents a noun dictionary describing the notation, reading and others of the noun in the Japanese language, 102 designates a verbal case pattern dictionary containing the typical noun which can be included in the surface case pattern of the verb in the Japanese language and the case slot, 103 denotes a pivot selecting means for selecting an element, which is the axis (center), from elements of the noun dictionary 101 and the verbal case pattern dictionary 102, and 104 depicts a question sentence producing means for producing a question sentence, to be shown for the co-occurrence information inputting person, on the basis of the element selected by the pivot selecting means 103. Further, numeral 105 is a question sentence indicating means for showing the co-occurrence information inputting person the question sentence produced by the question sentence producing means 104, 106 designates a co-occurrence information inputting means by which the co-occurrence inputting person inputs the co-occurrence information in accordance with the indication of the question sentence indicating means 105, 107 represents a feature vector calculating means for calculating, on the basis of the co-occurrence information from the co-occurrence information inputting means 106 and the selection result from the pivot selecting means 103, a feature vector to be given to each element, 108 denotes a noun semantic dictionary for encasing the noun dictionary information including the feature vector of the noun outputted from the feature vector calculating means 107, and 109 depicts a verb semantic dictionary for encasing the dictionary information of the case pattern of the verb including the feature vector of the case of the verb outputted from feature vector calculating means 107.

Here, in the verbal case pattern dictionary 102, the surface case pattern of each verb and the typical example of the noun are described in the form of " / / ] [ / ] / ".

Secondly, a description will be made in terms of the operation of the co-occurrence dictionary building apparatus thus arranged. Prior to the description of the operation of the co-occurrence dictionary building apparatus, the description of the formulas to be used for the description of the operation thereof will first be made hereinbelow. From the linear algebra, a n-row and v-column matrix C having a rank p can be expressed as the following equation (1) on the basis of a v-row and p-column orthogonal matrix A and an n-row and p-column orthogonal matrix B. ##EQU1## Accordingly, the original matrix C can be changed into the following equation (2). ##EQU2## where b.sub.k and a.sub.k are respectively column vectors of the k.sup.th column of the matrixes B and A.

In the aforementioned equation (2), .lambda. is called the singular value of the matrix C and the right side of the equation (2) is called the spectral decomposition.

The spectral decomposition of the matrix C has the following properties. Now, let it be assumed that the matrix C with the rank p is approximated by a matrix D having a rank q smaller than the rank p. If the metric of the poorness of the approximation is measured on the basis of the Euclidean distance in accordance with the following equation (3), the matrix D which minimizes the metric .delta. of the poorness of the approximation can be given by the following equation (4) on the basis of the partial sum of the spectral decomposition of the matrix C. ##EQU3## where c.sub.ij and d.sub.ij are the elements of the i row