WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Computerized cross-language document retrieval using latent semantic indexing    
United States Patent5301109   
Link to this pagehttp://www.wikipatents.com/5301109.html
Inventor(s)Landauer; Thomas K. (Summit, NJ); Littman; Michael L. (Somerset, NJ)
AbstractA methodology for retrieving textual data objects in a multiplicity of languages is disclosed. The data objects are treated in the statistical domain by presuming that there is an underlying, latent semantic structure in the usage of words in each language under consideration. Estimates to this latent structure are utilized to represent and retrieve objects. A user query is recouched in the new statistical domain and then processed in the computer system to extract the underlying meaning to respond to the query.



 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 5301109
Computerized cross-language document retrieval using latent semantic

     indexing - US Patent 5301109 Drawing
Computerized cross-language document retrieval using latent semantic indexing
Inventor     Landauer; Thomas K. (Summit, NJ); Littman; Michael L. (Somerset, NJ)
Owner/Assignee     Bell Communications Research, Inc. (Livingston, NJ)
Patent assignment
All assignments
Publication Date     April 5, 1994
Application Number     07/734,291
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     July 17, 1991
US Classification     704/9 704/2
Int'l Classification     G06F 015/40
Examiner     Envall Jr.; Roy N.
Assistant Examiner     Chung; Xuong
Attorney/Law Firm     Suchyta; Leonard Giordano; Joseph Charles,
Address
Parent Case     CROSS-REFERENCE TO A RELATED APPLICATION This is a continuation-in-part of application Ser. No. 07/536,029, filed Jun. 11, 1990 now abandoned.
Priority Data    
USPTO Field of Search     364/419
Patent Tags     computerized cross-language document retrieval latent semantic indexing
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5128865
Sadler
704/2
Jul,1992

[0 after 0 votes]
4916614
Kaji
704/2
Apr,1990

[0 after 0 votes]
4864503
Tolin
704/2
Sep,1989

[0 after 0 votes]
4839853
Deerwester

Jun,1989

[0 after 0 votes]
4654798
Taki
704/7
Mar,1987

[0 after 0 votes]
4593356
Hashimoto

Jun,1986

[0 after 0 votes]
4270182
Asija
704/8
May,1981

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. A multi-language information retrieval method for operating a computer system, including an information file of stored data objects, to retrieve selected data objects based on a user query, the method comprising the steps of

selecting a set of training data objects from the stored data objects, said set of training data objects selected to satisfy predetermined retrieval criteria,

translating each of said data objects in said set of training data objects into multiple languages to produce multiple translations and to generate a set of multi-language training data objects corresponding to said set of training data objects, and storing said translations corresponding to each of said multi-language training data objects in the information file,

for each of said multi-language training data objects, merging all of said translations into a single merged data object composed of terms contained in all of said translations, thereby generating a set of merged data objects corresponding to said set of multi-language training data objects,

parsing each said merged data object to extract distinct ones of said terms and generating a lexicon database from said distinct terms,

generating a joint term-by-data object matrix by processing said translations as stored in the information file, wherein said matrix has t rows in correspondence to said distinct terms in said lexicon database and d columns in correspondence to the number of said merged data objects in said set of merged data objects, and wherein each (i,j) cell of said matrix registers a tabulation of the occurrence of the i.sup.th distinct term in the j.sup.th merged data object,

decomposing said matrix into a reduced singular value representation composed of a distinct term file and a data object file to create a semantic space,

generating a pseudo-object, in response to the user query, by parsing the user query to obtain query terms and applying a given mathematical algorithm to said distinct terms and said query terms, and inserting said pseudo-object into said semantic space,

examining the similarity between said pseudo-object and the stored data objects in said semantic space to generate the selected data objects corresponding to said pseudo-object, and

generating a report of the selected data objects.

2. The method as recited in claim 1 further including, after the step of decomposing, the step of folding in, as other pseudo-objects, other data objects excluded from said set of training data objects by parsing each of said other data objects to obtain data object query terms and applying a given mathematical formula to said data object query terms and said distinct terms to create an augmented semantic space to serve as said semantic space.

3. The method as recited in claim 2 wherein

said matrix is expressed as Y,

said step of decomposing produces said representation in the form Y=T.sub.o S.sub.o D.sup.T.sub.o of rank m, and an approximation representation Y.sub.R =TSD.sup.T of rank k<m, where T.sub.o and D.sub.o represent said term and data object files and S.sub.o corresponds to said singular value representation and where T, D and S represent reduced forms of T.sub.o, D.sub.o and S.sub.o, respectively,

each of said other pseudo-objects is expressible as Y.sub.q and said step of folding in includes the step of computing D.sub.q =Y.sub.q.sup.T TS.sup.-1 for each of said other pseudo-objects,

said user-query pseudo-object is expressible as Y.sub.q and said step of inserting includes the step of computing D.sub.q =Y.sub.q.sup.T TS.sup.-1, and

said step of examining includes the step of evaluating the dot products between said user-query pseudo-object and the data objects in said augmented semantic space.

4. The method as recited in claim 3 wherein the degree of similarity is measured by said dot products exceeding a predetermined threshold.

5. The method as recited in claim 4 wherein said approximation representation is obtained by setting (k+1) through m diagonal values of S.sub.o to zero.

6. The method as recited in claim 2 wherein

said matrix is expressed as Y,

said step of decomposing produces said representation in the form Y=T.sub.o S.sub.o D.sup.T.sub.o of rank m, and an approximation representation Y.sub.R =TSD.sup.T of rank k<m, where T.sub.o and D.sub.o represent said term and data object files and S.sub.o corresponds to said singular value representation and where T, D and S represent reduced forms of T.sub.o, D.sub.o and S.sub.o, respectively,

each of said other pseudo-objects is expressible as Y.sub.q and said step of folding in includes the step of computing D.sub.q =Y.sub.q.sup.T TS.sup.-1 for each of said other pseudo-objects,

said user-query pseudo-object is expressible as Y.sub.q and said step of inserting includes the step of computing D.sub.q =Y.sub.q.sup.T TS.sup.-1 for said user-query pseudo-object, and

said step of examining includes the step of evaluating the cosines between said user-query pseudo-object and the data objects in said augmented semantic space.

7. A method for retrieving information from a multi-language information file stored in a computer system based on a user query, the file including stored data objects, the method comprising the steps of

selecting a set of training data objects from the stored data objects, said set of training data objects selected to satisfy predetermined retrieval criteria,

translating each of said data objects in said set of training data objects into multiple languages to produce multiple translations and to generate a set of multi-language training data objects corresponding to said set of training data objects, and storing said translations corresponding to each of said multi-language training data objects in the information file,

for each of said multi-language training data objects, merging all of said translations into a single merged data object composed of terms contained in all of said translations, thereby generating a set of merged data objects corresponding to said set of multi-language training data objects,

parsing each said merged data object to extract distinct ones of said terms and generating a lexicon database from said distinct terms,

generating a joint term-by-data object matrix by processing said translations as stored in the information file, wherein said matrix has t rows in correspondence to said distinct terms in said lexicon database and d columns in correspondence to the number of said merged data objects in said set of merged data objects, and wherein each (i,j) cell of said matrix registers a tabulation of the occurrence of the i.sup.th distinct term in the j.sup.th merged data object,

decomposing said matrix into a reduced singular value representation composed of a distinct term file and a data object file to create a semantic space,

folding into said semantic space other data objects excluded from said set of training data objects by parsing each of said other data objects to obtain data object query terms and applying a mathematical transformation to said data object query terms and said distinct terms to create an augmented semantic space to serve as said semantic space,

generating a pseudo-object, in response to the user query, by parsing the user query to obtain query terms and applying a given mathematical algorithm to said distinct terms and said query terms, and inserting said pseudo-object into said augmented semantic space,

examining the similarity between said pseudo-object and the stored data objects in said augmented semantic space to generate the selected data objects corresponding to said pseudo-object, and

generating a report of the selected data objects.

8. A multi-language information retrieval method for operating a computer system, including an information file of stored data objects, to retrieve selected data objects based on a user query, the method comprising the steps of

selecting a set of training data objects from the stored data objects, said set of training data objects selected to satisfy predetermined retrieval criteria,

translating each of said data objects in said set of training data objects into multiple languages to produce multiple translations and to generate a set of multi-language training data objects corresponding to said translations corresponding to each of said training data objects, and storing said set of multi-language training data objects in the information file,

for each of said multi-language training data objects, merging all of said translations into a single merged data object composed of terms contained in all of said translations, thereby generating a set of merged data objects corresponding to said set of multi-language training data objects,

parsing each said merged data object to extract distinct ones of said terms and generating a lexicon database from said distinct terms,

generating a joint term-by-data object matrix by processing said translations as stored in the information file, wherein said matrix has t rows in correspondence to said distinct terms in said lexicon database and d columns in correspondence to the number of said merged data objects in said set of merged data objects, and wherein each (i,j) cell of said matrix registers a tabulation of the occurrence of the i.sup.th distinct term in the j.sup.th merged data object,

decomposing said matrix into a reduced singular value representation composed of a distinct term file and a data object file to create a semantic space,

generating a pseudo-object, in response to the user query, by parsing the user query to obtain query terms and applying a given mathematical algorithm to said distinct terms and said query terms, and inserting said pseudo-object into said semantic space,

examining the similarity between said pseudo-object and the stored data objects in said semantic space to generate the selected data objects corresponding to said pseudo-object,

processing the selected data objects to produce a coded representation of the selected data objects and storing said coded representation in the computer system in a form accessible by the user for later recall so that the user query requires no repetition, and

generating a report of the selected data objects.
 Description Submit all comments and votes
 


FIELD OF THE INVENTION

This invention relates generally to computer-based information retrieval and, in particular, to user accessibility to and display of textual material stored in computer files utilizing a request in one language to retrieve documents in other languages related to the request.

BACKGROUND OF THE INVENTION

In the field of information retrieval, a long-standing objective has been the development of an automated procedure by which documents in one language could be effectively accessed by requests in another language without needing to translate either the documents or the requests. Among other things, such a capability would allow users to determine what documents were available in languages that the users could not read before incurring the expense and delay of translation.

One technique representative of some previously proposed procedures, disclosed in an article entitled "Automatic Processing of Foreign Language Documents," was published by G. Salton in 1970 in the Journal of American Society for Information Sciences. Salton reported experimenting with a method for automatic retrieval of documents in one language in response to queries in another using a vector representation and search technique in conjunction with a manually created dual-language thesaurus. The results for test samples of abstracts and queries were promising. However, creating an adequate multi-language thesaurus is difficult and requires considerable intellectual labor. Moreover, a traditional thesaurus necessarily imposes a discrete and rather restricted model of the languages in question and of their relation to one another.

U.S. Pat. No. 4,839,853, issued to one of the present co-inventors and assigned to the same assignee as is the present invention, utilizes the Latent Semantic Indexing (LSI) approach to model the underlying correlational structure of the distribution of terms in documents. Instead of representing documents and queries directly as sets of words, the LSI technique represents them as parameters in such a way that dependencies between words and between documents are taken into account. For example, if two terms are used in exactly the same contexts, that is, have identical distribution across a target collection of documents, LSI is designed to treat them not as two independent indexing entries but as two instances of an abstract indexing variable with the same vector value. Lesser and more indirect relations between terms and between documents are represented in an appropriate analogous fashion.

In the implementation of LSI as set forth in the above-identified patent, the modeling is accomplished by approximating the original term-by-document matrix by the product of three lower rank matrices of orthogonal derived indexing variables. The first matrix represents terms as values on a smaller set of independent "basis" vectors; the second matrix contains scaling coefficients; and the third matrix represents documents as values on the smaller set of basis vectors. The method can be interpreted geometrically as a means by which each document and each term is assigned to a point in a hyperspace. The mathematics and implementation of the method construct a derived space in which terms, documents, and queries can all be represented in the hyperspace. The mathematical procedure employed is singular value decomposition (SVD), which is closely related to factor analysis and eigenvalue decomposition.

The retrieval process is the same as in standard vector methods, e.g. using document-query cosines as the similarity measure. Various preprocessing steps, such as term weighting, may also be done in standard ways. The principal difference between LSI and previous vector models as represented by the work of Salton is that the vectors are constructed in a space with many fewer dimensions than the number of original terms, and that these dimensions are the subset of linearly independent basis vectors by which the original term-by-document matrix can be best approximated in a least squares sense. The number of dimensions retained has been determined empirically; optimal retrieval performance has usually been obtained with about 100 dimensions for collections of many hundreds to several thousands of documents.

The dimension reduction step of LSI has the advantageous property that small sources of variability in term usage are dropped and only the most important sources kept. Among other things, this can cause synonyms or near synonyms to be collapsed into similar vector representations, with the result that queries can retrieve similar documents even though they share no terms. This cannot happen in the usual raw term vector representation, necessitating manually constructed thesauri with their attendant problems.

The LSI method has previously been applied only within a single language, and there has been no teaching or suggestion in the art regarding the application of LSI to multi-language information retrieval.

SUMMARY OF THE INVENTION

These shortcomings as well as other deficiencies and limitations of conventional information retrieval techniques are obviated, in accordance with the present invention, by constructing a multi-language semantic space. This is effected automatically, without the need for a thesaurus, by modeling the usage of terms in documents using an expanded latent semantic indexing framework. In the broad aspect of the method, an initial set of documents, from a usually larger set of documents, is translated into the number of languages under consideration and the documents, including all translations, are stored in a computer information file; this produces a set of multiple (dual in one special but significant case) language documents. This set of multi-lingual documents is used to "train" an automatic multi-lingual indexing system by processing a joint term-by-document matrix of data. The joint matrix is formed by including the terms used in all the translations, and each document is allocated a single vector in the matrix no matter how many languages are treated by the methodology. After training, i.e., application of the singular value decomposition, the system can index any new document or query that is presented to it according to a set of derived abstract indexing variables that are language-independent.

The organization and operation of this invention will be better understood from a consideration of the detailed description, which follows, when taken in conjunction with the accompanying drawing.

BRIEF DESCRIPTION OF THE DRAWING

FIG. 1 is a plot of the "term" coordinates and the "document" coordinates based on a two-dimensional singular value decomposition of an original "term-by-document" matrix in a single language;

FIG. 2 shows the location of the training documents in the data object space for an example reduced to two dimensions in a dual language example; and

FIG. 3 is a flow diagram depicting the processing which generates the "term" and "document" matrices using singular value decomposition as well as the processing of a user's query.

DETAILED DESCRIPTION

Before discussing the principles and operational characteristics of this invention in detail, it is helpful to present a motivating example of latent semantic indexing for a single language case, namely, English. This also aids in introducing terminology utilized later in the discussion.

Illustrative Example of the LSI Method

The contents of Table 1 are used to illustrate how semantic structure analysis works and to point out the differences between this method and conventional keyword matching.

TABLE 1

Document Set Based on Titles

c1: Human machine interface for Lab ABC computer applications

c2: A survey of user opinion of computer system response time

c3: The EPS user interface management system

c4: Systems and human systems engineering testing of EPS-2

c5: Relation of user-perceived response time to error measurement

m1: The generation of random, binary, unordered trees

m2: The intersection graph of paths in trees

m3: Graph minors IV: Widths of trees and well-quasi-ordering

m4: Graph minors: A survey

In this example, a file of text objects consists of nine titles of technical documents with titles c1-c5 concerned with human/computer interaction and titles m1-m4 concerned with mathematical graph theory. In Table 1, words occuring in more than one title are italicized. Using conventional keyword retrieval, if a user requested papers dealing with "human computer interaction," titles c1, c2, and c4 would be returned, since these titles contain at least one keyword from the user request. However, c3 and c5, while related to the query, would not be returned since they share no words in common with the request. It is now shown how latent semantic structure analysis treats this request to return titles c3 and c5.

Table 2 depicts the "term-by-document" matrix for the 9 technical document titles. Each cell entry, (i,j), is the frequency of occurrence of term i in document j. This basic term-by-document matrix or a mathematical transformation thereof is used as input to the statistical procedure described below.

TABLE 2 ______________________________________ DOCUMENTS TERMS c1 c2 c3 c4 c5 m1 m2 m3 m4 ______________________________________ 1. human 1 0 0 1 0 0 0 0 0 2. interface 1 0 1 0 0 0 0 0 0 3. computer 1 1 0 0 0 0 0 0 0 4. user 0 1 1 0 1 0 0 0 0 5. system 0 1 1 2 0 0 0 0 0 6. response 0 1 0 0 1 0 0 0 0 7. time 0 1 0 0 1 0 0 0 0 8. EPS 0 0 1 1 0 0 0 0 0 9. survey 0 1 0 0 0 0 0 0 1 10. tree 0 0 0 0 0 1 1 1 0 11. graph 0 0 0 0 0 0 1 1 1 12. minor 0 0 0 0 0 0 0 1 1 ______________________________________

For this example the documents and terms have been carefully selected to yield a good approximation in just two dimensions for expository purposes. FIG. 1 is a two-dimensional graphical representation of the two largest dimensions resulting from the mathematical process, singular value decomposition. Both document titles and the terms used in them are placed into the same space. Terms are shown as circles and labeled by number. Document titles are represented by squares with the numbers of constituent terms indicated parenthetically. The angle between two objects (terms or documents) describe their computed similarity. In this representation, the two types of documents form two distinct groups: all the mathematical graph theory titles occupy the same region in space (basically along Dimension 1 of FIG. 1) whereas a quite distinct group is formed for human/computer interaction titles (essentially along Dimension 2 of FIG. 1).

To respond to a user query about "human computer interaction," the query is first folded into this two-dimensional space using those query terms that occur in the space (namely, "human" and "computer"). The query vector is located in the direction of the weighted average of these constituent terms, and is denoted by a directional arrow labeled "Q" in FIG. 1. A measure of closeness or similarity is the angle between the query vector and any given term or document vector. In FIG. 1 the cosine between the query vector and each c1-c5 titles is greater than 0.90; the angle corresponding to the cosine value of 0.90 with the query is shown by the dashed lines in FIG. 1. With this technique, documents c3 and c5 would be returned as matches to the user query, even though they share no common terms with the query. This is because the latent semantic structure (represented in FIG. 1) fits the overall pattern of term usage across documents.

Description of Singular Value Decomposition

To obtain the data to plot FIG. 1, the "term-by-document" matrix of Table 2 is decomposed using singular value decomposition (SVD). A reduced SVD is employed to approximate the original matrix in terms of a much smaller number of orthogonal dimensions. The reduced dimensional matrices are used for retrieval; these describe major associational structures in the term-document matrix but ignore small variations in word usage. The number of dimensions to represent adequately a particular domain is largely an empirical matter. If the number of dimensions is too large, random noise or variations in word usage will be modeled. If the number of dimensions is too small, significant semantic content will remain uncaptured. For diverse information sources, 100 or more dimensions may be needed.

To illustrate the decomposition technique, the term-by-document matrix, denoted Y, is decomposed into three other matrices, namely, the term matrix (TERM), the document matrix (DOCUMENT), and a diagonal matrix of singular values (DIAGONAL), as follows:

Y.sub.t,d =TERM.sub.t,k DIAGONAL.sub.k,k DOCUMENT.sup.T.sub.k,d

where Y is the original t-by-d matrix, TERM is the t-by-k matrix that has unit-length orthogonal columns, DOCUMENT.sup.T is the transpose of the d-by-k DOCUMENT matrix with unit-length orthogonal columns, and DIAGONAL is the k-by-k diagonal matrix of singular values typically ordered by magnitude.

The dimensionality of the solution, denoted k, is the rank of the t-by-d matrix, that is, k.ltoreq.min(t,d). Tables 3, 4 and 5 below show the TERM and DOCUMENT matrices and the diagonal elements of the DIAGONAL matrix, respectively, as found via SVD.

TABLE 3 __________________________________________________________________________ TERM MATRIX (12 terms by 9 dimensions) __________________________________________________________________________ human 0.22 -0.11 0.29 -0.41 -0.11 -0.34 -.52 -0.06 -0.41 interface 0.20 -0.07 0.14 -0.55 0.28 0.50 -0.07 -0.01 -0.11 computer 0.24 0.04 -0.16 -0.59 -0.11 -0.25 -0.30 0.06 0.49 user 0.40 0.06 -0.34 0.10 0.33 0.38 0.00 0.00 0.01 system 0.64 -0.17 0.36 0.33 -0.16 -0.21 -0.16 0.03 0.27 response 0.26 0.11 -0.42 0.07 0.08 -0.17 0.28 -0.02 -0.05 time 0.26 0.11 -0.42 0.07 0.08 -0.17 0.28 -0.02 -0.05 EPS 0.30 -0.14 0.33 0.19 0.11 0.27 0.03 -0.02 -0.16 survey 0.20 0.27 -0.18 -0.03 -0.54 0.08 -0.47 -0.04 -0.58 tree 0.01 0.49 0.23 0.02 0.59 -0.39 -0.29 0.25 -0.22 graph 0.04 0.62 0.22 0.00 -0.07 0.11 0.16 -0.68 0.23 minor 0.03 0.45 0.14 -0.01 -0.30 0.28 0.34 0.68 0.18 __________________________________________________________________________

TABLE 4 __________________________________________________________________________ DOCUMENT MATRIX (9 documents by 9 dimensions) __________________________________________________________________________ c1 0.20 -0.06 0.11 -0.95 0.04 -0.08 0.18 -0.01 -0.06 c2 0.60 0.16 -0.50 -0.03 -0.21 -0.02 -0.43 0.05 0.24 c3 0.46 -0.13 0.21 0.04 0.38 0.07 -0.24 0.01 0.02 c4 0.54 -0.23 0.57 0.27 -0.20 -0.04 0.26 -0.02 -0.08 c5 0.28 0.11 -0.50 0.15 0.33 0.03 0.67 -0.06 -0.26 m1 0.00 0.19 0.10 0.02 0.39 -0.30 -0.34 0.45 -0.62 m2 0.01 0.44 0.19 0.02 0.35 -0.21 -0.15 -0.76 0.02 m3 0.02 0.62 0.25 0.01 0.15 0.00 0.25 0.45 0.52 m4 0.08 0.53 0.08 -0.02 -0.60 0.36 0.04 -0.07 -0.45 __________________________________________________________________________

TABLE 5 ______________________________________ DIAGONAL (9 singular values) ______________________________________ 3.34 2.54 2.35 1.64 1.50 1.31 0.84 0.56 0.36 ______________________________________

As alluded to earlier, data to plot FIG. 1 was obtained by presuming that two dimensions are sufficient to capture the major associational structure of the t-by-d matrix, that is, k is set to two in the expression for Y.sub.t,d, yielding an approximation of the original matrix. Only the first two columns of the TERM and DOCUMENT matrices are considered with the remaining columns being ignored. Thus, the term data point corresponding to "human" in FIG. 1 is plotted with coordinates (0.22,-0.11), which are extracted from the first row and the two left-most columns of the TERM matrix. Similarly, the document data point corresponding to title m1 has coordinates (0.00,0.19), coming from row six and the two left-most columns of the DOCUMENT matrix. Finally, the Q vector is located from the weighted average of the terms "human" and "computer" appearing in the query. A method to compute the weighted average will be presented below.

General Model Details

It is now elucidating to describe in somewhat more detail the mathematical model underlying the latent structure, singular value decomposition technique.

Any rectangular matrix Y of t rows and d columns, for example, a t-by-d matrix of terms and documents, can be decomposed into a product of three other matrices:

Y=T.sub.o S.sub.o D.sup.T.sub.o, (1)

such that T.sub.o and D.sub.o have unit-length orthogonal columns (i.e. T.sub.o.sup.T T.sub.o =I; D.sub.o.sup.T D.sub.o =I) and S.sub.o is diagonal. This is called the singular value decomposition (SVD) of Y. (A procedure for SVD is described in the text Numerical Recipes, by Press, Flannery, Teukolsky and Vetterling, 1986, Cambridge University Press, Cambridge, England). T.sub.o and D.sub.o are the matrices of left and right singular vectors and S.sub.o is the diagonal matrix of singular values. By convention, the diagonal elements of S.sub.o are ordered in decreasing magnitude.

With SVD, it is possible to devise a simple strategy for an optimal approximation to Y using smaller matrices. The k largest singular values and their associated columns in T.sub.o and D.sub.o may be kept and the remaining entries set to zero. The product of the resulting matrices is a matrix Y.sub.R which is approximately equal to Y, and is of rank k. The new matrix Y.sub.R is the matrix of rank k which is the closest in the least squares sense to Y. Since zeros were introduced into S.sub.o, the representation of S.sub.o can be simplified by deleting the rows and columns having these zeros to obtain a new diagonal matrix S, and then deleting the corresponding columns of T.sub.o and D.sub.o to define new matrices T and D, respectively. The result is a reduced model such that

Y.sub.R =TSD.sup.T. (2)

The value of k is chosen for each application; it is generally such that k.gtoreq.100 for collections of 1000-3000 data objects.

For discussion purposes, it is useful to interpret the SVD geometrically. The rows of the reduced matrices T and D may be taken as vectors representing the terms and documents, respectively, in a k-dimensional space. With appropriate rescaling of the axes, by quantities related to the associated diagonal values of S, dot products between points in the space can be used to access and compare objects. (A simplified approach which did not involve rescaling was used to plot the data of FIG. 1, but this was strictly for expository purposes.) These techniques are now discussed.

Fundamental Comparisons

There are basically three types of comparisons of interest: (i) those comparing two terms; (ii) those comparing two documents or text objects; and (iii) those comparing a term and a document or text object. As used throughout, the notion of a text object or data object is general whereas a document is a specific instance of a text object or data object. Also, text or data objects are stored in the computer system in files.

Two Terms: In the data, the dot product between two row vectors of Y.sub.R tells the extent to which two terms have a similar pattern of occurrence across the set of documents. The matrix Y.sub.R Y.sup.T.sub.R is the square symmetric matrix approximation containing all the term-by-term dot products. Using equation (2),

Y.sub.R Y.sup.T.sub.R =(TSD.sup.T)(TSD.sup.T).sup.T =TS.sup.2 T.sup.T =(TS)(TS).sup.T. (3)

This means that the dot product between the i-th row and j-th row of Y.sub.R can be obtained by calculating the dot product between the i-th and j-th rows of the TS matrix. That is, considering the rows of TS as vectors representing the terms, dot products between these vectors give the comparison between the terms. The relation between taking the rows of T as vectors and those of TS as vectors is simple since S is a diagonal matrix; each vector element has been stretched or shrunk by the corresponding element of S.

Two Documents: In this case, the dot product is between two column vectors of Y. The document-to-document dot product is approximated by

Y.sup.T.sub.R Y.sub.R =(TSD.sup.T).sup.T (TSD.sup.T)=DS.sup.2 D.sup.T =(DS)(DS).sup.T. (4)

Thus the rows of the DS matrix are taken as vectors representing the documents, and the comparison is via the dot product between the rows of the DS matrix.

Term and Document: This comparison is somewhat different. Instead of trying to estimate the dot product between rows or between columns of Y, the fundamental comparison between a term and a document is the value of an individual cell in Y. The approximation of Y is simply equation (2), i.e., Y.sub.R =TSD.sup.T. The i,j cell of Y.sub.R may therefore be obtained by taking the dot product between the i-th row of the matrix TS.sup.1/2 and the j-th row of the matrix DS.sup.1/2. While the "within" (term or document) comparisons involved using rows of TS and DS as vectors, the "between" comparision requires TS.sup.1/2 and DS.sup.1/2 for coordinates. Thus it is not possible to make a single configuration of points in a space that will allow both "between" and "within" comparisons. They will be similar, however, differing only by a stretching or shrinking of the dimensional elements by a factor S.sup.1/2.

Representations of Pseudo-Objects

The previous results show how it is possible to compute comparisons between the various objects associated with the rows or columns of Y. It is very important in information retrieval applications to compute similar comparison quantities for objects such as queries that do not appear explicitly in Y. This is particularly important for the cross-language case considered in accordance with the present invention. For example, it is necessary to be able to take a completely novel query, find a location in the k-dimensional latent semantic space for it, and then evaluate its cosine with respect to terms or objects in the space. Another example would be trying, after-the-fact, to find representations for documents that did not appear in the original space. The new objects for both these examples are equivalent to objects in the matrix Y in that they may be represented as vectors of terms. For this reason they are called pseudo-documents specifically or pseudo-objects generically. In order to compare pseudo-documents to other documents, the starting point is defining a pseudo-document vector, designated Y.sub.q. Then a representation D.sub.q is derived such that D.sub.q can be used just like a row of D in the comparison relationships described in the foregoing sections. One criterion for such a derivation is that the insertion of a real document Y.sub.i should give D.sub.i when the model is ideal (i.e., Y=Y.sub.R). With this constraint,

Y.sub.q =TSD.sub.q.sup.T

or, since T.sup.T T equals the identity matrix,

D.sub.q.sup.T =S.sup.-1 T.sup.T Y.sub.q

or, finally,

D.sub.q =Y.sup.T.sub.q TS.sup.-1. (5)

Thus, with appropriate rescaling of the axes, this amounts to placing the pseudo-object at the vector sum of its corresponding term points. The D.sub.q may be used like any row of D and, appropriately scaled by S or S.sup.1/2, can be used like a usual document vector for making "within" and "between" comparisons. [It is to be noted that if the measure of similarity to be used in comparing the query against all the documents is one in which only the angle between the vectors is important (such as the cosine), there is no difference for comparison purposes between placing the query at the vector average or the vector sum of its terms since the average and sum differ only in magnitude.]

For the query example above ("human computer interaction"), Y.sub.q =[1010 . . . ].sup.T, so for the simplified two-dimensional representation, ##EQU1## or, finally,

D.sub.q =[0.14-0.03].

Thus, D.sub.q represents the location of the query in the document space and is basically the weighted average of the terms appearing in the query.

MULTI-LANGUAGE CASE

To extend the principles of LSI to cross-language retrieval, a document set comprising all documents of interest, in the languages to be searched, is formed. A subset of the documents, called the "training set," is selected; the "training set" is composed of documents for which translations exist in all the languages (two or more). The so-called "joint" term-by-document matrix of this set is composed from the addition of the terms in their renditions in all the languages. This joint matrix differs from the single-language LSI matrix in that each column, which represents a single multi-language document, is the combination of terms from the two (or more) languages coalesced into just a single column vector. As with the single-language technique, the joint matrix is then analyzed by singular value decomposition. The resulting representation defines vectors for the training-set terms and documents in the languages under consideration. Once the training analysis has been completed, other single-language documents can be "folded in" as pseudo-documents on the basis of terms from any one of the original languages alone. Most importantly, a user query is treated as such a new document.

In the derived indexing space there is a point representing each term in the training set. A new single-language document is assigned a point in the same space by putting it at an appropriate average of the location of all the terms it contains. For cross-language retrieval, the same number or greater of dimensions are kept as would be required to represent the collection in a single language. As outlined above, full or partial equivalence (in the sense that one term will have the same or similar effect in referencing documents as another) is induced between any two or more terms approximately to the extent that their pattern of use, or the overall pattern of association between other terms with which they co-occur, is similar across documents in the training set. Equivalent or nearly equivalent terms in different languages would, of course, be expected to be distributed in nearly the same way in a set of documents and their translations. Thus, the location of two or more equivalent terms in different languages should be almost the same in the resulting representation. Consequently, a document folded in by terms in one language is retrieved by a query containing the appropriate set of words in another language.

A simple example may aid in understanding the general procedure. For this example, a training set of "documents" is composed of f